Merge Duplicates with similiar names

Mack · March 17, 2021, 5:54pm

Hey EDT team!

I just recently downloaded the platform have been playing around with it. I was wondering if you guys could give me a head start.

I’m Web scraping data for a project. On multiple different sites then aggregating them into a CSV.
On every site “generally at least” the product has been called something similar but not the same.

So it would look something like this

Site A Asus ROG Crosshair VIII Hero Site URLcolumns form factor ATX Socket AM4
Site B ASUS ROG X570 Crosshair VIII Hero Site URL form factor ATX Socket AM4
Site C Asus Crosshair Hero Site URL form factor ATX Socket AM4 Chipset

all the same Motherboard but different names on different sites. Is there a way to identify all the duplicating rows then merge these duplicates into one row? So it would look like this

Asus ROG Crosshair VIII Hero Site A URL Site B URL Site C URL form factor ATX Socket AM4

I have about 5000 thousand listings and have been doing them manually. Its obviously extremely time-consuming. Several people have recommended using panda, I’m not great with python. Many have also recommended EDT!

I know the general logic if Name Matches 65% of other Name & Socket & ATX are the same merge row. I just don’t know how to do it or where to start.

Admin · March 18, 2021, 9:47am

Hi @Mack

You could do this by creating a filter for each product.

So Stack all your inputs.
Use Case to convert to lower case (you can Copy Col first if you want to retain the original case information).
Add a Filter for each product.
You can then Stack and output the outputs from all the filters.

(you can also use regular expressions in the filters)

You can then re-run EDT any time the input changes. You can also add a new filter if a new product appears.

Obviously it would be nice to have some form of ‘fuzzy matching’ to do all this automatically. We don’t currently support fuzzy matching. Maybe one day.

Fuzzy matching is tricky. It requires a lot of domain knowledge to understand that “Asus ROG Crosshair VIII Hero form factor ATX Socket AM4” is the same as “ASUS ROG X570 Crosshair VIII factor ATX Socket AM4” but different to “ASUS ROG X670 Crosshair VIII factor ATX Socket AM4”. If you are feeling brave you could try to code up your own solution in Python. Or maybe there is a library that can do it. Try looking at https://en.wikipedia.org/wiki/Levenshtein_distance . But I think that is mostly used for single words.

HTH

Admin · November 10, 2022, 2:22pm

The latest Easy Data Transform snapshot supports fuzzy Lookup. Please see:

Anonymous · November 11, 2022, 4:01am

Hi,

It would have been nice if you have shared some test data as you have in your list or excel or where ever.

Here is my take on what you have provided.

and here is the transform file
UniqueProduct.transform (2.3 KB)

The idea is combine the columns that you think are common for your need and then use that column to gather records together.