Example: matching dirty data

Often we need to match data that is dirty. For example, we might want to match the company name in two datasets to do a Lookup.

In the first dataset we have company names:

Smith Industries
Weyland-Yutani
ACME LTD

In the second dataset we have company names:

Smith Industries Ltd
Weyland-yutani,Inc.
Acme

We can try to match these using the fuzzy matching option in Lookup but this is slow for big datasets and can result in false positives. So it is much better to try to clean the data as much as possible first.

We can do this by using a Replace transform to replace any of the following with a space:

  • punctuation characters
  • ‘inc’, ‘incorporated’, ‘ltd’, ‘limited’ etc (use \b to denote a word boundary in a regex, so we only replace a whole word, not part of a word)

Then use a Whitespace transform to remove leading and trailing whitespace and convert consecutive spaces into a single space.

We can then do a Lookup with exact match:

dirty.transform (5.5 KB)

If there are still issues (e.g. typos in company names) we could try fuzzy match.

2 Likes