Remove rows from one dataset matching rows in another dataset

GLS · November 15, 2024, 6:21pm

I am working with a 1.3 million rows dataset with birthweights from which I need to remove about 1.2 k of matching birthweight rows of subjects with a specific condition in another dataset, in order to do statistical analyses.

Any idea on how to do that?

Thank you in advance,

GLS

Olaf · November 15, 2024, 6:31pm

Do from the big dataset a lookup to the smaller one, you can use what ever filled field in that data set (or you create with NewCol a new field with text like „remove“).

The looked up value should now mark the records to be removed. With this you can use filter to keep only rows where the lookup value is empty. Afterwards you can remove the column with the lookup value.

Admin · November 15, 2024, 7:09pm

If you main dataset looks like this:

ID	Weight	Condition
001	6.5
002	8.1	jaundice
003	7.5	premature
004	6.9

And you want to remove all rows with jaundice or premature in the Condition column. Then just use the Subtract transform with a second data with all the values you want to remove:

Condition
jaundice
premature

babyweight.transform (2.2 KB)

If the data is less clean you might have to use one of the new ‘contains’ variants in Lookup, as suggested by @Olaf .

Olaf · November 15, 2024, 7:13pm

Example:

Remove rows by second table.transform (4.6 KB)

Olaf · November 15, 2024, 7:19pm

And now with the subtract, which is even more simple, but a form of brute force. If it is based on additional criteria in the second set, the lookup might be the best choice as you can lookup for the criteria value.

Remove rows by second table 2.transform (5.1 KB)

Olaf · November 15, 2024, 7:28pm

now I playing around …
Remove rows by second table 3.transform (5.2 KB)

The smaller data set has no an additional column with a criteria and the lookup select this criteria. The filter is now changed to a remove one, remove all records with the criteria match.

GLS · November 15, 2024, 8:05pm

Thank you for your prompt answer!

Subtract removes ALL matching values in the small dataset from the big dataset
Let’s say
Big dataset
4.250
3.150
2.800
3.450
3.500
3.150
2.800
3.150

Small dataset
3.150
2.800

The substract will produce
4.250
3.450
3.500

Whereas I need just to remove only one match and I would like to obtain
4.250
3.450
3.500
3.150
2.800
3.150

And the lookup does also the same.

Admin · November 15, 2024, 8:14pm

The option to remove one, instead of remove all is an option I have considered for Subtract.

Until then, you can workaround it by creating a synthetic key.

babyweight2.transform (5.1 KB)

GLS · November 15, 2024, 9:05pm

Thank you very much for the transform, it works perfectly!
And it would be grand to have this option in Substract.

Have a great weekend!

Olaf · November 15, 2024, 10:50pm

solution with lookup

babyweight3.transform (4.5 KB)

Olaf · November 16, 2024, 9:41am

I was curious about the execution time differences between the two different solution, I thought lookup and filter might be expensive (even when the solution has less transformations). I build a big data set of 1 million records, each value was there 40 times, the sequence of the values was random (added after generation an second column with random function and sorted them by the random value). Second I generated a small data set by unique about 1.800 values out of the big set. Both were written in csv files.

Then I used both solution with the csv file input. EDT is unbelievable fast handling this 1 mio records. I repeated the execution multiple time. In average execution time for the subtract solution is about 2.0 sec and for the lookup 2.1 sec, so 5% difference (on a Mac with with M2 processor). For sure the user doesn’t recognize the difference.

Most expensive is the total transformation:

So for the problem of @GLS an additional option in Substract would give further benefit.

Admin · November 16, 2024, 9:48am

It is on the wishlist! Probably add a drop-down with:

Remove all
Remove first
Remove last

With Remove all as the default.

GLS · November 17, 2024, 9:44am

Thank you Admin and Olaf!
The latest transform from Admin works perfectly - and honestly I am not looking at shaving fractions of seconds for transforms that anyhow are very fast, but I see Olaf’s point for larger datasets.

As for the wish list for the Remove transform (remove from X content in Y) :
Given X:
A 1
B 1
C 2
D 4
E 2
F 3
G 2
H 1

and Y:
A 2
B 1
C 2
D 5

My preferred options would be:

Remove all matching (would leave in X rows D & F)
Remove single match(es) (starting from the top would leave in X rows B, D, F, G, H)
Remove first match(es) (would leave in X rows B, D, F, G, H)
Remove last match(es) (would leave in X rows A, B, C, D, F)

I think important to remove single matches as you may need have datasets coming from different sources, and for analysis matching rows may need to be removed from the larger dataset.

And in the Remove transform Warning pane it may help to indicate unmatched rows in Y (D 5)

Thank you again!

Admin · November 17, 2024, 10:02am

I am not clear on the different between Remove single match(es) and Remove first match(es). Perhaps you could include an example where they give different results.

Also it might not be worth adding extra options that are equivalent to adding a single Sort or Dedupe transform before.

GLS · November 17, 2024, 10:52am

You are right - removing single matches would remove rows from data set X from the top as remove first match would do!

prashant · November 18, 2024, 6:46pm

Would be very helpful !

Admin · November 18, 2024, 7:33pm

We are working on it now.

Admin · November 19, 2024, 4:03pm

The new Subtract option will be in the next release.

prashant · November 19, 2024, 4:49pm

Slowly the list of features for v2 builds

Olaf · November 25, 2024, 6:51pm

I tried the version 2.0.1

The old solution from Admin is now down to:

an the the time required is down to < .5 second

Even the old transform solution is faster in 2.0.1 !!!