Remove rows from one dataset matching rows in another dataset

I am working with a 1.3 million rows dataset with birthweights from which I need to remove about 1.2 k of matching birthweight rows of subjects with a specific condition in another dataset, in order to do statistical analyses.

Any idea on how to do that?

Thank you in advance,

GLS

Do from the big dataset a lookup to the smaller one, you can use what ever filled field in that data set (or you create with NewCol a new field with text like „remove“).

The looked up value should now mark the records to be removed. With this you can use filter to keep only rows where the lookup value is empty. Afterwards you can remove the column with the lookup value.

2 Likes

If you main dataset looks like this:

ID Weight Condition
001 6.5
002 8.1 jaundice
003 7.5 premature
004 6.9

And you want to remove all rows with jaundice or premature in the Condition column. Then just use the Subtract transform with a second data with all the values you want to remove:

Condition
jaundice
premature

babyweight.transform (2.2 KB)

If the data is less clean you might have to use one of the new ‘contains’ variants in Lookup, as suggested by @Olaf .

2 Likes

Example:

Remove rows by second table.transform (4.6 KB)

1 Like

And now with the subtract, which is even more simple, but a form of brute force. If it is based on additional criteria in the second set, the lookup might be the best choice as you can lookup for the criteria value.

Remove rows by second table 2.transform (5.1 KB)

1 Like

now I playing around …
Remove rows by second table 3.transform (5.2 KB)

The smaller data set has no an additional column with a criteria and the lookup select this criteria. The filter is now changed to a remove one, remove all records with the criteria match.

Thank you for your prompt answer!

Subtract removes ALL matching values in the small dataset from the big dataset
Let’s say
Big dataset
4.250
3.150
2.800
3.450
3.500
3.150
2.800
3.150

Small dataset
3.150
2.800

The substract will produce
4.250
3.450
3.500

Whereas I need just to remove only one match and I would like to obtain
4.250
3.450
3.500
3.150
2.800
3.150

And the lookup does also the same.

The option to remove one, instead of remove all is an option I have considered for Subtract.

Until then, you can workaround it by creating a synthetic key.

babyweight2.transform (5.1 KB)

2 Likes

Thank you very much for the transform, it works perfectly!
And it would be grand to have this option in Substract.

Have a great weekend!

1 Like

solution with lookup


babyweight3.transform (4.5 KB)

I was curious about the execution time differences between the two different solution, I thought lookup and filter might be expensive (even when the solution has less transformations). I build a big data set of 1 million records, each value was there 40 times, the sequence of the values was random (added after generation an second column with random function and sorted them by the random value). Second I generated a small data set by unique about 1.800 values out of the big set. Both were written in csv files.

Then I used both solution with the csv file input. EDT is unbelievable fast handling this 1 mio records. I repeated the execution multiple time. In average execution time for the subtract solution is about 2.0 sec and for the lookup 2.1 sec, so 5% difference (on a Mac with with M2 processor). For sure the user doesn’t recognize the difference.

Most expensive is the total transformation:

So for the problem of @GLS an additional option in Substract would give further benefit.

1 Like

It is on the wishlist! Probably add a drop-down with:

  • Remove all
  • Remove first
  • Remove last

With Remove all as the default.

2 Likes

Thank you Admin and Olaf!
The latest transform from Admin works perfectly - and honestly I am not looking at shaving fractions of seconds for transforms that anyhow are very fast, but I see Olaf’s point for larger datasets.

As for the wish list for the Remove transform (remove from X content in Y) :
Given X:
A 1
B 1
C 2
D 4
E 2
F 3
G 2
H 1

and Y:
A 2
B 1
C 2
D 5

My preferred options would be:

  • Remove all matching (would leave in X rows D & F)
  • Remove single match(es) (starting from the top would leave in X rows B, D, F, G, H)
  • Remove first match(es) (would leave in X rows B, D, F, G, H)
  • Remove last match(es) (would leave in X rows A, B, C, D, F)

I think important to remove single matches as you may need have datasets coming from different sources, and for analysis matching rows may need to be removed from the larger dataset.

And in the Remove transform Warning pane it may help to indicate unmatched rows in Y (D 5)

Thank you again!

I am not clear on the different between Remove single match(es) and Remove first match(es). Perhaps you could include an example where they give different results.

Also it might not be worth adding extra options that are equivalent to adding a single Sort or Dedupe transform before.

You are right - removing single matches would remove rows from data set X from the top as remove first match would do!

1 Like

Would be very helpful !

We are working on it now.

1 Like

The new Subtract option will be in the next release.

1 Like

Slowly the list of features for v2 builds :smiley:

I tried the version 2.0.1

The old solution from Admin is now down to:

an the the time required is down to < .5 second :+1: :clap:

Even the old transform solution is faster in 2.0.1 !!!

1 Like