Filter rows by value from Stats

I have a working solution, so this question is about a more efficient method or else a possible feature request.

The nub of the problem is to filter rows by a value derived from a summary value or statistical output, in this case everything above a percentile value. The core transform with some sample data is attached. The general logic is to get the percentile value (Stats), trim rows and columns to isolate that value (Filter and Remove Col), then use Cross to add that value as a new column in the original data. This can now be Filtered normally before output formatting.

I am hoping for a more concise yet general solution for those four transform steps.

Another way to go about it would be to Sort the table on the column of interest then take the top 20%. For this I need an immediately available input of number of rows to avoid essentially the same steps as above.

For another option, I have never needed/wanted/bothered to learn javascript. I could but it is not my first preference.

Percentile.transform (5.3 KB)

Here is solution based on your suggested way of taking the top 20% from the sorted list.

To get the number of rows in your data set, you can use the Meta info drop down in the input From Clipboard

Then use the Scale transform to turn the row numbers to their equivalent 20% and use that as filter

image

Transform file
Percentile2.transform (3.8 KB)

Here is solution on your original requirement.

image

Transform file.
Percentile3.transform (3.2 KB)

@Anonymous. I had not earlier noticed that Meta Info is available, so that is immediately useful thanks.

I played around with Scale, then changed the number of input rows. Unless I missed something it appears that the scaling value must be adjusted manually for the number of input rows to get the right result. I replaced Scale with Calculate, multiplying Meta Info by 0.2 (i.e. one minus the desired percentile). After that (with related naming adjustments) it worked with different row counts.

Percentile 2.1.transform (4.3 KB)

I have just picked up your second solution which I shall check out later this evening thank you.

A great use of If with Fill up. Technically I suppose this is only a step or two shorter but feels more elegant than the value isolation steps I originally used. The latter feels like chopping bits up on the side to reintroduce later, rather than progressive change on the original.

A good exercise for me thank you. It is interesting to see there are (at least) three clear solutions.

It would be useful if you could take a value from one dataset (e.g. the number of rows) and set it as the value of a parameter in a downstream transform. However it is a challenge to think of a way to do it that is simple for the user. It is something we hope to find time to think about a bit more.

The Outliers transform has the option to keep or remove rows outside a range. But the range is set by Inter Quartile Range or Standard Deviation, rather than percentage. Also it includes/excludes above and below the range, rather than just above below a cut off point.

I looked at Outliers in the initial phase and saw that problem. Would it be a reasonable pair of changes to:

  • provide an option to make it one or two-tailed, e.g. x > 75% rather than only x < 25% OR x > 75%?
  • make the percentile a user option on the proviso it will be symmetrical? That is, if the user enters 70 then the range is 30 - 70, 95 is 5 - 95.

I tend to be playing with data that is not normally distributed, so use medians and other percentiles quite a bit.

Here is another way to do it, using meta info.

Percentile4.transform (3.5 KB)

Yes, that is definitely a possibility for future releases. But I need to do it in a way that doesn’t make Outliers too complex and confusing. Perhaps with a little text summary, similar to Filter.

image