Outliers transform

We are looking into the possibility of adding an Outliers transform and looking for some feedback.

This would allow you to look for outlier values in a selected column. It would only work for numeric values.

A value would be identified as an outlier if it’s value was above or below a threshhold where the threshhold is:

  • N times the Inter Quartile Range above or below the column mean; or
  • N times the Standard Deviation above or below the column mean
    The user chooses Inter Quartile Range or Standard Deviation and the value of N.

The user would also have a choice what to do with outliers, selecting one of:

  • remove outlier rows
  • keep only outlier rows
  • change outlier values to column mean
  • change outlier values to the threshhold
  • change outlier values to a value provided by the user
  • add a new column marking outliers

Would this be useful to anyone? How do you handle outliers currently? Is this something you deal with much @DanFeliciano ?

First, I’m very interested in the Transform, I can see many valid and invalid uses of the transform.
The manipulation of raw data is concerning to me, especially if the manipulated data is shared with folks who don’t realize the data set has been altered.

I teach a class at Dartmouth on Forensic Analytics and Fraudulent Data using Benford’s law as a way of identifying data sets that have been tampered with. I can’t believe how much “scientific data” has been manipulated.

Would it be possible to have a comment generated in the output/processing window indicating the number of outliers that were changed?

"If you torture the data long enough, it will confess to anything”

Yes. Pretty much all the transforms tell you how many rows have been modified.

In future we might add a detailed report that you can generate that gives you an outline of what was done during a run. It will have to be an outline to keep it a manageable size.

I knocked up a quick Benford’s Law example here. ;0)

1 Like

You can now try the new Outliers transform: