We are looking into the possibility of adding an
Outliers transform and looking for some feedback.
This would allow you to look for outlier values in a selected column. It would only work for numeric values.
A value would be identified as an outlier if it’s value was above or below a threshhold where the threshhold is:
N times the Inter Quartile Range above or below the column mean; or
N times the Standard Deviation above or below the column mean
The user chooses Inter Quartile Range or Standard Deviation and the value of N.
The user would also have a choice what to do with outliers, selecting one of:
remove outlier rows
keep only outlier rows
change outlier values to column mean
change outlier values to the threshhold
change outlier values to a value provided by the user
add a new column marking outliers
Would this be useful to anyone? How do you handle outliers currently? Is this something you deal with much
First, I’m very interested in the Transform, I can see many valid and invalid uses of the transform.
The manipulation of raw data is concerning to me, especially if the manipulated data is shared with folks who don’t realize the data set has been altered.
I teach a class at Dartmouth on Forensic Analytics and Fraudulent Data using Benford’s law as a way of identifying data sets that have been tampered with. I can’t believe how much “scientific data” has been manipulated.
Would it be possible to have a comment generated in the output/processing window indicating the number of outliers that were changed?
"If you torture the data long enough, it will confess to anything”
Yes. Pretty much all the transforms tell you how many rows have been modified.
In future we might add a detailed report that you can generate that gives you an outline of what was done during a run. It will have to be an outline to keep it a manageable size.
I knocked up a quick Benford’s Law example
You can now try the new
There is a new snapshot release for Windows and Mac. It adds to the
previous snapshot release:
A new Outliers transform, for dealing with outliers in numerical data.
More control over tooltips in the Center pane.
For more details and to download this release go to:
The downloads are called:
You may need to refresh the page, if you have visited it previously.