Data verification

Admin · October 9, 2024, 12:38pm

Everything is stored as a string (text) in Easy Data Transform. Some transforms will try to convert strings to integers (or reals or dates) temporarily for the purposes of that transform. But then it stores it as text again. This isn’t as fast as storing a column as integers, but it is much more flexible and less hassle (as you don’t have to go through and define the type of every column).

Setting the column type as ‘Integer’ just changes the verification rules available for that column. It doesn’t change how that column is stored. So you shouldn’t lose any leading zeros, unless perhaps you output to Excel and explicitly set the column format to integer.

Admin · October 9, 2024, 12:39pm

Would it help if columns with >0 rules were shown as bold? So you can see at a glance which columns have rules defined?

joker · October 9, 2024, 12:48pm

Definitely. But…

Why do you want to auto-set a rule when switching a column type to Integer?

Most helpful would be if the config pane of the new transform would not move when an update occurs.

Admin · October 9, 2024, 1:07pm

If you set Column type to integer, wouldn’t you usually want to check that they are all integers? So we set that rule on by default for Numeric (integer) columns.

The only 3 rules that are checked by default are:

Integer except listed special values for Numeric (integer) columns
Numeric except listed special values for Numeric (real) columns
Valid date in format for Date columns

Admin · October 9, 2024, 1:07pm

That was an oversight on our part and should be easily fixed. Thanks.

Admin · October 9, 2024, 1:16pm

I think it would be useful to have a default set of Verify checks based on column name. So, for example:

If a column is called ‘Purchase Date’ then set Verification column type to Date and check rules Valid date in format and no empty values.
If a column is called ‘barcode’ then set Verification column type to Numeric (integer) and check is valid EAN13.

They would just be the default rules for each column when you add a new Verify. You could change them. Sort of a simple data dictionary, but it won’t be in v2.0!

joker · October 9, 2024, 3:33pm

That was an oversight on our part and should be easily fixed. Thanks.

That is by far the best solution

joker · October 9, 2024, 3:37pm

If you set Column type to integer, wouldn’t you usually want to check that they are all integers?

You are probably right. I guess this only comes up because of the update triggerd and the display glitch you said you will look into.

But I would personally not set “nan” as an default exclude, because that is probably sort of an error I would be looking for.

joker · October 9, 2024, 3:38pm

sounds clever, but my EAN-column never will be called ‘barcode’

Might be pretty tricky

Admin · October 9, 2024, 3:42pm

You would be able to define your own column names. And it would be optional.

If you get your files mostly from a single source, I think this feature would be quite useful. But if every dataset you get has different column names, it won’t help.

joker · October 9, 2024, 3:52pm

mhmm… when I try to test to validate some EAN13, I instantly get warnings from the auto test “Integer except listed special…”. EAN validation than works like a charm…

verify EAN.transform (6.9 KB)

joker · October 9, 2024, 3:55pm

BTW: the Info coloring does not work out well in night mode…

Admin · October 9, 2024, 4:04pm

True. You can click the button to change it, but we should probably some up with some that works in both dark and light mode. Thanks.

Admin · October 9, 2024, 4:13pm

Fair point. It is a bit tricky to be consistent on this. Is a telephone number text or an integer? What if it has spaces in it? What if it has a ‘+’?

We could move EAN13 to the text type. But it does mean there are a lot of rules for text. Especially if we add all the various GTIN flavors.

I will give it some thought.

Admin · October 9, 2024, 4:19pm

I am struggling to come up with 3 colors that:

-work with both dark and light UIs
-are visually distinctive from each other
-can be intuitively associated with error/warning/info

Suggestions welcome.

joker · October 9, 2024, 4:29pm

see… it’s a (formated) text. You don’t want to do math with it, either

That’s already true… and it’s “only” the first beta… it will only become worse over time. And this amount for every column in a data set… massive configuration. The only help I can think of would be a text searchfilter like you came up with for column headers…

Admin · October 9, 2024, 4:38pm

I’m not sure how well that would work with a tree.

Admin · October 9, 2024, 6:40pm

I have a fix for that. It snuck in unnoticed when we changed something else. That is why we do betas!

Admin · October 9, 2024, 6:53pm

Experimenting with that now. Do you think this is a useful additional cue?

Admin · October 9, 2024, 7:00pm

As it isn’t clear if EAN13 or UPC-A are integers or barcodes, I am going to list them under both (as we do for telephone numbers). That seems the best compromise for various reasons.

Also, @joker persuaded me, there will be no rules checked by default.