Dealing with schema drift

Admin · September 5, 2024, 2:44pm

‘Schema drift’ is where the columns in an input change over time. E.g.:

new columns are added
existing columns are deleted
the column order changes

It can be a real nuisance for transformations that you run regularly.

In Easy Data Transform you can handle this using Stack. But this can be quite tedious to do.

So we are adding a new Schema feature to inputs in v2. This gives you the option to store an ordered list of column names with each input and say what you want to do if the input does not match the schema.

add missing columns (in the schema, but not in the input) with empty values
rearrange input columns into the same order as the schema
add or ignore extra columns (in the input, but not in the schema)
stop with an error

We hope to have a beta version that customers can try in not too long.

prashant · September 7, 2024, 2:05am

now you just teasing us

My take on Rearrange

will be very helpful , but you will have to show the user the existing column order also to help
Give the option to enter ‘dummy’ column with ‘dummy’ value so in case a column is deleted , user can insert that too , to maintain integrity in calculation downstream.

Admin · September 7, 2024, 8:14am

Yes, still thinking about that.

That has already been added.

prashant · September 7, 2024, 9:59am

oooh la la ! I am now looking forward to the big bad beta , which I hope is paid !

MikeA · September 11, 2024, 2:40pm

This looks fantastic. Looking forward to v2.

joker · October 14, 2024, 12:02pm

I can’t stop thinking about this, as schema drift is quite an issue for me. But as my situation is more working towards pretty big/complex schemas:

Can we have this as a transform please??? Check the data set against a given schema?

Would be just so great.

Anonymous · October 14, 2024, 1:45pm

You can do that by using Stack transform.

Have a clipboard input with only header i.e your schema like column names
Input your data file, have a Sample transform and set only for few records (even 1 record is enough)
Stack them together using Stack transform
Values for the same column header will fall into their respective columns
Any column which does not match will get added to the right

This way you will know your schema drift.

Admin · October 14, 2024, 5:33pm

As @Anonymous says, you can Stack columns by name. But the new schema versioning feature built into inputs should be more powerful and easier to use. Have you tried it?

joker · October 14, 2024, 7:37pm

yes, I know and tried it.
The new schema on import is so much more powerful and helpful! But as I said: I am more often working towards a given schema. The import schema does not help too much on this.

Admin · October 14, 2024, 8:37pm

You could set Verify to fail if you don’t end up with the expected column names.