Dealing with schema drift

‘Schema drift’ is where the columns in an input change over time. E.g.:

  • new columns are added
  • existing columns are deleted
  • the column order changes

It can be a real nuisance for transformations that you run regularly.

In Easy Data Transform you can handle this using Stack. But this can be quite tedious to do.

So we are adding a new Schema feature to inputs in v2. This gives you the option to store an ordered list of column names with each input and say what you want to do if the input does not match the schema.

  • add missing columns (in the schema, but not in the input) with empty values
  • rearrange input columns into the same order as the schema
  • add or ignore extra columns (in the input, but not in the schema)
  • stop with an error

We hope to have a beta version that customers can try in not too long.

5 Likes

now you just teasing us

My take on Rearrange

  1. will be very helpful , but you will have to show the user the existing column order also to help
  2. Give the option to enter ‘dummy’ column with ‘dummy’ value so in case a column is deleted , user can insert that too , to maintain integrity in calculation downstream.

Yes, still thinking about that.

That has already been added.

image

oooh la la ! I am now looking forward to the big bad beta , which I hope is paid !

This looks fantastic. Looking forward to v2.

1 Like

I can’t stop thinking about this, as schema drift is quite an issue for me. But as my situation is more working towards pretty big/complex schemas:

Can we have this as a transform please??? Check the data set against a given schema?

Would be just so great.

You can do that by using Stack transform.

  1. Have a clipboard input with only header i.e your schema like column names
  2. Input your data file, have a Sample transform and set only for few records (even 1 record is enough)
  3. Stack them together using Stack transform
  4. Values for the same column header will fall into their respective columns
  5. Any column which does not match will get added to the right

This way you will know your schema drift.

As @Anonymous says, you can Stack columns by name. But the new schema versioning feature built into inputs should be more powerful and easier to use. Have you tried it?

yes, I know and tried it.
The new schema on import is so much more powerful and helpful! But as I said: I am more often working towards a given schema. The import schema does not help too much on this.

You could set Verify to fail if you don’t end up with the expected column names.