How to Specify Target Output Schema

DHofmann · March 5, 2024, 10:10am

My company uses EDT to migrate our customers data into our database as part of the onboarding process. For each new customer, the data source will be different. Often just an Excel file which they created. These are one-off migrations. Once a customer is migrated to the new database, we effectively throw away the migration, because the next customers input will look completely different.

The target output however is fixed. It needs to be a a defined set of CSV files with exact definition of columns and data types. Effectively, the output directly relates to the tables and columns in our database. These files are then loaded into the database with a load script. Currently, this process is brittle, because our EDT developers have “too much freedom”. They tend to work “from input to output” and end up with outputs that don’t match the specs.

What I’d like to do: Somehow specify /validate the output schema in EDT and force developers to transform the input into that specific schema.

In other words: I want to specify the desired tables and columns (incl. data types, null/not null) as last (or second last) validation steps, just before the CSV’s are produced. Developers will use these specs as template and build their transformations to lead into these final/predefined schema validations.

Is something like that possible with EDT. Or alternatively, what do you use to make sure that the output meets the specifications of the target?

Note: I’m not using EDT myself. Apologies if the question is too generic or obvious. I’m the database guy who ends up dealing with the “incompatible” files. So, I’m just looking for hints and tips on how to improve our process and what tools EDT might have to offer to address this issue.

Olaf · March 5, 2024, 10:44am

Read in fixed file with the output format (just the headers) in the sequences required and use the stack transaction with this header on top and the read in file on bottom, but this works only if the header names are identical. This will sort the input file into the right sequence and will drop all not required field.

Admin · March 5, 2024, 10:56am

One of the things that is high on the wishlist for v2 is to be able to validate the structure of a dataset.

e.g.
-has 10 columns
-first column is called ‘Date’ and contains only valid dates or nulls
-second column is called ‘ID’ and contains 10 digit numeric ids which are all unique
-third column is called ‘First name’ and contain no nulls
-fourth column matches a regular expression
etc

Probably this will be an additional transform, that has the option not to pass any data if a rule is not met.

Any not met rules will generate a helpful warning.

It should be possible to place it anywhere in the data, e.g. after an input or before an output.

But we are still working on the design.

Anonymous · March 6, 2024, 3:14am

Hi @DHofmann and Welcome,

It would be nice if you can provide a sample input and expected output, so that we can suggest solutions.

Admin · September 6, 2024, 5:56pm

@DHofmann
Note that we are starting to design a data verification feature for v2 now (this is separate to input schema). It will allow you to verify column names, types, values etc are as expected.

Admin · October 8, 2024, 4:44pm

@DHofmann
Schema versioning and data verification will be available in v2. Please try the beta and let us know what you think:

joker · October 9, 2024, 12:39pm

Can you please elaborate a bit what you mean by “Schema versioning”?

Thanks!

Admin · October 9, 2024, 12:43pm

Sorry, shouldn’t assume people know what that means.

The schema is basically the names and order of the columns (and sometimes, the type of each column).

‘Schema drift’ is when the schema changes over time, e.g. from:

date,payment,id

to:

date,currency,payment,id

‘Schema versioning’ is the handling of schema drift.

You could do it in v1 with Stack transforms, but it was quite clunky. We have tried to make it a lot slicker.