My company uses EDT to migrate our customers data into our database as part of the onboarding process. For each new customer, the data source will be different. Often just an Excel file which they created. These are one-off migrations. Once a customer is migrated to the new database, we effectively throw away the migration, because the next customers input will look completely different.
The target output however is fixed. It needs to be a a defined set of CSV files with exact definition of columns and data types. Effectively, the output directly relates to the tables and columns in our database. These files are then loaded into the database with a load script. Currently, this process is brittle, because our EDT developers have ātoo much freedomā. They tend to work āfrom input to outputā and end up with outputs that donāt match the specs.
What Iād like to do: Somehow specify /validate the output schema in EDT and force developers to transform the input into that specific schema.
In other words: I want to specify the desired tables and columns (incl. data types, null/not null) as last (or second last) validation steps, just before the CSVās are produced. Developers will use these specs as template and build their transformations to lead into these final/predefined schema validations.
Is something like that possible with EDT. Or alternatively, what do you use to make sure that the output meets the specifications of the target?
Note: Iām not using EDT myself. Apologies if the question is too generic or obvious. Iām the database guy who ends up dealing with the āincompatibleā files. So, Iām just looking for hints and tips on how to improve our process and what tools EDT might have to offer to address this issue.
Read in fixed file with the output format (just the headers) in the sequences required and use the stack transaction with this header on top and the read in file on bottom, but this works only if the header names are identical. This will sort the input file into the right sequence and will drop all not required field.
One of the things that is high on the wishlist for v2 is to be able to validate the structure of a dataset.
e.g.
-has 10 columns
-first column is called āDateā and contains only valid dates or nulls
-second column is called āIDā and contains 10 digit numeric ids which are all unique
-third column is called āFirst nameā and contain no nulls
-fourth column matches a regular expression
etc
Probably this will be an additional transform, that has the option not to pass any data if a rule is not met.
Any not met rules will generate a helpful warning.
It should be possible to place it anywhere in the data, e.g. after an input or before an output.
@DHofmann
Note that we are starting to design a data verification feature for v2 now (this is separate to input schema). It will allow you to verify column names, types, values etc are as expected.