How validate / verify the CSV template did not change (column names)

itoldusoandso · January 16, 2023, 12:57am

I am wondering if Easy Data Transform could help me to verify if the CSV file columns headers are what they are expected to be. I download a CSV file regularly. The company producing the CSV is notorious for making unannounced changes to the CSV file without giving any sort of warning in advance or after the matter.

I am less concerned about the columns moving around and more concerned about the names of columns (fields) change. If columns name change, the values will likely change as well. So without being warned about it my import would import the CSV without importing the changed columns properly. I use Airtable to import the CSVs. The import in Airtable works well for CSVs if they remain same. If column moves around, it’s not an issue, but if the column header name changes, the import would not stop but would continue and would simply skip the columns that are missing and will not import any changed columns names.

I don’t think I can expect Airtable to add any CSV checking process because that is really not Airtable problem and they are a database company, so having an import tool for CSV, that is what they do have is all they would provide. Making sure the CSV is correct, would need to be done before that.

So I am thinking Easy Data Transform could be a great tool to do that since I can see it already does so many things.

I looked at the Javascript option and I don’t know if this is something that could do that, or is there anything else I should take a look at.

From what I read I may be able to take my original CSV template (empty) file and then either compare that against the newly downloaded CSV (after removing data) and then use e.g. Araxis Merge app to compare the files. Or I could just use EDT to compare the column names, but it requires quite a long javascript statement.

All I would be happy if I could create somehow a CRC value or something of all column names in the CSV file (or even count the total number of characters would suffice) and then compare it with a control number (which is what I would know my original template file has).

I am not a programmer so I am pretty much improvising here. Maybe a solution to what I am looking is pretty simple, which would be great. I am all ear.

I see here this page has description about Hash check but it seems only capable to handle values but now the column names?

Anonymous · January 16, 2023, 2:12pm

Hi,

You can use Stack to check for the columns that have name change or spelling issues or new columns

For Example:

These are your orignal columns
col1,col2,col3,col4,col5,col6
v1,v2,v3,v4,v5,v6

Now in the new csv
col 1,cola,colb,col2,col3,col4,col5,col6
v1,va,vb,v2,v3,v4,v5,v6

Transform file
MyTransform.transform (2.1 KB)

Anonymous · January 16, 2023, 7:15pm

Forgot to mention that, the output of stack will show the original column blank, if it is not in the new csv, otherwise it will show the values under the column. Also any new (or old with new spelling or with extra spaces or whatever) will show after the last original column, this way you will know which are the new columns and then can make adjustment as necessary.

As a suggestion, you can have only the column headers for the original columns and then for the new csv, you can use Sample Tranform to get only 1 or few records and then use Stack Transform and you will know which ever original columns have value, that is unchanged and whichever are blank will mean, either they are not present or have different name or spelling, which you will find after the original columns.

Admin · January 16, 2023, 10:17pm

As @Anonymous says, you can do a Stack transform with Align columns by set to Header name. If the column names change, you will then get additional columns.

We have some ideas to handle this better in v2 without using Stack by storing the expected schema (column names and order) in an input and rearranging columns or throwing an error if the schema doesn’t match the expected one. Perhaps we will add an option to send some form of email or notification if things go wrong. Watch this space.

See also:

itoldusoandso · March 8, 2023, 9:24pm

Thanks that is good and will check the Stacks option that could work find for me right now.

itoldusoandso · January 10, 2024, 7:31am

I solved this a bit further, I have separated 2 flows, one verification control flow and the actual the CSV main flow filtered out the resulting stacked table and transposed it, then filtered out again and I only have now the rows showing pairs of headers not matching. I added a condition column to fill in with a larger block of text if there is a mismatch.

Now the output files are two, one is checking results and the other is normal processed file. The control checking results file is either 7 bits or so large if it is empty, if there are mismatches, the file size will have like 500 bits or more…

It’s a small visual trick, so I don’t need to check the flow actually can see on the file size of the control file.

It’s a crazy work-around but pretty nifty. Saves time.

Admin · January 10, 2024, 8:33am

Thanks for sharing that. We will be looking into making this sort of ‘schema validation’ much easier in v2.

DHofmann · March 5, 2024, 10:21am

We will be looking into making this sort of ‘schema validation’ much easier in v2

My company would benefit greatly from this feature. Do you know the roadmap for this? Also, please make sure to add this validation feature for outputs too. For my use case, we need to produce a specified output from varying inputs.

Admin · March 5, 2024, 10:43am

It is high on the wishlist for v2 and we have started v2 development. We might have something you can try in a few months. But we don’t make promises on future releases, as a matter of policy. It only leads to tears!

itoldusoandso · July 6, 2024, 2:46am

Here is the flow chart:

Basically my validation output is this:

The fact it says 2=3 is irrelevant, this is the remnants of the manipulation done to minimize the size of the validation file to minimum so it is easy to spot if the results of the validation are as expected or not.

If the validation went through, the file size is around 16 bits. So I don’t even need to open it because if the file is so small then there are no issue.
If there were some issue in the validation, then the output file would contain much more text, like the list of columns names that have changed or shifted, and the output file would be bigger and I would know there is something wrong.

Here is 16 bits so that tells me the file is validated.

Also - not necessary related but somewhat related to this topic:
If the CSV changes, I don’t want EDT to capitulate on me. That’s what it is doing now. If Ebay for instnace changes the Transaction report (they do it every couple of week, it’s crazy and they don’t tell anybody about it…!!.. when they change it and the CSV is loaded, now validation catches the error and I check the file, fix the template and fix the flow.

But I need to go through a circus of opening 2 separate windows of EDT, in one window load the old version of the flow with the old version of the input file and the old version of the template and then need open second window with the new version of the input file and new version of the template.

I need this because otherwise how can I know what I need to fix. My flow is long and I don’t even sometimes remember what is the reason there are some of the commands there so every time I open it I am going through it and I end up making a small adjustment and it works again.

Usually the issue is e.g. Totals or Calculate step takes the wrong input column and puts the output into the wrong column.

This is because e.g. the number of columns has changed in the new input file compared to the old version. There may be a new column or column missing etc.

Often, the column added or missing are non-essential. However, because the total count of columns changes in the CSV, the whole flow breaks.

The reason it breaks is because the way EDT works, it links the configuration to a specific column by column letter code or column number… This topic was beaten up badly already a few times here and I had insisted myself this is an unfortunate, but very reasonable and necessary choice.

HOWEVER - there is a better way doing this. My suggestion is you provide the user the choice whether the configuration is e.g. ABSOLUTE or RELATIVE. Relative link will be based on the column code/number (as it is now) and the ABSOLUTE link will be based on the column NAME.

That’s easy. If my flow contains the column names, I should be able to address the configuration of the steps based on column names, if I choose to do so.

This would be of tremendous help because this would ensure the flow doesn’t break apart if Ebay adds some new field (which I don’t care about anyhow) or removes one field I don’t even use.

Now, since you would have validation feature in your product in the future, you can easily validate if the field names are present in the CSV file. You can warn user if they are missing or perhaps if there are 2 columns with the same name and warn the user the flow will assume the 1st column with the matching name is the one that is being linked in the command configuration…

Admin · July 6, 2024, 3:49am

Do the column names stay the same? if so, can’t you use a Stack by header name to keep them in the same order?