Processing of multiple files and concatenated results

Olaf · September 23, 2023, 8:07pm

I have a lot of times to evaluate sets of files and collect the results in a “collection” file, which needs to be evaluated further.
Currently it is required to build a transform script to evaluate the source files in a batch (or in an external script) and write the result in append mode to “collection” file. Before running this, it is necessary to delete the old file otherwise it would append to the already existing file. Then a second transform script needs to be built to process the “collection” file. I create such flows a lot and script the activities and the necessary file deletions, etc. and it works fine.

It would be good to have possibility to mark and run only a part of an EDT script as batch and collecting the outcome in an append table within the script. Further steps are done on this table (these steps should be executed when batch part is finished). This would avoid the interim file, the file deletion process and some handling would be more intuitive.

For sure this has major design impact if it would be possible.

Anonymous · September 24, 2023, 6:47am

At least you can do that currently, by connecting output of your various transforms to Stack Transform (I am assuming here that all output have the same structure), it will collect all your output in order you provide connection and then you can do further transformation as per your requirement on the output of Stack transform and finally save the result.

Olaf · September 24, 2023, 8:06am

@Anonymous as I see it your suggestion would work only if it is a fixed number of input files and the EDT script would need a lot of multiplied branches. If you need something adapted it will be a nightmare. And in my case the number of input files can change every other day. Therefore it is better to run in batch mode with a joker for the input file names or in a terminal script using joker for the file names, too.

Admin · September 24, 2023, 8:44am

I will have to think about whether there is a simple way to do this. A lot of people already find the batch processing feature complicated.

Perhaps batch processing could have the option to run a selected .transform file one the batch processing was complete? This could then process the output file created by batch processing.

But probably the best way to handle the above it to write a batch processing script that calls the Easy Data Transform via the command line. You can loop over all files matching a wildcard in a script on Windows or Mac. So:

delete appended.csv
foreach file matching *.csv in folder
   call .transform file which appends to appended.csv
call .transform file which processes appended.csv to final.csv

Some examples of script loops in:
Windows: Reference > Command line arguments
Mac: Reference > Command line arguments

In v2 I would like to make this easier by optionally allowing wildcards (e.g. *.csv) to be supplied to an input instead of a fixed file name, it would then do the stack automatically.

Olaf · September 24, 2023, 11:15am

as stated, I script it, I like the flexibility which is available with the batch processing and the scripting, but I have some of such scripts:

cd /_____/TMP_Reporting
rm Total_TC_Exec.xlsx
cd All
for f in *.xlsx; do /Applications/EasyDataTransform.app/Contents/
MacOS/EasyDataTransform '/_____/TMP_Reporting/
01_Combine_Execution_templates.transform' -file 
'HeaderConcat_w_Meta=/_____/TMP_Reporting/HeaderConcat_w_Meta.xlsx[Tabelle1]' 
-file 'HeaderConcat=/_____/TMP_Reporting/HeaderConcat.xlsx[Tabelle1]'
-file In=$f -file Out=/_____/TMP_Reporting/Total_TC_Exec.xlsx -cli -verbose; 
done

cd ..

'/Applications/EasyDataTransform.app/Contents/MacOS/EasyDataTransform' 
'/_____/TMP_Reporting/02_TC_execution_evaluation.transform' -file 
'TC Execution=/_____/TMP_Reporting/Total_TC_Exec.xlsx[Easy Data Transform]' 
-file 'Header=/_____/TMP_Reporting/Header.xlsx[Tabelle1]' -file 
'Report=/_____/TMP_Reporting/Report.xlsx' -cli -verbose

And I like the power of it. The for f in *.xlsx loop process more than 250 files (and the number of files did change during time). One disadvantage of the scripting is that it starts and close now more than 200 times EDT, when the script runs you can take a short coffee break

I even put these scripts on my Mac into Keyboard Maestro flows which open afterwards the result file and do further activities.

So I was just suggesting if there is a possibility to get I done in one Transformation. By the way out of the other discussions about resource use I will try to change the interim file from Excel to another format

Admin · September 24, 2023, 12:32pm

The command line starts the GUI program each time, which was a bit of a kludge to save development time. In v2 I hope to have a command line program separate from the GUI program.

I think supporting wilcards for the input file without having to use command line or batch would help here.

Reading and writing Excel files is slow. Probably the fastest format we support is delimited text without quoting, which is faster to parse. E.g. .tsv (tab separated values) where:

delimiters are tabs
no tabs allowed in data so there is no need for quoting

You can use Whitespace and check replaces tabs with spaces to ensure no tabs in the data.

I am not sure exactly how much faster it is without quoting. I’ll try running some tests if I get chance.

Olaf · September 24, 2023, 1:09pm

one question on the loop

In batch I can run it for a defined sheet in the input Excel. If I use the terminal script in the loop I fail to use it if the sheet to process is not the first one. I didn’t found a way to get the [Sheet_X] into the “f” term. Do you have any idea?

Such ones I run individual in batch mode until now.

Admin · September 24, 2023, 2:03pm

This should work:

-file <myinput>=file1.xlsx[Sheet_X]

Or to get the 3rd sheet:

-file <myinput>=file1.xlsx[3]

You can’t pass a wildcard here. And you can’t automatically loop over all the sheets in your script unless the operating system can somewhow work out the sheet names.

Admin · September 24, 2023, 6:48pm

I ran some test and Excel is really slow compared to delimited text.

Anonymous · September 25, 2023, 4:27am

I took script as the transform file, and that is why suggested Stack for append table. That append table requirement could be solved by having stacked all the output before further work within transform.

I have not used the batch process capability of EDT, and to be honest, it is confusing for me, I tried one time, but just could not get handle on it, so I left it alone and so far all my needs were solved within the transform, so never invested much time in it.

Admin · September 25, 2023, 7:44am

The batch process feature uses your .transform as a template and applies to multiple files/Excel sheets. You use the input and output aliases to refer to the files in your ‘template’. It ended up a bit more complex than I would ideally like, but it has handle a lot of different scenarios.

There is a video here:

Anonymous · September 25, 2023, 8:24am

This is where I am at lost, I did see that video and read documentation on it, but for some reason, it is lost on me.

I do see in posts, how it is useful and it made life easy for People, but for me it is a rocket science

Admin · September 25, 2023, 8:46am

Rocket science is easy. It’s the rocket engineering that’s hard. ;0)

Anonymous · September 25, 2023, 8:49am

Let me rephrase, Watching video for Batch Processing is easy, it is the batch engineering with multiple inputs and multiple output is hard for me

jimmyhartington · September 25, 2023, 9:06am

If you are using a shell script and have a folder of csv-files you want to concatenate I have used another cli program called Miller - https://miller.readthedocs.io.
So if it is to cumbersome to make EDT do the concatenation, you could look in this direction.
EDT could then convert the concatenated csv to Excel or do other manipulation afterwards.

Here is a command I have used to concatenate a folder of csvs.

mlr --csv --fs semicolon unsparsify *.csv >out.csv