I want to put into consideration the idea of being able to encapsulate several (by selection) sections of a transformation within one “Super Transformation”. This could be specially useful in big transformation streams where one wants to consolidate transformations sections and have a more “cleaner” visual.
I have given this some thought already and I can see how it would be useful. It is on the ‘wishlist’ for v2.
What are the plans for V2 @Admin? Is this something you can share? Ideas / timeline?
Nothing is set in stone. But the big ones are things like:
- input and out from more data sources, such as:
- web APIs
- more file formats
- more transforms
- more options on existing transforms
- user interface improvements (such as grouping/folding transforms as discussed above)
- visualisation via charts and graphs
- others we are not ready to discuss at present
A subroutine transform would be neat.
Imagine a dozen input files, each representing a year’s worth of similar data. You want to apply the same chain of transforms to each file, then maybe do something unique per file. Finally, you stack them for your analysis.
If that chain of transforms in the middle could be defined once and chained into multiple data flows, that would be cool.
Just a thought.
It would be useful to save a sequence of transforms as a custom named transform to apply in new processes instead of copying and pasting from another process.
You can do that already, assuming I understand you correctly.
If you have an input and a chain of transforms, you can create a duplicate by:
- selecting the input
- right clicking and selecting ‘duplicate branch’
- you can then change the file the input points to and any transform options in the new branch
Then you can add your Stack transform.
That point about saving the column related options is really a big bottleneck. I am not trying to argue that there is an obvious solution but I have to say (and its obviously an unfair comparison) that for example you can do that in SPSS Modeler (I have no technical details about it). Meaning that if you set column related parameters in a given node and even if you detach it from the upstream nodes, the options you set do not change. That’s incredibly useful when you want to copy and/or fix anything. I would say this is the ONE setback for EDT in my particular workflow even though there are ways to mitigate it.
I’m not sure how SPSS does it, but the only way I can think that you can preserve the column related options when a transform is attached is by column index (position) or column name. For example you could store the name of columns set when a transform is disconnected and then try to match them up again when it is reconnected. It is definitely something to consider for v2.
Applying such a custom transform would either use right away the transforms in case of matching columns name OR show a popup requesting which columns to match.
It may be useful at any rate to have a popup just confirming that you really want to apply to the matching columns the transforms.
Yes, getting the user to confirm the mapping between old and new columns when they reconnect (defaulting to name match, if possible, and index match, if not) is probably the right way to do it. But not trivial to get working! Is that how SPSS does it @Johnnycash ?
Right, but if you adjust the common transform chain later you have to make the same change in multiple places.
You could also use the batch processing feature to apply the same set of transforms to multiple input files and append them all to one output file (rather than doing a Stack).
Another interesting possibility would be to specify multiple files using a wildcard in the input (e.g. “*.csv”) and have it effectively do the stack transform on all matching files before the first transform. That way you would only need 1 set of transforms. It is a possibility for v2.
True, and the way I’ve been handling this is either duplicate copies of a branch, if it is simple and unlikely to change, or batch processing to output files. That applies a single transform chain in parallel.
Another thing that would be nice is if you could restructure a fixed record file.
In my case, I’ve got one input file per year. Every record is 4096 characters. Position 2000-and-something is a flag indicating which of six types the record is, each of the six types having a different fixed-length layout.
I read the input, break it into three fields, make six filters, and write each filter out in text format. That way I get the 4096 character records sorted into the six types they belong in.
Those six files are where I get my input for other transform chains.
But please take none of that as complaint. I can slice, dice, and skewer our local tax collector with what I’m learning with EDT.
That’s not just a good thing, it’s kind of fun!
I’m not sure how we could improve on that.
I suppose, in theory, you could tell Easy Data Transform to create N filters from the N values in a column. But it would be a 1-shot deal as it would be completely impractical for Easy Data Transform to create and destoy filters transforms dynamically as the column values changes. So I’m not sure how useful it would be.
It’s not a big deal and probably very special purpose. In my case, I’m writing fixed length records as text to drop the layout information, then I’m reading them as fixed length records with new column boundaries.
Probably very rare. It’s needed in this case because the rows don’t all have the same field layout. Which is crazy.
Having multiple different fixed length record formats in the same file is a bit bizarre.