Joining files with cases that are identified by two columns

Dear All,

I’m a long time owner but first time user of EDT. (mostly solved problems with SPSS or files not complicated / large enough to feel I needed to put it into practice)

No I think I should have learned earlier as now I really need it. I have a research file that has cases labeled i.e. case 1 / case 2 in the identifier column. file is in “long format” as each case has multiple rows that correspond with timepoints column. I.e. time point 1 / 2 / and so forth.

There are multiple files that have different measurements information. and even though one file has 3 time points for a case others might have 4, some have only one. Therefore you can’t just sort the files and paste them together.

i.e. data looks like

case ID. timepoints. variable A
case 1 time 1 45
case 1 time 2 39
case 2 time 1 45

and in file two
case ID. timepoints. variable B
case 1 time 2 X
case 1 time 3 Y
case 2 time 1 C

What I would like to do is merge the info in file two to file one if the case and timepoints match. That means here that Case one time 3 from file two would remain unmatched (or ideally I would create a new tow in file one including that date and the “new” timepoints.) and that case 1, timepoint 2 would have a new variable column introduced containing variable B.

is this possible, that is matching information on two conditions (case ID and timepoint)?

many thanks for your time in advance,
doc_c

Given those 2 inputs, what would you expect the output to look like?

Are the inputs plain text with a single value per row? Or do the spaces above imply columns breaks?

is this possible, that is matching information on two conditions (case ID and timepoint)?

Yes, just combine them into a new column with a delimiter and use that as the Join key.

Hi and thanks for the swift reply.
The spaces imply column breaks i.e. each of those in separate columns. Hope I’m expressing myself correctly. Each entry starting with a case is a row.

This is what it should end up as output.

Column A Column B Column C Column D
case ID timepoints variable A variable B
case 1 time 1 45 no value
case 1 time 2 39 X
case 1 time 3 no value Y
case 2 time 1 45 C

So I would need a logic routine somewhat like: check if case ID and timepoint column matches, if yes, then insert value for variable B. If no match for case ID and timepoints insert new row with variable B anyways leaving variable A blank (as there is no matching case in the original file there wouldn’t be a value for A).

Hope this makes sense!

Best wishes and thanks

You can do it like this:

image

two-key-join.transform (6.1 KB)

If the row order matters, you will need to add a Sort.

1 Like

Cool. I’ll try this out.
for my educational benefit, why is there a remove cols? can I just import the workflow and adjust the terms to match the actual names in the files? can I reverse engineer what you have to learn how to do this myself? sorry for being a complete noob. I’ve started to watch the videos and normally I’d tinker myself for a while but have some timescale this needs to be done by unfortunately. Your help in this is much appreciated!

if you don’t do it, you need to do it later and you have unnecessary data in the further transaction, If you try it without the remove you need to do the split anyhow otherwise you have other problem:

The duplicated columns case ID and timepoints are inconsistent to each other when keeping them.

1 Like

I think I understood this while going over the file you send. You merge the case ID and timepoints to create one single column that you can then easily merge and then split the unified column to get back to the original format. Do I get this? You can then expand this to any amount of variables you want to include in the merger, right? Genius. Thank you so much.

The original columns are surplus to requirements once you have done the Concat Cols. You could get rid of them later on to reduce the number of transforms.

That is correct.

last question, can I use your template and just plunge my files into it and adapt the sample process to the current names or should I start from scratch?

Currently you can’t replace Clipboard input with file input without loosing the downstream column settings (something we hope to fix in future). So you will need to:
-delete the clipboard inputs
-drag on your files
-reconnect the connections ( Reference > Connections )
-reset the column related parameters
-probably add an output

1 Like