Missing Values after concatenating 1318 attributes

Hi,

i am a bit confused. I have a concatenation of around 1318 values, because this is a large dataset which i have to compare to yesterday’s values. While doing this, i create a hashvalue for today and compare it to the hashvalue from yesterday. If this hash is different, then we have to set the attribute “data_change” to true.

This is working, but at some point there are new attributes coming in from the source system and then the last values are blank after concatenation, whereas the attributes have values.
I assume there is some sort of cache, which interferes here, because if i have 8 new values, they aren’t always added to the end of the csv, but somewhere in between.

When i get 10 new attributes, i’D have 1328 values, and then, although the source values are shown correct, there are a couple of those missing in export, besides not missing, but being empty. Thus the hash value, if the changes apply to the new attributes is the same for today and yesterday and value changes aren’t detected.

I gave it a shot and created a new concatenation, and guess what, i got all values.

I have 6 transformations running with each around 1,400 attributes, and i would like to not have to create new transformation when new attributes come in, so what’s the way to do it.

Any idea how to get around this?

Best

I am struggling to understand what the issue. Can you show a simplified example of what you are doing?

I have a couple of attributes like

Article number, product name, product short text, length, height, width, weight,. price UK, price US, price Europe, price Switzerland, tariff code.

Today we have a price change, so when we compare the dara6to yesterday’s data, the column product_change will be true and then the data of today, being a delta, will be posted to website. To achieve this, I concatenate all attributes in one field, with a semicolon as delimiter. Then I create a hash for the data, and compare that to yesterday’s hash to figure out, whether there is a change in any field, instead of hashing each attribute and compare that. Having 1318 attributes it wouldn’t work…

Tia and best

In this simple example I concat 1500 column values and compute a hash.

concat-hash.transform (26.7 KB)

What is it that doesn’t work?

Is the issue that the number of columns is changing?

if I concatenate the attributes, the content of the attributes is not identical….

If you add a couple of attributes to your series and do not add them at the end, but somewhere in between, you will miss the content of some of them after concatenating..

let’s say you have 10 attributes with this Content:

A, 1, B, 2, C, 3, D, 4, E, 5

Now I place some other values randomly in this list:

A, 1, k, B, v, 2, C, 3, D, Y, 4, G, E, 5, 9

If I do the concatenation I the same box, by just adding those values and open a new box with the hashing feature, the I will miss content at the end of the row. Instead of having 4, G, E, 5,9 we will only have A1, k, B, v, 2 C, 3, D, Y, , , ..

Not the number of columns, but the content.

I’m still a bit vague on what you are trying to do. You are trying to work out if an values in named column have changed between 2 dataset, but the 2 dataset might have different number of columns?

You could set up a schema for the input and tell it to return an error if the columns are different to expected:

Or you could create a dataset with just a header that is more than 1318 columns (say 1400) and Stack your dataset under that (by column name or index) then Concat Cols those 1400 cols.

Is your data a delimited text file? If so, open it as a plain text file. Then all your data (with delimiters) will be in a single column. Compute the hash for that.

or use Reshape on the existing import to create one column, while retaining original data for possible transforms in another branch depending on outcome of the hash comparison.

I mention it because Reshape solved for me a recent problem which at first looked more complicated.

Reshape would put me 1318 attributes, which i have as columns in 1318 fields in one column, whereas i cannot identify which dataset is involved. I would need a hashvalue per line of 1384 fields to compare for each line whether data had been changed. Then the hashvlaue would be my filtering solutions, to a delta import… Given, there might be a couple of thousands lines, it won’t work, or i wasn’t able to figure out how to do this.

Figuring out how to do this would be easier with sample inputs and expected outputs.

Edit: I have read through the thread again. I am unclear where the problem lies. It seems plausible that it could be in one of a few places. For example, putting all the data into a single cell (I used Reshape + Concat for that) enables a file comparison hash. After that, what is the question? Number of columns provided? Identifying a difference column? Row? Cell?

Some input data and typical variations then expected end product will help.

1 Like