Pseudonymizing data

Currently, you can use the Hash function to pseudonymize data (replace values with a pseudonym). But:

  • Hashes are long.
  • There is a (very small) chance of a hash collision (where 2 different values have the same hash).
  • It is possible to reverse the hash with a lookup table (only practical if the value hash is short).

You can also use Row Num or UUID transforms. But these have their own issues.

So we are working on a new Pseudonymize transform:

This allows you to create a lookup table from the pseudonym to the original value.

  • The values are generally to be much shorter and more human friendly than hashes.
  • There is no chance of a collision.
  • You cannot reverse it without the lookup table.

You can control what the psuedonyms look like using Prefix and Start at.

The order in which indexes are assigned to values is pseudo-random, controlled by the Seed value.

We would be interested in feedback on this new transform from anyone who needs to pseudonymize data.

1 Like

Like it.

In case the pseudonym has to be build on more then one column the columns can be concatenated before.

My personal preference would be, in case the maximum number generated for pseudonym has n digits that the out has in every case has this number of digits. I.e. for 4 digits, that the first one starts with “Email-0001”, so that the pseudonym values have the same length.

1 Like

The example doesn’t show it, but the ids are padded with 0s to all be the same length.

1 Like