[Req] Generate UUID V4

Hellos,
Request to generate UUID 4 or GUID ?

Currently I believe javascript can do this , native always preferred?

Lord Chatgpt gave me this below

return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, c => {
  const r = Math.random() * 16 | 0;
  const v = c === 'x' ? r : (r & 0x3 | 0x8);
  return v.toString(16);
});

What do you need UUIDs for?

There seem to be more than one type of UUID/GUID: NCS/DCE/Microsoft. Then various sub-types and versions.

BTW I think the Javascript above is going to give you a different UUID every time you run it.

Need to transfer files to object storage with obfuscation , what better than UUID ?

Yes!!!

Do you want a row to have a different UUID every run?

If it is to be used as a unique reference, e.g. to anonymise data, is that not just a sufficiently large hash function of [some of] the data?

Or is it supposed to be reversible, weak encryption or just non-obvious to the casual observer, e.g. base64?

Yes, technically once in production file is generated once and passed on to other scripts. If you do consider this , it could be part of SHUFFLE feature to generate UUID also

UUID 4 atleast is complete random , not requiring any input from user side. Hence no reversibility

Do it in a single transform by hashing a suitable column (or concatenation), or generate a random number and hash or base64 that. What is the length you desire?

As I understand UUID (4) is a standard , it’s always 32 digits with 8–4–4–4–12 . Hashing here is not required

You asked for a unique (32 character) alphanumeric. EDT can do that directly. What is the additional structure you need?

Hello,
Could you please explain how can to do it directly with EDT (without JavaScript)?
I also need to anonymise some columns with random unique values.
Thank you in advance

Hash and Chop

Unique ID.transform (2.4 KB)

SHA1 produces 40 characters so I have chopped it to 32 in case that is a criterion.

If you want to retain a separate lookup table of the original codes against the hashes then copy the original data, join, and delete not-required columns

1 Like

A hash can be used to generate a long alphanumeric. It is useful for anonymising things. But the same input to the hash will always give the same output. Also, you can get collisions where 2 different values could conceivable return the same hash value. Although this is unlikely with a well designed hash.

UUIDs produce a large random number[1], based on no input. They are incredibly unlikely to produce the same UUID twice.

[1]Mostly. There are various different types of UUID.

I don’t think the third party library we use (Qt) has any way to change the UUID seed. If it did that would make it more likely to generate the same UUID twice, which would defeat the object of using UUIDs.

I also need to anonymise some columns with random unique values.

You might consider using Random to add a column of random values, Concat Cols with, say, their name to get values like:

98430238JohnSmith
32980712JaneSmith

Then hash those.

Note that it is possible that 2 different strings could result in the same hash (a ‘hash collision’). But this is very unlikely and you can check for it with a Verify transform.

1 Like

SHA1 collisions have a likelihood of about 1 in 2^80 (10^24). For 100,000 records that comes down to about a 3.4x10^-39 probability of collision.

If that is not sufficiently rare then use SHA256 or greater on padded data. The randomness of any arbitrary number of alphanumeric characters, whether or not case-sensitive, is not varied by calling it a UUID rather than a hash. Other than encoding bits, the example I gave represents UUID V5.

I noted earlier that hashing a key is for anonymisation, where you keep a separate secure file for lookups. Some Census bureaux do this for current anonymity yet for the benefit of historians after 99 or so years.

If non-repeatable keys are desired then hash a random number, as Andy and I both mentioned.

The 8-4-4-4-12 structure to which Prashant referred is the addition of separators (given Prashant did not express the need for codification bits), probably achievable with a bit of Regex. I have not looked more closely at that.

Just bear in mind that Random could generate the same value for 2 different rows. Also it is only pseudo random (based on a seed + an algorithm).

I guess part of the problem here is ill definition. Is a random key to be generated from scratch or from existing unique data? How much data? What collision probability is acceptable? UUID itself has slightly less than the theoretically calculable options for its length because strictly it uses a nibble to encode its type.

Math.random() in JS has a cycle length of 2^128 apparently (2^30 for older versions) so the question is the seed, which comes back to my original request for easy variation of it or even to have it as metadata.

Speculation on a change:
The default seed EDT generates for a new file could be retained in Random while a separate seed is generated from a hash of the input data (extracting digits) and made available through metadata. The suggestion of a “Generate now” button in Random would also be retained.

Here it is with 8-4-4-4-12, I used HMAC-MD5 hash, as you can set whatever secret you like.

Transform file.
GenerateUUID4.transform (2.2 KB)

1 Like

Thank you all for the feedback and showcasing how to do the same in EDT currently. My entire request was based on the idea of using NO seed .

We recently uploaded 12 year of ERP images & financial documents from our database to s3
Storage for Company 1

Images - 245,839
documents - 134,304

both of above have many many duplicates file names but kept in different locations in ERP , for object storage hence cannot use filename as seed

YES!!

Thank You as always