Human readable alternative for UUIDs

tobib picture tobib · Mar 27, 2018 · Viewed 10.2k times · Source

I am working on a system that makes heavy use of pseudonyms to make privacy-critical data available to researchers. These pseudonyms should have the following properties:

  1. They should not contain any information (e.g. time of creation, relation to other pseudonyms, encoded data, …).
  2. It should be easy to create unique pseudonyms.
  3. They should be human readable. That means they should be easy for humans to compare, copy, and understand when read out aloud.

My first idea was to use UUID4. They are quite good on (1) and (2), but not so much on (3).

An variant is to encode UUIDs with a wider alphabet, resulting in shorter strings (see for example shortuuid). But I am not sure whether this actually improves readability.

Another approach I am currently looking into is a paper from 2005 titled "An optimal code for patient identifiers" which aims to tackle exactly my problem. The algorithm described there creates 8-character pseudonyms with 30 bits of entropy. I would prefer to use a more widely reviewed standard though.

Then there is also the git approach: only display the first few characters of the actual pseudonym. But this would mean that a pseudonym could lose its uniqueness after some time.

So my question is: Is there any widely-used standard for human-readable unique ids?

Answer

Vasiliy Faronov picture Vasiliy Faronov · May 12, 2018

Not aware of any widely-used standard for this. Here’s a non-widely-used one:

Proquints

https://arxiv.org/html/0901.4016

https://github.com/dsw/proquint

A UUID4 (128 bit) would be converted into 8 proquints. If that’s too much, you can take the last 64 bits of the UUID4 (= just take 64 random bits). This doesn’t make it magically lose uniqueness; only increases the likelihood of collisions, which was non-zero to begin with, and which you can estimate mathematically to decide if it’s still OK for your purposes.