I'm not great with statistical mathematics, etc. I've been wondering, if I use the following:
import uuid
unique_str = str(uuid.uuid4())
double_str = ''.join([str(uuid.uuid4()), str(uuid.uuid4())])
Is double_str
string squared as unique as unique_str
or just some amount more unique? Also, is there any negative implication in doing something like this (like some birthday problem situation, etc)? This may sound ignorant, but I simply would not know as my math spans algebra 2 at best.
The uuid4
function returns a UUID created from 16 random bytes and it is extremely unlikely to produce a collision, to the point at which you probably shouldn't even worry about it.
If for some reason uuid4
does produce a duplicate it is far more likely to be a programming error such as a failure to correctly initialize the random number generator than genuine bad luck. In which case the approach you are using it will not make it any better - an incorrectly initialized random number generator can still produce duplicates even with your approach.
If you use the default implementation random.seed(None)
you can see in the source that only 16 bytes of randomness are used to initialize the random number generator, so this is an a issue you would have to solve first. Also, if the OS doesn't provide a source of randomness the system time will be used which is not very random at all.
But ignoring these practical issues, you are basically along the right lines. To use a mathematical approach we first have to define what you mean by "uniqueness". I think a reasonable definition is the number of ids you need to generate before the probability of generating a duplicate exceeds some probability p
. An approcimate formula for this is:
where d
is 2**(16*8)
for a single randomly generated uuid and 2**(16*2*8)
with your suggested approach. The square root in the formula is indeed due to the Birthday Paradox. But if you work it out you can see that if you square the range of values d
while keeping p
constant then you also square n
.