Collision probability of ObjectId vs UUID in a large distributed system

mongodb guid uuid rfc4122

SystematicFrank · Mar 24, 2014 · Viewed 12.7k times · Source

Considering that an UUID rfc 4122 (16 bytes) is much larger than a MongoDB ObjectId (12 bytes), I am trying to find out how their collision probability compare.

I know that is something around quite unlikely, but in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.

Compared to the normal case where all ids are generated by a small number of clients:

It might take months to detect a collision since the document creation
IDs are generated from a much larger client base
Each client has a lower ID generation rate

Answer

in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.

That sounds like very bad architecture to me. Are you using a two-tier architecture? Why would the mobile clients have direct access to the db? Do you really want to rely on network-based security?

Anyway, some deliberations about the collision probability:

Neither UUID nor ObjectId rely on their sheer size, i.e. both are not random numbers, but they follow a scheme that tries to systematically reduce collision probability. In case of ObjectIds, their structure is:

4 byte seconds since unix epoch
3 byte machine id
2 byte process id
3 byte counter

This means that, contrary to UUIDs, ObjectIds are monotonic (except within a single second), which is probably their most important property. Monotonic indexes will cause the B-Tree to be filled more efficiently, it allows paging by id and allows a 'default sort' by id to make your cursors stable, and of course, they carry an easy-to-extract timestamp. These are the optimizations you should be aware of, and they can be huge.

As you can see from the structure of the other 3 components, collisions become very likely if you're doing > 1k inserts/s on a single process (not really possible, not even from a server), or if the number of machines grows past about 10 (see birthday problem), or if the number of processes on a single machine grows too large (then again, those aren't random numbers, but they are truly unique on a machine, but they must be shortened to two bytes).

Naturally, for a collision to occur, they must match in all these aspects, so even if two machines have the same machine hash, it'd still require a client to insert with the same counter value in the exact same second and the same process id, but yes, these values could collide.

Collision probability of ObjectId vs UUID in a large distributed system

Answer

Related questions