Collision probability of ObjectId vs UUID in a large distributed system

SystematicFrank picture SystematicFrank · Mar 24, 2014 · Viewed 12.7k times · Source

Considering that an UUID rfc 4122 (16 bytes) is much larger than a MongoDB ObjectId (12 bytes), I am trying to find out how their collision probability compare.

I know that is something around quite unlikely, but in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.

Compared to the normal case where all ids are generated by a small number of clients:

  • It might take months to detect a collision since the document creation
  • IDs are generated from a much larger client base
  • Each client has a lower ID generation rate

Answer

mnemosyn picture mnemosyn · Mar 24, 2014

in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.

That sounds like very bad architecture to me. Are you using a two-tier architecture? Why would the mobile clients have direct access to the db? Do you really want to rely on network-based security?

Anyway, some deliberations about the collision probability:

Neither UUID nor ObjectId rely on their sheer size, i.e. both are not random numbers, but they follow a scheme that tries to systematically reduce collision probability. In case of ObjectIds, their structure is:

  • 4 byte seconds since unix epoch
  • 3 byte machine id
  • 2 byte process id
  • 3 byte counter

This means that, contrary to UUIDs, ObjectIds are monotonic (except within a single second), which is probably their most important property. Monotonic indexes will cause the B-Tree to be filled more efficiently, it allows paging by id and allows a 'default sort' by id to make your cursors stable, and of course, they carry an easy-to-extract timestamp. These are the optimizations you should be aware of, and they can be huge.

As you can see from the structure of the other 3 components, collisions become very likely if you're doing > 1k inserts/s on a single process (not really possible, not even from a server), or if the number of machines grows past about 10 (see birthday problem), or if the number of processes on a single machine grows too large (then again, those aren't random numbers, but they are truly unique on a machine, but they must be shortened to two bytes).

Naturally, for a collision to occur, they must match in all these aspects, so even if two machines have the same machine hash, it'd still require a client to insert with the same counter value in the exact same second and the same process id, but yes, these values could collide.