Kafka DefaultPartitioner algorithm

nucatus picture nucatus · Sep 30, 2016 · Viewed 11.6k times · Source

There is a very small but very powerful detail in the Kafka org.apache.kafka.clients.producer.internals.DefaultPartitioner implementation that bugs me a lot.

It is this line of code:

return DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;

to be more precise, the last % numPartitions. I keep asking myself what is the reason behind introducing such a huge constraint by making the partition ID a function of the number of existent partitions? Just for the convenience of having small numbers (human readable/traceable?!) in comparison to the total number of partitions? Does anyone here have a broader insight into the issue?

I'm asking this because in our implementation, the key we use to store data in kafka is domain-sensitive and we use it to retrieve information from kafka based on that. For instance, we have consumers that need to subscribe ONLY to partitions that present interest to them and the way we do that link is by using such keys.

Would be safe to use a custom partitioner that doesn't do that modulo operation? Should we notice any performance degradation. Does this have any implications on the Producer and/or Consumer side?

Any ideas and comments are welcome.

Answer

Matthias J. Sax picture Matthias J. Sax · Oct 1, 2016

Partitions in a Kafka topic are numbered from 0...N. Thus, if a key is hashed to determine a partitions, the result hash value must be in the interval [0;N] -- it must be a valid partition number.

Using modulo operation is a standard technique in hashing.