Is it possible to create a kafka topic with dynamic partition count?

vivek_jonam picture vivek_jonam · Sep 24, 2015 · Viewed 12.2k times · Source

I am using kafka to stream the events of page visits by the website users to an analytics service. Each event will contain the following details for the consumer:

  • user id
  • IP address of the user

I need very high throughput, so I decided to partition the topic with partition key as userId-ipAddress ie

For a userId 1000 and ip address 10.0.0.1, the event will have partition key as "1000-10.0.0.1"

In this use case the partition key is dynamic, so specifying the number of partitions upfront while creating the topic. Is it possible to create topic in kafka with dynamic partition count?

Is it a good practice to use this kind of partitioning or Is there any other way this can be achieved?

Answer

Lukáš Havrlant picture Lukáš Havrlant · Sep 26, 2015

It's not possible to create a Kafka topic with dynamic partition count. When you create a topic you have to specify the number of partitions. You can change it later manually using Replication Tools.

But I don't understand why do you need dynamic partition count in the first place. The partition key is not related to the number of partitions. You can use your partition key with ten partitions or with thousand partitions. When you send a message to Kafka topic, Kafka must send it to a specific partition. Every partition is identify by it's ID which is simply a number. Kafka computes something like this

partition_id = hash(partition_key) % number_of_partition

and it sends the message to partition partition_id. If you have far more users than partitions you should be OK. More suggestions:

  • Use userId as a partition key. You probably don't need IP address as a part of partition key. What is it good for? Typically you need all messages from a single user to end up in a single partition. If you have IP address as a partition key then the messages from a single user could end up in multiple partitions. I don't know your use case but it general that's not what you want.
  • Measure how many partitions you need to process all messages. Then create let's say ten times more partitions. You can create more partitions than you actually need. Kafka won't mind and there are no performance penalties. See How to choose the number of topics/partitions in a Kafka cluster?

Right now you should be able to process all messages in your system. If traffic grows you can add more Kafka brokers and you can use Replication tools to change leaders/replicas for partitions. If the traffic grows more than ten times you must create new partitions.