Kafka Topic vs Partition topic

Anil picture Anil · Jan 7, 2015 · Viewed 11k times · Source

I would like to know what is the difference between simple topic & partition topic.As per my understanding to balance the load, topic has been partitioned, Each message will have offset & consumer will acknowledge to ensure previous messages have been consumed.In case no of partition & consumer mismatches the re balance done by kafka does it efficiently manages.

If multiple topics created instead partition does it affect the operational efficiency.

Answer

user2720864 picture user2720864 · Jan 7, 2015

From the kafka documentation

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data

Having multiple partitions for any given topic allows Kafka to distribute it across the Kafka cluster. As a result the request for handling data from different partitions can be divided among multiple servers in the whole cluster. Also each partition can be replicated across multiple servers to minimize the data loss. Again from the doc page

The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

So having a topic with a single partition won't allow you to use these flexibilities. Also note in a real life environment you can have different topics to hold different categories of messages (though it is also possible to have a single topic with multiple partitions where each partitions can have specific categories of messages using the messgae key while producing).

I don't think creating multiple topics instead of partitions will have much impact on the overall performace. But imagine you want to keep track of all the tweets made by users in your site. You can then have one topic named "User_tweet" with multiple partitons so that while producing messages Kafka can distribute the data across multiple partitions and on the consumer end you only need to have one group of consumer pulling data from the same topic. Instead keeping "User_tweet_1", "User_tweet_2", "User_tweet_3" will only make things complex for you while both producing and consuming the messages.