heartbeat failed for group because it's rebalancing

user677715 picture user677715 · Oct 20, 2016 · Viewed 23k times · Source

What's the exact reason to have heartbeat failure for group because it's rebalancing ? What's the reason for rebalance where all the consumers in group are up ?

Thank you.

Answer

Matthias J. Sax picture Matthias J. Sax · Oct 23, 2016

Heartbeats are the basic mechanism to check if all consumers are still up and running. If you get a heartbeat failure because the group is rebalancing, it indicates that your consumer instance took too long to send the next heartbeat and was considered dead and thus a rebalance got triggered.

If you want to prevent this from happening, you can either increase the timeout (session.timeout.ms), or make sure your consumer sends heartbeat more often (heartbeat.interval.ms). Heartbeats are basically embedded in poll(), thus, you need to make sure you call poll frequently enough. This can usually be achieved by limit the number of records a single poll returns via max.poll.records (to shorten the time it takes to process all data that got fetched).

Update

Since Kafka 0.10.1, heartbeats are sent in a background thread, and not when poll() is called (cf. https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread). In this new design, configuration session.timeout.ms and heartbeat.interval.ms are still the same. Additionally, there is max.poll.interval.ms that determines how often poll() must be called.

For more details, cf. Difference between session.timeout.ms and max.poll.interval.ms for Kafka 0.10.0.0 and later versions