Storm max spout pending

Naresh picture Naresh · Jun 25, 2014 · Viewed 14.2k times · Source

This is a question regarding how Storm's max spout pending works. I currently have a spout that reads a file and emits a tuple for each line in the file (I know Storm is not the best solution for dealing with files but I do not have a choice for this problem).

I set the topology.max.spout.pending to 50k to throttle how many tuples get into the topology to be processed. However, I see this number not having any effect in the topology. I see all records in a file being emitted every time. My guess is this might be due to a loop I have in the nextTuple() method that emits all records in a file.

My question is: Does Storm just stop calling nextTuple() for the Spout task when topology.max.spout.pending is reached? Does this mean I should only emit one tuple every time the method is called?

Answer

John Gilmore picture John Gilmore · Jul 21, 2014

Exactly! Storm can only limit your spout with the next command, so if you transmit everything when you receive the first next, there is no way for Storm to throttle your spout.

The Storm developers recommend emitting a single tuple with a single next command. The Storm framework will then throttle your spout as needed to meet the "max spout pending" requirement. If you're emitting a high number of tuples, you can batch your emits to at most a tenth of your max spout pending, to give Storm the chance to throttle.