Best Spring batch scaling strategy

user509755 picture user509755 · Mar 17, 2015 · Viewed 11.8k times · Source

We have simple batch processes which are working fine. Recently we have new reqmnt to implement new batch process to generate reports. We have diff source of data to read to prepare this reports. Specifically we might have one view for each report.

Now we want to scale this process in such a way that, it can be scaled and be completed as early as possible.

I am familiar with multithread step but not sure about other strategy(Remote chunking and partition step) and which one to use when.

In our case processing + writing to file is more resource incentive then reading.

In such cases which approach is best suited.

Or if we find out that reading data from db is same resource incentive as writing + processing to file then what is the best option we have to improve/scale this process.

Answer

FGreg picture FGreg · Mar 17, 2015

TLDR;

Based on your description I think you could try Multi-threaded Step with Synchronized Reader since you mention processing and writing are the more expensive part of your step.

However, seeing as your reader is a database, I think getting a partitioned step configured and working would be very beneficial. It takes a little more work to get set up but will scale better in the long run.

Multi-threaded Step

Use For:

  • Speeding up an individual step
  • When load balancing can be handled by reader (i.e. JMS or AMQP)
  • When using custom reader that manually partitions the data being read

Don't Use For:

  • Stateful item readers

Multi-threaded steps utilize the chunk-oriented processing employed by Spring Batch. When you multi-thread a step it allows spring batch to execute an entire chunk in it's own thread. Note that this means the entire read-process-write cycle for your chunks of data will occur in parallel. This means there is no guaranteed order for processing your data. Also note that this will not work with stateful ItemReaders (JdbcCursorItemReader and JdbcPagingItemReader are both stateful).

Multi-threaded Step with Synchronized Reader

Use For:

  • Speeding up processing and writing for an individual step
  • When reading is stateful

Don't Use For:

  • Speeding up reading

There is one way around the limitation of not being able to use multi-threaded steps with stateful item readers. You can synchronize their read() method. This will essentially cause reads to happen serially (still no guarantee on order though) but still allow processing and writing to happen in parallel. This can be a good option when reading is not the bottleneck but processing or writing is.

Partitioning

Use For:

  • Speeding up an individual step
  • When reading is stateful
  • When input data can be partitioned

Don't Use For:

  • When input data cannot be partitioned

Partitioning a step behaves slightly different than a multi-threaded step. With a partitioned step you actually have complete distinct StepExecutions. Each StepExecution works on it's own partition of the data. This way the reader does not have problems reading the same data because each reader is only looking at a specific slice of the data. This method is extremely powerful but is also more complicated to set up than a multi-threaded step.

Remote Chunking

Use For:

  • Speeding up processing and writing for an individual step
  • Stateful readers

Don't Use For:

  • Speeding up reading

Remote chunking is very advanced Spring Batch usage. It requires to have some form of durable middleware to send and receive messages on (i.e. JMS or AMQP). With remote chunking, reading is still single-threaded but as each chunk is read it is sent to another JVM for processing. In practice this is very similar to how a multi-threaded step works however remote chunking can utilize more than one process as opposed to more than one thread. This means that remote chunking allows you to horizontally scale your application as opposed to vertically scaling it. (TBH I think if you are thinking about implementing remote chunking, you should consider taking a look at something like Hadoop.)

Parallel Step

Use For:

  • Speeding up overall job execution
  • When there are independent steps that don't rely on each other

Don't Use For:

  • Speeding up step execution
  • Dependent steps

Parallel steps are useful when you have on or more steps that can execute independently. Spring batch can easily allow steps to execute in parallel in seperate threads.