Deciding between Spring Batch Step, Tasklet or Chunks

Vimal picture Vimal · Jun 17, 2013 · Viewed 30.3k times · Source

I have a straight forward requirement in which, i need to read a list of items(from DB) and need to process the items and once processed, it has to be updated into DB.

I'm thinking of using Spring batch Chunks with reader, processor and writer. My reader will return one item at a time from the list and sends it to processor and once processing is over, it returns to Writer where it updates the DB

I may be multithreading it later with some cost of synchronization in these methods.

Here I foresee a few concerns.

  1. Number of items to be processed could be more. May be in 10,000s or even more.
  2. some logical calculation is required in the processor. hence processing 1 item at a time. not sure about the performance even if it is multithreaded with 10 threads.
  3. Writer can update the results in the DB for that processed item. Not sure how to do batch updates because it always has only 1 item processed and ready.

Is this approach correct for this kind of usecase or anything better can be done? Is there anyother way of processing a bunch of items at one call of reader, processor & writer? if so, do i need to create some mechnism where i extract say 10 items from the list and give it to processor? it seems writer updates each records as it comes, batch updates makes sense only if the writer receives a bunch of processed items. any suggestion?

Please throw some lights on this design for better performance.

Thanks,

Answer

Cygnusx1 picture Cygnusx1 · Jun 17, 2013

Spring Batch is the perfect tool to do what you need.

The chunk oriented step let you configure how many items you want to read/process/write with the commit-interval property.

        <batch:step id="step1" next="step2">
        <batch:tasklet transaction-manager="transactionManager" start-limit="100">
            <batch:chunk reader="myReader" processor="myProcessor" writer="MyWriter" commit-interval="800" />
            <batch:listeners>
                <batch:listener ref="myListener" />
            </batch:listeners>
        </batch:tasklet>
    </batch:step>

Let say your reader will call a SELECT statement that returns 10 000 records. And you set a commit-interval=500.

MyReader will call the read() method 500 times. Let say that in reality, the reader implementation might in fact remove items from the resultSet. For each call to read(), it will also call the process() method of MyProcessor.

But it will not call the write() method of MyWriter until the commit-interval is reached.

If you look at the definition of the interface ItemWriter:

public interface ItemWriter<T> {

/**
 * Process the supplied data element. Will not be called with any null items
 * in normal operation.
 * 
 * @throws Exception if there are errors. The framework will catch the
 * exception and convert or rethrow it as appropriate.
 */
void write(List<? extends T> items) throws Exception;

}

You see that the write receive a List of items. This list will be the size of your commit-interval (or less if the end is reached)

And btw, 10 000 of records is nothing. You may consider multithreading if you have to deal with millions of records. But even then, just playing around with the sweet spot of the commit-interval value will probably be enough.

Hope it helps