Best strategy for processing large CSV files in Apache Camel

Taka picture Taka · Nov 14, 2011 · Viewed 16.9k times · Source

I'd like to develop a route that polls a directory containing CSV files, and for every file it unmarshals each row using Bindy and queues it in activemq.

The problem is files can be pretty large (a million rows) so I'd prefer to queue one row at a time, but what I'm getting is all the rows in a java.util.ArrayList at the end of Bindy which causes memory problems.

So far I have a little test and unmarshaling is working so Bindy configuration using annotations is ok.

Here is the route:

from("file://data/inbox?noop=true&maxMessagesPerPoll=1&delay=5000")
  .unmarshal()
  .bindy(BindyType.Csv, "com.ess.myapp.core")           
  .to("jms:rawTraffic");

Environment is: Eclipse Indigo, Maven 3.0.3, Camel 2.8.0

Thank you

Answer

Claus Ibsen picture Claus Ibsen · Nov 14, 2011

If you use the Splitter EIP then you can use streaming mode which means Camel will process the file on a row by row basis.

from("file://data/inbox?noop=true&maxMessagesPerPoll=1&delay=5000")
  .split(body().tokenize("\n")).streaming()
    .unmarshal().bindy(BindyType.Csv, "com.ess.myapp.core")           
    .to("jms:rawTraffic");