I have to deal with a directory of about 2 million xml's to be processed.
I've already solved the processing distributing the work between machines and threads using queues and everything goes right.
But now the big problem is the bottleneck of reading the directory with the 2 million files in order to fill the queues incrementally.
I've tried using the File.listFiles()
method, but it gives me a java out of memory: heap space
exception. Any ideas?
First of all, do you have any possibility to use Java 7? There you have a FileVisitor
and the Files.walkFileTree
, which should probably work within your memory constraints.
Otherwise, the only way I can think of is to use File.listFiles(FileFilter filter)
with a filter that always returns false
(ensuring that the full array of files is never kept in memory), but that catches the files to be processed along the way, and perhaps puts them in a producer/consumer queue or writes the file-names to disk for later traversal.
Alternatively, if you control the names of the files, or if they are named in some nice way, you could process the files in chunks using a filter that accepts filenames on the form file0000000
-filefile0001000
then file0001000
-filefile0002000
and so on.
If the names are not named in a nice way like this, you could try filtering them based on the hash-code of the file-name, which is supposed to be fairly evenly distributed over the set of integers.