Binary search in a sorted (memory-mapped ?) file in Java

sds picture sds · Apr 10, 2009 · Viewed 17.1k times · Source

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search (essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)

I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?

Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.

Answer

Stu Thompson picture Stu Thompson · Apr 10, 2009

I am a big fan of Java's MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?

I have similar code (seek, read, repeat until done) in my own application, benchmarked java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.

Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.

To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.

public class StusMagicLargeFileReader  {
    private static final long PAGE_SIZE = Integer.MAX_VALUE;
    private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
    private final byte raw[] = new byte[1];

    public static void main(String[] args) throws IOException {
        File file = new File("/Users/stu/test.txt");
        FileChannel fc = (new FileInputStream(file)).getChannel(); 
        StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
        long position = file.length() / 2;
        String candidate = buffer.getString(position--);
        while (position >=0 && !candidate.equals('\n')) 
            candidate = buffer.getString(position--);
        //have newline position or start of file...do other stuff    
    }
    StusMagicLargeFileReader(FileChannel channel) throws IOException {
        long start = 0, length = 0;
        for (long index = 0; start + length < channel.size(); index++) {
            if ((channel.size() / PAGE_SIZE) == index)
                length = (channel.size() - index *  PAGE_SIZE) ;
            else
                length = PAGE_SIZE;
            start = index * PAGE_SIZE;
            buffers.add(index, channel.map(READ_ONLY, start, length));
        }    
    }
    public String getString(long bytePosition) {
        int page  = (int) (bytePosition / PAGE_SIZE);
        int index = (int) (bytePosition % PAGE_SIZE);
        raw[0] = buffers.get(page).get(index);
        return new String(raw);
    }
}