Why the odd performance curve differential between ByteBuffer.allocate() and ByteBuffer.allocateDirect()

Stu Thompson picture Stu Thompson · Sep 6, 2010 · Viewed 11.6k times · Source

I'm working on some SocketChannel-to-SocketChannel code which will do best with a direct byte buffer--long lived and large (tens to hundreds of megabytes per connection.) While hashing out the exact loop structure with FileChannels, I ran some micro-benchmarks on ByteBuffer.allocate() vs. ByteBuffer.allocateDirect() performance.

There was a surprise in the results that I can't really explain. In the below graph, there is a very pronounced cliff at the 256KB and 512KB for the ByteBuffer.allocate() transfer implementation--the performance drops by ~50%! There also seem sto be a smaller performance cliff for the ByteBuffer.allocateDirect(). (The %-gain series helps to visualize these changes.)

Buffer Size (bytes) versus Time (MS)

The Pony Gap

Why the odd performance curve differential between ByteBuffer.allocate() and ByteBuffer.allocateDirect()? What exactly is going on behind the curtain?

It very well maybe hardware and OS dependent, so here are those details:

  • MacBook Pro w/ Dual-core Core 2 CPU
  • Intel X25M SSD drive
  • OSX 10.6.4

Source code, by request:

package ch.dietpizza.bench;

import static java.lang.String.format;
import static java.lang.System.out;
import static java.nio.ByteBuffer.*;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.UnknownHostException;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;

public class SocketChannelByteBufferExample {
    private static WritableByteChannel target;
    private static ReadableByteChannel source;
    private static ByteBuffer          buffer;

    public static void main(String[] args) throws IOException, InterruptedException {
        long timeDirect;
        long normal;
        out.println("start");

        for (int i = 512; i <= 1024 * 1024 * 64; i *= 2) {
            buffer = allocateDirect(i);
            timeDirect = copyShortest();

            buffer = allocate(i);
            normal = copyShortest();

            out.println(format("%d, %d, %d", i, normal, timeDirect));
        }

        out.println("stop");
    }

    private static long copyShortest() throws IOException, InterruptedException {
        int result = 0;
        for (int i = 0; i < 100; i++) {
            int single = copyOnce();
            result = (i == 0) ? single : Math.min(result, single);
        }
        return result;
    }


    private static int copyOnce() throws IOException, InterruptedException {
        initialize();

        long start = System.currentTimeMillis();

        while (source.read(buffer)!= -1) {    
            buffer.flip();  
            target.write(buffer);
            buffer.clear();  //pos = 0, limit = capacity
        }

        long time = System.currentTimeMillis() - start;

        rest();

        return (int)time;
    }   


    private static void initialize() throws UnknownHostException, IOException {
        InputStream  is = new FileInputStream(new File("/Users/stu/temp/robyn.in"));//315 MB file
        OutputStream os = new FileOutputStream(new File("/dev/null"));

        target = Channels.newChannel(os);
        source = Channels.newChannel(is);
    }

    private static void rest() throws InterruptedException {
        System.gc();
        Thread.sleep(200);      
    }
}

Answer

bestsss picture bestsss · Jun 12, 2012

How ByteBuffer works and why Direct (Byte)Buffers are the only truly useful now.

first I am a bit surprised it's not common knowledge but bear it w/ me

Direct byte buffers allocate an address outside the java heap.

This is utmost importance: all OS (and native C) functions can utilize that address w/o locking the object on the heap and copying the data. Short example on copying: in order to send any data via Socket.getOutputStream().write(byte[]) the native code has to "lock" the byte[], copy it outside java heap and then call the OS function, e.g. send. The copy is performed either on the stack (for smaller byte[]) or via malloc/free for larger ones. DatagramSockets are no different and they also copy - except they are limited to 64KB and allocated on the stack which can even kill the process if the thread stack is not large enough or deep in recursion. note: locking prevents JVM/GC to move/reallocate the object around the heap

So w/ the introduction of NIO the idea was avoid the copy and multitudes of stream pipelining/indirection. Often there are 3-4 buffered type of streams before the data reaches its destination. (yay Poland equalizes(!) with a beautiful shot) By introducing the direct buffers java could communicate straight to C native code w/o any locking/copy necessary. Hence sent function can take the address of the buffer add the position and the performance is much the same as native C. That's about the direct buffer.

The main issue w/ direct buffers - they are expensive to allocate and expensive to deallocate and quite cumbersome to use, nothing like byte[].

Non-direct buffer do not offer the true essence the direct buffers do - i.e. direct bridge to the native/OS instead they are light-weighted and share exactly the same API - and even more, they can wrap byte[] and even their backing array is available for direct manipulation - what not to love? Well they have to be copied!

So how does Sun/Oracle handles non-direct buffers as the OS/native can't use 'em - well, naively. When a non-direct buffer is used a direct counter part has to be created. The implementation is smart enough to use ThreadLocal and cache a few direct buffers via SoftReference* to avoid the hefty cost of creation. The naive part comes when copying them - it attempts to copy the entire buffer (remaining()) each time.

Now imagine: 512 KB non-direct buffer going to 64 KB socket buffer, the socket buffer won't take more than its size. So the 1st time 512 KB will be copied from non-direct to thread-local-direct, but only 64 KB of which will be used. The next time 512-64 KB will be copied but only 64 KB used, and the third time 512-64*2 KB will be copied but only 64 KB will be used, and so on... and that's optimistic that always the socket buffer will be empty entirely. So you are not only copying n KB in total, but n × n ÷ m (n = 512, m = 16 (the average space the socket buffer has left)).

The copying part is a common/abstract path to all non-direct buffer, so the implementation never knows the target capacity. Copying trashes the caches and what not, reduces the memory bandwidth, etc.

*A note on SoftReference caching: it depends on the GC implementation and the experience can vary. Sun's GC uses the free heap memory to determine the lifespan of the SoftRefences which leads to some awkward behavior when they are freed - the application needs to allocated the previously cached objects again- i.e. more allocation (direct ByteBuffers take minor part in the heap, so at least they do not affect the extra cache trashing but get affected instead)

My rule of the thumb - a pooled direct buffer sized with the socket read/write buffer. The OS never copies more than necessary.

This micro-benchmark is mostly memory throughput test, the OS will have the file entirely in cache, so it mostly tests memcpy. Once the buffers run out of the L2 cache the drop of performance is to be noticeable. Also running the benchmark like that imposes increasing and accumulated GC collection costs. (rest() will not collect the soft-referenced ByteBuffers)