Compare Direct and Non-Direct ByteBuffer get/put operations

user882659 picture user882659 · Jun 24, 2012 · Viewed 8.3k times · Source

Is get/put from a non-direct bytebuffer faster than get/put from direct bytebuffer ?

If I have to read / write from direct bytebuffer , is it better to first read /write in to a thread local byte array and then update ( for writes ) the direct bytebuffer fully with the byte array ?

Answer

Peter Lawrey picture Peter Lawrey · Jun 24, 2012

Is get/put from a non-direct bytebuffer faster than get/put from direct bytebuffer ?

If you are comparing heap buffer with direct buffer which does not use native byte order (most systems are little endian and the default for direct ByteBuffer is big endian), the performance is very similar.

If you use native ordered byte buffers the performance can be significantly better for multi-byte values. For byte it makes little difference no matter what you do.

In HotSpot/OpenJDK, ByteBuffer uses the Unsafe class and many of the native methods are treated as intrinsics. This is JVM dependent and AFAIK the Android VM treats it as an intrinsic in recent versions.

If you dump the assembly generated you can see the intrinsics in Unsafe are turned in one machine code instruction. i.e. they don't have the overhead of a JNI call.

In fact if you are into micro-tuning you may find that most of the time of a ByteBuffer getXxxx or setXxxx is spend in the bounds checking, not the actual memory access. For this reason I still use Unsafe directly when I have to for maximum performance (Note: this is discouraged by Oracle)

If I have to read / write from direct bytebuffer , is it better to first read /write in to a thread local byte array and then update ( for writes ) the direct bytebuffer fully with the byte array ?

I would hate to see what that is better than. ;) It sounds very complicated.

Often the simplest solutions are better and faster.


You can test this yourself with this code.

public static void main(String... args) {
    ByteBuffer bb1 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    ByteBuffer bb2 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    for (int i = 0; i < 10; i++)
        runTest(bb1, bb2);
}

private static void runTest(ByteBuffer bb1, ByteBuffer bb2) {
    bb1.clear();
    bb2.clear();
    long start = System.nanoTime();
    int count = 0;
    while (bb2.remaining() > 0)
        bb2.putInt(bb1.getInt());
    long time = System.nanoTime() - start;
    int operations = bb1.capacity() / 4 * 2;
    System.out.printf("Each putInt/getInt took an average of %.1f ns%n", (double) time / operations);
}

prints

Each putInt/getInt took an average of 83.9 ns
Each putInt/getInt took an average of 1.4 ns
Each putInt/getInt took an average of 34.7 ns
Each putInt/getInt took an average of 1.3 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.3 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.2 ns

I am pretty sure a JNI call takes longer than 1.2 ns.


To demonstrate that its not the "JNI" call but the guff around it which causes the delay. You can write the same loop using Unsafe directly.

public static void main(String... args) {
    ByteBuffer bb1 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    ByteBuffer bb2 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    for (int i = 0; i < 10; i++)
        runTest(bb1, bb2);
}

private static void runTest(ByteBuffer bb1, ByteBuffer bb2) {
    Unsafe unsafe = getTheUnsafe();
    long start = System.nanoTime();
    long addr1 = ((DirectBuffer) bb1).address();
    long addr2 = ((DirectBuffer) bb2).address();
    for (int i = 0, len = Math.min(bb1.capacity(), bb2.capacity()); i < len; i += 4)
        unsafe.putInt(addr1 + i, unsafe.getInt(addr2 + i));
    long time = System.nanoTime() - start;
    int operations = bb1.capacity() / 4 * 2;
    System.out.printf("Each putInt/getInt took an average of %.1f ns%n", (double) time / operations);
}

public static Unsafe getTheUnsafe() {
    try {
        Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
        theUnsafe.setAccessible(true);
        return (Unsafe) theUnsafe.get(null);
    } catch (Exception e) {
        throw new AssertionError(e);
    }
}

prints

Each putInt/getInt took an average of 40.4 ns
Each putInt/getInt took an average of 44.4 ns
Each putInt/getInt took an average of 0.4 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns

So you can see that the native call is much faster than you might expect for a JNI call. The main reason for this delay could be the L2 cache speed. ;)

All run on an i3 3.3 GHz