ByteBuffer, CharBuffer, String and Charset

mins picture mins · Jun 30, 2014 · Viewed 13.6k times · Source

I'm trying to sort out characters, their representation in byte sequences according to character sets, and how to convert from one character set to another in Java. I've some difficulties.

For instance,

ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());

My understanding is that:

  • String are always stored as UTF-16 byte sequence in Java (2 bytes per character, big endian)
  • getBytes() result is this same UTF-16 byte sequence
  • wrap() maintains this sequence
  • bybf is therefore an UTF-16 big endian representation of the string Olé

Thus in this code:

Charset utf16 = Charset.forName("UTF-16");  
CharBuffer chbf = utf16.decode(bybf);  
System.out.println(chbf);  

decode() should

  • Interpret bybf as an UTF-16 string representation
  • "convert" it to the original string Olé.

Actually no byte should be altered since everything is UTF-16 stored and UTF-16 Charset should be a kind of "neutral operator". However the result is printed as:

??

How can that be?

Additional question: For converting correctly, it seems Charset.decode(ByteBuffer bb) requires bb to be an UTF-16 big endian byte sequence image of a string. Is that correct?


Edit: From the answers provided, I did some testing to print a ByteBuffer content and the chars obtained by decoding it. Bytes [encoding with ="Olé".getBytes(charsetName)] are printed on first line of groups, the other line(s) are the strings obtained by decoding back the bytes [with Charset#decode(ByteBuffer)] with various Charset.

I also confirmed that the default encoding for storing String into byte[] on a Windows 7 computer is windows-1252 (unless strings contain chars requiring UTF-8).

Default VM encoding: windows-1252  
Sample string: "Olé"  


  getBytes() no CS provided : 79 108 233  <-- default (windows-1252), 1 byte per char
     Decoded as windows-1252: Olé         <-- using the same CS than getBytes()
           Decoded as UTF-16: ??          <-- using another CS (doesn't work indeed)

  getBytes with windows-1252: 79 108 233  <-- same than getBytes()
     Decoded as windows-1252: Olé

         getBytes with UTF-8: 79 108 195 169  <-- 'é' in UTF-8 use 2 bytes
            Decoded as UTF-8: Olé

        getBytes with UTF-16: 254 255 0 79 0 108 0 233 <-- each char uses 2 bytes with UTF-16
           Decoded as UTF-16: Olé                          (254-255 is an encoding tag)

Answer

BevynQ picture BevynQ · Jun 30, 2014

You are mostly correct.

The native character representation in java is UTF-16. However when converting characters to bytes you either specify the charset you are using, or the system uses it's default which has usually been UTF-8 whenever I checked. This will yield interesting results if you are mixing and matching.

eg for my system the following

System.out.println(Charset.defaultCharset().name());
ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());
Charset utf16 = Charset.forName("UTF-16");
CharBuffer chbf = utf16.decode(bybf);
System.out.println(chbf);
bybf = ByteBuffer.wrap("Olé".getBytes(utf16));
chbf = utf16.decode(bybf);
System.out.println(chbf);

produces

UTF-8
佬쎩
Olé

So this part is only correct if UTF-16 is the default charset
getBytes() result is this same UTF-16 byte sequence.

So either always specify the charset you are using which is safest as you will always know what is going on, or always use the default.