I'm trying to sort out characters, their representation in byte sequences according to character sets, and how to convert from one character set to another in Java. I've some difficulties.
For instance,
ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());
My understanding is that:
getBytes()
result is this same UTF-16 byte sequence wrap()
maintains this sequence bybf
is therefore an UTF-16 big endian representation of the string Olé
Thus in this code:
Charset utf16 = Charset.forName("UTF-16");
CharBuffer chbf = utf16.decode(bybf);
System.out.println(chbf);
decode()
should
bybf
as an UTF-16 string representation Olé
. Actually no byte should be altered since everything is UTF-16 stored and UTF-16 Charset
should be a kind of "neutral operator". However the result is printed as:
??
How can that be?
Additional question: For converting correctly, it seems Charset.decode(ByteBuffer bb)
requires bb
to be an UTF-16 big endian byte sequence image of a string. Is that correct?
Edit: From the answers provided, I did some testing to print a ByteBuffer
content and the chars
obtained by decoding it. Bytes [encoding with ="Olé".getBytes(charsetName)
] are printed on first line of groups, the other line(s) are the strings obtained by decoding back the bytes [with Charset#decode(ByteBuffer)
] with various Charset
.
I also confirmed that the default encoding for storing String into byte[]
on a Windows 7 computer is windows-1252
(unless strings contain chars requiring UTF-8).
Default VM encoding: windows-1252
Sample string: "Olé"
getBytes() no CS provided : 79 108 233 <-- default (windows-1252), 1 byte per char
Decoded as windows-1252: Olé <-- using the same CS than getBytes()
Decoded as UTF-16: ?? <-- using another CS (doesn't work indeed)
getBytes with windows-1252: 79 108 233 <-- same than getBytes()
Decoded as windows-1252: Olé
getBytes with UTF-8: 79 108 195 169 <-- 'é' in UTF-8 use 2 bytes
Decoded as UTF-8: Olé
getBytes with UTF-16: 254 255 0 79 0 108 0 233 <-- each char uses 2 bytes with UTF-16
Decoded as UTF-16: Olé (254-255 is an encoding tag)
You are mostly correct.
The native character representation in java is UTF-16. However when converting characters to bytes you either specify the charset you are using, or the system uses it's default which has usually been UTF-8 whenever I checked. This will yield interesting results if you are mixing and matching.
eg for my system the following
System.out.println(Charset.defaultCharset().name());
ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());
Charset utf16 = Charset.forName("UTF-16");
CharBuffer chbf = utf16.decode(bybf);
System.out.println(chbf);
bybf = ByteBuffer.wrap("Olé".getBytes(utf16));
chbf = utf16.decode(bybf);
System.out.println(chbf);
produces
UTF-8
佬쎩
Olé
So this part is only correct if UTF-16 is the default charset
getBytes() result is this same UTF-16 byte sequence.
So either always specify the charset you are using which is safest as you will always know what is going on, or always use the default.