Is there a drastic difference between UTF-8 and UTF-16

Kraken picture Kraken · Mar 14, 2014 · Viewed 9.3k times · Source

I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders() method.

Now, in my java code, I take that response and then do some processing on it. And later, pass it on to a different service.

Now, I googled a bit and found out that by default the encoding in Java for strings is UTF-16.

In my response xml, one of the elements had a character É. Now this got screwed in the post processing request that I make to a different service.

Instead of sending É, it sent some jibberish stuff. Now I wanted to know, will there be really a lot of difference in the two of these encodings? And if I wanted to know what will É convert from UTF-8 to UTF-16, then how can I do that?

Thanks

Answer

Arjun Chaudhary picture Arjun Chaudhary · Mar 14, 2014

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.

Main UTF-8 pros:

  1. Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
  2. No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.

Main UTF-8 cons:

  1. Many common characters have different length, which slows indexing and calculating a string length terribly.

Main UTF-16 pros:

  1. Most reasonable characters, like Latin, Cyrillic, Chinese, Japanese can be represented with 2 bytes. Unless really exotic characters are needed, this means that the 16-bit subset of UTF-16 can be used as a fixed-length encoding, which speeds indexing.

Main UTF-16 cons:

  1. Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.

In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocol