How many bytes does one Unicode character take?

nan picture nan · Mar 13, 2011 · Viewed 317.9k times · Source

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?

I assume that one Unicode character can contain every possible character from any language - am I correct? So how many bytes does it need per character?

And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?

I read the Wikipedia article about Unicode but it is quite difficult for me. I am looking forward to seeing a simple answer.

Answer

paul.ago picture paul.ago · Oct 26, 2015

Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:

Binary    Hex          Comments
0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
10xxxxxx  0x80..0xBF   Continuation byte: one of 1-3 bytes following the first
110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
11110xxx  0xF0..0xF7   First byte of a 4-byte character encoding

So the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it'll take up.