Are all Kanji characters in UTF-8 3 bytes long?

TopCoder picture TopCoder · Sep 9, 2010 · Viewed 14.7k times · Source

Can someone please confirm that all Kanji characters in Chinese are 3 bytes long in UTF-8?

Answer

dan04 picture dan04 · Sep 10, 2010

The commonly used Hanzi/Kanji characters are in the "CJK Unified Ideographs" block between U+4E00 and U+9FFF, and take 3 bytes in UTF-8. (The Japanese Hiragana and Katakana characters also take 3 bytes.)

However, there are also some very rarely-used characters in the "CJK Unified Ideographs Extension B" and "CJK Compatibility Ideographs Supplement" blocks, which take 4 bytes in UTF-8.

Also be aware that Chinese text often contains ASCII characters like the digits 0-9.