What is the encoding of Chinese characters on Wikipedia?

laurent picture laurent · Apr 10, 2011 · Viewed 88.2k times · Source

I was looking at the encoding of Chinese characters on Wikipedia and I'm having trouble figuring out what they are using. For instance "的" is encoded as "%E7%9A%84" (see here). That's three bytes, however none of the encodings described on this page uses three bytes to represent Chinese characters. UTF-8 for instance uses 2 bytes.

I'm basically trying to match these three bytes to an actual character. Any suggestion on what encoding it could be?

Answer

jcomeau_ictx picture jcomeau_ictx · Apr 10, 2011

>>> c='\xe7\x9a\x84'.decode('utf8')
>>> c
u'\u7684'
>>> print c
的


though Unicode encodes it in 16 bits, utf8 breaks it down to 3 bytes.