UTF-8: how many bytes are used by languages to represent a visible character?

sid_com picture sid_com · Jan 23, 2013 · Viewed 13.9k times · Source

Does there exist a table or something similar which shows how many bytes different languages need on average to represent a visible character (glyph) when the encoding is utf8?

Answer

Celada picture Celada · Jan 23, 2013

If you want something general, I think you should stick with this:

  • English takes very slightly more than 1 byte per character (there is the occasional non-ASCII character, often punctuation or symbols embedded in text).
  • Most other languages which use the latin alphabet use somewhat more than 1, but I would be surprised if you should expect more than, say, 1.5.
  • Languages using some of the other scripts (Greek, etc...) take around 2 bytes per character.
  • East Asian languages take about 3 bytes per character (spacing, control characters, and embedded ASCII make it take less, non-BMP makes it take more).

That's all very incomplete, approximate, and non-quantitative.

If you need something more quantitative, I think you will have to research each language individually. I doubt you will find precomputed results out there that already apply to a host of different languages.

If you have a corpus of text for a language, it's easy to calculate the average number of bytes required. Start with the Text corpus Wikipedia page. It links to at least one good freely available corpus for English and there might be some available for other languages as well (I didn't hunt through the links to find out).

Incidentally, I don't recommend using this information to truncate the length of a database field as you indicated (in comments) that you intend to do. First of all, if you used a corpus made up from litterature to come up with your expected number of bytes per character, you might find the corpus is not at all representative of the short little text strings that end up in your database, throwing off your expectation. Just get the whole database column. Most results will be much shorter than the maximum length, and when they're not, I don't think your optimization is worth it to save a hundred bytes or so.