Does Unicode have a defined maximum number of code points?

user4344762 picture user4344762 · Dec 11, 2014 · Viewed 13.3k times · Source

I have read many articles in order to know what is the maximum number of the Unicode code points, but I did not find a final answer.

I understood that the Unicode code points were minimized to make all of the UTF-8 UTF-16 and UTF-32 encodings able to handle the same number of code points. But what is this number of code points?

The most frequent answer I encountered is that Unicode code points are in the range of 0x000000 to 0x10FFFF (1,114,112 code points) but I have also read in other places that it is 1,112,114 code points. So is there a one number to be given or is the issue more complicated than that?

Answer

Jonathan Leffler picture Jonathan Leffler · Dec 11, 2014

The maximum valid code point in Unicode is U+10FFFF, which makes it a 21-bit code set (but not all 21-bit integers are valid Unicode code points; specifically the values from 0x110000 to 0x1FFFFF are not valid Unicode code points).

This is where the number 1,114,112 comes from: U+0000 .. U+10FFFF is 1,114,112 values.

However, there are also a set of code points that are the surrogates for UTF-16. These are in the range U+D800 .. U+DFFF. This is 2048 code points that are reserved for UTF-16.

1,114,112 - 2,048 = 1,112,064

There are also 66 non-characters. These are defined in part in Corrigendum #9: 34 values of the form U+nFFFE and U+nFFFF (where n is a value 0x00000, 0x10000, … 0xF0000, 0x100000), and 32 values U+FDD0 - U+FDEF. Subtracting those too yields 1,111,998 allocatable characters. There are three ranges reserved for 'private use': U+E000 .. U+F8FF, U+F0000 .. U+FFFFD, and U+100000 .. U+10FFFD. And the number of values actually assigned depends on the version of Unicode you're looking at. You can find information about the latest version at the Unicode Consortium. Amongst other things, the Introduction there says:

The Unicode Standard, Version 7.0, contains 112,956 characters

So only about 10% of the available code points have been allocated.

I can't account for why you found 1,112,114 as the number of code points.

Incidentally, the upper limit U+10FFFF is chosen so that all the values in Unicode can be represented in one or two 2-byte coding units in UTF-16, using one high surrogate and one low surrogate to represent values outside the BMP or Basic Multilingual Plane, which is the range U+0000 .. U+FFFF.