When to use Unicode Normalization Forms NFC and NFD?

string unicode normalization unicode-normalization

Jesse Hallam · Apr 13, 2013 · Viewed 10.4k times · Source

The Unicode Normalization FAQ includes the following paragraph:

Programs should always compare canonical-equivalent Unicode strings as equal ... The Unicode Standard provides well-defined normalization forms that can be used for this: NFC and NFD.

and continues...

The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. ... NFD and NFKD are most useful for internal processing.

My questions are:

What makes NFC best for "general text." What defines "internal processing" and why is it best left to NFD? And finally, never minding what is "best," are the two forms interchangable as long as two strings are compared using the same normalization form?

Answer

The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.

In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.

For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.

There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.

It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.

When to use Unicode Normalization Forms NFC and NFD?

Answer

Related questions