Set of unambiguous looking letters & numbers for user input

Brian M. Hunt picture Brian M. Hunt · Aug 12, 2012 · Viewed 18.5k times · Source

Is there an existing subset of the alphanumerics that is easier to read? In particular, is there a subset that has fewer characters that are visually ambiguous, and by removing (or equating) certain characters we reduce human error?

I know "visually ambiguous" is somewhat waffly of an expression, but it is fairly evident that D, O and 0 are all similar, and 1 and I are also similar. I would like to maximize the size of the set of alpha-numerics, but minimize the number of characters that are likely to be misinterpreted.

The only precedent I am aware of for such a set is the Canada Postal code system that removes the letters D, F, I, O, Q, and U, and that subset was created to aid the postal system's OCR process.

My initial thought is to use only capital letters and numbers as follows:

A
B = 8
C = G
D = 0 = O = Q
E = F
H
I = J = L = T = 1 = 7
K = X
M
N
P
R
S = 5
U = V = Y
W
Z = 2
3
4
6
9

This problem may be difficult to separate from the given type face. The distinctiveness of the characters in the chosen typeface could significantly affect the potential visual ambiguity of any two characters, but I expect that in most modern typefaces the above characters that are equated will have a similar enough appearance to warrant equating them.

I would be grateful for thoughts on the above – are the above equations suitable, or perhaps are there more characters that should be equated? Would lowercase characters be more suitable?

Answer

trepidacious picture trepidacious · Dec 13, 2014

I needed a replacement for hexadecimal (base 16) for similar reasons (e.g. for encoding a key, etc.), the best I could come up with is the following set of 16 characters, which can be used as a replacement for hexadecimal:

0 1 2 3 4 5 6 7 8 9 A B C D E F     Hexadecimal
H M N 3 4 P 6 7 R 9 T W C X Y F     Replacement

In the replacement set, we consider the following:

All characters used have major distinguishing features that would only be omitted in a truly awful font.

Vowels A E I O U omitted to avoid accidentally spelling words.

Sets of characters that could potentially be very similar or identical in some fonts are avoided completely (none of the characters in any set are used at all):

0 O D Q 
1 I L J
8 B 
5 S
2 Z

By avoiding these characters completely, the hope is that the user will enter the correct characters, rather than trying to correct mis-entered characters.

For sets of less similar but potentially confusing characters, we only use one character in each set, hopefully the most distinctive:

Y U V 

Here Y is used, since it always has the lower vertical section, and a serif in serif fonts

C G         

Here C is used, since it seems less likely that a C would be entered as G, than vice versa

X K         

Here X is used, since it is more consistent in most fonts

F E         

Here F is used, since it is not a vowel

In the case of these similar sets, entry of any character in the set could be automatically converted to the one that is actually used (the first one listed in each set). Note that E must not be automatically converted to F if hexadecimal input might be used (see below).

Note that there are still similar-sounding letters in the replacement set, this is pretty much unavoidable. When reading aloud, a phonetic alphabet should be used.

Where characters that are also present in standard hexadecimal are used in the replacement set, they are used for the same base-16 value. In theory mixed input of hexadecimal and replacement characters could be supported, provided E is not automatically converted to F.

Since this is just a character replacement, it should be easy to convert to/from hexadecimal.

Upper case seems best for the "canonical" form for output, although lower case also looks reasonable, except for "h" and "n", which should still be relatively clear in most fonts:

h m n 3 4 p 6 7 r 9 t w c x y f

Input can of course be case-insensitive.

There are several similar systems for base 32, see http://en.wikipedia.org/wiki/Base32 However these obviously need to introduce more similar-looking characters, in return for an additional 25% more information per character.

Apparently the following set was also used for Windows product keys in base 24, but again has more similar-looking characters:

B C D F G H J K M P Q R T V W X Y 2 3 4 6 7 8 9