How to make tesseract to recognize only numbers, when they are mixed with letters?

zkunov picture zkunov · Feb 9, 2011 · Viewed 99.4k times · Source

I want to use tesseract to recognize only numbers. The problem is that I have mixture of numbers & letters and when I use SetVariable("tessedit_char_whitelist", "0123456789")
for every symbol tesseract returns wrong digit.

Can I set a threshold value so that tesseract omits the symbols with low resemblance?

NOTE: I set tesseract to recognize only digits so there is no confusion between O and 0.

Answer

Jerry picture Jerry · Sep 21, 2011

Recognizing only numbers is actually answered on the tesseract FAQ page. See that page for more info, but if you have the version 3 package, the config files are already set up. You just specify on the commandline:

tesseract image.tif outputbase nobatch digits

As for the threshold value, I'm not sure which you mean. If your input is an unusual font, perhaps you might retrain with a sample of your input. An alternative is to change tesseract's pruning threshold. Both options are also mentioned in the FAQ.