Tesseract SetVariable tessedit_char_whitelist in another language

best picture best · Apr 30, 2013 · Viewed 10.8k times · Source

Tesseract setVariable whitelist works ok for english language for example i use this to recognize only digits and letters from image (excluding special characters &*^%! etc)

_ocr.SetVariable("tessedit_char_whitelist",
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ");

But i can't do the same thing for Thai language

_ocr.SetVariable("tessedit_char_whitelist","0123456789กขคงจฉ");

Is there a different principle? Because this does not work. Instead of all determined characters I receive only digits in output, tesseract ignores all Thai letters which I put into the whitelist.

How can I pass this variable correctly?

Answer

user2578701 picture user2578701 · Dec 4, 2013

You might need to use the language package for Thai first... please refer the download list here https://code.google.com/p/tesseract-ocr/downloads/list

Then you need to replace "eng" with "tha" in your code to use the new language data to OCR