Tesseract OCR - Handwritten font

Poru picture Poru · Jan 4, 2012 · Viewed 10.5k times · Source

I'm trying to use Tesseract-OCR to detect the text of images with pure text in it but these text has a handwritten font called Journal.

Example:

enter image description here

The result is not the best:

Maxima! size` W (35)

Is there any possibility to improve the result or rather to get the exact result?

Answer

Andrew Cash picture Andrew Cash · Jan 7, 2012

I am surprised Tesseract is doing so well. With a little bit of training you should be able to train the lower case 'l' to be recognised correctly.

The main problem you have is the top of the large T character. The horizontal line extends across 2 (possibly 3) other character cells and this would cause a problem for any OCR engine when it tries to segment the characters for recognition. Training may be able to help in this case.

The next problem is the . and : which are very light/thin and are possibly being removed with image pre-processing before the OCR even starts.

Overall the only chance to improve the results with Tesseract would be to investigate training. Here are some links which may help.

Alternative to Tesseract OCR Training?
Tesseract OCR Library learning font
Tesseract confuses two numbers