Cleaning up an image for OCR with ImageMagick and 'textcleaner'

Edi picture Edi · May 14, 2015 · Viewed 12.2k times · Source

I have the following image that I'd like to prepare for an OCR with tesseract: enter image description here

The objective is to clean up the image and remove all of the noise. I'm using the textcleaner script that uses ImageMagick with the following parameters:

./textcleaner -g -e normalize -f 30 -o 12 -s 2 original.jpg output.jpg

The output is still not so clean: enter image description here

I tried all kinds of variations for the parameters but with no luck. Does anyone have an idea?

Answer

Kurt Pfeifle picture Kurt Pfeifle · May 16, 2015

If you convert to JPEG, you will always have the type of artifacts you are seeing.

This is a typical "feature" of JPEG compression. JPEGs are never good for images showing sharp lines, contrasts with uniform colors between different areas of the image, using only very few colors. This is true for black + white texts. JPEG is only "good" for typical photos, with lots of different colors and shading...

Your problem will most likely completely get resolved if you use PNG as an output format. The following image demonstrates this. I generated it with the same parameters as your last example command used, but with PNG as the output format:

textcleaner -g -e normalize -f 30 -o 12 -s 2 \
    http://i.stack.imgur.com/ficx7.jpg       \
    out.png

PNG instead of JPEG output

Here is a similar zoom into the output:

Zoomed PNG

You can very likely improve the output even more if you play with the parameters of the textcleaner script. But that is your job... :-)