I have the following image that I'd like to prepare for an OCR with tesseract:
The objective is to clean up the image and remove all of the noise.
I'm using the textcleaner
script that uses ImageMagick with the following parameters:
./textcleaner -g -e normalize -f 30 -o 12 -s 2 original.jpg output.jpg
The output is still not so clean:
I tried all kinds of variations for the parameters but with no luck. Does anyone have an idea?
If you convert to JPEG, you will always have the type of artifacts you are seeing.
This is a typical "feature" of JPEG compression. JPEGs are never good for images showing sharp lines, contrasts with uniform colors between different areas of the image, using only very few colors. This is true for black + white texts. JPEG is only "good" for typical photos, with lots of different colors and shading...
Your problem will most likely completely get resolved if you use PNG as an output format. The following image demonstrates this. I generated it with the same parameters as your last example command used, but with PNG as the output format:
textcleaner -g -e normalize -f 30 -o 12 -s 2 \
http://i.stack.imgur.com/ficx7.jpg \
out.png
Here is a similar zoom into the output:
You can very likely improve the output even more if you play with the parameters of the textcleaner script. But that is your job... :-)