What's the way to remove all lines and borders in image(keep texts) programmatically?

wind picture wind · Nov 27, 2015 · Viewed 23.8k times · Source

I 'm trying to extract text from an image using Tesseract OCR. Currently, with original input image(as below), output's very poor quality(about 50%).But when I try to remove all lines and borders in input image(using photoshop), output improve a lot(~90%). So is there any way to remove all lines and borders in image(keep texts) programmatically(using OpenCV, Image magick,..) ?

Original Image: Original Image

Expect Image: Expect Image

Answer

Mark Setchell picture Mark Setchell · Nov 28, 2015

Not using OpenCV, but just a one-liner of ImageMagick in the Terminal, but it may give you an idea how to do it in OpenCV. ImageMagick is installed on most Linux distros and is available for OSX and Windows.

The crux of the concept is to create a new image where each pixel is set to the median of the 100 neighbouring pixels to its left and the 100 neighbouring pixels to its right. That way, pixels that have lots of horizontal neighbours that are black (i.e. horizontal black lines) will be white in the output image. Then the same processing is applied in the vertical direction to remove vertical lines.

The command that you type into the Terminal will be:

convert input.png                                                 \
   \( -clone 0 -threshold 50% -negate -statistic median 200x1 \)  \
   -compose lighten -composite                                    \
   \( -clone 0 -threshold 50% -negate -statistic median 1x200 \)  \
   -composite result.png

The first line says to load your original image.

The second line starts some "aside-processing" that copies the original image, thresholds it and inverts it, then the median of all neighbouring pixels 100 either side is calculated.

The third line then takes the result of the second line and composites it over the original image, choosing the lighter of the pixels at each location - i.e. the ones that my horizontal line mask has whitened out.

The next two lines do the same thing again but oriented vertically for vertical lines.

Result is like this:

enter image description here

If I difference that with your original image, like this, I can see what it did:

convert input.png result.png -compose difference -composite diff.png

enter image description here

I guess, if you wanted to remove a bit more of the lines, you could actually blur the difference image a little and apply that to the original. Of course, you can play with the filter lengths and the thresholds and stuff too.