Detect table with OpenCV

Datageek picture Datageek · Oct 31, 2015 · Viewed 11.8k times · Source

I often work with scanned papers. The papers contain tables (similar to Excel tables) which I need to type into the computer manually. To make the task worse the tables can be of different number of columns. Manually entering them into Excel is mundane to say the least.

I thought I can save myself a week of work if I can put a program to OCR it. Would it be possible to detect headers text areas with the OpenCV and OCR the text behind the detected image coordinates.

Can I achieve this with the help of OpenCV or do I need entirely different approach?

Edit: Example table is really just a standard table similar to what you can see in Excel and other spread-sheet applications, see below.

enter image description here

Answer

flamelite picture flamelite · Oct 18, 2017

This question seems a little old but i was also working on a similar problem and got my own solution which i am explaining here.

For reading text using any OCR engine there are many challanges in getting good accuracy which includes following main cases:

  1. Presence of noise due to poor image quality / unwanted elements/blobs in the background region. This will require some pre-processing like noise removal which can be easily done using gaussian filter or normal median filter methods. These are also available in opencv.

  2. Wrong orientation of image: Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy.

  3. Presence of lines: While doing word or line segmentation OCR engine sometimes also tries to merge the words and lines together and thus processing wrong content and hence giving wrong results. There are other issues also but these are the basic ones.

In this case i think the scan image quality is quite good and simple and following steps can be used solve the problem.

  1. Simple image binarization will remove the background content leaving only necessary content as shown here. Binary image
  2. Now we have to remove lines which in this case is tabular grid. This can also be identified using connected components and removing the large connected components. So our final image that is needed to be fed to OCR engine will look like this.

    enter image description here

  3. For OCR we can use Tesseract Open Source OCR Engine. I got following results from OCR:

    Caption title

    header! header2 header3

    row1cell1 row1cell2 row1cell3

    row2cell1 row2cell2 row2cell3

  4. As we can see here that result is quite accurate but there are some issues like header! which should be header1, this is because OCR engine misunderstood ! with 1. This problem can be solved by further processing the result using Regex based operations.

After post processing the OCR result it can be parsed to read the row and column values.

Also here in this case to classify the sheet title, heading and normal cell values their font information can be used.