I often work with scanned papers. The papers contain tables (similar to Excel tables) which I need to type into the computer manually. To make the task worse the tables can be of different number of columns. Manually entering them into Excel is mundane to say the least.
I thought I can save myself a week of work if I can put a program to OCR it. Would it be possible to detect headers text areas with the OpenCV and OCR the text behind the detected image coordinates.
Can I achieve this with the help of OpenCV or do I need entirely different approach?
Edit: Example table is really just a standard table similar to what you can see in Excel and other spread-sheet applications, see below.
This question seems a little old but i was also working on a similar problem and got my own solution which i am explaining here.
For reading text using any OCR engine there are many challanges in getting good accuracy which includes following main cases:
Presence of noise due to poor image quality / unwanted elements/blobs in the background region. This will require some pre-processing like noise removal which can be easily done using gaussian filter or normal median filter methods. These are also available in opencv.
Wrong orientation of image: Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy.
In this case i think the scan image quality is quite good and simple and following steps can be used solve the problem.
Now we have to remove lines which in this case is tabular grid. This can also be identified using connected components and removing the large connected components. So our final image that is needed to be fed to OCR engine will look like this.
For OCR we can use Tesseract Open Source OCR Engine. I got following results from OCR:
Caption title
header! header2 header3
row1cell1 row1cell2 row1cell3
row2cell1 row2cell2 row2cell3
As we can see here that result is quite accurate but there are some issues like header! which should be header1, this is because OCR engine misunderstood ! with 1. This problem can be solved by further processing the result using Regex based operations.
After post processing the OCR result it can be parsed to read the row and column values.
Also here in this case to classify the sheet title, heading and normal cell values their font information can be used.