Invoice / OCR: Detect two important points in invoice image

skiwi picture skiwi · Oct 1, 2013 · Viewed 12.5k times · Source

I am currently working on OCR software and my idea is to use templates to try to recognize data inside invoices.

However scanned invoices can have several 'flaws' with them:

  • Not all invoices, based on a single template, are correctly aligned under the scanner.
  • People can write on invoices
  • etc.

Example of invoice: (Have to google it, sadly cannot add a more concrete version as client data is confidential obviously)

Example invoice

I find my data in the invoices based on the x-values of the text.

However I need to know the scale of the invoice and the offset from left/right, before I can do any real calculations with all data that I have retrieved.

What have I tried so far?

1) Making the image monochrome and use the left and right bounds of the first appearance of a black pixel. This fails due to the fact that people can write on invoices.

2) Divide the invoice up in vertical sections, use the sections that have the highest amount of black pixels. Fails due to the fact that the distribution is not always uniform amongst similar templates.

I could really use your help on (1) how to identify important points in invoices and (2) on what I should focus as the important points.

I hope the question is clear enough as it is quite hard to explain.

Answer

MvG picture MvG · Oct 1, 2013

Detecting rotation

I would suggest you start by detecting straight lines.

Look (perhaps randomly) for small areas with high contrast, i.e. mostly white but a fair amount of very black pixels as well. Then try to fit a line to these black pixels, e.g. using least squares method. Drop the outliers, and fit another line to the remaining points. Iterate this as required. Evaluate how good that fit is, i.e. how many of the pixels in the observed area are really close to the line, and how far that line extends beyond the observed area. Do this process for a number of regions, and you should get a weighted list of lines.

For each line, you can compute the direction of the line itself and the direction orthogonal to that. One of these numbers can be chosen from an interval [0°, 90°), the other will be 90° plus that value, so storing one is enough. Take all these directions, and find one angle which best matches all of them. You can do that using a sliding window of e.g. 5°: slide accross that (cyclic) region and find a value where the maximal number of lines are within the window, then compute the average or median of the angles within that window. All of this computation can be done taking the weights of the lines into account.

Once you have found the direction of lines, you can rotate your image so that the lines are perfectly aligned to the coordinate axes.

Detecting translation

Assuming the image wasn't scaled at any point, you can then try to use a FFT-based correlation of the image to match it to the template. Convert both images to gray, pad them with zeros till the originals take up at most 1/2 the edge length of the padded image, which preferrably should be a power of two. FFT both images in both directions, multiply them element-wise and iFFT back. The resulting image will encode how much the two images would agree for a given shift relative to one another. Simply find the maximum, and you know how to make them match.

Added text will cause no problems at all. This method will work best for large areas, like the company logo and gray background boxes. Thin lines will provide a poorer match, so in those cases you might have to blur the picture before doing the correlation, to broaden the features. You don't have to use the blurred image for further processing; once you know the offset you can return to the rotated but unblurred version.

Now you know both rotation and translation, and assumed no scaling or shearing, so you know exactly which portion of the template corresponds to which portion of the scan. Proceed.