Image classification in python

Kyle picture Kyle · Oct 11, 2010 · Viewed 7.5k times · Source

I'm looking for a method of classifying scanned pages that consist largely of text.

Here are the particulars of my problem. I have a large collection of scanned documents and need to detect the presence of certain kinds of pages within these documents. I plan to "burst" the documents into their component pages (each of which is an individual image) and classify each of these images as either "A" or "B". But I can't figure out the best way to do this.

More details:

  • I have numerous examples of "A" and "B" images (pages), so I can do supervised learning.
  • It's unclear to me how to best extract features from these images for the training. E.g. What are those features?
  • The pages are occasionally rotated slightly, so it would be great if the classification was somewhat insensitive to rotation and (to a lesser extent) scaling.
  • I'd like a cross-platform solution, ideally in pure python or using common libraries.
  • I've thought about using OpenCV, but this seems like a "heavy weight" solution.

EDIT:

  • The "A" and "B" pages differ in that the "B" pages have forms on them with the same general structure, including the presence of a bar code. The "A" pages are free text.

Answer

whatnick picture whatnick · Oct 11, 2010

I will answer in 3 parts since your problem is clearly a large one and I would highly recommend manual method with cheap labour if the collection of pages does not exceed a 1000.

Part 1: Feature Extraction - You have a very large array of features to choose from in the object detection field. Since one of your requirements is rotation invariance, I would recommend the SIFT/SURF class of features. You might also find Harris corners etc. suitable. Deciding which features to use can require expert knowledge and if you have computing power I would recommend creating a nice melting pot of features and passing it through a classifier training based importance estimator.

Part 2: Classifier Selection - I am a great fan of the Random Forest classifier. The concept is very simple to grasp and it is highly flexible and non-parametric. Tuning requires very few parameters and you can also run it in a parameter selection mode during supervised training.

Part 3: Implementation - Python in essence is a glue language. Pure python implementations for image processing are never going to be very fast. I recommend using a combination of OpenCV for feature detection and R for statistical work and classifiers.

The solution may seem over-engineered but machine learning has never been a simple task even when the difference between pages is simply that they are the left-hand and right-hand pages of a book.