Image classification in python

python image-processing opencv machine-learning barcode-scanner

Kyle · Oct 11, 2010 · Viewed 7.5k times · Source

I'm looking for a method of classifying scanned pages that consist largely of text.

Here are the particulars of my problem. I have a large collection of scanned documents and need to detect the presence of certain kinds of pages within these documents. I plan to "burst" the documents into their component pages (each of which is an individual image) and classify each of these images as either "A" or "B". But I can't figure out the best way to do this.

More details:

I have numerous examples of "A" and "B" images (pages), so I can do supervised learning.
It's unclear to me how to best extract features from these images for the training. E.g. What are those features?
The pages are occasionally rotated slightly, so it would be great if the classification was somewhat insensitive to rotation and (to a lesser extent) scaling.
I'd like a cross-platform solution, ideally in pure python or using common libraries.
I've thought about using OpenCV, but this seems like a "heavy weight" solution.

EDIT:

The "A" and "B" pages differ in that the "B" pages have forms on them with the same general structure, including the presence of a bar code. The "A" pages are free text.

Answer

I will answer in 3 parts since your problem is clearly a large one and I would highly recommend manual method with cheap labour if the collection of pages does not exceed a 1000.

Part 1: Feature Extraction - You have a very large array of features to choose from in the object detection field. Since one of your requirements is rotation invariance, I would recommend the SIFT/SURF class of features. You might also find Harris corners etc. suitable. Deciding which features to use can require expert knowledge and if you have computing power I would recommend creating a nice melting pot of features and passing it through a classifier training based importance estimator.

Part 2: Classifier Selection - I am a great fan of the Random Forest classifier. The concept is very simple to grasp and it is highly flexible and non-parametric. Tuning requires very few parameters and you can also run it in a parameter selection mode during supervised training.

Part 3: Implementation - Python in essence is a glue language. Pure python implementations for image processing are never going to be very fast. I recommend using a combination of OpenCV for feature detection and R for statistical work and classifiers.

The solution may seem over-engineered but machine learning has never been a simple task even when the difference between pages is simply that they are the left-hand and right-hand pages of a book.

Image classification in python

Answer

Related questions