I'm looking for a method of classifying scanned pages that consist largely of text.
Here are the particulars of my problem. I have a large collection of scanned documents and need to detect the presence of certain kinds of pages within these documents. I plan to "burst" the documents into their component pages (each of which is an individual image) and classify each of these images as either "A" or "B". But I can't figure out the best way to do this.
More details:
EDIT:
I will answer in 3 parts since your problem is clearly a large one and I would highly recommend manual method with cheap labour if the collection of pages does not exceed a 1000.
Part 1: Feature Extraction - You have a very large array of features to choose from in the object detection field. Since one of your requirements is rotation invariance, I would recommend the SIFT/SURF class of features. You might also find Harris corners etc. suitable. Deciding which features to use can require expert knowledge and if you have computing power I would recommend creating a nice melting pot of features and passing it through a classifier training based importance estimator.
Part 2: Classifier Selection - I am a great fan of the Random Forest classifier. The concept is very simple to grasp and it is highly flexible and non-parametric. Tuning requires very few parameters and you can also run it in a parameter selection mode during supervised training.
Part 3: Implementation - Python in essence is a glue language. Pure python implementations for image processing are never going to be very fast. I recommend using a combination of OpenCV for feature detection and R for statistical work and classifiers.
The solution may seem over-engineered but machine learning has never been a simple task even when the difference between pages is simply that they are the left-hand and right-hand pages of a book.