views:

119

answers:

3

I'm looking for a method of classifying scanned pages that consist largely of text.

Here are the particulars of my problem. I have a large collection of scanned documents and need to detect the presence of certain kinds of pages within these documents. I plan to "burst" the documents into their component pages (each of which is an individual image) and classify each of these images as either "A" or "B". But I can't figure out the best way to do this.

More details:

  • I have numerous examples of "A" and "B" images (pages), so I can do supervised learning.
  • It's unclear to me how to best extract features from these images for the training. E.g. What are those features?
  • The pages are occasionally rotated slightly, so it would be great if the classification was somewhat insensitive to rotation and (to a lesser extent) scaling.
  • I'd like a cross-platform solution, ideally in pure python or using common libraries.
  • I've thought about using OpenCV, but this seems like a "heavy weight" solution.

EDIT:

  • The "A" and "B" pages differ in that the "B" pages have forms on them with the same general structure, including the presence of a bar code. The "A" pages are free text.
+1  A: 

First, I would like to say that on my mind OpenCV is a very good tool for these kinds of manipulation. Moreover, it has a python interface well-described here.

OpenCV is highly optimized and your problem is not an easy one.

[GLOBAL EDIT : reorganization of my ideas]

Here's a few idea of features that could be used :

  • For detecting the barcodes you should maybe try to do a distance transform (DistTransform in OpenCV) if the barcode are isolated. Maybe you will be able to find interest pointseasily with match or matchShapes. I think it's feasible because the barcodes shoudl have the same shape (size, etc). The score of the interest points could be used as a feature.

  • The moments of the image could be useful here because you have different kinds of global structures. This will be maybe sufficient for making distinction between A & B pages (see there for the openCV function) (you will get invariant descriptors by the way :) )

  • You should maybe try to compute vertical gradient and horizontal gradient. A barcode is a specific place where vertical gradient==0 and horizontal gradient!=0. This main advantage is the low cost of these operations since your goal is only to check if there's such a zone on your page. You can find interest zone and use its score as a feature

Once you have your features, you can try to do supervised learning and test generalization. Your problem require very few false negative (because you are going to throw away some pages) so you should evaluate your performance with ROC curves and look carefully at the sensistivity (that should be high). For the classification, you could use regression with lasso penalization to find the best features. The post of whatnick also gives goods ideas and other descriptors (maybe more general).

Elenaher
+2  A: 

So you want to be able to distinguish between two kinds of pages using specific elements - basically, the presence of barcodes. There are two steps:

  1. feature extraction (computer vision): find interest points or lines which would be specific features of barcodes and not text.

  2. binary classification (statistical learning): determine whether there is a barcode or not, based on the extracted features.


Dealing with the first step, you should definitely have a look at the Hough transform. It is ideal to identify lines in an image, and could be useful for barcode detection. Read these two pages for instance. Here are examples with OpenCV.


About the second step, the most useful classifications would be based on:

  • k nearest neighbours
  • logistic regression
  • random forest (really well implemented in R, but I do not know about Python)
wok
@wok Orange Learning kit has a nice random forest implementation which I used to use before I found the one in R.
whatnick
+4  A: 

I will answer in 3 parts since your problem is clearly a large one and I would highly recommend manual method with cheap labour if the collection of pages does not exceed a 1000.

Part 1: Feature Extraction - You have a very large array of features to choose from in the object detection field. Since one of your requirements is rotation invariance, I would recommend the SIFT/SURF class of features. You might also find Harris corners etc. suitable. Deciding which features to use can require expert knowledge and if you have computing power I would recommend creating a nice melting pot of features and passing it through a classifier training based importance estimator.

Part 2: Classifier Selection - I am a great fan of the Random Forest classifier. The concept is very simple to grasp and it is highly flexible and non-parametric. Tuning requires very few parameters and you can also run it in a parameter selection mode during supervised training.

Part 3: Implementation - Python in essence is a glue language. Pure python implementations for image processing are never going to be very fast. I recommend using a combination of OpenCV for feature detection and R for statistical work and classifiers.

The solution may seem over-engineered but machine learning has never been a simple task even when the difference between pages is simply that they are the left-hand and right-hand pages of a book.

whatnick
SIFT are certainly a good idea but in this case, we can maybe define directly more customized features because of our prior knowledge (presence of barcode or of plaintext, etc...) (cf. my post). Using a classifier training to find how to combine our features to give an answer is a good choice. (+1 in general for the post)
Elenaher
@wok : I think that whatnick wanted to propose a more general (and clean) approach of the problem instead of going directly deeper in the question of "what feature should I use ?". We must keep in mind that barcode is not the only solution for this problem and try to combine different ways. Your link is very interesting in all cases.
Elenaher
Kyle
@Kyle Patented does not imply not usable, you will just have to pay a royalty if sift is actually useful
whatnick