tags:

views:

127

answers:

3

I am working on a somewhat large corpus with articles numbering the tens of thousands. I am currently using PDFBox to extract with various success, and I am looking for a way to programatically check each file to see if the extraction was moderately successful or not. I'm currently thinking of running a spellchecker on each of them, but the language can differ, I am not yet sure which languages I'm dealing with. Natural language detection with scores may also be an idea.

Oh, and any method also has to play nice with Java, be fast and relatively quick to integrate.

+1  A: 

Of course no method will be perfect.

There are usually two classes of text extraction poblems :

1 - nothing gets extracted. This can be because you've got a scanned document or something is invalid in the PDF.

Usually easy to detect, you should not need complicaed code to check those.

2 - You get garbage. Most of the times because the PDF file is weirdly encoded. This can be because of homemade encoding not properly declared, or maybe the PDF author needed characters not recognized by PDF ( For example, The turkish S with cedilla was missing for some time in the adobe glyph list : you could not create a correctly encoded file with it inside so you had to cheat to get it visually on the page ).

I use a ngram based method to detect languages of PDF files based on the extracted text (with different technologies but the idea is the same). Files where the language was not recognized are usually good suspects of a problem...

About spellchecking I suppose it will give you tons of false positives especially if you have multiple languages !

siukurnin
+2  A: 

Try an automatically learning spell checker. That's not as scary as it sounds: Start with a big dictionary containing all the words you're likely to encounter. This can be from several languages.

When scanning a PDF, allow for a certain number of unknown words (say 5%). If any of these words are repeated often enough (say 5 times), add them to the dictionary. If the PDF contains more than 5% unknown words, it's very likely something that couldn't be processed.

The scanner will learn over time allowing you to reduce the amount of unknown words if that should be necessary. If that is too much hazzle, a very big dictionary should work well, too.

If you don't have a dictionary, manually process a couple of documents and have the scanner learn. After a dozen files or so, your new dictionary should be large enough for a reasonable water level.

Aaron Digulla
+2  A: 

You could just run the corpus against a list of stop words (the most frequent words that search engines ignore, like "and" and "the"), but then you obviously need stop word lists for all possible/probable languages first.

tobiasvl