views:

217

answers:

4

My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.

I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches...

Suggestions on how to approach this task? Does these type of routine already exist?

EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is searchable...so really my first task, and the intent of my question is, how do I access the text of a PDF file from an asp.net app? OCR on scanned PDF's is probably beyond my requirements at this point.

A: 

The only problem I foresee of grabbing every non-common word is that you'll dilute your search results and have to query the DB for more pdfs. One website to look at is Scribd which does something similar to what you are talking about doing with users uploading files and people being able to view them online via a flash app.

A: 

That is very interesting topic. The question is how many keywords do you need to define one PDF. If you say:

  • 3 to 10 - I would check methods of text categorization such as bayesian classifier or K-NN (that method will group PDF files into clusters which are similar). I know that similar algorithms are used to filter spam. But it is a system that need input for example if you add keywords to 100 PDF this system will learn the schemas. I am not an expert but this is one way to do it.

  • more than 10 - then I would suggest brute force -> filter common words -> get most frequent words for a specific document.

I would explore first option. You must surely check such methods as "text categorization", "auto tagging", "text mining", "automatic keyword extraction".

Some links :

http://en.wikipedia.org/wiki/Naive_Bayes_classifier

Keyword Extraction Using Naive Bayes

Pawel
+2  A: 

As a first step you should extract all text from the PDF. ghostscript and pdftotext can do this, the PDFBox is another option. There are certainly other tools as well.

Then you can remove all stopwords and duplicates and write it to the database.

I has been mentioned that this does not work for scanned PDF documents but this is only half the truth. On the one hand there are lots of scanned PDFs which have text additionally embeded, because that is what some scanners drivers do (Canon CanoScan drivers performs OCR and generate searchable PDFs). On the other hand documents generated with LaTeX that contain non-ASCCII characters return garbage in my experience (even when I copy and paste in acrobat).

Ludwig Weinzierl
What I said is that there are PDF scans without text. I never said /all/ PDF scans lacked text.
Matthew Flaschen
Matthew Flaschen, you are right. My comment wasn't meant to be disagreement but addition. One half of the truth is that there are PDF scans without text, the other that text extraction does not necessarily work with typeset PDFs.
Ludwig Weinzierl
A: 

If you are planning on indexing PDF documents, you should consider using a dedicated text search engine like Lucene. Lucene provides features that will be difficult to implement using only SQL and a relational database. You will still need to extract the text from the PDF documents, but won't have to worry about filtering out common words. By filtering out common words, you will completely lose the ability to do phrase searches.

Francois Gravel