tags:

views:

34

answers:

1

I want to programmatically parse a pdf file, look for certain phrases and find out the page number that each phrase is on. Is this possible (I understand that pdf is not like a text file)? Is so, are there libraries out there that can help?

A: 

Apache Tika, which you can find at the Apache Lucene project, includes PDFBox, which will pull out the text where you can work with it.

bmargulies