I want to programmatically parse a pdf file, look for certain phrases and find out the page number that each phrase is on. Is this possible (I understand that pdf is not like a text file)? Is so, are there libraries out there that can help?
A:
Apache Tika, which you can find at the Apache Lucene project, includes PDFBox, which will pull out the text where you can work with it.
bmargulies
2009-12-30 03:30:28