views:

264

answers:

2

Is there a package/library for python that would allow me to open a PDF, and search the text for certain words?

+2  A: 

I don't think you can do it in one step, but you can certainly get the text out of a pdf with pdfminer. Then you can apply whatever text search to that recovered data.

shylent
+4  A: 

Using PyPdf you can use extractText() method to extract pdf text and work on it.

cartman
@cartman: do you have any idea how to work with the fact that PyPdf does not put a space between lines? For example, if one line in the pdf said 'hello' and then the next line said 'world' the text i extract out is 'helloworld' instead of 'hello world' which kind of kills any text mining...
hatorade
If I remember correctly, PyPdf reads some newlines in some PDFs as '\x00'.
PhilS
+1 for pyPdf: It's a _very_ handy module, even if a bit outdated for 2.6 (the sources are available anyway, it's but a few adaptations).
RedGlyph