Is there a package/library for python that would allow me to open a PDF, and search the text for certain words?
+4
A:
Using PyPdf you can use extractText() method to extract pdf text and work on it.
cartman
2009-11-04 07:39:34
@cartman: do you have any idea how to work with the fact that PyPdf does not put a space between lines? For example, if one line in the pdf said 'hello' and then the next line said 'world' the text i extract out is 'helloworld' instead of 'hello world' which kind of kills any text mining...
hatorade
2009-11-04 08:24:43
If I remember correctly, PyPdf reads some newlines in some PDFs as '\x00'.
PhilS
2009-11-04 08:53:04
+1 for pyPdf: It's a _very_ handy module, even if a bit outdated for 2.6 (the sources are available anyway, it's but a few adaptations).
RedGlyph
2009-11-04 09:27:07