(python) text-mine PDF files with Python? | ansaurus

tags:

views:

264

answers:

2

Q:

(python) text-mine PDF files with Python?

Is there a package/library for python that would allow me to open a PDF, and search the text for certain words?

+2 A:

I don't think you can do it in one step, but you can certainly get the text out of a pdf with pdfminer. Then you can apply whatever text search to that recovered data.

shylent 2009-11-04 07:38:39

+4 A:

Using PyPdf you can use extractText() method to extract pdf text and work on it.

cartman 2009-11-04 07:39:34

@cartman: do you have any idea how to work with the fact that PyPdf does not put a space between lines? For example, if one line in the pdf said 'hello' and then the next line said 'world' the text i extract out is 'helloworld' instead of 'hello world' which kind of kills any text mining...

hatorade 2009-11-04 08:24:43

If I remember correctly, PyPdf reads some newlines in some PDFs as '\x00'.

PhilS 2009-11-04 08:53:04

+1 for pyPdf: It's a _very_ handy module, even if a bit outdated for 2.6 (the sources are available anyway, it's but a few adaptations).

RedGlyph 2009-11-04 09:27:07

related questions

Zend_Pdf_Page::drawContentStream() Example?

Convert a .doc or .pdf to an image and display a thumbnail in Ruby?

Placing a PDF inside another PDF document with Zend_PDF

Open source PDF library for C/C++ application?

Opening a PDF in WPF Application

How to best merge information, at a server, into a "form", a PDF being generated as the final output

How does one decrypt a PDF with an owner password, but no user password?

How does google make make those awesome PDF reports in Analytics and when you print a Google Doc etc?

What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

File format for generating dynamic reports in applications

Automated PDF Creation from URL

How do I display a PDF in Adobe Flex?

Latex=>PDF Rights management

Why is my PDF footer text invisible?

Python module for converting PDF to text

What's the best way to import/read data from pdf files?

Are e-book readers good enough for tech books?

PDF generation from XHTML in a LAMP environment

Create PDFs from multipage forms in WebObjects

Printing a PDF in .NET

PDF Creation Tutorials?

PDF Editing in PHP?

Organizing Documents

Get a preview jpeg of a pdf on windows?

How do I programmatically create a PDF in my .NET application?