Extracting text data from PDF files | ansaurus

tags:

views:

198

answers:

1

+2 Q:

Extracting text data from PDF files

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?

In Python there there is PDFMiner, but I would like to keep this analysis all in R if possible.

Any suggestions?

+3 A:

Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.

That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.

Dirk Eddelbuettel 2010-10-04 01:56:22

You were right to suggest the text mining packages. The `tm` package has `readPDF`, which seems to do exactly what I need.

DrewConway 2010-10-04 02:46:28

related questions

Zend_Pdf_Page::drawContentStream() Example?

Convert a .doc or .pdf to an image and display a thumbnail in Ruby?

Placing a PDF inside another PDF document with Zend_PDF

Open source PDF library for C/C++ application?

Opening a PDF in WPF Application

How to best merge information, at a server, into a "form", a PDF being generated as the final output

How does one decrypt a PDF with an owner password, but no user password?

How does google make make those awesome PDF reports in Analytics and when you print a Google Doc etc?

What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

File format for generating dynamic reports in applications

Automated PDF Creation from URL

How do I display a PDF in Adobe Flex?

Latex=>PDF Rights management

Why is my PDF footer text invisible?

Python module for converting PDF to text

What's the best way to import/read data from pdf files?

Are e-book readers good enough for tech books?

PDF generation from XHTML in a LAMP environment

Create PDFs from multipage forms in WebObjects

Printing a PDF in .NET

PDF Creation Tutorials?

PDF Editing in PHP?

Organizing Documents

Get a preview jpeg of a pdf on windows?

How do I programmatically create a PDF in my .NET application?