We have a 96 page PDF file and we would like to have a text file containing all the text in that file. Is there a way to somehow print the PDF to a file so that file contains only the text of the PDF?
views:
102answers:
4Is it possible to "print a PDF to a file" so that the file contains plain text of the content?
A google search on pdf2txt will yield plenty of responses. I've personally used the command line program found at http://www.pdf2txt.com, and found it very good. However, if the PDF uses much formatting, it can be hard to parse out the info you want sometimes, as the formatting makes text retrieval much harder.
The Adobe Reader has the save as text option. The Foxit Reader has the view as text option. The adobe web site offers the service too. And many other options exist.
The only problem with those (and I have tested them on a few documents) is that by leaving the layout off, it's sometimes impossible to know exactly where the text should fit in a regular text file and so they must guess. For example: repeating headers, captions under images in the middle of a text block, paged multi-column text, etc.
Adobe also offers and online service for free that will do this:
http://www.adobe.com/products/acrobat/access_onlinetools.html
It even allows you to do this using email. Just send an email with the pdf attached to [email protected]. It will respond with the text attached.