views:

563

answers:

4

How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?

+3  A: 

Have you tried pyPDF or ReportLab PDF libraries? I personally have not used them but you can have a go at them. here is useful too

ghostdog74
+2  A: 

If you don't have your heart set on doing this with python, Ghostscript can do this for you. Check out pdf2ascii (a script that comes with GS) to get the plain text. Styles are more complicated as they can be specified in a few different ways.

fatcat1111
+3  A: 

If you want to do it just like Google:

Google converts the PDF to an image, and then overlays the image, where text used to be, with JavaScript highlightable areas (which is about like Voodoo magic). The areas appear to be text when you scroll over them with your cursor, but they're not. This might not help you to know, but that's how they do it. If you want to reverse engineer it, you might start with http://mercurial.selenic.com/ On the home page, they do the same thing with JavaScript to make the text highlightable and copyable. You can extract the text from the PDF, and find it's location in the page with on of the mentioned libraries in the other answers. Then you can overlay an extracted image of the file with the same style of JavaScript areas.

orokusaki
ah you're right - they are using images, which is not what I want because I need to manipulate the text
Plumo
+3  A: 

To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.

I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.

Etienne