views:

325

answers:

3

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.

I'm using python 2.6 but can use 3.x if required.

thanks

Summarized Responses

There is a JPedal java library which does this called PDF Clipped Image Extraction. The author, Mark Stephens, has a concise highlevel overview of how images are stored in PDF which may help someone building a python extractor.

For pdf's which have jpegs stored in place "as is", Ned Batchelder has a quick and dirty jpeg extractor.

+2  A: 

Often in a PDF, the image is simply stored as-is. For example, a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file. You can use this to very simply extract byte ranges from the PDF. I wrote about this some time ago, with sample code: Extracting JPGs from PDFs.

Ned Batchelder
thanks Ned. It looks like the particular pdf's I need this for are not using jpeg in-situ, but I'll keep your sample around in case it matches up other things that turn up.
matt wilkie
+1  A: 

There is an article explaining how images are stored inside a PDF at http://pdf.jpedal.org/java-pdf-blog/bid/27708/Understanding-the-PDF-file-format-how-are-images-stored

mark stephens
thanks Mark, that is an informative page, making it clear this is a more complicated operation than I thought: "All this means that if you want to extract images from a PDF, you need to assemble the image from all the raw data - it is *not* stored as a complete image file you can just rip out." [emphasis added] He has a java program which does what I want (http://www.jpedal.org/gplSrc/org/jpedal/examples/images/ExtractClippedImages.java.html), not that I know thing about java :)
matt wilkie
If you have some control over the PDFs, you might get them to limit the images to DCTDecoded in DeviceRGB in which case you could just rip them out. You might also see if something like ImageMagick has Python bindings.
mark stephens
A: 

Libpoppler comes with a tool called "pdfimages" that does exactly this.

(On ubuntu systems it's in the poppler-utils package)

http://poppler.freedesktop.org/

http://en.wikipedia.org/wiki/Pdfimages

dkagedal