tags:

views:

3239

answers:

5

I need to extract all the images from a PDF file on my server. I don't want the PDF pages, only the images at their original size and resolution.

How could I do this with Perl, PHP or any other UNIX based app (which I would invoke with the exec function from PHP)?

+5  A: 

With regards to Perl, have you checked CPAN?

Kent Fredric
+8  A: 

pdfimages does just that. It's is part of the poppler-utils and xpdf-utils packages.

From the manpage:

Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files.

Pdfimages reads the PDF file, scans one or more pages, PDF-file, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).

NB: pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.

Luis Melgratti
I think the package gets installed when you install xpdf.
PolyThinker
that is correct too, both packages have pdfimages.
Luis Melgratti
+1  A: 

pdfimages is nice as it does not reencode but only extract jpegs. But there is a bug:

pdfimages comes from package "poppler-utils" or from the bigger "xpdf-utils". At least in Ubuntu "poppler-utils" comes already pre-installed. The pdfimages in poppler-utils 10.0.3 (Ubuntu 9.04 Jaunty) still does not react to the option "-j" to extract ".jpg". It always extracts ".ppm".

As a workaround you may replace "poppler-utils" with "xpdf-utils": $ sudo apt-get install xpdf-utils

with kind regards,

+++ Oliver

A: 

annaatae, how do extract images from pfd in shared web hosting?

Jayapal Chandran
A: 

Hi, I am trying to use the command:

pdfimages <pdfname>.pdf /pdfimages

However, I am getting an error:

Error: Couldn't open image file '/pdfimages-000.ppm'<br>
Error: Couldn't open image file '/pdfimages-001.ppm'<br>
Error: Couldn't open image file '/pdfimages-002.ppm'<br>

etc...

Why would pdfimages be trying to open an image file? Isn't it supposed to be writing the image files extracted from the .pdf?

etech

etech
Type was the problem: Was missing the / before the output path.
etech