ansaurus

Question

Check whether a PDF-File is valid (Python)

Answer 1

+2 A:

If you're on a Linux or OS X box, you could use Pdftotext (part of Xpdf, found here). If you pass a non-PDF to pdftotext, it will certainly bark at you, and you can use commands.getstatusoutput to get the output and parse it for these warnings.

If you're looking for a platform-independent solution, you might be able to make use of pyPdf.

Edit: It's not elegant, but it looks like pyPdf's PdfFileReader will throw an IOError(22) if you attempt to load a non-PDF.

Cal Jacobson 2009-02-17 23:00:05

Answer 2

+2 A:

In a project if mine I need to check for the mime type of some uploaded file. I simply use the file command like this:

from subprocess import Popen, PIPE
filetype = Popen("/usr/bin/file -b --mime -", shell=True, stdout=PIPE, stdin=PIPE).communicate(file.read(1024))[0].strip()

You of course might want to move the actual command into some configuration file as also command line options vary among operating systems (e.g. mac).

If you just need to know whether it's a PDF or not and do not need to process it anyway I think the file command is a faster solution than a lib. Doing it by hand is of course also possible but the file command gives you maybe more flexibility if you want to check for different types.

MrTopf 2009-02-17 23:19:52

Answer 3

+3 A:

The two most commonly used PDF libraries for Python are:

Both are pure python so should be easy to install as well be cross-platform.

With pyPdf it would probably be as simple as doing:

from pyPdf import PdfFileReader
doc = PdfFileReader(file("upload.pdf", "rb"))

This should be enough, but doc will now have documentInfo() and numPages() methods if you want to do further checking.

As Carl answered, pdftotext is also a good solution, and would probably be faster on very large documents (especially ones with many cross-references). However it might be a little slower on small PDF's due to system overhead of forking a new process, etc.

Van Gale 2009-02-18 01:10:35

Answer 4

A:

By valid do you mean that it can be displayed by a PDF viewer, or that the text can be extracted? They are two very different things.

If you just want to check that it really is a PDF file that has been uploaded then the pyPDF solution, or something similar, will work.

If, however, you want to check that the text can be extracted then you have found a whole world of pain! Using pdftotext would be a simple solution that would work in a majority of cases but it is by no means 100% successful. We have found many examples of PDFs that pdftotext cannot extract from but Java libraries such as iText and PDFBox can.

Steve Claridge 2009-02-25 00:10:24

ansaurus

tags:

views:

answers:

Check whether a PDF-File is valid (Python)

related questions