ansaurus

Question

Check if a pdf file is valid using PdfBox by Apache

Answer 1

A:

Pdf files begin "%PDF" (open one in TextPad or similar and take a look)

Any reason you can't just read the file with a StringReader and check for this?

cagcowboy 2009-06-02 21:15:33

I have tried this, and it appears that PDF Files can use a variety of encodings and the text read sometimes does not match %PDF for valid and readable PDF files.

dodger 2009-06-02 21:19:31

Not all files that begin with %PDF are valid PDF files.

Kyle W. Cartmell 2009-06-02 22:03:11

Answer 2

A:

What do you mean by a valid PDF file? It also needs to contain a valid data reference table correctly pointing to all the objects in the file.

2009-06-03 06:51:56

Exactly, is there a method to check that this is in fact the case?

dodger 2009-06-03 11:13:11

Answer 3

+1 A:

you can find out the mime type of a file (or byte array), so you dont dumbly rely on the extension. I do it with aperture's MimeExtractor (http://aperture.sourceforge.net/) or I saw some days ago a library just for that (http://sourceforge.net/projects/mime-util)

I use aperture to extract text from a variety of files, not only pdf, but have to tweak thinks for pdfs for example (aperture uses pdfbox, but i added another library as fallback when pdfbox fails)

raticulin 2009-06-06 13:12:46

thanks, i'll try it

dodger 2009-06-07 15:36:44

Oh, I forgot to mention there is now an apache project for text extraction, http://lucene.apache.org/tika/, in case you prefer it to aperture

raticulin 2009-06-08 09:51:45

Answer 4

+2 A:

Here is what I use into my NUnit tests, that must validate against multiple versions of PDF generated using Crystal Reports:

public static void CheckIsPDF(byte[] data)
    {
        Assert.IsNotNull(data);
        Assert.Greater(data.Length,4);

        // header 
        Assert.AreEqual(data[0],0x25); // %
        Assert.AreEqual(data[1],0x50); // P
        Assert.AreEqual(data[2],0x44); // D
        Assert.AreEqual(data[3],0x46); // F
        Assert.AreEqual(data[4],0x2D); // -

        if(data[5]==0x31 && data[6]==0x2E && data[7]==0x33) // version is 1.3 ?
        {                  
            // file terminator
            Assert.AreEqual(data[data.Length-7],0x25); // %
            Assert.AreEqual(data[data.Length-6],0x25); // %
            Assert.AreEqual(data[data.Length-5],0x45); // E
            Assert.AreEqual(data[data.Length-4],0x4F); // O
            Assert.AreEqual(data[data.Length-3],0x46); // F
            Assert.AreEqual(data[data.Length-2],0x20); // SPACE
            Assert.AreEqual(data[data.Length-1],0x0A); // EOL
            return;
        }

        if(data[5]==0x31 && data[6]==0x2E && data[7]==0x34) // version is 1.4 ?
        {
            // file terminator
            Assert.AreEqual(data[data.Length-6],0x25); // %
            Assert.AreEqual(data[data.Length-5],0x25); // %
            Assert.AreEqual(data[data.Length-4],0x45); // E
            Assert.AreEqual(data[data.Length-3],0x4F); // O
            Assert.AreEqual(data[data.Length-2],0x46); // F
            Assert.AreEqual(data[data.Length-1],0x0A); // EOL
            return;
        }

        Assert.Fail("Unsupported file format");
    }

NinjaCross 2010-02-09 11:10:05

Thanks, this just helped me figure out what was going wrong with the PDF I was generating -- an EOL problem only showed in Adobe Reader, not Foxit/GoogleApps/Sumatra.

Michael Greene 2010-06-08 02:07:02

ansaurus

tags:

views:

answers:

Check if a pdf file is valid using PdfBox by Apache

related questions