views:

187

answers:

5

I have a windows application .NET that manages many PDF Files. Some files are corrupt.

2 issues: I'll try explain in my worst english...sorry

1.)

How can I detect if any pdf file is correct ?

I want read header of PDF and detect is correct.

var okPDF = PDFCorrect(@"C:\temp\pdfile1.pdf");

2.)

How to know if byte[] (bytearray) of file is PDF file or not.

For example, for ZIP files, you could examine the first four bytes and see if they match the local header signature, i.e. in hex

50 4b 03 04

if (buffer[0] == 0x50 && buffer[1] == 0x4b && buffer[2] == 0x03 && buffer[3] == 0x04)

If you are loading it into a long, this is (0x04034b50). by David Pierson

I want the same for PDF files.

byte[] dataPDF = ...

var okPDF = PDFCorrect(dataPDF);

Any sample code in .NEt, please

+4  A: 

The first line of a PDF file is a header identifying the version of the PDF specification to which the file conforms %PDF−1.0, %PDF−1.1, %PDF−1.2, %PDF−1.3, %PDF−1.4 etc.

You could check this by reading some bytes from the start of the file and see if you have the header at the beginning for a match as PDF file. See the PDF reference from Adobe for more details.

Don't have a .NET example for you (haven't touched the thing in some years now) but even if I had, I'm not sure you can check for a complete valid content of the file. The header might be OK but the rest of the file might be messed up (as you said yourself, some files are corrupt).

dpb
Oh, yeah, i'll try search more information. I've not found yet. You're right, two issues: 1. File Header PDF; 2. File Header PDF is right but Files PDFcorrupt. I'll try search in forums adobe, or anyone here give the solution.
alhambraeidos
+1  A: 

You could use iTextSharp to open and attempt to parse the file (e.g. try and extract text from it) but that's probably overkill. You should also be aware that it's GNU Affero GPL unless you purchase a commercial licence.

Rup
+2  A: 

Hello alhambraeidos!

1) Unfortunately there is no easy way to determine is pdf file corrupt. Usually the problem files have a correct header so the real reasons of corruption are different. PDF file is effectively a dump of PDF objects. The file contains a reference table giving the exact byte offset locations of each object from the start of the file. So most probably corrupted files have a broken offsets or may be some object is missed.

The best way to determine that the file is corrupted is to use specialized PDF libraries. There are lots of both free and commercial of such libraries for .NET. You may simply try to load PDF file with one of such libraries. iTextSharp will be a good choice.

2) In accordance with PDF reference the header of PDF file has usually form %PDF−1.X (where X is a number, for the present from 0 to 7). And 99% of PDF files have such header. But also there are some other kinds of headers which Acrobat Viewer accepts and even absence of header isn't real problem for PDF viewers. So you shouldn't treat file as corrupted if it hasn't a header. E.g. the header may be appeared somewhere within the first 1024 bytes of the file or be in the form %!PS−Adobe−N.n PDF−M.m

Just for your information I am a developer of the Docotic PDF library.

Vitaliy Shibaev
A: 

I check Header PDF like this, what do y ou think about it ?? thanks,

 public bool EsCabeceraPDF(string fileName)
    {
        byte[] buffer = null;
        FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
        BinaryReader br = new BinaryReader(fs);
        long numBytes = new FileInfo(fileName).Length;
        //buffer = br.ReadBytes((int)numBytes);
        buffer = br.ReadBytes(5);

        var enc = new ASCIIEncoding();
        var header = enc.GetString(buffer);

        //%PDF−1.0
        // If you are loading it into a long, this is (0x04034b50).
        if (buffer[0] == 0x25 && buffer[1] == 0x50
            && buffer[2] == 0x44 && buffer[3] == 0x46)
        {
            return header.StartsWith("%PDF-");
        }
        return false;

    }
alhambraeidos
+1  A: 

Well-behaving PDFs start with the first 9 Bytes as %PDF-1.x plus a newline (where x in 0..8). 1.x is supposed to give you the version of the PDF file format. The 2nd line are some binary bytes in order to help applications (editors) to identify the PDF as a non-ASCIItext file type.

However, you cannot trust this tag at all. There are lots of applications out there which use features from PDF-1.7 but claim to be PDF-1.4 and are thusly misleading some viewers into spitting out invalid error messages. (Most likey these PDFs are a result of a mis-managed conversion of the file from a higher to a lower PDF version.)

There is no such section as a "header" in PDF (maybe the initial 9 Bytes of %PDF-1.x are what you meant with "header"?). There may be embedded a structure for holding metadata inside the PDF, giving you info about Author, CreationDate, ModDate, Title and some other stuff.

My way to reliably check for PDF corruption

There is no other way to check for validity and un-corrupted-ness of a PDF than to render it.

A "cheap" and rather reliable way to check for such validity for me personally is to use Ghostscript.

However: you want this to happen fast and automatically. And you want to use the method programatically or via a scripted approach to check many PDFs.

Here is the trick:

  • Don't let Ghostscript render the file to a display or to a real (image) file.
  • Use Ghostscript's nullpage device instead.

Here's an example commandline:

gswin32c.exe ^
    -o nul ^
    -sDEVICE=nullpage ^
    -r36x36 ^
    "c:/path to /input.pdf"

This example is for Windows; on Unix use gs instead of gswin32c.exe and -o /dev/null.

Using -o nul -sDEVICE=nullpage will not output any rendering result. But all the stderr and stdout output of Ghostscript's processing the input.pdf will still appear in your console. -r36x36 sets resolution to 36 dpi to speed up the check.

%errorlevel% (or $? on Linux) will be 0 for an uncorrupted file. It will be non-0 for corrupted files. And any warning or error messages appearing on stdout may help you to identify problems with the input.pdf.

There is no other way to check for a PDF file's corruption than to somehow render it...

pipitas