Well-behaving PDFs start with the first 9 Bytes as %PDF-1.x
plus a newline (where x in 0..8). 1.x
is supposed to give you the version of the PDF file format. The 2nd line are some binary bytes in order to help applications (editors) to identify the PDF as a non-ASCIItext file type.
However, you cannot trust this tag at all. There are lots of applications out there which use features from PDF-1.7 but claim to be PDF-1.4 and are thusly misleading some viewers into spitting out invalid error messages. (Most likey these PDFs are a result of a mis-managed conversion of the file from a higher to a lower PDF version.)
There is no such section as a "header" in PDF (maybe the initial 9 Bytes of %PDF-1.x
are what you meant with "header"?). There may be embedded a structure for holding metadata inside the PDF, giving you info about Author, CreationDate, ModDate, Title and some other stuff.
My way to reliably check for PDF corruption
There is no other way to check for validity and un-corrupted-ness of a PDF than to render it.
A "cheap" and rather reliable way to check for such validity for me personally is to use Ghostscript.
However: you want this to happen fast and automatically. And you want to use the method programatically or via a scripted approach to check many PDFs.
Here is the trick:
- Don't let Ghostscript render the file to a display or to a real (image) file.
- Use Ghostscript's
nullpage
device instead.
Here's an example commandline:
gswin32c.exe ^
-o nul ^
-sDEVICE=nullpage ^
-r36x36 ^
"c:/path to /input.pdf"
This example is for Windows; on Unix use gs
instead of gswin32c.exe
and -o /dev/null
.
Using -o nul -sDEVICE=nullpage
will not output any rendering result. But all the stderr and stdout output of Ghostscript's processing the input.pdf will still appear in your console. -r36x36
sets resolution to 36 dpi to speed up the check.
%errorlevel%
(or $?
on Linux) will be 0
for an uncorrupted file. It will be non-0
for corrupted files. And any warning or error messages appearing on stdout may help you to identify problems with the input.pdf.
There is no other way to check for a PDF file's corruption than to somehow render it...