tags:

views:

36

answers:

1

It is not clear from the PDF ISO standard document (PDF32000-2008) whether a comment may follow the startxref keyword:

startxref
Byte_offset_of_last_cross-reference_section
%%EOF

The standard does seem to imply that comments may appear anywhere:

7.2.3 Comments

Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces a comment. The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line, including regular, delimiter, SPACE (20h), and HORZONTAL TAB characters (09h). A conforming reader shall ignore comments, and treat them as single white-space characters. That is, a comment separates the token preceding it from the one following it.

EXAMPLE The PDF fragment in this example is syntactically equivalent to just the tokens abc and 123.

abc% comment ( /%) blah blah blah
123

Comments (other than the %PDF–n.m and %%EOF comments described in 7.5, "File Structure") have no semantics. They are not necessarily preserved by applications that edit PDF files.

If they are allowed to appear after the startxref, parsing the file becomes more difficult because you do not know how far to back up from the %%EOF comment to start parsing to find the byte offset.

Any ideas?

+1  A: 

ISO 32000 says the lines shall contain 'startxref' and the byte offset to the xref keyword. So, comments are not permitted. I checked the source for several PDF parsers (itext, Xpdf and commercial library) and all of them expected the byte offset immediately after startxref + whitespace.

Dwight Kelly
How much whitespace? If it is only a newline-carriage-return, then going back about 30 bytes and searching forward for `startxref` would work. If it can be 1000's of bytes, then it's still just as hard. Or can you just search backwards byte-by-byte until you see your first "startxref"?
Ralph
We look for startxref in last 1048 bytes of file. Once found, we skip ANY white space then parse the number of the xref offset. White space in a PDF file can be any of the following characters: HORIZONTAL TABULATION (U+0009), LINE FEED (U+000A), VERTICALTABULATION (U+000B), FORM FEED (U+000C), CARRIAGE RETURN (U+000D), SPACE (U+0020), NOBREAK SPACE (U+00A0), EN SPACE (U+2002), EM SPACE (U+2003), FIGURE SPACE (U+2007), PUNCTUATION SPACE (U+2008), THIN SPACE (U+2009), HAIR SPACE (U+200A), ZERO WIDTH SPACE (U+200B), and IDEOGRAPHIC SPACE (U+3000)
Dwight Kelly