views:

125

answers:

1

I ran into this trying to throw together a simple Automator script to combine several one-page PDF files. I had 88 files to combine, each just about exactly 300KB, so I expected the final product to be about 30MB; the resulting PDF file, using the Combine PDFs Automator action, was 300+MB.

Poking around, the Automator action uses a Python script, with Foundation bindings, to create the new PDF document with the CoreGraphics PDF APIs. Nothing seems out of place. Basically, it's doing this (simplified, but these are the high points):

writeContext = CGPDFContextCreateWithURL(outURL, None, None)
for url in inURLs:
    doc = CGPDFDocumentCreateWithURL(url)
    page = CGPDFDocumentGetPage(doc, 1)
    mediaBox = CGPDFPageGetBoxRect(page, kCGPDFMediaBox)
    CGContextBeginPage(writeContext, mediaBox)
    CGContextDrawPDFPage(writeContext, page)
    CGContextEndPage(writeContext)
CGPDFContextClose(writeContext)

I can't imagine that CGContextDrawPDFPage, when drawing to a PDF context, would do anything but copy the PDF data for that page (with some window-dressing).

Even when "combining" just one PDF, the output is 2.8MB, compared to the 300KB original one-page PDF.

The resulting PDFs look exactly the same page-by-page as the original pages: text is selectable in the same places, graphics look identical, the pages are exactly the same size.

Any ideas?

A: 

Do the input PDFs contain the same set of fonts, or different sets? Maybe if the originals don't contain embedded fonts, but the output does, that could account for some of the growth.

JWWalker
Do you know of a good tool to examine this sort of information about a PDF file?
qwzybug
Nothing specific to PDF. I've sometimes just used the hex editor of Resorcerer.
JWWalker