tags:

views:

65

answers:

2

Nowadays it is more practical to purchase an ebook than the dead-tree version. But the PDFs frequently contain the blank pages used by the print edition. I typically see between 10-30 blank pages (or pages with text "This page intentionally left blank.") per ebook. Is it possible to programmatically remove these blank pages? Currently I manually identify the blank pages and then run it through this:

pdftops orig.pdf - | psselect "$range_of_non_blank_pages" | ps2pdf - new.pdf

So the hard part is identifying the blank pages. pdftotext would work for the most part, except where the page has only images and no text.

Also, even after removing many pages and seeing the resulting file size is smaller, after shrinking both the original file and the new version (using various methods found on the internets), the original file is usually smaller by several hundred KB or more. So it appears the method I'm using to remove the blank pages doesn't create an optimal pdf. I've also tried various gui programs and see the same results in this respect.

A: 

Partial answer: you don't need to go via postscript (this is probably the reason why you get a bigger file). One possibility is

pdftk orig.pdf cat "$range_of_non_blank_pages" output new.pdf

To identify blank pages, you'd need to use a tool that can go beyond selecting and reassembling pages. Try a library for a scripting language, for example CAM::PDF or PDF::API2 in Perl.

Gilles
A: 

I don't know of an open source solution that can detect and remove blank pages. However, Apago's commercial PDF Enhancer can automatically remove blank pages -- both vector and scanned. For scanned, it can remove scan artifacts such as black edges, hole punches and noise prior to determining if page is blank.

Dwight Kelly