views:

169

answers:

4

I have several PDFs with the following properties:

Each PDF contains a variable number of "documents" with differing number of pages.

Each page in a "document" has text such as "Page 3 of 26".

I want to be able to automatically identify the first and last page of each "document" within a PDF (Note: this is not the same as the first and last page of a PDF as each PDF may contain several "documents") and extract these into a new PDF for later printing and archival.

I'm not sure what tools I can bring to bear on this problem and what libraries are available to tackle this.

Any recommendations? Preferably free and can be used to create a tool that will run on Windows.

A: 

You can try using pdftk to decompress the PDF, parse the data, split it, and then recompress it.

Adam Rosenfield
A: 

check this library out

for a commercial solution you could try this one

Konstantinos
A: 

I managed to come up with a horrible unix hack that will work:

  • use pdftk to decompress and explode into separate pages
  • use pdftotext to convert each page into text
  • write a script to identify the appropriate string in the txt and copy the corresponding pdf into a sub-directory [in progress]
  • find some tool to recombine [to be investigated, probably pdftk can do]

Should work on my unix platform but not sure if it is acceptable to bring all these tools onto the windows environment.

One potential is to use an email gateway to receive pdfs and return processed pdf which makes it even more ugly.

Anyone with a native win32 solution?

+1  A: 

Java has a nice free pdf library. Check out iText.

From iText's site:

You can use iText to:

  • Serve PDF to a browser
  • Generate dynamic documents from XML files or databases
  • Use PDF's many interactive features
  • Add bookmarks, page numbers, watermarks, etc.
  • Split, concatenate, and manipulate PDF pages
  • Automate filling out of PDF forms
  • Add digital signatures to a PDF file
  • And much more...

Since it's Java, there should be no issues running on Windows, or anywhere else for that matter.

Steve K