views:

602

answers:

4

Does anyone know of a PDF file parser that I could use to pull out sections of text from the plaintext pdf file? Specifially I want a way to be able to reliably pull out the section of text specific to annotations?

Delphi, C# RegEx I dont mind.

+1  A: 

Not sure if it supports the functionality you need, but we've been using abcPDF with some success.

Jeremy
I don't think abcPDF supports parsing.
Richard Szalay
@Richard Szalay, I wasn't sure. The feature matrix says it supports reading pdfs, but whether it goes you an object model in the api to accesss parts of the pdf is something I can't say for certain.
Jeremy
I wouldn't go so far as to reject it's advertised feature set :) It didn't support it when I used it last, but it's writing capabilities certainly did the job well.
Richard Szalay
ABCpdf does expose an object model, it's what they call Atoms.
Mark S. Rasmussen
+3  A: 

The PDF File Parser article on xactpro seems to be exactly what you need. It explains the format of the PDF and comes with full source code for a parser (and another project for visualisation of the model).

The parser uses format-specific terms, but you could easily use the visualiser to learn what to look for.

Richard Szalay
+2  A: 

You can also take a look at Xpdf (http://www.foolabs.com/xpdf/download.html)

Mihai Nita
+1  A: 

check out pdfbox

Abhijith