tags:

views:

98

answers:

2

I want to read an existing PDF file, get not only the text, but also the format information like: Font (Bold, Italic...), and paragraphs... Is there an code library for doing this, is it open source or commercial?

I am on Windows and favor C# libraries, but C/C++ is also acceptable.

A: 

I'd echo Mr. Meyers on this. There appear to be a number of them; search for "pdf parser library" (plus your language) in your favorite search engine.

A few top hits:

http://www.lowagie.com/iText/

http://search.cpan.org/~antro/PDF-111/PDF/Parse.pm

http://podofo.sourceforge.net/

http://www.vicman.net/download/13733/ (several for .NET)

Note that if you're wanting to edit an existing PDF, you might want to read this:

http://1t3xt.info/tutorials/faq.php?branch=faq.pdf%5Fin%5Fgeneral&node=replace%5Fword

TrueWill
Thanks TrueWill, I've searched before and found some of them may have powerful ability in creating PDF while not as powerful when parsing, I hope get some guidence from someone experienced so I could jump to the right direction without spending too much time to evalute all those libraries.
lz_prgmr
And after reading the article you recommended, I am pessimistic about if there is such a library
lz_prgmr
+1  A: 

I can very much recommend pdflib (http://www.pdflib.com/). Its commercial, but it also has a lite version which you can use for free privately. It contains very muach functionality and is available for all plattforms.

RED SOFT ADAIR