tags:

views:

216

answers:

5

I see many questions and answers about using C# to generate PDF files.
I have a related, but different task.

I have a large number of PDF files already created, and I would like to validate certain parts of the content with Regular Expressions (RegExs). I want to open the PDFs in C#, and be able to read out the text in something approaching a linear fashion.

If headers, footers, any sidebars, etc, get skipped or read out of order, it doesn't matter. I'm just after as much of the main-body text as I can retrieve.

Can you point me towards tools, libraries, API's, etc, that will enable me to programmatically read text in PDF files?

+1  A: 

There is a library for .net called PDF Clown Clown

There is also a nice article over at codeProject article that details a few other libraries and approaches for reading PDF documents.

Development 4.0
A: 

Here is another one:

http://csharp-source.net/open-source/pdf-libraries

Joe Pitz
@Joe: you'll get more upvotes if you do more than just post links.
John Saunders
+2  A: 

I have successfully used two different libraries for this purpose. One is PDF Box (part of the Apache project), and also one from Snowtide Informatics.

Both are Java libraries, but you can use then with .NET in combination with IKVM.

Nick
+3  A: 

I have used PDFSharp not later than last automn and found it very easy to use in comparison to others. Home page for PDFSharp.

Will Marcouiller
A: 

Looks like iTextSharp was a popular answer Reading PDF documents in .NET
Also check out Reading/Writing PDF files in Visual C# Windows Forms

SwDevMan81