Programmatic Reading of PDFs in C#

views:

216

answers:

+6 Q:

Programmatic Reading of PDFs in C#

I see many questions and answers about using C# to generate PDF files.
I have a related, but different task.

I have a large number of PDF files already created, and I would like to validate certain parts of the content with Regular Expressions (RegExs). I want to open the PDFs in C#, and be able to read out the text in something approaching a linear fashion.

If headers, footers, any sidebars, etc, get skipped or read out of order, it doesn't matter. I'm just after as much of the main-body text as I can retrieve.

Can you point me towards tools, libraries, API's, etc, that will enable me to programmatically read text in PDF files?

+1 A:

There is a library for .net called PDF Clown Clown

There is also a nice article over at codeProject article that details a few other libraries and approaches for reading PDF documents.

Development 4.0 2010-03-09 18:47:37

Here is another one:

http://csharp-source.net/open-source/pdf-libraries

Joe Pitz 2010-03-09 18:49:10

@Joe: you'll get more upvotes if you do more than just post links.

John Saunders 2010-03-10 03:30:45

+2 A:

I have successfully used two different libraries for this purpose. One is PDF Box (part of the Apache project), and also one from Snowtide Informatics.

Both are Java libraries, but you can use then with .NET in combination with IKVM.

Nick 2010-03-09 18:49:36

+3 A:

I have used PDFSharp not later than last automn and found it very easy to use in comparison to others. Home page for PDFSharp.

Will Marcouiller 2010-03-09 18:50:45

Looks like iTextSharp was a popular answer Reading PDF documents in .NET
Also check out Reading/Writing PDF files in Visual C# Windows Forms

SwDevMan81 2010-03-09 19:15:30

ansaurus

tags:

views:

answers:

Programmatic Reading of PDFs in C#

related questions