views:

1927

answers:

6

Hi, I want to perform a simple (ideally RegEx) search and replace over a large number of PDF documents in a WinForms application.

I've got as far as using ITextSharp to read and tokenise existing documents, from which I can search for the text. The problem is that it doesn't seem to support generating new document from these tokens (only seems to support a GetImportedPage method that doesn't allow for modifcation.

Can anyone help me use ITextSharp to do this, or suggest a library (ideally free, but commerical if it needs to be), to do this simple task?

Thanks! Mark.

A: 

iText# is a .net port of iText, a java library that definitely lets you generate and manipulate PDFs. I believe it uses IKVM for the port, so maybe just a little more digging through the API and you'll find it!

tonyz
As I said, I used this - it doesn't allow changing of existing PDFs, unfortunately.
A: 

Unfortunately, I have tried IText# (ITextSharp) as stated. Whilst it does seem to be a very powerful library, everything I've read about it seems to point to the fact that modifications of existing content is not supported.

I have come across code examples that showed how to break down the existing document into tokens, and from there locate the text that I wish to replace, but there doesn't seem to be any support for re-creating the document with these tokens, only for importing complete pages from the source document - which doesn't help me.

So, really I guess I need to find an alternative (or write my own, which isn't really something I want, or have time, to do!)

A: 

You might try PDFSharp, it allows access to text and allows you to modify existing content.

Kris Erickson
A: 

I Mark, have you find an alternative to solve your problem? I need to do the same possibily using .net technology. Please le me know... Andrea

Andrea
Look at PDFSharp (http://www.pdfsharp.com/PDFsharp/) or pdfNet (http://www.pdftron.com/pdfnet/index.html)I've used pdfNet it's a simple, but rich API.
Jason D
A: 

bump @ Kris Erickson - I've had to do this for a client (they wanted their own search crawler), its a pain, but PDFshap is the most straight forward method.

JERiv
A: 

From past experience, I know of a commercial product that also makes this smae claim, it's remarkably easy to program to. pdfNet by pdfTron is the product I'm most familiar with. The API was rather large, however it's understandable, given all that PDF can do.

Adobe also has a 3rd party reseller of their own library. However it's about as friendly as C migrated to C++ migrated to C# can be. (Imagine an ill tempered ferret which thinks your finger looks like a snack... It's not quite as bad as that...)

-- Jason

Jason D