views:

1066

answers:

6

I have PDF documents from a scanner. This PDF contain forms filled out and signed by staff for a days work. I want to place a bar code or standard area for OCR text on every form type so the batch scan can be programatically broken apart into separate PDF document based on form type.

I would like to do this in Microsoft .net 2.0

I can purchase the require Adobe or other namespaces/dll need to accomplish the task if there are no open source namespaces/dll's available.

+2  A: 

Not a free or open source option, but you might also look at ABCPdf by webSuperGoo as another alternative to Adobe.

Brian Genisio
+1  A: 

You can research the iTextSharp library, which can split pdf files. But it isn't very good for reading the actual pdfs. So I have no idea how it would know where to split them.

There are companies that already do this for you. You can research the kwiktag company.

Will Rickards
+1  A: 

iTextSharp will help you split, reassemble, and apply barcodes to pdf's in .NET languages. I dont think it can OCR a document, but I havent looked (I used Abby fine Reader engine).

StingyJack
+1  A: 

From the title of your question I'm assuming that you just need to break apart PDF files and that they are already OCR'd. There are a few open source .NET PDF libraries out there. I have successfully used PDFSharp in a project of my own.

Here is a quick snippet that shows how to cull out each page from a PDF document using PDFSharp:

string filePath = @"c:\file.pdf";

using (PdfDocument ipdf = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly))
{
    int i = 1;
    foreach (PdfPage page in ipdf.Pages)
    {
        using (PdfDocument opdf = new PdfDocument())
        {
            opdf.Version = ipdf.Version;
            opdf.AddPage(page);

            opdf.Save("page " + i++ + ".pdf");
        }
    }
}

Assuming also that you need to access the text in the document for grouping you can use the PdfPage.Contents property.

joshperry