tags:

views:

8033

answers:

10

I have a PDF file, which contains data that we need to import into a database. The files seem to be pdf scans of printed alphanumeric text. Looks like 10 pt. Times New Roman.

Are there any tools or components that can will allow me to parse this text? Any advice is appreciated. Thanks.

A: 

A quick google search shows this promising result. http://www.pdftron.com/net/index.html

Sijin
A: 

You can use a module like perl's PDF to extract the text. And use another tool to import the pertinent info into the database.

I am sure there are PDF components for .NET, but I have not tried any, so I don't know what is good.

J.J.
+1  A: 

At a company I used to work for, we used ActivePDF toolkit with some success:

http://www.activepdf.com/products/serverproducts/toolkit/index.cfm

I think you'd need at least the Standard or Pro version but they have trials so you can see if it'll do what you want it to.

Dana
+1  A: 

If the PDF contains scanned images, can these solutions actually work. I think they require that the data be stored internally as text. The solution that sheebz is looking for will probably require some kind of OCR.

Or am I missing something obvious?

wcm
Nope you have to OCR. But if they have a consistent font this isn't too hard.In fact if the PDF has complex layout it can be simpler to create an image and OCR that with an existing library rather than trying to decode the PDF format.
Martin Beckett
A: 

I've recently found ReportLab for Python.

Walter
A: 

If the PDF is a scans of printed text, it will be hard (involves image processing, character recognizing etc.) to do it yourself. PDF will generally store the scanned documents as JPEGs internally. You are better of using a third party tool (OCR tool) that does this.

Vivek
+4  A: 

You can't extract scanned text from a PDF. You need OCR software. The good news is there are a few open source applications you can try and the OCR route will most likely be easier than using a PDF library to extract text. Check out Tesseract and GOCR.

jm4
+3  A: 

I've used pdftohtml to successfully strip tables out of PDF into CSV. It's based on Xpdf, which is a more general purpose tool, that includes pdftotext. I just wrap it as a Process.Start call from C#.

If you're looking for something a little more DIY, there's the iTextSharp library - a port of Java's iText - and PDFBox (yes, it says Java - but they have a .NET version by way of IKVM.NET). Here's some CodeProject articles on using iTextSharp and PDFBox from C#.

And, if you're really a masochist, you could call into Adobe's PDF IFilter with COM interop. The IFilter specs is pretty simple, but I would guess that the interop overhead would be significant.

Edit: After re-reading the question and subsequent answers, it's become clear that the OP is dealing with images in his PDF. In that case, you'll need to extract the images (the PDF libraries above are able to do that fairly easily) and run it through an OCR engine.

I've used MODI interactively before, with decent results. It's COM, so calling it from C# via interop is also doable and pretty simple:

' lifted from http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging
Dim inputFile As String = "C:\test\multipage.tif"
Dim strRecText As String = ""
Dim Doc1 As MODI.Document

Doc1 = New MODI.Document
Doc1.Create(inputFile)
Doc1.OCR()  ' this will ocr all pages of a multi-page tiff file
Doc1.Save() ' this will save the deskewed reoriented images, and the OCR text, back to the inputFile

For imageCounter As Integer = 0 To (Doc1.Images.Count - 1) ' work your way through each page of results
   strRecText &= Doc1.Images(imageCounter).Layout.Text    ' this puts the ocr results into a string
Next

File.AppendAllText("C:\test\testmodi.txt", strRecText)     ' write the OCR file out to disk

Doc1.Close() ' clean up
Doc1 = Nothing

Others like Tesseract, but I have direct experience with it. I've heard both good and bad things about it, so I imagine it greatly depends on your source quality.

Mark Brackett
This was an excellent list of resources.. thanks
torial
A: 

If I get it right, sheebz is asking how to extract PDF fields and load the data into a database. Have you looked at iTextSharp? - http://sourceforge.net/projects/itextsharp/

MarlonRibunal
+5  A: 

I have posted about parsing pdf's in one of my blogs. Hit this link:

http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx

MarlonRibunal
thanks! Saved me a headache!
JasonS
The link above no longer works - get a "Unable to open connection to data provider" error message
jontsnz