views:

445

answers:

3

Hello, I need to parse large text (about 1000 pages of word or pdf document)and place some of the text inside this document into database fields

I found that the only thing I can distinguish the text I want to extract is the format , it is always "Helvetica-Condensed" size 12

can I do that ? I know how to use the string functions but what I should use to test the format ?

as I said the text is stored inside word document or PDF

if there is third party component can do no problem please refer it to me

Thanks

A: 

There is QuickPDF. The price is $249,00.

ikkebr
A: 

The other option is to code it yourself. The file specification is available online, and if your only trying to rip the text out of the document this should guide you most of the way.

The only thing to be careful of are documents which are built entirely from images. In that scenario (no matter what you use to read the file) you will also need an OCR type of application. To see if this is the case or not, open a sample of the type of file you are wanting to "extract" text from, select the text to copy then try to paste into notepad.

skamradt
A: 

You can use PilotEdit to extract strings matching a regular expresson. Check this picture to see if PilotEdit is what you need.

http://www.pilotedit.com/uploads/PilotEditCopyStringsMatchingRegularExpressionToTheClipboard.JPG

Dracoder