tags:

views:

713

answers:

6

Is it possible to search "words" in pdf files with delphi?

I have code with which I can search in many others files like (exe, dll, txt) but it doesn't work with pdf files.

+2  A: 

It depends on the structure of the specific PDF.

If the pdf is made of images (scanned pages) then you have to OCR each image and build a full text index inside the PDF. (To see if its image based, open it with notepad and look for obj tags full of random chars). There are a few utilities and apps that do this kind of work for you, CVision PDF Compressor is one that I have used before.

If the pdf is a standard PDF, then you should be able to open it like any other text file and search for the words.

Here is page that will detail some of the structure of a PDF. This a SO post for the same.

StingyJack
-1 for being a tool.
alamodey
Don't be so salty, just cos your answer was vague. I was nice enough to explain why i gave you a downvote, so you could either revise it (and get the DV removed) or remove it of your own volition. --- Seriously, you need to not go around adding snide comments to every post of mine.
StingyJack
A: 

PDF is not just a binary representation. Think of it as a tree of objects, where an object node has some metadata and some content information. Some of these objects have string data, some don't. Some of these are even encrypted, and some are compressed. So, there's very little chance your string finder will work on any arbitrary PDF.

dirkgently
+2  A: 

The components/libraries mentioned in the answer to this question should do what you need.

Craig Stuntz
+1  A: 

I'm just working on a project that does this. The method I use is to convert the PDF file to plain text (with pdftotext.exe) and create an index on the resulting text. We do the same with word and other office files, works pretty good!

Searching directly into pdf files from Delphi (without external app) is more difficult I think. If you find anything, please update here as I would also be very interested in that!

birger
+1  A: 

One option I have used is to use Microsoft's ifilter technology, this is used by windows desktop search and many other products such as sharepoint and SQL server full-text search.

It supports almost any office/office-like file format, even dwg, msg, pdf, and files in zip/rar archives.

The easiest way to use it is to run FiltDump.exe on any files you have, and index the text output.

To know about the filters installed on your PC, you can use ifilter explorer. Wikipedia has some links on its ifilters page.

Osama ALASSIRY
+1  A: 

Quick PDF Library's GetPageText function can give you the words from a PDF as well as the page number and the co-ordinates of those words - sometimes useful for highlighting.

Bing