views:

20

answers:

2

Hi,

I'm developing a tool that searches the keyword entered by the user on a given site. My problem is, it searches the keyword only on html/web pages but not on the PDF/MS-Word files found on the site.

Can anyone suggest me some api/tool or provide the code that can search text from the given online PDF/MS-Word/Text file?

A: 

You could probably use Antiword for word files.

pdftotext can be used for pdf-files.

Both commands available through apt: sudo apt-get install xpdf-utils antiword

aioobe
But I don't want to download the file, I want to download only those files that have the keyword.Means, I need to search in PDFs online and download only if the PDF contains that keyword (which user is searching).
Saubhagya
Wow.. you honestly think you can search for a keyword in a file, without downloading the file?? The actual search would then obviously need to be done on the server.
aioobe
A: 

Developing in anything that runs on the JVM, you would probably do best using POI for MS Office document parsing and PDFBox, JPedal or PDF Clown for parsing .pdfs.

For general indexing, you wont miss with lucene and nutch.

Tomislav Nakic-Alfirevic