ansaurus

Question

What is the best approach to implement search for searching documents (PDF, XML, HTML, MS Word) ?

Answer 1

+4 A:

Why re-invent the wheel?

Also, search Stack Overflow for "full text search" and you'll find a lot of other very similar questions. Here's another one, for example: http://stackoverflow.com/questions/34314/how-do-i-implement-search-functionality-in-a-website

womp 2009-05-06 21:09:21

Answer 2

+2 A:

You could use Solr which sits on top of Lucene, and is a real web search engine application, while the Lucene is a library. However neither Solr or Lucene parse the Word document, pdf, etc. to extract meta data information. It's necessary to index the document based on a pre-defined document schema.

2009-05-07 00:48:58

Tika - yet another Lucene family API - addresses extracting of meta data and semantic from documents of various formats.

grigory 2009-09-17 14:10:09

Answer 3

+2 A:

As for extracting the text content of Office documents (which you need to do before giving it to Lucene), there is the Apache Tika project, which supports quite a few file formats, including Microsoft's.

Thilo 2009-05-07 09:32:59

Answer 4

+1 A:

Using Tika, the code to get the text from a file is quite simple:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;

// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()

So far, Tika 0.3 seems to work great. Just throw any file at it and it will give you back what makes the most sense for that format. I can get the text for indexing of anything I've thrown at it so far, including PDF's and the new MS Office files. If there are problems with some formats, I believe they mainly lie in getting formatted text extraction rather than just raw plaintext.

Jegschemesch 2009-05-23 12:06:49

ansaurus

tags:

views:

answers:

What is the best approach to implement search for searching documents (PDF, XML, HTML, MS Word) ?

related questions