views:

399

answers:

4

What could be a good way to code a search functionality for searching documents in a java web application?

Is 'tagged search' a good fit for such kind of search functionality?

+4  A: 

Why re-invent the wheel?

Check out Apache Lucene.

Also, search Stack Overflow for "full text search" and you'll find a lot of other very similar questions. Here's another one, for example: http://stackoverflow.com/questions/34314/how-do-i-implement-search-functionality-in-a-website

womp
+2  A: 

You could use Solr which sits on top of Lucene, and is a real web search engine application, while the Lucene is a library. However neither Solr or Lucene parse the Word document, pdf, etc. to extract meta data information. It's necessary to index the document based on a pre-defined document schema.

Tika - yet another Lucene family API - addresses extracting of meta data and semantic from documents of various formats.
grigory
+2  A: 

As for extracting the text content of Office documents (which you need to do before giving it to Lucene), there is the Apache Tika project, which supports quite a few file formats, including Microsoft's.

Thilo
+1  A: 

Using Tika, the code to get the text from a file is quite simple:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;

// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()

So far, Tika 0.3 seems to work great. Just throw any file at it and it will give you back what makes the most sense for that format. I can get the text for indexing of anything I've thrown at it so far, including PDF's and the new MS Office files. If there are problems with some formats, I believe they mainly lie in getting formatted text extraction rather than just raw plaintext.

Jegschemesch