What could be a good way to code a search functionality for searching documents in a java web application?
Is 'tagged search' a good fit for such kind of search functionality?
What could be a good way to code a search functionality for searching documents in a java web application?
Is 'tagged search' a good fit for such kind of search functionality?
Why re-invent the wheel?
Check out Apache Lucene.
Also, search Stack Overflow for "full text search" and you'll find a lot of other very similar questions. Here's another one, for example: http://stackoverflow.com/questions/34314/how-do-i-implement-search-functionality-in-a-website
You could use Solr which sits on top of Lucene, and is a real web search engine application, while the Lucene is a library. However neither Solr or Lucene parse the Word document, pdf, etc. to extract meta data information. It's necessary to index the document based on a pre-defined document schema.
As for extracting the text content of Office documents (which you need to do before giving it to Lucene), there is the Apache Tika project, which supports quite a few file formats, including Microsoft's.
Using Tika, the code to get the text from a file is quite simple:
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;
// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()
So far, Tika 0.3 seems to work great. Just throw any file at it and it will give you back what makes the most sense for that format. I can get the text for indexing of anything I've thrown at it so far, including PDF's and the new MS Office files. If there are problems with some formats, I believe they mainly lie in getting formatted text extraction rather than just raw plaintext.