views:

222

answers:

3

Also I want to know how to add meta data while indexing so that i can boost some parameters

A: 

Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.

Michael Shimmins
+2  A: 

There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)

  • One of them is Apache Tika, a sub-project of Lucene.
  • Apache POI is a more general document handling project inside Apache.
  • There are also some commercial alternatives.
Yuval F
A: 

You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

The code will look like this. Reader reader = new Tika().parse(stream);

Sergey