Also I want to know how to add meta data while indexing so that i can boost some parameters
views:
222answers:
3
A:
Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.
Michael Shimmins
2010-04-06 06:11:35
+2
A:
There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)
- One of them is Apache Tika, a sub-project of Lucene.
- Apache POI is a more general document handling project inside Apache.
- There are also some commercial alternatives.
Yuval F
2010-04-06 07:56:58
A:
You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Supported Document Formats
- HyperText Markup Language
- XML and derived formats
- Microsoft Office document formats
- OpenDocument Format
- Portable Document Format
- Electronic Publication Format
- Rich Text Format
- Compression and packaging formats
- Text formats
- Audio formats
- Image formats
- Video formats
- Java class files and archives
- The mbox format
The code will look like this. Reader reader = new Tika().parse(stream);
Sergey
2010-04-16 14:04:38