views:

25

answers:

2

Hi I have a document structure where each text line in the document has some meta-data associated with it. The search result must show the line and the meta-data for the line.

Currently I am storing each such line as a Lucene documents and storing the metata-data as one of the non-indexed fields. That is I create and add a Lucene Document structure for each line. My concerns is that I may end up with too many Documents in the index.

Is there a more elegant approach ?

Thanks

+1  A: 

How many is "too many"? Lucene has been known to handle hundreds of millions of records in a single index, so I doubt that you should have a problem. That being said, there's no substitute for testing and benchmarking yourself to see if this approach is good for your needs.

bajafresh4life
+1  A: 

Personally I'd index the documents as normal, and figure out the metadata / line number later.

There is no question about whether or not Lucene can cope with that many documents, however it might degrade the search results somewhat. For you can perform searches where you look for multiple terms in close proximity to each other, however this obviously won't work when the terms are split over multiple documents (lines).

Kragen
You are correct. I tried creating multiple documents, one per line with meteda data stored as a part of index. That did not work well as the queries started to produce unacccepetable results. For example if I queried for a "This" and "That", tt would fail as 'this' and 'that' might exist in the file but would be in two different Lucen docs. And span queries were simple out of the question. So you are right: e documents as normal, and figure out the metadata / line number later is the right approach.