views:

53

answers:

4

For a discussion forum, does it work better to index each entry inside a discussion thread as a separate lucene document or simple concat all entries within a discussion into one big block of text and index a whole discussion thread as a single lucene document?

A: 

If you concatenate all entries within a discussion you run into the error where you cannot pin point the exact entry you want to retrieve.

Lucene should be able to quickly index and search each entry (post/thread/whatever). Mashing them all together just seems overkill.

Ryan Ternier
With your suggestion the problem comes about when you want to display search results. You could potentially get 20 entries for the same discussion thread in your search result because all entries simply included the same word. Can you see what I mean?
Am
True it could return multiple entries, but you still have control over what is ultimately displayed, and you should be able to distinctly select singular instances of each entry.
Ryan Ternier
+1  A: 

Depends on what kind of search capabilities you are looking for. For eg, if you want the users to be able to search for keywords that occurred in threads on some particular date, then you must index all entries as separate documents with a date (as a NumericField searchable using a NumericRangeFilter).

Indexing every entry as a separate document will also enable you to score each entry using the Lucene scorers which will help in retrieving the most relevant entries (and not threads) as a response to a query. Additionally you can also add the thread topic as a separate field to each entry-document (at the cost of little more space).

Concatenating all entries is not a good idea if you want to point the user to the exact entry of interest. As to your concern(comment on Ryan's answer) on returning multiple entries from the same thread, you can add a thread id to each entry while indexing. Then at the time of displaying results you can display only the entry for each thread id (the entry with the highest score could be displayed along with the thread topic)

athena
A: 

If you decide to index them separately, you can use Solr, which is about to support search result collapsing:

http://www.lucidimagination.com/blog/2010/09/16/2446/

bajafresh4life
A: 

I will prefer to index each entry separately. It will make the design more flexible as your system should have some kind of topic entity to group the entries in the same thread. And another issue to index with concatenation is it would need to re-index once new entry is posted which has performance impact.

Sheng Chien