views:

80

answers:

2

I have a StackOverflow-like system where content is organised into threads, each thread having content of its own (the question body / text), and posts / replies.

I'm producing the ability to search this content via Lucene, and if possible I have decided I would like to index individual posts, (it makes the index easier to update, and means I have more control and ability to tweak the results), rather than index entire threads. The problem I have however is that I want the search to display a list of threads, rather than a list of posts.

How can I get Lucene to return only unique threads as results, while also searching the content of the posts?

+1  A: 

Each document can have a "threadId" field. After running a search, you can loop through your result set and return all the unique threadId's.

The tricky part is specifying how many results you want to return. If you want to show say, 10 results on your results page, you'll probably need Lucene to return 10 + m results, since a certain percentage of the return set will be de-duped out, because they are posts belonging to the same thread. You'll need to incorporate some extra logic that will run another Lucene search if the deduped set is < 10.

This is what the Nutch project does when collapsing multiple search results that belong to the same domain.

bajafresh4life
+1  A: 

When you index the threads, you should break each thread into postings and make each post a Document with a field containing a unique id identifying the thread to which it belongs.

When you do the search implementation, I would recommend using lucene 2.9 or later, which enables you to use a Collector. Collectors lets you preprocess the retrieved documents and thereby you'll be able to group together posts that originate from the same thread-id.

Steen