ansaurus

Question

How would one use Lucene to help implement search on a site like StackOverflow?

Answer 1

A:

You've probably done more thinking on this subject than most folks who will try and answer you (part of the reason why it's been a day and I'm your first response, I'd imagine). I'm just going to try and tackle your final three questions, b/c there's just a lot there that I don't have time to go into, and I think those three are the most interesting (the physical implementation questions are probably going to wind up being 'pick something, and then tweak it as you learn more').

vote data Not sure that votes make something more relevant to a search, frankly, just makes them more popular. If that makes sense, I'm trying to say that whether a given post is relevant to your question is mostly independant of whether it was relevant to other people. that said, there's probably at least a weak correlation between interesting questions and those that folks would want to find. Vote data is probably most useful in doing searches based purely on data, e.g. "most popular" type searches. In generic text-based searches, I'd probably not provide any weight for votes at first, but would consider working on an algorithm that perhaps provides a slight weight for the sorting (so, not the results returned, but minor boost to the ordering of them).

replies I'd agree w/ your approach here, subject to some testing; remember that this is going to have to be an iterative process based on user feedback (so you'll need to collect metrics on whether searches returned successful results for the searcher)

other Don't forget the user's score also. So, users get points on SO also, and that influences their default rank in the answers of each question they answer (looks like it's mostly for tiebreaking on replies that have the same number of bumps)

Paul 2010-02-21 15:27:47

@Paul: I've updated the question to reflect how vote data (which is a confidence score) relates to the relevance score as well as thoughts on replies. I don't think I'm going to use the user's reputation to weigh the search sort results, but in terms of tiebreaking replies in terms of votes, it's easy enough to do in SQL Server.

casperOne 2010-03-04 00:53:37

Answer 2

A:

Determining relevance is always tricky. You need to figure out what you're trying to accomplish. Is your search trying to provide an exact match for a problem someone might have or is it trying to provide a list of recent items on a topic?

Once you've figured what you want to return you can look at the relative effect of each feature you're indexing. That will get a rough search going. From there you tweak based on user feedback (I suggest using implicit feedback instead of explicit otherwise you'll annoy the user).

As to indexing, you should try to put the data in so that each item has all the information necessary to rank it. This means you'll need to grab the data from a number of locations to build it up. Some indexing systems have the capability to add values to existing items which would make it easy to add scores to questions when subsequent answers came in. Simplicity would just have you rebuild the question every so often.

Epsilon Prime 2010-02-22 21:49:29

Answer 3

+2 A:

The answers you are looking for really can not be found using lucene alone. You need ranking and grouping algorithms to filter and understand the data and how it relates. Lucene can help you get normalized data, but you need the right algorithm after that.

I would recommend you check out one or all of the following books, they will help you with the math and get you pointed in the right direction:

Algorithms of the Intelligent Web

Collective Intelligence in Action

Programming Collective Intelligence

Mike Glenn 2010-02-25 03:00:21

Mike Glenn: You make a very good point. I'm going to update the question to reflect that fact later in the day, so your input on the updated question would be appreciated. Also, I've read "Programming Collective Intelligence" (and consulted it before writing this) and I've found that it doesn't do much to help with this situation (where you have some sort of relevance score vs. ranking of the item that is relevant), but I'll probably take a second look.

casperOne 2010-02-25 18:58:40

Algorithms of the Intelligent Web, is a book you really are going to want to check out. As far as the question goes heres how I would start out using lucene. The question text would be placed in a tokenized field, same with the replies,comments and tags in their own fields on the doc. By default lucene weights matches on fields with fewer terms high than a match on a field with many terms. Which is to say that a match on the tags field will have more relevance than a match on the question or reply fields.

Mike Glenn 2010-02-28 03:50:44

Now bringing in the votes and possibly the users rep score COULD make your search more accurate. But you would need to set up a randomized test, and determine a way to measure how effective both methods are at delivering the result the users was searching for.

Mike Glenn 2010-02-28 03:57:30

@Mike Glenn: I've updated the answer to go more into relevance score and confidence score using vote data. I've looked over "Programming Collective Intelligence" again, but the most it says is "if you want to use confidence data to enhance relevance data, you have to use one of the techniques mentioned earlier". It doesn't go into much detail. I'm going to look into the other titles, but if you have specific references to parts of the book that would be relevant, I'd appreciate it, and any feedback on the updated question above, of course. +1 for the book references.

casperOne 2010-03-04 00:56:28

Answer 4

A:

I think that Lucene is not good for this job. You need something really fast with high availbility... like SQL But you want open source?

I would suggest you use Sphinx - http://www.sphinxsearch.com/ It's much better, and i am speaking with experience, i used them both.

Sphinx is amazing. Really is.

Yuki 2010-03-02 09:23:43

@Yuki: It should be noted I am using SQL Server 2008 on the back end, so Sphinx is not helpful.

casperOne 2010-03-03 21:18:31

Hi again,Well, you are wrong! I used Sphinx with Sql Server 2008 (and even 2005)!When you configure it, you can specify the connection and the select statement in the index ini files...Sphinx does not mind which database is used to get data.Search in Google and you can find some examples.

Yuki 2010-03-09 08:49:27

@Yuki: Sorry, you are right, it will work with SQL Server 2008, but I am also on a shared hosting environment, so I can't run a service (which I believe is required).

casperOne 2010-08-16 20:40:42

Answer 5

+1 A:

The lucene index will have following fields :

Title
Question
Accepted Answer (Or highly voted answer if there is no accepted answer)
All answers combined

All these are fields are Analyzed. Length normalization is disabled to get better control on the scoring.

The aforementioned order of the fields also reflect their importance in descending order. That is if the query match in title is more important than in accepted answer, everything else remaining same.

The # of upvotes is for the question and the top answer can be captured by boosting those fields. But, the raw upvote count cannot be used as boost values as it could skew results dramatically. (A question with 4 upvotes will get twice the score of one with 2 upvotes.) These values need to be dampened aggressively before they could be used as boost factor. Using something natural logarithm (for upvotes >3) looks good.

Title can be boosted by a value little higher than that of the question.

Though inter-linking of questions is not very common, having a basic pagerank-like weight for a question could throw up some interesting results.

I do not consider tags of the question as very valuable information for search. Tags are nice when you just want to browse the questions. Most of the time, tags are part of the text, so search for the tags will result match the question. This is open to discussion, though.

A typical search query will be performed on all the four fields.

+(title:query question:query accepted_answer:query all_combined:query)

This is a broad sketch and will require significant tuning to arrive at right boost values and right weights for queries, if required. Experiementation will show the right weights for the two dimensions of quality - relevance and importance. You can make things complicated by introducing recency as aranking parameter. The idea here is, if a problem occurs in a particular version of the product and is fixed in later revisions, the new questions could be more useful to the user.

Some interesting twists to search could be added. Some form of basic synonym search could be helpful if only a "few" matching results are found. For example, "descrease java heap size" is same as "reduce java heap size." But, then, it will also mean "map reduce" will start matching "map decrease." (Spell checker is obvious, but I suppose, programmers would spell their queries correctly.)

Shashikant Kore 2010-03-02 11:53:49

@Shashikant Kore: Would it be better to boost on the query level or on the document/field level? I think that it might be better on the field or query level, as I think that I want to weight the relevance score differently based on the changes to the question I've made above.

casperOne 2010-03-04 00:58:10

Yeah, my preference is for field level boosting, as suggested above. Since we already have vote information at the time of indexing, it is a good idea to use it at that time.

Shashikant Kore 2010-03-04 02:54:06

ansaurus

tags:

views:

answers:

How would one use Lucene to help implement search on a site like StackOverflow?

related questions