views:

512

answers:

7

The related questions that appear after entering the title, and those that are in the right side bar when viewing a question seem to suggest very apt questions.

Stackoverflow only does a SQL search for it and uses no special algorithms, said Spolsky in a talk.

What algorithms exist to give good answers in such a case. How do U do database search in such a case? Make the title searchable and search on the keywords or search on tags and those questions with many votes on top?

+1  A: 

I do not know how SO implements it, but my hunch is that they use a variant of approximate string matching.

lothar
+2  A: 

This post will help you http://stackoverflow.com/questions/62328/is-there-an-algorithm-that-tells-the-semantic-similarity-of-two-phrases

victor hugo
in fact that's the top ranked related question now ;)
SillyMonkey
@SillyMonkey Exactly what I was going to say!
Lakshman Prasad
Haha I see! That's funny
victor hugo
+1  A: 

The related questions sidebar will be building on the tags for each question (probably by ranking them based on tag overlap, so 5 tags in common > 4 tags in common etc).

The rest will be building on heuristics and algorithms suitable for natural language processing. These aren't normally very good in general purpose language, but most of them are VERY good once the vocabulary is reduced down to a single technical area such as programming.

workmad3
tag overlap is probably not the only stuff involved; as the top related question to this question has no tag in common with this question.:)
Lakshman Prasad
The top question (Is there an algorithm that tells you...) has the tag nlp, the same as this question and two other tags. The one below that has nlp and 4 other tags. #3 has nlp, 4 other tags and less upvotes than #2. There is something else though, as #4 has nlp, 3 other tags and more upvotes than #3, so probably some processing on title as well :)
workmad3
A: 

Have a look at Porter stemming for a stemming algorithm if you are looking to get into "related" algorithms.

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".

Once you have processed a document and done stemming on it, you can index the stemmed words by count and then compare against other documents. This is the most basic approach to tackling this problem.

Also take care to ignore stop words like "the", "an", "of" etc.

aleemb
A: 

Use the fulltext search feature of SQL Server.

Hemanshu Bhojak
A: 

Such problems are solved by making a "bag of words" of the stemmed words. That is basically a word count vector. Those words are preprocessed (stemmed) and weighted with their probability to occur in a sentence ("the" has a higher probability than "probability" and should thus be weighted less). You then can perceive this bag of words either as a vector in euclidean space or as a sample of a probability density.

You can apply algorithms as nearest neighbour search or semantic hashing. The latter seems to be SOTA (see http://www.cs.toronto.edu/~rsalakhu/papers/semantic_final.pdf).

bayer
+4  A: 

If you listen to the Stack Overflow podcast 32 (unfortunately the transcript doesn't have much in) you can hear Jeff Atwood say a little about how he does it.

It seems like the algorithm is something like:

  • Take the question
  • Remove the most common words in English (from a list he got from google)
  • submit a full text search to the SQL server 2008 full text search engine

More details about the full text search can be found here: http://msdn.microsoft.com/en-us/library/ms142571.aspx

This may be out of date by now - they were talking about moving to a better/faster full text search such as Lucene, and I vaguely remember Jeff saying in the podcast that this had been done.

Nick Fortescue