tags:

views:

466

answers:

8

I am trying to implement a feature similar to the "Related Questions" on Stackoverflow.

How do I go about writing the SQL statement that will search the Title and Summary field of my database for similar questions?

If my questions is: "What is the SQL used to do a search similar to "Related Questions" on Stackoverflow".

Steps that I can think of are;

  1. Strip the quotation marks
  2. Split the sentence into an array of words and run a SQL search on each word.

If I do it this way, I am guessing that I wouldn't get any meaningful results. I am not sure if Full Text Search is enabled on the server, so I am not using that. Will there be an advantage of using Full Text Search?

I found a similar question but there was no answer: similar question

Using SQL 2005

+1  A: 

They probably relate based on tags that are added to the questions...

Ropstah
The related questions dialog pops up when you enter a title, not tags.
Daniel Lew
True, they search the title for 'unique' words. They probably filter out words like: "how","the","what" etc. and search the rest of the words in other titles while they don't take word positions into account. But they also probably search those unique words in tags, those matches are probably higher rated in search results.
Ropstah
A: 

The SQL very well may be just "SELECT * FROM questions;". I find it hard to imagine that the algorithm for finding similar questions is implemented in SQL.

Daniel Straight
True, but something more constructive would help here. As evidenced by the other answers, a Full Text engine would really help do the job.
Patrick Harrington
A: 

I'm pretty sure it would be most efficient to implement the feature based on the tags associated with each post.

lush
Except for the fact that it does the search before you enter your tags.
musicfreak
A: 

It's probably done using a full text search which matches like words/phrases. I've used it in MySQL and SQL Server with decent success with out of the box functionality.

You can find more on MySQL full text searches at:

http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html

Or just google Full Text search and you will find a lot of information.

Ryan Smith
A: 

It looks keyword based on the title you enter, queried against titles and content of other questions. Probably easier (and more appropriate) to do in Lucene (or similar) then in a relational database.

Gandalf
A: 

I'd say it's probably a full text search on the question title and the question content and answers as well using the individual words (not the whole title) you enter. Then, using the ranking features of full-text, the top 10 or so questions that rank the highest are displayed.

As tydok pointed out, it looks like they are using full-text searching (I couldn't imagine any other way).

Here's the MSDN reference on Full-Text Searching, nailing the specific query used probably isn't going to happen.

scottm
+6  A: 

Check out this podcast.

One of our major performance optimizations for the “related questions” query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster.

Nick D
You don't want to remove too many 'common' words - 'net, class, switch, case, data' are all common. First pass would probably to search for keyword sin your question and lookup questions with similair keywords
Martin Beckett
This just supports my (inexplicably downvoted) answer: it's not about the SQL, it's about what's done before and after the SQL.
Daniel Straight
A: 

After enabling Full Text search on my SQL 2005 server, I am using the following stored procedure to search for text.

ALTER PROCEDURE [dbo].[GetSimilarIssues] 
(
 @InputSearch varchar(255)
)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;

DECLARE @SearchText varchar(500);

SELECT @SearchText = '"' + @InputSearch + '*"'

SELECT  PostId, Summary, [Description], 
Created
FROM Issue

WHERE FREETEXT (Summary, @SearchText);
END
Picflight