views:

840

answers:

5

I have a website that has over 400,000 items. Some similar, some vastly different. We want to provide a way to search these items the best way possible. After being delivered the website it was using full text indexing. The solution is basic at best, woefully inadequate at worst.

So what is the best way to search these items? They are stored in a SQL Server Database (2005). Our website is designed in C# 2.0.

Currently here is the process:

  1. User enters value into text box.
  2. We 'clean' this entry. Removing 'scary' characters that could be an attempted hack. Remove key words (and, or, etc..)
  3. Pass value into a stored procedure to return results.
  4. Return results.
+6  A: 

Look at Lucene.NET. I think it's a vast improvement over full-text search in SQL Server.

tvanfosson
I hope Jeff will read this question :) StackOverflow is a great demonstration of crappy build-in search based on SQLServer FTS :)
aku
@aku -- agreed. I generally use Google with site:stackoverflow.com now.
tvanfosson
I think they're still going to use lucene.net
Scott Cowan
A: 

Second step is quite controversial - what words you consider as 'scary'? If you use SQL server build-in full text search then instead of manually removing key words from input query you can setup lists of nose/stop words inside sql server.

Here is one feature I want to see here on StackOverflow as well as on any other site that provides search functionality:

  • give more priority(weight) to some fields of your documents (in case of stackoverflow - search should prioritize topic title)

Also consider to use 3rd party solution for FTS such as Lucene or Sphinx - they can provide much better user experience than build-in functionality. Some advantages of 3rd party FTS components are: reduced database load, better relevance of search results, better indexing speed, smaller size of database.

aku
+1  A: 

You can have a look at Lucene.net, it will minimize the calls to the database for the search queries.

Following from http://incubator.apache.org/lucene.net/

Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and .NET platform utilizing Microsoft .NET Framework.

Lucene.Net sticks to the APIs and classes used in the original Java implementation of Lucene. The API names as well as class names are preserved with the intention of giving Lucene.Net the look and feel of the C# language and the .NET Framework. For example, the method Hits.length() in the Java implementation now reads Hits.Length() in the C# port.

In addition to the APIs and classes port to C#, the algorithm of Java Lucene is ported to C# Lucene. This means an index created with Java Lucene is back-and-forth compatible with the C# Lucene; both at reading, writing and updating. In fact a Lucene index can be concurrently searched and updated using Java Lucene and C# Lucene processes.

renegadeMind
+2  A: 

SQL Server Central has a nice article on creating a Google-like Full Text Search using SQL Server. Unfortunately you have to register view the full article, but registration is free and they post a lot of good information. Here is the link:

http://www.sqlservercentral.com/articles/Full-Text+Search+(2008)/64248/

Excerpt:

...

Google Style

The key to a successful application is to make it easy to use but powerful. Google has done this with their Web search engine. The syntax for queries is simple and intuitive, but full-featured. Though the basic building blocks of a Google query are simple you can combine them in powerful ways. I'll begin with basic Google query syntax and add some additional operators to take advantage of the power of SQL Server CONTAINS predicate syntax. The full Google syntax is defined in the Google Help:Cheat Sheet at http://www.google.com/help/cheatsheet.html.

...

The article has full example code and even a link to download it. Its an interesting read even if you don't plan on implementing it.

Dan Rigby
+1  A: 

You could use Google site search to deliver your search results. Doesn't always give you the flexibility to display the results as you want, but for many is good enough.

Sam Meldrum