views:

107

answers:

5

I need to create a full-text search form for a database of emails / support tickets (in C#) and I'm looking for advice and articles on how to approach this. In particular I'd like to know how to approach the classic full-text search problems, for example:

  • Making sure that matches are sensible, for example if someone enters "big head" and a document contains "big hairy head", making sure that document is returned in the search.
  • Ordering results by relevancy.
  • How to bets display matches, for example highlighting matching terms

I know that full-text search is a fairly mammoth subject area in itself, I'm just looking for simple articles and advice on how to create something that is at least marginally useful and usable.

I've used things like Lucene.Net before - obviously some sort of full-text index is going to be needed - the challenging bit is taking the list of documents that Lucene returns and presenting it in a useful way.

UPDATE: I want to clarify slightly what I mean - there are hundreds of generic full-text search forms that all perform a very similar function, for example:

  • The search button on each and every internet forum
  • The search button on each and every wiki
  • Windows / google desktop search
  • Google

Each of those searches takes information from different sources, and displays them using different means (html, Windows form etc...) but each of those solve the same problems in varyingly complex methods, and for the most part (with the possible exception of desktop search) the input data is of the same format: Html or text.

I'm looking for advice and common strategies on how to do things like rank search results in ways that are likely to be useful to the user.

Alternatively one strategy I had considered was doing something like taking some wiki software, exporting my entire data set as text into that wiki, and just using the wiki to search - the sort of search I'm after is for all intents and purposes functionally identical to 99% of searches that already exist, I just want to give it a different input data source, and format the output slightly differently (both of which I already know how to do).

Surely there must be some advice on how those sorts of searches are done?

A: 

Your topic is a database specific question. you need to sepcify which database you will use. You can give the search key word to database engine instead of searching by your program.

Henry Gao
+2  A: 

SQL Server (including the Express versions) all have a full free-text search facility. This can search text within columns but can also harness IFilters to search within embedded documents. You can use the FREETEXTTABLE command in T-SQL to intelligently search within content and return it in ranking order:

"Returns a table of zero, one, or more rows for those columns containing character-based data types for values that match the meaning, but not the exact wording, of the text in the specified freetext_string. FREETEXTTABLE can only be referenced in the FROM clause of a SELECT statement like a regular table name.

Queries using FREETEXTTABLE specify freetext-type full-text queries that return a relevance ranking value (RANK) and full-text key (KEY) for each row."

eg.

SELECT FT_TBL.CategoryName 
    ,FT_TBL.Description
    ,KEY_TBL.RANK
FROM dbo.Categories AS FT_TBL 
    INNER JOIN FREETEXTTABLE(dbo.Categories, Description, 
        'sweetest candy bread and dry meat') AS KEY_TBL
        ON FT_TBL.CategoryID = KEY_TBL.[KEY];

For more info have a read of Understanding SQL Server Full-Text Indexing.

Dan Diplo
This looks ok, but it has a couple of issues - for 1 it doesn't index html properly - searching for html, br span etc... returns millions of duff results rather than just returning results where the rendered html contains 'html' 'br' 'span' etc...
Kragen
Also, the way weighting / ranking works looks tricky to deal with - If I enter a search like "html boolean WebRequest" how tricky is it going to be to get it to rank results where all 3 appear together over ones where only 1 appears, or results where all 3 appears spead out in a large document.
Kragen
+1  A: 

You can use a great library from apache Lucene.Net also Linq to Lucene extensions can simplify your work

ArsenMkrt
This is what I've ended up using - it was less work than I was imagining, but its also more flexible.
Kragen
A: 

Have a look at CONTAINSTABLE too, as it supports wildcards and weighting etc...

http://msdn.microsoft.com/en-us/library/ms189760.aspx

TimS
A: 

If you don't want to go the SQL root then also consider Microsoft Search Server 2008 Express - it's free, powerful and looks easy to use. It matches all your requirements and handles things like ranking etc. automatically.

Dan Diplo