views:

2082

answers:

5

Hi all,

In a pretty typical scenario, I have a 'Search' text box on my web application which has user input passed directly to a stored procedure which then uses full text indexing to search on two fields in two tables, which are joined using appropriate keys.

I am using the CONTAINS predicate to search the fields. Before passing the search string in, I do the following:

SET @ftQuery = '"' + REPLACE(@query,' ', '*" OR "') + '*"'

Changing the castle to "the*" OR "castle*", for example. This is necessary because I want people to be able to search on cas and get results for castle.

WHERE CONTAINS(Building.Name, @ftQuery) OR CONTAINS(Road.Name, @ftQuery)

The problem is that now that I have appended a wildcard to the end of each word, noise words (e.g. the) also have a wildcard appended and therefore no longer appear to get dropped. This means that a search for the castle will return items with words such as theatre etc.

Changing OR to AND was my first thought, but that appears to simply return no matches if a noise word is then used in the query.

All I am trying to achieve is to allow the user to enter multiple, space separated words that respresent either the entirety or a prefix of the words they are searching on, in any order - and drop noise words such as the from their input (otherwise when they search for the castle they get a big list of items with the result they need somewhere in the middle of the list.

I could go ahead and implement my own noise word removal procedure, but it seems like something that full text indexing ought to be able to handle.

Grateful for any help!

Jamie

+3  A: 

Noise words are stripped out before the indexing is stored. So it is impossible to write a query that searches on a stop word. If you REALLY want to enable this behavior, you need to edit the list of stop words. (http://msdn.microsoft.com/en-us/library/ms142551.aspx) and then re-build your index.

galuvian
+1  A: 

I had the same question and after a thorough search I've come to the conclusion there's no good solution.

As a compromise, I'm implementing the brute force solution:

1) Open C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\noiseENU.txt and copy all the text in there.

2) Paste into a code file in the application, replace line breaks with "," to get a List initializer like this:

public static List<string> _noiseWords = new List<string>{ "about", "1", "after", "2", "all", "also", "3", "an", "4", "and", "5", "another", "6", "any", "7", "are", "8", "as", "9", "at", "0", "be", "$", "because", "been", "before", "being", "between", "both", "but", "by", "came", "can", "come", "could", "did", "do", "does", "each", "else", "for", "from", "get", "got", "has", "had", "he", "have", "her", "here", "him", "himself", "his", "how", "if", "in", "into", "is", "it", "its", "just", "like", "make", "many", "me", "might", "more", "most", "much", "must", "my", "never", "no", "now", "of", "on", "only", "or", "other", "our", "out", "over", "re", "said", "same", "see", "should", "since", "so", "some", "still", "such", "take", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "up", "use", "very", "want", "was", "way", "we", "well", "were", "what", "when", "where", "which", "while", "who", "will", "with", "would", "you", "your", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };

3) Before you submit the search string, break it into words and remove any words in the noise words like this:

List<string> goodWords = new List<string>();
string[] words = searchString.Split(' ');
foreach (string word in words)
{
   if (!_noiseWords.Contains(word))
      goodWords.Add(word);
}

Not an ideal solution, but should work as long as the noise words file doesn't change. Multiple language support would use a dictionary of lists by language.

amsimmon
A: 

Similar to my approach.

While I wish to use Full Text Indexing for its ability to perform stemming, its speed and multiple word search and so on I'm actually only indexing a couple of nvarchar(100) fields in two tables. Each table will easily remain below 50,000 rows.

My solution was to remove all noise words from the text file and allow the indexer to compile an index including all words. It still only consists of a few thousand entries.

I then do the replace on spaces in the search string as described in my original post to get CONTAINS to work on multiple words, and to stem words individually.

Seems to work really well, but I will be keeping a close eye on performance.

Jamie
A: 

Here's a working function. The file noiseENU.txt is copied as-is from \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData .

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function
Herb Caudill
A: 

You can also remove noise words BEFORE making the query. List of language id: http://msdn.microsoft.com/en-us/library/ms190303.aspx

Dim queryTextWithoutNoise As String = removeNoiseWords(queryText, ConnectionString, 1033)

Public Function removeNoiseWords(ByVal inputText As String, ByVal cnStr As String, ByVal languageID As Integer) As String

    Dim r As New System.Text.StringBuilder
    Try
        If inputText.Contains(CChar("""")) Then
            r.Append(inputText)
        Else
            Using cn As New SqlConnection(cnStr)

                Const q As String = "SELECT display_term,special_term FROM sys.dm_fts_parser(@q,@l,0,0)"
                cn.Open()
                Dim cmd As New SqlCommand(q, cn)
                With cmd.Parameters
                    .Add(New SqlParameter("@q", """" & inputText & """"))
                    .Add(New SqlParameter("@l", languageID))
                End With
                Dim dr As SqlDataReader = cmd.ExecuteReader
                While dr.Read
                    If Not (dr.Item("special_term").ToString.Contains("Noise")) Then
                        r.Append(dr.Item("display_term").ToString)
                        r.Append(" ")
                    End If
                End While
            End Using
        End If
    Catch ex As Exception
        ' ...        
    End Try
    Return r.ToString

End Function
Manuel Alves