views:

788

answers:

2

I am attempting to use a full text search over a series of names in my database. This is my first attempt at using full text search. Currently I take the search string entered and put a NEAR condition between each term (i.e. entered phrase of "Kings of Leon" becomes "Kings NEAR of NEAR Leon").

Unfortunately I have discovered that this tactic results in a false negative search result because the word "of" is being dropped by SQL Server when it creates the indexes because it is a noise word. Thus, "Kings Leon" will match correctly, but "Kings of Leon" will not.

My co-worker suggests taking all the noise words as defined in the MSSQL\FTData\noiseENG.txt and putting them in the .Net code so the noise words can be stripped out before the full text search is executed.

Is this the best solution? Is there not some auto-magic setting I can change in SQL server to do this for me? Or maybe just a better solution that doesn't feel as hacky?

+2  A: 

Full Text is going to work off of the search criteria you provide it. You can remove the noise word from the file, but you really risk bloating your index size by doing that. Robert Cain has a lot of good information on his blog regarding this:

http://arcanecode.com/2008/05/29/creating-and-customizing-noise-words-in-sql-server-2005-full-text-search/

To save some time you can look at how this method removes them and copy the code and words:

        public string PrepSearchString(string sOriginalQuery)
    {
        string strNoiseWords = @" 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | $ | ! | @ | # | $ | % | ^ | & | * | ( | ) | - | _ | + | = | [ | ] | { | } | about | after | all | also | an | and | another | any | are | as | at | be | because | been | before | being | between | both | but | by | came | can | come | could | did | do | does | each | else | for | from | get | got | has | had | he | have | her | here | him | himself | his | how | if | in | into | is | it | its | just | like | make | many | me | might | more | most | much | must | my | never | now | of | on | only | or | other | our | out | over | re | said | same | see | should | since | so | some | still | such | take | than | that | the | their | them | then | there | these | they | this | those | through | to | too | under | up | use | very | want | was | way | we | well | were | what | when | where | which | while | who | will | with | would | you | your | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z ";

        string[] arrNoiseWord = strNoiseWords.Split("|".ToCharArray());

        foreach (string noiseword in arrNoiseWord)
        {
            sOriginalQuery = sOriginalQuery.Replace(noiseword, " ");
        }
        sOriginalQuery = sOriginalQuery.Replace("  ", " ");
        return sOriginalQuery.Trim();
    }

however, I would probably go with a Regex.Replace for this which should be much faster than looping. I just don't have a quick example to post.

Jonathan Kehayias
Once you add the following line to the start of your method, it works correctly: sOriginalQuery = " " + sOriginalQuery + " ";This is needed to allow matching of noise words that are the first or last word of the search phrase.
Brian Hinchey
A: 

Here's a working function. The file noiseENU.txt is copied as-is from \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData .

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function
Herb Caudill