I'm trying to improve our search capabilities for short phrases (in our case movie titles) and am currently looking at SQL Server 2008 Full Text Search, which provides some of the functionality we would like:
- Word stemming (e.g. "saw" also means "see", "seen", etc.)
- Synonyms (e.g. "6" is synonymous with "VI")
However the ranking algorithm seems to be proving problematic, using FREETEXTTABLE
with the search term and extracting the RANK
field. For example when the user enters "saw" then the results we get with out catalogue are:
RANK | Title
---------------------------------------------------------------------
180 | The Exorcist: The version you've never seen
180 | Saw IV
180 | Saw V
180 | Anybody Here Seen Jeannie?
180 | Seeing Red
All of these have the same rank, even though it would be clear to a person that the second and third entries are a better match than the other stemmed terms.
Similarly entering "moon" gives the following results:
RANK | Title
---------------------------------------------------------------------
144 | Pink Floyd - The Dark Side of the Moon
144 | Fly Me To The Moon 3D
144 | Twilight: New Moon
144 | Moon
And here although there are no stemming matches, it would be clear to a person that the best match for "moon" is "Moon" rather than longer titles which contain it only as part of the title, yet FTS ranks them equally.
I'm guessing that it's probably something to do with the way SQL Server ranks results, which treats stemmed words and synonyms with equal weight to the original term, and takes into account word density for ranking which would be good with long passages of text, but doesn't really apply with short phrases like these. So I'm starting to thing that FTS isn't suitable for this job, unfortunately.
I don't really want to re-invent the wheel, so are there any existing search solutions that would work for titles and give good rankings plus the stemming/thesaurus functionality? It would also be nice if it had a spell checker to implement "did you mean..." functionality like Google, so "saww" would be corrected to "saw" and "mon" to "moon", etc.