ansaurus

Question

Searching and ranking short phrases (e.g. movie titles)

Answer 1

+4 A:

Sounds like SQL FTS's ranking is close, but not exactly, what you're looking for, and that you've narrowed down the "not exactly" cases to three:

inflections are ranked identically to non-inflected forms
words are ranked identically to their synonyms
exact matches (or short titles) are ranked identically as one-word matches within longer titles

What all three of these have in common is that a very simple, automated post-processor on results could use these rules to break ties between identically-ranked results: if there's an exact match, rank it above a non-exact match, and rank shorter titles ahead of longer ones. You may want to consider keeping FTS, and simply putting some code (either in a stored proc or in your app) on top of FTS which sorts groups of tied results by the criteria you mentioned. This would probably be easier than switching to Lucene or another non-Microsoft full-text search implementation.

If I were in your shoes, since you've already got something working with FTS, I'd try the post-processing hack above and see if it's "good enough" to meet your needs, since it'd probably be the easiest thing to do.

If it's not good enough, I'd start by looking at Lucene.NET (free), Solr (free), and dtSearch ($$$). Note that none will be as easy as FTS is, though-- especially Lucene.NET which is AFAIK the most popular and is very full-featured but requires a fair amount of coding, configuration, maintenance, etc. You can see this SO thread for some other opinions, there are probabaly more threads like this on SO and elsewhere if you want more opinions.

If you're looking for a "did you mean..." spelling-suggestion feature. There's an example of building this kind of feature on top of FTS in Pro Full-Text Search in SQL Server 2008 (link contains some excerpts from Google Books). Would this meet your needs? If not, there are lots of other options both free and not.

Justin Grant 2009-11-23 18:49:59

I can speak from experience about the ease of entering queries into dtSearch... It's far easier than SQL server. Unfortunately plugging the results into a computer program and setting it up in a server environment are not easier than SQL server... (Of course I'm not a DB guy so that could indicate a bias in my thinking...)

Jason D 2009-11-27 06:15:18

Answer 2

+1 A:

I know you're not interested in reinventing the wheel, but I wanted to contribute something that might at least get your wheels turning.

'How to Strike a Match' is one of my favorite posts on this topic. In it, the author matches strings based on the similarity of sequential doublets between words.

For example, 'search' and 'smirch' are both broken into doublets by letter: se, ea, ar, rc, ch for search and sm, mi, ir, rc, ch for smirch. Then the number of matching doublets is multiplied by two ( rc and ch match, so 2 * 2) and divided by the total number of doublets (5 + 5 = 10 in this case). 4/10 = 40% match between search and smirch.

This penalizes long, unrelated strings because they increase the denominator without increasing the numerator.

In your second example, this algorithm would highlight moon as the best example, but would not fail to retain Dark Side of the Moon, etc - they would just rank lower. In your first example, you would have to apply some sort of lexical conversion before invoking this algorithm, because it will fail to find similar words that stem-change (like see / saw / seen) although it will do well with non-stem-changers (France / French).

I have not contemplated how to implement this directly in an SQL application.

jpp 2009-11-26 17:22:18

Answer 3

A:

Having worked with both SQL Server (2005) FullText and Lucene(.NET) in a production environment, i really think that Lucene is the better choice:

SQL Server FTS is nice and quick; but you can't really manipulate the way that the indexes are generated. Also, you can't just "see" the index tables. All the implementation is hidden away and as such this tool is great for out-of-the-box generic FTS but harder for specific applications.

Lucene on the other hand has been used and tested in a multitude of scenarios (I highly recommend Lucene in Action if you decide to take that route). Even if the existing implementations do not fit exactly your needs you can always create a "new" specific implementation(write your own analyzer/tokenizer/filter - stemmer!! - 1), although the amount of parametrization of lucene is plenty (2) and you can always check the contents of the index using Luke (3). You also get an search application that is independent of the datastore (4), and it works equally well for Java && .NET (5). Additionally, and if that makes you tick, there is also Hibernate && NHibernate implementation (Hibernate Search - 6).

Jaguar 2009-11-30 08:22:02

ansaurus

tags:

views:

answers:

Searching and ranking short phrases (e.g. movie titles)

related questions