views:

47

answers:

4

If I have a table like this...

create table #words (
id int identity,
word varchar(1024)
)

insert into #words (word) values ('dock')
insert into #words (word) values ('dockable')

and i do a LIKE query

select id, word from #words where word like '%dock%'

Is there a way to tell which result would be the most accurate?

+1  A: 

You can use the SOUNDEX and DIFFERENCE T-SQL functions to compare words, but you may still need a way to determine which is "most accurate".

For example, run the following queries:

SELECT DIFFERENCE('dock','dock');
SELECT DIFFERENCE('dock','dockable');

Difference of 'dock' and 'dock' is 4, which is the best result; 'dock' and 'docakble' is a 2, which is a higher difference.

LittleBobbyTables
I hadn't realised T-SQL supported SOUNDEX. Yuck. :)
Will A
+2  A: 

You could try using similarity metrics to get a distance score for each result as compared to the search string. SOUNDEX and the like give you some primitive options, but there are much more sophisticated alternatives, depending on your requirement. The SimMetrics library of functions allows you to compare strings by Hamming distance, Levenshtein distance, etc. Here's a thorough article describing the installation and usage of the library.

Ken Redler
+1  A: 

I would look at using Full Text Searching (FTS) - CONTAINS is more precise than FREETEXT

CONTAINS

WHERE CONTAINS(word, 'dock') 

FREETEXT

WHERE FREETEXT (word, 'dock') 

Indexed, these will be faster than LIKE and FTS includes a score value based on an algorithm to rank matching. You'll have to test & see if the results fit your needs.

OMG Ponies
+4  A: 

For complex multi-word criteria you should use Full Text Search and CONTAINSTABLE. The output of this table function contains a RANK column:

The table produced by CONTAINSTABLE includes a column named RANK. The RANK column is a value (from 0 through 1000) for each row indicating how well a row matched the selection criteria. This rank value is typically used in one of these ways in the SELECT statement:

  • In the ORDER BY clause to return the highest-ranking rows as the first rows in the table.
  • In the select list to see the rank value assigned to each row.

For simple single word criteria you should implement a Levenshtein distance function in SQL CLR and use that to find the most similar best match words (or use the one from Ken Redler's linked project).

Remus Rusanu