views:

56

answers:

4

Hi,

I have some column EntityName, and I want to have users to be able to search names by entering words separated by space. The space is implicitly considered as an 'AND' operator, meaning that the returned rows must have all of the words specified, and not necessarily in the given order.

For example, if we have rows like these:

  1. abba nina pretty balerina
  2. acdc you shook me all night long
  3. sth you are me
  4. dream theater it's all about you

when the user enters: me you, or you me (the results must be equivalent), the result has rows 2 and 3.

I know I can go like:

WHERE Col1 LIKE '%' + word1 + '%'
  AND Col1 LIKE '%' + word2 + '%'

but I wanted to know if there's some more optimal solution.

The CONTAINS would require a full text index, which (for various reasons) is not an option.

Maybe Sql2008 has some built-in, semi-hidden solution for these cases?

+2  A: 

The only thing I can think of is to write a CLR function that does the LIKE comparisons. This should be many times faster.

Update: Now that I think about it, it makes sense CLR would not help. Two other ideas:

1 - Try indexing Col1 and do this:

WHERE (Col1 LIKE word1 + '%' or Col1 LIKE '%' + word1 + '%')
  AND (Col1 LIKE word2 + '%' or Col1 LIKE '%' + word2 + '%')

Depending on the most common searches (starts with vs. substring), this may offer an improvement.

2 - Add your own full text indexing table where each word is a row in the table. Then you can index properly.

RedFilter
Even though I was against it at first, it seems that's the best solution so far...
veljkoz
After I've tried it, I want to add an update to this - it's just incredibly slow... if the 'like' method finishes in 10 seconds, this CLR function needs ...well I don't know - I just stopped it after 20 mins... so this solution is shelved as well...
veljkoz
Post your code...
RedFilter
@veljkoz: see my update
RedFilter
the 1. doesn't cover the cases where the rows don't start with the search word (but it is faster because it can use index in that case). The 2. is ok, and we we're already thinking about it. Thanks for the updates!
veljkoz
@veljkoz: That is incorrect, #1 does cover substring matches, see the `OR` clause.
RedFilter
Oh, yes, you're right - sorry...
veljkoz
+1  A: 

You're going to end up with a full table scan anyway.

The collation can make a big difference apparently. Kalen Delaney in the book "Microsoft SQL Server 2008 Internals" says:

Collation can make a huge difference when SQL Server has to look at almost all characters in the strings. For instance, look at the following:

SELECT COUNT(*) FROM tbl WHERE longcol LIKE '%abc%'

This may execute 10 times faster or more with a binary collation than a nonbinary Windows collation. And with varchar data, this executes up to seven or eight times faster with a SQL collation than with a Windows collation.

Martin Smith
This is a good point, but we already have collations set up appropriately...
veljkoz
+1  A: 
WITH Tokens AS(SELECT 'you' AS Token UNION ALL SELECT 'me')
SELECT ...
FROM YourTable AS t
WHERE (SELECT COUNT(*) FROM Tokens WHERE y.Col1 LIKE '%'+Tokens.Token+'%') 
 = 
(SELECT COUNT(*) FROM Tokens) ;
AlexKuznetsov
An interesting approach, but unfortunately painfully slow...
veljkoz