views:

50

answers:

2

I have a db query that'll cause a full table scan using a like clause and came upon a question I was curious about...

Which of the following should run faster in Mysql or would they both run at the same speed? Benchmarking might answer it in my case, but I'd like to know the why of the answer. The column being filtered contains a couple thousand characters if that's important.

SELECT * FROM users WHERE data LIKE '%=12345%'

or

SELECT * FROM users WHERE data LIKE '%proileId=12345%'

I can come up for reasons why each of these might out perform the other, but I'm curious to know the logic.

+2  A: 

All things being equal, longer match strings should run faster since it allows to skip through the test strings with bigger steps and do less matches.

For an example of the algorithms behind sting matching see for example Boyer Moore Algorithm on Wikipedia.

Of course not all things are equal, so I would definitely benchmark it.

A quick check found in the mysql reference docs the following paragraph :

If you use ... LIKE '%string%' and string is longer than three characters, MySQL uses the Turbo Boyer-Moore algorithm to initialize the pattern for the string and then uses this pattern to perform the search more quickly.

Peter Tillemans
Thanks for the information, though benchmarking wouldn't tell me that they were using that specific algorithm. Though they are likely using something similar.
Allain Lalonde
I just cross referenced and found mysql does use Boyer-Moore and under which conditions. I updated the answer.
Peter Tillemans
Fantastic. Thanks.
Allain Lalonde
+1  A: 

No difference whatsoever. Because you've got a % sign at the beginning of your LIKE expression, that completely rules out the use of indexes, which can only be used to match the a prefix of the string.

So it will be a full table scan either way.

In a significant sized database (i.e. one which doesn't fit in ram on your 32G server), IO is the biggest cost by a very large margin, so I'm afraid the string pattern-matching algorithm will not be relevant.

MarkR
true, but it still burns less CPU cycles which is nice to know in the time of Green IT ;-).
Peter Tillemans
In which case, it depends which occurs more often in your field, 'p' or '='. It has to compare every character in the string with the first literal character. If it doesn't find it, it can stop. If you have lots of = but few 'p's, then the '%p' expression is better and vice versa.
MarkR