views:

69

answers:

2

Hi:

My db is running on mysql v5.x. I have a table T1 with 5 columns and column C1 is the primary key. C1 is of type varchar(20). It contains about 2000 rows with values like:

fxg
axt3
tru56
and so on.. 

Now my application's job is to read input data and find if the input data has a starting pattern similar to that found in column C1 in table T1. For example: my input may appear as:

    trx879478986
    fxg87698x84
    784xtr783utr
    axt3487ghty
... and so on

So for the above input, I have to return true for 'fxg87698x84' and 'axt3487ghty' and false for others. The query I use is:

select 1 from T1 where (? like concat(C1,'%'));
note: the ? is replaced by the input value got from the application.

The issue is my input is huge (about 1 million records to be processed in 30 minutes) and my query is not fast enough. Any ideas on how to re-write the query or force it to use indexes? Even if I have to use a different object structure, I can do it, if that helps. So any help will be appreciated. Thx.

+1  A: 

the way your problem is set up, you almost by definition need to check every row in the database against every input doing it the way you are currently doing it. The index doesn't really matter in this case since any row could be a match.

I'm not sure it would be faster, but one thing you could try would be to query the database for an exact match on every possibly valid substring of your input.

For example, if you know that your substrings have to be at least length 3 to match, start with the first 3 characters: trx879478986 => trx, trx8, trx87, ...

Build an array of those possible matches and use the IN() operator to query for them:

SELECT 1 FROM T1 WHERE c1 IN ($array_of_strings);

I'm pretty sure mysql can use an index to match against a list of values given to IN()

Ty W
What I was going to suggest - only I'd have added an 'ORDER BY CHAR_LENGTH (c1)' to favour a match of 'ATX12345' against 'ATX'
symcbean
+1, what I was typing. This does use indexes, and by avoiding `LIKE` you avoid the problem of what happens if C1 contains a `%` or `_` character.
bobince
Thx for the input. But the combination of valid substrings I have to do is between 3 chars and 20 characters and that extra processing sort of offsets the performance gain that I may get.
Abdullah
do you know that to be true or are you guessing at performance costs?
Ty W
A: 

you could try a Top-N query to find the first candidate, and then apply that candidate only to the actual pattern:

select 1 
  from (select c1 
          from junk 
         where c1 <= 'fxg87698x84'
         order by c1 desc limit 1) tmp 
 where 'fxg87698x84' like concat(c1, '%');

the top-n query should use a regular index on c1.

EDIT: Explained that in more detail in my blog: http://blog.fatalmind.com/2010/09/29/finding-the-best-match-with-a-top-n-query/

Markus Winand
Beautiful!!. This really helps to reduce my full table scan. thx again. - Abdullah
Abdullah