views:

171

answers:

4

I have a field in my table having text data type.

Is there a difference in performance for the following two sql queries:

 select * from tablename where fieldname="xyz%";
 select * from tablename where fieldname="%zyx";

If we were to implement the execution of these queries, this is what I think we would need to do:

We have to match the two regexes (xyz* and *zyx).

We will have to check the string chars one by starting from the beginning.

For the first query we will have to read the first three characters to see if there is a match but for the second one we will have to read till the we get the end of the string to determine if the match has occurred. But if we have the length of the string stored somewhere we can directly read the last three characters giving similar performance as the first case.

My question is whether commercial databases like mysql and oracle show any difference in the performance in the execution of the queries.

+2  A: 

If fieldname is indexed, most of commercial databases can transform the first query into an interval search

select * from tablename where fieldname>="xyz" and fieldname<"xy{"

which is very fast.

grep
what about the second one, does it have any effect.
iamrohitbanga
btw xyz could be anything. It could be '2wssdfj,fsf@sef34' for example.
iamrohitbanga
@iamrohitbanga this conversation become more and more abstract. Actually nobody knows what do you mean or want. Can you please devise an example on some real life grounds? Thanks
Col. Shrapnel
In which case your interval search would be >='2wssdfj,fsf@sef34' and < '2wssdfj,fsf@sef35'. The key point is that incrementing the ASCII value of the last character and doing a range search is identical to the 'string%'.
JulesLt
+6  A: 

There is definitely difference between performance on all DB's. First case will be definitely faster if column is indexed.

I had similar instance in my project where user was also allowed to search "ends with" (like your second query).

As this was frequently used operation and query was slow,

  1. We added additional column to table which stored reverse of fieldname.
  2. indexed this column
  3. whenever ends with was searched , we searched in this new column :) (by reversing original search string)

so your second query becomes:

 select * from tablename where fieldname_rev="xyz%";

This approach made it as fast as starts with query.

YoK
@Col. Shrapnel thats what i am looking for. @YoK is this documented anywhere in the theory or in db manuals.Can you cite some such source? Thanks a lot.
iamrohitbanga
@iamrohitbanga I don't know is it in theory or db manuals ? But as I told I am writing this from my project experience. I will try to find is it in some theory.
YoK
@iamrohitbanga lol so you can't even tell what are you looking for without a help of someone? You need to work on that. Knowing exactly what you want to do is always helps
Col. Shrapnel
Hey Col. I mentioned the same thing before, it is just that YoK has had a practical experience with this issue. He has additionally reworded the problem.I was just curious to find out if there is any difference in the performance. The use case is general. I just want to know if a starts with match is diff from an ends with match. I feel that if the length of the string is used in implementing the ends with, there should not be much difference in the implementations. However if you use a lexical analyzer it would read the string from beginning to the end. But what about dbs? what do they use?
iamrohitbanga
+3  A: 

Picking up from your comment : " I just want to know if a starts with match is diff from an ends with match".

Firstly - remember that we are not looking for the best algorithm to match a string. We are looking for the best algorithm to find all matching strings in a set of N rows. We want to do better than 'Do algorithm X, N times'.

If fieldname is NOT indexed, then there will be very little difference in performance between the two queries - the SQL engine is just going to do a match on the first 3 or last 3 bytes of the string, which is simply a matter of offsetting to the right memory location.

If the fieldname IS indexed, there will be a huge difference in performance between the two searches, because rather than examining all N rows, we can discard most of the data.

i.e. for the "xyz%" version, we can use a binary search.

We start at the middle element, which happens to be 'peter'. We can immediately discard everything before 'peter' and get the middle element on the remainder - 'samantha', and so on, until we find the entries starting 'xyz'.

With the "%xyz" version, we cannot do this, as ANY string could potentially match at the end, we need to look at every string.

As the size of our table expands, the difference between these two approaches becomes large.

The solution of creating a field/index for the reverse of fieldname allows us to use the binary search technique again. (In some databases it is actual possible to do this without creating an extra field, but through using particular index types, virtual columns, etc).

This is simplified a lot - for detail on the actual implementation of database indexes, look into B-Tree and B*Tree indexes.

JulesLt
+1  A: 

Yes, there is a difference between the following two queries:

select * from tablename where fieldname LIKE "xyz%";
select * from tablename where fieldname LIKE "%zyx";
  1. The equals ("=") operator doesn't allow wildcards in SQL - you need to use LIKE
  2. The queries are entirely different
    • "xyz%" will return records that start with "xyz"
    • "%xyz" will return records that end with "xyz"
  3. Assuming an index exists on the fieldname column, "%xyz" can not use the index - but"xyz%" could, which means it would be faster.

The fastest means of finding substrings within text is to use Full Text Search (FTS) - both Oracle and MySQL have their own native functionality, and there are 3rd party tools like Sphinx and Solr.

OMG Ponies