views:

1092

answers:

11

I am wondering what the performance of a query would be like using the LIKE keyword and the wildcard as the value compared to having no where clause at all.

Consider a where clause such as "WHERE a LIKE '%'". This will match all possible values of the column 'a'. How does this compare to not having the where clause at all.

The reason I ask this is that I have an application where there are some fields that the user may specify values to search on. In some cases the user would like all the possible results. I am currently using a single query like this:

SELECT * FROM TableName WHERE a LIKE ? AND b LIKE ?

The values of '%' and '%' can be supplied to match all possible values for a and or b. This is convenient since I can use a single named query in my application for this. I wonder what the performance considerations are for this. Does the query optimizer reduce LIKE '%' to simply match all? I realize that because I'm using a named query (prepared statement), that may also affect the answer. I realize the answer is likely database specific. So specifically how would this work in Oracle, MS SQL Server and Derby.

The alternate approach to this would be to use 3 separate queries based on the user inputting the wildcard.

A is wildcard query:

SELECT * FROM TableName WHERE b LIKE ?

B is wildcard query:

SELECT * FROM TableName WHERE a LIKE ?

A and B are wildcards:

SELECT * FROM TableName

No wildcards:

SELECT * FROM TableName WHERE a LIKE ? AND b LIKE ?

Obviously having a single query is the simplest and easiest to maintain. I would rather use just the one query if performance will still be good.

+1  A: 

Any DBMS worth its salt would strip out LIKE '%' clauses before even trying to run the query. I'm fairly certain I've seen DB2/z do this in its execution plans.

The prepared statement shouldn't make a difference since it should be turned into real SQL before it gets to the execution engine.

But, as with all optimization questions, measure, don't guess! DBAs exist because they constantly tune the DBMS based on actual data (which changes over time). At a bare minimum, you should time (and get the execution plans) for all variations with suitable static data to see if there's a difference.

I know that queries like:

select c from t where ((1 = 1) or (c = ?))

are optimized to remove the entire where clause before execution (on DB2 anyway and, before you ask, the construct is useful where you need to remove the effect of the where clause but still maintain the parameter placeholder (using BIRT with Javascript to modify the queries for wildcards)).

paxdiablo
I don't think this is true, especially if the column being compared is NULLable. LIKE '%' should not return those rows where the column is NULL, so stripping the criteria away should not suddenly reintroduce them to the results.
Aaron Bertrand
The DBMS should (and it *is* "should") be able to tell if the column is not nullable and hence not optimize. In any case, that would render the OPs question moot, since they wouldn't be able to do that either.
paxdiablo
+1  A: 

Depending on how the LIKE predicate is structured and on the field you're testing on, you might need a full table scan. Semantically a '%' might imply a full table scan but Sql Server does all sorts of optimization internally on queries. So the question becomes: Does Sql Server optimize on a LIKE predicate formed with'%' and throws it out of the WHERE clause?

Paul Sasik
+5  A: 

SQL Server will generally see

WHERE City LIKE 'A%'

and treat it as

WHERE City >= 'A' AND City < 'B'

...and happily use an index seek if appropriate. I say 'generally', because I've seen it fail to do this simplification in certain cases.

If someone's trying to do:

WHERE City LIKE '%ville'

...then an index seek will be essentially impossible.

But something as simple as:

WHERE City LIKE '%'

will be considered equivalent to:

WHERE City IS NOT NULL
Rob Farley
DB2 (at least) has the concept of reverse indexes where '%ville' is easily optimized (by storing the reversed values in the index and internally changing the query to 'elliv%'). You can emulate the same on other DBMS' with an extra column and insert/update triggers.
paxdiablo
Sure, but then %ville% becomes more complicated. If you're looking for whole words, then FullText searching becomes a nicer option.
Rob Farley
+1 for pointing out that `LIKE '%'` will only return rows with non-null values.
Martin B
+3  A: 

You can use whatever query analysis the DBMS offers (e.g. EXPLAIN for MySQL, SET SHOWPLAN_ALL ON for MS SQL (or use one of the other methods), EXPLAIN PLAN FOR for Oracle) to see how the query will be executed.

outis
A: 

What if a column has a non-null blank value? Your query will probably match it.

If this is a query for a real world application then try using the free text indexing features of most modern sql databases. The performance issues will become insignificant.

A simple if statement of if (A B) search a b else (A) search a else B search b else tell user they didn't specify anything

is trivial to maintain and becomes much easier to understand instead of making assumptions about the LIKE operator. You are probably going to do that in the UI anyway when you display the results "Your search for A found x" or "Your search for A B found..."

james
A: 

I'm not sure of the value of using a prepared statement with the kind of parameters you're describing. The reason is that you might fool the query optimizer into preparing an execution plan that would be completely wrong depending on which of the parameters were '%'.

For instance, if the statement were prepared with an execution plan using the index on column A, but the parameter for column A turned out to be '%' you may experience poor performance.

Larry Lustig
A: 

a where clause with " like '%'" as the only predicate will behave exactly the same as no where clause at all.

This is wrong, only non-NULL values would match.
IronGoofy
thanks for the correction!
I meant it will behave the same from performance perspective, but I obviously wasn't clear enough. Also, this is also not true in some cases.
+2  A: 

Derby also offers tools for examining the actual query plan that was used, so you can run experiments using Derby and look at the query plan that Derby chose. You can run Derby with -Dderby.language.logQueryPlan=true, and Derby will write the query plan to derby.log, or you can use the RUNTIMESTATISTICS facility, as described here: http://db.apache.org/derby/docs/10.5/tuning/ctundepth853133.html

I'm not sure if Derby will strip out the A LIKE '%' ahead of time, but I also don't think that the presence of that clause will introduce much of a slowdown in the execution speed.

I'd be quite interested to see the actual query plan output that you get in your environment, with and without the A LIKE '%' clause in place.

Bryan Pendleton
+2  A: 

Oracle 10gR2 does not appear to perform a special optimisation for this situation, but it does recognise that LIKE '%' excludes nulls.

create table like_test (col1)
as select cast(dbms_random.string('U',10) as varchar2(10))
from dual
connect by level <= 1000
/
insert into like_test values (null)
/
commit
/

exec dbms_stats.gather_table_stats(user,'like_test')

explain plan for
select count(*)
from   like_test
/
select plan_table_output from table(dbms_xplan.display)
/
explain plan for
select count(*)
from   like_test
where  col1 like '%'
/
select plan_table_output from table(dbms_xplan.display)
/
explain plan for
select count(*)
from   like_test
where  col1 is not null
/
select plan_table_output from table(dbms_xplan.display)
/

... giving ...

Plan hash value: 3733279756

------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Cost (%CPU)| Time     |
------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |     1 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |           |     1 |            |          |
|   2 |   TABLE ACCESS FULL| LIKE_TEST |  1001 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------

... and ...

Plan hash value: 3733279756

--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |     1 |    10 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |           |     1 |    10 |            |          |
|*  2 |   TABLE ACCESS FULL| LIKE_TEST |  1000 | 10000 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("COL1" LIKE '%')

... and ...

Plan hash value: 3733279756

--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |     1 |    10 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |           |     1 |    10 |            |          |
|*  2 |   TABLE ACCESS FULL| LIKE_TEST |  1000 | 10000 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("COL1" IS NOT NULL)

Note the cardinality (rows) on the TABLE ACCESS FULL line

David Aldridge
+1  A: 

One aspect that I think is missing from the discussion is the fact that the OP wants to use a prepared statement. At the time the statement is prepared, the database/optimizer will not be able to work out the simplifications others have mentioned and so won't be able to optimize away the a like '%' as the actual value will not be known at prepare time.

Therefore:

  • when using prepared statements, have four different statements available (0, only a, only b, both) and use the appropriate one when needed
  • see if you get better performance when you don't use a prepared statement when sticking to just one statement (although then it would be pretty easy to not include 'empty' conditions)
IronGoofy
+1  A: 

I was hoping there would be a textbook answer to this but it sounds like it will largely vary with different database types. Most of the responses indicated that I should run a test so that is exactly what I did.

My application primarily targets the Derby, MS SQL and Oracle databases. Since derby can be run embedded and is easy to set up, I tested the performance on that first. The results were surprising. I tested the worst case scenario against a fairly large table. I ran the test 1000 times and averaged the results.

Query 1:

SELECT * FROM TableName

Query 2 (With values of a="%" and b="%"):

SELECT * FROM TableName WHERE a LIKE ? AND b LIKE ?

Query 1 average time: 178ms

Query 2 average time: 181ms

So performance on derby is almost the same between the two queries.

Chris Dail