ansaurus

Question

Answer 1

+3 A:

Most (good) optimizers will make a stab at using an index for a LIKE clause where the wild card is not appearing first. If the pattern starts with a wild card, there is much less that they can do.

If the index is a B-Tree index, as opposed to a hash index (ISAM systems usually use B-Trees), then the leading characters of the clause can be used to constrain the index search. If the system uses hash indexes, then you can't work on partial strings easily, unless you create separate indexes on the first character, then on the first two characters, then the first three characters, ... of the column. An ISAM system might allow you that flexibility; most SQL systems do not and you would have to create columns of 1, 2, 3, ... characters containing the first 1, 2, 3 ... characters of the part number field.

Added: Comments ask "which DBMS?", which is fair. I can vouch for IBM Informix Dynamic Server (IDS) and Standard Engine (SE) in any version you can lay hands on. I would expect both IBM DB2 (LUW or z/OS) to do it; I would expect Oracle to do it. Comments indicate that PostgreSQL 8.0 and above does it - subject to caveats. I can't answer of my own knowledge for Sybase, Ingres, MS SQL Server, Firebird or MySQL. There may be caveats associated with every DBMS about when the index can be used.

Note that if there is another index that provides selectivity, then that may be used in preference to the one that provides access to the wildcard search.

Jonathan Leffler 2009-01-04 21:42:07

Can you mention specific DBMS engines? I tried with PostgreSQL and it does not seem to do this optimization.

bortzmeyer 2009-01-05 12:52:08

PostgreSQL will do it since 8.0, but only for a btree index, an anchored regex (starting with ^) or LIKE clause with no wildcard at the beginning, and the "C" locale.

puetzk 2009-01-05 13:40:41

An index simply on the PartNumber column will be used for a partial match like this in almost every SQL RDBMS I've used (which is about 8 different DBs, including the most popular).

Tom H. 2009-01-05 17:05:17

Answer 2

+1 A:

How about you partition the table on the partnumber field. You could split the table on to different volumes.

Volume A holds a-m
Volume B holds n-z

EDIT Never done this by the way.

See this for theory http://msdn.microsoft.com/en-us/library/ms345146.aspx

John Nolan 2009-01-04 21:42:46

Sounds interesting, but I don't know much about that... exactly what is the benefit there? Would it would run two parallel queries? Would this require a multi-processor box? How many partitions could I split before it becomes inefficient? Any pointers to web reasources I could read?

Michael Bray 2009-01-04 21:45:19

Way premature optimization. Use hardware solutions first - and defer those until you know a lot more about actual performance.

le dorfier 2009-01-04 21:50:13

le dorfier - I agree it probably is premature.

John Nolan 2009-01-04 21:54:46

If you take a little thought about your structure (identify a numerical or date field that's a good candidate) you can retrofit partitioning later on.

ConcernedOfTunbridgeWells 2009-01-04 22:35:48

parts list does seem like a good candidate in this case though - suppose it depends on the values in the table

John Nolan 2009-01-04 23:21:20

Answer 3

+1 A:

This query looks fine! If the field is indexed and you're performing a LIKE 'term%' query, where the wildcard is at the end, you should get optimized execution plans.

Depending on your DBMS, you can check what the optimizer really does with the EXPLAIN keyword.

Henning 2009-01-04 21:45:32

Answer 4

+1 A:

Experiment by partition your table using the first 2 or 3 characters of the part number. Experiment with partition local indexes vs global indexes.

EvilTeach 2009-01-04 21:50:30

Answer 5

+4 A:

I'd guess that your primary key (the GUID) probably has a clustered index. You may want to consider making the primary key NOT be clustered. Instead, you could cluster the index you created for the PartNumber. (there can only be one clustered index per table)

You should also consider adding a TOP predicate to the query, so that only the top 100 (or so) rows are returned. I'm thinking... if the user first types an M, there could be a couple hundred thousand matches, which would be slow to load. By limiting the number of rows, you should get better performance.

G Mastros 2009-01-04 21:52:10

looking at the execution plan, it doesn't appear to be clustered. (the ShowPlan XML Statistics profile says "Index seek" not "clustered...".) Do you know what determines this? Does SQL determine based on the number of records or something else? I am using TOP, but it's last in the plan.

Michael Bray 2009-01-04 21:57:07

oh wait... that is the PartNumber index that says "Index seek" I don't actually see any index seek on that table - instead it says "Key Lookup".

Michael Bray 2009-01-04 21:57:55

Primary Keys are enforced through indexes. By default, SQL Server will cluster the primary key index. By 'un-clustering' it, and clustering the index on the PartNumber column, you will probably get some improvement in performance. Without actually trying it, it's impossible to guess how much.

G Mastros 2009-01-04 22:02:59

Answer 6

A:

I'm curious,

Can you expand your question to include durations for the following 4 queries:

SELECT top 100 NeededFields FROM Parts WHERE PartNumber LIKE '%'
SELECT top 100 NeededFields FROM Parts WHERE PartNumber LIKE 'M%'
SELECT top 100 NeededFields FROM Parts WHERE PartNumber LIKE 'ML%'
SELECT top 100 NeededFields FROM Parts WHERE PartNumber LIKE 'ML0833%'

If it turns out that the first/second query is a ton slower than the last one, you could look at introducing a cache table that maintains these cases (updating it with a trigger or a job)

Also, I think I just noticed something, is your index fully covered? If it is not you are probably getting table scans whenever the result count hits a threshold.

Sam Saffron 2009-01-04 22:41:18

Answer 7

A:

Consider not using SQL for this.

Create some dumps of the data you'll need for your user-interface. A million records isn't very much- consider making 26 files- one for each "initial letter" and store the information. Have it include the most popular entries.

If you want more specificity (or the user starts scrolling), you can make 26*26 more files (AB, AC, AD, etc) for each of the remainder records.

If you insist on using SQL for this, profile it. Try to create an index just for the leading character, e.g.

CREATE INDEX partno_idx ON parts (SUBSTRING(partnumber,0,1))

(or whatever your local SQL dialect is) then use a query like WHERE SUBSTRING(partnumber,0,1)='M'. The result will avoid prefix searches which are rarely faster than whole-string (or whole number) indexes (which might use a hashtable or a skiplist).

Profiling this is important: Perhaps converting the character to its ASCII-code will be faster. Your data set, and your SQL server software and version will be highly relevant.

geocar 2009-01-04 22:45:52

Answer 8

+1 A:

I would use full text search. Your results will be almost instant with this kind of queries.

Thuglife 2009-01-04 22:47:52

Answer 9

A:

PostgreSQL does not seem able to optimize LIKE queries when the wildcard is at the end.

Here, the table Zones is indexed on the column name:

CREATE INDEX index_zones_name ON Zones(name);

And the index is used for = queries:

EXPLAIN SELECT id FROM Zones WHERE name = 'toto.fr';
                              QUERY PLAN                                   
-------------------------------------------------------------------------------
 Index Scan using index_zones_name on zones  (cost=0.00..21.06 rows=4 width=4)
   Index Cond: (name = 'toto.fr'::text)

The query is instantaneous, despite the three millions rows.

But, for LIKE queries:

EXPLAIN SELECT id FROM Zones WHERE name LIKE 'toto%';
                   QUERY PLAN                        
---------------------------------------------------------
 Seq Scan on zones  (cost=0.00..75991.43 rows=1 width=4)
   Filter: (name ~~ 'toto%'::text)

And the query takes much longer.

bortzmeyer 2009-01-05 12:51:32

Answer 10

A:

If you are using mysql, consider creating a partial index:

mysql> CREATE INDEX part_of_name ON customer (name(10));

The index will only match the 10 first characters

MatthieuP 2009-01-05 13:00:38

ansaurus

tags:

views:

answers:

Optimal LIKE search in SQL

related questions