views:

859

answers:

10

I have a parts database that I am going to be constantly querying for a quoting system. The parts database has 1,400,000+ records in it. The users are just going to start typing part numbers, which they expect the system to be able to find after only a few characters, so I need to be able to do a wildcard search, something like:

SELECT NeededFields FROM Parts WHERE PartNumber LIKE 'ML%'

Is there any kind of optimization that I can perform to try to wring the most performance out of this type of query? I have the PartNumber field indexed, but I'm not sure if that is the best that I can get. I'd be willing to consider alternate indexing structures built into the database separate from SQL indexes too. The primary key is a Guid, but I need this for replication and because of specific data structures that I use.

+3  A: 

Most (good) optimizers will make a stab at using an index for a LIKE clause where the wild card is not appearing first. If the pattern starts with a wild card, there is much less that they can do.

If the index is a B-Tree index, as opposed to a hash index (ISAM systems usually use B-Trees), then the leading characters of the clause can be used to constrain the index search. If the system uses hash indexes, then you can't work on partial strings easily, unless you create separate indexes on the first character, then on the first two characters, then the first three characters, ... of the column. An ISAM system might allow you that flexibility; most SQL systems do not and you would have to create columns of 1, 2, 3, ... characters containing the first 1, 2, 3 ... characters of the part number field.

Added: Comments ask "which DBMS?", which is fair. I can vouch for IBM Informix Dynamic Server (IDS) and Standard Engine (SE) in any version you can lay hands on. I would expect both IBM DB2 (LUW or z/OS) to do it; I would expect Oracle to do it. Comments indicate that PostgreSQL 8.0 and above does it - subject to caveats. I can't answer of my own knowledge for Sybase, Ingres, MS SQL Server, Firebird or MySQL. There may be caveats associated with every DBMS about when the index can be used.

Note that if there is another index that provides selectivity, then that may be used in preference to the one that provides access to the wildcard search.

Jonathan Leffler
Can you mention specific DBMS engines? I tried with PostgreSQL and it does not seem to do this optimization.
bortzmeyer
PostgreSQL will do it since 8.0, but only for a btree index, an anchored regex (starting with ^) or LIKE clause with no wildcard at the beginning, and the "C" locale.
puetzk
An index simply on the PartNumber column will be used for a partial match like this in almost every SQL RDBMS I've used (which is about 8 different DBs, including the most popular).
Tom H.
+1  A: 

How about you partition the table on the partnumber field. You could split the table on to different volumes.

Volume A holds a-m
Volume B holds n-z

EDIT Never done this by the way.

See this for theory http://msdn.microsoft.com/en-us/library/ms345146.aspx

John Nolan
Sounds interesting, but I don't know much about that... exactly what is the benefit there? Would it would run two parallel queries? Would this require a multi-processor box? How many partitions could I split before it becomes inefficient? Any pointers to web reasources I could read?
Michael Bray
Way premature optimization. Use hardware solutions first - and defer those until you know a lot more about actual performance.
le dorfier
le dorfier - I agree it probably is premature.
John Nolan
If you take a little thought about your structure (identify a numerical or date field that's a good candidate) you can retrofit partitioning later on.
ConcernedOfTunbridgeWells
parts list does seem like a good candidate in this case though - suppose it depends on the values in the table
John Nolan
+1  A: 

This query looks fine! If the field is indexed and you're performing a LIKE 'term%' query, where the wildcard is at the end, you should get optimized execution plans.

Depending on your DBMS, you can check what the optimizer really does with the EXPLAIN keyword.

Henning
+1  A: 

Experiment by partition your table using the first 2 or 3 characters of the part number. Experiment with partition local indexes vs global indexes.

EvilTeach
+4  A: 

I'd guess that your primary key (the GUID) probably has a clustered index. You may want to consider making the primary key NOT be clustered. Instead, you could cluster the index you created for the PartNumber. (there can only be one clustered index per table)

You should also consider adding a TOP predicate to the query, so that only the top 100 (or so) rows are returned. I'm thinking... if the user first types an M, there could be a couple hundred thousand matches, which would be slow to load. By limiting the number of rows, you should get better performance.

G Mastros
looking at the execution plan, it doesn't appear to be clustered. (the ShowPlan XML Statistics profile says "Index seek" not "clustered...".) Do you know what determines this? Does SQL determine based on the number of records or something else? I am using TOP, but it's last in the plan.
Michael Bray
oh wait... that is the PartNumber index that says "Index seek" I don't actually see any index seek on that table - instead it says "Key Lookup".
Michael Bray
Primary Keys are enforced through indexes. By default, SQL Server will cluster the primary key index. By 'un-clustering' it, and clustering the index on the PartNumber column, you will probably get some improvement in performance. Without actually trying it, it's impossible to guess how much.
G Mastros
A: 

I'm curious,

Can you expand your question to include durations for the following 4 queries:

SELECT top 100 NeededFields FROM Parts WHERE PartNumber LIKE '%'
SELECT top 100 NeededFields FROM Parts WHERE PartNumber LIKE 'M%'
SELECT top 100 NeededFields FROM Parts WHERE PartNumber LIKE 'ML%'
SELECT top 100 NeededFields FROM Parts WHERE PartNumber LIKE 'ML0833%'

If it turns out that the first/second query is a ton slower than the last one, you could look at introducing a cache table that maintains these cases (updating it with a trigger or a job)

Also, I think I just noticed something, is your index fully covered? If it is not you are probably getting table scans whenever the result count hits a threshold.

Sam Saffron
A: 

Consider not using SQL for this.

Create some dumps of the data you'll need for your user-interface. A million records isn't very much- consider making 26 files- one for each "initial letter" and store the information. Have it include the most popular entries.

If you want more specificity (or the user starts scrolling), you can make 26*26 more files (AB, AC, AD, etc) for each of the remainder records.

If you insist on using SQL for this, profile it. Try to create an index just for the leading character, e.g.

CREATE INDEX partno_idx ON parts (SUBSTRING(partnumber,0,1))

(or whatever your local SQL dialect is) then use a query like WHERE SUBSTRING(partnumber,0,1)='M'. The result will avoid prefix searches which are rarely faster than whole-string (or whole number) indexes (which might use a hashtable or a skiplist).

Profiling this is important: Perhaps converting the character to its ASCII-code will be faster. Your data set, and your SQL server software and version will be highly relevant.

geocar
+1  A: 

I would use full text search. Your results will be almost instant with this kind of queries.

Thuglife
A: 

PostgreSQL does not seem able to optimize LIKE queries when the wildcard is at the end.

Here, the table Zones is indexed on the column name:

CREATE INDEX index_zones_name ON Zones(name);

And the index is used for = queries:

EXPLAIN SELECT id FROM Zones WHERE name = 'toto.fr';
                              QUERY PLAN                                   
-------------------------------------------------------------------------------
 Index Scan using index_zones_name on zones  (cost=0.00..21.06 rows=4 width=4)
   Index Cond: (name = 'toto.fr'::text)

The query is instantaneous, despite the three millions rows.

But, for LIKE queries:

EXPLAIN SELECT id FROM Zones WHERE name LIKE 'toto%';
                   QUERY PLAN                        
---------------------------------------------------------
 Seq Scan on zones  (cost=0.00..75991.43 rows=1 width=4)
   Filter: (name ~~ 'toto%'::text)

And the query takes much longer.

bortzmeyer
A: 

If you are using mysql, consider creating a partial index:

mysql> CREATE INDEX part_of_name ON customer (name(10));

The index will only match the 10 first characters

MatthieuP