views:

548

answers:

5

Suppose I have a database table with columns a, b, and c. I plan on doing queries on all three columns, but I'm not sure which columns in particular I'm querying. There's enough rows in the table that an index immensely speeds up the search, but it feels wrong to make all the permutations of possible indexes (like this):

a
b
c
a, b
a, c
b, c
a, b, c

Is there a better way to handle this problem? (It's very possible that I'll be just fine indexing a, b, c alone, since this will cut down on the number of rows quickly, but I'm wondering if there's a better way.)

If you need more concrete examples, in the real-life data, the columns are city, state, and zip code. Also, I'm using a MySQL database.

+12  A: 

In MS SQL the index "a, b, c" will cover you for scenarios "a"; "a, b"; and "a, b, c". So you would only need the following indexes:

a, b, c
b, c
c

Not sure if MySQL works the same way, but I would assume so.

Todd Ropog
This is the correct answer. MySQL works the same way, and this technique is called "Leftmost Prefixing". From the MySQL manual at http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html: "If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to find rows. For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3). "
zombat
Hmm, I should have known this. ;) Very awesome, I will give this a shot.
Daniel Lew
You might also need a, c, but it depends on what your queries look like. You may also need the individual index to cover the OR scenario mentioned by Andriyev, not sure.
Todd Ropog
+1  A: 

Hi

The more the indexes you create the more your performance will be hit during update and delete operations. Because the index itself might get updated.

Yes, you can use multiple-column indexes. Something like

CREATE TABLE temp (
    id         INT NOT NULL,
    a          INT NULL,
    b          INT NULL,
    c          INT NULL,
    PRIMARY KEY (id),
    INDEX ind1 (a,b,c),
    INDEX ind2 (a,b)
);

This type of index i.e. ind1 will surely help you in queries like

SELECT * FROM temp WHERE a=2 AND b=3 AND c=4;

Similarly, ind2 will help you in queries like

SELECT * FROM temp WHERE a=2 AND b=3;

But these indexes won't be used if the query is some thing like

SELECT * FROM temp WHERE a=2 OR b=3 OR c=4;

Here you will need separate indexes on a, b, and c.

So instead of having so many indexes, I would agree with what John said i.e. have indexes on a,b,c and if you feel that your workload covers more multi-column queries then you can switch to multi-column indexes.

cheers

Andriyev
This table is rarely updated, so it's of no concern to me if updating is slow.
Daniel Lew
+1  A: 

Given that your columns are actually City, State and Zip Code, I would suggest just the following indexes:

INDEX(ZipCode)

If I am correct, Zip Codes are not duplicated across the USA, so it's pointless adding City or State information to the index as well because they will be the same value for all Zip Codes. E.g., 90210 is always Los Angeles, CA.

INDEX(City(5)) or INDEX(City(5)), State)

This is just an index on the first five letters of the city name. In many cases, this will be specific enough that having the State indexed wouldn't provide any useful filtering. E.g., 'Los A' will almost certainly be records from Los Angeles, CA. Maybe there is another small town in the USA starting with 'Los A', but there will be so few records it's not worth cluttering the index with State data as well. On the other hand, some city names appear in many states (Springfield comes to mind), so in those cases it is better to have the State indexed as well. You will need to figure out for yourself which index is most suited to your set of data. If in doubt, I would go with the second index (City and State).

INDEX(State, *sort_field*)

State is a pretty broad index (quite possibly NY and CA alone will have 30% of the records). If you plan displaying this information to the user, say, 30 records at a time, then you would have a query ending in

... WHERE STATE = "NY"
ORDER BY <sort_field>
LIMIT <number>, 30

To make that query efficient, you need to include the sorting column in the State index. So if you're showing pages ordered by Last Name (presuming you have that column), then you would use INDEX(State, LastName(3)), otherwise MySQL has to sort all of the 'NY' records before it can give you the 30 you want.

too much php
Your info on ZIP codes is not strictly correct. Many ZIP codes have more than one "acceptable place name". For example, "Hollywood, CA" is an acceptable place name for 90028, even though Hollywood is only a district of Los Angeles and not an actual city. The "default place name" for 90028 is actually "Los Angeles, CA". As well, sometimes two cities or portions of two cities will fall within the same ZIP code.It IS true that each ZIP code has exactly one "default place name", but you can't rely on that for user-entered data.
Geerad
As long as there are (in most cases) no more than two or three place names for each Zip Code, the index will still be fine.
too much php
Not sure what the percentages are, but my zip code has four allowable names. And I know of another that also has four.
Todd Ropog
+1  A: 

It's depend on your sql-query.

index (a, b, c) is different to index(b, c, a) or index(a, c, b)

ZA
+3  A: 

To use indexes for all possible equality conditions on N columns, you will need C([N/2], N) indexes, that is N! / ([N/2]! * (N - [N/2])!)

See this article in my blog for detailed explanations:

You can also read the strict mathematical proof by Russian mathematician Egor Timoshenko (update: now in English).

One can, however, get decent performance with less indexes using the following techniques:

Index merging

If the columns col1, col2 and col3 are selective, then this query

SELECT  *
FROM    mytable
WHERE   col1 = :value1
        AND col2 = :value2
        AND col3 = :value3

can use three separate indexes on col1, col2 and col3, select the ROWID's that match each condition separately and them find their intersection, like in:

SELECT  *
FROM    (
        SELECT  rowid
        FROM    mytable
        WHERE   col1 = :value1
        INTERSECT
        SELECT  rowid
        FROM    mytable
        WHERE   col2 = :value2
        INTERSECT
        SELECT  rowid
        FROM    mytable
        WHERE   col3 = :value3
        ) mo
JOIN    mytable mi
ON      mi.rowid = mo.rowid

Bitmap indexing

PostgreSQL can build temporary bitmap indexes in memory right during the query.

A bitmap index is quite a compact contiguous bit array.

Each bit set for the the array tells that the corresponging tid should be selected from the table.

Such an index can take but 128M of temporary storage for a table with 1G rows.

The following query:

SELECT  *
FROM    mytable
WHERE   col1 = :value1
        AND col2 = :value2
        AND col3 = :value3

will first allocate a zero-filled bitmap large enough to cover all possible tid's in the table (that is large enough to take all tid's from (0, 0) to the last tid, not taking missing tid's into account).

Then it will seek the first index, setting the bits to 1 if they satisfy the first condition.

Then it will scan the second index, AND'ing the bits that satisfy the second condition with a 1. This will leave 1 only for those bits that satisfy both conditions.

Same for the third index.

Finally, it will just select rows with the tid's corresponding to the bits set.

The tid's will be fetched sequentially, so it's very efficient.

Quassnoi