views:

567

answers:

5

Hi,

I don't know much about database optimization, but I'm trying to understand this case.

Say I have the following table:

cities
===========
state_id integer
name varchar(32)
slug varchar(32)

Now, say I want to perform queries like this:

SELECT * FROM cities WHERE state_id = 123 AND slug = 'some_city'
SELECT * FROM cities WHERE state_id = 123

If I want the "slug" for a city to be unique within its particular state, I'd add a unique index on state_id and slug.

Is that index enough? Or should I also add another on state_id so the second query is optimized? Or does the second query automatically use the unique index?

I'm working on PostgreSQL, but I feel this case is so simple that most DBMS work similarly.

Also, I know this surely doesn't make a difference on small tables, but my example is a simple one. Think of 200k+ rows tables.

Thanks!

A: 

To do optimization use EXPLAIN http://www.postgresql.org/docs/7.4/static/sql-explain.html and see for your self. But optimization is not the most important reason to make those indexes; first it is a constraint inhibiting a database from not being logical.

+1  A: 

[EDIT: Misread the question... Hopefully my answer is more relevant now!]

In your case, I'd suggest 1 index on (state_id, slug). If you ever need to search just by slug, add an index on just that column. If you have those, then adding another index on state_id is unnecessary as the first index already covers it.

An index can be used whenever an initial segment of its columns are used in a WHERE clause. So e.g. an index on columns A, B and C will optimise queries containing WHERE clauses involving A, B and C, WHERE clauses with just A and B, or WHERE clauses with just A. Note that the order that columns appear in the index definition is very important -- this example index cannot be used for WHERE clauses involving just B and/or C.

(Of course it's up to the query optimiser whether or not a particular index actually gets used, but in your case with 200k rows, you can guarantee that a simple search by state_id or slug or both will use one of the indices.)

j_random_hacker
+1  A: 

Any decent optimizer will see an index on three columns - say:

CREATE INDEX idx_1 ON SomeTable(Col1, Col2, Col3);

and will use that index for any of the following conditions:

WHERE Col1 = ...something...

WHERE Col1 = ...something... AND Col2 = ...otherthing...

WHERE Col3 = ....whatnot....
  AND Col1 = ...something....
  AND Col2 = ...otherthing...

That is, it will use the index if there are conditions applied to any contiguous leading subset of the columns of the index. Although I used equality, it can also apply to ranges (open - just greater than, for example) or closed (between two values).

Jonathan Leffler
A: 

Say there where millions of rows, how would it matter?

I just did a quick test with explain as Remco suggests, and I found something interesting.

If I have only a unique index on both columns, the second query shows a "bitmap heap scan" along with a "bitmap index scan".

If I add another index on states_id, the query shows only an "index scan".

So I guess adding that index does help a bit? I didn't think this would be the case. Would the query optimizer work differently if there where hundreds of thousands (or millions) of rows? Anyone care to explain?

Thanks!

Ivan
+1  A: 

A single unique index on (state_id, slug) should be sufficient. To be sure, of course, you'll need to run EXPLAIN and/or ANALYZE (perhaps with the help of something like http://explain.depesz.com/), but ultimately what indexes are appropriate depends very closely on what kind of queries you will be running. Remember, indexes make SELECTs faster and INSERTs, UPDATEs, and DELETEs slower, so you ideally want only as many indexes as are actually necessary.

Also, PostgreSQL has a smart query optimizer: it will use radically different search plans for queries on small tables and huge tables. If the table is small, it will just do a sequential scan and not even bother with any indexes, since the overhead of working with them is higher than just brute-force sifting through the table. This changes to a different plan once the table size passes a threshold, and may change again if the table gets larger again, or if you change your SELECT, or....

Summary: you can't trust the results of EXPLAIN and ANALYZE on datasets much smaller or different than your actual data. Make it work, then make it fast later (if you need to).

kquinn