views:

765

answers:

3

In ActiveRecord there are two ways to declare indexes for multiple columns:

add_index :classifications, [:species, :family, :trivial_names]
add_index :classifications, :species
add_index :classifications, :family
add_index :classifications, :trivial_names

Is there any difference between the first approach and the second one? If so, when should I use the first and when the second?

+1  A: 

From the docs:

When creating an index on multiple columns, the first column is used as a name for the index. For example, when you specify an index on two columns [:first, :last], the DBMS creates an index for both columns as well as an index for the first column :first. Using just the first name for this index makes sense, because you will never have to create a singular index with this name.

Use the first method when creating a compound index, and the second when creating indexes on single attributes.

There are some good points here on when to use compound indexes, but the gist is that they are good when utilizing a where on multiple attributes. Note that they should be used alongside other indexes (always index your foriegn keys) - not as a replacement.

Mr. Matt
Thanks! But... when does it make sense to use a compound index instead of an index on a single attribute? Can you give me some examples?
collimarco
Updated my answer
Mr. Matt
You said not to use them as a replacement, but what if I only use them in this query:SELECT * FROM classifications WHERE species LIKE '%sth%' OR family LIKE '%sth%' OR trivial_names LIKE '%sth%'In this case, is it correct to use only the compound index?
collimarco
Yup - that should be fine.
Mr. Matt
+4  A: 

The two approaches are different. The first creates a single index on three attributes, the second creates three single-attribute indices. Storage requirements will be different, although without distributions it's not possible to say which would be larger.

Indexing three columns [A, B, C] works well when you need to access for values of A, A+B and A+B+C. It won't be any good if your query (or find conditions or whatever) doesn't reference A.

When A, B and C are indexed separately, some DBMS query optimizers will consider combining two or more indices (subject to the optimizer's estimate of efficiency) to give a similar result to a single multi-column index.

Suppose you have some e-commerce system. You want to query orders by purchase_date, customer_id and sometimes both. I'd start by creating two indices: one for each attribute.

On the other hand, if you always specify purchase_date and customer_id, then a single index on both columns would probably be most efficient. The order is significant: if you also wanted to query orders for all dates for a customer, then make the customer_id the first column in the index.

Mike Woodhouse
+3  A: 

You are comparing a composite index with a set of independent indices. They are just different.

Think of it this way: a compound index gives you rapid look-up of the first field in a nested set of fields followed by rapid look-up of the second field within ONLY the records already selected by the first field, followed by rapid look-up of the third field - again, only within the records selected by the previous two indices.

Lets take an example. Your database engine will take no more than 20 steps to locate a unique value within 1,000,000 records (if memory serves) if you are using an index. This is true whether you are using using a composite or and independent index - but ONLY for the first field ("species" in your example although I'd think you'd want Family, Species, and then Common Name).

Now, let's say that there are 100,000 matching records for this first field value. If you have only single indices, then any lookup within these records will take 100,000 steps: one for each record retrieved by the first index. This is because the second index will not be used (in most databases - this is a bit of a simplification) and a brute force match must be used.

If you have a composite index then your search is much faster because your second field search will have an index within the first set of values. In this case you'll need no more than 17 steps to get to your first matching value on field 2 within the 100,000 matches on field 1 (log base 2 of 100,000).

So: steps needed to find a unique record out of a database of 1,000,000 records using a composite index on 3 nested fields where the first retrieves 100,000 and the second retrieves 10,000 = 20 + 17 + 14 = 51 steps.

Steps needed under the same conditions with just independent indices = 20 + 100,000 + 10,000 = 110,020 steps.

Big difference, eh?

Now, don't go nuts putting composite indices everywhere. First, they are expensive on inserts and updates. Second, they are only brought to bear if you are truly searching across nested data (for another example, I use them when pulling data for logins for a client over a given date range). Also, they are not worth it if you are working with relatively small data sets.

Finally, check your database documentation. Databases have grown extremely sophisticated in the ability to deploy indices these days and the Database 101 scenario I described above may not hold for some (although I always develop as if it does just so I know what I am getting).

Mark Brittingham
Thanks for the explanation! See what I asked Mr. Matt: the WHERE clause contains OR. In this case is it useful a compound index? I would say no, because the db always have to search all the items and not only the rows that results from the first condition (it would have been different if there was the AND operator because it "filters" the rows and reduce the scope). Am I wrong?
collimarco
collimarco - in the example you provide to Mr. Matt, the independent indices would provide better performance as each would be used independently as part of the SQL execution plan. Think of it this way: AND is compositional, OR is independent. To give another example, if your where clause was "WHERE (Family=X AND Species=Y) OR (CommonName=Z)" then you'd want a composite index on Family|Species and an independent index on CommonName.
Mark Brittingham
BTW: SQL Execution plans are available in more sophisticated databases such as SQL Server and Oracle and can be quite valuable both as a teaching tool (to help you see what goes on under the covers as the db attempt to optimize its search plan) and as a mechanism for testing out various indexing strategies.
Mark Brittingham