views:

995

answers:

5
  1. When should I use a composite index in a database?
  2. What are the performance ramification by using a composite index)?
  3. Why should I use use a composite index?

For example, I have a homes table:

CREATE TABLE IF NOT EXISTS `homes` (
  `home_id` int(10) unsigned NOT NULL auto_increment,
  `sqft` smallint(5) unsigned NOT NULL,
  `year_built` smallint(5) unsigned NOT NULL,
  `geolat` decimal(10,6) default NULL,
  `geolng` decimal(10,6) default NULL,
  PRIMARY KEY  (`home_id`),
  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),
) ENGINE=InnoDB  ;

Does it make sense for me to use a composite index for both geolat and geolng, such that:

I replace:

  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),

with:

KEY `geolat_geolng` (`geolat`, `geolng`)

If so:

  • Why?
  • What is the performance ramification by using a composite index)?

UPDATE:

Since many people have stated it entirely dependent upon the queries I perform, below is the most common query performed:

SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???

UPDATE 2:

With the following database schema:

CREATE TABLE IF NOT EXISTS `homes` (
  `home_id` int(10) unsigned NOT NULL auto_increment,
  `primary_photo_group_id` int(10) unsigned NOT NULL default '0',
  `customer_id` bigint(20) unsigned NOT NULL,
  `account_type_id` int(11) NOT NULL,
  `address` varchar(128) collate utf8_unicode_ci NOT NULL,
  `city` varchar(64) collate utf8_unicode_ci NOT NULL,
  `state` varchar(2) collate utf8_unicode_ci NOT NULL,
  `zip` mediumint(8) unsigned NOT NULL,
  `price` mediumint(8) unsigned NOT NULL,
  `sqft` smallint(5) unsigned NOT NULL,
  `year_built` smallint(5) unsigned NOT NULL,
  `num_of_beds` tinyint(3) unsigned NOT NULL,
  `num_of_baths` decimal(3,1) unsigned NOT NULL,
  `num_of_floors` tinyint(3) unsigned NOT NULL,
  `description` text collate utf8_unicode_ci,
  `geolat` decimal(10,6) default NULL,
  `geolng` decimal(10,6) default NULL,
  `display_status` tinyint(1) NOT NULL,
  `date_listed` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `contact_email` varchar(100) collate utf8_unicode_ci NOT NULL,
  `contact_phone_number` varchar(15) collate utf8_unicode_ci NOT NULL,
  PRIMARY KEY  (`home_id`),
  KEY `customer_id` (`customer_id`),
  KEY `city` (`city`),
  KEY `num_of_beds` (`num_of_beds`),
  KEY `num_of_baths` (`num_of_baths`),
  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),
  KEY `account_type_id` (`account_type_id`),
  KEY `display_status` (`display_status`),
  KEY `sqft` (`sqft`),
  KEY `price` (`price`),
  KEY `primary_photo_group_id` (`primary_photo_group_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=8 ;

Using the following SQL:

EXPLAIN SELECT  homes.home_id,
        address,
        city,
        state,
        zip,
        price,
        sqft,
        year_built,
        account_type_id,
        num_of_beds,
        num_of_baths,
        geolat,
        geolng,
        photo_id,
        photo_url_dir
      FROM homes
      LEFT OUTER JOIN home_photos ON homes.home_id = home_photos.home_id
       AND homes.primary_photo_group_id = home_photos.home_photo_group_id
       AND home_photos.home_photo_type_id = 2
      WHERE homes.display_status = true
      AND homes.geolat BETWEEN -100 AND 100
      AND homes.geolng BETWEEN -100 AND 100

EXPLAIN returns:

id  select_type  table        type  possible_keys                                    key                  key_len  ref     rows  Extra
----------------------------------------------------------------------------------------------------------
1   SIMPLE       homes        ref   geolat,geolng,display_status                     display_status       1        const   2     Using where
1  SIMPLE        home_photos  ref   home_id,home_photo_type_id,home_photo_group_id   home_photo_group_id  4        homes.primary_photo_group_id   4

I don't quite understand how to read the EXPLAIN command. Does this look good or bad. Right now, I am NOT using a composite index for geolat and geolng. Should I be?

A: 

There is no Black and White, one size fits all answer.

You should use a composite index, when your query work load would benefit from one.

You need to profile your query work load in order to determine this.

A composite index comes into play when queries can be satisfied entirely from that index.

UPDATE (in response to edit to posted question): If you are selecting * from the table the composite index may be used, it may not. You will need to run EXPLAIN PLAN to be sure.

Mitch Wheat
Teddy
It depends entirely on what queries are being made against that table.
Mitch Wheat
I've updated my original post to include the most common query performed. See above.
Teddy
+1  A: 

Imagine you have the following three queries:

Query I:

SELECT * FROM homes WHERE `geolat`=42.9 AND `geolng`=36.4

Query II:

SELECT * FROM homes WHERE `geolat`=42.9

Query III:

SELECT * FROM homes WHERE `geolng`=36.4

If you have seperate index per column, all three queries use indexes. In MySQL, if you have composite index (geolat, geolng), only query I and query II (which is using the first part of the composit index) uses indexes. In this case, query III requires full table search.

On Multiple-Column Indexes section of manual, it is clearly explained how multiple column indexes work, so I don't want to retype manual.

From the MySQL Reference Manual page:

A multiple-column index can be considered a sorted array containing values that are created by concatenating the values of the indexed columns.

If you use seperated index for geolat and geolng columns, you have two different index in your table which you can search independent.

INDEX geolat
-----------
VALUE RRN
36.4  1
36.4  8
36.6  2
37.8  3
37.8  12
41.4  4

INDEX geolng
-----------
VALUE RRN
26.1  1
26.1  8
29.6  2
29.6  3
30.1  12
34.7  4

If you use composite index you have only one index for both columns:

INDEX (geolat, geolng)
-----------
VALUE      RRN
36.4,26.1  1
36.4,26.1  8
36.6,29.6  2
37.8,29.6  3
37.8,30.1  12
41.4,34.7  4

RRN is relative record number (to simplify, you can say ID). The first two index generated seperate and the third index is composite. As you can see you can search based on geolng on composite one since it is indexed by geolat, however it's possible to search by geolat or "geolat AND geolng" (since geolng is second level index).

Also, have a look at How MySQL Uses Indexes manual section.

eyazici
Actually, I don't have any of those queries. My query is listed in the original post. My query is to return homes within a square grid. I know about spatial and I'm not trying to calculate distances. I simply want to know if using a composite index make sense when I'm trying to display all homes within a particular geo grid (e.g. neighborhood/city/county)
Teddy
Eyazici, I've updated my original post (update 2). This is my actual query. My actual db schema. And what the EXPLAIN command returns. So, with this information - should I be using a composite index. I'm still unclear. Thanks in advance
Teddy
@"Actually, I don't have any of those queries.". Actually you have, I have used simple WHERE condition to explaing base logic. When using a conditional(ie. WHERE) on a column MySQL tries to use indexes whenever possible. "x BETWEEN a AND b" is similar to "x>a AND x<b". You have used both geolng and geolat columns in your query conditional. If you use composit index "(geolat, geolng)" your "AND geolng BETWEEN ??? AND ???" conditional does not gain advantages of index (this is for MySQL). So you *should* use separate index per column for your scenario.
eyazici
I don't understand. Why should I use separate indexes for geolat and geolng when I'll ALWAYS perform a query that include both columns
Teddy
I have modified my post with the citation of why you should seperate indexes for your purpose. Composite indexes are not stored seperate place, they are stored concatenated in same place(this is the only difference from seperate indexes).
eyazici
So that seems like I SHOULD use composite (concatenated) geo lat/lng indexes. No?
Teddy
Since you use "AND" between your conditionals, "Yes, you should use composite once".
eyazici
I'm unfortunately now even more confused :( 3 comments ago, you said I should 'separate' indexes. Now your last comment says I should use 'composite' indexes (once?).
Teddy
Yes I am sorry, it is because misunderstanding of your query. It doesn't make sense to use different indexes for your given purpose.
eyazici
Teddy
Yes, if your conditional is just "geolat BETWEEN ??? AND ??? AND geolng BETWEEN ??? AND ???" you should use composite index.
eyazici
A: 

I'm with @Mitch, depends entirely your queries. Fortunately you can create and drop indexes at any time, and you can prepend the EXPLAIN keyword to your queries to see if the query analyzer uses the indexes.

If you'll be looking up an exact lat/long pair this index would likely make sense. But you're probably going to be looking for homes within a certain distance of a particular place, so your queries will look something like this (see source):

select *, sqrt(  pow(h2.geolat - h1.geolat,  2) 
               + pow(h2.geolng - h1.geolng, 2) ) as distance
from homes h1, homes h2
where h1.home_id = 12345 and h2.home_id != h1.home_id
order by distance

and the index very likely won't be helpful at all. For geospatial queries, you need something like this.

Update: with this query:

SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???

The query analyzer could use an index on geolat alone, or an index on geolng alone, or possibly both indexes. I don't think it would use a composite index. But it's easy to try out each of these permutations on a real data set and then (a) see what EXPLAIN tells you and (b) measure the time the query really takes.

Jim Ferrans
I'm simply using wanting to return homes within a square grid. I know about spatial, so I'm not trying to calculate distance. I simply want to return homes within the square grid and want that to perform quickly. As such, I want to make sure I have my indexes setup correctly.Does that help?
Teddy
+6  A: 

You should use a composite index when you are using queries that benefit from it. A composite index that looks like this:

index( column_A, column_B, column_C )

will benefit a query that uses those fields for joining, filtering, and sometimes selecting. It will also benefit queries that use left-most subsets of columns in that composite. So the above index will also satisfy queries that need

index( column_A, column_B, column_C )
index( column_A, column_B )
index( column_A )

But it will not (at least not directly, maybe it can help partially if there are no better indices) help for queries that need

index( column_A, column_C )

Notice how column_B is missing.

In your original example, a composite index for two dimensions will mostly benefit queries that query on both dimensions or the leftmost dimension by itself, but not the rightmost dimension by itself. If you're always querying two dimensions, a composite index is the way to go, doesn't really matter which is first (most probably).

Mark Canlas
Mark, I've updated my original post (update 2). This is my actual query. My actual db schema. And what the EXPLAIN command returns. So, with this information - should I be using a composite index. I'm still unclear. Thanks in advance.
Teddy
A: 

To do spacial searches you need an R-Tree algorithm, which allows searching geographical areas very quickly. Exactly what you need for this job.

Some databases have spacial indexes built in. A quick google shows MySql 5 has them (which looking at your SQL I'm guessing you're using MySql).

Will