views:

72

answers:

2

I have one table that maps locations to postal codes. For example, New York State has about 2000 postal codes. I have another table that maps mail to the postal codes it was sent to, but this table has about 5 million rows. I want to find all the mail that was sent to New York State, which seems simple enough, but the query is unbelievably slow. I haven't been able to even wait long enough for it to finish. Is the problem that there are 5 million rows? I can't help but think that 5 million shouldn't be such a large number for a computer these days... Oh, and everything is indexed. Is SQL just not designed to handle such large joins?

UPDATE: as people have asked, I've updated this question with the table definitions and the query that I'm using.

-- Roughly 70,000 rows
CREATE TABLE `mail_zip` (
  `mail_id` int(11) default NULL,
  `zip` int(11) default NULL,
  KEY `index_mail_zip_on_mail_id` (`mail_id`),
  KEY `index_mail_zip_on_zip` (`zip`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

-- Roughly 5,000,000 rows
CREATE TABLE `geographies` (
  `city_id` int(11) default NULL,
  `postal_code` int(11) default NULL,
  KEY `index_geographies_on_city_id` (`city_id`),
  KEY `index_geographies_on_postal_code` (`postal_code`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

-- Query
select mz.mail_id from mail_zip mz join geographies g on mz.zip = g.postal_code where g.city_id = 36 limit 10;

UPDATE 2: okay, I lied. With the proper indices, the above query works fine. The problem is actually the order by clause. See the two nearly identical queries below: the only difference is "order by m.sent_on desc" which adds an extra 4 minutes and 30 seconds to the query! Also, using explain, adding the order by uses a filesort which must be what's slowing it down. However, sent_on is indexed, so why isn't it using the index? I must not be making the index properly.

-- Roughly 350,000 rows
CREATE TABLE `mail` (
  `id` int(11) NOT NULL auto_increment,
  `sent_on` datetime default NULL,
  `title` varchar(255) default NULL,
  PRIMARY KEY  (`id`),
  KEY `index_mail_on_sent_on` (`sent_on`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1

-- Runs in 0.19 seconds
-- Query
select distinct(m.id), m.title from mail m join mail_zip mz on mz.mail_id = m.id join geographies g on g.postal_code = mz.zip where g.city_id = 36 limit 10;

+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+-----------------------+
| id | select_type | table | type   | possible_keys                                          | key     | key_len | ref                  | rows    | Extra                 |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+-----------------------+
|  1 | SIMPLE      | mz    | ALL    | index_mail_zip_on_com_id,index_mail_zip_on_zip         | NULL    | NULL    | NULL                 | 5260053 | Using temporary       | 
|  1 | SIMPLE      | m     | eq_ref | PRIMARY                                                | PRIMARY | 4       |            mz.com_id |       1 |                       | 
|  1 | SIMPLE      | g     | ref    | index_geographies_on_city_id,zip                       | zip     | 5       |            mz.zip    |       1 | Using where; Distinct | 
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+-----------------------+

-- Runs in 4 minutes and 30 seconds
-- Query
select distinct(m.id), m.title from mail m join mail_zip mz on mz.mail_id = m.id join geographies g on g.postal_code = mz.zip where g.city_id = 36 order by m.sent_on desc limit 10;

+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+---------------------------------+
| id | select_type | table | type   | possible_keys                                          | key     | key_len | ref                  | rows    | Extra                           |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+---------------------------------+
|  1 | SIMPLE      | mz    | ALL    | index_mail_zip_on_com_id,index_mail_zip_on_zip         | NULL    | NULL    | NULL                 | 5260053 | Using temporary; Using filesort | 
|  1 | SIMPLE      | m     | eq_ref | PRIMARY                                                | PRIMARY | 4       |            mz.com_id |       1 |                                 | 
|  1 | SIMPLE      | g     | ref    | index_geographies_on_city_id,zip                       | zip     | 5       |            mz.zip    |       1 | Using where; Distinct           | 
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+---------------------------------+
+5  A: 

MySQL is perfectly capable of handling joins involving 5 million rows or even many more than this.

Your problem is probably one of two things:

  • You are missing an index.
  • You are writing your query in such a way that the optimizer is unable to use the best index, for example if you are using functions that are not sargable on your indexed column in your join condition.

Since you claim that "everything is indexed" I'd guess that it is the second. Post your table information and your query and we should be able to help you fix it.

You can also run EXPLAIN on your query to see which indexes it is using.

Mark Byers
As people have asked, I've updated this question with the table definitions and the query that I'm using.
Philip Brocoum
A: 

You should be able to join, for sake of argument, 10 tables with the biggest table(s) having rows in the multiple millions and upwards, and you should be able to get results fast.

Assume that there is something up with the indexing strategy or the query operation or the query plan.

It has nothing to do with SQL per se; it might be related to MySQL, or the particular storage engine you are using in MySQL.

Did you know that the SQL standard doesn't define anything related to indexes? You could argue that anything related to indexes is non-standard, though 'extra to the standard' might be a better way to look at it.

Jonathan Leffler
As people have asked, I've updated this question with the table definitions and the query that I'm using.
Philip Brocoum