views:

1070

answers:

8

Consider the following database tables:

  • Table "messages" with 13,000,000 rows (one row per message).
  • Table "users" with 3,000,000 rows (one row per user).

The following query is used to fetch a bunch of messages and the corresponding users:

SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id 
WHERE messages.id in (?, ?, ?, ? ... a total of 100 "?":s);

100 messages are fetched in each query.

"messages" is indexed on id (primary key, BIGINT not auto-generated) and user_id.

"users" is indexed on id (primary key, INT auto-generated).

The database is MySQL using MyISAM.

Currently the query takes well over 3000 ms to execute which puzzles me since "messages" is indexed on "id", so retrieving the correct rows should be very quick.

My question is: Given the describe scenario and setup, is a 3000 ms query time "normal" or am I missing something? Please let me know if further details are required.

Update #1: Here are the table definitions:

CREATE TABLE messages (
  id bigint(20) NOT NULL DEFAULT '0',
  user_id int(11) NOT NULL DEFAULT '0',
  message varchar(160) NOT NULL DEFAULT '',
  PRIMARY KEY (id),
  KEY user_id (user_id),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

CREATE TABLE users (
  id int(11) NOT NULL DEFAULT '0',
  username varchar(32) NOT NULL DEFAULT '',
  PRIMARY KEY (id),
  UNIQUE KEY username (username),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

The only "non-standard" thing I observe in the definitions is that "messages.id" is a BIGINT rather than an INT. Could that be a hint?

+2  A: 
SELECT  messages.id, messages.message, users.id, users.username
FROM    messages
INNER JOIN
        users
ON      users.id = messages.user_id
WHERE   messages.id in (?, ?, ?, ? ... a total of 100 "?":s);

It seems that your messages have datatype TEXT and are long.

Long TEXT columns are stored out-of-row, that's why you'll need to make some extra page reads to retrieve them which may take long.

Could you please check two things:

  1. Performance of this query:

    SELECT  messages.id, users.id, users.username
    FROM    messages
    INNER JOIN
            users
    ON      users.id = messages.user_id
    WHERE   messages.id in (?, ?, ?, ? ... a total of 100 "?":s);
    
  2. Execution plans generated by this query and your original query.

Quassnoi
I've posted an update with exact table definitions. Note that "messages.message" is simply VARCHAR(160). "messages.id" is a BIGINT - could that be a hint?
knorv
@knorv: could you please post the plans for both queries? Run EXPLAIN SELECT messages.id ... and post the output here.
Quassnoi
this would've been my guess too...
Brian Rudolph
A: 

Currently the query takes well over 3000 ms to execute

Every time, or just the first query? Could it be that the first query is incurring the cost of loading the indices etc?

And for comparison, how long does it take to do the same query for a specific message ID?

Also depending on the spec of the box you're running this on, then as well as looking at the execution plans as suggested by others it may also be worth looking at the memory usage of mysqld and making sure that it's not simply swapping.

frankodwyer
3000+ ms is *every* time the query is issued. See my update wrt table definitions.
knorv
+1  A: 

Well, the query and table design themselves are likely not the cause. Though the query could use some help(like adding the "in list" into the join predicate as to eliminate a late-side filter, though i'd guess the optimizer returns the same plan regardless)

My guess is this is a symptom of other issues, index\table fragmentation or out of date stats. Are these tables often deleted from? It might help to defrag the tables and indexes, otherwise you may be a victim of pages that are only 10% or less full, which would cause alot of disk I/O.

Note: with an integer seed for a primary key, you don't often see alot of fragmentation unless you get alot of deletes & updates to rows.

Brian Rudolph
A: 

Since this is usually re-written by the parser as:

SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id 
WHERE messages.id = ?
OR messages.id = ?
OR messages.id = ? etc.

I'd be interested to see the execution plan and the performance for a single case:

SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id 
WHERE messages.id = ?

In which case you might be better off doing a UNION or creating a table containing the IDs and performing a JOIN.

Cade Roux
A: 

How long does this take?

SELECT messages.id, messages.message, users.id, users.username 
FROM (select message_id, message from messages where message_id in (?????)) a 
INNER JOIN users b 
ON b.user_id = a.user_id

And post the Query plan if it's slow. If the message table has multiple indexes then the index for message_id should be forced in the inner select query, if the db is not using it.

Sai Ganesh
+1  A: 
  1. We NEED that explain.
  2. MyISAM offers poor concurrency. Consider that concurrent inserts may be causing you headaches. With such a large databse, InnoDB may be the direction to go.
  3. If messages are being inserted and deleted, this could be causing things to skew if your tables aren't occasionally optimized. Also, MyISAM primary keys aren't clusterd. Again, with such a large databse, InnoDB may be the direction to go.
Autocracy
A: 

What hardware are you looking at here? I assume you have a server with a reasonable amount of ram and key_buffer set quite large (say, bigger than the combined index sizes of the two modest sized tables). I assume the server is an otherwise idle performance-test server.

Can you measure the amount of IOs?

If you repeat the exact same query, is it fast?

If you load the entire database into a ram disc (a small table with only 15M rows is going to fit in a ram disc quite easily) is it faster?

Also (as others have noted), post the EXPLAIN plan.

But such a small database should always be fast as it will fit in ram on all but the puniest of servers.

MarkR
+6  A: 

I've worked on MyISAM tables with billions of rows, and one of the things I found after some row count limit was that the optimizer took far too long to decide how to approach a query, and wrongly performed some table scans. I can't find the exact page I found describing it, but I started always using FORCE_INDEX on each segment of queries where I knew how it should request objects

http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

The fact of the matter is, if you're using tables that large, you need to be designing each and every query to work with your indices, so there's nothing wrong with forcing an index. It will still scan the table if it has to, but FORCE_INDEX tells it not to unless it absolutely has to.

Also, if your tables are large, I assume your indices are also large. You absolutely need to make sure you have proper configuration settings and that your key_buffer is an adequate size and you have enough i/o. If you're running 32bit mysql (which you shouldn't be), then put your key_buffer as large as 1GB (asuming you have 1GB to spare) and check its usage with 'mysqlreport'

If you're running 64bit mysql, opt for making it as large as you can, while still leaving space for the OS to cache files and whatever other applications you have running, so maybe a couple of GB if you can.

Even if your queries use indices, if the index can't be properly buffered in memory, you're still hitting the disk and there's a performance hit proportional to the index size and speed of disk / available i/o.

As far as int vs big int, the only noticeable performance difference I've seen is in performing calculations on big ints, like SUM. SUM is appreciably slower on big int than on int, so much so that I'd look at storing numbers in a different magnitude, or separating them out to two ints if you need to perform frequent calculations on them.

Mike Venzke