views:

264

answers:

4

Hi,

I've been doing some research and testing on how to do fast random selection in MySQL. In the process I've faced some unexpected results and now I am not fully sure I know how ORDER BY RAND() really works.

I always thought that when you do ORDER BY RAND() on the table, MySQL adds a new column to the table which is filled with random values, then it sorts data by that column and then e.g. you take the above value which got there randomly. I've done lots of googling and testing and finally found that the query Jay offers in his blog is indeed the fastest solution:

SELECT * FROM Table T JOIN (SELECT CEIL(MAX(ID)*RAND()) AS ID FROM Table) AS x ON T.ID >= x.ID LIMIT 1;

While common ORDER BY RAND() takes 30-40 seconds on my test table, his query does the work in 0.1 seconds. He explains how this functions in the blog so I'll just skip this and finally move to the odd thing.

My table is a common table with a PRIMARY KEY id and other non-indexed stuff like username, age, etc. Here's the thing I am struggling to explain

SELECT * FROM table ORDER BY RAND() LIMIT 1; /*30-40 seconds*/
SELECT id FROM table ORDER BY RAND() LIMIT 1; /*0.25 seconds*/
SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /*90 seconds*/

I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this. I have a project where I need to do fast ORDER BY RAND() and personally I would prefer to use

SELECT id FROM table ORDER BY RAND() LIMIT 1;
SELECT * FROM table WHERE id=ID_FROM_PREVIOUS_QUERY LIMIT 1;

which, yes, is slower than Jay's method, however it is smaller and easier to understand. My queries are rather big ones with several JOINs and with WHERE clause and while Jay's method still works, the query grows really big and complex because I need to use all the JOINs and WHERE in the JOINed (called x in his query) sub request.

Thanks for your time!

+1  A: 

While there's no such thing as a "fast order by rand()", there is a workaround for your specific task.

For getting any single random row, you can do like this german blogger does: http://rhcms.de/blog.php (I couldn't see a hotlink url. If anyone sees one, feel free to edit the link.)

The text is in german, but the SQL code is a bit down the page and in big white boxes, so it's not hard to see.

Basically what he does is make a procedure that does the job of getting a valid row. That generates a random number between 0 and max_id, try fetching a row, and if it doesn't exist, keep going until you hit one that does. He allows for fetching x number of random rows by storing them in a temp table, so you can probably rewrite the procedure to be a bit faster fetching only one row.

The downside of this is that if you delete A LOT of rows, and there are huge gaps, the chances are big that it will miss tons of times, making it ineffective.

Update: Different execution times

SELECT * FROM table ORDER BY RAND() LIMIT 1; /30-40 seconds/

SELECT id FROM table ORDER BY RAND() LIMIT 1; /0.25 seconds/

SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /90 seconds/

I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this.

It may have to do with indexing. id is indexed and quick to access, whereas adding username to the result, means it needs to read that from each row and put it in the memory table. With the * it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking.

This makes a difference only if there are variable length columns (varchar/text), which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row.

Tor Valamo
Basically that Jay's query I posted above is pretty much the same (if I understand it correctly of course) but is done completely in MySQL. So if I am not mistaken, the german blogger manually does just the same in PHP code.Unfortunately I do delete from my table so it will definitely have gaps. I think the gaps won't be large but this is something I can't fully control so I can't use that method.
Eugene
no he does it as a mysql procedure. which means that you don't have to duplicate it for every query. and if it's a table like a "posts" table in a forum, then it's fine. deletes will happen fairly often, but not often enough to pose a problem.
Tor Valamo
I see, thanks, I'll look into the solution that guy offers. But well, anyway I wonder why is ORDER BY RAND() fast when I select only `id`, very slow when I select `id` with `username` and slow when I select all columns? Why does the time vary so much depending on what columns I choose to select?
Eugene
It may have to do with indexing. `id` is indexed and quick to access, whereas adding `username` to the result, means it needs to read that from each row and put it in the memory table. With the `*` it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking. This makes a difference only if there are variable length columns, which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row.
Tor Valamo
A: 

I can tell you why the SELECT id FROM ... is much slower than the other two, but I am not sure, why SELECT id, username is 2-3 times faster than SELECT *.

When you have an index (the primary key in your case) and the result includes only the columns from the index, MySQL optimizer is able to use the data from the index only, does not even look into the table itself. The more expensive is each row, the more effect you will observe, since you substitute the filesystem IO operations with pure in-memory operations. If you will have an additional index on (id, username), you will have a similar performance in the third case as well.

newtover
+1  A: 

It may have to do with indexing. id is indexed and quick to access, whereas adding username to the result, means it needs to read that from each row and put it in the memory table. With the * it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking. This makes a difference only if there are variable length columns, which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row

Practice is better that all theories! Why not just to check plans? :)

mysql> explain select name from avatar order by RAND() limit 1;
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
| id | select_type | table  | type  | possible_keys | key             | key_len | ref  | rows  | Extra                                        |
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
|  1 | SIMPLE      | avatar | index | NULL          | IDX_AVATAR_NAME | 302     | NULL | 30062 | Using index; Using temporary; Using filesort |
+----+-------------+--------+-------+---------------+-----------------+---------+------+-------+----------------------------------------------+
1 row in set (0.00 sec)

mysql> explain select * from avatar order by RAND() limit 1;
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows  | Extra                           |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
|  1 | SIMPLE      | avatar | ALL  | NULL          | NULL | NULL    | NULL | 30062 | Using temporary; Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
1 row in set (0.00 sec)

 mysql> explain select name, experience from avatar order by RAND() limit 1;
+----+-------------+--------+------+--------------+------+---------+------+-------+---------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows  | Extra                           |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
|  1 | SIMPLE      | avatar | ALL  | NULL          | NULL | NULL    | NULL | 30064 | Using temporary; Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+-------+---------------------------------+
Random
A: 

Why don't you add an index id, username on the table see if that forces mysql to use the index rather than just a filesort and temp table.

jmoz