views:

122

answers:

4

I'm trying to get distinct id's out of a 6 million row table. The query is pretty simple, and the explain seems ok. The distict row is indexed as part of a grouped index of uid-date-time.

The query is

SELECT DISTINCT uid FROM events;
and returns 334117 rows in 17 min 15.54 seconds. The explain is
+----+-------------+-------------+-------+---------------+----------------+----
----+------+------+--------------------------+
| id | select_type | table       | type  | possible_keys | key            | keylen | ref  | rows | Extra                    |
+----+-------------+-------------+-------+---------------+----------------+--------+------+------+--------------------------+
|  1 | SIMPLE      | events     | range | NULL          | dateuid_idx      | 4     | NULL |   10 | Using index for group-by |
+----+-------------+-------------+-------+---------------+----------------+----
----+------+------+--------------------------+

would this be faster if the uid where held in a seperate index? or is there another way to speed this up?

+2  A: 

It looks like your index in by Date and then UID which doesn't help. Indexing by UID first or having a separate index for UID will improve performance greatly.

In addition, you might consider using GROUP BY instead of DISTINCT.

g .
Thanks G. Everything I had read previously in looking into this issue (and my tests on GROUP BY with a JOIN) lead me to believe that DISTINCT and GROUP BY were the same. However, I just ran the query with GROUP BY (no JOINS), and the query is now 3.11 seconds. Fast enough for my maintenance scripts. Thanks
pedalpete
A date + uid index does help, a little, since such index may be scanned in lieu of the main table, and hence requiring less I/O. Also, DISTINCT and GROUP BY, although related, do not provide the same type of output.
mjv
@mjv - Good point. Scanning the index is better than scanning the table.
g .
@pedalpete - You went from 17 minutes to 3 seconds just by changing DISTINCT to GROUP BY?
g .
apparently yes, but then I forgot about the mysql caching, so that may have made a significant difference.
pedalpete
@pedalpete if such a jump from 17 minutes to 3 _seconds_ ! is only due to caching, plus the typically slight difference, if any between distinct and group by, it means a few things: a) you have really slow I/Os b) you have really big memory c) you don't seem all that curious after all...
mjv
+1  A: 

create an index on uid

Manu
+2  A: 

Yes, having uid as separate index can be much faster. Without an index a full-table scan is required; that is, it actually iterates through each row and checks if the specific uid is already selected or not.

However with an uid being an index, it only has to traverse through the index tree, which can be a lot faster if there's a lot of duplicate uids.

reko_t
A: 

A separate uid index would speed things up. Creating the index however would probably take in th order of 17 minutes (since roughly the same amount of "reading" must be done to create teh index.

Therefore creating such an index all depends on expected usage. May be a good "investment" if relevant queries are run a few times (if only during this current "campaign" of duplicate record hunting).

mjv