views:

533

answers:

9

Overview:

I have three tables 1) subscribers, bios, and shirtsizes and i need to find the subscribers without a bio or shirtsizes

the tables are laid out such as

subscribers

| season_id |  user_id |

bio

| bio_id | user_id |

shirt sizes

| bio_id | shirtsize |

And I need to find all users who do not have a bio or shirtsize, (if no bio; then no shirtsize via relation) for any given season.

I originally wrote a query like:

SELECT *
   FROM subscribers s 
   LEFT JOIN bio b ON b.user_id = subscribers.user_id 
   LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id 
WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);

but it is taking 10 seconds to complete now.

I am wondering how I can restructure the query (or possibly the problem) so that it will preform reasonably.

Here is the mysql explain: (ogu = subscribers, b = bio, tn = shirtshize)

| id | select_type | table | type  | possible_keys | key     | key_len | ref         | rows   | Extra       |   
+----+-------------+-------+-------+---------------+---------+---------+-------------+--------+-------------+    
|  1 | SIMPLE      | ogu   | ref   | PRIMARY       | PRIMARY | 4       | const       |    133 | Using where |
|  1 | SIMPLE      | b     | index | NULL          | PRIMARY | 8       | NULL        | 187644 | Using index |
|  1 | SIMPLE      | tn    | ref   | nid           | nid     | 4       | waka2.b.nid |      1 | Using where |

The above is pretty sanitized, here's the realz info:

mysql> DESCRIBE subscribers
+-----------+---------+------+-----+---------+-------+
| Field     | Type    | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| subscribers  | int(11) | NO   | PRI |         |       | 
| uid       | int(11) | NO   | PRI |         |       | 


mysql> DESCRIBE bio;
+-------+------------------+------+-----+---------+-------+
| Field | Type             | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| bio_id   | int(10) unsigned | NO   | PRI | 0       |       | 
| uid   | int(10) unsigned | NO   | PRI | 0       |       | 


mysql> DESCRIBE shirtsize;
+-------+------------------+------+-----+---------+-------+
| Field | Type             | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| bio_id   | int(10) unsigned | NO   | PRI | 0       |       | 
| shirtsize   | int(10) unsigned | NO   | PRI | 0       |       |

and the real query looks like:

SELECT ogu.nid, ogu.is_active, ogu.uid, b.nid AS bio_node, tn.nid AS size
                  FROM og_uid ogu
                  LEFT JOIN bio b ON b.uid = ogu.uid
                  LEFT JOIN term_node tn ON tn.nid = b.nid
                  WHERE ogu.nid = 185033 AND ogu.is_admin = 0
                  AND (b.nid IS NULL OR tn.tid IS NULL)

nid is season_id or bio_id (with a type); term_node is going to be the shirtsize

+8  A: 

The query should be OK. I would run it through a query analyzer and refine the indexes on the tables.

Tor Haugen
I guess this is what I get for trusting contributed modules to have correct index. Thanks!
jskulski
+1  A: 

If you define what you are looking for exactly rather than SELECT * it might speed it up a bit... also OR is not the fastest query to be doing, if you can re-write it without the OR it will be faster.

Also... you could try unions instead of left joins maybe?

SELECT s.user_id
   FROM subscribers s 
   LEFT JOIN bio b ON b.user_id = s.user_id 
   LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id 
WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);

would be something like:

(SELECT s.user_id FROM subscribers s WHERE s.season_id = 185181)
UNION
(SELECT b.user_id, b.bio_id FROM bio b WHERE bio.bio_id IS NULL)
UNION
(SELECT shirtsizes.bio_id FROM shirtsizes WHERE shirtsizes.size is NULL)

(to be honest that doesn't look right to me... but then I never use joins or join syntax or unions...)

I would do:

SELECT *
FROM subscribers s, bio b, shirtsizes sh
WHERE s.season_id = 185181
AND shirtsize.bio_id = bio.bio_id 
AND b.user_id = s.user_id 
AND (bio.bio_id IS NULL 
     OR 
     shirtsize.size IS NULL);
SeanJA
"I never use joins or unions" Of course you do use joins. FROM subscribers s, bio b, shirtsizes shWHERE s.season_id = 185181AND shirtsize.bio_id = bio.bio_id AND b.user_id = s.user_id is a set of joins. It is is using the outdated style of join but it is still a join.
HLGEM
I guess it would have been more correct to write 'I avoid using the join syntax and I never use unions'
SeanJA
+7  A: 

Joins are one of the most expensive operations that you can perform on an SQL query. While it should be able to automatically optimize your query somewhat, maybe try restructuring it. First of all, I would instead of SELECT *, be sure to specify which columns you need from which relations. This will speed things up quite a bit.

If you only need the user ID for example:

SELECT s.user_id
   FROM subscribers s 
   LEFT JOIN bio b ON b.user_id = subscribers.user_id 
   LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id 
WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);

That will allow the SQL database to restructure your query a little more efficiently on its own.

Brian
Regular joins are moderately expensive; outer joins are more expensive; the typical correlated sub-query is diabolical.
Jonathan Leffler
I posted the real query and am only grabbing a few columns, but thanks.
jskulski
+1  A: 

Is bio_id the primary key of bios? Is it really possible for there to be a bios row with b.user_id = subscribers.user_id but with b.bio_id NULL?

Are there shirtsize rows with shirtsize.bio_id NULL? Do those rows ever have shirtsize.size not NULL?

John Saunders
+4  A: 

Obviously I haven't checked this but it seems to be that what you want is to select any subscriber where there there isn't a matching bio or the join between bios and shirtsizes fails. I would consider using NOT EXISTS for this condition. You'll probably want indexes on bio.user_id and shirtsizes.bio_id.

select *
from subscribers
where s.season_id = 185181
      and not exists (select *
                      from bio join shirtsizes on bio.bio_id = shirtsizes.bio_id
                      where bio.user_id = subscribers.user_id)

EDIT:

Based on your update, you may want to create separate keys on each column instead of/in addition to having compound primary keys. It's possible that the joins aren't able to take optimal advantage of the compound primary indexes and an index on the join columns themselves may speed things up.

tvanfosson
Quite a few people have investigated this and it turns out that SELECT * is normally slightly faster than SELECT 1 / SELECT NULL / etc in an EXISTS query. But yes, using EXISTS can often be faster than join, but not always, it's a weapon to try out in each scenario...
Dems
Point taken. I've updated the answer.
tvanfosson
+1  A: 

Would it be any quicker to do a difference between the list of subscribers for the relevant season and the list of subscribers for the season with bios and shirt sizes?

SELECT *
   FROM Subscribers
   WHERE season_id = 185181
     AND user_id NOT IN
         (SELECT DISTINCT s.user_id
             FROM subscribers s
             JOIN bios b ON s.user_id = b.user_id
             JOIN shirtsizes z ON b.bio_id = z.bio_id
             WHERE s.season_id = 185181
         )

This avoids outer joins, which are not as fast as inner joins, and may therefore be quicker. On the other hand, it might be creating two large lists with very few differences between them. It is not clear whether the DISTINCT in the sub-query would improve or harm performance. It implies a sort operation (expensive) but paves the way for a merge-join if the MySQL optimizer supports such things.

There might be other notations available - MINUS or DIFFERENCE, for example.

Jonathan Leffler
There is no reason to use DISTINCT in the subquery.
Bill Karwin
A: 
select * from subscribers where user_id not in (
  select user_id from bio where bio_id not in (
 select bio_id from shirt_sizes
  )
) and season_id=185181
Hafthor
A: 

I presume that your "big table" is subscribers, and that season_id is probably neither selective nor indexed (indexing it is rather meaningless if it's not selective, anyway), which means that you'll have to fully scan subscribers, anyway. Parting, I would join (with an inner join) the two other tables - note that if there is no bio_id in shirt_size it's exactly the same for your query as if there were no bio. First bit:

select uid
from bio
     inner join shirtsizes
             on shirtsizes.bio_id = bio.bio_id

At which point you want to check that shirtsizes is indexed on bio_id. Now you can left outer join this query to subscribers:

select *
from subscribers s
     left outer join (select uid
                      from bio
                      inner join shirtsizes
                              on shirtsizes.bio_id = bio.bio_id) x
                  on x.uid = s.uid
where s.season_id = 185181
  and x.uid is null

which is likely to run reasonably fast if neither bio nor shirtsizes are gigantic ...

A: 

Your query, as it is written now, evaluates all bio's and term_node's if they exist, and then filters them out.

But what you want is just find og_uid's that don't have term_node's (not having a bio also implies not having a term_node)

So you just want to stop evaluating bio's and term_node's as soon as you find the first existing term_node:

SELECT  *
FROM    (
        SELECT  ogu.nid, ogu.is_active, ogu.uid,
                (
                SELECT  1
                FROM    bio b, term_node tn
                WHERE   b.uid = ogu.uid
                        AND tn.nid = b.nid
                LIMIT   1
                ) AS ex
        FROM    og_uid ogu
        WHERE   ogu.nid = 185033
                AND ogu.is_admin = 0
        ) ogu1
WHERE   ex IS NULL

This will evaluate at most one bio and at most one term_node for each og_uid, instead of evaluating all existing thousands and the filtering them out.

Should work much faster.

Quassnoi