ansaurus

Question

In MySQL, how to JOIN two very large tables which both have columns in the WHERE condition?

Answer 1

+2 A:

Most likely MySQL will first use the index (blog_id, published_date) to scan all the rows satisfying the condition blog_id = 1 starting with the row with the newest published_date. To do this it just need to scan backwards through the index starting from the right place. For each row it must join to the posts_tags table. At this point both the tag_id and the post_id are known so it is just a lookup in the primary index to see if the row exists. 10% of the rows have the tag foo so on average about 100 rows in the posts table will have to be checked before the first 10 rows of the result set are found.

I would expect the query you posted to run quite quickly if the tag foo is common. I don't think it will check millions of rows - perhaps a few hundred, or a few thousand if you are unlucky. As soon as it has found 10 matching rows it can stop without checking any more rows.

On the other hand, if you choose a tag that has fewer than 10 occurrences it will be slow as it will have to scan all the rows in that blog.

Do you have performance measurements that shows the query is particularly slow even when the tag you are searching for occurs often? Can you post the output of EXPLAIN for the query?

Mark Byers 2010-09-07 22:08:13

I agree; don't disrupt your data model for this until you are quite sure you have a real problem. The query you propose may work very well indeed.

Ollie Jones 2010-09-07 22:15:21

This is a general case of a problem I have seen before, so I don't have EXPLAIN output. Your point about 10% of the rows being tagged w/ "foo" is good. Let's say the frequency of that tag was much lower, like 0.1%. In that case, would denormalization be a sensible next step (i.e., duplicating blog_id and published_date on the posts_tags table, with an appropriate index)? Or is there a better way to get this data out of this schema?

Newt 2010-09-07 22:36:34

Answer 2

A:

If the query plan estimates that your number of rows being joined is small, then it may not use the index. Because scanning is a linear operation, it performs better for small numbers of rows, whereas using the index performs better for large number of rows. So as others have suggested look at the query plan to see what it is estimating for number of rows.

It is possible to add the blod_id and tag_id conditions to the ON criteria, although it looks strange. I'm not sure if this will change anything, but I usually experiment with such things.

You might also experiment with reversing the order of the columns in the index, as it does matter. Kind of like an phonebook is a LastName,FirstName index which would be very different from a phonebook with a FirstName,LastName index.

It's hard to sit down and say deterministically what will perform best without experimenting. I usually go through these kinds of things through experimentation and benchmarking. Sometimes I find that the results are counter to what I expected based on documentation, and then delve into it more to realize there is some subtle behavior/feature that I didn't reallize applied to the particular situation.

AaronLS 2010-09-07 22:17:34

Answer 3

+1 A:

if performance is paramount then denormalise as suggested:

table:

create table posts_tags
(
blog_id int unsigned not null, -- denormalise
tag_id smallint unsigned not null,
post_id int unsigned not null,
primary key(blog_id, tag_id, post_id) -- clustered composite PK
)
engine=innodb;

denormalisation trigger:

delimiter #

create trigger posts_tags_before_ins_trig before insert on posts_tags
for each row
proc_main:begin

declare b_id int unsigned default 0;

   select blog_id into b_id from posts where post_id = new.post_id;

   set new.blog_id = b_id;

end proc_main #

delimiter ;

query stored procedure: (assumed posts.post_id was auto_increment PK)

delimiter ;

drop procedure if exists get_latest_blog_posts_by_tag;

delimiter #

create procedure get_latest_blog_posts_by_tag
(
in p_blog_id int unsigned,
in p_tag_id smallint unsigned
)
proc_main:begin

  select
    p.*
  from
    posts p
  inner join 
  (
    select distinct
      pt.post_id
    from
      posts_tags pt
    where
      pt.blog_id = p_blog_id and pt.tag_id = p_tag_id
    order by
      pt.post_id desc limit 10
  ) rp on p.post_id = rp.post_id
  order by
    p.post_id desc;

end proc_main #

delimiter ;

call get_latest_blog_posts_by_tag(1,1);

f00 2010-09-07 23:00:23

Answer 4

A:

Any filters you want to do on a joined table should go in the join. Technically, the WHERE clause should contain only conditions that require more than 1 table or the primary table. While it may not speed up all queries, it assures MySQL optimizes the query properly.

FROM 
posts p
INNER JOIN
posts_tags pt 
ON pt.post_id = p.id
    AND pt.tag_id = 1
WHERE
p.blog_id = 1

Brent Baisley 2010-09-08 00:51:42

ansaurus

tags:

views:

answers:

In MySQL, how to JOIN two very large tables which both have columns in the WHERE condition?

related questions