views:

59

answers:

4

I'm trying to determine the best general approach for querying against joined two tables that have a lot of data, where each table has a column in the where clause. Imagine a simple schema w/ two tables:

posts
 id (int)
 blog_id (int)
 published_date (datetime)
 title (varchar)
 body (text)

posts_tags 
 post_id (int)
 tag_id (int)

With the following indexes:

posts: [blog_id, published_date]
tags: [tag_id, post_id]

We want to SELECT the 10 most recent posts on a given blog that were tagged with "foo". For the sake of this discussion, assume the blog has 10 million posts, and 1 million of those have been tagged with "foo". What is the most efficient way to query for this data?

The naive approach would be to do this:

 SELECT 
  id, blog_id, published_date, title, body
 FROM 
  posts p
 INNER JOIN
  posts_tags pt 
  ON pt.post_id = p.id
 WHERE
  p.blog_id = 1
  AND pt.tag_id = 1
 ORDER BY
  p.published_date DESC
 LIMIT 10

MySQL will use our indexes, but will still end up scanning millions of records. Is there a more efficient way to retrieve this data w/o denormalizing the schema?

+2  A: 

Most likely MySQL will first use the index (blog_id, published_date) to scan all the rows satisfying the condition blog_id = 1 starting with the row with the newest published_date. To do this it just need to scan backwards through the index starting from the right place. For each row it must join to the posts_tags table. At this point both the tag_id and the post_id are known so it is just a lookup in the primary index to see if the row exists. 10% of the rows have the tag foo so on average about 100 rows in the posts table will have to be checked before the first 10 rows of the result set are found.

I would expect the query you posted to run quite quickly if the tag foo is common. I don't think it will check millions of rows - perhaps a few hundred, or a few thousand if you are unlucky. As soon as it has found 10 matching rows it can stop without checking any more rows.

On the other hand, if you choose a tag that has fewer than 10 occurrences it will be slow as it will have to scan all the rows in that blog.

Do you have performance measurements that shows the query is particularly slow even when the tag you are searching for occurs often? Can you post the output of EXPLAIN for the query?

Mark Byers
I agree; don't disrupt your data model for this until you are quite sure you have a real problem. The query you propose may work very well indeed.
Ollie Jones
This is a general case of a problem I have seen before, so I don't have EXPLAIN output. Your point about 10% of the rows being tagged w/ "foo" is good. Let's say the frequency of that tag was much lower, like 0.1%. In that case, would denormalization be a sensible next step (i.e., duplicating blog_id and published_date on the posts_tags table, with an appropriate index)? Or is there a better way to get this data out of this schema?
Newt
A: 

If the query plan estimates that your number of rows being joined is small, then it may not use the index. Because scanning is a linear operation, it performs better for small numbers of rows, whereas using the index performs better for large number of rows. So as others have suggested look at the query plan to see what it is estimating for number of rows.

It is possible to add the blod_id and tag_id conditions to the ON criteria, although it looks strange. I'm not sure if this will change anything, but I usually experiment with such things.

You might also experiment with reversing the order of the columns in the index, as it does matter. Kind of like an phonebook is a LastName,FirstName index which would be very different from a phonebook with a FirstName,LastName index.

It's hard to sit down and say deterministically what will perform best without experimenting. I usually go through these kinds of things through experimentation and benchmarking. Sometimes I find that the results are counter to what I expected based on documentation, and then delve into it more to realize there is some subtle behavior/feature that I didn't reallize applied to the particular situation.

AaronLS
+1  A: 

if performance is paramount then denormalise as suggested:

table:

create table posts_tags
(
blog_id int unsigned not null, -- denormalise
tag_id smallint unsigned not null,
post_id int unsigned not null,
primary key(blog_id, tag_id, post_id) -- clustered composite PK
)
engine=innodb;

denormalisation trigger:

delimiter #

create trigger posts_tags_before_ins_trig before insert on posts_tags
for each row
proc_main:begin

declare b_id int unsigned default 0;

   select blog_id into b_id from posts where post_id = new.post_id;

   set new.blog_id = b_id;

end proc_main #

delimiter ;

query stored procedure: (assumed posts.post_id was auto_increment PK)

delimiter ;

drop procedure if exists get_latest_blog_posts_by_tag;

delimiter #

create procedure get_latest_blog_posts_by_tag
(
in p_blog_id int unsigned,
in p_tag_id smallint unsigned
)
proc_main:begin

  select
    p.*
  from
    posts p
  inner join 
  (
    select distinct
      pt.post_id
    from
      posts_tags pt
    where
      pt.blog_id = p_blog_id and pt.tag_id = p_tag_id
    order by
      pt.post_id desc limit 10
  ) rp on p.post_id = rp.post_id
  order by
    p.post_id desc;

end proc_main #

delimiter ;

call get_latest_blog_posts_by_tag(1,1);
f00
A: 

Any filters you want to do on a joined table should go in the join. Technically, the WHERE clause should contain only conditions that require more than 1 table or the primary table. While it may not speed up all queries, it assures MySQL optimizes the query properly.

FROM 
posts p
INNER JOIN
posts_tags pt 
ON pt.post_id = p.id
    AND pt.tag_id = 1
WHERE
p.blog_id = 1
Brent Baisley