I'm trying to determine the best general approach for querying against joined two tables that have a lot of data, where each table has a column in the where clause. Imagine a simple schema w/ two tables:
posts
id (int)
blog_id (int)
published_date (datetime)
title (varchar)
body (text)
posts_tags
post_id (int)
tag_id (int)
With the following indexes:
posts: [blog_id, published_date]
tags: [tag_id, post_id]
We want to SELECT the 10 most recent posts on a given blog that were tagged with "foo". For the sake of this discussion, assume the blog has 10 million posts, and 1 million of those have been tagged with "foo". What is the most efficient way to query for this data?
The naive approach would be to do this:
SELECT
id, blog_id, published_date, title, body
FROM
posts p
INNER JOIN
posts_tags pt
ON pt.post_id = p.id
WHERE
p.blog_id = 1
AND pt.tag_id = 1
ORDER BY
p.published_date DESC
LIMIT 10
MySQL will use our indexes, but will still end up scanning millions of records. Is there a more efficient way to retrieve this data w/o denormalizing the schema?