views:

86

answers:

5

I'm trying to optimize a slow query that was generated by the Django ORM. It is a many-to-many query. It takes over 1 min to run.

The tables have a good amount of data, but they aren't huge (400k rows in sp_article and 300k rows in sp_article_categories)

#categories.article_set.filter(post_count__lte=50)

EXPLAIN ANALYZE SELECT * 
                  FROM "sp_article" 
            INNER JOIN "sp_article_categories" ON ("sp_article"."id" = "sp_article_categories"."article_id") 
                WHERE ("sp_article_categories"."category_id" = 1081  
                  AND "sp_article"."post_count" <= 50 )

Nested Loop  (cost=0.00..6029.01 rows=656 width=741) (actual time=0.472..25.724 rows=1266 loops=1)
  ->  Index Scan using sp_article_categories_category_id on sp_article_categories  (cost=0.00..848.82 rows=656 width=12) (actual time=0.015..1.305 rows=1408 loops=1)
        Index Cond: (category_id = 1081)
  ->  Index Scan using sp_article_pkey on sp_article  (cost=0.00..7.88 rows=1 width=729) (actual time=0.014..0.015 rows=1 loops=1408)
        Index Cond: (sp_article.id = sp_article_categories.article_id)
        Filter: (sp_article.post_count <= 50)
Total runtime: 26.536 ms

I have an index on:

sp_article_categories.article_id (type: btree)
sp_article_categories.category_id
sp_article.post_count (type: btree)

Any suggestions on how I can tune this to get the query speedy?

Thanks!

A: 

Put an index on sp_article_categories.category_id

Steven
Already have that one. I forgot to include it in my post...
erikcw
A: 

From a pure SQL perspective, your join is more efficient if your base table has fewer rows in it, and the WHERE conditions are performed on that table before it joins to another.

So see if you can get Django to select from the categories first, then filter the category_id before joining to the article table.

Pseudo-code follows:

SELECT * FROM categories c
INNER JOIN articles a
    ON c.category_id = 1081
    AND c.category_id = a.category_id

And put an index on category_id like Steven suggests.

Randolph Potter
Didn't seem to make a difference: SELECT * FROM sp_article_categories cINNER JOIN sp_article a ON c.category_id = 1081 AND c.article_id = a.idWHERE a.post_count <= 50;
erikcw
You may need to change the sort order of the article table, so that the category_id is included in the article_id btree index.
Randolph Potter
A: 

You can use field names instead * too.

select [fields] from....

pedrorezende
I'm using field names in the actual code. Just used * to keep things short in the post. Doesn't seem to make a difference performance wise when I benchmark it.
erikcw
A: 

Hi Erik!

I assume you have run analyze on the database to get fresh statistics.

It seems that the join between sp_article.id and sp_article_categories.article_id is costly. What data type is the article id, numeric? If it isn't you should perhaps consider making it numeric - integer or bigint, whatever suites your needs. It can make a big difference in performance according to my experience. Hope it helps.

Cheers! // John

John P
+1  A: 

You've provided the vital information here - the explain analyse. That isn't showing a 1 second runtime though, it's showing 20 milliseconds. So - either that isn't the query being run, or the problem is elsewhere.

The only difference between explain analyse and a real application is that the results aren't actually returned. You would need a lot of data to slow things down to 1 second though.

The other suggestions are all off the mark since they're ignoring the fact that the query isn't slow. You have the relevant indexes (both sides of the join are using an index scan) and the planner is perfectly capable of filtering on the category table first (that's the whole point of having a half decent query planner).

So - you first need to figure out what exactly is slow...

Richard Huxton