views:

220

answers:

2

I'm archiving this web forum, which normally gets purged about once a week. So I'm screen scraping it, and storing it into my database (PostgreSQL).

I also do a little analysis on the data, with some graphs for users to enjoy, like what time of day is the forum most active, and so forth.

So I have a posts table, like so:

   Column   |            Type
------------+------------------------------
 id         | integer
 body       | text
 created_at | timestamp without time zone
 topic_id   | integer
 user_name  | text
 user_id    | integer

And I now want to have a post count for each user, for my little top 10 posters table.

I came up with this:

SELECT user_id, user_name, count(*)
FROM posts
GROUP BY user_id, user_name
ORDER BY count DESC LIMIT 10

Which turns out to be very slow. 9 seconds, with just about 300 000 rows in the posts table at the moment.

It takes only half a second, if I group on just one column, but I need both.

I'm rather new to relational databases, and SQL, so I'm not quite sure if this is right, or just how am I doing it wrong?

+9  A: 

There's probably only one user with a particular ID, so max(user_name) should equal user_name. Then you can group on a single column, which your post indicates works faster:

SELECT user_id, max(user_name), count(*)
FROM posts
GROUP BY user_id
Andomar
+1 yup, that is the way to do it :). In the sql spec 1999 and up, it is actually allowed to list un-aggregated columns in the `SELECT` list that do not also appear in the `GROUP BY` list, as long as those columns are functionally dependent upon the `GROUP BY` list. Shameless plug: http://rpbouman.blogspot.com/2007/05/debunking-group-by-myths.html
Roland Bouman
A: 

also could use having count > 0 so you only return true

Matt