views:

177

answers:

3

I saw this question on meta: http://meta.stackoverflow.com/questions/33101/how-does-so-query-comments

I wanted to set the record straight and ask the question in a proper technical way.

Say I have 2 tables:

Posts
 id
 content
 parent_id           (null for questions, question_id for answer)  

Comments
 id 
 body 
 is_deleted
 post_id 
 upvotes 
 date 

Note: I think this is how the schema for SO is setup, answers have a parent_id which is the question, questions have null there. Questions and answers are stored in the same table.

How do I pull out comments stackoverflow style in a very efficient way with minimal round trips?

The rules:

  1. A single query should pull out all the comments needed for a page with multiple posts to render
  2. Needs to only pull out 5 comments per answer, with pref for upvotes
  3. Needs to provide enough information to inform the user there are more comments beyond the 5 that are there. (and the actual count - eg. 2 more comments)
  4. Sorting is really hairy for comments, as you can see on the comments in this question. The rules are, display comments by date, HOWEVER if a comment has an upvote it is to get preferential treatment and be displayed as well at the bottom of the list. (this is nasty hard to express in sql)

If any denormalizations make stuff better what are they? What indexes are critical?

+1  A: 

Use:

WITH post_hierarchy AS (
  SELECT p.id,
         p.content,
         p.parent_id,
         1 AS post_level
    FROM POSTS p
   WHERE p.parent_id IS NULL
  UNION ALL
  SELECT p.id,
         p.content,
         p.parent_id,
         ph.post_level + 1 AS post_level
    FROM POSTS p
    JOIN post_hierarchy ph ON ph.id = p.parent_id)  
SELECT ph.id, 
       ph.post_level,
       c.upvotes,
       c.body
  FROM COMMENTS c
  JOIN post_hierarchy ph ON ph.id = c.post_id
ORDER BY ph.post_level, c.date

Couple of things to be aware of:

  1. StackOverflow displays the first 5 comments, doesn't matter if they were upvoted or not. Subsequent comments that were upvoted are immediately displayed
  2. You can't accommodate a limit of 5 comments per post without devoting a SELECT to each post. Adding TOP 5 to what I posted will only return the first five rows based on the ORDER BY statement
OMG Ponies
+3  A: 

I wouldn't bother to filter the comments using SQL (which may surprise you because I'm an SQL advocate). Just fetch them all sorted by CommentId, and filter them in application code.

It's actually pretty infrequent that there are more than five comments for a given post, so that they need to be filtered. In StackOverflow's October data dump, 78% of posts have zero or one comment, and 97% have five or fewer comments. Only 20 posts have >= 50 comments, and only two posts have over 100 comments.

So writing complex SQL to do that kind of filtering would increase complexity when querying all posts. I'm all for using clever SQL when appropriate, but this would be penny-wise and pound-foolish.

You could do it this way:

SELECT q.PostId, a.PostId, c.CommentId
FROM Posts q
LEFT OUTER JOIN Posts a
  ON (a.ParentId = q.PostId)
LEFT OUTER JOIN Comments c
  ON (c.PostId IN (q.PostId, a.PostId))
WHERE q.PostId = 1234
ORDER BY q.PostId, a.PostId, c.CommentId;

But this gives you redundant copies of q and a columns, which is significant because those columns include text blobs. The extra cost of copying redundant text from the RDBMS to the app becomes significant.

So it's probably better to not do this in two queries. Instead, given that the client is viewing a Question with PostId = 1234, do the following:

SELECT c.PostId, c.Text
FROM Comments c
JOIN (SELECT 1234 AS PostId UNION ALL 
    SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p
  ON (c.PostId = p.PostId);

And then sort through them in application code, collecting them by the referenced post and filtering out extra comments beyond the five most interesting ones per post.


I tested these two queries against a MySQL 5.1 database loaded with StackOverflow's data dump from October. The first query takes about 50 seconds. The second query is pretty much instantaneous (after I pre-cached indexes for the Posts and Comments tables).

The bottom line is that insisting on fetching all the data you need using a single SQL query is an artificial requirement (probably based on a misconception that the round-trip of issuing a query against an RDBMS is overhead that must be minimized at any cost). Often a single query is a less efficient solution. Do you try to write all your application code in a single function? :-)

Bill Karwin
I agree with your sentiments here, my implementation would be actually a slight optimization, I would store comment_count in the posts table. on the client side pull out all the posts for rendering, whizz through them and then do a select * from comments where post_id in (id1,id2,id3) - for all posts with more than 0 comments) this makes stuff ultra simple and very efficient for the general case
Sam Saffron
+1  A: 

the real question is not the query, but the schema, specially the clustered indexes. The comment ordering requirements are ambuigous as you defined them (is it only 5 per answer or not?). I interpreted the requirements as 'pull 5 comments per post (answer or question) and give preference to upvoted ones, then to newer ones. I know this is not how SO comments are showen, but you gotta define your requirements more precisesly.

Here is my query:

declare @postId int;
set @postId = ?;

with cteQuestionAndReponses as (
  select post_id
  from Posts
  where post_id = @postId
  union all
  select post_id
  from Posts
  where parent_id = @postId)
select * from
cteQuestionAndReponses p
outer apply (
  select count(*) as CommentsCount
  from Comments c 
  where is_deleted = 0
  and c.post_id = p.post_id) as cc
outer apply (
  select top(5) *
  from Comments c 
  where is_deleted = 0
  and p.post_id = c.post_id
  order by upvotes desc, date desc
  ) as c

I have some 14k posts and 67k comments in my test tables, the query gets the posts in 7ms:

Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 7 ms.

Here is the schema I tested with:

create table Posts (
 post_id int identity (1,1) not null
 , content varchar(max) not null
 , parent_id int null -- (null for questions, question_id for answer) 
 , constraint fkPostsParent_id 
    foreign key (parent_id)
    references Posts(post_id)
 , constraint pkPostsId primary key nonclustered (post_id)
);
create clustered index cdxPosts on 
  Posts(parent_id, post_id);
go

create table Comments (
 comment_id int identity(1,1) not null
 , body varchar(max) not null
 , is_deleted bit not null default 0
 , post_id int not null
 , upvotes int not null default 0
 , date datetime not null default getutcdate()
 , constraint pkComments primary key nonclustered (comment_id)
 , constraint fkCommentsPostId
    foreign key (post_id)
    references Posts(post_id)
 );
create clustered index cdxComments on 
  Comments (is_deleted, post_id,  upvotes, date, comment_id);
go

and here is my test data generation:

insert into Posts (content)
select 'Lorem Ipsum' 
from master..spt_values;

insert into Posts (content, parent_id)
select 'Ipsum Lorem', post_id
from Posts p
cross apply (
  select top(checksum(newid(), p.post_id) % 10) Number
  from master..spt_values) as r
where parent_id is NULL  

insert into Comments (body, is_deleted, post_id, upvotes, date)
select 'Sit Amet'
  -- 5% deleted comments
  , case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end
  , p.post_id
  -- up to 10 upvotes
  , abs(checksum(newid(), p.post_id, r.Number)) % 10
  -- up to 1 year old posts
  , dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate()) 
from Posts p
cross apply (
  select top(abs(checksum(newid(), p.post_id)) % 10) Number
  from master..spt_values) as r
Remus Rusanu