tags:

views:

416

answers:

7

I have a stored proc (called sprocGetArticles) which returns a list of articles from the articles table. This stored proc does not have any parameters.

Users can leave comments for each article and I store these comments in the comments table linked by the article id.

Is there any way I can do a comment count for each articleid in the returned list from within the sprocGetArticles stored procedure so I only have to make one call to the database?

My problem is that i need the article id to do the count which I seem unable to declare.

Is this the best approach anyway?

+2  A: 

Well, without knowing what you are selecting and your general schema (and assuming you are at least using SQL Server 2005:

WITH CommentCounts AS
(
   SELECT COUNT(*) CommentCount, ac.ArticleID
   FROM Articles a
   INNER JOIN ArticleComments ac
      ON ac.ArticleID = a.ID
   GROUP BY ac.ArticleID
)

SELECT a.*,
       c.CommentCount
FROM Articles a
INNER JOIN CommentCounts c
   ON a.ID = c.ArticleID

This is a Common Table Expression or CTE. You can read more about them here: http://msdn.microsoft.com/en-us/library/ms190766.aspx

AndyMcKenna
That isn't valid CTE syntax per your link.
OMG Ponies
Thanks, don't know how I missed that.
AndyMcKenna
+5  A: 

SQL allows entire scalar subqueries to be returned as projected columns. Subqueries can be correlated with the parent query. So is easy to count the comments in a subquery that counts the comments for a given article id:

SELECT a.*, (
  SELECT COUNT(*)
  FROM Comments c
  WHERE c.article_id = a.article_id) AS CountComments
  FROM Articles a;

Note that counting the comments each time can be quite expensive, is better to keep the count as an Article property.

Remus Rusanu
Can you please explain your comment a bit further about keeping the count as an article property? Why does this mean I wouldn't need to count the comments each time? Thanks.
Cunners
Thank you. This is what I was looking for!
David Stratton
+1  A: 

The following will work on SQL Server 2005+ or Oracle 9i+:

WITH COMMENT_COUNT AS (
      SELECT ac.article_id
             COUNT(ac.*) 'numComments'
        FROM ARTICLE_COMMENTS ac
    GROUP BY ac.article_id)
SELECT t.description,
       cc.numComments
  FROM ARTICLES t
  JOIN COMMENT_COUNT cc ON cc.article_id = t.article_id

SQL Server call it a Common Table Expression (CTE); Oracle calls it Subquery factoring.

Alternative:

SELECT t.description,
       cc.numComments
  FROM ARTICLES t
  JOIN (SELECT ac.article_id
               COUNT(ac.*) 'numComments'
          FROM ARTICLE_COMMENTS ac
      GROUP BY ac.article_id) cc ON cc.article_id = t.article_id

Performing the subquery in the SELECT statement works, but will perform the worst of all suggestions for the fact it will execute for every row.

OMG Ponies
+2  A: 

Maybe I'm missing something, but what's with all the subqueries and inline views? Why not just do a straightforward left-join, e.g.:

  SELECT a.ArticleId
       , a.ArticleName
       , (other a columns)
       , COUNT(*)
    FROM Articles a
         LEFT JOIN Comments c
                ON c.ArticleId = a.ArticleId
GROUP BY a.ArticleId
       , a.ArticleName
       , (other a columns);
Steve Broberg
+1  A: 

One option no one has mentioned so far would be a computed column on your article table which would count the number of comments. This is in general much faster than actually computing the number of comments everytime around, and if you really need to query that number frequently, it could save you a lot of processing overhead!

In SQL Server 2005 and up, what you could do in this case is create a small stored function to count the number of comments for each article, and then add this as a computed column to your article table. You could then use that as a normal column and trust me - it's a lot quicker than using subqueries all the time!

CREATE FUNCTION dbo.CountComments(@ArticleID INT)
RETURNS INT 
WITH SCHEMABINDING
AS BEGIN
    DECLARE @ArticleCommentCount INT

    SELECT @ArticleCommentCount = COUNT(*)
    FROM dbo.ArticleComments
    WHERE ArticleID = @ArticleID

    RETURN @ArticleCommentCount
END
GO

Add this to your article table as a column:

ALTER TABLE dbo.Articles
    ADD CommentCount AS dbo.CountComments(ArticleID)

and from then on, just use it as a normal column:

SELECT ArticleID, ArticleTitle, ArticlePostDate, CommentCount 
FROM dbo.Articles

To make it even faster, you could add this column as a persisted column to your table, and then it really rocks! :-)

ALTER TABLE dbo.Articles
    ADD CommentCount AS dbo.CountComments(ArticleID) PERSISTED

It's a bit more work upfront, but if you need this often and all the time, it could be well worth the trouble! Also works great for e.g. reading out certain bits of information from a XML column stored in your database table and expose it as a regular INT column or whatever.

Highly recommend! It's a feature often overlooked in SQL Server.

Marc

marc_s
thanks that worked really well!
Cunners
Steve Broberg
Obviously, SQL Server optimizes these queries. If you make the column PERSISTED, then you're actually storing the value - and I suspect, SQL SErver will only recalculate it when the count changes.
marc_s
I would be happy to share my execution plans with you if you don't believe me - the same query using subqueries uses 90% of the total time, while the one with the computed column only uses 10% - it's almost 10x faster. It's a fact.
marc_s
I don't doubt you - I'd be interested in someone who understands the internals explain how this is possible. Intuitively, it doesn't makes sense that a computed column would be faster, as there is less information available to the optimizer (or at least, more obfuscated information). The function dictates that the subquery must be run for each row, since it is possible the function may not return the same row every time. Similarly, it would seem a complex problem for a DBMS to determine under what conditions a refresh would be necessary for an arbitrary function.
Steve Broberg
Also, I see from doing a search that SQL Server mandates that persisted CCs be deterministic, which answers part of my question. However, I can still envision writing a deterministic function that is nontrivial to "reverse" in order to determine which row(s) in a table with a persisted CC need updating. Perhaps SS uses a scheme where it marks a row as "needs updating" and then will recompute the persisted column the next time it is referenced (which may result in all rows needing to be recomputation even if they are not strictly affected by the update).
Steve Broberg
@Steve: I've wondered myself, and it does appear to me that somehow, SQL Server must be storing and reusing values, instead of recomputing them every time they're needed. But how they do that internally in detail escapes me.
marc_s
@Marc: I've run an experiment using your solution, and my results do not agree with what you've reported here - I'd be interested in you looking at my long post below and hearing your comments.
Steve Broberg
SQL Server cannot maintain an aggregate computed and persisted column as is defined in the post. The column will be populated with the value at record creation time and will not be further updated when a comment is added or removed. SQL can however maintain an indexed view over article comments, to similar effect, but there are some restriction on creating indexed views.
Remus Rusanu
A: 

Regarding the use of computed columns mentioned in the answer, I wanted to confirm the claims that using a computed column would produce better performance (it didn't make sense to me, but I'm no SQL Server guru). The results I got indicated that using a computed column is indeed slower - much slower, than a simple group by or subquery. I ran a test on a SQL Server instance I have on my own PC - here is the methodology and results:

CREATE TABLE smb_header (keycol INTEGER NOT NULL
                        , name1 VARCHAR2(255)
                        , name2 VARCHAR2(255));

INSERT INTO smb_header
  VALUES (1
        , 'This is column 1'
        , 'This is column 2'
         );

INSERT INTO smb_header
   SELECT (SELECT MAX(keycol)
             FROM smb_header
          ) + keycol
        , name1
        , name2
     FROM smb_header;
REM (repeat 20 times to generate ~1 million rows)

ALTER TABLE smb_header ADD PRIMARY KEY (keycol);

CREATE TABLE smb_detail (keycol INTEGER
                        , commentno INTEGER
                        , commenttext VARCHAR2(255));

INSERT INTO smb_detail
   SELECT keycol
        , 1
        , 'A comment that describes this issue'
     FROM smb_header;

ALTER TABLE smb_detail ADD PRIMARY KEY (keycol, commentno);

ALTER TABLE smb_detail ADD FOREIGN KEY (keycol) 
                           REFERENCES smb_header (keycol);

INSERT INTO smb_detail
   SELECT keycol
        , (SELECT MAX(commentno)
             FROM smb_detail sd2
            WHERE sd2.keycol = sd1.keycol
          ) + commentno
        , 'A comment that follows comment number ' 
          + CAST(sd1.commentno AS VARCHAR(32))
     FROM smb_detail sd1
    WHERE MOD(keycol, 31) = 0;

REM repeat 5 times, to create some records that have 64 comments
REM where others have one.

At this point, there will be around 1 million rows in the header, and either 1 or 64 comments for each.

Now I create the function (the same as yours above, only with my column & table names), and the computed column:

alter table dbo.smb_header add CommentCountPersist as dbo.CountComments(keycol)

By the way, PERSISTED will not work for this column, as I suspected in my comments above - it is not possible or too difficult for SQL Server to keep track of which rows need updating if you refer to other tables in your function. Using the PERSISTED keyword produces the error:

Msg 4934, Level 16, State 3, Line 1
Computed column 'CommentCountPersist' in table 'smb_header' cannot be 
persisted because the column does user or system data access.

This makes sense to me - I don't see how SQL Server could determine what rows need updating when other rows change, for any function that could be implemented, without the process of updates being horribly inefficient.

Now, for the tests. I create a temp table #holder to insert the rows into - I want to make sure when my queries run, I process the entire result set, not just the first few rows that would appear in the Mgmt Studio grid control.

  SELECT h.keycol
       , h.name1
       , CommentCount
    INTO #holder
    FROM smb_header h
   WHERE h.keycol < 0

Here are the results of my queries. First, the computed column:

  INSERT
    INTO #holder
  SELECT h.keycol
       , h.name1
       , CommentCount
    FROM smb_header h
   WHERE h.keycol between 5000 and 10000

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'Worktable'. Scan count 1, logical reads 10160, physical reads 0, 
                   read-ahead  reads 0, lob logical reads 0, 
                   lob physical reads 0, lob read-ahead reads 0.
Table 'smb_header'. Scan count 1, logical reads 44, physical reads 0, 
                    read-ahead reads 0, lob logical reads 0, 
                    lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 265 ms,  elapsed time = 458 ms.

(5001 row(s) affected)
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Now the GROUP BY version, the computed column:

  INSERT
    INTO #holder
  SELECT h.keycol
       , h.name1
       , COUNT(*)
    FROM smb_header h
       , smb_detail d 
   WHERE h.keycol between 5000 and 10000
     AND h.keycol = d.keycol 
GROUP BY h.keycol, h.name1

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'smb_header'. Scan count 1, logical reads 44, physical reads 0, 
                    read-ahead reads 0, lob logical reads 0, 
                    lob physical reads 0, lob read-ahead reads 0.
Table 'smb_detail'. Scan count 1, logical reads 366, physical reads 0, 
                    read-ahead reads 0, lob logical reads 0, 
                    lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 15 ms,  elapsed time = 13 ms.

(5001 row(s) affected)
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Writing the query with the subquery in the SELECT clause as Remus did above yields the same plan & performance as the GROUP BY (which would be expected).

As you can see, the computed column performs version significantly worse. This makes sense to me, as the optimizer is forced to call the function and do the count(*) for every row in the header, instead of using more sophisticated methods of resolving the two sets of data.

It is possible that I'm doing something wrong here. I'd be interested in marc_s contributing his findings.

Steve Broberg
A: 
marc_s
Marc,I think you may be misinterpreting the meaning of those "query cost (relative to batch)" messages in the plan window. I believe those costs are not reflective of the actual costs of the query - they're either estimated costs at query compilation time, or they represent the cpu costs in parsing/processing the query.Try this: Turn on statistics I/O and statistics time (in the Query->Query Options...->Advanced dialog) and run the queries again, looking at the actual measured time elapsed. You'll see that the query with the computed columns is slower. (continued)
Steve Broberg
(you need to look on the "Messages" tab to see these results). To make this more dramatic, change the constraint on keycol to be "between 5000 and 200000". On my machine (A 2Ghz quad core Intel, with the db files on a 2 disk RAID 0), these are the stats I got: ------- Computed Column: SQL Server Execution Times: CPU time = 70437 ms, elapsed time = 80569 ms. ------- GROUP BY: SQL Server Execution Times: CPU time = 297 ms, elapsed time = 3628 ms. The GROUP BY runs 22 times faster than the CC.
Steve Broberg
As I think about it, I believe that the "query cost (relative to batch)" is not taking into account any I/O incurred by function calls. If you look at the i/o stats on the Messages tab, you'll see that it's only counting i/o from the smb_header table, whereas the group by is counting from both tables.
Steve Broberg
Finally, when you run the queries as I did (inserting into the #holder table, which runs faster for both due to avoiding the server/client traffic), you'll see that the missing i/o of the functional query seems to be accounted for in the mysterious "Worktable" line item.
Steve Broberg
Steve: I'm using the **ACTUAL** execution plans - **NOT** the guesstimates. I do think these are indeed at least an indication (if not 100% accurate)
marc_s
and as I said - I don't know enough about the internals of SQL Server to be really in a position to argue this with you - I can only speak from my experiences of 10+ years with SQL Server and in my experience, computed columns (especially if they are persisted - which in this scenario won't work, I admit) do tend to speed up things significantly.
marc_s
But I'd really love to get a well founded and authoritative answer to this one. Your arguments do make good sense, Steve - I just lack the insights to either support or counter them.... let's see who else might get involved....
marc_s
I also selected the actual plans. As I mentioned above, I don't think the % of batch reports are entirely accurate. Although I'm not current on SQL Server, I was a Sybase Developer 10 years ago, and I'm familiar with the Set Statistics IO and Set Statistics Time options. Regardless of what the GUIs are telling you, do you see on your side that the CC version of the query takes very long compared to the GROUP BY when you increase the number of rows from 5000 to 195000. Unfortunately, I don't think anyone else is going to see our discussion here.
Steve Broberg
It appears there are issues with the "relative to batch" reporting, at least according to this article: http://www.sql-server-performance.com/articles/per/Breaking_Down_Complex_Execution_Plans_p1.aspx
Steve Broberg
Very interesting article, thanks, Steve! Gotta learn yet another way of tallying up query performance, it seems!
marc_s