ansaurus

Question

How do a sql count on an additional table inside a sql stored proc?

Answer 1

+2 A:

Well, without knowing what you are selecting and your general schema (and assuming you are at least using SQL Server 2005:

WITH CommentCounts AS
(
   SELECT COUNT(*) CommentCount, ac.ArticleID
   FROM Articles a
   INNER JOIN ArticleComments ac
      ON ac.ArticleID = a.ID
   GROUP BY ac.ArticleID
)

SELECT a.*,
       c.CommentCount
FROM Articles a
INNER JOIN CommentCounts c
   ON a.ID = c.ArticleID

This is a Common Table Expression or CTE. You can read more about them here: http://msdn.microsoft.com/en-us/library/ms190766.aspx

AndyMcKenna 2009-07-19 04:39:05

That isn't valid CTE syntax per your link.

OMG Ponies 2009-07-21 17:43:38

Thanks, don't know how I missed that.

AndyMcKenna 2009-07-21 18:10:11

Answer 2

+5 A:

SQL allows entire scalar subqueries to be returned as projected columns. Subqueries can be correlated with the parent query. So is easy to count the comments in a subquery that counts the comments for a given article id:

SELECT a.*, (
  SELECT COUNT(*)
  FROM Comments c
  WHERE c.article_id = a.article_id) AS CountComments
  FROM Articles a;

Note that counting the comments each time can be quite expensive, is better to keep the count as an Article property.

Remus Rusanu 2009-07-19 05:31:26

Can you please explain your comment a bit further about keeping the count as an article property? Why does this mean I wouldn't need to count the comments each time? Thanks.

Cunners 2009-07-27 04:06:58

Thank you. This is what I was looking for!

David Stratton 2010-06-02 16:57:53

Answer 3

+1 A:

The following will work on SQL Server 2005+ or Oracle 9i+:

WITH COMMENT_COUNT AS (
      SELECT ac.article_id
             COUNT(ac.*) 'numComments'
        FROM ARTICLE_COMMENTS ac
    GROUP BY ac.article_id)
SELECT t.description,
       cc.numComments
  FROM ARTICLES t
  JOIN COMMENT_COUNT cc ON cc.article_id = t.article_id

SQL Server call it a Common Table Expression (CTE); Oracle calls it Subquery factoring.

Alternative:

SELECT t.description,
       cc.numComments
  FROM ARTICLES t
  JOIN (SELECT ac.article_id
               COUNT(ac.*) 'numComments'
          FROM ARTICLE_COMMENTS ac
      GROUP BY ac.article_id) cc ON cc.article_id = t.article_id

Performing the subquery in the SELECT statement works, but will perform the worst of all suggestions for the fact it will execute for every row.

OMG Ponies 2009-07-19 08:03:28

Answer 4

+2 A:

Maybe I'm missing something, but what's with all the subqueries and inline views? Why not just do a straightforward left-join, e.g.:

  SELECT a.ArticleId
       , a.ArticleName
       , (other a columns)
       , COUNT(*)
    FROM Articles a
         LEFT JOIN Comments c
                ON c.ArticleId = a.ArticleId
GROUP BY a.ArticleId
       , a.ArticleName
       , (other a columns);

Steve Broberg 2009-07-19 12:26:24

Answer 5

+1 A:

One option no one has mentioned so far would be a computed column on your article table which would count the number of comments. This is in general much faster than actually computing the number of comments everytime around, and if you really need to query that number frequently, it could save you a lot of processing overhead!

In SQL Server 2005 and up, what you could do in this case is create a small stored function to count the number of comments for each article, and then add this as a computed column to your article table. You could then use that as a normal column and trust me - it's a lot quicker than using subqueries all the time!

CREATE FUNCTION dbo.CountComments(@ArticleID INT)
RETURNS INT 
WITH SCHEMABINDING
AS BEGIN
    DECLARE @ArticleCommentCount INT

    SELECT @ArticleCommentCount = COUNT(*)
    FROM dbo.ArticleComments
    WHERE ArticleID = @ArticleID

    RETURN @ArticleCommentCount
END
GO

Add this to your article table as a column:

ALTER TABLE dbo.Articles
    ADD CommentCount AS dbo.CountComments(ArticleID)

and from then on, just use it as a normal column:

SELECT ArticleID, ArticleTitle, ArticlePostDate, CommentCount 
FROM dbo.Articles

To make it even faster, you could add this column as a persisted column to your table, and then it really rocks! :-)

ALTER TABLE dbo.Articles
    ADD CommentCount AS dbo.CountComments(ArticleID) PERSISTED

It's a bit more work upfront, but if you need this often and all the time, it could be well worth the trouble! Also works great for e.g. reading out certain bits of information from a XML column stored in your database table and expose it as a regular INT column or whatever.

Highly recommend! It's a feature often overlooked in SQL Server.

Marc

marc_s 2009-07-19 12:37:00

thanks that worked really well!

Cunners 2009-07-20 10:41:48

Steve Broberg 2009-07-20 14:53:59

Obviously, SQL Server optimizes these queries. If you make the column PERSISTED, then you're actually storing the value - and I suspect, SQL SErver will only recalculate it when the count changes.

marc_s 2009-07-20 16:13:33

I would be happy to share my execution plans with you if you don't believe me - the same query using subqueries uses 90% of the total time, while the one with the computed column only uses 10% - it's almost 10x faster. It's a fact.

marc_s 2009-07-20 16:14:23

I don't doubt you - I'd be interested in someone who understands the internals explain how this is possible. Intuitively, it doesn't makes sense that a computed column would be faster, as there is less information available to the optimizer (or at least, more obfuscated information). The function dictates that the subquery must be run for each row, since it is possible the function may not return the same row every time. Similarly, it would seem a complex problem for a DBMS to determine under what conditions a refresh would be necessary for an arbitrary function.

Steve Broberg 2009-07-20 23:04:19

Also, I see from doing a search that SQL Server mandates that persisted CCs be deterministic, which answers part of my question. However, I can still envision writing a deterministic function that is nontrivial to "reverse" in order to determine which row(s) in a table with a persisted CC need updating. Perhaps SS uses a scheme where it marks a row as "needs updating" and then will recompute the persisted column the next time it is referenced (which may result in all rows needing to be recomputation even if they are not strictly affected by the update).

Steve Broberg 2009-07-20 23:11:52

@Steve: I've wondered myself, and it does appear to me that somehow, SQL Server must be storing and reusing values, instead of recomputing them every time they're needed. But how they do that internally in detail escapes me.

marc_s 2009-07-21 04:57:29

@Marc: I've run an experiment using your solution, and my results do not agree with what you've reported here - I'd be interested in you looking at my long post below and hearing your comments.

Steve Broberg 2009-07-21 16:23:29

SQL Server cannot maintain an aggregate computed and persisted column as is defined in the post. The column will be populated with the value at record creation time and will not be further updated when a comment is added or removed. SQL can however maintain an indexed view over article comments, to similar effect, but there are some restriction on creating indexed views.

Remus Rusanu 2009-07-27 07:17:27

Answer 6

A:

Regarding the use of computed columns mentioned in the answer, I wanted to confirm the claims that using a computed column would produce better performance (it didn't make sense to me, but I'm no SQL Server guru). The results I got indicated that using a computed column is indeed slower - much slower, than a simple group by or subquery. I ran a test on a SQL Server instance I have on my own PC - here is the methodology and results:

CREATE TABLE smb_header (keycol INTEGER NOT NULL
                        , name1 VARCHAR2(255)
                        , name2 VARCHAR2(255));

INSERT INTO smb_header
  VALUES (1
        , 'This is column 1'
        , 'This is column 2'
         );

INSERT INTO smb_header
   SELECT (SELECT MAX(keycol)
             FROM smb_header
          ) + keycol
        , name1
        , name2
     FROM smb_header;
REM (repeat 20 times to generate ~1 million rows)

ALTER TABLE smb_header ADD PRIMARY KEY (keycol);

CREATE TABLE smb_detail (keycol INTEGER
                        , commentno INTEGER
                        , commenttext VARCHAR2(255));

INSERT INTO smb_detail
   SELECT keycol
        , 1
        , 'A comment that describes this issue'
     FROM smb_header;

ALTER TABLE smb_detail ADD PRIMARY KEY (keycol, commentno);

ALTER TABLE smb_detail ADD FOREIGN KEY (keycol) 
                           REFERENCES smb_header (keycol);

INSERT INTO smb_detail
   SELECT keycol
        , (SELECT MAX(commentno)
             FROM smb_detail sd2
            WHERE sd2.keycol = sd1.keycol
          ) + commentno
        , 'A comment that follows comment number ' 
          + CAST(sd1.commentno AS VARCHAR(32))
     FROM smb_detail sd1
    WHERE MOD(keycol, 31) = 0;

REM repeat 5 times, to create some records that have 64 comments
REM where others have one.

At this point, there will be around 1 million rows in the header, and either 1 or 64 comments for each.

Now I create the function (the same as yours above, only with my column & table names), and the computed column:

alter table dbo.smb_header add CommentCountPersist as dbo.CountComments(keycol)

By the way, PERSISTED will not work for this column, as I suspected in my comments above - it is not possible or too difficult for SQL Server to keep track of which rows need updating if you refer to other tables in your function. Using the PERSISTED keyword produces the error:

Msg 4934, Level 16, State 3, Line 1
Computed column 'CommentCountPersist' in table 'smb_header' cannot be 
persisted because the column does user or system data access.

This makes sense to me - I don't see how SQL Server could determine what rows need updating when other rows change, for any function that could be implemented, without the process of updates being horribly inefficient.

Now, for the tests. I create a temp table #holder to insert the rows into - I want to make sure when my queries run, I process the entire result set, not just the first few rows that would appear in the Mgmt Studio grid control.

  SELECT h.keycol
       , h.name1
       , CommentCount
    INTO #holder
    FROM smb_header h
   WHERE h.keycol < 0

Here are the results of my queries. First, the computed column:

  INSERT
    INTO #holder
  SELECT h.keycol
       , h.name1
       , CommentCount
    FROM smb_header h
   WHERE h.keycol between 5000 and 10000

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'Worktable'. Scan count 1, logical reads 10160, physical reads 0, 
                   read-ahead  reads 0, lob logical reads 0, 
                   lob physical reads 0, lob read-ahead reads 0.
Table 'smb_header'. Scan count 1, logical reads 44, physical reads 0, 
                    read-ahead reads 0, lob logical reads 0, 
                    lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 265 ms,  elapsed time = 458 ms.

(5001 row(s) affected)
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Now the GROUP BY version, the computed column:

  INSERT
    INTO #holder
  SELECT h.keycol
       , h.name1
       , COUNT(*)
    FROM smb_header h
       , smb_detail d 
   WHERE h.keycol between 5000 and 10000
     AND h.keycol = d.keycol 
GROUP BY h.keycol, h.name1

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'smb_header'. Scan count 1, logical reads 44, physical reads 0, 
                    read-ahead reads 0, lob logical reads 0, 
                    lob physical reads 0, lob read-ahead reads 0.
Table 'smb_detail'. Scan count 1, logical reads 366, physical reads 0, 
                    read-ahead reads 0, lob logical reads 0, 
                    lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 15 ms,  elapsed time = 13 ms.

(5001 row(s) affected)
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Writing the query with the subquery in the SELECT clause as Remus did above yields the same plan & performance as the GROUP BY (which would be expected).

As you can see, the computed column performs version significantly worse. This makes sense to me, as the optimizer is forced to call the function and do the count(*) for every row in the header, instead of using more sophisticated methods of resolving the two sets of data.

It is possible that I'm doing something wrong here. I'd be interested in marc_s contributing his findings.

Steve Broberg 2009-07-21 16:21:06

Answer 7

A:

marc_s 2009-07-21 17:07:16

Marc,I think you may be misinterpreting the meaning of those "query cost (relative to batch)" messages in the plan window. I believe those costs are not reflective of the actual costs of the query - they're either estimated costs at query compilation time, or they represent the cpu costs in parsing/processing the query.Try this: Turn on statistics I/O and statistics time (in the Query->Query Options...->Advanced dialog) and run the queries again, looking at the actual measured time elapsed. You'll see that the query with the computed columns is slower. (continued)

Steve Broberg 2009-07-21 19:50:36

(you need to look on the "Messages" tab to see these results). To make this more dramatic, change the constraint on keycol to be "between 5000 and 200000". On my machine (A 2Ghz quad core Intel, with the db files on a 2 disk RAID 0), these are the stats I got: ------- Computed Column: SQL Server Execution Times: CPU time = 70437 ms, elapsed time = 80569 ms. ------- GROUP BY: SQL Server Execution Times: CPU time = 297 ms, elapsed time = 3628 ms. The GROUP BY runs 22 times faster than the CC.

Steve Broberg 2009-07-21 19:55:46

As I think about it, I believe that the "query cost (relative to batch)" is not taking into account any I/O incurred by function calls. If you look at the i/o stats on the Messages tab, you'll see that it's only counting i/o from the smb_header table, whereas the group by is counting from both tables.

Steve Broberg 2009-07-21 19:59:36

Finally, when you run the queries as I did (inserting into the #holder table, which runs faster for both due to avoiding the server/client traffic), you'll see that the missing i/o of the functional query seems to be accounted for in the mysterious "Worktable" line item.

Steve Broberg 2009-07-21 20:02:17

Steve: I'm using the **ACTUAL** execution plans - **NOT** the guesstimates. I do think these are indeed at least an indication (if not 100% accurate)

marc_s 2009-07-21 20:26:41

and as I said - I don't know enough about the internals of SQL Server to be really in a position to argue this with you - I can only speak from my experiences of 10+ years with SQL Server and in my experience, computed columns (especially if they are persisted - which in this scenario won't work, I admit) do tend to speed up things significantly.

marc_s 2009-07-21 20:28:12

But I'd really love to get a well founded and authoritative answer to this one. Your arguments do make good sense, Steve - I just lack the insights to either support or counter them.... let's see who else might get involved....

marc_s 2009-07-21 20:44:21

I also selected the actual plans. As I mentioned above, I don't think the % of batch reports are entirely accurate. Although I'm not current on SQL Server, I was a Sybase Developer 10 years ago, and I'm familiar with the Set Statistics IO and Set Statistics Time options. Regardless of what the GUIs are telling you, do you see on your side that the CC version of the query takes very long compared to the GROUP BY when you increase the number of rows from 5000 to 195000. Unfortunately, I don't think anyone else is going to see our discussion here.

Steve Broberg 2009-07-21 21:21:55

It appears there are issues with the "relative to batch" reporting, at least according to this article: http://www.sql-server-performance.com/articles/per/Breaking_Down_Complex_Execution_Plans_p1.aspx

Steve Broberg 2009-07-21 21:24:56

Very interesting article, thanks, Steve! Gotta learn yet another way of tallying up query performance, it seems!

marc_s 2009-07-22 05:07:30

ansaurus

tags:

views:

answers:

How do a sql count on an additional table inside a sql stored proc?

related questions