views:

1645

answers:

5

Preemptive apologies for the nonsensical table/column names on these queries. If you've ever worked with the DB backend of Remedy, you'll understand.

I'm having a problem where a Count Distinct is returning a null value, when I suspect the actual value should be somewhere in the 20's (23, I believe). Below is a series of queries and their return values.

SELECT count(distinct t442.c1)
      FROM t442, t658, t631
     WHERE t442.c1 = t658.c536870930
       AND t442.c200000003 = 'Network'
       AND t442.c536871139 < 2
       AND t631.c536870913 = t442.c1
       AND t658.c536870925 = 1
       AND (t442.c7 = 6 OR t442.c7 = 5)
       AND t442.c536870954 > 1141300800
       AND (t442.c240000010 = 0)

Result = 497.

Add table t649 and make sure it has records linked back to table t442:

 SELECT COUNT (DISTINCT t442.c1)
              FROM t442, t658, t631, t649
             WHERE t442.c1 = t658.c536870930
               AND t442.c200000003 = 'Network'
               AND t442.c536871139 < 2
               AND t631.c536870913 = t442.c1
               AND t658.c536870925 = 1
               AND (t442.c7 = 6 OR t442.c7 = 5)
               AND t442.c536870954 > 1141300800
               AND (t442.c240000010 = 0)
               AND t442.c1 = t649.c536870914

Result = 263.

Filter out records in table t649 where column c536870939 <= 1:

SELECT COUNT (DISTINCT t442.c1)
          FROM t442, t658, t631, t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (t442.c7 = 6 OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1

Result = 24.

Filter on the HAVING statement:

SELECT COUNT (DISTINCT t442.c1)
          FROM t442, t658, t631, t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (t442.c7 = 6 OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1
        HAVING COUNT (DISTINCT t631.c536870922) =
                                              COUNT (DISTINCT t649.c536870931)

Result = null.

If I run the following query, I can't see anything in the result list that would explain why I'm not getting any kind of return value. This is true even if I remove the DISTINCT from the SELECT. (I get 25 and 4265 rows of data back, respectively).

SELECT DISTINCT t442.c1, t631.c536870922, t649.c536870931
          FROM t442, t658, t631, t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (t442.c7 = 6 OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1

I have several other places where I have the query set up exactly like the one that is returning the null value and it work perfectly fine--returning usable numbers that are the correct values. I have to assume that whatever is unique in this situation is related to data and not the actual query, but I'm not sure what to look for in the data to explain it. I haven't been able to find any null values in the raw data before aggregation. I don't know what else would cause this.

Any help would be appreciated.

A: 

COUNT(DISTINCT column) doesn't count NULL values:

SELECT  COUNT(DISTINCT val1)
FROM    (
        SELECT  NULL AS val1
        FROM    dual
        )

---
0

Could it be the case?

Quassnoi
A: 

I would try putting the HAVING clause conditions in the WHERE clause instead. Is there any reason you chose HAVING? Just FYI, HAVING is a filter that is done after the result set is returned which may cause unexpected results. Also it is not used in the optimization of the query. If you don't have to use HAVING I would suggest not using it.

I would suggest adding the counts to the SELECT clause then joining them in the WHERE clause.

northpole
(A) He is using HAVING so he can do a condition on aggregate functions, i.e. COUNT. (B) I don't know if you're basing your information on some other DBMS, but in Oracle the HAVING clause is most certainly a part of the query that is parsed and optimized along with everything else.
Dave Costa
In ORACLE, SQL statements can utilize both a WHERE clause and an Oracle HAVING clause. The WHERE clause will filter rows as they are selected from the table, and before grouping, the Oracle HAVING clause will filter rows after the grouping.
northpole
additionally, HAVING is for aggregation and is not used in the optimization of the query in MySQL or ORACLE.
northpole
Putting the HAVING clause in the WHERE throws an error. ORA-00934: group function is not allowed here
Dave
no, I mean removing the HAVING all together and using the SELECT and WHERE clauses to handle it
northpole
Yes, that's what I did.SELECT COUNT(DISTINCT t442.c1)...AND t649.c536870939 > 1AND count(distinct t631.c536870922)= count (distinct t649.c536870931)
Dave
ahhh, yes, you are correct. Testing this now in my environment shows the same. I was hopeful that you could alias the count and use that in the where clause....let me continue to think about other solutions.
northpole
+1  A: 

What is the result of:

SELECT COUNT (DISTINCT t631.c536870922),
       COUNT (DISTINCT t649.c536870931)
          FROM t442, t658, t631, t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (t442.c7 = 6 OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1

If the two columns there never have equal values, then it makes sense that adding the HAVING clause would eliminate all rows from the result set.

Dave Costa
4 and 3, respectively. See next comment.
Dave
Also, even if the HAVING clause did eliminate all rows, shouldn't I get 0 instead of null? One of the other places I use this query does correctly return 0.
Dave
Your query is functionally equivalent to one like SELECT x FROM (SELECT 3 AS x, 4 as y FROM dual) WHERE x = y. Your result set has a one row (the count(*) results) with two columns, and you're saying show me the rows where they're equal. If they're not equal, no rows returned. If you're getting a 0 in one place, that must be because the count(*) for both those columns is 0 for each of them.
Steve Broberg
A: 

If I do this:

SELECT distinct t442.c1, count(distinct t631.c536870922), 
    count (distinct t649.c536870931)
          FROM t442, t658, t631, t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (t442.c7 = 6 OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1
           group by t442.c1
           having count(distinct t631.c536870922)= 
                         count (distinct t649.c536870931)

I see the 23 rows that should be counted. Removing the HAVING statement returns 24 rows, the extra one which does not meet that HAVING criteria.

EDIT: Results of the query, as requested per Steve Broberg:

row | t442.c1         | cnt t631 | cnt 649
-------------------------------------------
1   | CHG000000230378 |    2     |    1
2   | CHG000000230846 |    1     |    1
3   | CHG000000232562 |    1     |    1
4   | CHG000000232955 |    1     |    1
5   | CHG000000232956 |    1     |    1
6   | CHG000000232958 |    1     |    1
7   | CHG000000233027 |    1     |    1
8   | CHG000000233933 |    1     |    1
9   | CHG000000233934 |    1     |    1
10  | CHG000000233997 |    1     |    1
11  | CHG000000233998 |    1     |    1
12  | CHG000000233999 |    1     |    1
13  | CHG000000234001 |    1     |    1
14  | CHG000000234005 |    1     |    1
15  | CHG000000234009 |    1     |    1
16  | CHG000000234012 |    1     |    1
17  | CHG000000234693 |    1     |    1
18  | CHG000000234696 |    1     |    1
19  | CHG000000234730 |    1     |    1
20  | CHG000000234839 |    1     |    1
21  | CHG000000235115 |    1     |    1
22  | CHG000000235224 |    1     |    1
23  | CHG000000235488 |    1     |    1
24  | CHG000000235847 |    1     |    1 

The first row is filtered out properly if I include the HAVING clause.

Dave
Given that there are only 23 rows, can you include the results of that query?
Steve Broberg
See answer post below.
Dave
Or just this edited post...still learning how to use this site.
Dave
See my answer below
Steve Broberg
+2  A: 

I understand now. Your problem in the original query is that it is highly unusual (if not, in fact, wrong) to use a HAVING clause without a GROUP BY clause. The answer lies in the order of operation the various parts of the query are performed.

In the original query, you do this:

SELECT COUNT(DISTINCT t442.c1)
  FROM ...
 WHERE ...
HAVING COUNT(DISTINCT t631.c536870922) = COUNT(DISTINCT t649.c536870931);

The database will perform your joins and constraints, at which point it would do any group by and aggregation operations. In this case, you are not grouping, so the COUNT operations are across the whole data set. Based on the values you posted above, COUNT(DISTINCT t631.c536870922) = 25 and COUNT(DISTINCT t649.c536870931) = 24. The HAVING clause now gets applied, resulting in nothing matching - your asking for cases where the count of the total set (even though there are multiple c1s) are equal, and they are not. The DISTINCT gets applied to an empty result set, and you get nothing.

What you really want to do is just a version of what you posted in the example that spit out the rows counts:

SELECT count(*)
  FROM (SELECT t442.c1     
          FROM t442
             , t658
             , t631
             , t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (   t442.c7 = 6
                OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1
         GROUP BY t442.c1
        HAVING COUNT(DISTINCT t631.c536870922) = COUNT(DISTINCT t649.c536870931)
       );

This will give you a list of the c1 columns that have equal numbers of the 631 & 649 table entries. Note: You should be very careful about the use of DISTINCT in your queries. For example, in the case where you posted the results above, it is completely unnecessary; oftentimes it acts as a kind of wallpaper to cover over errors in queries that don't return results the way you want due to a missed constraint in the WHERE clause ("Hmm, my query is returning dupes for all these values. Well, a DISTINCT will fix that problem").

Steve Broberg
edit: fixed my final query to return the count you were originally looking for.
Steve Broberg
Ok, I think I understand. It does look like the places where the query was working (different set of WHERE clauses on those) the result set was identical with or without the HAVING.The query you just provided gives the individual rows and not just the count, but if I wrap that query in a SELECT COUNT(*) FROM (query) I get the results I need.
Dave
Hah. Beat me to it. :-)
Dave
+1, great job. This one was bothering me. I just couldn't figure it out. Thanks for the lesson :D
northpole
+1 Don't use HAVING without GROUP BY.
Carl Manaster