views:

182

answers:

4

I have a statement that SELECTs distinct rows from a table with a two value index

SELECT distinct Reference, Value1, Value2, Value3, Value4 FROM [tblHistory]

Where Reference is an index with another field "Project". For a particular system this data is inserted into another table using only Reference as the index because Value1 through Value4 SHOULD always be the same for the same Reference - however in about 1/500 it is not.

In the case where there is a duplicate Reference AND differences in one or more of the Value1-Value4 fields I need to pick the row with the most completed Value1-Value4 fields as they are often NULL. If all instances have the same number of populated columnsI can return the first row found.

Other than using temporary tables and code like

case when Value1 is null then 1 else 0 end 
+ case when Value2 is null then 1 else 0 end 
+ case when Value3 is null then 1 else 0 end
+ case when Value4 is null then 1 else 0 end
as CountOfNulls

Is there a way to filter the data so I only get the most populated row?

I'm running MS SQL Server 2000.

+1  A: 

You could look into the Coalesce function but honestly I would probably do the Case statement as you have above.

What reason would you not want to use it?

According to the comments, the table is more than just 4 value fields. But the fear is that a temporary table keeping the highest count of nulls will be necessary.

I feel that the Case solution, potentially implemented into a view is still a viable and good solution.

Jeff Martin
Firstly because in reality there are many more columns that 4!Secondly, I would like it to be a simple procedure rather than one involving a temporary table. Maybe there is a clean way to do it without making a temporary table and keeping the highest CountOfNulls per Reference?
PCurd
don't really need a temp table, just create the view with the calculation in it, it should be a 1 time thing.
Jeff Martin
Jeff is right - you've got the right technique, just don't use a temp table. This could easily be a view.
le dorfier
Seems there isn't really a simple fix to my problem so I'm marking this as the "answer" as it (with the comments) is the best coverage.
PCurd
Thanks for the answer, gonna move some of the salient comments into the answer for the search engines...
Jeff Martin
+1  A: 

This is just a hunch, since I haven't seen your database, but it looks like this is one table that should actually be two. As I said, this may have been done for performance or other legit reasons, but the way you need to report over it suggests this table should be split.

Nonetheless, if I had no other alternative, I'd create a column with the count of filled in columns for the row, and programmatically update it whenever the particular record is updated.

Bork Blatt
A: 

Well it's not a great way to do it, but I have put together this code and it works:

SELECT distinct Reference, Value1, Value2, Value3, Value4

FROM [tblHistory]
WHERE Reference+cast(4-(case when Value1 is null then 1 else 0 end 
+ case when Value2 is null then 1 else 0 end 
+ case when Value3 is null then 1 else 0 end
+ case when Value4 is null then 1 else 0 END) AS varchar) IN (

SELECT ro_custref + CAST(MAX(CountOfNonNulls) AS VARCHAR) FROM
(

SELECT ro_custref, 4-(case when Value1 is null then 1 else 0 end 
+ case when Value2 is null then 1 else 0 end 
+ case when Value3 is null then 1 else 0 end
+ case when Value4 is null then 1 else 0 end)
as CountOfNonNulls

FROM [tblHistory]
)l
GROUP BY Reference
)

I don't actually have this table structure so I haven't tested it but it seems to work. The idea is to make a "new" key by adding the highest CountOfNonNulls to the Reference field and using that to limit the select - it means the nasty CASE code is run twice but the other filters I have (not shown) limit the population to roughly 80 rows in my system so I can live with this.

I have yet to see what it would do if there were two rows with the same CountOfNonNulls value but different Value1-Value4 fields - I think it would break. In that case I would probably add the Value1-Value4 fields to my "new" key but that's a bit silly.

Any suggestions of improvement would be much appreciated!

PCurd
A: 

-- count() will not include NULL, so we can avoid making complex conditions
;
with
sum_cnt
(
    Reference,
    cnt
)
as
(
    select 
     Reference, 
     count(Value1) + count(Value2) + count(Value3) + count(Value4) 
    from 
     tblHistory 
    group by 
     Reference
)
select top 1
    Reference
from
    sum_cnt 
order by
    cnt desc


Irawan Soetomo
That's a really neat solution - sadly I'm using SQL Server 2000 and can't use CTEs.
PCurd