views:

199

answers:

2

Hi,

Thanks to everyone who helped me with my last question. This is a similar question, but now I have a better idea of what I want. Again I'm using MS SQL Server 2008, and I'm trying to figure out a query that is a little beyond my week of SQL experience.

Right now I have the following simple query

SELECT pl.val, pl.txt_val, pl.id
FROM dm.labs pl
  WHERE pl.lab_nm = 'CK (CPK)' 
    AND pl.val < 999999;
    AND pl.val IS NOT NULL
ORDER BY pl.val;

In this table, each row corresponds to the results of a patient's lab value. The problem is that for some patients, there are rows that are actually multiple copies of the same lab reading. There is a column pl.lab_filed_dts (datetime data type) giving the date the lab was taken.

What I want to do is:

Edit: if two (or more) rows have the same id, the same val, and the same lab_filed_dts, regardless of their values in any other column, I want to return val, txt_val, and id from only one of the rows (it doesn't matter which one).

Edit: if two rows have the same id, the same val, and the same day portion (but not necessarily time) of lab_filed_dts AND the time portion of lab_filed_dts for one of them is midnight, I want to return val, txt_val, and id from the row whose lab_filed_dts time is not midnight.

Thanks again for everyone's help!

+1  A: 

If I understand your question correctly:

SELECT pl.val, pl.txt_val, pl.id
FROM dm.labs pl
  WHERE pl.lab_nm = 'CK (CPK)' 
    AND pl.val < 999999;
    AND pl.val IS NOT NULL
GROUP BY  pl.val, pl.txt_val, pl.id, cast(lab_filed_dts as date)
HAVING lab_filed_dts =  max(lab_filed_dts)
ORDER BY pl.val;

/* changed pl.vak to pl.val as you can't order on a column not in the select and I assumed it was just a type*/

jmoreno
@jmoreno: I hope you don't mind that I edited your post. If you indent each line of a block of code by four spaces you'll get the nice formatting.
Jason
I don't think that this will work. If two labs have different txt_val values, they will both appear in the results even if the important columns are the same.
Tom H.
I think another problem with this is that if use `HAVING lab_filed_dts = max(lab_filed_dts)` then I'll only get the most recent CK lab for each patient. Patients have multiple labs done over time and I want to see all of the lab results. What I'm trying to get rid of are rows in the table that are multiple copies of a single lab value, that is, for some reason readings for a patient were taken once but a nurse entered the results into the data records twice.
raoulcousins
@raoulcousins: What Tom is looking for is a natural key, some combination of one or more of the RELEVANT fields that uniquely identify the row. An identify field is a surrogate key and while that can be useful, you should always try to identify a natural key. In this case, if you could know when the lab was completed, that would provide your natural key, it wouldn't matter when the data was entered, the time it was completed would be unique, and you could thus prevent duplicates from ever being entered.
jmoreno
@raoulcousins: I've changed the query so that you'll only get the last lab of the day. I didn't do the midnight check, because that would make the query a lot more complex and I'm not sure of the actual utility. Tell me what you think of the current version...
jmoreno
@jmoreno: I'll have to look at the data more carefully, but there doesn't seem to be a combination of relevant keys that I can consistently use as a unique identifier. That's the big problem with this messy data set. Your change is very helpful. If I can pull one lab for each day this is a huge help. Where does the difficulty come from in the midnight thing? Would it be any easier if it didn't matter which row is returned if one is midnight?
raoulcousins
@jmoreno: Also, are there any books you recommend so I can really learn to do this myself? I've read through O'Reilly Learning SQL, but maybe I need something more SQL Server specific.
raoulcousins
@jmoreno: this one gives me the error Column 'dm.labs.lab_filed_dts' is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause. There was also an unrelated extra semicolon at `AND pl.val < 999999` but I'm not sure why I'm getting this error. I'm trying the second method now.
raoulcousins
A: 

A different approach, using the lab_id mentioned later.

SELECT DISTINCT pl.val, pl.txt_val, pl.id
FROM dm.labs pl
  WHERE pl.lab_nm = 'CK (CPK)' 
    AND pl.val < 999999;
    AND pl.val IS NOT NULL
    AND lab_id NOT IN (SELECT midnight.lab_id 
                       FROM dm.labs midnight 
                             INNER JOIN dm.labs nm
                               ON  midnight.val     = nm.val AND 
                                   midnight.id      = nm.id AND 
                                   midnight.lab_id <> nm.lab_id
                       WHERE midnight.lab_filed_dts = CAST(CAST(midnight.lab_filed_dts AS DATE) 
                                                     AS DateTime)

ORDER BY pl.val;
jmoreno
@jmoreno: Fantastic! I'm not sure if this is getting exactly the data I want, but it sure looks like it. I'll spend some time dissecting it and hopefully learn some SQL from it. Thank you so much for your help. I would upvote if I had the reputation.
raoulcousins
Ok so looking over this more carefully I now have a question about it. If I'm understanding the subquery correctly, the subquery selects rows where there are two results that match in everything except that one of them has the time as midnight and the other has a time other than midnight. By doing ``AND lab_id NOT in...`` then won't I skip all rows selected in the subquery, where I actually want to select one of them?
raoulcousins
Your understanding of the subquery is off by a bit. In particular only the row with midnight as the date is selected. If you have 2 rows, 1 with a midnight date and one without, then the 1 with the midnight date will be excluded. If you only have 1 row and it has a midnight date it will NOT be excluded. This does potentially open you up to a problem (and this might be what you are referring to): if you have 2 (or more) rows with midnight dates and no row without a midnight date, you won't get either row. This problem is of course fixable if it is a problem, but it gets more complicated...
jmoreno