views:

86

answers:

2

I have the following tables:

PERSON_T              DISEASE_T               DRUG_T
=========             ==========              ========
PERSON_ID             DISEASE_ID              DRUG_ID
GENDER                PERSON_ID               PERSON_ID
NAME                  DISEASE_START_DATE      DRUG_START_DATE
                      DISEASE_END_DATE        DRUG_END_DATE

I want to write a query that takes an input of a disease id and returns one row for each person in the database with a column for the gender, a column for whether or not they have ever had the disease, and a column for each drug which specifies if they took the drug before contracting the disease. I.E. true would mean drug_start_date < disease_start_date. False would mean drug_start_date>disease_start_date or the person never took that particular drug.

We currently pull all of the data from the database and use Java to create a 2D array with all of these values. We are investigating moving this logic into the database. Is it possible to create a query that will return the result set as I want it or would I have to create a stored procedure? We are using Postgres, but I assume an SQL answer for another database will easily translate to Postgres.

+3  A: 

Based on the info provided:

   SELECT p.name,
          p.gender,
          CASE WHEN d.disease_id IS NULL THEN 'N' ELSE 'Y' END AS had_disease,
          dt.drug_id
     FROM PERSON p
LEFT JOIN DISEASE d ON d.person_id = p.person_id
                   AND d.disease_id = ?
LEFT JOIN DRUG_T dt ON dt.person_id = p.person_id
                   AND dt.drug_start_date < d.disease_start_date

..but there's going to be a lot of rows that will look duplicate except for the drug_id column.

OMG Ponies
Shouldn't the second join be an outer join for the case that the person never took the drug?
Jay Askren
@Jay Askren: LEFT JOIN is an OUTER join, the `OUTER` keyword is optional.
OMG Ponies
I find it troubling that someone who is programming a system that could have life and death results has to ask how to write a query this simple.
HLGEM
@OMG: the LEFT JOIN will bring back PERSON records which have no matching drug treatments of any kind - it won't bring back PERSON records that have not had a *specific* drug treatment. (This is an inevitable issure with the structure as supplied.)
Mark Bannister
@Mark Bannister: In order to show persons where a column is to indicate T/F for drug use around the time of diagnosis, you have to LEFT JOIN onto `DRUG_T`. NULL == F[alse]. Isolating a specific drug is easy - add `AND dt.drug_id = ?` to the `LEFT JOIN DRUG_T ...`
OMG Ponies
@OMG: Yes, if you are after a single drug. If you are interested in **all** drugs not administered, the existing structures specified by the OP are not sufficient for the requirements.
Mark Bannister
This query doesn't really give me what I need, but it shows me that I likely need to write a stored procedure.
Jay Askren
+1  A: 

You're essentially looking to create a cross-tab query with the drugs. While there are plenty of OLAP tools out there that can do this sort of thing (among all sorts of other slicing and dicing of the data), doing something like this in traditional SQL is not easy (and, in general, impossible to do without some sort of procedural syntax in all but the simplest scenarios).

You essentially have two options when doing this with SQL (well, more accurately, you have one option, and another more complicated but flexible option that derives from it):

  1. Use a series of CASE statements in your query to produce columns that are representative of each individual drug. This requires knowing the list of variable values (i.e. drugs) ahead of time
  2. Use a procedural SQL language, such as T-SQL, to dynamically construct a query that uses case statements as described above, but along with obtaining that list of values from the data itself.

The two options essentially do the same thing, you're just trading simplicity and ease of maintenance for flexibility in the second option.

For example, using option 1:

select
    p.NAME,
    p.GENDER,
    (case when d.DISEASE_ID is null then 0 else 1 end) as HAD_DISEASE,
    (case when sum(case when dr.DRUG_ID = 1 then 1 else 0 end) > 0 then 1 else 0 end) as TOOK_DRUG_1,
    (case when sum(case when dr.DRUG_ID = 2 then 1 else 0 end) > 0 then 1 else 0 end) as TOOK_DRUG_2,
    (case when sum(case when dr.DRUG_ID = 3 then 1 else 0 end) > 0 then 1 else 0 end) as TOOK_DRUG_3

from PERSON_T p

left join DISEASE_T d on d.PERSON_ID = p.PERSON_ID and d.DISEASE_ID = @DiseaseId
left join DRUG_T dr on dr.PERSON_ID = p.PERSON_ID and dr.DRUG_START_DATE < d.DISEASE_START_DATE

group by p.PERSON_ID, p.NAME, p.GENDER, d.DISEASE_ID

As you can tell, this gets a little laborious as you get outside of just a few potential values.

The other option is to construct this query dynamically. I don't know PostgreSQL and what, if any, procedural capabilities it has, but the overall procedure would be this:

  1. Gather list of potential DRUG_ID values along with names for the columns
  2. Prepare three string values: the SQL prefix (everything before the first drug-related CASE statement, the SQL stuffix (everything after the last drug-related CASE statement), and the dynamic portion
  3. Construct the dynamic portion by combining drug CASE statements based upon the previously retrieved list
  4. Combine them into a single (hopefully valid) SQL statement and execute
Adam Robinson
PostgreSQL has PgPLSQL, with lots of similarities to Oracle's PLSQL.
OMG Ponies