views:

410

answers:

10

I have the following database table with information about people, diseases, and drugs:

PERSON_T              DISEASE_T               DRUG_T
=========             ==========              ========
PERSON_ID             DISEASE_ID              DRUG_ID
GENDER                PERSON_ID               PERSON_ID
NAME                  DISEASE_START_DATE      DRUG_START_DATE
                      DISEASE_END_DATE        DRUG_END_DATE

From these tables, I run some statistics about which individuals have taken which drugs and had which diseases. From this I can figure out which patterns are interesting for me to delve further into. For instance, below is a simplified example of the boolean pattern I might find for disease 52:

( (Drug 234 = false AND Drug 474 = true AND Drug 26 = false) OR 
  (Drug 395 = false AND Drug 791 = false AND Drug 371 = true) )

Edit: Here is another example:

( (Drug 234 = true AND Drug 474 = true AND Drug 26 = false) OR 
      (Drug 395 = false AND Drug 791 = false AND Drug 371 = true) )

Now I want to convert this pattern into a sql query and find all the people who match this pattern.
For example, I want to find all of the people in PERSON_T who had the disease and ((who did not take drug 234 and 26 before exhibiting symptoms, but did take drug 474 before exhibiting symptoms) or (who took drug 371 before exhibiting symptoms, but not drug 791 and 395 before exhibiting symptoms))

How would I go about translating this pattern back into the original query?

Here's my first attempt, but I get stuck on the first term:

SELECT * FROM PERSON_T, DRUG_T, DISEASE_T 
  WHERE DISEASE_ID = 52 AND 
    PERSON_T.PERSON_ID = DISEASE_T.PERSON_ID AND 
    PERSON_T.PERSON_ID = DRUG_T.PERSON_ID  AND 
    (DRUG_T.DRUG_ID=234 AND (DRUG_T.DRUG_START_DATE>DISEASE_T.END_DATE || ???)

I need this to work in PostgreSql, but I assume that any given answer can be translated from a given database to PostgreSql.

Response to comments

  1. I fixed formatting of database tables. Thank you.
  2. I need to be able to take an arbitrary boolean statement and translate it to SQL. The boolean statements we are actually creating are much longer than the example I gave. Any new tables I create will be in a new database and need to have the same schema as the original tables. This way to the end user, they can run their same code on the new tables and it work the same as if it ran on the original tables. This is a requirement from the customer. I'm hoping I can create a view which is just a query to the original tables. If we can't get that to work, I may create a copy of the tables and filter the data as I'm copying it over to the new table. We are not using neural networks to do the analysis. We are using our own custom algorithms which scale much better than neural networks.
  3. The Disease_Start_Date is the date when the person gets the diseaes which is likely when the symptons start appearing. Disease_End_Date is when the person is recovered which is likely when symptoms disappear.
  4. Drug_start_date is when the person starts taking the drugs. Drug_end_date is when the person stops taking the drugs.

Edit I added my own answer. Can anyone come up with a simpler answer?

A: 

Pardon any mistakes but I think something like this would work (in T-SQL):

SELECT col1, col2, col3...
FROM PERSON_T AS P, DRUG_T AS DR, DISEASE_T AS DI
WHERE disease_id = 52
AND P.person_id = DI.person_id
AND P.person_id = DR.person_id
AND drug_id NOT IN(234, 26)
AND drug_id = 474
AND disease_start_date < drug_start_date
UNION
SELECT col1, col2, col3...
FROM PERSON_T AS P, DRUG_T AS DR, DISEASE_T AS DI
WHERE disease_id = 52
AND P.person_id = DI.person_id
AND P.person_id = DR.person_id
AND drug_id NOT IN(791, 395)
AND drug_id = 371
AND disease_start_date < drug_start_date

Now it doesn't have to be done with a UNION but for readibility I thought this was the easiest given your conditions. Maybe this will lead you in the right direction.

ajdams
This will not handle my second pattern I just added. My patterns might specify that a person has taken drug 234 and 474 before getting the disease, but not drug 26. This query results in zero results in this case.
Jay Askren
A: 
SELECT per.person_id, per.name, per.gender
FROM person_t per
INNER JOIN disease_t dis
USING (person_id)
INNER JOIN drug_t drug
USING (person_id)
WHERE dis.disease_id = 52 AND drug.drug_start_date < dis.disease_start_date AND ((drug.drug_id IN (234, 474) AND drug.drug_id NOT IN (26)) OR (drug.drug_id IN (371) AND drug.drug_id NOT IN (395, 791)));

This will do what you are asking for. The IN statements at the end are pretty self explanatory.

Peter Hanneman
This will not handle my second pattern I just added. My patterns might specify that a person has taken drug 234 and 474 before getting the disease, but not drug 26. This query results in zero results in this case.
Jay Askren
Just had to move the 234 to the other IN clause - just put them in backwards - it works now.The first IN is the drugs you want to see before symptoms, 2nd IN are the drugs you want to see after symptoms began. 3rd and 4th IN's are your next pattern - same input format.
Peter Hanneman
Peter Hanneman
+5  A: 

To me, the straightforward (if ugly) solution is to use EXISTS and NOT EXISTS clauses:

SELECT *
FROM PERSON_T INNER JOIN DISEASE_T
     USING (PERSON_ID)
WHERE DISEASE_ID = 52
  AND EXISTS (SELECT 1 FROM DRUG_T
              WHERE DRUG_T.PERSON_ID = PERSON_T.PERSON_ID
                AND DRUG_ID = 474
                AND [time condition])
  AND NOT EXISTS (SELECT 1 FROM DRUG_T
              WHERE DRUG_T.PERSON_ID = PERSON_T.PERSON_ID
                AND DRUG_ID = 234
                AND [time condition])

...and so on. In the example, we're asking for people who have taken drug 474 but not 234. Obviously, you can group the clauses with ANDs and ORs according to what you need.

Aside: I find all caps difficult to read. I usually use uppercase for SQL keywords and lowercase for table and column names.

Justin K
I didn't think this worked at first, but it appears to give the correct answer. Thanks.
Jay Askren
A: 

None of the answers given seem to work. Again here is the pattern I want to implement: ( (Drug 234 = true AND Drug 474 = true AND Drug 26 = false) OR (Drug 395 = false AND Drug 791 = false AND Drug 371 = true) )

I believe the following query will work for (Drug 234 = true AND Drug 474 = true AND Drug 26 = false). Given that, it is pretty easy to add the second half of the query.

SELECT  p.person_id, p.gender FROM person_t as p 
    join drug_t as dr on dr.person_id = p.person_id 
    join disease_t as ds on ds.person_id=p.person_id 
    WHERE dr.drug_start_date < ds.disease_start_date AND disease_id = 52 AND dr.drug_id=234
INTERSECT
SELECT  p.person_id, p.gender FROM person_t as p 
    join drug_t as dr on dr.person_id = p.person_id 
    join disease_t as ds on ds.person_id=p.person_id 
    WHERE dr.drug_start_date < ds.disease_start_date AND disease_id = 52 AND dr.drug_id=474
INTERSECT (
SELECT p.person_id, p.gender
    FROM person_t as p 
    JOIN disease_t as ds on ds.person_id = p.person_id 
    LEFT JOIN drug_t as dr ON dr.person_id = p.person_id  AND dr.drug_id = 26
    WHERE disease_id = 52 AND dr.person_id is null 
UNION 
SELECT p.person_id, p.gender
    FROM person_t as p 
    JOIN disease_t as ds on ds.person_id = p.person_id 
    JOIN drug_t as dr ON dr.person_id = p.person_id  AND dr.drug_id = 26
    WHERE disease_id = 52 AND dr.drug_start_date > ds.disease_start_date)

This query works, but is quite ugly. I also suspect it will be extremely slow once I have a production database with 100 million people. Is there anything I can do to simplify/optimize this query?

Jay Askren
Why has disease_id suddenly changed to 26 half way through? That wasn't in the question?
Martin Smith
That was typo. It is fixed now. Thanks for catching that.
Jay Askren
+1  A: 

I have no idea how this will perform with large tables (I imagine it will be pretty lousy as date comparisons are typically pretty expensive), but here is a method that should work. It is relatively verbose, but is very easy to modify for different boolean cases.

Example 1:

SELECT dis.*
FROM disease_t dis
LEFT JOIN drug d1 ON d1.person_id = dis.person_id AND d1.drug_id = 234
LEFT JOIN drug d2 ON d2.person_id = dis.person_id AND d2.drug_id = 474
LEFT JOIN drug d3 ON d3.person_id = dis.person_id AND d3.drug_id = 26
LEFT JOIN drug d4 ON d4.person_id = dis.person_id AND d4.drug_id = 395
LEFT JOIN drug d5 ON d5.person_id = dis.person_id AND d5.drug_id = 791
LEFT JOIN drug d6 ON d6.person_id = dis.person_id AND d6.drug_id = 371
WHERE dis.disease_id = 52
AND (((d1.person_id IS NULL OR dis.startdate < d1.startdate) AND
      (d2.person_id IS NOT NULL AND d2.startdate < dis.startdate) AND
      (d3.person_id IS NULL OR dis.startdate < d3.startdate)) 
     OR
     ((d4.person_id IS NULL OR dis.startdate < d4.startdate) AND
      (d5.person_id IS NULL OR dis.startdate < d5.startdate) AND
      (d6.person_id IS NOT NULL AND d6.startdate < dis.startdate)))

Example 2:

SELECT dis.*
FROM disease_t dis
LEFT JOIN drug d1 ON d1.person_id = dis.person_id AND d1.drug_id = 234
LEFT JOIN drug d2 ON d2.person_id = dis.person_id AND d2.drug_id = 474
LEFT JOIN drug d3 ON d3.person_id = dis.person_id AND d3.drug_id = 26
LEFT JOIN drug d4 ON d4.person_id = dis.person_id AND d4.drug_id = 395
LEFT JOIN drug d5 ON d5.person_id = dis.person_id AND d5.drug_id = 791
LEFT JOIN drug d6 ON d6.person_id = dis.person_id AND d6.drug_id = 371
WHERE dis.disease_id = 52
AND (((d1.person_id IS NOT NULL AND d1.startdate < dis.startdate) AND
      (d2.person_id IS NOT NULL AND d2.startdate < dis.startdate) AND
      (d3.person_id IS NULL OR dis.startdate < d3.startdate)) 
     or
     ((d4.person_id IS NULL OR dis.startdate < d4.startdate) AND
      (d5.person_id IS NULL OR dis.startdate < d5.startdate) AND
      (d6.person_id IS NOT NULL AND d6.startdate < dis.startdate)))
sgriffinusa
This appears to work. Thanks.
Jay Askren
+2  A: 

Here is a query that handles ( (Drug 234 = true AND Drug 474 = true AND Drug 26 = false) OR (Drug 395 = false AND Drug 791 = false AND Drug 371 = true) ), as you posted.

/*
-- AS DEFINED BY JOINS
-- All "person_id"'s match
-- Drug 1 is not Drug 2
-- Drug 1 is not Drug 3
-- Drug 2 is not Drug 3
-- All Drugs are optional as far as the SELECT statement is concerned (left join)
   -- Drug IDs will be defined in the WHERE clause
-- All Diseases for "person_id"

-- AS DEFINED IN WHERE STATEMENT
-- Disease IS 52
-- AND ONE OF THE FOLLOWING:
--   1) Disease started AFTER Drug 1
--      Disease started AFTER Drug 2
--      Drug 1 IS 234
--      Drug 2 IS 474
--      Drug 3 IS NOT 26 (AND NOT 234 or 474, as defined in JOINs)
--   2) Disease started AFTER Drug 3
--      Drug 1 IS NOT 395
--      Drug 2 IS NOT 791
--      Drug 3 IS 371
*/

SELECT p.person_id, p.gender FROM person_t as p
LEFT JOIN drug_t    AS dr1 ON (p.person_id = dr1.person_id)
LEFT JOIN drug_t    AS dr2 ON (p.person_id = dr2.person_id AND dr1.drug_id != dr2.drug_id)
LEFT JOIN drug_t    AS dr3 ON (p.person_id = dr3.person_id AND dr1.drug_id != dr3.drug_id AND dr2.drug_id != dr3.drug_id)
JOIN      disease_t AS ds  ON (p.person_id = ds.person_id)
WHERE ds.disease_id = 52
AND (   (    (dr1.drug_start_date < ds.disease_start_date AND dr2.drug_start_date < ds.disease_start_date)
        AND (dr1.drug_id = 234 AND dr2.drug_id = 474 AND dr3.drug_id != 26)
        )
    OR
        (    (dr3.drug_start_date < ds.disease_start_date)
        AND (dr1.drug_id != 395 AND dr2.drug_id != 791 AND dr3.drug_id = 371)
        )
    )
pferate
A: 

I don't have test data real handy to try this out on, but I think you could do something like:

SELECT *
FROM DISEASE_T D
INNER JOIN DRUG_T DR ON D.PERSON_ID = DR.PERSON_ID AND D.DRUG_ID=52
INNER JOIN PERSON_T P ON P.PERSON_ID = D.PERSON_ID
GROUP BY PERSON_ID
HAVING SUM(
    CASE WHEN DRUG_ID=234 AND DRUG_START_DATE<DISEASE_START_DATE THEN -1 
    WHEN DRUG_ID=474 AND DRUG_START_DATE<DISEASE_START_DATE THEN 1 
    WHEN DRUG_ID=26 AND DRUG_START_DATE<DISEASE_START_DATE THEN -1 
    ELSE 0 END) = 1
    OR
    SUM(
    CASE WHEN DRUG_ID=395 AND DRUG_START_DATE<DISEASE_START_DATE THEN -1 
    WHEN DRUG_ID=791 AND DRUG_START_DATE<DISEASE_START_DATE THEN -1 
    WHEN DRUG_ID=371 AND DRUG_START_DATE<DISEASE_START_DATE THEN 1 
    ELSE 0 END) = 1

The case I know will fail there is if you have multiple records for the same person and the same drug/disease in the drug/disease tables. If that's the case, you could also change the HAVING clause to look more like:

(SUM(CASE WHEN DRUG_ID=234 AND DRUG_START_DATE<DISEASE_START_DATE THEN 1 ELSE 0 END) = 0
AND SUM(CASE WHEN DRUG_ID=474 AND DRUG_START_DATE<DISEASE_START_DATE THEN 1 ELSE 0 END) > 0
AND SUM(CASE WHEN DRUG_ID=26 AND DRUG_START_DATE<DISEASE_START_DATE THEN 1 ELSE 0 END) = 0)
OR
(SUM(CASE WHEN DRUG_ID=395 AND DRUG_START_DATE<DISEASE_START_DATE THEN 1 ELSE 0 END) = 0
AND SUM(CASE WHEN DRUG_ID=791 AND DRUG_START_DATE<DISEASE_START_DATE THEN 1 ELSE 0 END) = 0
AND SUM(CASE WHEN DRUG_ID=371 AND DRUG_START_DATE<DISEASE_START_DATE THEN 1 ELSE 0 END) > 0)
joelt
A person can take the same drug more than once and there would be a row for each time they took the drug. I believe the same is true for diseases.
Jay Askren
A: 

I would probably approach this problem from some direction similar to this. It's pretty flexible.

DRUG_DISEASE_CORRELATION_QUERY
===============================
DRUG_DISEASE_CORRELATION_QUERY_ID
DISEASE_ID
DESCRIPTION

(1, 52, 'What this query does.')
(2, 52, 'Add some more results.')

DRUG_DISEASE_CORRELATION_QUERY_INCLUDE_DRUG
===========================================
DRUG_DISEASE_CORRELATION_QUERY_ID
DRUG_ID

(1, 234)
(1, 474)
(2, 371)

DRUG_DISEASE_CORRELATION_QUERY_EXCLUDE_DRUG
===========================================
DRUG_DISEASE_CORRELATION_QUERY_ID
DRUG_ID

(1, 26)
(2, 395)
(2, 791)



CREATE VIEW DRUG_DISEASE_CORRELATION
AS
SELECT 
    p.*,
    q.DRUG_DISEASE_CORRELATION_QUERY_ID
FROM 
    DRUG_DISEASE_CORRELATION_QUERY q
    INNER JOIN DISEASE_T ds on ds.DISEASE_ID = q.DISEASE_ID
    INNER JOIN PERSON_T p ON p.PERSON_ID = ds.PERSON_ID
  WHERE 
    AND EXISTS (SELECT * FROM DRUG_T dr WHERE dr.PERSON_ID = p.PERSON_ID AND dr.DRUG_ID IN
        (SELECT qid.DRUG_ID FROM DRUG_DISEASE_CORRELATION_QUERY_INCLUDE_DRUG qid WHERE 
        qid.DRUG_DISEASE_CORRELATION_QUERY_ID = q.DRUG_DISEASE_CORRELATION_QUERY_ID)
        AND DRUG_START_DATE < ds.DISEASE_START_DATE)
   AND NOT EXISTS (SELECT * FROM DRUG_T dr WHERE dr.PERSON_ID = p.PERSON_ID AND dr.DRUG_ID IN
        (SELECT qed.DRUG_ID FROM DRUG_DISEASE_CORRELATION_QUERY_EXCLUDE_DRUG qed WHERE 
        qed.DRUG_DISEASE_CORRELATION_QUERY_ID = q.DRUG_DISEASE_CORRELATION_QUERY_ID)
        AND DRUG_START_DATE < ds.DISEASE_START_DATE)
GO


SELECT * FROM DRUG_DISEASE_CORRELATION WHERE DRUG_DISEASE_CORRELATION_QUERY_ID = 1
UNION
SELECT * FROM DRUG_DISEASE_CORRELATION WHERE DRUG_DISEASE_CORRELATION_QUERY_ID = 2
Jeffrey L Whitledge
A: 

If I have it right, you want to:

  • Select out those Persons
  • Who have been infected with one (1) specific disease
  • Who have been treated with one or more specified drugs
  • And who have NOT been been treated with one or more specified other drugs

This could be simplified by converting your "drug requirements" into a temporary table of some form. This would allow for the use of any number of "good" and "bad" drugs to query against. What I have below could be implemented as a stored procedure, but if that's not an option a number of convoluted options are available.

Breaking down the steps:

First, here's how the desired patients are selected. We'll use this as a subquery later:

SELECT [PersonData]
 from DISEASE_T di
  inner join PERSON_T pe
   on pe.Person_Id = di.Person_Id
 where di.Disease_Id = [TargetDisease]
  and [TimeConstraints]

Second, for each set of "target" drugs that you have ANDed together, set up a temporary table like so (this is SQL Server syntax, Postgres should have something similar):

CREATE TABLE #DrugSet
 (
   Drug_Id  [KeyDataType]
  ,Include  int   not null
 )

populate it with one row for every drug you are considering:

  • Drug_Id = the drug you're checking
  • Include = 1 if the person is to have taken the drug, and 0 if they are not to have taken it

and calculate two values:

@GoodDrugs, the number of drugs you want the patient to have taken
@BadDrugs, the number of drugs you want the patient to have not taken

Now, stitch all the above together in the following query:

SELECT pe.[PersonData]  --  All the desired columns from PERSON_T and elsewhere
 from DRUG_T dr
  --  Filter to only include "persons of interest"
  inner join (select [PersonData]
               from DISEASE_T di
                inner join PERSON_T pe
                 on pe.Person_Id = di.Person_Id
               where di.Disease_Id = [TargetDisease]
                and [TimeConstraints]) pe
   on pe.Person_Id = dr.Person_ID
 --  Join with any of the drugs we are intersted in
 left outer join #DrugSet ta  
  on ta.Drug_Id = dr.Drug_Id
 group by pe.[PersonData]  --  Same as in the SELECT clause
 having sum(case ta.Include
              when 1 then 1  --  This patient has been given a drug that we're looking to match
              else 0         --  This patient has not been given this drug (catches NULLs, too)
            end) = @GoodDrugs
  and  sum(case ta.Include
              when 0 then 1  --  This patient has been given this drug that we're NOT looking to match
              else 0         --  This patient has not been given this drug (catches NULLs, too)
            end) = @BadDrugs

I've intentionally ignored the time criteria as you didn't go into details on them, but they should be fairly simple to add (though I hope that's not famous last words). Further optimizations may be possible, but a lot depends on the data and other possible criteria.

You would need to run this once for each "drug set" (that is, sets of TRUE or FALSE drugs ANDed together), concatenating the list with each pass. You could probably expand #DrugSet to factor in each drug set you are checking, but I'm reluctant to try and code that without some serious data to test it against.

*/

Philip Kelley
I disregarded the date columns and asssumed that there where intuitive primary keys present. If this is not the case, some distinct and group by clauses will be called for.
Philip Kelley
A: 

I've tried to break down the problem, and follow through as logically as I could.

Firstly the three tables (Person_T, Drugs_T, Disease_T) can be thought of as shown in Fig 1.0:

A Person can have multiple drugs and multiple diseases. Each drug and disease has a start date & end date.

Therfore I would first de-normalize the three tables into one table (Table_dn) thus:

dnId | PersonId | DrugId | DiseaseId | DgSt | DgEn | DiSt | DiEn
----   --------   ------   ---------   ----   ----   ----   ----

This de-normalized table can be a temp table if needed, regardless Table_dn now contains all the entire Global dataset as shown in Fig 2.0 (denoted as G).

From my understanding of your description, I can see essentially a two layer filter.

Filter 1

This filter is simply a boolean set of drugs Combinations, as you have already stated in your question description. eg:

(drug a = 1 & drug b = 0 & etc) OR (.....

Filter 2

This filter is a little more complicated than the first, it is date range criteria. Fig 3.0 shows this date range in RED. The yellow represents record dates that span in multiple ways:

  • before RED period
  • After RED period
  • Between RED period
  • Ending before end of RED period
  • Starting after start of RED period

Now the YELLOW date periods could be the drugs period or disease period OR combination of both.

This filter should be applied to the set of results obtained from the first results.

Of course depending on your exact question those two Filters may need to go the other way round (e.g f2 first then f1).

SQL pseudo code:

Select sub.*
From    
      (select    * 
       from      Table_dn 
       where     [Filter 1]
      ) as sub

where [Filter 2]

alt text

Darknight