ansaurus

Question

Medical information extraction using Python

Answer 1

+8 A:

Here are some possible way you can solve this -

Using Regular Expressions - Define them according to the patterns in your text. Match the expressions, extract pattern and you repeat for all records. This approach needs good understanding of the format in which the data is & of course regular expressions :)
String Manipulation - This approach is relatively simpler. Again one needs a good understanding of the format in which the data is. This is what I have done below.
Machine Learning - You could define all you rules & train a model on these rules. After this the model tries to extract data using the rules you provided. This is a lot more generic approach than the first two. Also the toughest to implement.

See if this work for you. Might need some adjustments.

new_file = open('parsed_file', 'w')
for rec in open("your_csv_file"):
    tmp = rec.split(' : ')
    date = tmp[0]
    reason = tmp[1]

    if reason[:2] == 'He':
        sex = 'Male'
        symptoms = reason.split(' and ')[0].split('He got ')[1]
    else:
        sex = 'Female'
        symptoms = reason.split(' and ')[0].split('She got ')[1]
    symptoms = [i.strip() for i in symptoms.split(',')]
    symptoms = '\n'.join(symptoms)
    if 'died' in rec:
        died = 'True'
    else:
        died = 'False'
    new_file.write("Sex: %s\nSymptoms: %s\nDeath: %s\nDeath Time: %s\n\n" % (sex, symptoms, died, date))

Ech record is newline separated \n & since you did not mention one patient record is 2 newlines separated \n\n from the other.

LATER: @Nurse what did you end up doing? Just curious.

MovieYoda 2010-10-25 02:47:36

I was just about to comment and say basically this: it looks like `string.split(...)` and a simple state machine (like this one) will give you the most bang for your buck.

David Wolever 2010-10-25 02:49:21

that's about it. Basic string munching. If your records are in the pattern you say they are. Then this should work out of the box. But if some discrepancies arise (since I don't know the data). You might need to tweak it a bit to match your data.

MovieYoda 2010-10-25 02:51:53

I would do it that way if it's consistent .. there are a lot of rules and they don't exist at the same order. We can't process it lineary if you know what I mean.

Nurse 2010-10-25 02:56:58

@Nurse then it's better to mention all the rules/cases. People can't suggest you perfect solutions without knowing all the rules.

MovieYoda 2010-10-25 03:03:08

Does your output only include the 'Sex', 'Symptoms', 'Death', 'Death Time' fields? Or do you sometimes need to output other information, such as treatment?

philosodad 2010-10-25 03:03:46

Yeah we have treatment patterns, complications, refereals, investigations results, drug dosages etc .. what I am putting here are just simplified examples .. please read the lower most part of the main post and you will find rules examples. I thought if there is a way to make python understand the text this can be easy.

Nurse 2010-10-25 03:09:06

It should be reasonably easy. I may not be able to suggest a perfect answer, but maybe I can help you establish a question that can be perfectly answered! Do I understand that the rules are basically as follows. 1. Every record starts with the Time and is followed, after a ':' by the sex. 2. Every field has a keyword identifier, such as 'got' followed by strings separated by a consistent token, such as ','. 3. Fields are separated by the word 'and'. So you might have `date: sex 'keyword' string, string and 'keyword' string, string, string and 'keyword' string.` Is that correct?

philosodad 2010-10-25 03:15:25

yeah that's correct but sometimes we have only a keyword without strings like "kept under observation", or keyword <variable> some text like "died 10 hours later".. for this I am interested to know that he died and how many hours later he died

Nurse 2010-10-25 03:26:06

Then the answer should work fairly well for you. You want the record to split into fields on the keyword 'and', and you can establish a dictionary to map keywords to labels ('got' => 'Symptoms: ") and so on. Extract the date and sex as above, split the record using the 'and' separator as shown above, put the fields into an array [field, field, field], and then iterate through the array, using the first word in the field to decide what to output.

philosodad 2010-10-25 03:41:32

@ movieyoda: I am still looking for a better way than splitting the text. I have heard good things about Lexers and things like that. We have a lot of patterns and parsing them one by one using text splitting is somehow hard for me.

Nurse 2010-10-26 11:54:46

Answer 2

+2 A:

This uses dateutil to parse the date (e.g. '11/11/2010 - 09:00am'), and parsedatetime to parse the relative time (e.g. '4 hours later'):

import dateutil.parser as dparser
import parsedatetime.parsedatetime as pdt
import parsedatetime.parsedatetime_consts as pdc
import time
import datetime
import re
import pprint
pdt_parser = pdt.Calendar(pdc.Constants())   
record_time_pat=re.compile(r'^(.+)\s+:')
sex_pat=re.compile(r'\b(he|she)\b',re.IGNORECASE)
death_time_pat=re.compile(r'died\s+(.+hours later).*$',re.IGNORECASE)
symptom_pat=re.compile(r'[,-]')

def parse_record(astr):    
    match=record_time_pat.match(astr)
    if match:
        record_time=dparser.parse(match.group(1))
        astr,_=record_time_pat.subn('',astr,1)
    else: sys.exit('Can not find record time')
    match=sex_pat.search(astr)    
    if match:
        sex=match.group(1)
        sex='Female' if sex.lower().startswith('s') else 'Male'
        astr,_=sex_pat.subn('',astr,1)
    else: sys.exit('Can not find sex')
    match=death_time_pat.search(astr)
    if match:
        death_time,date_type=pdt_parser.parse(match.group(1),record_time)
        if date_type==2:
            death_time=datetime.datetime.fromtimestamp(
                time.mktime(death_time))
        astr,_=death_time_pat.subn('',astr,1)
        is_dead=True
    else:
        death_time=None
        is_dead=False
    astr=astr.replace('and','')    
    symptoms=[s.strip() for s in symptom_pat.split(astr)]
    return {'Record Time': record_time,
            'Sex': sex,
            'Death Time':death_time,
            'Symptoms': symptoms,
            'Death':is_dead}


if __name__=='__main__':
    tests=[('11/11/2010 - 09:00am : He got nausea, vomiting and died 4 hours later',
            {'Sex':'Male',
             'Symptoms':['got nausea', 'vomiting'],
             'Death':True,
             'Death Time':datetime.datetime(2010, 11, 11, 13, 0),
             'Record Time':datetime.datetime(2010, 11, 11, 9, 0)}),
           ('11/11/2010 - 09:00am : She got heart burn, vomiting of blood and died 1 hours later in the operation room',
           {'Sex':'Female',
             'Symptoms':['got heart burn', 'vomiting of blood'],
             'Death':True,
             'Death Time':datetime.datetime(2010, 11, 11, 10, 0),
             'Record Time':datetime.datetime(2010, 11, 11, 9, 0)})
           ]

    for record,answer in tests:
        result=parse_record(record)
        pprint.pprint(result)
        assert result==answer
        print

yields:

{'Death': True,
 'Death Time': datetime.datetime(2010, 11, 11, 13, 0),
 'Record Time': datetime.datetime(2010, 11, 11, 9, 0),
 'Sex': 'Male',
 'Symptoms': ['got nausea', 'vomiting']}

{'Death': True,
 'Death Time': datetime.datetime(2010, 11, 11, 10, 0),
 'Record Time': datetime.datetime(2010, 11, 11, 9, 0),
 'Sex': 'Female',
 'Symptoms': ['got heart burn', 'vomiting of blood']}

Note: Be careful parsing dates. Does '8/9/2010' mean August 9th, or September 8th? Do all the record keepers use the same convention? If you choose to use dateutil (and I really think that's the best option if the date string is not rigidly structured) be sure to read the section on "Format precedence" in the dateutil documentation so you can (hopefully) resolve '8/9/2010' properly. If you can't guarantee that all the record keepers use the same convention for specifying dates, then the results of this script would have be checked manually. That might be wise in any case.

unutbu 2010-10-25 03:36:51

+1 Just for the effort in doing all this.

MovieYoda 2010-10-25 03:44:38

Answer 3

A:

Maybe this can help you too , it's not tested

import collections
import datetime
import re

retrieved_data = []

Data = collections.namedtuple('Patient', 'Sex, Symptoms, Death, Death_Time')
dict_data = {'Death':'',
             'Death_Time':'',
             'Sex' :'',
             'Symptoms':''}


with open('data.txt') as f:
     for line in iter(f.readline, ""):

         date, text = line.split(" : ")
         if 'died' in text:
             dict_data['Death'] = True
             dict_data['Death_Time'] = datetime.datetime.strptime(date, 
                                                                 '%d/%m/%Y - %I:%M%p')
             hours = re.findall('[\d]+', datetime.text)
             if hours:
                 dict_data['Death_Time'] += datetime.timedelta(hours=int(hours[0]))
         if 'she' in text:
            dict_data['Sex'] = 'Female'
         else:
            dict_data['Sex'] = 'Male'

         symptoms = text[text.index('got'):text.index('and')].split(',')

         dict_data['Symptoms'] = '\n'.join(symptoms) 

         retrieved_data.append(Data(**dict_data))

         # EDIT : Reset the data dictionary.
         dict_data = {'Death':'',
             'Death_Time':'',
             'Sex' :'',
             'Symptoms':''}

singularity 2010-10-25 04:11:09

Answer 4

A:

It would be relatively easy to do most of the processing with regards to sex, date/time, etc., as those before you have shown, since you can really just define a set of keywords that would indicate these things and use those keywords.

However, the matter of processing symptoms is a bit different, as a definitive list of keywords representing symptoms would be difficult and most likely impossible.

Here's the choice you have to make: does processing this data really represent enough work to spend days writing a program to do it for me? If that's the case, then you should look into natural language processing (or machine learning, as someone before me said). I've heard pretty good things about nltk, a natural language toolkit for Python. If the format is as consistent as you say it is, the natural language processing might not be too difficult.

But, if you're not willing to expend the time and effort to tackle a truly difficult CS problem (and believe me, natural language processing is), then you ought to do most of the processing in Python by parsing dates, gender-specific pronouns, etc. and enter in the tougher parts by hand (e.g. symptoms).

Again, it depends on whether or not you think the programmatic or the manual solution will take less time in the long run.

Rafe Kettler 2010-10-25 04:45:09

but if I understand the format, the symptoms would just be any delineated strings between a keyword 'got' and the next 'and'.

philosodad 2010-10-26 03:56:24

Hopefully that's true, in which case you could just use normal string processing or regex.

Rafe Kettler 2010-10-26 04:04:27

That's something I was doing a search about. I tried nltk already but the documentation is very techniqual and I am not able to get it. Actually I don't mind spending a month or two developing a tool that will help me on the long run. I have about 7000-10000 records to insert into a database every month so investing some time learning wouldn't be a waste of time.

Nurse 2010-10-26 11:57:38

ansaurus

tags:

views:

answers:

Medical information extraction using Python

related questions