tags:

views:

250

answers:

2

Hi there.

I'm dealing with cleaning relatively large (30ish lines) blocks of text. Here's an excerpt:

PID|1||06225401^^^PA0^MR||PATIENT^FAKE R|||F

PV1|1|I|||||025631^DoctorZ^^^^^^^PA0^^^^DRH|DRH||||...

ORC|RE||CYT-09-06645^AP||||||200912110333|INTERFACE07

OBR|1||CYT09-06645|8104^^L|||20090602|||||||200906030000[conditio...

OBX|1|TX|8104|1|SOURCE OF SPECIMEN:[source]||||||F|||200912110333|CYT ...

I currently have a script that takes out illegal characters or terms. Here's an example.

    infile = open(thisFile,'r')
    m = infile.read()

    #remove junk headers
    m = m.replace("4þPATHþ", "")
    m = m.replace("10þALLþ", "")

My goal is to modify this script so that I can add 4 digits to the end of one of the fields. In specific, the date field ("20090602") in the OBR line. The finished script will be able to work with any file that follows this same format. Is this possible with the way I currently handle the file input or do I have to use some different logic?

+2  A: 

You may find the answers here helpful.

http://stackoverflow.com/questions/1175540/iterative-find-replace-from-a-list-of-tuples-in-python/1175547

Unknown
Thanks. I don't think this is exactly what I'm looking for, but it's good to know nonetheless.
Sean O'Hollaren
+1  A: 

Here's an outline (untested) ... basically you do it a line at a time

for line in infile:
    data = line.rstrip("\n").split("|")
    kind = data[0]
    # start of changes
    if kind == "OBR":
        data[7] += "0000" # check that 7 is correct!
    # end of changes
    outrecord = "|".join(data)
    outfile.write(outrecord + "\n")

The above assumes that you are selecting fix-up targets by line-type (example: "OBR") and column index (example: 7). If there are only a few such targets, just add more similar fix statements. If there are many targets, you could specify them like this:

fix_targets = {
    "OBR": [7],
    "XYZ": [1, 42],
    }

and the fix_up code would look like this:

if kind in fix_targets:
    for col_index in fix_targets[kind]:
        data[col_index] += "0000"

You may like in any case to add code to check that data[col_index] really is a date in YYYYMMDD format before changing it.

None of the above addresses removing the unwanted headers, because you didn't show example data. I guess that applying your replacements to each line (and avoiding writing the line if it became only whitespace after replacements) would do the trick.

John Machin
Perfect. I figured it might be something like this, but I couldn't really visualize it. Thanks for the answer.
Sean O'Hollaren
Good. outfile.write(outrecord + '\n') has a built-in Python equivalent: print >> outfile, outrecord, which is simpler.
EOL
@EOL - except that the "print chevron" syntax, which is not often used, is also deprecated.
JimB