views:

808

answers:

5

I wanted to parse a text file that contains unstructured text. I need to get the address, date of birth, name, sex, and ID.

. 55 MORILLO ZONE VIII,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
F
01/16/1952
ALOMO, TERESITA CABALLES
3412-00000-A1652TCA2
12    
. 22 FABRICANTE ST. ZONE
VIII LUISIANA LAGROS,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
M
10/14/1967
AMURAO, CALIXTO MANALO13

In the example above, the first 3 lines is the address, the line with just an "F" is the sex, the DOB would be the line after "F", name after the DOB, the ID after the name, and the no. 12 under the ID is the index/record no.

However, the format is not consistent. In the second group, the address is 4 lines instead of 3 and the index/record no. is appended after the name (if the person doesn't have an ID field).

I wanted to rewrite the text into the following format:

name, ID, address, sex, DOB
+3  A: 

you have to exploit whatever regularity and structure the text does have.

I suggest you read one line at a time and match it to a regular expression to determine its type, fill in the appropriate field in a person object. writing out that object and starting a new one whenever you get a field that you already have filled in.

Nathan
+2  A: 

You can probably do this with regular expressions without too much difficulty. If you have never used them before, check out the python documentation, then fire up redemo.py (on my computer, it's in c:\python26\Tools\scripts).

The first task is to split the flat file into a list of entities (one chunk of text per record). From the snippet of text you gave, you could split the file with a pattern matching the beginning of a line, where the first character is a dot:

import re
re_entity_splitter = re.compile(r'^\.')

entities = re_entity_splitter.split(open(textfile).read())

Note that the dot must be escaped (it's a wildcard character by default). Note also the r before the pattern. The r denotes 'raw string' format, which excuses you from having to escape the escape characters, resulting in so-called 'backslash plague.'

Once you have the file split into individual people, picking out the gender and birthdate is a snap. Use these:

re_gender     = re.compile(r'^[MF]')
re_birth_Date = re.compile(r'\d\d/\d\d/\d\d')

And away you go. You can paste the flat file into re demo GUI and experiment with creating patterns to match what you need. You'll have it parsed in no time. Once you get good at this, you can use symbolic group names (see docs) to pick out individual elements quickly and cleanly.

twneale
Thanks. I've already have some experience with regular expressions. How do I deal with the address part? Some entities have 3 or 4 lines.
Francis
Once you split the file into a list of people, for each person I would try this:1. split the person's text into a list of lines2. for each person list, while the last item in the list doesn't match the re_gender, pop the item off the end of the list.3. The remaining list items are the address. lst = person.splitlines() while not re_gender.search(lst[-1].strip()): lst.pop() lst.pop() address_list = lst
twneale
+1  A: 

Here's a quick hack job.

f = open('data.txt')

def process(file):
    address = ""

    for line in file:
        if line == '': raise StopIteration
        line = line.rstrip() # to ignore \n
        if line in ('M','F'):
            sex = line
            break
        else:
            address += line

    DOB = file.readline().rstrip() # to ignore \n
    name = file.readline().rstrip()

    if name[-1].isdigit():
        name = re.match(r'^([^\d]+)\d+', name).group(1)
        ID = None
    else:
        ID = file.readline().rstrip()
        file.readline() # ignore the record #

    print (name, ID, address, sex, DOB)

while True:
    process(f)
Unknown
Fails on the line DOB = file.readline().rstrip() with the error ValueError: Mixing iteration and read methods would lose data
Francis
@Francis: in that case simply turn the for loop into a while loop and use file.readline().rstrip().
Unknown
+1  A: 

It may be overkill, but the leading edge machine learning algorithms for this type of problem are based on conditional random fields. For example, Accurate Information Extraction from Research Papers using Conditional Random Fields.

There is software out there that makes training these models relatively easy. See Mallet or CRF++.

Tristan
+4  A: 

Here is a first stab at a pyparsing solution (easy-to-copy code at the pyparsing pastebin). Walk through the separate parts, according to the interleaved comments.

data = """\
. 55 MORILLO ZONE VIII,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
F
01/16/1952
ALOMO, TERESITA CABALLES
3412-00000-A1652TCA2
12
. 22 FABRICANTE ST. ZONE
VIII LUISIANA LAGROS,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
M
10/14/1967
AMURAO, CALIXTO MANALO13
"""

from pyparsing import LineEnd, oneOf, Word, nums, Combine, restOfLine, \
    alphanums, Suppress, empty, originalTextFor, OneOrMore, alphas, \
    Group, ZeroOrMore

NL = LineEnd().suppress()
gender = oneOf("M F")
integer = Word(nums)
date = Combine(integer + '/' + integer + '/' + integer)

# define the simple line definitions
gender_line = gender("sex") + NL
dob_line = date("DOB") + NL
name_line = restOfLine("name") + NL
id_line = Word(alphanums+"-")("ID") + NL
recnum_line = integer("recnum") + NL

# define forms of address lines
first_addr_line = Suppress('.') + empty + restOfLine + NL
# a subsequent address line is any line that is not a gender definition
subsq_addr_line = ~(gender_line) + restOfLine + NL

# a line with a name and a recnum combined, if there is no ID
name_recnum_line = originalTextFor(OneOrMore(Word(alphas+',')))("name") + \
    integer("recnum") + NL

# defining the form of an overall record, either with or without an ID
record = Group((first_addr_line + ZeroOrMore(subsq_addr_line))("address") + 
    gender_line + 
    dob_line +
    ((name_line +
        id_line + 
        recnum_line) |
      name_recnum_line))

# parse data
records = OneOrMore(record).parseString(data)

# output the desired results (note that address is actually a list of lines)
for rec in records:
    if rec.ID:
        print "%(name)s, %(ID)s, %(address)s, %(sex)s, %(DOB)s" % rec
    else:
        print "%(name)s, , %(address)s, %(sex)s, %(DOB)s" % rec
print

# how to access the individual fields of the parsed record
for rec in records:
    print rec.dump()
    print rec.name, 'is', rec.sex
    print

Prints:

ALOMO, TERESITA CABALLES, 3412-00000-A1652TCA2, ['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS'], F, 01/16/1952
AMURAO, CALIXTO MANALO, , ['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS'], M, 10/14/1967

['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS', 'F', '01/16/1952', 'ALOMO, TERESITA CABALLES', '3412-00000-A1652TCA2', '12']
- DOB: 01/16/1952
- ID: 3412-00000-A1652TCA2
- address: ['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS']
- name: ALOMO, TERESITA CABALLES
- recnum: 12
- sex: F
ALOMO, TERESITA CABALLES is F

['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS', 'M', '10/14/1967', 'AMURAO, CALIXTO MANALO', '13']
- DOB: 10/14/1967
- address: ['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS']
- name: AMURAO, CALIXTO MANALO
- recnum: 13
- sex: M
AMURAO, CALIXTO MANALO is M
Paul McGuire
Hi Paul. Thanks for the solution! I have some lines that start with "# | *" that I want to ignore/skip. How do I do this?
Francis
Do you mean they start with the string "# | *"? Or that they start with any one of these characters? If the first, define a comment as comment = "# | *" + restOfLine + NL; if the second define a comment as comment = oneOf("# | *") + restOfLine + NL. Then do: record.ignore(comment) - easy-peasy!
Paul McGuire