views:

69

answers:

2

I have a comma delimited text file. The 5th field on each line contains the name and address information. The name is separated from the street information by a '¬' character. The same character also separates the city|state|zip. A sample field would be:
"¬BOL¬MICKEY M MOUSE¬123 TOMORROW LANE¬ORLANDO FL 12345-6789¬¬¬¬EOL¬"

I need to separate the name into parts and the city|state|zip into parts. However, the name may or may not have a middle initial so:

m = l[4].split("¬")
firstName, mi, lastName = m[2].split()

won't work if there is no middle initial. Also, the name of the city may or may not have spaces so:

city, state, zipCode = m[4].split()

won't work if the city is 'San Antonio' or 'Rio de Janeiro' for instance.

Bottom line, how do I parse sections of a field where the section is not always in the same format?

+3  A: 

In your examples it seems that you can in both cases solve the problem by getting the 'first fields', the 'last fields' and 'everything in between':

m = line.split("¬")[2].split()
firstname = m[0]
surname = m[-1]
initials = m[1:-1] # Maybe just keep this as a list?

And:

m = line.split("¬")[4].split()
city = ' '.join(m[:-2])
state = m[-2]
zipCode = m[-1]

In general you can handle a single field containing spaces by getting the 'fixed' fields from both the start and the end and whatever is left over is the field that can contain spaces.. As soon as you have two fields containing spaces in the same column, there's nothing you can do. It's ambiguously defined.

With the data format you have, you may have some problems if there are people with first or last names containing spaces such as Robert Van de Graff. This can be solved if you have an initial by looking for words containing only one letter such as: Robert J. Van de Graaff and using those to define where the first and last names start and end. But in general you may have problems.

Also there's an internationalization issue hidden here: not everyone writes their 'first name' first - sometimes they write their family name first.

Mark Byers
I get an AttributeError on the "split split" line:AttributeError: 'list' object has no attribute 'split'.I'm using ActiveState's ActivePython 2.6 under Windows XP.
Count Boxer
Sorry, had the square brackets in the wrong place.
Mark Byers
You are correct. In testing your suggestion some of the last names have spaces (i.e Ronald H Mc Donald). I am going to write those cases where the middle initial is a list ['H', 'Mc'] to an error file to be handled differently.
Count Boxer
@CountBoxer: The Mc should almost certainly be part of the surname in this case. It looks like someone entered it in incorrectly. That happens a lot in real data. There's nothing wrong with two middle initials though - I have two middle names, for example. You could search for all words containing one letter and say those are initials, anything before is the first name, anything after is the last name. If the first name or last name is missing, that's an error. If there are no initials, you'll just have to make a best guess.
Mark Byers
A: 

Basically what Anon suggests, you can implement it like this:

cityInfo = m[4].split()
city, state, zipCode = ' '.join(cityInfo[:-2]), cityInfo[-2], cityInfo[-1])
Wim
If I split on whitespace and test the length of the array: name = m[2].split() if len(name) == 2: firstName, lastName = m[2].split() mi = "" elif len(name) == 3: firstName, mi, lastName = m[2].split() else: print "Error in name: %s" % (m[2]) firstName, mi, lastName = "", "", ""But this is impractical for the city|state|zip data... and ugly code.
Count Boxer