I was originally going to post a pyparsing example using the Each class (which picks out expressions that can be in any order), but then I saw that there was intermixed garbage, so searching through your string using searchString
seemed a better fit. This intrigued me because searchString
returns a sequence of ParseResults, one for each match (including any corresponding named results). So I thought, "What if I combine the returned ParseResults using sum - what a hack!", er, "How novel!" So here's a never-before-seen pyparsing hack:
from pyparsing import *
# define the separate expressions to be matched, with results names
dob_ref = "DOB" + Regex(r"\d{2}-\d{2}-\d{4}")("dob")
id_ref = "ID" + Word(alphanums,exact=12)("id")
info_ref = "-" + restOfLine("info")
# create an overall expression
person_data = dob_ref | id_ref | info_ref
for test in (samplestr1,samplestr2,samplestr3,samplestr4,):
# retrieve a list of separate matches
separate_results = person_data.searchString(test)
# combine the results using sum
# (NO ONE HAS EVER DONE THIS BEFORE!)
person = sum(separate_results, ParseResults([]))
# now we have a uber-ParseResults object!
print person.id
print person.dump()
print
Giving this output:
PARI12345678
['DOB', '10-10-2010', 'ID', 'PARI12345678']
- dob: 10-10-2010
- id: PARI12345678
PARI12345678
['ID', 'PARI12345678', 'DOB', '10-10-2010']
- dob: 10-10-2010
- id: PARI12345678
['DOB', '10-10-2010']
- dob: 10-10-2010
PARI12345678
['ID', 'PARI12345678', '-', ' I am cool']
- id: PARI12345678
- info: I am cool
But I do also speak regex. Here is a similar approach using re's.
import re
# define each individual re, with group names
dobRE = r"DOB +(?P<dob>\d{2}-\d{2}-\d{4})"
idRE = r"ID +(?P<id>[A-Z0-9]{12})"
infoRE = r"- (?P<info>.*)"
# one re to rule them all
person_dataRE = re.compile('|'.join([dobRE, idRE, infoRE]))
# using findall with person_dataRE will return a 3-tuple, so let's create
# a tuple-merger
merge = lambda a,b : tuple(aa or bb for aa,bb in zip(a,b))
# let's create a Person class to collect the different data bits
# (or if you are running Py2.6, use a namedtuple
class Person:
def __init__(self,*args):
self.dob, self.id, self.info = args
def __str__(self):
return "- id: %s\n- dob: %s\n- info: %s" % (self.id, self.dob, self.info)
for test in (samplestr1,samplestr2,samplestr3,samplestr4,):
# could have used reduce here, but let's err on the side of explicity
persontuple = ('','','')
for data in person_dataRE.findall(test):
persontuple = merge(persontuple,data)
# make a person
person = Person(*persontuple)
# print out the collected results
print person.id
print person
print
With this output:
PARI12345678
- id: PARI12345678
- dob: 10-10-2010
- info:
PARI12345678
- id: PARI12345678
- dob: 10-10-2010
- info:
- id:
- dob: 10-10-2010
- info:
PARI12345678
- id: PARI12345678
- dob:
- info: I am cool