I have data in a CSV file. One of the column lists a persons name and all the rows that follow in that column provide some descriptive attributes about that person until the next persons name shows up. I can tell when the row has a name or an attribute by the LTYPE column, N in that column indicates that in that row the NAME value is actually a name, an A in that column indicates that the data in the NAME column is an attribute. The attributes are coded and I have 600K lines of the data. Here is a sample. The data is grouped and the befinning of each grouping is indicated by RID resetting to 1.
{'LTYPE': 'N', 'RID': '1', 'NAME': 'Jason Smith'}
{'LTYPE': 'A', 'RID': '2', 'NAME': 'DA'}
{'LTYPE': 'A', 'RID': '3', 'NAME': 'B'}
{'LTYPE': 'N', 'RID': '4', 'NAME': 'John Smith'}
{'LTYPE': 'A', 'RID': '5', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '6', 'NAME': 'CB'}
{'LTYPE': 'A', 'RID': '7', 'NAME': 'DB'}
{'LTYPE': 'A', 'RID': '8', 'NAME': 'DA'}
{'LTYPE': 'N', 'RID': '9', 'NAME': 'Robert Smith'}
{'LTYPE': 'A', 'RID': '10', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '11', 'NAME': 'DB'}
{'LTYPE': 'A', 'RID': '12', 'NAME': 'CB'}
{'LTYPE': 'A', 'RID': '13', 'NAME': 'RB'}
{'LTYPE': 'A', 'RID': '14', 'NAME': 'VC'}
{'LTYPE': 'N', 'RID': '15', 'NAME': 'Harvey Smith'}
{'LTYPE': 'A', 'RID': '16', 'NAME': 'SA'}
{'LTYPE': 'A', 'RID': '17', 'NAME': 'AS'}
{'LTYPE': 'N', 'RID': '18', 'NAME': 'Lukas Smith'}
{'LTYPE': 'A', 'RID': '19', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '20', 'NAME': 'AS'}
I want to create the following:
{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'N', 'RID': '1', 'PERSON_NAME': 'Jason Smith', 'NAME': 'Jason Smith'}
{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'A', 'RID': '2', 'PERSON_NAME': 'Jason Smith', 'NAME': 'DA'}
{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'A', 'RID': '3', 'PERSON_NAME': 'Jason Smith', 'NAME': 'B'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'N', 'RID': '4', 'PERSON_NAME': 'John Smith', 'NAME': 'John Smith'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '5', 'PERSON_NAME': 'John Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '6', 'PERSON_NAME': 'John Smith', 'NAME': 'CB'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '7', 'PERSON_NAME': 'John Smith', 'NAME': 'DB'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '8', 'PERSON_NAME': 'John Smith', 'NAME': 'DA'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'N', 'RID': '9', 'PERSON_NAME': 'Robert Smith', 'NAME': 'Robert Smith'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '10', 'PERSON_NAME': 'Robert Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '11', 'PERSON_NAME': 'Robert Smith', 'NAME': 'DB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '12', 'PERSON_NAME': 'Robert Smith', 'NAME': 'CB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '13', 'PERSON_NAME': 'Robert Smith', 'NAME': 'RB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '14', 'PERSON_NAME': 'Robert Smith', 'NAME': 'VC'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'N', 'RID': '15', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'Harvey Smith'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'A', 'RID': '16', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'SA'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'A', 'RID': '17', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'AS'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'N', 'RID': '18', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'Lukas Smith'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'A', 'RID': '19', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'A', 'RID': '20', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'AS'}
I started off by getting the index positions of LTYPE
nameIndex=[]
attributeIndex=[]
for line in thedata:
if line['LTYPE']=='N':
nameIndex.append(int(line["RID"])-1)
if line['LTYPE']=='A':
attributeIndex.append(int(line["RID"])-1)
So I have the list index of each of the rows classified as a name in one list and the list index of each of the rows classified as an attribute in another list. It is then easy to attach the name to each observation as follows
for counter, row in enumerate(thedata):
if counter in nameIndex:
row['PERSON_NAME']=row['NAME']
person_NAME=row['NAME']
if counter not in nameIndex:
row['PERSON_NAME']=person_NAME
I am struggling to determine and assign the list of attributes to each person.
First I need to combine the attributes that belong together so I did this:
newAttribute=[]
for counter, row in enumerate(thedata):
if counter in attributeIndex:
tempAttribute=tempAttribute+' '+row['NAME']
if counter not in attributeIndex:
if counter==0:
tempAttribute=""
pass
if counter!=0:
newAttribute.append(tempAttribute.lstrip())
tempAttribute=""
one problem with my approach is that I still have to add the last group to the newAttribute list since the loop finishes before it is added. So to get the list of grouped attributes I have to run
newAttribute.append(tempAttribute)
But even then I can't seem to find a clean way to add the attributes I have to do it in two steps. First, I create a dictionary with the nameIndex positions as the key and the attributes as the values
tempDict={}
for each in range(len(nameIndex)):
tempdict[nameIndex[each]]=newAttribute[each]
I cycle through the list once putting in the attribute on the name line
for counter,row in enumerate(thedata):
if counter in tempDict:
thedata[counter]['TA']=tempDict[counter]
and then I go through it again checking if the key 'TA' exists and using the existence to set the PERSON_ATTRIBUTE key
for each in thedata:
if each.has_key('TA'):
each['PERSON_ATTRIBUTES']=each['TA']
holdAttribute=each['TA']
else:
each['PERSON_ATTRIBUTES']=holdAttribute
There has got to be a cleaner way to think about this and so I was wondering if anyone would like to point me in the direction of some functions that I could read about that would let me clean up this code. I know I still have to drop the 'TA' key but I figured that I have taken enough space.