views:

101

answers:

2

I have data in a CSV file. One of the column lists a persons name and all the rows that follow in that column provide some descriptive attributes about that person until the next persons name shows up. I can tell when the row has a name or an attribute by the LTYPE column, N in that column indicates that in that row the NAME value is actually a name, an A in that column indicates that the data in the NAME column is an attribute. The attributes are coded and I have 600K lines of the data. Here is a sample. The data is grouped and the befinning of each grouping is indicated by RID resetting to 1.

{'LTYPE': 'N', 'RID': '1', 'NAME': 'Jason Smith'}
{'LTYPE': 'A', 'RID': '2', 'NAME': 'DA'}
{'LTYPE': 'A', 'RID': '3', 'NAME': 'B'}
{'LTYPE': 'N', 'RID': '4', 'NAME': 'John Smith'}
{'LTYPE': 'A', 'RID': '5', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '6', 'NAME': 'CB'}
{'LTYPE': 'A', 'RID': '7', 'NAME': 'DB'}
{'LTYPE': 'A', 'RID': '8', 'NAME': 'DA'}
{'LTYPE': 'N', 'RID': '9', 'NAME': 'Robert Smith'}
{'LTYPE': 'A', 'RID': '10', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '11', 'NAME': 'DB'}
{'LTYPE': 'A', 'RID': '12', 'NAME': 'CB'}
{'LTYPE': 'A', 'RID': '13', 'NAME': 'RB'}
{'LTYPE': 'A', 'RID': '14', 'NAME': 'VC'}
{'LTYPE': 'N', 'RID': '15', 'NAME': 'Harvey Smith'}
{'LTYPE': 'A', 'RID': '16', 'NAME': 'SA'}
{'LTYPE': 'A', 'RID': '17', 'NAME': 'AS'}
{'LTYPE': 'N', 'RID': '18', 'NAME': 'Lukas Smith'}
{'LTYPE': 'A', 'RID': '19', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '20', 'NAME': 'AS'}

I want to create the following:

{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'N', 'RID': '1', 'PERSON_NAME': 'Jason Smith', 'NAME': 'Jason Smith'}
{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'A', 'RID': '2', 'PERSON_NAME': 'Jason Smith', 'NAME': 'DA'}
{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'A', 'RID': '3', 'PERSON_NAME': 'Jason Smith', 'NAME': 'B'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'N', 'RID': '4', 'PERSON_NAME': 'John Smith', 'NAME': 'John Smith'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '5', 'PERSON_NAME': 'John Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '6', 'PERSON_NAME': 'John Smith', 'NAME': 'CB'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '7', 'PERSON_NAME': 'John Smith', 'NAME': 'DB'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '8', 'PERSON_NAME': 'John Smith', 'NAME': 'DA'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'N', 'RID': '9', 'PERSON_NAME': 'Robert Smith', 'NAME': 'Robert Smith'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '10', 'PERSON_NAME': 'Robert Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '11', 'PERSON_NAME': 'Robert Smith', 'NAME': 'DB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '12', 'PERSON_NAME': 'Robert Smith', 'NAME': 'CB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '13', 'PERSON_NAME': 'Robert Smith', 'NAME': 'RB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '14', 'PERSON_NAME': 'Robert Smith', 'NAME': 'VC'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'N', 'RID': '15', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'Harvey Smith'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'A', 'RID': '16', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'SA'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'A', 'RID': '17', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'AS'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'N', 'RID': '18', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'Lukas Smith'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'A', 'RID': '19', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'A', 'RID': '20', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'AS'}

I started off by getting the index positions of LTYPE

nameIndex=[]
attributeIndex=[]
for line in thedata:
    if line['LTYPE']=='N':
        nameIndex.append(int(line["RID"])-1)
    if line['LTYPE']=='A':
        attributeIndex.append(int(line["RID"])-1)

So I have the list index of each of the rows classified as a name in one list and the list index of each of the rows classified as an attribute in another list. It is then easy to attach the name to each observation as follows

for counter, row in enumerate(thedata):
    if counter in nameIndex:
        row['PERSON_NAME']=row['NAME']
        person_NAME=row['NAME']
    if counter not in nameIndex:
        row['PERSON_NAME']=person_NAME

I am struggling to determine and assign the list of attributes to each person.

First I need to combine the attributes that belong together so I did this:

 newAttribute=[]
 for counter, row in enumerate(thedata):
     if counter in attributeIndex:
         tempAttribute=tempAttribute+' '+row['NAME']

     if counter not in attributeIndex:
         if counter==0:
             tempAttribute=""
             pass
         if counter!=0:
             newAttribute.append(tempAttribute.lstrip())
             tempAttribute=""

one problem with my approach is that I still have to add the last group to the newAttribute list since the loop finishes before it is added. So to get the list of grouped attributes I have to run

newAttribute.append(tempAttribute)

But even then I can't seem to find a clean way to add the attributes I have to do it in two steps. First, I create a dictionary with the nameIndex positions as the key and the attributes as the values

tempDict={}
for each in range(len(nameIndex)):
    tempdict[nameIndex[each]]=newAttribute[each]

I cycle through the list once putting in the attribute on the name line

for counter,row in enumerate(thedata):
    if counter in tempDict:
        thedata[counter]['TA']=tempDict[counter]

and then I go through it again checking if the key 'TA' exists and using the existence to set the PERSON_ATTRIBUTE key

for each in thedata:
    if each.has_key('TA'):
        each['PERSON_ATTRIBUTES']=each['TA']
        holdAttribute=each['TA']
    else:
        each['PERSON_ATTRIBUTES']=holdAttribute

There has got to be a cleaner way to think about this and so I was wondering if anyone would like to point me in the direction of some functions that I could read about that would let me clean up this code. I know I still have to drop the 'TA' key but I figured that I have taken enough space.

+1  A: 

I would split this into two tasks.

First, divide thedata into groups of LTYPE=N rows and the LTYPE=A rows that follow it.

def group_name_and_attributes(thedata):
    group = []
    for row in thedata:
        if row['LTYPE'] == 'N':
            if group:
                yield group
            group = [row]
        else:
            group.append(row)
    if group:
        yield group

Next, take each group in isolation and collect the total attributes for each; it's easy to then add the sum attributes to each row as desired.

def join_person_attributes(thedata):
    for group in group_name_and_attributes(thedata):
        attributes = ' '.join(row['NAME'] for row in group if row['LTYPE'] == 'A')
        for row in group:
            new_row = row.copy()
            new_row['PERSON_ATTRIBUTES'] = attributes
            yield new_row

new_data = list(join_person_attributes(thedata))

Of course you could make this modify the rows in-place, or only return one row per group, or ...

ephemient
I appreciate your help a lot and I learned quite a bit from playing with the code you provided. I marked your answer up but I marked Alex's as the accepted because I had to add two lines to yours to get what I was looking for. I added pname= ' '.join(row['NAME'] for row in group if row['LTYPE'] == 'N') after the attributes= assignment in join_person_attributes function and new_row['PERSON_NAME'] = pname after the new_row assignment statement. I really do appreciate your answer and learned quite a bit. Thanks
PyNEwbie
+2  A: 

I suggest a different, index-free approach based on itertools.groupby:

import itertools, operator

data = [
{'LTYPE': 'N', 'RID': '1', 'NAME': 'Jason Smith'},
{'LTYPE': 'A', 'RID': '2', 'NAME': 'DA'},
{'LTYPE': 'A', 'RID': '3', 'NAME': 'B'},
{'LTYPE': 'N', 'RID': '4', 'NAME': 'John Smith'},
{'LTYPE': 'A', 'RID': '5', 'NAME': 'BC'},
{'LTYPE': 'A', 'RID': '6', 'NAME': 'CB'},
{'LTYPE': 'A', 'RID': '7', 'NAME': 'DB'},
{'LTYPE': 'A', 'RID': '8', 'NAME': 'DA'},
]

for k, g in itertools.groupby(data, operator.itemgetter('LTYPE')):
  if k=='N':
    person_name_record = next(g)
  else:
    attribute_records = list(g)
    person_attributes = ' '.join(r['NAME'] for r in attribute_records)
    addfields = dict(PERSON_ATTRIBUTES=person_attributes,
                     PERSON_NAME=person_name_record['NAME'])
    person_name_record.update(addfields)
    for r in attribute_records: r.update(addfields)

for r in data: print r

This prints your desired results for the first couple people (and each person is treated separately, so it should work just the same for a few hundred thousand people;-).

Alex Martelli
Thanks I have been playing with your answer and learning a lot about how itertools works. I also learned from the other answer I marked yours as the answer because I had to make a slight modification to the other answer to get what I needed.
PyNEwbie