tags:

views:

256

answers:

4

I have been struggling with managing some data. I have data that I have turned into a list of lists each basic sublist has a structure like the following

<1x>begins
<2x>value-1
<3x>value-2
<4x>value-3
 some indeterminate number of other values
<1y>next observation begins
<2y>value-1
<3y>value-2
<4y>value-3
 some indeterminate number of other values

this continues for an indeterminate number of times in each sublist

EDIT I need to get all the occurrences of <2,<3 & <4 separated out and grouped together I am creating a new list of lists [[<2x>value-1,<3x>value-2, <4x>value-3], [<2y>value-1, <3y>value-2, <4y>value-3]]

EDIT all of the lines that follow <4x> and <4y> (and for that matter <4anyalpha> have the same type of coding and I don't know a-priori how high the numbers can go-just think of these as sgml tags that are not closed I used numbers because my fingers were hurting from all the coding I have been doing today.

The solution I have come up with finally is not very pretty

 listINeed=[]
 for sublist in biglist:
    for line in sublist:
        if '<2' in line:
            var2=line
        if '<3' in line:
            var3=line
        if '<4' in line:
            var4=line
            templist=[]
            templist.append(var2)
            templist.append(var3)
            templist.append(var4)
            listIneed.append(templist)
            templist=[]
            var4=var2=var3=''

I have looked at ways to try to clean this up but have not been successful. This works fine I just saw this as another opportunity to learn more about python because I would think that this should be processable by a one line function.

+1  A: 

You're off to a good start by noticing that your original solution may work but lacks elegance.

You should parse the string in a loop, creating a new variable for each line. Here's some sample code:

import re

s = """<1x>begins
<2x>value-1
<3x>value-2
<4x>value-3
 some indeterminate number of other values
<1y>next observation begins
<2y>value-1
<3y>value-2
<4y>value-3"""
firstMatch = re.compile('^\<1x')
numMatch = re.compile('^\<(\d+)')
listIneed = []
templist = None
for line in s.split():
        if firstMatch.match(line):
                if templist is not None: 
                        listIneed.append(templist)
                templist = [line]
        elif numMatch.match(line):
            #print 'The matching number is %s' % numMatch.match(line).groups(1)
            templist.append(line)
if templist is not None: listIneed.append(templist)

print listIneed
RossFabricant
I appreciate your creativity but I think my solution is cheaper to implement though I am not absolutely sure. It took less than two seconds to run against about 750K lines
PyNEwbie
If by "cheaper to implement" you mean your approach runs faster, than you are probably right. Your approach will also break if there are 5 variables instead of 4. If my solution doesn't need to work, I can make it run as fast as you want.
RossFabricant
Well one problem with my solution is that if there is ever the case that the variables are out of order then it won't work. But my solution will work fine with five or n variables, I just have to define them. I guess I was wrong there is not a one liner. I learned a lot from your code
PyNEwbie
+1  A: 

If you want to pick out the second, third, and fourth elements of each sublist, this should work:

listINeed = [sublist[1:4] for sublist in biglist]
David Zaslavsky
Well I can't be sure which ones they are and the entire thing I posted up there is the sublist so I might need 1 or ten units that are only indicated by there names
PyNEwbie
Then you need to be more specific in your question... I really can't understand what exactly it is you're trying to do.
David Zaslavsky
+1  A: 

itertools.groupby() can get you by.

itertools.groupby(biglist, operator.itemgetter(2))
Ignacio Vazquez-Abrams
A: 

If I've understood your question correctly:

import re
def getlines(ori):
    matches = re.finditer(r'(<([1-4])[a-zA-Z]>.*)', ori)
    mainlist = []
    sublist = []
    for sr in matches:
        if int(sr.groups()[1]) == 1:
            if sublist != []:
                mainlist.append(sublist)
            sublist = []
        else:
            sublist.append(sr.groups()[0])
    else:
        mainlist.append(sublist)
    return mainlist

...would do the job for you, if you felt like using regular expressions.

The version below would break all of the data down into sublists (not just the first four in each grouping) which might be more useful depending what else you need to do to the data. Use David's listINeed = [sublist[1:4] for sublist in biglist] to get the first four results from each list for the specific task above.

import re
def getlines(ori):
    matches = re.finditer(r'(<(\d*)[a-zA-Z]>.*)', ori)
    mainlist = []
    sublist = []
    for sr in matches:
        if int(sr.groups()[1]) == 1:
            print "1 found!"
            if sublist != []:
                mainlist.append(sublist)
            sublist = []
        else:
            sublist.append(sr.groups()[0])
    else:
        mainlist.append(sublist)
    return mainlist
mavnn