views:

252

answers:

5

I have some data that I have stored in a list and if I print out the list I see the following:

.
.
.
007 A000000 Y
007 B000000  5
007 C010100  1
007 C020100 ACORN FUND
007 C030100 N
007 C010200  2
007 C020200 ACORN INTERNATIONAL
007 C030200 N
007 C010300  3
007 C020300 ACORN USA
007 C030300 N
007 C010400  4
.
.
.

The dots before and after the sequence are to represent that there is other data that is similarily structured but might or might not not be part of this seventh item (007). if the first value in the seventh item is '007 A000000 Y' then I want to create a dictionary listing of some of the data items. I can do this and have done so by just running through all of the items in my list and comparing their values to some test values for the variables. For instance a line of code like:

if dataLine.find('007 B')==0:
    numberOfSeries=int(dataLine.split()[2])

What I want to do though is

if dataLine.find(''007 A000000 Y')==0:
    READ THE NEXT LINE RIGHT HERE

Right now I am having to iterate through the entire list for each cycle

I want to shorten the processing because I have about 60K files that have between 500 to 5,000 lines in each.

I have thought about creating another reference to the list and counting the datalines until dataLine.find(''007 A000000 Y')==0. But that does not seem like it is the most elegant solution.

+2  A: 

You could read the data into a dictionary. Assuming you are reading from a file-like object infile:

from collections import defaultdict
data = defaultdict(list)
for line in infile:
    elements = line.strip().split()
    data[elements[0]].append(tuple(elements[1:]))

Now if you want to read the line after '007 A000000 Y', you can do so as:

# find the index of ('A000000', 'Y')
idx = data['007'].index(('A000000', 'Y'))
# get the next line
print data['007'][idx+1]
John Fouhy
+2  A: 

The only difficulty with using all the data in a dictionary is that a really big dictionary can become troublesome. (It's what we used to call the "Big Ole Matrix" approach.)

A solution to this is to construct an index in the Dictionary, creating a mapping of key->offset, using the tell method to get the file offset value. Then you can refer to the line again by seeking with the seek method.

Charlie Martin
+3  A: 

You can use itertools.groupby() to segment your sequence into multiple sub-sequences.

import itertools

for key, subseq in itertools.groupby(tempans, lambda s: s.partition(' ')[0]):
    if key == '007':
    for dataLine in subseq:
        if dataLine.startswith('007 B'):
     numberOfSeries = int(dataLine.split()[2])


itertools.dropwhile() would also work if you really just want to seek up to that line,

list(itertools.dropwhile(lambda s: s != '007 A000000 Y', tempans))
['007 A000000 Y',
 '007 B000000  5',
 '007 C010100  1',
 '007 C020100 ACORN FUND',
 '007 C030100 N',
 '007 C010200  2',
 '007 C020200 ACORN INTERNATIONAL',
 '007 C030200 N',
 '007 C010300  3',
 '007 C020300 ACORN USA',
 '007 C030300 N',
 '007 C010400  4',
 '.',
 '.',
 '.',
 '']
groner
A: 

Okay-while I was Googling to make sure I had covered my bases I came across a solution:

I find that I forget to think in Lists and Dictionaries even though I use them. Python has some powerful tools to work with these types to speed your ability to manipulate them.
I need a slice so the slice references are easily obtained by

beginPosit = tempans.index('007 A000000 Y')
endPosit = min([i for i, item in enumerate(tempans) if '008 ' in item])

where tempans is the datalist now I can write

for line in tempans[beginPosit:endPosit]:
    process each line

I think I answered my own question. I learned alot from the other answers and appreciate them but I think this is what I needed

Okay I am going to further edit my answer. I have learned a lot here but some of this stuff is over my head still and I want to get some code written while I am learning more about this fantastic tool.

from itertools import takewhile
beginPosit = tempans.index('007 A000000 Y')
new=takewhile(lambda x: '007 ' in x, tempans[beginPosit:])

This is based on an earlier answer to a similar question and Steven Huwig's answer

PyNEwbie
Yeah, but now you're reading through the entire list twice just to find your slice indices
kurosch
A: 

You said you wanted to do this:

if dataLine.find(''007 A000000 Y')==0:
    READ THE NEXT LINE RIGHT HERE

Presumably this is within a "for dataLine in data" loop.

Alternatively, you could use an iterator directly instead of in a for loop:

>>> i = iter(data)
>>> while i.next() != '007 A000000 Y': pass  # find your starting line
>>> i.next()  # read the next line
'007 B000000  5'

You also mention having 60K files to process. Are they all formatted similarly? Do they need to be processed differently? If they can all be processed the same way, you could consider chaining them together in a single flow:

def gfind( directory, pattern="*" ):
    for name in fnmatch.filter( os.listdir( directory ), pattern ):
        yield os.path.join( directory, name )

def gopen( names ):
    for name in names:
        yield open(name, 'rb')

def gcat( files ):
    for file in files:
        for line in file:
            yield line

data = gcat( gopen( gfind( 'C:\datafiles', '*.dat' ) ) )

This lets you lazily process all your files in a single iterator. Not sure if that helps your current situation but I thought it worth mentioning.

kurosch