views:

94

answers:

2

I am using python to parse the incoming comma separated string. I want to do some calculation afterwards on the data. The length of the string is: 800 characters with 120 comma separated fields. There such 1.2 million strings to process.

for v in item.values():
         l.extend(get_fields(v.split(',')))  
#process l 

get_fields uses operator.itemgetter() to extract around 20 fields out of 120.

This entire operation takes about 4-5 minutes excluding the time to bring in the data. In the later part of the program I insert these lines into sqlite memory table for further use. But overall 4-5 minutes time for just parsing and getting a list is not good for my project.

I run this processing in around 6-8 threads.

Does switching to C/C++ might help?

+1  A: 

Your program might be slowing down trying to allocate enough memory for 1.2M strings. In other words, the speed problem might not be due to the string parsing/manipulation, but rather in the l.extend. To test this hypothsis, you could put a print statement in the loop:

for v in item.values():
    print('got here')
    l.extend(get_fields(v.split(',')))  

If the print statements get slower and slower, you can probably conclude l.extend is the culprit. In this case, you may see significant speed improvement if you can move the processing of each line into the loop.

PS: You probably should be using the csv module to take care of the parsing for you in a more high-level manner, but I don't think that will affect the speed very much.

unutbu
The timeit module (http://docs.python.org/library/timeit.html) may be of help in determining how long things take.
GreenMatt
I suggested a quick and dirty method because if you can't *see* a noticeable slow-down, then memory allocation is not the issue.
unutbu
A: 

Are you loading a dict with your file records? Probably better to process the data directly:

datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()

l = sum(get_fields(v.split(',')) for v in file, [])

This avoids creating any overall data structures, and only accumulated the desired values as returned by get_fields.

Paul McGuire