ansaurus

Question

Python halts while iteratively processing my 1GB csv file

Answer 1

+2 A:

"Craps out" is not a particularly good description. What does it do? Does it swap? Fill all memory? Or just eats CPU without appearing to do anything?

However, just for a start, use a dictionnary rather than a list for stored_ids. Searching in a dictionnary is usually done in O(1) time while searching in a list is O(n).

Edit: here is a trivial micro-benchmark:

$ python -m timeit -s "l=range(1000000)" "1000001 in l"
10 loops, best of 3: 71.1 msec per loop
$ python -m timeit -s "s=set(range(1000000))" "1000001 in s"
10000000 loops, best of 3: 0.174 usec per loop

As you can see, a set (which has the same performance characteristics as a dict) does searches among one million integers more than 10000 times faster than a similar list (much less than a microsecond vs. almost 100 milliseconds per lookup). Consider that such a lookup happens for each line of your 1GB file and you understand how big the issue can be.

Antoine P. 2010-01-06 01:50:12

I'll play around with it, but I don't think this is the problem. I should cast the items in stored_ids to ints so it can search more efficiently at least...

Dan 2010-01-07 05:57:49

This is completely wrong. It is much more efficient to switch to a O(1) container from a O(n) container, than to try to micro-optimize the comparison operation a bit. I'll add a benchmark in the answer above.

Antoine P. 2010-01-07 15:50:12

Awesome, thanks!

Dan 2010-01-07 22:05:42

Answer 2

A:

This code would die on any line that does not have at least 4 commas; for example, it would die on an empty line. If you are sure you dont want to use csv reader, then at least catch IndexError on line.split(',')[4]

kibitzer 2010-01-06 02:45:08

The process halts at almost exactly the same line of output across 4 different data sets, leading me to believe this isn't the case. I'll add the appropriate error catching and report back though.

Dan 2010-01-07 05:59:04

Answer 3

A:

Please explain what do you mean by halt? it hangs or quits? are there any error traceback?

a) It will fail on any line not having ","

>>> 'hmmm'.split(",")[2]
Traceback (most recent call last):
  File "<string>", line 1, in <string>
IndexError: list index out of range

b) why are you splitting line multiple times, instead do this

tokens = line.split(",")

if len(tokens) >=5 and tokens[0] in stored_ids:
    print "%s,%s" % (tokens[2], tokens[4])

c) create a dict of stored_ids, so tokens[0] in stored_id will be fast

d) wrap your inner code in try/exept and see if there are any errors

e) where are you running it on command line or some IDE ?

Anurag Uniyal 2010-01-06 04:16:50

a) not a problemb) because I'm lazy. I'll change it.c) will dod) there aren't, but again, sure I'll try ite) Python 2.6.4 on Win32

Dan 2010-01-07 06:00:45

Answer 4

A:

Searching in array took O(n), so use dict instead

stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

Or Use set

a=set(row[0] for row in entries if row[2] == options.vendor)
b=set(line.split(",")[0] for line in hashes)
c=a.intersection(b)

in c you will only have found strings for both hashes, and csv

S.Mark 2010-01-06 04:46:59

hah! I've never used sets in Python. I like that. I'll try it when I'm in front of my code again tomorrow.

Dan 2010-01-07 06:01:47

ansaurus

tags:

views:

answers:

Python halts while iteratively processing my 1GB csv file

related questions