views:

198

answers:

4

I have two files:

  1. metadata.csv: contains an ID, followed by vendor name, a filename, etc
  2. hashes.csv: contains an ID, followed by a hash The ID is essentially a foreign key of sorts, relating file metadata to its hash.

I wrote this script to quickly extract out all hashes associated with a particular vendor. It craps out before it finishes processing hashes.csv

stored_ids = []

# this file is about 1 MB
entries = csv.reader(open(options.entries, "rb"))

for row in entries:
  # row[2] is the vendor
  if row[2] == options.vendor:
    # row[0] is the ID
    stored_ids.append(row[0])

# this file is 1 GB
hashes = open(options.hashes, "rb")

# I iteratively read the file here,
# just in case the csv module doesn't do this.
for line in hashes:

  # not sure if stored_ids contains strings or ints here...
  # this probably isn't the problem though
  if line.split(",")[0] in stored_ids:

    # if its one of the IDs we're looking for, print the file and hash to STDOUT
    print "%s,%s" % (line.split(",")[2], line.split(",")[4])

hashes.close()

This script gets about 2000 entries through hashes.csv before it halts. What am I doing wrong? I thought I was processing it line by line.

ps. the csv files are the popular HashKeeper format and the files I am parsing are the NSRL hash sets. http://www.nsrl.nist.gov/Downloads.htm#converter

UPDATE: working solution below. Thanks everyone who commented!

entries = csv.reader(open(options.entries, "rb"))   
stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

hashes = csv.reader(open(options.hashes, "rb"))
matches = dict((row[2], row[4]) for row in hashes if row[0] in stored_ids)

for k, v in matches.iteritems():
    print "%s,%s" % (k, v)
+2  A: 

"Craps out" is not a particularly good description. What does it do? Does it swap? Fill all memory? Or just eats CPU without appearing to do anything?

However, just for a start, use a dictionnary rather than a list for stored_ids. Searching in a dictionnary is usually done in O(1) time while searching in a list is O(n).

Edit: here is a trivial micro-benchmark:

$ python -m timeit -s "l=range(1000000)" "1000001 in l"
10 loops, best of 3: 71.1 msec per loop
$ python -m timeit -s "s=set(range(1000000))" "1000001 in s"
10000000 loops, best of 3: 0.174 usec per loop

As you can see, a set (which has the same performance characteristics as a dict) does searches among one million integers more than 10000 times faster than a similar list (much less than a microsecond vs. almost 100 milliseconds per lookup). Consider that such a lookup happens for each line of your 1GB file and you understand how big the issue can be.

Antoine P.
I'll play around with it, but I don't think this is the problem. I should cast the items in stored_ids to ints so it can search more efficiently at least...
Dan
This is completely wrong. It is much more efficient to switch to a O(1) container from a O(n) container, than to try to micro-optimize the comparison operation a bit. I'll add a benchmark in the answer above.
Antoine P.
Awesome, thanks!
Dan
A: 

This code would die on any line that does not have at least 4 commas; for example, it would die on an empty line. If you are sure you dont want to use csv reader, then at least catch IndexError on line.split(',')[4]

kibitzer
The process halts at almost exactly the same line of output across 4 different data sets, leading me to believe this isn't the case. I'll add the appropriate error catching and report back though.
Dan
A: 

Please explain what do you mean by halt? it hangs or quits? are there any error traceback?

a) It will fail on any line not having ","

>>> 'hmmm'.split(",")[2]
Traceback (most recent call last):
  File "<string>", line 1, in <string>
IndexError: list index out of range

b) why are you splitting line multiple times, instead do this

tokens = line.split(",")

if len(tokens) >=5 and tokens[0] in stored_ids:
    print "%s,%s" % (tokens[2], tokens[4])

c) create a dict of stored_ids, so tokens[0] in stored_id will be fast

d) wrap your inner code in try/exept and see if there are any errors

e) where are you running it on command line or some IDE ?

Anurag Uniyal
a) not a problemb) because I'm lazy. I'll change it.c) will dod) there aren't, but again, sure I'll try ite) Python 2.6.4 on Win32
Dan
A: 

Searching in array took O(n), so use dict instead

stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

Or Use set

a=set(row[0] for row in entries if row[2] == options.vendor)
b=set(line.split(",")[0] for line in hashes)
c=a.intersection(b)

in c you will only have found strings for both hashes, and csv

S.Mark
hah! I've never used sets in Python. I like that. I'll try it when I'm in front of my code again tomorrow.
Dan