views:

161

answers:

2

Hi, I have a very large CSV file contaning only two fields (id,url). I want to do some indexing on the url field with python, I know that there are some tools like Whoosh or Pylucene. but I can't get the examples to work. can someone help me with this?

A: 

file.csv contents:

a,b
d,f
g,h

Python script that loads it all into one giant dictionary:

#Python 3.1
giant_dict = {id.strip(): url.strip() for id, url in (line.split(',') for line in open('file.csv', 'r'))}

print(giant_dict)
{'a': 'b', 'd': 'f', 'g': 'h'}
Hamish Grubijan
Dear lord, why are you parsing it yourself instead of using the CSV module??
moshez
the problem is that this file will be more than 5GB. so I cannot load it into memory at once!
Hossein
What exactly are you trying to do? You can read file line by line with this: for line in open('file.csv'). Also, why not just get 9GB or RAM installed?
Hamish Grubijan
The urls of this larg file should be compare with another large file,and for faster access i need to do some indexing on.
Hossein
I still do not understand what you are trying to do. What if there is a match? What if there is no match? Describe the whole thing please. Indexing is no silver bullet.
Hamish Grubijan
@Hossein: Please add new facts to the question -- do not add new facts as comments to an answer.
S.Lott
+1  A: 

PyLucene is very easy to work with, but as you haven't posted your example i am not sure what problem you are facing.

Alternatively when you have only key:value type of data, may be better than Pylucene would be DB like Berkeley DB(python bindings pybsddb). It will work like python dictionary and should be more or as fast as lucene, you can try that.

Anurag Uniyal