views:

677

answers:

5

I'm reading a 6 million entry .csv file with Python, and I want to be able to search through this file for a particular entry.

Are there any tricks to search the entire file? Should you read the whole thing into a dictionary or should you perform a search every time? I tried loading it into a dictionary but that took ages so I'm currently searching through the whole file every time which seems wasteful.

Could I possibly utilize that the list is alphabetically ordered? (e.g. if the search word starts with "b" I only search from the line that includes the first word beginning with "b" to the line that includes the last word beginning with "b")

I'm using import csv.

(a side question: it is possible to make csv go to a specific line in the file? I want to make the program start at a random line)

Edit: I already have a copy of the list as an .sql file as well, how could I implement that into Python?

+6  A: 

If the csv file isn't changing, load in it into a database, where searching is fast and easy. If you're not familiar with SQL, you'll need to brush up on that though.

Here is a rough example of inserting from a csv into a sqlite table. Example csv is ';' delimited, and has 2 columns.

import csv
import sqlite3

con = sqlite3.Connection('newdb.sqlite')
cur = con.cursor()
cur.execute('CREATE TABLE "stuff" ("one" varchar(12), "two" varchar(12));')

f = open('stuff.csv')
csv_reader = csv.reader(f, delimiter=';')

cur.executemany('INSERT INTO stuff VALUES (?, ?)', csv_reader)
cur.close()
con.commit()
con.close()
f.close()
JimB
I had kind of hoped I wouldn't have to use SQL to do this, Python is supposedly almost as quick as Perl dealing with strings? Is SQL really any faster?(I'm using Linux so please try to suggest cross-platform software)
Baldur
com.close() should be con.close()
pwdyson
@Baldur - This isn't a matter of perl vs python, you're problem is that you're repeatedly reading a large file. Perl and python would do it the same way. A database just give you a better interface for indexing and searching.
JimB
I'd rather not use a database since I'm not familiar with SQL, is there no way to implement this well without SQL and if there isn't- what database management system should I use? Is mySQL good? And don't you load the entire file into a database in your example, creating 6 million tuples? Doesn't that take up a lot of time every time the program starts?
Baldur
@Baldur It's a pretty short and sweet example. Why don't you try it and see how long creating the DB takes and try a couple searches? Who knows, maybe it's the perfect solution for you
prestomation
@Baldur: I would use sqlite like in my example. It fast, easy, and the database is a single file. there is no initial load time, because the data structures are on disk, so you're not waiting to load it into memory. You can probably speed it up even more by creating and index for the columns you're searching on. Plus, knowing a little SQL is a great skill to have.
JimB
+1  A: 

Well, if your words aren't too big (meaning they'll fit in memory), then here is a simple way to do this (I'm assuming that they are all words).

from bisect import bisect_left

f = open('myfile.csv')

words = []
for line in f:
    words.extend(line.strip().split(','))

wordtofind = 'bacon'
ind = bisect_left(words,wordtofind)
if words[ind] == wordtofind:
    print '%s was found!' % wordtofind

It might take a minute to load in all of the values from the file. This uses binary search to find your words. In this case I was looking for bacon (who wouldn't look for bacon?). If there are repeated values you also might want to use bisect_right to find the the index of 1 beyond the rightmost element that equals the value you are searching for. You can still use this if you have key:value pairs. You'll just have to make each object in your words list be a list of [key, value].

Side Note

I don't think that you can really go from line to line in a csv file very easily. You see, these files are basically just long strings with \n characters that indicate new lines.

Justin Peel
+2  A: 

you can use memory mapping for really big files

import mmap,os,re
reportFile = open( "big_file" )
length = os.fstat( reportFile.fileno() ).st_size
try:
    mapping = mmap.mmap( reportFile.fileno(), length, mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
    mapping = mmap.mmap( reportFile.fileno(), 0, None, mmap.ACCESS_READ )
data = mapping.read(length)
pat =re.compile("b.+",re.M|re.DOTALL) # compile your pattern here.
print pat.findall(data)
ghostdog74
+1  A: 

You can't go directly to a specific line in the file because lines are variable-length, so the only way to know when line #n starts is to search for the first n newlines. And it's not enough to just look for '\n' characters because CSV allows newlines in table cells, so you really do have to parse the file anyway.

dan04
A: 

my idea is to use python zodb module to store dictionaty type data and then create new csv file using that data structure. do all your operation at that time.

vivek sharma