ansaurus

Question

Answer 1

+3 A:

I would use memory-maping: http://docs.python.org/library/mmap.html.
This way you can use the file as if it's stored in memory, but the OS decides which pages should actually be read from the file.

Bastien Léonard 2009-04-13 15:32:40

Answer 2

+4 A:

It's not clear what "list[pointer]" is all about. Consider this, however.

from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys
for line in fin:
    key, value = map( int, line.split())
    if key in targetKeys:
        keyValues[key].append( value )

S.Lott 2009-04-13 15:33:48

This is slower than the method I included in the question. :(PS: I added a couple comments to my code snippet to explain it a bit better.

marcog 2009-04-13 15:56:49

Answer 3

+5 A:

If you only need 200 of 50 million lines, then reading all of it into memory is a waste. I would sort the list of search keys and then apply binary search to the file using seek() or something similar. This way you would not read the entire file to memory which I think should speed things up.

kigurai 2009-04-13 15:37:33

This method, combined with Will's idea of fixed-width entries sound good. Let me try this out quick.

marcog 2009-04-13 15:59:31

Great, this is blazing fast! :D

marcog 2009-04-13 16:33:48

Check my answer for actual working code to do this.

Joe Koberg 2009-04-13 16:37:54

Glad to have helped :)

kigurai 2009-04-13 18:48:07

Answer 4

A:

One possible optimization is to do a bit of buffering using the sizehint option in file.readlines(..). This allows you to load multiple lines in memory totaling to approximately sizehint bytes.

Il-Bhima 2009-04-13 15:40:51

Answer 5

+7 A:

Slight optimization of S.Lotts answer:

from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys as strings
for line in fin:
    key, value = line.split()
    if key in targetKeys:
        keyValues[key].append( value )

Since we're using a dictionary rather than a list, the keys don't have to be numbers. This saves the map() operation and a string to integer conversion for each line. If you want the keys to be numbers, do the conversion a the end, when you only have to do it once for each key, rather than for each of 50 million lines.

Chris Upchurch 2009-04-13 15:46:13

+1: Good point -- no math means no conversion necessary.

S.Lott 2009-04-13 15:55:44

Answer 6

+1 A:

If you have any control over the format of the file, the "sort and binary search" responses are correct. The detail is that this only works with records of a fixed size and offset (well, I should say it only works easily with fixed length records).

With fixed length records, you can easily seek() around the sorted file to find your keys.

Will Hartung 2009-04-13 15:53:55

Answer 7

A:

You need to implement binary search using seek()

vartec 2009-04-13 15:56:12

Answer 8

+1 A:

Here is a recursive binary search on the text file

import os, stat

class IntegerKeyTextFile(object):
    def __init__(self, filename):
        self.filename = filename
        self.f = open(self.filename, 'r')
        self.getStatinfo()

    def getStatinfo(self):
        self.statinfo = os.stat(self.filename)
        self.size = self.statinfo[stat.ST_SIZE]

    def parse(self, line):
        key, value = line.split()
        k = int(key)
        v = int(value)
        return (k,v)

    def __getitem__(self, key):
        return self.findKey(key)

    def findKey(self, keyToFind, startpoint=0, endpoint=None):
        "Recursively search a text file"

        if endpoint is None:
            endpoint = self.size

        currentpoint = (startpoint + endpoint) // 2

        while True:
            self.f.seek(currentpoint)
            if currentpoint <> 0:
                # may not start at a line break! Discard.
                baddata = self.f.readline() 

            linestart = self.f.tell()
            keyatpoint = self.f.readline()

            if not keyatpoint:
                # read returned empty - end of file
                raise KeyError('key %d not found'%(keyToFind,))

            k,v = self.parse(keyatpoint)

            if k == keyToFind:
                print 'key found at ', linestart, ' with value ', v
                return v

            if endpoint == startpoint:
                    raise KeyError('key %d not found'%(keyToFind,))

            if k > keyToFind:
                return self.findKey(keyToFind, startpoint, currentpoint)
            else:
                return self.findKey(keyToFind, currentpoint, endpoint)

A sample text file created in jEdit seems to work:

>>> i = integertext.IntegerKeyTextFile('c:\\sampledata.txt')
>>> i[1]
key found at  0  with value  345
345

It could definitely be improved by caching found keys and using the cache to determine future starting seek points.

Joe Koberg 2009-04-13 16:35:18

Nice trick to make sure it starts at a line break, but since I have full control over the input file it's faster to just format it to have a fixed-width per line. See my implementation at the end of original question.

marcog 2009-04-13 17:03:41

ansaurus

tags:

views:

answers:

Reading Huge File in Python

related questions