ansaurus

Question

Lazy Method for Reading Big File in Python?

Answer 1

+20 A:

To write a lazy function, just use yield:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


f = open('really_big_file.dat')
for piece in read_in_chunks(f):
    process_data(piece)

Another option would be to use iter and a helper function:

f = open('really_big_file.dat')
def read1k():
    return f.read(1024)

for piece in iter(read1k, ''):
    process_data(piece)

If the file is line-based, the file object is already a lazy generator of lines:

for line in open('really_big_file.dat'):
    process_data(line)

nosklo 2009-02-06 09:20:08

+1: for `iter()`. A warning that a line in a file may be arbitrary large (sometime the whole file) might be helpful.

J.F. Sebastian 2009-03-18 12:41:46

Answer 2

+4 A:

Take a look at this post on Neopythonic: "Sorting a million 32-bit integers in 2MB of RAM using Python"

Paolo Tedesco 2009-02-06 09:28:12

See also http://effbot.org/zone/wide-finder.htm for a combination of huge file processing techniques.

Constantin 2009-02-06 10:46:55

Answer 3

+9 A:

You can use the mmap module to map the contents of the file into memory and access it with indices and slices. Here an example from the documentation:

import mmap
with open("hello.txt", "r+") as f:
    # memory-map the file, size 0 means whole file
    map = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print map.readline()  # prints "Hello Python!"
    # read content via slice notation
    print map[:5]  # prints "Hello"
    # update content using slice notation;
    # note that new content must have same size
    map[6:] = " world!\n"
    # ... and read again using standard file methods
    map.seek(0)
    print map.readline()  # prints "Hello  world!"
    # close the map
    map.close()

unbeknown 2009-02-06 09:41:29

Answer 4

A:

i'm in somewhat similar situation, it's not clear whether you know chunk size in bytes, I usually don't, but the number of records (lines) that required is known:

def get_line():
     with open('4gb_file') as file:
         for i in file:
             yield i

lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]

update: thanks nosklo. here's what I meant. it's almost works, except that it looses a line 'between' chunks.

chunk = [next(gen) for i in range(lines_required)]

does the trick w/o loosing any lines, but it doesn't look very nice

SilentGhost 2009-02-06 10:12:47

is this pseudo code? it won't work. It is also needless confusing, you should make the number of lines an optional parameter to the get_line function.

nosklo 2009-02-06 15:26:12

Answer 5

A:

i am not allowed to comment due to my low reputation, but SilentGhosts solution should be much easier with file.readlines([sizehint])

python file methods

edit: SilentGhost is right, but this should be better than:

s = "" 
for i in xrange(100): 
   s += file.next()

sinzi 2009-02-06 10:37:22

sizehint is in bytes

SilentGhost 2009-02-06 10:43:08

ok, sorry, you are absolutely right. but maybe this solution will make you happier ;) : s = "" for i in xrange(100): s += file.next()

sinzi 2009-02-06 10:58:17

-1: Terrible solution, this would mean creating a new string in memory each line, and copying the entire file data read to the new string. The worst performance and memory.

nosklo 2009-02-06 15:28:25

why would it copy the entire file data into a new string?from the python documentation:In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.

sinzi 2009-02-06 15:37:09

@sinzi: "s +=" or concatenating strings makes a new copy of the string each time, since the string is immutable, so you are creating a new string.

nosklo 2009-02-06 16:50:27

@nosklo: these are details of implementation, list comprehension can be used in it's place

SilentGhost 2009-02-06 17:05:54

Answer 6

+2 A:

file.readlines() takes in an optional size argument which approximates the number of lines read in the lines returned.

bigfile = open('bigfilename','r')
tmp_lines = bigfile.readlines(BUF_SIZE)
while tmp_lines:
    process([line for line in tmp_lines])
    tmp_lines = bigfile.readlines(BUF_SIZE)

Anshul 2010-01-21 18:27:59

ansaurus

tags:

views:

answers:

Lazy Method for Reading Big File in Python?

related questions