views:

4232

answers:

6

I have a very big file 4GB and when I try to read it my computer hangs. So I want to read it piece by piece and after processing each piece store the processed piece into another file and read next piece.

Is there any method to yield these pieces ?

I would love to have a lazy method.

+20  A: 

To write a lazy function, just use yield:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


f = open('really_big_file.dat')
for piece in read_in_chunks(f):
    process_data(piece)


Another option would be to use iter and a helper function:

f = open('really_big_file.dat')
def read1k():
    return f.read(1024)

for piece in iter(read1k, ''):
    process_data(piece)


If the file is line-based, the file object is already a lazy generator of lines:

for line in open('really_big_file.dat'):
    process_data(line)
nosklo
+1: for `iter()`. A warning that a line in a file may be arbitrary large (sometime the whole file) might be helpful.
J.F. Sebastian
+4  A: 

Take a look at this post on Neopythonic: "Sorting a million 32-bit integers in 2MB of RAM using Python"

Paolo Tedesco
See also http://effbot.org/zone/wide-finder.htm for a combination of huge file processing techniques.
Constantin
+9  A: 

You can use the mmap module to map the contents of the file into memory and access it with indices and slices. Here an example from the documentation:

import mmap
with open("hello.txt", "r+") as f:
    # memory-map the file, size 0 means whole file
    map = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print map.readline()  # prints "Hello Python!"
    # read content via slice notation
    print map[:5]  # prints "Hello"
    # update content using slice notation;
    # note that new content must have same size
    map[6:] = " world!\n"
    # ... and read again using standard file methods
    map.seek(0)
    print map.readline()  # prints "Hello  world!"
    # close the map
    map.close()
unbeknown
A: 

i'm in somewhat similar situation, it's not clear whether you know chunk size in bytes, I usually don't, but the number of records (lines) that required is known:

def get_line():
     with open('4gb_file') as file:
         for i in file:
             yield i

lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]

update: thanks nosklo. here's what I meant. it's almost works, except that it looses a line 'between' chunks.

chunk = [next(gen) for i in range(lines_required)]

does the trick w/o loosing any lines, but it doesn't look very nice

SilentGhost
is this pseudo code? it won't work. It is also needless confusing, you should make the number of lines an optional parameter to the get_line function.
nosklo
A: 

i am not allowed to comment due to my low reputation, but SilentGhosts solution should be much easier with file.readlines([sizehint])

python file methods

edit: SilentGhost is right, but this should be better than:

s = "" 
for i in xrange(100): 
   s += file.next()
sinzi
sizehint is in bytes
SilentGhost
ok, sorry, you are absolutely right. but maybe this solution will make you happier ;) : s = "" for i in xrange(100): s += file.next()
sinzi
-1: Terrible solution, this would mean creating a new string in memory each line, and copying the entire file data read to the new string. The worst performance and memory.
nosklo
why would it copy the entire file data into a new string?from the python documentation:In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.
sinzi
@sinzi: "s +=" or concatenating strings makes a new copy of the string each time, since the string is immutable, so you are creating a new string.
nosklo
@nosklo: these are details of implementation, list comprehension can be used in it's place
SilentGhost
+2  A: 

file.readlines() takes in an optional size argument which approximates the number of lines read in the lines returned.

bigfile = open('bigfilename','r')
tmp_lines = bigfile.readlines(BUF_SIZE)
while tmp_lines:
    process([line for line in tmp_lines])
    tmp_lines = bigfile.readlines(BUF_SIZE)
Anshul