views:

1979

answers:

5

I'm using Python 2.6 on a Mac Mini with 1GB RAM. I want to read in a huge text file

$ ls -l links.csv; file links.csv; tail links.csv 
-rw-r--r--  1 user  user  469904280 30 Nov 22:42 links.csv
links.csv: ASCII text, with CRLF line terminators
4757187,59883
4757187,99822
4757187,66546
4757187,638452
4757187,4627959
4757187,312826
4757187,6143
4757187,6141
4757187,3081726
4757187,58197

So each line in the file consists of a tuple of two comma separated integer values. I want to read in the whole file and sort it according to the second column. I know, that I could do the sorting without reading the whole file into memory. But I thought for a file of 500MB I should still be able to do it in memory since I have 1GB available.

However when I try to read in the file, Python seems to allocate a lot more memory than is needed by the file on disk. So even with 1GB of RAM I'm not able to read in the 500MB file into memory. My Python code for reading the file and printing some information about the memory consumption is:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys

infile=open("links.csv", "r")

edges=[]
count=0
#count the total number of lines in the file
for line in infile:
 count=count+1

total=count
print "Total number of lines: ",total

infile.seek(0)
count=0
for line in infile:
 edge=tuple(map(int,line.strip().split(",")))
 edges.append(edge)
 count=count+1
 # for every million lines print memory consumption
 if count%1000000==0:
  print "Position: ", edge
  print "Read ",float(count)/float(total)*100,"%."
  mem=sys.getsizeof(edges)
  for edge in edges:
   mem=mem+sys.getsizeof(edge)
   for node in edge:
    mem=mem+sys.getsizeof(node) 

  print "Memory (Bytes): ", mem

The output I got was:

Total number of lines:  30609720
Position:  (9745, 2994)
Read  3.26693612356 %.
Memory (Bytes):  64348736
Position:  (38857, 103574)
Read  6.53387224712 %.
Memory (Bytes):  128816320
Position:  (83609, 63498)
Read  9.80080837067 %.
Memory (Bytes):  192553000
Position:  (139692, 1078610)
Read  13.0677444942 %.
Memory (Bytes):  257873392
Position:  (205067, 153705)
Read  16.3346806178 %.
Memory (Bytes):  320107588
Position:  (283371, 253064)
Read  19.6016167413 %.
Memory (Bytes):  385448716
Position:  (354601, 377328)
Read  22.8685528649 %.
Memory (Bytes):  448629828
Position:  (441109, 3024112)
Read  26.1354889885 %.
Memory (Bytes):  512208580

Already after reading only 25% of the 500MB file, Python consumes 500MB. So it seem that storing the content of the file as a list of tuples of ints is not very memory efficient. Is there a better way to do it, so that I can read in my 500MB file into my 1GB of memory?

+7  A: 

There is a recipe for sorting files larger than RAM on this page, though you'd have to adapt it for your case involving CSV-format data. There are also links to additional resources there.

Edit: True, the file on disk is not "larger than RAM", but the in-memory representation can easily become much larger than available RAM. For one thing, your own program doesn't get the entire 1GB (OS overhead etc). For another, even if you stored this in the most compact form for pure Python (two lists of integers, assuming 32-bit machine etc), you'd be using 934MB for those 30M pairs of integers.

Using numpy you can also do the job, using only about 250MB. It isn't particular fast to load this way, as you have to count the lines and pre-allocate the array, but it may be the fastest actual sort given that it's in-memory:

import time
import numpy as np
import csv

start = time.time()
def elapsed():
    return time.time() - start

# count data rows, to preallocate array
f = open('links.csv', 'rb')
def count(f):
    while 1:
        block = f.read(65536)
        if not block:
             break
        yield block.count(',')

linecount = sum(count(f))
print '\n%.3fs: file has %s rows' % (elapsed(), linecount)

# pre-allocate array and load data into array
m = np.zeros(linecount, dtype=[('a', np.uint32), ('b', np.uint32)])
f.seek(0)
f = csv.reader(open('links.csv', 'rb'))
for i, row in enumerate(f):
    m[i] = int(row[0]), int(row[1])

print '%.3fs: loaded' % elapsed()
# sort in-place
m.sort(order='b')

print '%.3fs: sorted' % elapsed()

Output on my machine with a sample file similar to what you showed:

6.139s: file has 33253213 lines
238.130s: read into memory
517.669s: sorted

The default in numpy is Quicksort. The ndarray.sort() routine (which sorts in-place) can also take keyword argument kind="mergesort" or kind="heapsort" but it appears neither of these is capable of sorting on a Record Array which, incidentally, I used as the only way I could see to sort the columns together as opposed to the default which would sort them independently (totally messing up your data).

Peter Hansen
But my problem is about sorting a file smaller than the available RAM in memory.
asmaier
@asmaier, see edited answer with clarification of memory usage, and solution using numpy that may work for you.
Peter Hansen
Two questions to your solution: Why does one need to preallocate the array? Couldn't one simply use numpy.fromfile() to generate the array?
asmaier
If you don't preallocate, whatever method you use to load the data would have to "grow" the array incrementally. This will always be much less memory-efficient than allocating everything you need in a single step, as you never need to copy from one array to another, larger, array as you load. As for fromfile(), if you can see a way for it to work, go ahead. It seems to take a single "sep" argument, so with `sep=","` you'll get only two integers loaded, and with `sep="\n"` you'll get only one... and it would still not know how much space to allocate in advance.
Peter Hansen
So I finally tried your solution and it works fine (Output: 21.736s: file has 30609720 rows | 213.738s: loaded | 507.188s: sorted) Thank you very much! But I still have a few questions: Why are you using `f.seek(0)` instead of `f.close()`? Why are you doing the line count using a generator? I removed the generator function and instead used `for line in f: linecount=linecount+1` and it was only slightly (1%) slower than your solution.
asmaier
If I did f.close(), I'd just need to reopen the file again. f.seek(0) was simply to avoid that (not that it's a problem either way). I did the generator approach thinking that reading blocks and using .count() would be faster (premature optimization! bad me...) but obviously that wasn't required. Your approach is perfectly fine too. Glad to be able to help.
Peter Hansen
+4  A: 

Since these are all just numbers, loading them into an Nx2 array would remove some overhead. Use NumPy for multidimensional arrays. Alternatively, you could use two normal python arrays to represent each column.

kwatford
+4  A: 

The cheapest way to store the input lines in memory is as array.array('i') elements -- assuming each number will fit in a signed 32-bit integer. The memory cost will be 8N bytes, where N is the number of lines.

Here is how to do the sort and write the output file in sorted order:

from array import array
import csv
a = array('i')
b = array('i')
for anum, bnum in csv.reader(open('input.csv', 'rb')):
    a.append(int(anum))
    b.append(int(bnum))
wtr = csv.writer(open('output.csv', 'wb'))
for i in sorted(xrange(len(a)), key=lambda x: b[x]):
    wtr.writerow([a[i], b[i]])

Unfortunately sorted()returns a list, not an iterator, and this list will be rather large: 4N bytes for pointers and 12N bytes for int objects, i.e. 16N bytes for the sorted() output. Note: this is based on CPython 2.X on a 32-bit machine; it gets worse for each of 3.X and 64-bit machines. All up that's 24N bytes. You have 31 million lines, so you need 31 * 24 = 744 MB ... looks like it should work; note that this calculation doesn't allow for any memory allocated by the sort, but you have a reasonable safety margin.

Aside: What is the cost of an extra GB or 3 of memory expressed in hours at your salary rate?

John Machin
+3  A: 

All python objects have a memory overhead on top of the data they are actually storing. According to getsizeof on my 32 bit Ubuntu system a tuple has an overhead of 32 bytes and an int takes 12 bytes, so each row in your file takes a 56 bytes + a 4 byte pointer in the list - I presume it will be a lot more for a 64 bit system. This is in line with the figures you gave and means your 30 million rows will take 1.8 GB.

I suggest that instead of using python you use the unix sort utility. I am not a Mac-head but I presume the OS X sort options are the same the linux version, so this should work:

sort -n -t, -k2 links.csv

-n means sort numerically

-t, means use a comma as the field separator

-k2 means sort on the second field

This will sort the file and write the result to stdout. You could redirect it to another file or pipe it to you python program to do further processing.

edit: If you do not want to sort the file before you run your python script, you could use the subprocess module to create a pipe to the shell sort utility, then read the sorted results from the output of the pipe.

Dave Kirby
And for the benefit of Windows users: you can get a compatible sort.exe from the GnuWin32 project at http://gnuwin32.sourceforge.net/
John Machin
Just for sorting your solution is definitely the fastest. In my case `sort` needed 450 seconds to sort and output my data to a file, whereas the python solution needed 1750s (and spend most of the time just writing the file). However `sort` used 440MB of RAM, whereas the python solution proposed by Peter Hansen only needed 240MB. And both solutions only used one core of my dual-core machine, so there is still a lot of room for improvement...
asmaier
A: 

You might want to look at mmap:

http://docs.python.org/library/mmap.html

It'll let you treat the file like a big array/string and will get the OS to handle shuffling data into and out of memory to let it fit.

So you could read in the csv file, one line at a time then write out the results to a mmap'd file (in a suitable binary format), then work on the mmap'd file. As the mmap'd file is only temporary you could of course just create a tmp file for this purpose.

Here's some code that demos using mmap with a tempfile to read in csv data and store it as pair's of integers:


import sys
import mmap
import array
from tempfile import TemporaryFile

def write_int(buffer, i):
    # convert i to 4 bytes and write into buffer
    buffer.write(array.array('i', [i]).tostring())

def read_int(buffer, pos):
    # get the 4 bytes at pos and convert to integer
    offset = 4*pos
    return array.array('i', buffer[offset:offset+4])[0]

def get_edge(edges, lineno):
    pos = lineno*2
    i, j = read_int(edges, pos), read_int(edges, pos+1)
    return i, j

infile=open("links.csv", "r")

count=0
#count the total number of lines in the file
for line in infile:
    count=count+1

total=count
print "Total number of lines: ",total

infile.seek(0)

# make mmap'd file that's long enough to contain all data
# assuming two integers (4 bytes) per line
tmp = TemporaryFile()
file_len = 2*4*count
# increase tmp file size
tmp.seek(file_len-1)
tmp.write(' ')
tmp.seek(0)
edges = mmap.mmap(tmp.fileno(), file_len)

for line in infile:
    i, j=tuple(map(int,line.strip().split(",")))
    write_int(edges, i)
    write_int(edges, j)

# now confirm we can read the ints back out ok
for i in xrange(count):
    print get_edge(edges, i)

It's a bit rough though. Really you'd probably want to wrap up all of that with a nice class, so that your edge's could be accessed in a way that makes them behave like a list (with indexing, len etc). Hopefully thought it'd give you a starting point.

John Montgomery
(1) Where's the bit where it does a sort?(2) Consider using struct.pack and struct.unpack instead of array.array methods -- much less overhead (do 2 values in one function call, for a start)(3) no need for tuple()(4) should strip both parts AFTER slpit
John Machin