ansaurus

Question

Python: quickly loading 7 GB of text files into unicode strings

Answer 1

+2 A:

There is much that is confusing here, which makes it more difficult to answer this question:

The ipython requirement. Why do you need to process such large data files from within ipython instead of a stand-alone script?
The tmpfs RAM disk. I read your question as implying that you read all of your input data into memory at once in Python. If that is the case, then python allocates its own buffers to hold all the data anyway, and the tmpfs filesystem only buys you a performance gain if you reload the data from the RAM disk many, many times.
Mentioning IncPy. If your performance issues are something you could solve with memoization, why can't you just manually implement memoization for the functions where it would help most?

So. If you actually need all the data in memory at once -- if your algorithm reprocesses the entire dataset multiple times, for example -- I would suggest looking at the mmap module. That will provide the data in raw bytes instead of unicode objects, which might entail a little more work in your algorithm (operating on the encoded data, for example), but will use a reasonable amount of memory. Reading the data into Python unicode objects all at once will require either 2x or 4x as much RAM as it occupies on disk (assuming the data is UTF-8).

If your algorithm simply does a single linear pass over the data (as does the Aho-Corasick algorithm you mention), then you'd be far better off just reading in a reasonably sized chunk at a time:

with codecs.open(inpath, encoding='utf-8') as f:
    data = f.read(8192)
    while data:
        process(data)
        data = f.read(8192)

I hope this at least gets you closer.

llasram 2010-09-09 09:44:49

Thank you for your patient response. I am trying to write a simplified version of the BLAST algorithm that operates on arbitrary alphabets. (BLAST restricts you to DNA, RNA, etc.) The first step involves locating exact substrings of a query in a large database with an algorithm like Aho-Corasick. The second step involves searching around these substrings for longer strings that are similar to the entire query (with an algorithm like Smith-Waterman). (I am new to these things, as you can tell...)

m9389e 2010-09-09 22:47:57

I chose iPython for its parallel processing capabilities. The second step especially is expensive, and I'm running it on a powerful machine in Amazon EC2. Also, the interactive environment seems important as I explore these things...

m9389e 2010-09-09 22:51:01

I can definitely get my entire database into 17 GB of memory, but the load time has been prohibitively slow for testing. It feels like a mistake to waste so much EC2 time reading from disk.

m9389e 2010-09-09 22:54:19

As far as I understand your requirements, you *definitely* want to use `mmap`. Get your data into a form where the on-disk representation matches the in-memory representation, then use `mmap` to map it into your address space. The kernel will need to read it into memory once as you first access each part of the data, but you have enough RAM that it will stay cached for subsequent reads, even in new processes. Re: ipython, I don't follow what you mean about its parallel processing capabilities, which AFAIK are no other than what you can do in Python normally.

llasram 2010-09-10 10:26:56

Huh -- I hadn't realized IPython provided it's own multiprocessing capabilities. Interesting.

llasram 2010-09-10 10:34:48

iPython has some functionality that makes parallel processing very easy: http://ipython.scipy.org/doc/stable/html/parallel/parallel_multiengine.html . I've had better luck with it than rolling my own functions with the multiprocessing module.

m9389e 2010-09-11 19:45:15

llasram: Thank you. I have begun working with mmap, and I think I am converging on a workable setup.

m9389e 2010-09-11 19:47:24

Answer 2

A:

I saw the mention of IncPy and IPython in your question, so let me plug a project of mine that goes a bit in the direction of IncPy, but works with IPython and is well-suited to large data: http://packages.python.org/joblib/

If you are storing your data in numpy arrays (strings can be stored in numpy arrays), joblib can use memmap for intermediate results and be efficient for IO.

Gael Varoquaux 2010-09-11 13:32:24

This looks great, Gael. Thank you.

m9389e 2010-09-11 19:45:55

ansaurus

tags:

views:

answers:

Python: quickly loading 7 GB of text files into unicode strings

related questions