views:

446

answers:

8

If I have a list(or array, dictionary....) in python that could exceed the available memory address space, (32 bit python) what are the options and there relative speeds? (other than not making a list that large) Let me emphasize "could". The list could exceed the memory but I have know way of knowing before hand. If it started to exceeding 75% I would like to no longer keep the list in memory (or the new items anyway), is there a way to convert to a file based approach mid stream?

What are the best (speed in and out) file storage options?

Just need to store a simple list of numbers. no need to random Nth element access, just append and pop type operations.

+6  A: 

There are probably dozens of ways to store your list data in a file instead of in memory. How you choose to do it will depend entirely on what sort of operations you need to perform on the data. Do you need random access to the Nth element? Do you need to iterate over all elements? Will you be searching for elements that match certain criteria? What form do the list elements take? Will you only be inserting at the end of the list, or also in the middle? Is there metadata you can keep in memory with the bulk of the items on disk? And so on and so on.

One possibility is to structure your data relationally, and store it in a SQLite database.

Ned Batchelder
Just need to store simple list of numbers. no need to random Nth element access, just append and pop type operations.
Vincent
A: 

Modern operating systems will handle this for you without you having to worry about it. It's called virtual memory.

apphacker
This is true, but it's just really slow. Plus even virtual memory is limited (if he had an incredibly big data set).
TM
list(combinations(1140, 17)) fills available memory address space (python error) not sure what is happening in the VM. As far as I can tell it never uses it. I have more that 4gb of internal Memory and it does not fill it as 32bit has a 4GB limit. Not sure what happens when the limit is reach other than I get an error.
Vincent
One of the things you lose (if you ever had it in the first place) with garbage-collected languages is the ability to let the virtual memory system "just handle this for you". 1/ The locality of the cons cells in your structure is not guaranteed. 2/ The GC will access them even if you don't. Virtual memory never worked great for implementing sparsely accessed data structures unless you were extremely lucky, but for an arbitrary data structure in Python, even luck will not suffice.
Pascal Cuoq
Oh, and I forgot, the question clearly mentions exhaustion of the **address space**. Virtual memory only helps with exhaustion of **physical memory** (assuming that the address space is larger than the physical memory).
Pascal Cuoq
Two comments. (0) Can you use 64-bit Python to get a truly large address space, and rely on virtual memory? (It's hard to believe you could beat virtual memory by manually paging your data to files.) (1) Do you *have* to call `list()` on `combinations()`? If you can use lazy evaluation of your iterators, can you avoid filling RAM?
steveha
@steveha I am looking into 64-bit python. I am running on OSX 10.6, I am not sure how to get 64-pit python 3.1 installed. I guess I would need to build it on my system and I would have to stumble through that .
Vincent
Thanks for clearing that up! I am glad I made an incorrect comment, I learned something from it.
apphacker
+5  A: 

The answer is very much "it depends".

What are you storing in the lists? Strings? integers? Objects?

How often is the list written to compared with being read? Are items only appended on the end, or can entries be modified or inserted in the middle?

If you are only appending to the end then writing to a flat file may be the simplest thing that could possibly work.

If you are storing objects of variable size such as strings then maybe keep an in-memory index of the start of each string, so you can read it quickly.

If you want dictionary behaviour then look at the db modules - dbm, gdbm, bsddb, etc.

If you want random access writing then maybe a SQL database may be better.

Whatever you do, going to disk is going to be orders of magnitude slower than in-memory, but without knowing how the data is going to be used it is impossible to be more specific.

edit: From your updated requirements I would go with a flat file and keep an in-memory buffer of the last N elements.

Dave Kirby
+1  A: 

You might want to consider a different kind of structure: not a list, but figuring out how to do (your task) with a generator or a custom iterator.

RyanWilcox
I am trying. But am asking the question for more general application.
Vincent
+2  A: 

Did you check shelve python module which is based on pickle?

http://docs.python.org/library/shelve.html

ralu
Thanks, That Might work nice.
Vincent
+4  A: 

If your "numbers" are simple-enough ones (signed or unsigned integers of up to 4 bytes each, or floats of 4 or 8 bytes each), I recommend the standard library array module as the best way to keep a few millions of them in memory (the "tip" of your "virtual array") with a binary file (open for binary R/W) backing the rest of the structure on disk. array.array has very fast fromfile and tofile methods to facilitate the moving of data back and forth.

I.e., basically, assuming for example unsigned-long numbers, something like:

import os

# no more than 100 million items in memory at a time
MAXINMEM = int(1e8)

class bigarray(object):
  def __init__(self):
    self.f = open('afile.dat', 'w+')
    self.a = array.array('L')
  def append(self, n):
    self.a.append(n)
    if len(self.a) > MAXINMEM:
      self.a.tofile(self.f)
      del self.a[:]
  def pop(self):
    if not len(self.a):
      try: self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      except IOError: return self.a.pop()  # ensure normal IndexError &c
      try: self.a.fromfile(self.f, MAXINMEM)
      except EOFError: pass
      self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      self.f.truncate()
    return self.a.pop()

Of course you can add other methods as necessary (e.g. keep track of the overall length, add extend, whatever), but if pop and append are indeed all you need this should serve.

Alex Martelli
This is very nice.
Vincent
@Vincent, thanks, glad you like it!
Alex Martelli
A: 

What about a document oriented database?
There are several alternatives; I think the most known one currently is CouchDB, but you can also go for Tokyo Cabinet, or MongoDB. The last one has the advantage of python bindings directly from the main project, without requiring any additional module.

Roberto Liffredo
+2  A: 

Well, if you are looking for speed and your data is numerical in nature, you could consider using numpy and PyTables or h5py. From what I remember, the interface is not as nice as simple lists, but the scalability is fantastic!

Shane Holloway