ansaurus

Question

Answer 1

A:

I think the ZODB is the tool to use. It will store lots of arbitrary items, it handles memory issues.

Here is a working example, and in this case I included objects that reference each other as well as being stored in a BTree by number.

import random
from collections import deque

import ZODB
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
import transaction
import persistent
import BTrees

def random_string(n=100):
    return ''.join([chr(random.randint(0,95)+32) for i in xrange(n)]) 


class Node(persistent.Persistent):
   def __init__(self, value=None):
       if not value:
           self.value =  random_string()

   def setNeighbors(self, refs):
       self.p1 = refs[0]
       self.p2 = refs[1]
       self.p3 = refs[2]
       self.p4 = refs[3]


def getTree():
    storage = FileStorage('c:\\test.zdb')
    db = DB(storage)
    connection = db.open()
    root = connection.root()
    if root.has_key('tree'):
        tree = root['tree']
    else:
        tree = BTrees.OOBTree.OOBTree()
        root['tree'] = tree
        transaction.commit()
    return tree


def fillDB():
    tree = getTree()

    # start with some initial objects.
    nodes = deque([Node(), Node(), Node(), Node()])
    node = Node()

    for n in xrange(20000):
        tree[n] = node           # store the node based on a numeric ID
        node.setNeighbors(nodes) # Make the node refer to more nodes.
        node = nodes.popleft()   # maintain out list of 4 upcoming nodes.
        nodes.append(Node())
        if n % 1000 == 0:
            transaction.commit() # Must commit for data to make it to disk.
            print n
    transaction.commit()
    return tree

At this point the tree variable basically works like a dictionary and can be accessed by the key. You can also get the keys in a range by using tree.keys(min, max) as described by the ZODB BTrees API documentation.

You can store your 10 lists by putting each one under a different key in the root object returned by the ZODB. The root object serves as the "gateway" to the ZODB object storage.

Thanks to the ZODB, you can also use the inter-object references as well as the Btree index. For example:

tree = getTree()

node1 = tree[1]
print node1.p1.p1.p1.p1.p1.p1.p1.p1.p1.value

Joe Koberg 2010-03-24 20:07:50

I admit its a very low-level description. I'll fix the question to make it clear.

Oren 2010-03-24 20:17:08

I don't have any sample code here. I'll try to write something now.

Oren 2010-03-24 20:55:54

It is quite not what I meant. The nodes don't have 4 pointers each. There are 4 "accessing-nodes" outside the list (total 4, not 4 for every node).The new explanation for what I need is better. Is there a way to skip the "commits"?

Oren 2010-03-24 22:17:38

* I meant 4 "accessing-pointers".. blegh

Oren 2010-03-24 22:24:04

Do you know if the ZODB tree index-nodes (any tree-node except for the buckets at the buttom of the tree), are always stored in the RAM?If they do, then I can use the "deque" trick above to ensure fast sequential access. For example, when I'll want nodes[314] I can add nodes[315:(315+100)/100*100] to the deque. Then, all the accesses to nodes[315], nodes[316], nodes[317]... will be very fast.

Oren 2010-03-25 00:50:56

But then when you insert a node between 315 and 316, what will the index be? 315.5? (ok maybe but you'll run into the FP precision limit with enough items). It's easy to have a doubly-linked list of nodes that just reference each other and traverse and insert/remove - but then SOME of the list operations come expensively (like len() unless you keep a count attribute updated)... or random access via index.

Joe Koberg 2010-03-25 15:40:32

But yes the ZODB keeps an object cache and frequently-used interior nodes should stay in the cache. I am not aware of the cache eviction policy so can't answer for sure.

Joe Koberg 2010-03-25 15:41:44

You must commit .... otherwise how does the system know when to start writing.... Every time you modify an attribute it would generate an IO.

Joe Koberg 2010-03-25 15:51:05

Oh. that's another thing about ZODB FileStorage... it's append-only storage... even deletions cause it to grow, until you "pack" the file. But you get atomicity and history features "for free".

Joe Koberg 2010-03-25 15:53:01

Answer 2

A:

How about using Tokyo Cabinet? Very fast and simple, like lists but built for what you want. http://1978th.net/tokyocabinet/

John 2010-03-24 21:54:40

Is there a python version?

Oren 2010-03-24 22:25:52

There might be: http://stackoverflow.com/questions/601865/python-table-engine-binding-for-tokyo-cabinet

Jon-Eric 2010-03-25 00:43:32

I think he will have the same "index-key" problem mentioned with SQL.

Joe Koberg 2010-03-25 15:50:09

Answer 3

+2 A:

Using zc.blist can bring good results after all, and setting the "cache_size" option when creating the DB controls the size of the data that will remain in the RAM. The size of used RAM can grow bigger if you don't do "transaction.commit" often enough. By defining a large cache_size and doing transaction.commit often, the last accessed buckets of the blist will stay in the RAM, giving you fast access to them, and the amount of used RAM won't grow too much.

Packing is very expensive though, but if you have a large harddisk, you don't have to do it that often anyway.

Here is some code to try yourself. Run "top" at the background and change cache_size to see how it affects the amount of used RAM.

import time
import os
import glob
from ZODB import DB
from ZODB.FileStorage import FileStorage
import transaction
from zc.blist import BList

print('STARTING')

random = open('/dev/urandom', 'rb')


def test_list(my_list, loops = 1000, element_size = 100):
    print('testing list')
    start = time.time()
    for loop in xrange(loops):
        my_list.append(random.read(element_size))
    print('appending %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    length = len(my_list)
    print('length calculated in %.4f seconds' % (time.time() - start,))

    start = time.time()
    for loop in xrange(loops):
        my_list.insert(length / 2, random.read(element_size))
    print('inserting %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    for loop in xrange(loops):
        my_list[loop] = my_list[loop][1:] + my_list[loop][0]
    print('modifying %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    for loop in xrange(loops):
        del my_list[0]
    print('removing %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    transaction.commit()
    print('committing all above took %.4f seconds' % (time.time() - start,))

    del my_list[:loops]
    transaction.commit()

    start = time.time()
    pack()
    print('packing after removing %s elements took %.4f seconds' % (loops, time.time() - start))

for filename in glob.glob('database.db*'):    
    try:
        os.unlink(filename)
    except OSError:
        pass

db = DB(FileStorage('database.db'),
        cache_size = 2000)

def pack():
    db.pack()

root = db.open().root()

root['my_list'] = BList()

print('inserting initial data to blist')

for loop in xrange(10):
    root['my_list'].extend(random.read(100) for x in xrange(100000))
    transaction.commit()

transaction.commit()

test_list(root['my_list'])

Oren 2010-03-27 00:21:36

+1 for posting the working code after finding a solution!

Joe Koberg 2010-04-01 14:40:38

ansaurus

tags:

views:

answers:

maintaining a large list in python

related questions