I need to maintain a large list of python pickleable objects. The list is too large to be all stored in the RAM, so some database\paging mechanism is required. I need that the mechanism will support fast access for close (nearby) areas in the list.
The list should implement all the python-list features, but most of the time I will work sequentially: scan some range in the list and while scanning decide if I want to insert\pop some nodes in the scanning point.
The list can be very large (2-3 GB), and should not be all contained in the RAM at once. The nodes are small (100-200 bytes) but can contain various types of data.
A good solution for this could be using a BTree, where only the last accessed buckets are loaded in the RAM.
Using SQL tables is not good, since I'll need to implement a complex index-key mechanism. My data is not a table, its a simple python list, with the feature of adding elements in specific indexes, and popping elements from specific positions.
I tried ZODB and zc.blist, which implement a BTree based list that can be stored in a ZODB database file, but I don't know how to configure it so the above features will run in reasonable time. I don't need all the multi-threading\transactioning features. No one else will touch the database-file except for my single-thread program.
Can anyone explain me how to configure the ZODB\zc.blist so the above features will run fast, or show me a different large-list implementation?
Some quick&dirty code that I tried:
import time
import random
NODE_JUMP = 50000
NODE_ACCESS = 10000
print 'STARTING'
random_bytes = open('/dev/urandom', 'rb')
my_list = list()
nodes_no = 0
while True:
nodes_no += NODE_JUMP
start = time.time()
my_list.extend(random_bytes.read(100) for i in xrange(NODE_JUMP))
print 'extending to %s nodes took %.2f seconds' % (nodes_no, time.time() - start)
section_start = random.randint(0, nodes_no -NODE_ACCESS -1)
start = time.time()
for index in xrange(section_start, section_start + NODE_ACCESS):
# rotate the string
my_list[index] = my_list[index][1:] + my_list[index][0]
print 'access to %s nodes took %.2f seconds' % (NODE_ACCESS, time.time() - start,)
Print ended with:
extending to 5000000 nodes took 3.49 seconds access to 10000 nodes took 0.02 seconds extending to 5050000 nodes took 3.98 seconds access to 10000 nodes took 0.01 seconds extending to 5100000 nodes took 2.54 seconds access to 10000 nodes took 0.01 seconds extending to 5150000 nodes took 2.19 seconds access to 10000 nodes took 0.11 seconds extending to 5200000 nodes took 2.49 seconds access to 10000 nodes took 0.01 seconds extending to 5250000 nodes took 3.13 seconds access to 10000 nodes took 0.05 seconds Killed (not by me)