views:

100

answers:

4

Using Google app engine, is it possible to initialize a globally accessible singleton on app startup? I have a large static tree structure that I need to use on every request and want to initialize it beforehand. The tree structure is too large (20+MB) to be put into Memcache and I am trying to figure out what other alternatives I have.

EDIT: Just to add some clarity based on the answers I've received so far. I'm loading a dictionary of words into a trie/prefix tree structure. The trie is immutable as the dictionary of words is fixed. I'm generating anagrams based on an input string of characters, so one request may access a fair amount of the trie on a single request, possibly more the 1MB, however I'm not certain.

Here's is the python structure that I'm loading the dictionary of words into as well.

class Node(object):

    def __init__(self, letter='', final=False):
        self.letter = letter
        self.final = final
        self.children = {}

    def add(self, letters):
        node = self
        for index, letter in enumerate(letters):
            if letter not in node.children:
                node.children[letter] = Node(letter, index==len(letters)-1)
            node = node.children[letter]
+3  A: 

Each request might be served from a completely different process, on a different server, which might even be on a separate datacenter (hey, maybe in a different continent). There is nothing that's guaranteed to be "globally accessible" to the handlers of different requests to the same app except the datastore (even memcache's entries might disappear at any time if things get too busy: it's a cache, after all!-).

Perhaps you can keep your "static tree structure" in a data file that you upload together with the application's code, and access it from disk in lieu of "initializing".

Edit: as requested, here's a rough and ready example of the "lightweight class mapping tree into array" approach which I mentioned in a comment -- not tuned nor finely tested. I'm taking as the example a binary search tree with integer payloads and assuming that for some reason it's important to keep the exact structure in the "light" tree as in the "heavy" tree it represents. Even with these simplifications it's still a lot of code, but, here comes:

import array
import random

def _doinsert(tree, payload):
  if tree is None: return HeavyTree(payload)
  tree.insert(payload)
  return tree

class HeavyTree(object):
  def __init__(self, payload):
    self.payload = payload
    self.left = self.right = None
  def insert(self, other):
    if other <= self.payload:
      self.left = _doinsert(self.left, other)
    else:
      self.right = _doinsert(self.right, other)
  def walk(self):
    if self.left:
      for x in self.left.walk(): yield x
    yield self.payload
    if self.right:
      for x in self.right.walk(): yield x
  def walknodes(self):
    yield self
    if self.left:
      for x in self.left.walknodes(): yield x
    if self.right:
      for x in self.right.walknodes(): yield x

data = [random.randint(0, 99) for _ in range(9)]
print 'data: ',
for x in data: print x,
print
theiter = iter(data)
thetree = HeavyTree(next(theiter))
for x in theiter: thetree.insert(x)

print
print 'Heavy tree:'
print 'nodes:',
for x in thetree.walknodes(): print x.payload,
print
print 'inord:',
for x in thetree.walk(): print x,
print

class LightTree(HeavyTree):
  def __init__(self, base, offset):
    self.base = base
    self.offset = offset
  @property
  def payload(self):
    return self.base[self.offset]
  @property
  def left(self):
    return self._astree(self.offset+1)
  @property
  def right(self):
    return self._astree(self.offset+2)
  def _astree(self, i):
    offset = self.base[i]
    if offset < 0: return None
    return LightTree(self.base, offset)

def heavy_to_light(heavy):
  for i, node in enumerate(heavy.walknodes()):
    node.id = i * 3
  base = array.array('l', (i+1) * 3 * [-1])
  for node in heavy.walknodes():
    base[node.id] = node.payload
    if node.left: base[node.id+1] = node.left.id
    if node.right: base[node.id+2] = node.right.id
  return LightTree(base, 0)

print
print 'Light tree:'
light = heavy_to_light(thetree)
print 'nodes:',
for x in light.walknodes(): print x.payload,
print
print 'base :',
for x in light.base: print x,
print
print 'inord:',
for x in light.walk(): print x,
print

A typical run would show:

data:  27 79 90 60 82 80 3 94 76

Heavy tree:
nodes: 27 3 79 60 76 90 82 80 94
inord: 3 27 60 76 79 80 82 90 94

Light tree:
nodes: 27 3 79 60 76 90 82 80 94
base : 27 3 6 3 -1 -1 79 9 15 60 -1 12 76 -1 -1 90 18 24 82 21 -1 80 -1 -1 94 -1 -1
inord: 3 27 60 76 79 80 82 90 94

variable in details every time, of course, since the data are being generated randomly.

Maybe this sort of thing is just too cumbersome for anybody who didn't start with good old Fortran (and thus inevitably learned how to represent logical pointers as indices into an array), as I did back in EE school many decades ago;-). But loading such arrays from a file straight into memory is blazingly fast (compared to unpickling and the like)...!-)

Alex Martelli
Thanks, Alex! I tried putting the static tree structure in a data file by Pickling it, but the Unpickling process is as slow as just initializing the tree in the first place. Any other suggestions?
jamesaharvey
@james, depends on what the tree must contain -- e.g. if it was all ints, or all strings, it could be mapped into an `array.array` (using ints or sequences of bytes as offsets/indices) wrapped into a very lightweight class; if things are a bit more complicated, the class won't be as lightweight and it may need a few files/arrays as underlying data, but this general approach can be stretched quite far. Point is, loading an `array.array` from a file is "as fast as it gets" -- basically the same time as reading the file's bytes.
Alex Martelli
Alex, do you have an code example using array.array with strings in a lightweight class as you stated above. I'm still fairly new to python. Thanks!
jamesaharvey
@james, OK, just edited my answer to add an example (for a simple case, and yet it's a lot of code of course).
Alex Martelli
Thank you very much for the code sample, Alex. I'll give it a try.
jamesaharvey
+2  A: 

I think Google gives you 300MB of local memory for each instance that is running your project. So all you have to do is store the tree structure into a variable in some module.

Whenever Google spins up a new process for your app, it will run the code to build your tree once, and then you can access it for future requests that are handled by that process. Just make sure that building the tree takes less than 30 seconds, because it has to happen within the time frame of whatever random request makes Google decide to spin up a new process.

Liron Shapira
+3  A: 

How much of this tree do you need to access on a single request? In what manner do you query it? Does it ever change?

If it's immutable, you don't really need a 'singleton' - which implies mutability - just a way to access the data on each instance. Depending on how you need to access it, you could store it as a data file, a blob, or as data in the datastore.

Nick Johnson
My overall problem is that the data that I need to access on each instance is over 1MB in size. Any other suggestions? I'm currently populating a variable in a module that I can access globally.
jamesaharvey
As others have pointed out, you don't need to deserialize your data for each request - just for the first request to a new instance. Why not do that, and store it as a module member? Also, you may want to convert your trie into a DAWG to save space.
Nick Johnson
I am using a module member at this point, but I can't fully initialize it due to GAE memory constraints, so I load up chunks of the trie as needed which works ok. I've done away with deserialization as well. I'll also attempt converting my trie to a DAWG. Thanks for your help, Nick!
jamesaharvey
A: 

Memcache and the datastore are your best bets for truly global access across all your instances. However, a global variable can still be used to cache your data structure within each instance. Wouldn't a trie structure be pretty easy to break up into chunks such that each chunk will fit into memcache? Once you get your trie into memcache, anytime you access a chunk of the trie from memcache, you could store it in a global variable on that instance. Over the course of a few requests, you will build up a full copy of the trie on each instance that you have running. This is a bit complicated, but could ultimately give you the best performance.

Peter Recore