views:

65

answers:

2

The app engine datastore, of course, has downtime. However, I'd like to have a "fail-safe" put which is more robust in the face of datastore errors (see motivation below). It seems like the task queue is an obvious place to defer writes when the datastore is unavailable. I don't know of any other solutions though (other than shipping off the data to a third-party via urlfetch).

Motivation: I have an entity which really needs to be put in the datastore - simply showing an error message to the user won't do. For example, perhaps some side effect has taken place which can't easily be undone (perhaps some interaction with a third-party site).

I've come up with a simple wrapper which (I think) provides a reasonable "fail-safe" put (see below). Do you see any problems with this, or have an idea for a more robust implementation? (Note: Thanks to suggestions posted in the answers by Nick Johnson and Saxon Druce, this post has been edited with some improvements to the code.)

import logging
from google.appengine.api.labs.taskqueue import taskqueue
from google.appengine.datastore import entity_pb
from google.appengine.ext import db
from google.appengine.runtime.apiproxy_errors import CapabilityDisabledError

def put_failsafe(e, db_put_deadline=20, retry_countdown=60, queue_name='default'):
    """Tries to e.put().  On success, 1 is returned.  If this raises a db.Error
    or CapabilityDisabledError, then a task will be enqueued to try to put the
    entity (the task will execute after retry_countdown seconds) and 2 will be
    returned.  If the task cannot be enqueued, then 0 will be returned.  Thus a
    falsey value is only returned on complete failure.

    Note that since the taskqueue payloads are limited to 10kB, if the protobuf
    representing e is larger than 10kB then the put will be unable to be
    deferred to the taskqueue.

    If a put is deferred to the taskqueue, then it won't necessarily be
    completed as soon as the datastore is back up.  Thus it is possible that
    e.put() will occur *after* other, later puts when 1 is returned.

    Ensure e's model is imported in the code which defines the task which tries
    to re-put e (so that e can be deserialized).
    """
    try:
        e.put(rpc=db.create_rpc(deadline=db_put_deadline))
        return 1
    except (db.Error, CapabilityDisabledError), ex1:
        try:
            taskqueue.add(queue_name=queue_name,
                          countdown=retry_countdown,
                          url='/task/retry_put',
                          payload=db.model_to_protobuf(e).Encode())
            logging.info('failed to put to db now, but deferred put to the taskqueue e=%s ex=%s' % (e, ex1))
            return 2
        except (taskqueue.Error, CapabilityDisabledError), ex2:
            return 0

Request handler for the task:

from google.appengine.ext import db, webapp

# IMPORTANT: This task deserializes entity protobufs.  To ensure that this is
#            successful, you must import any db.Model that may need to be
#            deserialized here (otherwise this task may raise a KindError).

class RetryPut(webapp.RequestHandler):
    def post(self):
        e = db.model_from_protobuf(entity_pb.EntityProto(self.request.body))
        e.put() # failure will raise an exception => the task to be retried

I don't expect to use this for every put - most of the time, showing an error message is just fine. It is tempting to use it for every put, but I think sometimes it might be more confusing for the user if I tell them that their changes will appear later (and continue to show them the old data until the datastore is back up and the deferred puts execute).

+1  A: 

One potential issue is that tasks are limited to 10kb of data, so this won't work if you have an entity which is larger than that once pickled.

Saxon Druce
Good point; thankfully I don't have to worry about this for the entities I'm using this for. But I'll update the code's docstring to reflect this limit.
David Underhill
+2  A: 

Your approach is reasonable, but has several caveats:

  • By default, a put operation will retry until it runs out of time. Since you have a backup strategy, you may want to give up sooner - in which case you should supply an rpc parameter to the put method call, specifying a custom deadline.
  • There's no need to set an explicit countdown - the task queue will retry failing operations for you at increasing intervals.
  • You don't need to use pickle - Protocol Buffers have a natural string encoding which is much more efficient. See this post for a demonstration of how to use it.
  • As Saxon points out, task queue payloads are limited to 10 kilobytes, so you may have trouble with large entities.
  • Most importantly, this changes the datastore consistency model from 'strongly consistent' to 'eventually consistent'. That is, the put that you enqueued to the task queue could be applied at any time in the future, overwriting any changes that were made in the interim. Any number of race conditions are possible, essentially rendering transactions useless if there are puts pending on the task queue.
Nick Johnson
Thanks for the detailed feedback; I'll definitely incorporate these thoughts. The only reason I set a countdown is because I'd figured this would make sure the task queue didn't immediately try to re-put the entity (since it just failed, perhaps it should be given a just a little time [perhaps the default of 60s is too much] in case the problem is transient, such as a tablet splitting etc.]).
David Underhill