views:

101

answers:

3

I've built something on Google App Engine that acts as a backend for an iPhone app. In the app, there are interactions that are pushed out to social networks via their APIs. So the typical workflow is like this:

  1. User uses the iPhone app to do "something"
  2. App Engine app is alerted via HTTP
  3. App Engine alerts social network that user did "something." If the user were to check their profile on that network, their activity via the app would be displayed. So, as far as the user is concerned, what they did probably worked.
  4. App Engine needs to do some persistence on its own, but when it tries, a DatastoreTimeException is thrown. And now the data is in a funky state.

So what's a good way to handle this? By nature of the problem I'd love to wrap it in a "transaction", but there's no way to roll back what got sent to the Social Network. So, I'm thinking more along the lines of how do you handle a DatastoreTimeException? Should I just wrap it in a try block and give it another go? Is it a better idea to show the user an error, and then when they try again, "skip" the social network interaction so that it isn't pushed out twice? Is there another idea that I'm not thinking of here?

A: 

http://code.google.com/appengine/docs/java/javadoc/com/google/appengine/api/datastore/DatastoreTimeoutException.html

"This can happen when you attempt to put, get, or delete too many entities or an entity with too many properties, or if the datastore is overloaded or having trouble."

If you're seeing the exception frequently, I expect it's because the datastore operation is too big, so retrying isn't really going to help. If you're just coding defensively against the risk that the exception might be thrown, then you could try it again (perhaps by queueing a task that will do so. But if you can't hit datastore, who's to say you can queue a task?)

If you want to be bulletproof robust, and you can ensure that the operation you perform on the social network is idempotent (can be repeated), then:

  • Make a note to yourself that you need to perform the social network operation.
  • If the note failed to store, abort and return failure.
  • Otherwise, attempt the social network operation.
  • If successful, remove the note.
  • Have some kind of task or loop to retry any remaining notes in future.

Of course you have to be a bit cautious about the response code you give back to the iPhone client, since success can take a long time - longer than the duration of the request made by the iPhone app. So you want your app engine request to be idempotent too, and you probably want some kind of cancel.

If all you get from the social network is success or failure, and if successful the operation must not be repeated, then you're in trouble. That's a rubbish API to offer on the web, since just because a web server sends you a successful response doesn't mean you received it, so there is sometimes no way for the caller to know that they've succeeded even though success creates responsibilities. But it happens.

Steve Jessop
In practice, retrying is often successful; you'll periodically get datastore timeouts even for small operations.
Wooble
That's what I mean - if it failed because the datastore glitched, then retrying is great. If it failed because your entities got too big, then retrying is just going to churn forever. Assuming it is a glitch, the duration presumably determines whether you can retry in a simple loop (before your request times out too), or if you need a task (because it will take longer than a single request to resolve the "funky state"). But I don't have experience of this on app engine.
Steve Jessop
The object has a few properties - a string, a long, a couple of dates, and a couple of integers. Could that be seen as "too big" or is it safe to bet that it was a glitch (this is the only time I've seen it).
bpapa
I doubt that's too big (although I guess you might check that the size of the string wasn't absurd). "Big" for this purpose has something to do with the number of indexes that the entity has to be inserted into, which is why GAE applies an artificial limit to the "exploding index problem" for multi-valued properties. I've used app engine a little, and I think the only time I've ever hit the resource limits is timing out requests deliberately on a simple test how many datastore ops I can do. So personally I can't give you any real numbers, just "bigger than anything I've done".
Steve Jessop
A: 

I find this statement worrying: In practice, retrying is often successful; you'll periodically get datastore timeouts even for small operations. – Wooble Jan 23 at 14:59

How can GAE be taken seriously if it has reliability issues? Generally do you find the datastore to be slow? Whats your estimate of the frequency of these exceptions?

kasuku
This isn't an answer to the question
bpapa
A: 

This is a fundamental problem with any distributed system. In general, there's no easy "bulletproof" solution. The best option, if possible, is to make sure that one or both of your operations are idempotent - that is, executing them multiple times has no effect. For the datastore, this is fairly easy: if you specify a key name, multiple puts will simply overwrite each other. If it's possible, you should make use of idempotence in your social API too, so you can safely re-execute in case of failure.

Nick Johnson