views:

508

answers:

4

I'm looking for resources to help migrate my design skills from traditional RDBMS data store over to AppEngine DataStore (ie: 'Soft Schema' style). I've seen several presentations and all touch on the the overarching themes and some specific techniques.

I'm wondering if there's a place we could pool knowledge from experience ("from the trenches") on real-world approaches to rethinking how data is structured, especially porting existing applications. We're heavily Hibernate based and have probably travelled a bit down the wrong path with our data model already, generating some gnarly queries which our DB is struggling with.

Please respond if:

  1. You have ported a non-trivial application over to AppEngine
  2. You've created a common type of application from scratch in AppEngine
  3. You've done neither 1 or 2, but are considering it and want to share your own findings so far.
+1  A: 

I played around with Google App Engine for Java and found that it had many shortcomings:

This is not general purpose Java application hosting. In particular, you do not have access to a full JRE (e.g. cannot create threads, etc.) Given this fact, you pretty much have to build your application from the ground up with the Google App Engine JRE in mind. Porting any non-trival application would be impossible.

More pertinent to your datastore questions...

The datastore performance is abysmal. I was trying to write 5000 weather observations per hour -- nothing too massive -- but I could not do it because I kept on running into time out exception both with the datastore and the HTTP request. Using the "low-level" datastore API helped somewhat, but not enough.

I wanted to delete those weather observation after 24 hours to not fill up my quota. Again, could not do it because the delete operation took too long. This problem in turn led to my datastore quota filling up. Insanely, you cannot easily delete large swaths of data in the GAE datastore.

There are some features that I did like. Eclipse integration is snazzy. The appspot application server UI is a million times better than working with Tomcat (e.g. nice views of logs). But the minuses far outweighed those benefits for me.

In sum, I constantly found myself having to shave the yak, in order to do something that would have been pretty trivial in any normal Java / application hosting environment.

Julien Chastang
You classify datastore performance as 'abysmal' - can you give more details as to what you were doing? A substantial component of datastore interaction is the round-trip time; batching operations makes an enormous difference here.
Nick Johnson
@Nick Unfortunately, you cannot batch. See http://is.gd/YWPj "To save multiple objects...", although they say they are working on the problem. Again, working with the low-level API achieves batch-like performance, but it is still too slow. I was trying to persist weather observation (temperature, pressure, etc.) which come at roughly 5000 per hour. Nothing dramatic.
Julien Chastang
I don't beleive that's correct. Batch gets and puts are possible through JPA or JDO in Java. I've seen this in some code examples already. http://googleappengine.blogspot.com/2009/06/10-things-you-probably-didnt-know-about.html (See item #5 on that page)
Mark Renouf
@Mark. That Python example, I believe, does not point to the high-level PersistenceManager API, but the low-level datastore API (Again see is.gd/YWPj is.gd/Z4Pv). As I mention, this is what I did to improve performance. But it was still too slow. I ended up "filling" a data structure and gradually persisting it in smaller chunks via a cron. This solution was annoying b/c it added a lot of complexity. When it was time to delete old data to avoid quota problems, I ran into timeout problems again. At that point I decided the GAE persistence framework was too immature, and non-scalable.
Julien Chastang
@Julien: You can batch in Python, and this question is tagged with both. You still haven't said what you were doing exactly, so I can't comment on whether there was a better way, but scalability is the reason that you sometimes have to jump through hoops in the first place - your code will work as well with 1qps as with 1kqps.
Nick Johnson
+1  A: 

The timeouts are tight and performance was ok but not great, so I found myself using extra space to save time; for example I had a many-to-many relationship between trading cards and players, so I duplicated the information of who owns what: Card objects have a list of Players and Player objects have a list of Cards.

Normally storing all your information twice would have been silly (and prone to get out of sync) but it worked really well.

In Python they recently released a remote API so you can get an interactive shell to the datastore so you can play with your datastore without any timeouts or limits (for example, you can delete large swaths of data, or refactor your models); this is fantastically useful since otherwise as Julien mentioned it was very difficult to do any bulk operations.

Kiv
+1  A: 

The non relational database design essentially involves denormalization wherever possible.

Example: Since the BigTable doesnt provide enough aggregation features, the sum(cash) option that would be in the RDBMS world is not available. Instead it would have to be stored on the model and the model save method must be overridden to compute the denormalized field sum.

Essential basic design that comes to mind is that each template has its own model where all the required fields to be populated are present denormalized in the corresponding model; and you have an entire signals-update-bots complexity going on in the models.

Lakshman Prasad
+2  A: 

I'm wondering if there's a place we could pool knowledge from experience

Various Google Groups are good for that, though I don't know if any are directly applicable to Java-GAE yet -- my GAE experience so far is all-Python (I'm kind of proud to say that Guido van Rossum, inventor of Python and now working at Google on App Engine, told me I had taught him a few things about how his brainchild worked -- his recommendation mentioning that is now the one I'm proudest, on amongst all those on my linkedin profile;-). [I work at Google but my impact on App Engine was very peripheral -- I worked on "building the cloud", cluster and network management SW, and App Engine is about making that infrastructure useful for third party developers].

There are indeed many essays & presentations on how best to denormalize and shard your data for optimal GAE scaling and performance -- they're of varying quality, though. The books that are out so far are so-so; many more are coming in the next few months, hopefully better ones (I had a project to write one of those, with two very skilled friends, but we're all so busy that we ended up dropping it). In general, I'd recommend the Google I/O videos and the essays that Google blessed in its app engine site and blogs, PLUS every bit of content from appenginefan's blog -- what Guido commended me for teaching him about GAE, I in turn mostly learned from appenginefan (partly through the wonderful app engine meetup in Palo Alto, but his blog is great too;-).

Alex Martelli