Denormalization in Google App Engine?

Background::::

I'm working with google app engine (GAE) for Java. I'm struggling to design a data model that plays to big table's strengths and weaknesses, these are two previous related posts:

I've tentatively decided on a fully normalized backbone with denormalized properties added into entities so that most client requests can be serviced with only one query.

I reason that a fully normalized backbone will:

Help maintain data integrity if I code a mistake in the denormalization
Enable writes in one operation from a client's perspective
Allow for any type of unanticipated query on the data (provided one is willing to wait)

While the denormalized data will:

Enable most client requests to be serviced very fast

Basic denormalization technique:::

I watched an app engine video describing a technique referred to as "fan-out." The idea is to make quick writes to normalized data and then use the task queue to finish up the denormalization behind the scenes without the client having to wait. I've included the video here for reference, but its an hour long and theres no need to watch it in order to understand this question: http://code.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

If I use this "fan-out" technique, every time the client modifies some data, the application would update the normalized model in one quick write and then fire off the denormalization instructions to the task queue so the client does not have to wait for them to complete as well.

Problem:::

The problem with using the task queue to update the denormalized version of the data is that the client could make a read request on data that they just modified before the task queue has completed the denormalization on that data. This would provide the client with stale data that is incongruent with their recent request confusing the client and making the application appear buggy.

As a remedy, I propose fanning out denormalization operations in parallel via asynchronous calls to other URLS in the application via URLFetch: http://code.google.com/appengine/docs/java/urlfetch/ The application would wait until all of the asynchronous calls had been completed before responding to the client request.

For example, if I have an "Appointment" entity and a "Customer" entity. Each appointment would include a denormalized copy of the customer information for who its scheduled for. If a customer changed their first name, the application would make 30 asynchronous calls; one to each affected appointment resource in order to change the copy of the customer's first name in each one.

In theory, this could all be done in parallel. All of this information could be updated in roughly the time it takes to make 1 or 2 writes to the datastore. A timely response could be made to the client after the denormalization was completed eliminating the possibility of the client being exposed to incongruent data.

The biggest potential problem I see with this is that the application can not have more than 10 asynchronous request calls going at any one time (documented here): http://code.google.com/appengine/docs/java/urlfetch/overview.html).

Proposed denormalization technique (recursive asynchronous fan-out):::

My proposed remedy is to send denormalization instructions to another resource that recursively splits the instructions into equal-sized smaller chunks, calling itself with the smaller chunks as parameters until the number of instructions in each chunk is small enough to be executed outright. For example, if a customer with 30 associated appointments changed the spelling of their first name. I'd call the denormalization resource with instructions to update all 30 appointments. It would then split those instructions up into 10 sets of 3 instructions and make 10 asynchronous requests to its own URL with each set of 3 instructions. Once the instruction set was less than 10, the resource would then make asynchronous requests outright as per each instruction.

My concerns with this approach are:

It could be interpreted as an attempt to circumvent app engine's rules, which would cause problems. (its not even allowed for a URL to call itself, so I'd in fact have to have two URL resources that handle the recursion that would call each other)
It is complex with multiple points of potential failure.

I'd really appreciate some input on this approach.

This sounds awfully complicated, and the more complicated the design the more difficult it is to code and maintain.

Assuming you need to denormalize your data, I'd suggest just using the basic denormalization technique, but keep track of which objects are being updated. If a client requests an object which is being updated, you know you need to query the database to get the updated data; if not, you can rely on the denormalized data. Once the task queue finishes, it can remove the object from the "being updated" list, and everything can rely on the denormalized data.

A sophisticated version could even track when each object was edited, so a given object would know if it had already been updated by the task queue.

Thanks for your response. My counter-argument is that your suggestion could be the more complicated option since there would have to be checking to see what state the data is in as well as multiple types of queries depending on if the data is normalized or not. The fan-out algorithm that I suggest could be the overall less complicated option since after it is implemented correctly, it allows the rest of the system to be simpler.

DutrowLLC 2010-07-06 20:48:16

ansaurus

tags:

views:

answers: