ansaurus

Question

Answer 1

+3 A:

As you've noticed, this design doesn't scale. It requires 4 (!!!) DB queries to render the page. That's 3 too many :)

The prevailing notion of working with the App Engine Datastore is that you want to do as much work as you possibly can when something is written, so that almost nothing needs to be done when something is retrieved and rendered. You presumably write the data very few times, compared to how many times it's rendered.

Normalization is similarly something that you seem to be striving for. The Datastore doesn't place any value in normalization -- it may mean less data incongruity, but it also means reading data is muuuuuch slower (4 reads?!!). Since your data is read much more often than it's written, optimize for reads, even if that means your data will occasionally be duplicated or out of sync for a short amount of time.

Instead of thinking about how the data looks when it's stored, think about how you want the data to look when it's displayed to the user. Store as close to that format as you can, even if that means literally storing pre-rendered HTML in the datastore. Reads will be lightning-fast, and that's a good thing.

So since you should optimize for reads, oftentimes your writes will grow to gigantic proportions. So gigantic that you can't fit it in the 30 second time limit for requests. Well, that's what the task queue is for. Store what you consider the "bare necessities" of your model in the datastore, then fire off a task queue to pull it back out, generate the HTML to be rendered, and put it in there in the background. This might mean your model is immediately ready to display until the task has finished with it, so you'll need a graceful degradation in this case, even if that means rendering it "the slow way" until the data is fully populated. Any further reads will be lightning-quick.

In summary, I don't have any specific advice directly related to your database -- that's dependent on what you want the data to look like when the user sees it.

What I can give you are some links to some super helpful videos about the datastore:

Brett Slatkin's 2008 and 2009 talks on building scalable, complex apps on App Engine, and a great one from this year about data pipelines (which isn't directly applicable I think, but really useful in general)
App Engine Under the Covers: How App Engine does what it does, behind the scenes
AppStats: a great way to see how many datastore reads you're performing, and some tips on reducing that number

Jason Hall 2010-06-25 18:32:13

Wow, this answer got long in a hurry :) Sorry about that, hopefully it's not entirely useless!

Jason Hall 2010-06-25 18:32:46

Well, a long answer to a long question - and besides, you had plenty of helpful things to point out. :)

David Underhill 2010-06-25 18:56:29

Thanks for your response, it was very helpful. It wasn't too long, especially considering the length of the question. I added some more information on use cases to communicate a better feel for Read-Write ratios.

DutrowLLC 2010-06-25 18:56:38

With the datastore failing a certain percentage of the time as well as going into read-only mode sporadically, how do you ensure data integrity of your denormalized data?

DutrowLLC 2010-06-25 19:37:01

You'll have to figure that out based on your own needs. If the datastore is in read-only mode you can detect that and just not allow writes during that time. I don't know if transactions can be interrupted by read-only mode, but if not, then that might be a way to maintain consistency while the datastore degrades.

Jason Hall 2010-06-25 20:16:40

Answer 2

+2 A:

You specified two specific "views" your website needs to provide:

Scheduling an appointment. Your current scheme should work just fine for this - you'll just need to do the first query you mentioned.
Overall view of operations. I'm not really sure what this entails, but if you need to do the string of four queries you mentioned above to get this, then your design could use some improvement. Details below.

Four datastore queries in and of itself isn't necessarily overboard. The problem in your case is that two of the queries are expensive and probably even impossible. I'll go through each query:

Getting a list of appointments - no problem. This query will be able to scan an index to efficiently retrieve the appointments in the date range you specify.
Get all line items for each of appointment from #1 - this is a problem. This query requires that you do an IN query. IN queries are transformed into N sub-queries behind the scenes - so you'll end up with one query per appointment key from #1! These will be executed in parallel so that isn't so bad. The main problem is that IN queries are limited to only a small list of values (up to just 30 values). If you have more than 30 appointment keys returned by #1 then this query will fail to execute!
Get all invoices referenced by line items - no problem. You are correct that this query is cheap because you can simply fetch all of the relevant invoices directly by key. (Note: this query is still synchronous - I don't think asynchronous was the word you were looking for).
Get all payments for all invoices returned by #3 - this is a problem. Like #2, this query will be an IN query and will fail if #3 returns even a moderate number of invoices which you need to fetch payments for.

If the number of items returned by #1 and #3 are small enough, then GAE will almost certainly be able to do this within the allowed limits. And that should be good enough for your personal needs - it sounds like you mostly need it to work, and don't need to it to scale to huge numbers of users (it won't).

Suggestions for improvement:

Denormalization! Try storing the keys for Line_Item, Invoice, and Payment entities relevant to a given appointment in lists on the appointment itself. Then you can eliminate your IN queries. Make sure these new ListProperty are not indexed to avoid problems with exploding indices

Other less specific ideas for improvement:

Depending on what your "overall view of operations" is going to show, you might be able to split up the retrieval of all this information. For example, perhaps you start by showing a list of appointments, and then when the manager wants more information about a particular appointment you go ahead and fetch the information relevant to that appointment. You could even do this via AJAX if you this interaction to take place on a single page.
Memcache is your friend - use it to cache the results of datastore queries (or even higher level results) so that you don't have to recompute it from scratch on every access.

David Underhill 2010-06-25 18:48:14

+1 for Memcache, absolutely indispensable.

Jason Hall 2010-06-25 18:54:02

Thanks for your response. I wasn't aware of the 30-value limit on IN queries. I suppose I could shard the query, but that would be nasty. Looks like I'll probably just put denormalized fields into the "Appointment" entities. I don't have any experience with maintaining denormalized data, are the any references that you would suggest?

DutrowLLC 2010-06-25 19:08:46

Denormalization isn't as scary as it sounds. Whenever you create or delete a `Line_Item`, `Invoice`, or `Payment` just update the corresponding `Appointment` too. I wouldn't worry too much about doing this transactionally either - just create your `Line_Item` (etc.) and then update your `Appointment` (if you create multiple line items in a single request, then just update the relevant `Appointment` entity once). And do the reverse when deleting a `Line_Item`. If the second query experiences a transient failure, just push it off onto the Task Queue and it will eventually be applied.

David Underhill 2010-06-25 19:22:07

Rather than storing `LineItem` in a list on `Invoice`, store a list of `LineItem` keys on `Invoice`. You can retrieve entities by key without N queries or the 30 item limit.

spankalee 2010-06-26 17:16:29

I agree - store the keys for the related entities in the lists (that's what I was trying to say in my post).

David Underhill 2010-06-26 17:44:26

Answer 3

+2 A:

Here are a few app-engine specific factors that I think you'll have to contend with:

When querying using an inequality, you can only use an inequality on one property. for example, if you are filtering on an appt date being between July 1st and July 4th, you couldn't also filter by price > 200
Transactions on app engine are a bit tricky compared to the SQL database you are probably used to. You can only do transactions on entities that are in the same "entity group".

Peter Recore 2010-06-25 18:55:47

ansaurus

tags:

views:

answers:

Database design - google app engine

Proposed structure:

Usage clarification

related questions