views:

72

answers:

1

I need to design a data model for an Amazon S3-like application. Let's simplify the problem into 3 key concepts - users, buckets and objects. There are many ways to design this model - I'll list two.

  1. Three Kinds - User, Bucket and Object. Each Object has a Bucket as its parent. Each Bucket has a User as its parent. User is the root.

  2. Dynamic Kinds - Users are stored in the User kind and buckets are stored in the Bucket kind - same as #1. However objects within a bucket are stored in a dynamic kind named "<BucketID>_Object". There is no parent / child relationship between bucket and object entities anymore. This relationship is established by the name of the object kind.

#1 is of course the more intuitive and traditional model. One can argue that #2 is radical while others may say ridiculous.

Why am I thinking about #2? - In my application, properties defined on objects can vary from bucket to bucket. These properties are specified by the user at bucket creation time. Also, all properties on objects need to be queryable. A dynamic object kind per bucket allows me to support these requirements. Moreover, because my object kind is now a root kind, I no longer need to apply ancestor filters which means I get an index on each object property for free. In Model #1 I am forced to apply ancestor filters which means that I need a custom index for every property I want to query against.

I apologize for the convoluted explanation. I'll try better if it's not clear.

My questions are - is #2 a totally outrageous model? With #2 my kinds can potentially run into the 10s of thousands. Is that ok? I understand there's a limit on the number of custom indexes. But I am not creating custom indexes on my dynamic kinds but only relying on the automatic indexes.

Thanks, Keyur

+3  A: 

There are issues with both. #1 is basically fine, except use reference properties instead of ancestors, and make your Object kind an Expando.

The problem with having buckets descend from users and objects descend from buckets is that this forces every bucket and object a user creates to live in the same entity group. This constrains performance and scalability, as all of an individual user's data has to be stored on the same datastore node. Entity groups are useful when you need to manipulate multiple entities in the same transaction. If you just need to model ownership, use a ReferenceProperty.

In my application, properties defined on objects can vary from bucket to bucket. These properties are specified by the user at bucket creation time. Also, all properties on objects need to be queryable.

An Expando gives you both of these. Your properties can be defined on the fly, and they're indexed automatically.

Nothing requires two entities of the same kind to have the same set of properties. Kinds are just names; they don't define or enforce any kind of schema. Creating a bunch of them on the fly just doesn't buy you anything.

Drew Sears
+1. Exactly what I was going to suggest.
Nick Johnson
Thanks Drew and Nick. That was very helpful. Follow up question: With reference properties, my query for, say, objects of size > 1000 and in bucket 1234 will look like "where bucket = key('Bucket', 1234) and size > 1000". Will this require a custom index or will the automatic indexes be enough to satisfy such queries?
Keyur