views:

837

answers:

4

Am wondering if anyone might provide some conceptual advice on an efficient way to build a data model to accomplish the simple system described below. Am somewhat new to thinking in a non-relational manner and want to try avoiding any obvious pitfalls. It's my understanding that a basic principal is that "storage is cheap, don't worry about data duplication" as you might in a normalized RDBMS.

What I'd like to model is:

A blog article which can be given 0-n tags. Many blog articles can share the same tag. When retrieving data would like to allow retrieval of all articles matching a tag. In many ways very similar to the approach taken here at stackoverflow.

My normal mindset would be to create a many-to-may relationship between tags and blog articles. However, I'm thinking in the context of GAE that this would be expensive, although I have seen examples of it being done.

Perhaps using a ListProperty containing each tag as part of the article entities, and a second data model to track tags as they're added and deleted? This way no need for any relationships and the ListProperty still allows queries where any list element matching will return results.

Any suggestions on the most efficient way to approach this on GAE?

+1  A: 

Many-to-many sounds reasonable. Perhaps you should try it first to see if it is actually expensive.

Good thing about G.A.E. is that it will tell you when you are using too many cycles. Profiling for free!

Ali A
I was thinking many-to-many too but even the documentation at Google warns against this in all but the most necessary situations. Good advice thought about profiling, I think I'll try running some tests using different approaches and report the results back here.
Matty
+1  A: 

One possible way is with Expando, where you'd add a tag like:

setattr(entity, 'tag_'+tag_name, True)

Then you could query all the entities with a tag like:

def get_all_with_tag(model_class, tag):
    return model_class.all().filter('tag_%s =' % tag, True)

Of course you have to clean up your tags to be proper Python identifiers. I haven't tried this, so I'm not sure if it's really a good solution.

ianb
What if tag names dont have to be english?
Eran Kampf
+4  A: 

Thanks to both of you for your suggestions. I've implemented (first iteration) as follows. Not sure if it's the best approach, but it's working.

Class A = Articles. Has a StringListProperty which can be queried on it's list elements

Class B = Tags. One entity per tag, also keeps a running count of the total number of articles using each tag.

Data modifications to A are accompanied by maintenance work on B. Thinking that counts being pre-computed is a good approach in a read-heavy environment.

Matty
Just the approach I was going to suggest, except I didn't find time. :)
Nick Johnson
+2  A: 

counts being pre-computed is not only practical, but also necessary because the count() function returns a maximum of 1000. if write-contention might be an issue, make sure to check out the sharded counter example.

http://code.google.com/appengine/articles/sharding_counters.html

mainsocial