views:

431

answers:

3

Hi. I am in the early stages of developing a sports statistics website (ultimate frisbee) and would like to know your opinions if Google App Engine is right for me.

I am writing it in Python using Django and have been comfortable with standard RDBMS for years but this site is a long term project and I am expecting very large amounts of data so I would like the "infinite" scaling that the GAE datastore offers. A vast majority of the queries to the database will return very standard results that would make the datastore seem like a logical choice. However, I would like to be able to make extremely complex queries in the future to come up with new statistical metrics or simply come up with interesting results. I plan on doing a lot of this in the future, but won't know what these queries are until the data is already collected.

For instance, you often see baseball stats analysts come up with ridiculous stats like "This is only the first time in the past 50 years that two left handed pitchers whose last names start with 'Z' have thrown one-hit shutouts in back to back days". I would like to have the flexibility of making any queries whatsoever in the future. :)

However, I am under the impression that a non-relational database like bigtable requires you to come up with models containing redundant data beforehand and all of the work takes place on the inserts rather than the fetches. I've already built django models that would contain virtually all the data I would ever need to query on, but I have no idea what denormalized models I'll want to have a year or two from now. Thus, I feel like making complex queries in the future would be extremely difficult on the GAE datastore and would require me to pull a ton of information off the server before processing it in python.

Is the google app engine datastore simply wrong for what I want to do? Or am just missing something. Thanks so much in advance!

Update: Thanks for the responses so far. I realize that I should also mention that a lot of these complex queries are queries that I would like the users to be able to do, thus making an offline database not really an option. For instance, users should be able to see various statistics of how well any two particular players play when they are on the field at the same time during specific games or seasons. While these queries aren't nearly as frequent as standard aggregate stats, they will still happen with regularity.

Having a relational database as well as the GAE datastore would be great, but django doesn't support multiple db's by default yet and patching a solution together sounds difficult and messy. Eric Florenzano has a nice solution for two databases that both use the django models, but if I were to use the GAE datastore, I would have to use the app engine's db model instead. And coming up with a nice solution like he did for this complex problem is a bit beyond my skill level at this point.

Right now my favorite two options are using the GAE Task Queue to do the difficult queries or going to a more standard webhost like webfaction and then just denormalize my tables later once my data grows and I need to increase performance.

+3  A: 

I would say that bigtable-type storage is less suitable for statistical applications, for the very reasons that you mention. But this is a classical trade off that you have to make. I've seldom found myself using the flexibility of really complex queries, but have many times been forced to come up with more specialized solutions for stuff that shouldn't have been in the db in the first place.

If you stick to a RDBMS, you can do logical partitioning and denormalization fairly easy for instance through Hibernates persistence strategies and Hibernate Shards. If you can live with the somewhat slower processing, you can also do SQL-queries on bigtable-type storage (see for instance hadoop pig latin).

disown
Thanks for the advice! I probably can live with the slower processing for certain things and would really like to use bigtable if at all possible.
Spike
A: 

GAE data-store is completely different animal from a RDBMS. It is easy in a relational DB to write something like:

SELECT STDEV(player_score)
FROM Table
WHERE player_id = 1234
  AND game_date BETWEEN '2007-01-01' AND '2009-11-10'
  AND city <> 'London'

GAE query has lots of restrictions -- see here -- so it is not easy to translate this. For aggregate functions (sum, stdev, etc..) you have to pull all data into application layer and calculate, or maintain aggregate entities which update on each data insert/update.

Update
You may consider using GAE for UI and business logic, but having separate relational DB somewhere else in cloud like: Microsoft SQL, DB2 on Amazon, MySQL elsewhere -- and than using GAE data-store for pre-calculated aggregations and statistics. So stats are still calculated in RDBMS, but you store results (partial, pre-calculated stats) in GAE storage; similar to dimensional storage in analytic cubes.

Damir Sudarevic
I really appreciate your input. I am fairly set on using django for the UI instead of appengine though. Mixing two different database types does sound like a potentially great approach, it just sounds extremely difficult to set up...
Spike
+4  A: 

What you're describing is essentially OLAP - Online Analytical Processing. OLAP is one thing that 'traditional' RDBMSes are very good at, in part due to the flexibility and power of SQL - and non-relational databases such as the App Engine datastore aren't. It sounds like your OLAP-type queries will be relatively infrequent compared to normal access, though, so I'd suggest one of two approaches:

  • Mirror all your data from your App Engine datastore to a relational database at intervals, and perform the analytical queries on the relational database. User-facing traffic is still served by the datastore, so you get all the advantages of that, but you have an offline copy you can do complex queries against.
  • Use App Engine's Task Queue support to execute queries that examine large datasets. You can write your query in Python or Java, then use the Task Queue to execute it across a very large dataset, and pick up the results asynchronously, when they're done. Obviously there's a bit of infrastructure work required to make this easy (though keep an eye on my blog for a future project involving this ;).
Nick Johnson
I was hoping you would answer :). Thanks for the two great suggestions. I have updated my question to address a potential problem in your first suggestion. As for your second, I didn't even realize Task Queue existed! It will take some looking into, but I wonder if that will be able to solve all my problems.
Spike