ansaurus

Question

Answer 1

+3 A:

I would say that bigtable-type storage is less suitable for statistical applications, for the very reasons that you mention. But this is a classical trade off that you have to make. I've seldom found myself using the flexibility of really complex queries, but have many times been forced to come up with more specialized solutions for stuff that shouldn't have been in the db in the first place.

If you stick to a RDBMS, you can do logical partitioning and denormalization fairly easy for instance through Hibernates persistence strategies and Hibernate Shards. If you can live with the somewhat slower processing, you can also do SQL-queries on bigtable-type storage (see for instance hadoop pig latin).

disown 2009-11-10 20:16:41

Thanks for the advice! I probably can live with the slower processing for certain things and would really like to use bigtable if at all possible.

Spike 2009-11-11 06:11:57

Answer 2

A:

GAE data-store is completely different animal from a RDBMS. It is easy in a relational DB to write something like:

SELECT STDEV(player_score)
FROM Table
WHERE player_id = 1234
  AND game_date BETWEEN '2007-01-01' AND '2009-11-10'
  AND city <> 'London'

GAE query has lots of restrictions -- see here -- so it is not easy to translate this. For aggregate functions (sum, stdev, etc..) you have to pull all data into application layer and calculate, or maintain aggregate entities which update on each data insert/update.

Update
You may consider using GAE for UI and business logic, but having separate relational DB somewhere else in cloud like: Microsoft SQL, DB2 on Amazon, MySQL elsewhere -- and than using GAE data-store for pre-calculated aggregations and statistics. So stats are still calculated in RDBMS, but you store results (partial, pre-calculated stats) in GAE storage; similar to dimensional storage in analytic cubes.

Damir Sudarevic 2009-11-10 20:32:44

I really appreciate your input. I am fairly set on using django for the UI instead of appengine though. Mixing two different database types does sound like a potentially great approach, it just sounds extremely difficult to set up...

Spike 2009-11-11 06:14:56

Answer 3

+4 A:

What you're describing is essentially OLAP - Online Analytical Processing. OLAP is one thing that 'traditional' RDBMSes are very good at, in part due to the flexibility and power of SQL - and non-relational databases such as the App Engine datastore aren't. It sounds like your OLAP-type queries will be relatively infrequent compared to normal access, though, so I'd suggest one of two approaches:

Mirror all your data from your App Engine datastore to a relational database at intervals, and perform the analytical queries on the relational database. User-facing traffic is still served by the datastore, so you get all the advantages of that, but you have an offline copy you can do complex queries against.
Use App Engine's Task Queue support to execute queries that examine large datasets. You can write your query in Python or Java, then use the Task Queue to execute it across a very large dataset, and pick up the results asynchronously, when they're done. Obviously there's a bit of infrastructure work required to make this easy (though keep an eye on my blog for a future project involving this ;).

Nick Johnson 2009-11-10 22:39:03

I was hoping you would answer :). Thanks for the two great suggestions. I have updated my question to address a potential problem in your first suggestion. As for your second, I didn't even realize Task Queue existed! It will take some looking into, but I wonder if that will be able to solve all my problems.

Spike 2009-11-11 06:09:19

ansaurus

tags:

views:

answers:

Complex Queries using GAE datastore

related questions