views:

63

answers:

1

I'm kinda new to modeling my data model for efficient querying with GAE, but have extensive knowledge with RDBMS.

Here's the problem: I got roughly a million terms (strings) and need to query and compare associated numerical values as a time series with weekly data points. Think of it as a graph with time on the X axis and a linear Y axis showing the numerical measures.

So far I got the discrete data points per term and day in the datastore, and I'm looking for a way to aggregate the data by week and store the data so that I can query the datastore efficiently. I was thinking of precalculating a number of time series of different length (4 weeks, 5 weeks, 6 weeks etc.) per term and store each entry as {term, start_week, [time series]}

With RDBMS I could easily group by week and create the data series programatically, either as stored procedure or in the application back end. Due to GAE constraints and the nature of BigTable as a highly distributed system, this is not an option.

Any ideas are highly appreciated!

+1  A: 

The approach you're heading towards seems reasonable, but it all depends on the sort of queries you need to execute. Assuming you need to look up time series by name (string) and week, and you generally want to fetch between, say, 1 and 100 consecutive weeks worth of data, I would suggest the following:

  • Have one entity for each week's worth of data for each term, as you suggest
  • Instead of storing the data 'loose' and aggregating it periodically, store new points directly in this form. Whenever you receive a new data point, if it's the first point of the week, create a new entity. If it's not, retrieve the existing entity for that week and append your data point to it.
  • When you want to plot data, query for the term and time period you need, and fetch the results in time order.
Nick Johnson
The problem is: I have to query not for a time series for 1 term, but hundreds of terms. The query then takes too long to finish.
poezn
Can you elaborate on your use case? If you know the term and the time period, with correct use of key names you can do a single bulk datastore get for all the records you need, which is substantially more efficient than many queries.
Nick Johnson