views:

397

answers:

4

I am using the following class to store some data:

class NewsArticle(db.Model):
    score = db.FloatProperty(default=0.0)
    date_scored = db.DateTimeProperty()
    ...

What I need to do is to get those NewsArticle entities that have the top score in some time frame (e.g. get the top scored data entities of today or of last week).

I tried the following:

query = db.GqlQuery('SELECT * FROM NewsArticle WHERE date_created > DATETIME(:year, :month, :day, 0, 0, 0) ORDER BY score DESC', year=date.selected_year, month=date.selected_month, day=date.selected_day)

But that doesn't work since the datastore requires that the

first ordering property must be the same as inequality filter property

I thought about getting all NewsArticle entities for a specific timeframe and then doing the score-sorting in my application, but I am expecting a really large number of results so in-memory-sorting won't be efficient.

What other solutions are possible for my problem?

A: 

Try this:

import datetime
...
query = Data.gql('SELECT * FROM NewsArticle WHERE date_created >= :today ORDER BY score DESC', today=datetime.date.today())
Srirangan
"To get all results that match an inequality filter, a query scans the index table for the first matching row, then returns all consecutive results until it finds a row that doesn't match. For the consecutive rows to represent the complete result set, the rows must be ordered by the inequality filter before other sort orders." http://code.google.com/appengine/docs/java/datastore/queriesandindexes.html#Restrictions_on_Queries
jbochi
+1  A: 

You could either:

  1. filter only by the timeframe and sort by score in memory or,

  2. if you can limit the timeframes to whole days and weeks, include additional properties in your model to save the the week as a integer and the day as a DateProperty and do a simple equality check on that.

EDIT: To learn more, have a look at Restrictions on Queries

jbochi
That's what I did: For the last 24 hours I use solution #1 and for everything else I included additional fields for day, month, year and week.
Benedikt Eger
Glad to know I could help!
jbochi
A: 

I think ordering by multiple properties might work. Simply use the first inequality property as the first ordering property.

query = db.GqlQuery('SELECT * FROM NewsArticle WHERE date_created > DATETIME(:year, :month, :day, 0, 0, 0) ORDER BY date_created, score DESC', year=date.selected_year, month=date.selected_month, day=date.selected_day)
z33m
I believe this is a valid query, but you would not get the top scores.
jbochi
+1  A: 

I am expecting a really large number of results so in-memory-sorting won't be efficient.

You can get at most 1000 results from a query, so sorting them every which way will be extremely efficient -- e.g., consider, on my Macbook Air (first-day, slowest model):

$ python -mtimeit -s'import random; x=range(1000); random.shuffle(x)' 'y=sorted(x)'
1000 loops, best of 3: 714 usec per loop

AppEngine's CPUs are substantially faster than the Air, so 700 microseconds to sort 1000 results would be a very pessimistic estimate; this compares with tens of msec for fetching the data -- so, don't worry at all about sorting: as long as you can fetch the results you want, you'll be just fine.

BTW, to assess likely App Engine performance on a task, see Guido van Rossum's presentation here -- he claims 10-50 msec for a "typical db.get()" (50-100 for put, etc).

If you expect more than 1000 results from a query, that typically means you need to denormalize the table to trim the query's results to under 1000. For example, in your case, say that you expect about 500-700 entries a day -- in which case getting all results for today would be no problem, but a week is definitely a problem: you need to trim the query down to 20% or less of what it would "normally" be.

For example, say your scores are in the range 0-100, about uniformly distributed. In this case, you might add to the entity a boolean field "topcandidate": when you save the entity, you set that field to True iff the score is in the range 85-100 (as you see, this means the table gets denormalized, since the field represents logically redundant information).

When you fetch weekly top results, you add the equality condition to have topcandidate True. Instead of 3500-4900 results, this should get you down to 500-900 -- the top-scoring 15% or so, after which you can sort them in memory and pick, say, the top 100.

Of course the exact numbers depend on the distribution of your score field (more likely to be a bell curve than a flat uniform distribution) and how many "top score candidates" you need, but this is one generally useful approach to work around the 1000-results limit.

Alex Martelli