views:

447

answers:

3

I've enjoyed building out a couple simple applications on the GAE, but now I'm stumped about how to architect a music collection organizer on the app engine. In brief, I can't figure out how to filter on multiple properties while sorting on another.

Let's assume the core model is an Album that contains several properties, including:

  • Title
  • Artist
  • Label
  • Publication Year
  • Genre
  • Length
  • List of track names
  • List of moods
  • Datetime of insertion into database

Let's also assume that I would like to filter the entire collection using those properties, and then sorting the results by one of:

  • Publication year
  • Length of album
  • Artist name
  • When the info was added into the database

I don't know how to do this without running into the exploding index conundrum. Specifically, I'd love to do something like:

Albums.all().filter('publication_year <', 1980).order('artist_name')

I know that's not possible, but what's the workaround?

This seems like a fairly general type of application. The music albums could be restaurants, bottles of wine, or hotels. I have a collection of items with descriptive properties that I'd like to filter and sort.

Is there a best practice data model design that I'm overlooking? Any advice?

A: 

As you say, you can't have an inequality condition on one field and an order by another (or inequalities on two fields, etc, etc). The workaround is simply to use the "best" inequality condition to get data in memory (where "best" means the one that's expected to yield the least data) and then further refine it and order it by Python code in your application.

Python's list comprehensions (and other forms of loops &c), list's sort method and the sorted built-in function, the itertools module in the standard library, and so on, all help a lot to make these kinds of tasks quite simple to perform in Python itself.

Alex Martelli
Thanks for replying. It was my impression that I should make every effort to avoid computation within a view to avoid exhausting the GAE resources. On the other hand, sorting a few hundred (at most 1000) items isn't a hard task.Have you done these type of sorts on the app engine when returning results? Does it stress the resource limits?
Greg
I've done them and it's far from touching the 30-seconds deadline. On my laptop, sorting a randomly-shuffled 1000-items list takes about 500 microseconds, and I suspect GAE's servers are faster than my laptop;-)
Alex Martelli
@Alex: You're leaving out the cost of fetching and decoding 1000 entities, though, which is significant.
Nick Johnson
@Nick, sure, but if you fetched them already sorted that would not reduce the cost of fetching and decoding them! Point is, the cost of the sorting in memory is absolutely negligible.
Alex Martelli
@Alex: Thank you!
Greg
@Alex: But if he wants only, say, the top 20 results, and has to fetch and sort 1000 to get them, that's a lot of extra overhead that you have to include into your benchmark.
Nick Johnson
+1  A: 

There's a couple of options here: You can filter as best as possible, then sort the results in memory, as Alex suggests, or you can rework your data structures for equality filters instead of inequality filters.

For example, assuming you only want to filter by decade, you can add a field encoding the decade in which the song was recorded. To find everything before or after a decade, do an IN query for the decades you want to span. This will require one underlying query per decade included, but if the number of records is large, this can still be cheaper than fetching all the results and sorting them in memory.

Nick Johnson
This (reworking for equalisty filters) is actually what Brett has suggested in his GoogleIO talk: http://code.google.com/events/io/sessions/BuildingScalableComplexApps.html. I would suggest watching it - really insightful.
Bartosz Radaczyński
I was there at the time, so I don't need to watch it. ;)
Nick Johnson
+1  A: 

Since storage is cheap, you could create your own ListProperty based indexfiles with key_names that reflect the sort criteria.

class album_pubyear_List(db.Model):
    words = db.StringListProperty()

class album_length_List(db.Model):
    words = db.StringListProperty()

class album_artist_List(db.Model):
    words = db.StringListProperty()

class Album(db.Model):
    blah...

    def save(self):
        super(Album, self).save()

        # you could do this at save time or batch it and do
        # it with a cronjob or taskqueue

        words = []

        for field in ["title", "artist", "label", "genre", ...]:
            words.append("%s:%s" %(field, getattr(self, field)))

        word_records = []
        now = repr(time.time())
        word_records.append(album_pubyear_List(parent=self, key_name="%s_%s" %(self.pubyear, now)), words=words)
        word_records.append(album_length_List(parent=self, key_name="%s_%s" %(self.album_length, now)), words=words)
        word_records.append(album_artist_List(parent=self, key_name="%s_%s" %(self.artist_name, now)), words=words)
        db.put(word_records)

Now when it's time to search you create an appropriate WHERE clause and call the appropriate model

where = "WHERE words = " + "%s:%s" %(field-a, value-a) + " AND " + "%s:%s" %(field-b, value-b) etc.
aModel = "album_pubyear_List" # or anyone of the other key_name sorted wordlist models

indexes = db.GqlQuery("""SELECT __key__ from %s %s""" %(aModel, where))
keys = [k.parent() for k in indexes[offset:numresults+1]] # +1 for pagination
object_list = db.get(keys) # returns a sorted by key_name list of Albums
molicule