views:

73

answers:

1

OK this is a 2 part question, I've seen and searched for several methods to get a list of unique values for a class and haven't been practically happy with any method so far.
So anyone have a simple example code of getting unique values for instance for this code. Here is my super slow example.

class LinkRating2(db.Model):
    user = db.StringProperty()
    link = db.StringProperty()
    rating2 = db.FloatProperty()

def uniqueLinkGet(tabl):
    start = time.time()
    dic = {}
    query = tabl.all()
    for obj in query:
        dic[obj.link]=1
    end = time.time()
    print end-start
    return dic

My second question is calling for instance an iterator instead of fetch slower? Is there a faster method to do this code below? Especially if the number of elements called be larger than 1000?

query = LinkRating2.all()
link1 = 'some random string'
a = query.filter('link = ', link1)
adic ={}
for itema in a:
    adic[itema.user]=itema.rating2
+3  A: 

1) One trick to make this query fast is to denormalize your data. Specifically, create another model which simply stores a link as the key. Then you can get a list of unique links by simply reading everything in that table. Assuming that you have many LinkRating2 entities for each link, then this will save you a lot of time. Example:

class Link(db.Model):
    pass  # the only data in this model will be stored in its key

# Whenever a link is added, you can try to add it to the datastore.  If it already
# exists, then this is functionally a no-op - it will just overwrite the old copy of
# the same link.  Using link as the key_name ensures there will be no duplicates.
Link(key_name=link).put()

# Get all the unique links by simply retrieving all of its entities and extracting
# the link field.  You'll need to use cursors if you have >1,000 entities.
unique_links = [x.key().name() for Link.all().fetch(1000)]

Another idea: If you need to do this query frequently, then keep a copy of the results in memcache so you don't have to read all of this data from the datastore all the time. A single memcache entry can only store 1MB of data, so you may have to split your links data into chunks to store it in memcache.

2) It is faster to use fetch() instead of using the iterator. The iterator causes entities to be fetched in "small batches" - each "small batch" results in a round-trip to the datastore to get more data. If you use fetch(), then you'll get all the data at once with just one round-trip to the datastore. In short, use fetch() if you know you are going to need lots of results.

David Underhill