ansaurus

Question

Loading datasets from datastore and merge into single dictionary. Resource problem.

Answer 1

+2 A:

A few simple ideas:

1) Since you need all the results, instead of doing a for loop like you have, call fetch() explicitly to just go ahead and get all the results at once. Otherwise, the for loop may result in multiple queries to the datastore as it only gets so many items at once. For example, perhaps you could try:

return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts).fetch(1000)

2) Maybe only load part of the data in the initial request. Then use AJAX techniques to load additional data as needed. For example, start by returning the product information, and then make additional AJAX requests to get the parts.

3) Like Will pointed out, IN queries perform one query PER argument.

Problem: An IN query does one equals query for each argument you give it. So key IN self.defaultParts actually does len(self.defaultParts) queries.
Possible Improvement: Try denormalizing your data more. Specifically, store a list of products each part is used in on each part. You could structure your Parts model like this:

    class ProductParts(db.Model):
        ...
        products = db.ListProperty(db.Key)  # product keys
        ...

Then you can do ONE query to per product instead of N queries per product. For example, you could do this:

parts = ProductParts.all().filter("products =", product).fetch(1000)

The trade-off? You have to store more data in each ProductParts entity. Also, when you write a ProductParts entity, it will be a little slower because it will cause 1 row to be written in the index for each element in your list property. However, you stated that you only have 100 products so even if a part was used in every product the list still wouldn't be too big (Nick Johnson mentions here that you won't get in trouble until you try to index a list property with ~5,000 items).

Less critical improvement idea:

4) You can create the GqlQuery object ONCE and then reuse it. This isn't your main performance problem by any stretch, but it will help a little. Example:

GQL_PROD_PART_BY_KEYS = ProductParts.gql('WHERE __key__ IN :1')
@property
def defaultPartsData(self):
    return GQL_PROD_PART_BY_KEYS.bind(self.defaultParts)

You should also use AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot of appstats info about your request along with your post.

Here is what the code might look like if you re-wrote it fetch the data with fewer round-trips to the datastore (these changes are based on ideas #1, #3, and #4 above).

GQL_PARTS_BY_PRODUCT = ProductParts.gql('WHERE products = :1')
for product in productData:
    productDict = {}
    typeDict = {}
    productDict['productName'] = product.name

    cache_key = 'productparts_%s' % (slugify(product.key()))
    partData = memcache.get(cache_key)

    if not partData:
        for type in typeData:
            typeDict[type.typeId] = { 'default' : '', 'optional' : [] }

        # here's a new approach that does just ONE datastore query (for each product)
        GQL_PARTS_BY_PRODUCT.bind(product)
        parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
        for part in parts:
            if part.key() in self.defaultParts:
                part_type = 'default'
            else:
                part_type = 'optional'

            for label in labelsForLangCode:
                if label.key() in defaultPart.partLabelList:
                    typeDict[defaultPart.type.typeId][part_type] = label.partLangLabel
        # (end new code)
        memcache.add(cache_key, typeDict, 500)
        partData = memcache.get(cache_key)

    productDict['parts'] = partData    
    productList.append(productDict)

David Underhill 2010-05-10 22:35:00

Both are very good suggestions. Iterating over a query object will result in one datastore query per iteration, and you can see how that would add up in nested loops.

Will McCutchen 2010-05-10 22:40:12

@dound and @will: What you both point out is the problem lies in the queries using the IN operator. That seams likely since it make alot of subqueries. How would you structure the database models? I don't really know how the connections between parts and products should look in any other way. Since you can't store a list of db.ReferanceProperty but only a db.ListProperty(db.key). Also I'll tried adding the AppStats to see what it says.

fredrik 2010-05-11 05:56:27

I added a mixture of what you did and little more. I will update my question.

fredrik 2010-05-11 11:22:18

There's some more fixed and small problems. But the solution I posted above with some fixes with the ReferenceProperty got me down to about 1000ms and 1500ms when a new product is added. Thanks alot for all help! You did an great job on the one!

fredrik 2010-05-11 16:09:57

Great, I'm glad you've got it working well now!

David Underhill 2010-05-11 17:53:22

Answer 2

+1 A:

One important thing to be aware of is the fact that IN queries (along with != queries) result in multiple subqueries being spawned behind the scenes, and there's a limit of 30 subqueries.

So your ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts) query will actually spawn len(self.defaultParts) subqueries behind the scenes, and it will fail if len(self.defaultParts) is greater than 30.

Here's the relevant section from the GQL Reference:

Note: The IN and != operators use multiple queries behind the scenes. For example, the IN operator executes a separate underlying datastore query for every item in the list. The entities returned are a result of the cross-product of all the underlying datastore queries and are de-duplicated. A maximum of 30 datastore queries are allowed for any single GQL query.

You might try installing AppStats for your app to see where else it might be slowing down.

Will McCutchen 2010-05-10 22:51:03

Very good point. I was just about to add that :p. He might be able to avoid doing the IN query if he adds a list to each part which contains the keys of the products it is used in. Then each product could get all its parts with a single query. And he only has 100 products so the list wouldn't be too big.

David Underhill 2010-05-10 22:54:43

@dound: "He might be able to avoid doing the IN query if he adds a list to each part which contains the keys of the products it is used in." I'm a bit confused over this. If I add a product's key to a db.ListProperty(db.key) on each part. How would that make a difference? I seams to me that it's the same way I do it now, but just the other way around?

fredrik 2010-05-11 05:59:25

If you put the product's key in a list property on the part, then you can do one query per product to get all its parts. On the other hand, right now you do len(self.defaultParts)+len(self.optionalParts) queries per product. I try to explain a little more in my (edited) answer.

David Underhill 2010-05-11 07:15:02

Answer 3

A:

I think the problem is one of design: wanting to construct a relational join table in memcache when the framework specifically abhors that.

GAE will toss your job out because it takes too long, but you shouldn't be doing it in the first place. I'm a GAE tyro myself, so I cannot specify how it should be done, unfortunately.

msw 2010-05-10 22:53:40

ansaurus

tags:

views:

answers:

Loading datasets from datastore and merge into single dictionary. Resource problem.

related questions