views:

108

answers:

3

Hi,

I have a productdatabase that contains products, parts and labels for each part based on langcodes.

The problem I'm having and haven't got around is a huge amount of resource used to get the different datasets and merging them into a dict to suit my needs.

The products in the database are based on a number of parts that is of a certain type (ie. color, size). And each part has a label for each language. I created 4 different models for this. Products, ProductParts, ProductPartTypes and ProductPartLabels.

I've narrowed it down to about 10 lines of code that seams to generate the problem. As of currently I have 3 Products, 3 Types, 3 parts for each type, and 2 languages. And the request takes a wooping 5500ms to generate.

for product in productData:
        productDict = {}
        typeDict = {}
        productDict['productName'] = product.name

        cache_key = 'productparts_%s' % (slugify(product.key()))
        partData = memcache.get(cache_key)

        if not partData:
            for type in typeData:
                typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
            ## Start of problem lines ##
            for defaultPart in product.defaultPartsData:
                for label in labelsForLangCode:
                    if label.key() in defaultPart.partLabelList:
                        typeDict[defaultPart.type.typeId]['default'] = label.partLangLabel

            for optionalPart in product.optionalPartsData:
                for label in labelsForLangCode:
                    if label.key() in optionalPart.partLabelList:
                        typeDict[optionalPart.type.typeId]['optional'].append(label.partLangLabel)
            ## end problem lines ##
            memcache.add(cache_key, typeDict, 500)
            partData = memcache.get(cache_key)

        productDict['parts'] = partData    
        productList.append(productDict)

I guess the problem lies in the number of for loops is too many and have to iterate over the same data over and over again. labelForLangCode get all labels from ProductPartLabels that match the current langCode.

All parts for a product is stored in a db.ListProperty(db.key). The same goes for all labels for a part.

The reason I need the some what complex dict is that I want to display all data for a product with it's default parts and show a selector for the optional one.

The defaultPartsData and optionaPartsData are properties in the Product Model that looks like this:

@property
def defaultPartsData(self):
    return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts)

@property
def optionalPartsData(self):
    return ProductParts.gql('WHERE __key__ IN :key', key = self.optionalParts)

When the completed dict is in the memcache it works smoothly, but isn't the memcache reset if the application goes in to hibernation? Also I would like to show the page for first time user(memcache empty) with out the enormous delay.

Also as I said above, this is only a small amount of parts/product. What will the result be when it's 30 products with 100 parts.

Is one solution to create a scheduled task to cache it in the memcache every hour? It this efficient?

I know this is alot to take in, but I'm stuck. I've been at this for about 12 hours straight. And can't figure out a solution.

..fredrik

EDIT:

A AppStats screenshoot here.

From what I can read the queries seams fine in AppStats. only taking about 200-400 ms. How can the difference be that big?

EDIT 2:

I implemented dound's solution and added abit. Now it looks like this:

langCode = 'en'
    typeData = Products.ProductPartTypes.all()
    productData = Products.Product.all()
    labelsForLangCode = Products.ProductPartLabels.gql('WHERE partLangCode = :langCode', langCode = langCode)
    productList = []

    label_cache_key = 'productpartslabels_%s' % (slugify(langCode))
    labelData = memcache.get(label_cache_key)

    if labelData is None:
        langDict = {}
        for langLabel in labelsForLangCode:
            langDict[str(langLabel.key())] = langLabel.partLangLabel

        memcache.add(label_cache_key, langDict, 500)
        labelData = memcache.get(label_cache_key)

    GQL_PARTS_BY_PRODUCT = Products.ProductParts.gql('WHERE products = :1')
    for product in productData:
        productDict = {}
        typeDict = {}
        productDict['productName'] = product.name

        cache_key = 'productparts_%s' % (slugify(product.key()))
        partData = memcache.get(cache_key)

        if partData is None:
            for type in typeData:
                typeDict[type.typeId] = { 'default' : '', 'optional' : [] }

            GQL_PARTS_BY_PRODUCT.bind(product)
            parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
            for part in parts:
                for lb in part.partLabelList:
                    if str(lb) in labelData:
                        label = labelData[str(lb)]
                        break

                if part.key() in product.defaultParts:
                    typeDict[part.type.typeId]['default'] = label
                elif part.key() in product.optionalParts:
                    typeDict[part.type.typeId]['optional'].append(label)

            memcache.add(cache_key, typeDict, 500)
            partData = memcache.get(cache_key)

        productDict['parts'] = partData    
        productList.append(productDict) 

The result is much better. I now have about 3000ms with out memcache and about 700ms with.

I'm still abit worried about the 3000ms, and on the local app_dev server the memcache gets filled up for each reload. Shouldn't put everything in there and then read from it?

Last but not least, does anyone know why the request take about 10x as long on the production server the the app_dev?

EDIT 3: I noticed that non of the db.Model are indexed, could this make a differance?

EDIT 4: After consulting AppStats (And understanding it, took some time. It seams that the big problems lies within part.type.typeId where part.type is a db.ReferenceProperty. Should have seen it before. And maybe explained it better :) I'll rethink that part. And get back to you.

..fredrik

+2  A: 

A few simple ideas:

1) Since you need all the results, instead of doing a for loop like you have, call fetch() explicitly to just go ahead and get all the results at once. Otherwise, the for loop may result in multiple queries to the datastore as it only gets so many items at once. For example, perhaps you could try:

return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts).fetch(1000)

2) Maybe only load part of the data in the initial request. Then use AJAX techniques to load additional data as needed. For example, start by returning the product information, and then make additional AJAX requests to get the parts.

3) Like Will pointed out, IN queries perform one query PER argument.

  • Problem: An IN query does one equals query for each argument you give it. So key IN self.defaultParts actually does len(self.defaultParts) queries.
  • Possible Improvement: Try denormalizing your data more. Specifically, store a list of products each part is used in on each part. You could structure your Parts model like this:
    class ProductParts(db.Model):
        ...
        products = db.ListProperty(db.Key)  # product keys
        ...
  • Then you can do ONE query to per product instead of N queries per product. For example, you could do this:

parts = ProductParts.all().filter("products =", product).fetch(1000)

  • The trade-off? You have to store more data in each ProductParts entity. Also, when you write a ProductParts entity, it will be a little slower because it will cause 1 row to be written in the index for each element in your list property. However, you stated that you only have 100 products so even if a part was used in every product the list still wouldn't be too big (Nick Johnson mentions here that you won't get in trouble until you try to index a list property with ~5,000 items).

Less critical improvement idea:

4) You can create the GqlQuery object ONCE and then reuse it. This isn't your main performance problem by any stretch, but it will help a little. Example:

GQL_PROD_PART_BY_KEYS = ProductParts.gql('WHERE __key__ IN :1')
@property
def defaultPartsData(self):
    return GQL_PROD_PART_BY_KEYS.bind(self.defaultParts)

You should also use AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot of appstats info about your request along with your post.


Here is what the code might look like if you re-wrote it fetch the data with fewer round-trips to the datastore (these changes are based on ideas #1, #3, and #4 above).

GQL_PARTS_BY_PRODUCT = ProductParts.gql('WHERE products = :1')
for product in productData:
    productDict = {}
    typeDict = {}
    productDict['productName'] = product.name

    cache_key = 'productparts_%s' % (slugify(product.key()))
    partData = memcache.get(cache_key)

    if not partData:
        for type in typeData:
            typeDict[type.typeId] = { 'default' : '', 'optional' : [] }

        # here's a new approach that does just ONE datastore query (for each product)
        GQL_PARTS_BY_PRODUCT.bind(product)
        parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
        for part in parts:
            if part.key() in self.defaultParts:
                part_type = 'default'
            else:
                part_type = 'optional'

            for label in labelsForLangCode:
                if label.key() in defaultPart.partLabelList:
                    typeDict[defaultPart.type.typeId][part_type] = label.partLangLabel
        # (end new code)
        memcache.add(cache_key, typeDict, 500)
        partData = memcache.get(cache_key)

    productDict['parts'] = partData    
    productList.append(productDict)
David Underhill
Both are very good suggestions. Iterating over a query object will result in one datastore query per iteration, and you can see how that would add up in nested loops.
Will McCutchen
@dound and @will: What you both point out is the problem lies in the queries using the IN operator. That seams likely since it make alot of subqueries. How would you structure the database models? I don't really know how the connections between parts and products should look in any other way. Since you can't store a list of db.ReferanceProperty but only a db.ListProperty(db.key). Also I'll tried adding the AppStats to see what it says.
fredrik
I added a mixture of what you did and little more. I will update my question.
fredrik
There's some more fixed and small problems. But the solution I posted above with some fixes with the ReferenceProperty got me down to about 1000ms and 1500ms when a new product is added. Thanks alot for all help! You did an great job on the one!
fredrik
Great, I'm glad you've got it working well now!
David Underhill
+1  A: 

One important thing to be aware of is the fact that IN queries (along with != queries) result in multiple subqueries being spawned behind the scenes, and there's a limit of 30 subqueries.

So your ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts) query will actually spawn len(self.defaultParts) subqueries behind the scenes, and it will fail if len(self.defaultParts) is greater than 30.

Here's the relevant section from the GQL Reference:

Note: The IN and != operators use multiple queries behind the scenes. For example, the IN operator executes a separate underlying datastore query for every item in the list. The entities returned are a result of the cross-product of all the underlying datastore queries and are de-duplicated. A maximum of 30 datastore queries are allowed for any single GQL query.

You might try installing AppStats for your app to see where else it might be slowing down.

Will McCutchen
Very good point. I was just about to add that :p. He might be able to avoid doing the IN query if he adds a list to each part which contains the keys of the products it is used in. Then each product could get all its parts with a single query. And he only has 100 products so the list wouldn't be too big.
David Underhill
@dound: "He might be able to avoid doing the IN query if he adds a list to each part which contains the keys of the products it is used in." I'm a bit confused over this. If I add a product's key to a db.ListProperty(db.key) on each part. How would that make a difference? I seams to me that it's the same way I do it now, but just the other way around?
fredrik
If you put the product's key in a list property on the part, then you can do one query per product to get all its parts. On the other hand, right now you do len(self.defaultParts)+len(self.optionalParts) queries per product. I try to explain a little more in my (edited) answer.
David Underhill
A: 

I think the problem is one of design: wanting to construct a relational join table in memcache when the framework specifically abhors that.

GAE will toss your job out because it takes too long, but you shouldn't be doing it in the first place. I'm a GAE tyro myself, so I cannot specify how it should be done, unfortunately.

msw