views:

87

answers:

2

Hi,

I have three models: Product, Category and Place. Product has ManyToMany relation with Category and Place. I need to get a list of categories with at least on product matching a specific place. For example I might need to get all the categories that has at least one product from Boston.

I have 100 categories, 500 places and 100,000 products.

In sqlite with 10K products the query takes ~ a second. In production I'll use postgresql.

I'm using:

categories = Category.objects.distinct().filter(product__place__name="Boston")

Is this query going to be expensive? Is there a better way to do this?

This is the result of connection.queries

{'time': '0.929', 'sql': u'SELECT DISTINCT "catalog_category"."id", "catalog_category"."name" FROM "catalog_category" INNER JOIN "catalog_product_categories" ON ("catalog_category"."id" = "catalog_product_categories"."category_id") INNER JOIN "catalog_product" ON ("catalog_product_categories"."product_id" = "catalog_product"."id") INNER JOIN "catalog_product_places" ON ("catalog_product"."id" = "catalog_product_places"."product_id") INNER JOIN "catalog_place" ON ("catalog_product_places"."car_id" = "catalog_car"."id") WHERE "catalog_place"."name" = Boston  ORDER BY "catalog_category"."name" ASC'}]

Thanks

A: 

This is not just a Django issue; DISTINCT is slow on most SQL implementations because it's a relatively hard operation. Here is a good discussion of why it's slow in Postgres specifically.

One way to handle this would be to use Django's excellent caching mechanism on this query, assuming that the results don't change often and minor staleness isn't a problem. Another approach would be to keep a separate list of just the distinct categories, perhaps in another table.

Chase Seibert
A: 

Although Chase is right that DISTINCT is generally a slow operation, in this case it is also completely pointless. As you can see from the generated SQL, the DISTINCT is being done on the combination of ID and name - which will never be duplicated anyway. So there is no need for the distinct() call in this query.

Generally, Django does not return duplicate results from a simple filter. The main time when distinct() is useful is when you are accessing a related queryset via a ManyToMany or ForeignKey relationship, where multiple items might be related to the same instance, and distinct will remove the duplicates.

Daniel Roseman
I do get duplicate results in this case if I don't use distinct() and I do have ManyToMany relationship. Unless I'm missing something and doing something wrong?
pablo