views:

314

answers:

1

Am building up a service that needs to maintain something along the lines of a case tracking system. Here's our model:

class Incident(models.Model):    
    title = models.CharField(max_length=128)
    category = models.ForeignKey(Category)
    status = models.ForeignKey(Status)    
    severity = models.ForeignKey(Severity)
    owned_by = models.ForeignKey(User, related_name="owned_by", null=True, blank=True)   
    next_action = models.ForeignKey(IncidentAction)    
    created_date  = models.DateTimeField()
    created_by = models.ForeignKey(User, related_name="opened_by")    
    last_edit_date = models.DateTimeField(null=True, blank=True)
    last_edit_by = models.ForeignKey(User, related_name="last_edit_by", null=True, blank=True)        
    closed_date  = models.DateTimeField(null=True, blank=True)
    closed_by = models.ForeignKey(User, related_name="Closed by", null=True, blank=True)

Because there are a lot of foreign keys being pulled into this model, it makes for interesting sql queries. We've been using as a trial the djblets data grid and the django debug toolbar, and are alarmed at the sheer number of queries being hit every time we add a new column for view that uses a foreign key, it does basically this type of query workflow:

#prepare the grid
select * from incident_table;
#render each row
for each row in incident table
    for each column that is a foreign key select row from foreign table with id

It does an additional select query per row for each column that tries to pull a property for a foreign key.

I'm thinking that this is a universal problem with django and its ORM with regard to pulling in properties from foreign key models for display. As a test, I dropped the data grid and just did a simple list of properties for a queryset, and saw the queries balloon up in a similar way.

We're wanting to scale this up with tons of users hitting the model. As comparison, I made a similar view off the User model, and its full display is just done with one query because if you only pull fields from the given model, it doesn't do an additional db hit per additional column.

Some optimizations we tried were:

  • django-orm-cache: Doesn't seem to work with django 1.0.4
  • django-caching: This works well for caching the oft queried models
  • view level caching with memcached
  • Edit: using select_related() may speed up template rendering by not having a round trip back to the database, but it seems like it follows the foreign keys just ahead of time on the original queryset using the single query per foreign key. Just seems to move the multi database query hit ahead of time.

But there are some deeper questions we're soliciting the wisdom of the crowd on:

  • For models with tons of foreign keys, what is the best way to make it efficiently query to get properties from foreign keys?
    • Is caching the dependent models the only way to go using the above ORM caching systems?
    • Or is this a standard case of outgrowing the ORM and needing to roll our own custom sql query with joins to get the desired datagrid output in as efficiently as possible?

Related questions that raised concerns on caching and foreign keys:

DB / performance: layout of django model that rarely refers to its parent more than once, Django ORM: caching and manipulating ForeignKey objects:

+3  A: 

select_related() is the correct solution; you're wrong about how it's supposed to work. I think you're not using select_related correctly if you're still getting multiple queries across the specified FK. A quick log of a Python session (Studio has a FK to django.auth.user here):

>>> from django.db import connection
>>> studios = Studio.objects.all().select_related('user')
>>> for studio in studios:
>>>     print studio.user.email
>>>        
[email protected]
[email protected]
>>> len(connection.queries) 
1

So, I got a list of Studio objects (2 in my test DB) and got the User for each one in one SQL query. Without the select_related() call, it takes three queries.

Note that select_related doesn't handle many-to-many relationships -- though I think you can manually query the intermediate table of the m2m to follow those FKs either direction without requiring the extra query, as long as you're OK starting your queryset from the intermediate object. Maybe that's what's catching you? You only specified FK relationships, not m2ms, so I gave a simple illustration of that.

Yoni Samlan
Ah, yes. I stand very corrected - looks like the datagrid is actually ignoring any optimizations provided by the supplied queryset and does its own lookups. Independent testing of my model with your example made the benefits of select_related() as clear as day. Thanks!
dmyung