Django patterns, part 2: efficient reverse lookups

One of the main sources of unnecessary database queries in Django applications is reverse relations.

By default, Django doesn't do anything to follow relations across models. This means that unless you're careful, any relationship can lead to extra hits on the database. For instance, assuming MyModel has a ForeignKey to MyRelatedModel, this:

myobj = MyModel.objects.get(pk=1)
print myobj.myrelatedmodel.name

hits the database two separate times - once to get the MyModel object, and once to get the related MyRelatedModel object. Luckily, it's easy to get Django to optimise this into a single call:

myobj = MyModel.objects.select_related.get(pk=1)

This way Django does a JOIN in the database call, and caches the related object in a hidden attribute of myobj. Printing myobj.__dict__ will show this:

{'_myrelatedmodel_cache': [MyRelatedModel: obj],
 'name': 'My name'}

Now, whenever you call myobj.myrelatedmodel, Django automatically uses the version in _myrelatedmodel_cache rather than going back to the database to get it. Note that this is exactly the same as what happens once the the related object was accessed in the first snippet above - Django caches it in the same way for future use. All select_related() does is pre-cache it before the first access.

None of this is new - it's quite well explained in the Django documentation. However, what's not obvious is how to do the same for reverse relationships. In other words, this:

myrelatedobj = MyRelatedObject.objects.get(pk=1)
print myrelatedobj.mymodel_set.all()

Here you'll always get two separate db calls, and adding select_related() anywhere won't help at all. Now one extra db call isn't that significant, but consider this in a template:

<ul>
{% for obj in myobjects %}
    <li>{{ myobj.name }}</li>
    <ul>
         {% for relobj in myobj.backwardsrelationship_set.all %}
         <li>{{ relobj.name }}</li>
         {% endfor %}
    </ul>
{% endfor %}
</ul>

Not an unreasonable thing to want to do - iterate through a bunch of objects, then for each one display all the objects in its backwards relationship. However, this will always cost n+1 queries, where n is the number of objects in the myobjects queryset. And what's worse, Django will go back and get the items from the database each time they're accessed, even if we've already got them for the same object in the same view or template. The queries quickly mount up. So how can we optimise this?

The answer is to get all the related objects at once, for the entire queryset, then cache each object's related objects in a hidden attribute. We can do this by sorting the objects once we've got them into a dict, keyed by the id of their parent object:

qs = MyRelatedObject.objects.all()
obj_dict = dict([(obj.id, obj) for obj in qs])
objects = MyObject.objects.filter(myrelatedobj__in=qs)
relation_dict = {}
for obj in objects:
    relation_dict.setdefault(obj.myobject_id, []).append(obj)
for id, related_items in relation_dict.items():
    obj_dict[id]._related_items = related_items

Now each MyRelatedObject instance in qs has a _related_items attribute, containing all the MyObject items in its reverse relationship. Obviously, since Django doesn't know about this, the only way to get the items is to explicitly iterate through _related_items rather than myobject_set.all in the template. And if you need extra filtering, you need to do it in the view where you first get the objects, since the resulting attribute isn't a queryset and can't be filtered.

There's quite a bit of looping etc in this snippet, so you should probably profile carefully to ensure this isn't actually more expensive than just going back to the database. But I've found that this is fairly efficient, and saves a lot of database access.

Comments