Getting the related item in an aggregate

There's a question I see quite a lot at StackOverflow and the Django Users group regarding aggregation in Django. It goes like this: I know how to annotate a max/min value for a related item on each item in a queryset. But how do I get the actual related item itself?

I wish this was easier than it actually is. The problem is that in the underlying SQL, annotating the value is a simple aggregation query on the related item, whereas getting the entire object means moving to a complicated dependent subquery.

To illustrate, take these models:

class Blog(models.Model):
    name = models.CharField(max_length=64)

class Entry(models.Model):
    blog = models.ForeignKey(Blog)
    added = models.DateTimeField(auto_now_add=True)
    text = models.TextField()

Getting the date of the latest Entry for each Blog is simple:

blogs = Blog.objects.annotate(Max('entry__added'))

and the underlying SQL is just as simple:

SELECT blog.id, blog.name, MAX(entry.added)
FROM blog_blog blog
JOIN blog_entry entry on entry.blog_id = blog.id
GROUP BY blog.id

But that doesn't work if you want the whole Entry object. You need to do something much more complicated:

SELECT blog.id, blog.name, entry.id, entry.added, entry.text
FROM blog_blog blog, blog_entry entry
WHERE entry.id = (
    SELECT e2.id FROM blog_entry e2
    WHERE e2.blog_id = blog.id
    ORDER BY e2.added 
    LIMIT 1
);

and currently there's no support for this in the Django ORM.

Now, you could just pass the above query to the .raw queryset method in Django 1.2: Blog.objects.raw('SELECT...'), and perhaps surprisingly, this will work, in that the extra fields from the Entry model will be appended to each Blog instance. If you needed the actual Entry instance - say if you had some extra methods on the Entry model that you needed to run with each one - you would have to iterate through the queryset and instantiate new Entry objects with the fields from each Blog.

Also note there's another gotcha with raw querysets, which is that they are re-executed every time you slice them or access one of their members - so it's probably best to cast them to a plain list first.

There is another approach which gets you the items related in the normal Django way, so that you can do entry_instance.blog. It does this in two queries, with a bit of Python processing in the meantime.

from django.db.models import Max
blogs = Blog.objects.annotate(Max('entry__added'))
values = tuple([(blog.id, blog.max__entry__added) for blog in blogs])

entries = Entry.objects.extra(where=['(blog_id, added) IN %s' % (values,)])

blog_dict = dict([(b.id, b) for b in blogs])
entries = list(entries) 
for entry in entries:
    entry._blog_cache = blog_dict[entry.blog_id]

Here we do a standard annotate query to get the added values for each relevant Entry. Then we can do an extra query to get the actual Entries associated with each (blog_id, max_entry) tuple (note we can't use the params argument for the values list, unfortunately, as it will get double-quoted). Finally, we can re-associate each Entry with its Blog - I've done it that way round to fit in with the standard ForeignKey and its automatic mapping of entry._blog_cache to entry.blog, and since we're only interested in one entry per blog it shouldn't matter whether we have to iterate through blogs or entries.

Again, it's a shame we have to drop to raw SQL for the middle step here. The query depends on matching multiple values for each row, and although it would be possible to do this by iterating through and adding Q objects for each row, it would be an absolutely horrible query. At least we're using extra here, which is arguably better than the raw we used in the first attempt above.

Comments