Jul 25

Europython 2010 talk: Advanced Django ORM Techniques

Here are the slides from my talk at Europython 2010, Advanced Django ORM Techniques.

The talk is mainly a summary of the query optimisation tricks I've previously talked about on this blog, although I did begin by explaining briefly how models, fields and relationships work behind the scenes - I'll write some of that up here at some point.

I'll also be posting a longer review of Europython here, hopefully in the next few days.

Jun 02

Class-based views and thread safety

Just because it's a class, part 2

I previously wrote about a thread-safety bug I encountered in writing a middleware object. I recently found a very similar situation in a class-based view I was modifying, and thought it was worth writing up what the problem was and how I solved it. Interestingly, there's a discussion going on in Django-developers at the moment on class-based views which touches on many of the same issues.

By 'class-based view', I mean that rather than just being a function which is called from urls.py when the URL matches its pattern, this was a class with a __call__ method. In theory, that should work in exactly the same way as a function view - to Python, functions and callable objects are pretty much the same thing.

One of the reasons that the controller was written as a class was that it had multiple utility methods that were called during the rendering process. To make it easier to pass data between the methods, various things were set as instance variables - self.item_id, self.key, and so on.

However, the urlpattern for this view was defined like this:

(r'^(?P<key>)/((?P<page>[^/]+)/)?', FormController()),

which meant that it was instantiated just once per process: when the urls are first imported. Unfortunately, when running under a production server like Apache/mod_wsgi, a process does not equal a request. Processes are persistent - managed as part of a pool by the server - and a process can serve many requests before eventually being killed and restarted.

This means that the FormController object declared above will probably be responsible for serving multiple requests - and any instance attributes, eg self.item_id, will be visible to all of them. Clearly, this has the potential to be a massive bug, if information starts leaking between requests - one user might start seeing information meant for someone else, or a form could validate itself based on information from a completely different user.

There are various potential ways to solve this. The main idea is to instantiate the object once per request, rather than once per process - this ensures that any instance data is specific to that request. The way I've done it for now is to redefine the url like this:

(r'^(?P<key>)/((?P<page>[^/]+)/)?', FormController),

ie make the handler a class, rather than an instance. That means that Django will instantiate a FormController object when the url matches, so the code that was previously in__call__ goes in __init__ instead. Unfortunately, in Python you can't return anything from an __init__ method - the returned value from an instantiation is automatically the object itself. So what I've done (based on an idea from Mike Malone) is to make the view class a subclass of HttpResponse - then, rather than returning render_to_response, you can just render the template to a string and pass this to the super class's __init__ method:

class FormController(HttpResponse):
  def __init__(self, request, key, page=None):
      ... lots of code ...
      content = loader.render_to_string(
                                  template, 
                                  form_context,
                                  context_instance=RequestContext(request)
                )
      super(FormController, self).__init__(content)

The value returned from the instantiation is therefore an HttpResponse object containing the rendered template, as Django expects.

One small side-effect of doing this is that you can't return anything else - for example an HttpResponseRedirect (or an HttpResponseNotFound). You need to do the redirect manually:

      if self.redirect_to_next_page:
          self.status_code = 302
          super(FormController, self).__init__()
          self['Location'] = next_page
          return
May 10

Django aggregation and a simple GROUP BY

Using 'values' to sum over distinct values within a QuerySet

I love Django's aggregation framework. It very successfully abstracts the common aggregration tasks into a Pythonic syntax that sits extremely well with the rest of the ORM, and the documentation explains it all without a single reference to SQL.

But sometimes that very abstraction gets in the way of working out what you want to do. One example of this happened to me today when I needed to do a sum of values grouped by a single value on a model - in SQL terms, a simple GROUP BY query.

The documentation is very clear about how to do aggregations across an entire QuerySet, and annotations across a relationship. So you can, for example, easily do a sum of all the 'value' fields in a model, or a sum of all the 'value' fields on a related model for each instance of the parent model. But the implication is that these are the only things you can do. So I was left wondering if I had to create a dummy related model just to contain the unique values of the field I wanted to group on.

In fact, what I wanted to do was clearly documented, but because of the (probably correct) desire not to express things in terms of SQL, it's not that easy to find. So here's how to do it: you just need to use values.

For instance, my model is a set of transactions for a financial accounting system. Each transaction is associated with an order, which is just an integer ID referring to records in a completely different system. I wanted to get the total of transactions for each unique order ID. It's as simple as this:

Transaction.objects.values('order_id').annotate(total=Sum('value'))

Which gives you a ValuesQuerySet along the lines of:

[{'order_id': 12345L, 'total': Decimal('1.23')}, 
 {'order_id': 54321L, 'total': Decimal('2.34')}, 
 {'order_id': 56789L, 'total': Decimal('3.45')}]

One thing to be aware of: as the docs do note, the order of values and annotate is significant here. This way round, it groups by the fields listed in values and then annotates. But if you put annotate first, it does the calculation for each individual record, without grouping, then uses values to simply restrict the fields it outputs.

Apr 21

Vim autocomplete, Django and virtualenv

A lot of detective work to enable one of vim's useful features

One of the features I quite missed when I first moved from Komodo to Textmate as my main editor was autocompletion. Although I didn't use it very much, it was occasionally useful to be reminded of the methods available on a class, without having to look up the documentation or open the source module.

Now I've moved to vim, which has its own version of autocomplete: omnicompletion. This is activated by pressing Ctrl-X Ctrl-O in insert mode after typing the name of a class or instance, and displays a nice drop-down menu with all the available members of that object. The Python plugins which come with vim allow this function to not only complete items from the standard library, but also to parse your own files which are open in the editor, reading the modules you import and adding those elements to the completion dictionaries.

However, this wasn't working for me in my Django projects. After a lot of investigation, I found that this was down to three issues.

The first two of these related to the fact that I was working within a virtualenv. Despite the fact that I was starting MacVim after activating the virtual environment, the specific site-packages directory was not being added to the path and it was defaulting to the system-wide one, which on my system doesn't contain any of the packages I was using. The solution to this is to use the activate_this.py script which is provided with virtualenv to activate it within another process - it's intended for use with mod_wsgi, but works just as well here. You can run it from the vim command line, using python to tell vim that the following commands are in Python. (Note that the Python interpreter is actually persistent, so you can import modules or define variables in one command and they are still available in subsequent ones.)

:python activate_this = '/path/to/virtualenv/bin/activate_this.py'
:python execfile(activate_this, dict(__file__=activate_this))

This sets up the paths properly, but it was still not working. After a lot of investigation, I finally realised that this was because vim uses its own built-in Python interpreter, which is version 2.5, while my Snow Leopard machine was using 2.6. The confusion arose because doing this:

:python import sys; print sys.executable

does return the path to my virtualenv's version of Python. To make it even worse, this:

:python print sys.version

returned 2.5.2, when the executable printed in the previous command was actually Python 2.6.1! I can't explain that bit of weirdness, and would be interested to hear in the comments if anyone else can, but the fact that vim was clearly using a 2.5 Python did at least explain why it wasn't picking up the local packages which were installed in the virtualenv's lib/python2.6/site-packages directory.

The version of Python included is decided at compile time, and the pre-built versions of MacVim are actually compiled against Python 2.5, so to make it work with the 2.6 directories I had to build my own binaries. Luckily this is very easy.

The third and final piece of the puzzle was Django-specific. Anything that uses a Django model, or indeed imports anything that references django.db, needs a settings module. Normally when running from manage.py this is set automatically, but obviously that doesn't work within vim. So you just need to set the DJANGO_SETTINGS_MODULE environment variable, which again can be done from the vim command line:

:let $DJANGO_SETTINGS_MODULE='mysite.settings'

With all that in place, plus some further Python path manipulation to ensure it found all my project's code, I was now able to complete code within Django projects.

There's some work to be done to automate this. At the moment, I've put all the above commands into a .vimrc file at the base of the virtual env, and added some code to my main ~/.vimrc to load it based on the value of the VIRTUAL_ENV environment variable (which is set by bin/activate):

if filereadable($VIRTUAL_ENV . '/.vimrc')
    source $VIRTUAL_ENV/.vimrc
endif

This probably isn't ideal, as it involves remembering to create that file with all its specific hard-coded paths each time I set up a new virtualenv. On the other hand, trying to do something more automatic will be difficult, as my settings files are not always in a predictable location - eg for work they are often under projectname.configs.development.settings - so maybe this is the best I can do.

Apr 21

Bye-bye django.comments, hello Disqus

All right, I admit defeat

So, I've learned my lesson. When I first set up this blog I said at great length that I'd prefer using Django's built-in commenting system than the third-party Disqus service that Mingus uses by default.

Well, after a week of trying to stop a flood of spam comments, first by deleting them as they came in and then by disabling comments altogether, I've taken the plunge and reverted my changes, so I'm now incorporating Disqus. All the (13!) real existing comments have been ported over to Disqus.

For those who are interested, copying the comments over was surprisingly simple. I thought I'd have to create some XML to import them into Disqus, but it turns out there is a nice API. It further turns out that the django-disqus app includes a simple management command, disqus-export, which copies the comments straight over. One issue - the fork of this app included in Mingus has some issues with this command on recent Django versions, but you just need to remove the verbosity option in the script to make it work.

Apr 13

Temporary models in Django

Occasionally I need to create a temporary model within a Django application.

The most recent occasion for this was a one-off management command I was writing to import some data from a legacy system. The old database, for some reason, eschewed foreign keys in favour of char fields in a linking table which referred to the relevant rows. In converting this to a Django app, and wanting to use sensible database structure, I planned to replace this with normal ForeignKey fields. But I needed to temporarily hold onto the old references during the import process, so that I could set the new FK properly.

I didn't want to add a field to my model, create a migration for the new field, do the import, then add another migration to drop the field again, so a quick answer was to create a temporary table to hold the linking data during the import. And I wanted to define it within the management command itself, again so as not to pollute the real models with temporary code.

Surprisingly, this turned out to be quite easy. Here's the code:

from django.db import models, cursor
from django.contrib.contenttypes.management import update_contenttypes
from django.core.management import call_command

class TempCustomerAddress(models.Model):
    address = models.ForeignKey('accounts.Address')
    legacy_id = models.CharField(max_length=12, unique=True)

    class Meta:
        app_label = 'utils'

class Command(NoArgsCommand):

    def handle_noargs(self, **options):
        models.register_models('utils', TempCustomerAddress)
        models.signals.post_syncdb.disconnect(update_contenttypes)
        call_command('syncdb')

        # ... do importing and stuff referring to TempCustomerAddress ...

        cursor = connection.cursor()
        cursor.execute('DROP TABLE `utils_tempcustomeraddress`')

Firstly I define the model, giving it an explicit app_label referring to an existing application within my project - not, incidentally, the one containing the actual command.

Then within the body of the command, it was pretty much just a matter of registering the model and running syncdb. It turns out that there is a very simple, although undocumented, function, register_models, to do this - you just need to pass it the name of the application to register the model into, and the model class itself. Again, I'm using the 'utils' app to register the model against - mainly because in our project that isn't managed by South, so syncdb will work. One thing I did have to do though was disconnect the post_syncdb signal which creates content types, as this seemed not to like the temporary model.

The final task, after the import had run, was to drop the temporary table. Since I'm not using south here I have to do that manually by running some SQL.

Mar 09

Easy create or update

An efficient way to write one of Django's missing shortcuts

One common database operation that isn't supported out of the box by Django's ORM is create_or_update - in other words, given a set of parameters, either update an existing object or create a new one if there isn't one already.

The naive implementation is to do a get() on the model, catching the DoesNotExist exception if there's no match and instantiating a new object, then updating the attributes and saving. (You wouldn't want to use get_or_create here, as that doesn't allow you to update the instance if it already exists, so you'd have some duplication of code and db queries).

try:
    obj = MyModel.objects.get(field1=value1)
except MyModel.DoesNotExist:
    obj = MyModel()
    obj.field1 = field1
obj.field2 = value2
obj.save()

The only problem with this is that it creates multiple queries: one to get the existing row, and then two to save it - Django checks to see if it should do an insert or an update when you save, which costs another query. Most of the time, this doesn't massively matter: creating and updating is usually done outside of the standard page rendering flow, so it's not a huge problem if it's a tiny bit slower.

But there are times when you do want to optimise this. One, which we recently ran into at work, is when you want to log items to the database in the course of normal page rendering. We do this to let users of our CMS know when they've put items on a page that aren't rendering how they should be, usually because they don't have the right selection of image assets. (There are good operational reasons as to why we can't stop them from entering them in the first place: I won't go into that here.) A further wrinkle for us is that we want to ensure each error only gets one entry in the log table, but should always record the most recent time that particular error scenario was encountered. So, an ideal case for create_or_update, if only it existed.

Of course I can't stand to see unnecessary db queries, so here's an implementation that uses QuerySet.update to do the initial getting and updating if a match exists. The trick is to realise that update returns the number of rows affected by the query - which has been true more or less ever since queryset-refactor landed nearly two years ago, but which was wrongly and explicitly denied in the documentation until recently (and still is denied in the 1.1 docs, even though it's true). We can use this number to tell if a matching row existed - and if it doesn't, we can then simply call create with the same arguments. Simple.

attrs = {'field1': 'value1', 'field2': 'value2'}
filter_attrs = {'filter_field': 'filtervalue'}
rows = MyModel.objects.filter(**filter_attrs).update(**attrs)
if not rows:
    attrs.update(filter_attrs)
    obj = MyModel.objects.create(**attrs)

The attrs dictionary contains the field names/values to use to update the object, and filter_attrs is the filter names/values to find the object to update. If we're creating a new object, it will of course need to set both the attrs values and the filter_attrs, so we update one dictionary from the other.

Now, note that this will always call a db UPDATE, and if no match exists, it will additionally call an INSERT. Compare this with the original version, which always calls a SELECT, plus another SELECT and an UPDATE if the match exists, but just an INSERT if there's no match. So whether this is more efficient will depend on the use case - if you expect more updates than create, this version should be better (a single UPDATE versus SELECT+UPDATE), but if the reverse is true the original implementation will probably be better.

Feb 22

Django patterns, part 4: forwards generic relations

Simulating select_related() on a GenericForeignKey

My last post talked about how to follow reverse generic relations efficiently. However, there's a further potential inefficiency in using generic relations, and that's the forward relationship.

If once again we take the example of an Asset model with a GenericForeignKey used to point at Articles and Galleries, we can get from each individual Asset to its related item by doing asset.content_object. However, if we have a whole queryset of Assets, doing this:

{% for asset in assets %}
   {{ asset.content_object }}
{% endfor %}

will result in as many queries as there are assets - in fact it's n+m, where n is the number of assets and m is the number of different content types, as you'll have one extra query per type to get the ContentType object. (Although it might be slightly less than that if you've used ContentTypes elsewhere, as the model manager caches lookups on the assumption that they never change once they've been set.)

However, luckily we can make this much more efficient as well, again using a variation of the dictionary technique.

generics = {}
for item in queryset:
    generics.setdefault(item.content_type_id, set()).add(item.object_id)

content_types = ContentType.objects.in_bulk(generics.keys())

relations = {}
for ct, fk_list in generics.items():
    ct_model = content_types[ct].model_class()
    relations[ct] = ct_model.objects.in_bulk(list(fk_list))

for item in queryset:
    setattr(item, '_content_object_cache', 
            relations[content_type_id][object_id])

Here we get all the different content types used by the relationships in the queryset, and the set of distinct object IDs for each one, then use the built-in in_bulk manager method to get all the content types at once in a nice ready-to-use dictionary keyed by ID. Then, we do one query per content type, again using in_bulk, to get all the actual object.

Finally, we simply set the relevant object to the _content_object_cache field of the source item. The reason we do this is that this is the attribute that Django would check, and populate if necessary, if you called x.content_object directly. By pre-populating it, we're ensuring that Django will never need to call the individual lookup - in effect what we're doing is implementing a kind of select_related() for generic relations.

Feb 15

Django patterns part 3: efficient generic relations

Extending the dictionary technique to cover generic lookups

I've previously talked about how to make reverse lookups more efficient using a simple dictionary trick. Today I want to write about how this can be extended to generic relations.

At its heart, a generic relationship is defined by two elements: a foreign key to the ContentType table, to determine the type of the related object, and an ID field, to identify the specific object to link to. Django uses these two elements to provide a content_object pseudo-field which, to the user, works similarly to a real ForeignKey field. And, again just like a ForeignKey, Django can helpfully provide a reverse relationship from the linked model back to the generic one, although you do need to explicitly define this using generic.GenericRelation to make Django aware of it.

As usual, though, the real inefficiency arises when you are accessing reverse relationships for a whole lot of items - say, each item in a QuerySet. As with reverse foreign keys, Django will attempt to resolve this relationship individually for each item, resulting in a whole lot of queries. The solution is a little different, though, to take into account the added complexity of generic relations.

Assuming the list of items is all of one type, the first step is to get the content type ID for this model. From that, we can get the object IDs, and then do the query in one go. From there, we can use the dictionary trick described last time to associate each item with its particular related items. In this example, we have an Asset model that is the generic model, holding assets for other models such as Article and Gallery.

articles = Article.objects.all()
article_dict = dict([(article.id, article for article in articles])

article_ct = ContentType.objects.get_for_model(Article)
assets = Asset.objects.filter(
                content_type=article_type, 
                object_id__in=[a.id for a in all_articles]
              )
asset_dict = {}
for asset in assets:
    asset_dict.setdefault(asset.object_id, []).append(asset)
for id, related_items in asset_dict.items():
    article_dict[id]._assets = related_items

This is good as far as it goes, but what about when we have a heterogeneous list of items? That, after all, is the point of generic relations. So what if our starting point is a collection of both Galleries and Articles, and we still want to get all the related Assets in one go? As it turns out, the solution is not massively different: we just need to change the way we key the items in the intermediate dictionary, to record the content type as well as the object ID.

article_ct = ContentType.objects.get_for_model(Article)
gallery_ct = ContentType.objects.get_for_model(Gallery
assets = Asset.objects.filter(
                Q(content_type=article_type, 
                    object_id__in=[a.id for a in articles]) |
                Q(content_type=gallery_ct, object_id__in=[g.id for g in galleries])
             )

    asset_dict = {}
    for asset in assets:
        asset_dict.setdefault("%s_%s" % (asset.content_type_id, asset.object_id), 
                                         []).append(asset)

    for article in articles:
        article._assets = asset_dict.get("%s_%s" % (article_ct.id, article.id), None)

    for gallery in galleries:
        gallery._assets = asset_dict.get("%s_%s" % (gallery_ct.id, gallery.id), None)

Here we first of all use Q objects to get all the assets of type Article with IDs in the list of articles, plus all those of type Gallery with IDs in the list of galleries. Then we use the fact that each asset knows its own content type ID to create the dictionary keys in the form <content_type_id>_<object_id>. Finally, we loop through the articles and the galleries separately to get the relevant assets for each item.

Feb 01

Middleware post-processing in Django: a gotcha

Just because it's a class, it doesn't mean you should store state in it

One of the requirements for the new Heart website we've just launched was to allow users to personalise their location to one of 33 radio stations across the country. For various reasons, this meant rewriting all the links on the page, dynamically, depending on the user's location setting.

The easiest place to do this sort of post-processing in Django is in response middleware. So I wrote a quick class that used regexes to grab all the href and action attributes (for a and form elements respectively - images didn't need localising) and add the relevant locations. Because it was dynamic, I used the ability of re.sub to call a function to determine the replacement value; and to save on multiple database queries, I saved various things in the instance. So it looked a bit like this:

href = re.compile(r'(href|action)=["\'](.+?)["\']')

class LocalisationMiddleware(object):
    def process_response(self, request, response):
        self.current_station = get_station(request)
        self.stations = Station.objects.values_list('slug', flat=True)

        content = href.sub(self.re_replace, response.content.decode('utf8'))
        response.content = unicode(content)
        return response

    def re_replace(self, matchobj):
        current_station = self.current_station
        url = "/%s%s" % (current_station.slug, matchobj.group(2))
        return "%s=%s" % (matchobj.group(1), url)

But then, during testing, we started getting some rather odd bug reports. Someone would be happily browsing the London pages, and would suddenly get a link pointing at Essex - which is supposed to be impossible.

We eventually realised what the problem was. Django middleware is instantiated once per process: so several requests were being serviced by the same instance, and the values of the local instance attributes - in particular self.current_station - were being leaked across requests.

The solution is to use a separate object to contain the current station and the re_replace method, and instantiate it explicitly in process_response:

class LocalisationMiddleware(object):

    def process_response(self, request, response):
         url_replacement = UrlReplacement(request)
         content = href.sub(url_replacement,
                           response.content.decode('utf8'))
        # etc

class UrlReplacement(object):
    def __init__(self, request):
       self.current_station = get_station(request)
       self.stations = Station.objects.values_list('slug', flat=True)

    def __call__(self, matchobj):
        # do replacements
Jan 11

Django patterns, part 2: efficient reverse lookups

Avoiding extra database calls on backwards ForeignKey queries

One of the main sources of unnecessary database queries in Django applications is reverse relations.

By default, Django doesn't do anything to follow relations across models. This means that unless you're careful, any relationship can lead to extra hits on the database. For instance, assuming MyModel has a ForeignKey to MyRelatedModel, this:

myobj = MyModel.objects.get(pk=1)
print myobj.myrelatedmodel.name

hits the database two separate times - once to get the MyModel object, and once to get the related MyRelatedModel object. Luckily, it's easy to get Django to optimise this into a single call:

myobj = MyModel.objects.select_related.get(pk=1)

This way Django does a JOIN in the database call, and caches the related object in a hidden attribute of myobj. Printing myobj.__dict__ will show this:

{'_myrelatedmodel_cache': [MyRelatedModel: obj],
 'name': 'My name'}

Now, whenever you call myobj.myrelatedmodel, Django automatically uses the version in _myrelatedmodel_cache rather than going back to the database to get it. Note that this is exactly the same as what happens once the the related object was accessed in the first snippet above - Django caches it in the same way for future use. All select_related() does is pre-cache it before the first access.

None of this is new - it's quite well explained in the Django documentation. However, what's not obvious is how to do the same for reverse relationships. In other words, this:

myrelatedobj = MyRelatedObject.objects.get(pk=1)
print myrelatedobj.mymodel_set.all()

Here you'll always get two separate db calls, and adding select_related() anywhere won't help at all. Now one extra db call isn't that significant, but consider this in a template:

<ul>
{% for obj in myobjects %}
    <li>{{ myobj.name }}</li>
    <ul>
         {% for relobj in myobj.backwardsrelationship_set.all %}
         <li>{{ relobj.name }}</li>
         {% endfor %}
    </ul>
{% endfor %}
</ul>

Not an unreasonable thing to want to do - iterate through a bunch of objects, then for each one display all the objects in its backwards relationship. However, this will always cost n+1 queries, where n is the number of objects in the myobjects queryset. And what's worse, Django will go back and get the items from the database each time they're accessed, even if we've already got them for the same object in the same view or template. The queries quickly mount up. So how can we optimise this?

The answer is to get all the related objects at once, for the entire queryset, then cache each object's related objects in a hidden attribute. We can do this by sorting the objects once we've got them into a dict, keyed by the id of their parent object:

qs = MyRelatedObject.objects.all()
obj_dict = dict([(obj.id, obj) for obj in qs])
objects = MyObject.objects.filter(myrelatedobj__in=qs)
relation_dict = {}
for obj in objects:
    relation_dict.setdefault(obj.myobject_id, []).append(obj)
for id, related_items in relation_dict.items():
    obj_dict[id]._related_items = related_items

Now each MyRelatedObject instance in qs has a _related_items attribute, containing all the MyObject items in its reverse relationship. Obviously, since Django doesn't know about this, the only way to get the items is to explicitly iterate through _related_items rather than myobject_set.all in the template. And if you need extra filtering, you need to do it in the view where you first get the objects, since the resulting attribute isn't a queryset and can't be filtered.

There's quite a bit of looping etc in this snippet, so you should probably profile carefully to ensure this isn't actually more expensive than just going back to the database. But I've found that this is fairly efficient, and saves a lot of database access.

Jan 07

SSH and Mac OSX Terminal

Resetting terminal tab titles after SSH has messed with them.

I like the Mac as a development environment most of the time, but occasionally some things annoy me.

One of these niggles is the way that the tab title in Terminal changes when you SSH to an external server, but doesn't change back when you close the connection. So you end up with tabs that claim to be connected to a server, but aren't.

The culprit seems to be SSH itself. Here's my solution: a shell script that runs SSH and then sets the tab title back to the default "Terminal".

ssh $*
echo "\033]0;Terminal\007"

I've saved this to ~/bin/sshp, and made it executable, so now I just type sshp myserver instead of ssh. A further step would be to alias it back to ssh in .bash_profile with alias ssh=sshp

Dec 26

Vim taglist and Django

Inspired by the graphical cheat sheet here, I've recently moved over to Vim as my main development environment.

After installing a whole range of plugins, I found that one of them, taglist, no longer worked with my Django code. The reason was that something was changing the filetype of Django modules to 'python.django', and taglist - unlike most other plugins - was trying to match against the whole filetype, rather than just a part of it.

My solution is to hack taglist so that it does a partial match on the filetype. In the Tlist_Get_Buffer_Filetype function (line 984), change

let buf_ft = getbufvar(a:bnum, '&filetype')

to

let buf_ft = split(getbufvar(a:bnum, '&filetype'), '\.')[0]
Dec 26

Showing queries in Haystack

A Django debug toolbar panel for Haystack

At work we've been using Haystack to manage our site search, with a Solr backend.

As usual, we're customising things quite a lot - using faceted queries and weighted indexes, and bypassing the built-in search forms - so I wanted to be sure, in line with my general obsession with query efficiency, that we weren't generating multiple Solr queries for every search.

Haystack does log queries for every request internally, but as far as I can tell there's no way of getting to that information without writing some custom code to import and expose the relevant variable. So I've written a (very basic) panel for the Django debug toolbar which does just that.

Just put this somewhere on your pythonpath or in your project, and add it to the DEBUG_TOOLBAR_PANELS list in settings.py.

Dec 20

Django patterns: memoizing

How to cache expensive operations to prevent repeated database calls

One of the things I wanted to do with this blog was to cover some of the design patterns I've discovered/come across/stolen over the years I've been working with Django. So this is the first in what I hope will be a long-running series on Django patterns.

Memoizing is the process by which a complicated or expensive function is replaced by a simpler one that returns the previously calculated value. This is a very useful thing to do in a complicated model, especially in cases where methods like get_absolute_url are calculated via a series of lookups on related models. Frequently I've found myself calling one of these methods on the same object several times within a view or template, leading to a huge amount of unnecessary database calls.

It's very easy to do this manually - the method simply needs to check whether the cached value already exists, if not calculate it and store it somewhere, then return the cached value:

def get_expensive_calculation(self):
    if not hasattr(self, '_expensive_calculation'):
        self._expensive_calculation = do_expensive_calculation()
    return self._expensive_calculation

Here the cache lives within the instance itself. For the way I use it, this is useful: instances are created and destroyed within a single request/response cycle, so the cache dies with the object at the end of that process, and I don't need to worry about invalidating the cache if the value subsequently changes. Naturally, you could use Django's cache framework here - you'd need to create a unique key somehow, perhaps using the model name and pk as a prefix - but otherwise it would work pretty much the same way.

However, it's a bit of a pain having to write this same boilerplate each time you want to memoize something, so I wanted to write a decorator that would do it, which I could simply apply to a model method to get it to automatically cache the result. There are various memoizing decorators out there, but they mostly suffer from two problems: either they only work on plain functions, rather than methods, or they create a global cache, which would lead to a memory leak as the value would be kept even though the instance had gone out of scope.

So here's my version:

def memoize_method(func):
    key = "__%s" % func.__name__
    def inner(self, *args, **kwargs):
        if not hasattr(self, key):
            setattr(self, key, func(self, *args, **kwargs))
        return getattr(self, key)
    return inner

This is pretty simple in the end. The decorator uses the name of the function it's decorating to create a key, and when it's called it is passed 'self', so it checks if that key exists on that object and either creates or returns it.

One potential problem with this is that it doesn't take any account of the method's arguments: after the first call, it will always return the same value even if called again with completely different arguments. Most of the time, this won't be a problem: since the cache only persists for a single request, you're most likely to be calling it with the same arguments each time. But it's fairly simple to extend the caching mechanism to use parameters within the key:

def memoize_method_with_params(params):
    def wrap(func):
        key = "__%s__%s" % (func.__name__, '__'.join(['%s:%%(%s)s' % (a, a) for
                                                      a in params]))
        def inner(self, *args, **kwargs):
            actual_key = key % kwargs
            if not hasattr(self, actual_key):
                setattr(self, actual_key, func(self, *args, **kwargs))
            return getattr(self, actual_key)
        return inner
    return wrap

This time, since the decorator itself takes arguments, you need to use the double-wrap method: the outer function is called on definition, and it returns the decorator function, which itself contains the inner wrapped function. The algorithm to calculate the key looks complex, but is actually just creating a string in the form __funcname__key1:%(key1)s__key2:%(key2)s, which will use the dictionary string interpolation method to include the actual values when the function is called. (One issue, left for the reader to correct: params must be a list or tuple, if passed a string it will fail.)

Although this is pretty nice, I can't help feeling that I should be using descriptors to do this. Inspired by a posting by Marty Alchin and one by Ian Bicking, I attempted to make this work, but I unfortunately drew a blank - the problem is that only the __get__ method has access to the instance, where the cache needs to be stored, but that needs to be available in __call__ somehow. One possible solution would be to have __get__ return another descriptor itself, but that seems like overkill for this.

Dec 08

South migrations with MPTT

We've been using django-MPTT at work for quite a while. It's a great way to manage hierarchical data in a read-efficient way, and we use it heavily in our CMS application. I'll definitely be talking about it further in future posts.

Recently we moved our database migrations from our defunct dmigrations project to Andrew Godwin's wonderful South application. One of South's best features is the ability to 'freeze' the ORM within each migration, so that you can manipulate the db via the familiar Django syntax rather than having to deal with raw SQL.

However, we ran into a problem when trying to use this to add new instances to a model that uses MPTT. We're actually using Ben Frishman's fork of django-mptt, which he wrote while he was working for us this summer. This has a base model class that defines all the MPTT fields and methods, rather than monkey-patching them in as the original version does.

The issue was that the frozen ORM only includes the basic fields that are defined on the actual model. This led to trouble when inserting a new object, especially when it's in the middle of an existing tree. MPTT includes values which identify an item's place in its tree, and when a new object is inserted most of the elements in the tree have to be updated to reflect the new positioning. django-mptt normally deals with all the SQL changes necessary, but this wasn't happening within a migration, because the dynamically-created model wasn't inheriting the correct models and fields.

The answer turned out to be simple, although it is undocumented. The frozen ORM definitions are stored in each migration as a nested dictionary. Each model is an key in the top level dictionary, whose value is a dictionary containing the field name/definitions as keys/values. However, in the sub-dictionaries, along with the field definitions, you can also store Meta defintions, including a South-specific extension: _bases, which defines the model base to inherit from. For example:

{
    'categories.category': {
        'Meta': {'unique_together': "(['slug', 'parent'],)", '_bases': ('mptt.models.Model',)},
        'id': ('django.db.models.fields.AutoField', [], {'primary_key': 'True'}),
        'name': ('django.db.models.fields.CharField', [], {'max_length': '50'}),
        'parent': ('django.db.models.fields.related.ForeignKey', [], {'blank': 'True', 'related_name': "'children'", 'null': 'True', 'to': "orm['categories.Category']"}),
        'slug': ('django.db.models.fields.CharField', [], {'max_length': '50'}),
    }
}

This ensures that the frozen category model inherits from mptt.models.Model, and gains all the special MPTT magic.

Dec 05

Customising Mingus, part 2

This is intended to be primarily a technical blog, so I was keen to get the presentation of code snippets correct. I'm a - shall we say - fairly frequent answerer on StackOverflow, and I've got used to their Markdown-enabled edit box. Luckily, the Mingus basic-blog application allows a choice of markup for body text, and even defaults to Markdown. But as always there were quite a few things to improve.

Firstly, I do like StackOverflow's dynamic WYSIWSYG preview of the marked-up copy. Although Markdown syntax is quite simple, it's easy to get it wrong - using a three-space indent rather than four for code, for example. An instant preview just underneath the text entry field in the admin form is very useful. SO does it using the showdown.js library, which is part of their port of the 'what you see is what you mean' markdown editor, WMD.

It was as easy to integrate the whole of WMD as just the preview, by adding a mingus\admin.py like this:

from django import forms
from django.conf import settings
from django.contrib import admin
from django.utils.safestring import mark_safe
from basic.blog.models import Post
from basic.blog.admin import PostAdmin

class WMDEditor(forms.Textarea):

    def __init__(self, *args, **kwargs):
        attrs = kwargs.setdefault('attrs', {'class':'vLargeTextField'})
        super(WMDEditor, self).__init__(*args, **kwargs)

    def render(self, name, value, attrs=None):
        rendered = super(WMDEditor, self).render(name, value, attrs)
        return rendered + mark_safe(u'''
            <div id='wmd-container'>
            <div id='wmd-button-bar'></div>
            <div id='wmd-preview'></div>
            <script type="text/javascript">
            wmd_options = {
                output: "Markdown",
                buttons: "bold italic | link blockquote code image | ol ul"
            };
            </script>
            <script type="text/javascript" src="%sstatic/js/wmd.js"></script>
            </div>''' % settings.MEDIA_URL)

class PostForm(forms.ModelForm):
    body = forms.CharField(widget=WMDEditor)
    class Meta:
        model = Post

class WMDPostAdmin(PostAdmin):
    form = PostForm

    class Media:
        css = {
            "all": ("static/css/wmd.css",)
        }
        js = ("static/js/showdown.js",)

admin.site.unregister(Post)
admin.site.register(Post, WMDPostAdmin)

Because Mingus already does some Javascript on the Post admin to add the 'body inlines' section under the main textbox, I've made the WMD button bar appear underneath that, on top of the preview, instead of on top of the actual textarea. A bit weird, but it does work - it's not as if I use it all the time, anyway. This no doubt breaks if you use another markup language, but I always use Markdown, so no problem there.

So, from markup to syntax highlighting. Mingus is, unfortunately, a bit confusing here. Partly this is a result of Kevin's desire to integrate as many standalone applications as possible, and only write the minimum of glue code. However, this means that there are several applications that potentially supply markup functionality, and it confused me for quite a while. These include the django-extensions app, which includes the syntax_color templatetag; and django-sugar, which includes the pygment_tags library.

However, the basic django-blog app actually deals with markup and highlighting itself already. On saving a post, the markup is translated into HTML and saved in a body_markup field, thanks to the django-markup app. What I didn't realise is that django-markup already runs the formatted text through pygments to add the highlighting. The reason I didn't realise this is that pygments turns out not to be very clever in guessing the code language. If you don't tell it explicitly, it doesn't do anything. In the absence of a hard-coded hint, its attempt to guess the language is limited to looking at the first line of the code, where it hopes to see a pseudo-shebang line:

...
#! python

Once I started doing that, highlighting worked as expected (although there were some minor CSS issues - on some browsers the font used for pre was far too big). This also meant I could remove the call to the django-sugar pygmentize filter that mingus has for some reason added to all the blog templates.

I can't help feeling the proliferation of markup/highlighting code within mingus is a bit silly. I only realised in writing this that there is actually yet another place where highlighting could take place, as the Markdown library itself has an extension to call pygments (although presumably django-markup prefers to do this explicitly because other markup libraries don't have this extension).

There's one issue that remains unresolved. As well as the now-removed pygmentize filter, mingus also runs blog content through render_inlines, which allows insertion of arbitrary Django model content within a blog post. However, for some reason this removes all the indentation from code blocks - obviously not very useful when posting Python. I'm not using the inlines at the moment anyway, so I've removed them from the template until I can work out what's going on.

Other than that, everything works and the blog is now ready to use.

Oct 31

Cambridge Stack Overflow dev day

I don't go to a lot of tech conferences - family life tends to make getting away for any length of time fairly difficult. So originally I ignored the banners advertising the Stack Overflow DevDays, thinking I wouldn't be able to make it anyway. But when my employer arbitrarily changed the rules over how much holiday I'm allowed to carry forward into next year, I ended up a couple of days in hand - and a conversation with a co-worker convinced me to go at the last minute. After a comedy of errors regarding the last available ticket for the London event, I finally managed to snap up a ticket for the Cambridge day.

Since this was a Stack Overflow conference, it wasn't surprising that the keynote was by Joel Spolsky. It was preceded by a mildly amusing short film where he satirised his 'treat developers right' reputation by pretending to be a cross between an autocratic boss and a sadistic PE teacher, which was funny enough but slightly pointless. The talk itself was good: it was about the tension between the 'simplicity is everything' attitude of firms like 37 Signals, versus the undeniable fact that people want features, as evidenced by the way FogBugz' sales went up every time they added more features.

Spolsky is an entertaining speaker and I enjoyed the talk, even if there wasn't a particularly coherent take-home message: he was trying to say that you should only give people options for things that are actually important, but the whole point is that what's not important to one user is vital for another, which is why software like Microsoft Word ends up with so many hundreds of options.

Next up was Christian Heilmann talking about Yahoo! Developer Tools. Now this was really interesting - something I haven't had a chance to play with at all, but definitely will in the future. Yahoo has put together a very nice way of querying any of their APIs via REST with a simple SQL-like language, YQL. What's more, it's possible to submit your own data sources which can be linked up via an XML translation table and made available for everyone to query via YQL. Carrying that forward, you can write mini-applications in Javascript that use any of these APIs and soon you'll be able to offer these to be installed on users' Yahoo home pages in much the same way as Facebook apps. I must admit my heart did sink a bit when Christian mentioned the customised markup language, after too much time wrestling with FBML, but it's an exciting possibility.

After a short break, next up was Cambridge University's Frank Stajano. This talk was ostensibly about computer security, and specifically what we can learn from fraudsters to make our systems more secure. But he's a fan of the BBC3 programme The Real Hustle, a hidden-camera show where members of the public are conned in various ways, and he's done various bits of research analysing the cons from the programme and relating them to systems security. So the format of the lecture was to show us various clips from the show, then a couple of slides which were supposed to tell us how this type of con was used in computer terms and how we could avoid it. However, it didn't really achieve that - the links to computer security were not well explained, and although the talk was quite fun I didn't feel I learned much.

Next was Joel again, talking about FogBugz. Now I know you have to expect this sort of thing at conferences (especially at Carsonified ones, or so I'm told), but I actually object to paying to sit through an hour of sales pitch, however entertainingly delivered it is. FogBugz looks like a perfectly competent product, but I didn't see anything that made it shine over a product like Jira, or even particularly over the open-source Redmine that we use these days at work. Plus the demo included a couple of screens that clearly violated the principle Joel had pushed earlier of only giving options where they made a difference.

Lunch, followed by Steven Sanderson on ASP.NET-MVC. I actually found this fairly good - despite my complete lack of interest in any Microsoft technology, I'm not actually hostile, so I paid enough attention to find out what they were doing in this area. As the speaker freely admitted, .NET MVC is quite obviously ripped off from Ruby on Rails. It does offer some nice ways of doing things, but is missing a lot of the things that Django and Rails do - no ORM, for example, because it relies on LINQ; and no real templating system, because you just use standard ASP files. So nothing amazingly revolutionary, except if you're a Microsoft fanboy who's totally unaware of what the wider world is doing, but still good to see that Microsoft is learning things and giving its developers some alternatives. Best part: it's "open source", which in Microsoft language means "we're not going to accept your patches or anything, but you're free to fork it if you want". Great.

Next: Remy Sharp on jQuery. A deeply disappointing talk. Ryan Carson introduced it by asking how many of the audience had used jQuery (about half) and how many considered themselves experts (a handful), telling the latter that they may as well get a cup of coffee. In fact, that whole half of the audience should have done so: this was a very basic introduction, covering only the fundamentals. Remy is not a particularly fluent talker and this was not very well presented.

After another break, we had Michael Foord on Python. This was another fairly basic introduction - I had suspected I wasn't going to learn anything, but got my hopes up when Michael started off by talking about IronPython (he's the co-author of IronPython In Action). Unfortunately this was only a short digression, although it did look very cool (instantiating a Windows dialog from the IronPython console...) and the rest of the talk was a run-through of a clever little spellchecker in 40-odd lines of Python. This was all well and good, but the code wasn't anything particularly special to Python - you could have done it in any of a dozen other languages in about the same number of lines - and it didn't cover any of Python's cooler features. If I'd never dabbled in Python, I don't think this would have been enough to whet my appetite.

Finally, Jeff Atwood talking about Stack Overflow. This was only a short talk, where Jeff spoke about the reasons he and Joel had set up the site, what he hoped and hopes to achieve, and the achievement he gets from it.

So, that was it for the talks. Free beer was offered in a bar in town, but unfortunately those family obligations raised their heads again and I had to drive home.

Overall, a good day. I had about a 50% hit rate on interesting talks, which I suppose is fairly good going, and I did get a chance to meet some new people. It was a shame that most of the talks slightly overran, leaving almost no time for questions.

One surprising thing was that the day wasn't very well integrated with Stack Overflow. I had at least expected us to get preprinted badges showing our SO username and reputation scores, but no such luck. And when Carson asked the audience at one point who thought they had the highest rep, I didn't put my hand up, assuming my 9,000 points would be average in this crowd. But when he tried to work it out, starting by asking who had 1,000 points, who had 1,500, etc, I soon found I did indeed have by far the highest rep - the next highest put his hand down at about 2,500. Made me feel slightly sad (which I am, of course). A shame that I missed the chance to parlay my brief moment of fame into something more long-lasting by skipping the drinks.

On the whole, I'm glad I went, and if nothing else it's convinced me I need to try to go to more of this sort of thing.

Oct 04

Customising Django-Mingus

This blog is built using Kevin Fricovsky's excellent django-mingus project, which is mainly a set of standard pre-existing reusable apps with some templates and a bit of glue to hold it together.

Although it's quite usable out of the box, I found - inveterate hacker that I am - that there were several things that I didn't quite like in the project as it was. So I changed them (isn't open source great, laydees-n-genelmen). At some point I'll fork the project on github and upload the changes, but for now here's what I've done.

Firstly, mingus forsakes Django's built-in comments framework for the external Disqus project. I didn't really fancy signing up for another service - especially as I'm not expecting vast numbers of comments on this blog. It's quite a simple matter to reinstate the comments - the relevant template code is included in the post_detail.html template included with the basic-blog app which mingus extends, so I just needed to copy and paste it into the mingus version. Then add (r'comments/', include('django.contrib.comments.urls')), to urls.py, django.contrib.comments to settings.py, run a syncdb and it's all done.

There are however a couple of missing pieces here. basic-blog doesn't include templates for the comment preview and post confirmation, so you just get an unstyled white page. Simple to fix: add a comments directory with a base.html template as follows:

{% extends "base.html" %}
{% block content %}{% endblock %}

By default the post-confirmation page doesn't include a link back to the original object, leaving the user nowhere. So an overwritten posted.html in the same directory fixes that:

{% extends "comments/base.html" %}
{% load i18n %}    
{% block title %}{% trans "Thanks for commenting" %}.{% endblock %}    
{% block content %}
  <h2>{% trans "Thank you for your comment" %}.</h2>    
  <p><a href="{{ comment.get_content_object_url }}">Return to blog</a>
{% endblock %}

The last issue with comments was that there was no indication on the index page of how many comments each post had. This is a standard feature of blogs, and a bit surprising it wasn't there - perhaps it's a consequence of using Disqus. Anyway, the solution was to add the following to templates/proxy/includes/post_item.html:

{% if object.content_object.allow_comments %}
{% get_comment_count for object.content_object as comment_count %}
<div class="comment_count"><a href="{{ object.content_object.get_absolute_url }}#comments">{{ comment_count }} comment{{ comment_count|pluralize }}</a></div>
{% endif %}

I also added a style rule for the .comment_count class in base.css.

So much for comments. Now, layout. I couldn't help thinking that the default layout had the main area to narrow and the right-hand column too wide. Luckily the templates are based on the 960 Grid System css, so it was easy to change the central column to use the grid_11 suffix_1 classes, for a width of 11/16 and a gutter of 1/16, and the right-hand column to use grid_4.

The final issue was to do with markup - that was a bit more complicated, so I'll leave it to part 2.

Oct 03

The one where my friend the sysadmin kills me

Using git hooks to automatically deploy changes to the server

Warning: this entry is very much a matter of 'This isn't the right way to do it, but it works for me'.

For small projects that are in active development, I frequently have to deploy code changes to the live server. To make this as simple as possible for me, so I can concentrate on the coding, I tend to like running on a live checkout of the code directly from the repo.

I never really got this automated properly with svn, although no doubt it's a simple matter of setting up the right post-commit hooks. However, now I'm working mainly in git, and I thought it would be good if I could push straight from my local repo to the remote one, and automatically see the production code update.

It's fairly easy to set up a remote repository to push to - I followed the instructions here, which worked a treat. However, this wasn't helping with getting this code to auto-checkout and deploy itself. So I began experimenting, and what I came up with was this.

Firstly, instead of setting up a bare repo as recommended in those instructions, use a standard git init for your remote. If you now try and push to this, git will complain with a long message explaining that "Updating the currently checked out branch may cause confusion". It gives some tips about how to turn off that message, but we can avoid it altogether by using branches.

On the server, simply create and check out a live branch:

   git branch live
   git checkout live

Now, we just need a hook that pulls from master to live every time we commit to master. The hook we need is called post-receive, and like all hooks it lives in .git/hooks. Here's mine:

#!/bin/sh
read params
cd .. 
echo "ASSET_VERSION = '`echo $params|cut -d " " -f2`'" > local_settings.py
env -i ~/bin/git reset --hard
env -i ~/bin/git pull
exec ~/webapps/mysite/apache2/bin/restart

The two git commands simply ensure that the live branch has no local changes, and pulls all changes direct from master - which in turn of course has been updated directly from my development machine.

The rest is me trying to be even cleverer. I wanted an automatic cache-busting mechanism to stop my javascript being cached while in development. So I have a simple local_settings.py file which defines a value which is appended to the querystring of all my asset urls. The hook updates this automatically - it is passed the hash of the current commit, so it reads the parameters (which is far more difficult in bash than it needs to be, by the way), extracts the hash, and writes it to local_settings.py.

The final step is to restart Apache, and we're laughing.

Now, no doubt there are much better ways of doing this. But like I say, it works for me.