rosemanblog

What's new in Django 1.7

2014-10-06T00:00:00+01:00

A few weeks ago I gave a talk at AirConf 2014, a virtual conference organised by my friends at AirPair, about what's new in Django 1.7. Here's the video:

Update 2017: AirPair's videos seem to have vanished.

I kept the slides simple, on purpose, as most of the interesting stuff was in the code demos. You can find them here anyway.

A quick aside about speaking at this sort of virtual event. There's a very noticeable lack of feedback, since you're talking to yourself rather than a room full of presumably-interested people. It makes it very difficult to judge whether the talk is going well, or even if anyone is actually listening - compare to a normal conference, where you know that if everyone is immersed in their laptops rather than looking at you, the talk isn't catching their imaginations.

Still, I think the talk was well received, and I enjoyed giving it.

A better vim 'go to declaration' with ASTs.

2013-12-17T11:15:00+00:00

There are loads of little commands in vim for navigating around your files. One of the ones I like the most is gd, for "go to local declaration". What that does - or is supposed to do - is find the place at which the identifier under the cursor is declared or defined, and go to it.

However, as the documentation states, this is "not guaranteed to work" - and in fact rarely does. The issue is that vim does not parse the code and find the actual definition: what it does it to find the beginning of the current function (which rarely works in Python anyway) or the top of the file, and search forward from there. Even if that did work reliably, it still wouldn't cope with definitions at class or module level, and doesn't distinguish between variables defined at different scopes.

No doubt the proper fix here is to use something like ctags, that processes your file and outputs a tags list that can be used by vim's tag matching. But that still wouldn't really understand scope. Anyway, I wanted to see if this could be done dynamically, by using Python itself to try and extract some meaning from the code, and determine where the variable was actually defined using its understanding of the scope. So, in my insanity, I turned to ASTs.

Abstract Syntax Trees are a way of representing code as a series of objects, arranged in a tree, which represent the language's grammar in an abstract way. Given this simple code:

def foo(bar):
  baz = bar * 2
  return baz

and calling ast.parse() on it (as a string), you get this series of objects (reformatted for clarity):

Module(body=[FunctionDef(
    name='foo',
    args=arguments(
        args=[Name(id='bar', ctx=Param())],
        vararg=None, kwarg=None, defaults=[]),
    body=[
        Assign(
            targets=[Name(id='baz', ctx=Store())],
            value=BinOp(
                left=Name(id='bar', ctx=Load()),
                op=Mult(),
                right=Num(n=2))
        ),
        Return(value=Name(id='baz', ctx=Load()))
    ], decorator_list=[]
)])

That's not particularly easy to read, but you can at least see that it's represented the various identifiers - foo, bar and baz - as objects with attributes defining things like their names, and that the objects form a tree with the Module at the top, a FunctionDef inside that, and then the Assign and Return statements inside the body attribute.

So, it looks like we can use this tree to find the definition of the variable we're on. The process will look like this: first, we need to scan through the tree to find our current file position and locate the AST representation of our current identifier. Then, we need to progressively widen our scope until we find the object that first assigns that identifier. Finally, we can move vim's cursor to that object's position.

The code for doing this is borrowed mainly from the excellent pyflakes project, which uses AST parsing to detect syntax or style errors. (The vim interface code is inspired by the much-lamented pyflakes.vim - I've never got on with Syntastic in quite the same way I did with that). Pyflakes already does a great job of implementing a Visitor class to step through the tree and call relevant methods for each node, so all we need to do is override the relevant methods to search for our identifier, rather than check for style.

Pyflakes also very helpfully keeps track of an object's scope. It uses this to check for errors like identifiers being referenced before they have been defined, or style issues like objects that are defined but never referenced. So, every time the visitor encounters a new definition, it stores it in a dictionary representing the current scope (module, function or class), and whenever it encounters a further use of that identifier, it looks the object up in the scope stack and annotates it with the scope where it was found.

Clearly, that is exactly what we need to implement our 'go to definition' functionality. We just need to override the Visitor methods such that we compare each node with our identifier, and if they match we take that node, find its scope (which has already been recorded by the visitor) and then extract the definition's line/column number from there. Note that when matching, we compare both name and line number - clearly, names can be defined multiple times in a file, and we want the correct node for our target position.

Things are complicated slightly by the fact that our target identifier might not be a Node at all, but an attribute of another node. This happens with dot notation: for example, the attribute reference foo.quux is represented in an AST like this:

Attribute(value=Name(id='baz', ctx=Load()), attr='quux', ctx=Load())

If you were searching for the original definition of the quux attribute, you would need to start from the Attribute node and check for matches to the attr attribute.

We deal with this by overriding both handleChildren and NAME methods from Pyflake's Checker class. The first is called for attributes, and the second for Name nodes, ie everything else. In each one we check the line number, as well as either the id or the attr atribute.

def handleChildren(self, tree):
    if (hasattr(tree, 'lineno') and tree.lineno == self.lineno
        and hasattr(tree, 'attr') and tree.attr == self.name):
            self.target = tree
            scope = self.getScope(tree.attr)
            self.targetScope = scope

    for node in checker.iter_child_nodes(tree):
        self.handleNode(node, tree)

def NAME(self, node):
    super(MyChecker, self).NAME(node)
    if node.lineno == self.lineno and node.id == self.name:
      self.target = node
      self.targetScope = self.getScope(node.id)

The getScope method is a simple check upwards through the relevant scope objects:

def getScope(self, target):
    if target in self.scope:
        return self.scope[target]
    for scope in self.scopeStack[-2::-1]:
        if target in scope:
            return scope[target]
    print 'Sorry, no definition found.'

All that's left is the wrapper to grab the word under the vim cursor using vim.eval('expand("<cword>")'), pass it to the Checker, extract the target definition, and go to the relevant line/column. One last tiny tweak is that not all nodes have an actual child node for the definition itself - eg the FunctionDef above starts on col 0, but the name itself is on col 4 - so we use a simply Python str.find to move to the actual column position of the name.

Now, of course this is still very far from perfect. It won't resolve names imported via from module import * - but of course no-one would ever use that anyway. Much more seriously though, it can't find names that are defined outside the current scope tree. For example, if you call a method that defines a series of attributes on an object, then return it, and then pass it to another method, inside that second method the scope chain has no record of where the new attributes were originally defined. Similarly, it won't find methods that are defined on an object's superclass. I haven't yet come up with any good way of fixing that, although in practice it doesn't seem to be a huge problem.

You can find the code on GitHub.

Stepping through code in IPython

2013-09-19T00:00:00+01:00

I spend a lot of time in the Python shell - specifically, IPython. Like many Python programmers, I find it invaluable for delving into the structure of objects, exploring their members, running their methods, and so on. It's really the dynamic language's answer to the really good IDE support you'd get in a more static language like Java.

One of the things it's really useful to be able to do in the shell is to import a module and then step through the code in the debugger. Now, you can do this simply by importing pdb, but then you don't get the nice IPython-enhanced version that you get when you're dropped into the IPython debugger via the %debug magic.

Obviously, there must be a way of getting that IPython debugger manually within IPython itself - and there is:

from IPython.core.debugger import Pdb
ipdb = Pdb()

Now you can do with the ipdb object anything you would previously have done with pdb, except more snazzily.

The other thing I always forget is how to actually call an imported function and start in debug mode. Most of the time, when debugging running code, I just do import pdb;pdb.set_trace() and leave it at that (I do this so frequently I have a snippet in vim for it). But to call a function from the shell and start debugging straight away, you need another function: runcall().

This takes the function and its parameters, as separate arguments. So to start the function foo(bar, baz, quux) in the debugger you would do ipdb.runcall(foo, bar, baz, quux).

Now I understand why this is: otherwise you would be calling the debugger on the result of the function, which isn't what you want. But it's still a bit annoying to have to remember to do that. So, I decided to write an IPython magic script that translates the function call syntax into the separate-argument version - while still accepting the latter. I haven't written any magics before, so this is an experiment with the syntax.

Drop this into ~/.ipython/extensions and do %load_ext step from IPython. Now you can do %step foo(bar, baz, quux) and step through your code with joy.

import re

from IPython.core.magic import Magics, magics_class, line_magic
from IPython.core.debugger import Pdb

ipdb = Pdb()

@magics_class
class StepMagic(Magics):

  @line_magic
  def step(self, params):
    # params might either be
    # foo, bar, baz
    # or
    # foo(bar, baz)
    # we need to determine if there is a comma before the opening paren.
    comma_pos = params.find(',')
    paren_pos = params.find(')')
    if comma_pos > -1 and (comma_pos < paren_pos or paren_pos == -1):
      param_list = params.split(',')
    else:
      # Match everything up to first open paren, and everything inside parens.
      # We use lazy repetition on the first group to ensure that it copes when
      # the expressions within the parens are themselves calls.
      match = re.match(r'(.*?)\((.*)\)', params)
      if match:
        func, args = match.groups()
        # TODO: this doesn't work when args consists of lists/tuples.
        param_list = [func] + args.split(',')
      else:
        # Assume it's a single expression
        param_list = [params]

    evaluated_params = [self.shell.ev(p) for p in param_list]
    ipdb.runcall(*evaluated_params)


_loaded = False

def load_ipython_extension(ip):
    """Load the extension in IPython."""
    global _loaded
    if not _loaded:
        plugin = StepMagic(shell=ip)
        ip.register_magics(plugin)
        _loaded = True

As I note in the code, there's a bug: if your parameters are themselves lists or tuples, this will fail, as it will split the elements of the list into separate arguments. There are probably various ways of dealing with that, from better regexes all the way to messing with ASTs, but it's good enough as far as it goes.

Querysets aren't as lazy as you think

2012-05-30T20:30:00+01:00

It should be reasonably well-known by now that querysets are lazy. That is, simply instantiating a queryset via a manager doesn't actually hit the database: that doesn't happen until the queryset is sliced or iterated. That's why the first field definition in the form below is safe, but not the second:

class MyForm(forms.Form):
    my_field = forms.ModelChoiceField(queryset=MyModel.objects.all())
    my_datefield = forms.DateTimeField(initial=datetime.datetime.now())

Even though both elements are calling methods on definition, the first is safe because the queryset is not evaluated at that time, whereas the second is not safe because it is evaluated at that time, and therefore remains the same for the duration of the current process (which can be many days). For the record, you should always pass the callable: initial=datetime.datetime.now, without the calling brackets.

Now, there are a couple of gotchas here. It is perfectly possible to define manager methods that are not safe to use in places like the queryset argument to the first field above. Here's an example:

class PublishedManager(models.Manager):
    def get_query_set(self):
        return super(PublishedManager, self).get_query_set().filter(
                published_date__lte=datetime.datetime.now())

Clearly, this is an attempt to create a manager that automatically filters items which have been published. In the normal course of calling this in a view, it will work exactly as expected. But if you passed it into the queryset parameter of the form field, the same thing will happen as with the date field: the cut-off point will always be set when the form is first imported, and will persist for the life of the process.

This is because there's nothing magical about manager methods that makes them lazy. The laziness comes further down, inside the QuerySet class itself. This method, which is called automatically by the all() method, will be evaluated when it is called - ie when the form is first defined. At that point, the right-hand-side of the query expression will also be evaluated, and passed into the main Manager get_query_set method. So no matter when you instantiate your form after this, during the lifetime of the process you will never see any objects whose published_date is greater than the first time.

But note that if you change the published_date of an existing object to before that time - or even create a new object with that date - you will see it. The queryset is still lazy, and the database will be queried each time the form is instantiated: but that published_date parameter is fixed.

Moving to Pelican

2012-02-20T12:45:00+00:00

I'm starting to blog again more regularly, but of course half the fun of having your own blog is mucking about with the technology that powers it. Encouraged by others who've recently made the switch, I've moved this blog over to the static page generation system Pelican.

I liked Mingus, but it really was overkill for what I need: regenerating a bunch of basically static pages for every request didn't make much sense, even if there weren't that many requests in the first place. I wasn't using any of the apps that came along with the basic blogging stuff: oembed, quotes, that sort of thing. And it seemed to me that if I was just typing in markdown and saving it, I might as well use a system that can take that markdown directly and just create a blog from it.

Well, that's exactly what Pelican does. No database involved, just a bunch of markdown files on disk and a script to convert them into formatted HTML pages via some Jinja2 templates. I write an entry in vim, save it into a specific directory, then run the pelican script which outputs the formatted pages. I've set up dropbox to sync that folder to my server, and that's all that needs to happen to publish a new blog entry.

As usual, I've customised things a bit. I didn't want to change the old URLs at all, so I've used the ARTICLE_PERMALINK_STRUCTURE setting to output files into directories by year/month/day (in hindsight, that was another bit of overkill from Mingus - one directory per month would probably have been enough). I've also set Apache to rewrite requests from slug.html to slug/, again so that the old URLs continue to work. Pelican has a CLEAN_URLS setting which means that files are created as slug/index.html, with the idea that you set Apache to serve the index.html automatically and other pages link to just slug/, but I didn't like the idea of yet another layer of subdirectories. So I've hacked it to split the setting into two: CREATE_CLEAN_URLS to control how the files are created, and LINK_CLEAN_URLS to determine how other pages link to them. Works nicely.

Improving the vim terminal development environment

2012-02-04T14:21:27+00:00

Since I started working at Google, I've been doing almost all my development using vim through the terminal. Since my last post on the subject (over a year ago - bad blogger), I've discovered some significant improvements to what I recommended there, and I thought I'd share what I've learned.

The biggest improvement is to switch to iTerm2 as my terminal - thanks to Ross for the recommendation. Not only does it remove the need for the TerminalColors hack, it also supports mouse events natively - so the MouseTerm hack is no longer needed. Plus, it includes support for control codes that change the cursor shape, so it is now simple to get vim to show whether it's in normal or insert mode. Just add these lines to .vimrc:

let &t_SI = "\<Esc>]50;CursorShape=1\x7"
let &t_EI = "\<Esc>]50;CursorShape=0\x7"

The latest version of iTerm2 has another trick up its sleeve. It integrates seamlessly with a custom version of tmux, the terminal multiplexer that's an alternative to GNU Screen. Using any mulitplexer allows you to connect to a remote server with a single connection, and open as many tabs/windows to that connection as you like - and, what's more, you can detach your session when you log out and re-attach later, bringing your terminals back exactly how you left them. Useful enough, but what iTerm2 adds is native support for tmux tabs and windows - so that they open as tabs/windows on the Mac desktop, and you can switch between them using normal Mac shortcuts rather than tmux-specific ones. Very useful: it involves compiling a custom version of tmux and some associated libraries on the server, but it's well worth it.

One missing element was the ability to cut-and-paste directly from vim into apps running on the Mac. This was tricky to get right.

The basic principle is for the X clipboard to be shared with the OS X pasteboard, and then to to get vim to yank/delete to that clipboard. To get this working, I needed to run an X server on my Mac and enable clipboard sharing - it needs to be running for this, even if you're not using any X applications. There is an X11 app supplied with OS X, but (on Snow Leopard at least) it didn't seem to support the correct settings. Although it might be possible to get it working with some use of defaults write, the easier solution is to use the open-source XQuartz project - it's the same as the supplied X11.app, but updated more frequently. The version that Google had, luckily, already installed on my machine (2.7.0) includes a Pasteboard tab on the Preferences dialog which controls syncing between Mac Pasteboard and X clipboard - you need to tick "Enable syncing", "Update Pasteboard when CLIPBOARD changes", and "Update CLIPBOARD when Pasteboard changes".

The next step is to ensure that the SSH connection to the development machine uses X forwarding, which simply means using ssh -X when connecting. If the remote machine/VM is headless, it may not have any of the X stuff installed: installing the xauth package is probably enough to bring in the necessary dependencies, but I'm definitely not an expert here.

Now, in vim, you can yank to the X clipboard by using the + register, using "+y - I now have this mapped to Ctrl-C, which is close enough to the Mac's default of Cmd-C that I find it easy to remember. In bleeding-edge versions of vim later than 7.3.74, there is actually an additional option that you can use so that all yanks go directly to the X clipboard: set clipboard=unnamedplus. But I found that this ignores the difference between line-wise and character-wise selection, treating everything as character-wise, so cutting and pasting full lines inside vim becomes unnecessarily annoying. Better to learn the extra shortcut for those times when I explicitly need to copy out of vim.

One final update to my last post: I talked there about switching between tabs, but I've now stopped using tabs altogether. Seriously, folks, if you're using vim, learn about buffers: they are great. Especially with the latest version of Command-T which allows you to use the same fuzzy-matching search to switch between open buffers using <leader>b. Another advantage of buffers over tabs is that it allows you to have multiple windows open simultaneously in the same session: Ctrl-w v splits the main window horizontally, and Ctrl-w w switches between open windows (I've actually remapped Tab in normal mode to this). With a large monitor coupled with Google's extremely strict 80-character limit, you can easily have two or three files next to each other. A real productivity boost.

From MacVim to terminal vim

2010-12-15T06:18:09+00:00

I'm a very happy user of MacVim, which very nicely integrates vim into a native Mac app. But occasionally I need to edit code via a terminal, which means dropping back to plain old vim. Recently I found myself working on a project that was distributed on a self-contained virtual machine, and after several days of mucking around with mounting the VM's filesystem via sshfs and suffering continued networking drop-outs, I decided to bite the bullet and move to working entirely within the terminal.

Mostly, the transition was fairly simple. I was able to copy over my .vimrc and the contents of the .vim directory to my home directory on the VM, and almost everything 'just worked'. There were a few exceptions.

The first thing that I missed was the mouse. Although like any good vim user I do stay mainly on the keyboard, it's nice to have the mouse available occasionally as an alternative for things like rapid scrolling with the wheel, tab/window switching, and text selection. I initially thought I would just have to do without, but it turns out that it is quite possible to have the mouse working within the terminal.

The main issue is that although there is a well-defined way for an xterm terminal to send mouse events, the OSX Terminal app doesn't support it. However, there is a nice easy hack that does work: MouseTerm. This is a SIMBL plugin which patches the terminal so it sends mouse events. I already had SIMBL installed, as I use Ciarán Walsh's indispensible TerminalColors to make the terminal colours sensible, so it was just a matter of clicking the MouseTerm .dmg to get it working. Then I just added set mouse=a to my .vimrc, and hey presto: mouse in terminal vim.

Secondly, tabs. I use tabs all the time in vim. MacVim overrides vim's built-in tabs with proper native ones, so that you can open them in the standard way - eg opening documents in new tabs via :tabe - but then switch between them with the standard OSX shortcut keys, cmd-[ and cmd-]. In the terminal, the first of these continues to work, but since the tabs are now vim's own ones, switching needs to be done by the vim shortcuts of gt and gT. Not too much to learn - I might try and force myself to use those within MacVim, for the sake of consistency.

Thirdly, extensions. As I said above, most of these just worked by copying over my .vim directory. But one of the ones I use most of all - the excellent Command-T file navigator - did not. This was because the version of vim installed on Ubuntu by default via apt-get install vim does not include Ruby support, even though it does include most of the other options. Removing that and installing the vim-nox package instead rectified that.

The final issue is cursors. MacVim nicely distinguishes between insert and normal mode by switching between bar and block cursors. In the terminal, this doesn't happen. It appears to be possible to send custom escape sequences when switching modes: the termcap-cursor-shape help topic explains how. But I couldn't get this to work even for the examples in that topic, which switch colour rather than shape; I don't even know how to begin finding the right escape codes to change the cursor shape. In any case, I suspect something in the interface between Terminal.app and the Ubuntu bash shell is preventing colour codes from working - for example, printf "\e[0 32 m" works in the OSX shell to change the text colour to green, but the same command fails to work when I'm ssh-ed in to the VM. Any hints would be gratefully received.

Getting the related item in an aggregate

2010-08-14T16:18:46+01:00

There's a question I see quite a lot at StackOverflow and the Django Users group regarding aggregation in Django. It goes like this: I know how to annotate a max/min value for a related item on each item in a queryset. But how do I get the actual related item itself?

I wish this was easier than it actually is. The problem is that in the underlying SQL, annotating the value is a simple aggregation query on the related item, whereas getting the entire object means moving to a complicated dependent subquery.

To illustrate, take these models:

class Blog(models.Model):
    name = models.CharField(max_length=64)

class Entry(models.Model):
    blog = models.ForeignKey(Blog)
    added = models.DateTimeField(auto_now_add=True)
    text = models.TextField()

Getting the date of the latest Entry for each Blog is simple:

blogs = Blog.objects.annotate(Max('entry__added'))

and the underlying SQL is just as simple:

SELECT blog.id, blog.name, MAX(entry.added)
FROM blog_blog blog
JOIN blog_entry entry on entry.blog_id = blog.id
GROUP BY blog.id

But that doesn't work if you want the whole Entry object. You need to do something much more complicated:

SELECT blog.id, blog.name, entry.id, entry.added, entry.text
FROM blog_blog blog, blog_entry entry
WHERE entry.id = (
    SELECT e2.id FROM blog_entry e2
    WHERE e2.blog_id = blog.id
    ORDER BY e2.added 
    LIMIT 1
);

and currently there's no support for this in the Django ORM.

Now, you could just pass the above query to the .raw queryset method in Django 1.2: Blog.objects.raw('SELECT...'), and perhaps surprisingly, this will work, in that the extra fields from the Entry model will be appended to each Blog instance. If you needed the actual Entry instance - say if you had some extra methods on the Entry model that you needed to run with each one - you would have to iterate through the queryset and instantiate new Entry objects with the fields from each Blog.

Also note there's another gotcha with raw querysets, which is that they are re-executed every time you slice them or access one of their members - so it's probably best to cast them to a plain list first.

There is another approach which gets you the items related in the normal Django way, so that you can do entry_instance.blog. It does this in two queries, with a bit of Python processing in the meantime.

from django.db.models import Max
blogs = Blog.objects.annotate(Max('entry__added'))
values = tuple([(blog.id, blog.max__entry__added) for blog in blogs])

entries = Entry.objects.extra(where=['(blog_id, added) IN %s' % (values,)])

blog_dict = dict([(b.id, b) for b in blogs])
entries = list(entries) 
for entry in entries:
    entry._blog_cache = blog_dict[entry.blog_id]

Here we do a standard annotate query to get the added values for each relevant Entry. Then we can do an extra query to get the actual Entries associated with each (blog_id, max_entry) tuple (note we can't use the params argument for the values list, unfortunately, as it will get double-quoted). Finally, we can re-associate each Entry with its Blog - I've done it that way round to fit in with the standard ForeignKey and its automatic mapping of entry._blog_cache to entry.blog, and since we're only interested in one entry per blog it shouldn't matter whether we have to iterate through blogs or entries.

Again, it's a shame we have to drop to raw SQL for the middle step here. The query depends on matching multiple values for each row, and although it would be possible to do this by iterating through and adding Q objects for each row, it would be an absolutely horrible query. At least we're using extra here, which is arguably better than the raw we used in the first attempt above.

Europython 2010 talk: Advanced Django ORM Techniques

2010-07-25T16:15:53+01:00

Here are the slides from my talk at Europython 2010, Advanced Django ORM Techniques.

Advanced Django ORM techniques from Daniel Roseman

The talk is mainly a summary of the query optimisation tricks I've previously talked about on this blog, although I did begin by explaining briefly how models, fields and relationships work behind the scenes - I'll write some of that up here at some point.

I'll also be posting a longer review of Europython here, hopefully in the next few days.

Update: here's the (unfortunately fairly poor-quality) video from the talk:

Class-based views and thread safety

2010-06-02T06:20:16+01:00

I previously wrote about a thread-safety bug I encountered in writing a middleware object. I recently found a very similar situation in a class-based view I was modifying, and thought it was worth writing up what the problem was and how I solved it. Interestingly, there's a discussion going on in Django-developers at the moment on class-based views which touches on many of the same issues.

By 'class-based view', I mean that rather than just being a function which is called from urls.py when the URL matches its pattern, this was a class with a __call__ method. In theory, that should work in exactly the same way as a function view - to Python, functions and callable objects are pretty much the same thing.

One of the reasons that the controller was written as a class was that it had multiple utility methods that were called during the rendering process. To make it easier to pass data between the methods, various things were set as instance variables - self.item_id, self.key, and so on.

However, the urlpattern for this view was defined like this:

(r'^(?P<key>)/((?P<page>[^/]+)/)?', FormController()),

which meant that it was instantiated just once per process: when the urls are first imported. Unfortunately, when running under a production server like Apache/mod_wsgi, a process does not equal a request. Processes are persistent - managed as part of a pool by the server - and a process can serve many requests before eventually being killed and restarted.

This means that the FormController object declared above will probably be responsible for serving multiple requests - and any instance attributes, eg self.item_id, will be visible to all of them. Clearly, this has the potential to be a massive bug, if information starts leaking between requests - one user might start seeing information meant for someone else, or a form could validate itself based on information from a completely different user.

There are various potential ways to solve this. The main idea is to instantiate the object once per request, rather than once per process - this ensures that any instance data is specific to that request. The way I've done it for now is to redefine the url like this:

(r'^(?P<key>)/((?P<page>[^/]+)/)?', FormController),

ie make the handler a class, rather than an instance. That means that Django will instantiate a FormController object when the url matches, so the code that was previously in__call__ goes in __init__ instead. Unfortunately, in Python you can't return anything from an __init__ method - the returned value from an instantiation is automatically the object itself. So what I've done (based on an idea from Mike Malone) is to make the view class a subclass of HttpResponse - then, rather than returning render_to_response, you can just render the template to a string and pass this to the super class's __init__ method:

class FormController(HttpResponse):
  def __init__(self, request, key, page=None):
      ... lots of code ...
      content = loader.render_to_string(
                                  template, 
                                  form_context,
                                  context_instance=RequestContext(request)
                )
      super(FormController, self).__init__(content)

The value returned from the instantiation is therefore an HttpResponse object containing the rendered template, as Django expects.

One small side-effect of doing this is that you can't return anything else - for example an HttpResponseRedirect (or an HttpResponseNotFound). You need to do the redirect manually:

      if self.redirect_to_next_page:
          self.status_code = 302
          super(FormController, self).__init__()
          self['Location'] = next_page
          return

Django aggregation and a simple GROUP BY

2010-05-10T11:50:44+01:00

I love Django's aggregation framework. It very successfully abstracts the common aggregration tasks into a Pythonic syntax that sits extremely well with the rest of the ORM, and the documentation explains it all without a single reference to SQL.

But sometimes that very abstraction gets in the way of working out what you want to do. One example of this happened to me today when I needed to do a sum of values grouped by a single value on a model - in SQL terms, a simple GROUP BY query.

The documentation is very clear about how to do aggregations across an entire QuerySet, and annotations across a relationship. So you can, for example, easily do a sum of all the 'value' fields in a model, or a sum of all the 'value' fields on a related model for each instance of the parent model. But the implication is that these are the only things you can do. So I was left wondering if I had to create a dummy related model just to contain the unique values of the field I wanted to group on.

In fact, what I wanted to do was clearly documented, but because of the (probably correct) desire not to express things in terms of SQL, it's not that easy to find. So here's how to do it: you just need to use values.

For instance, my model is a set of transactions for a financial accounting system. Each transaction is associated with an order, which is just an integer ID referring to records in a completely different system. I wanted to get the total of transactions for each unique order ID. It's as simple as this:

Transaction.objects.values('order_id').annotate(total=Sum('value'))

Which gives you a ValuesQuerySet along the lines of:

[{'order_id': 12345L, 'total': Decimal('1.23')}, 
 {'order_id': 54321L, 'total': Decimal('2.34')}, 
 {'order_id': 56789L, 'total': Decimal('3.45')}]

One thing to be aware of: as the docs do note, the order of values and annotate is significant here. This way round, it groups by the fields listed in values and then annotates. But if you put annotate first, it does the calculation for each individual record, without grouping, then uses values to simply restrict the fields it outputs.

Vim autocomplete, Django and virtualenv

2010-04-21T17:41:49+01:00

One of the features I quite missed when I first moved from Komodo to Textmate as my main editor was autocompletion. Although I didn't use it very much, it was occasionally useful to be reminded of the methods available on a class, without having to look up the documentation or open the source module.

Now I've moved to vim, which has its own version of autocomplete: omnicompletion. This is activated by pressing Ctrl-X Ctrl-O in insert mode after typing the name of a class or instance, and displays a nice drop-down menu with all the available members of that object. The Python plugins which come with vim allow this function to not only complete items from the standard library, but also to parse your own files which are open in the editor, reading the modules you import and adding those elements to the completion dictionaries.

However, this wasn't working for me in my Django projects. After a lot of investigation, I found that this was down to three issues.

The first two of these related to the fact that I was working within a virtualenv. Despite the fact that I was starting MacVim after activating the virtual environment, the specific site-packages directory was not being added to the path and it was defaulting to the system-wide one, which on my system doesn't contain any of the packages I was using. The solution to this is to use the activate_this.py script which is provided with virtualenv to activate it within another process - it's intended for use with mod_wsgi, but works just as well here. You can run it from the vim command line, using python to tell vim that the following commands are in Python. (Note that the Python interpreter is actually persistent, so you can import modules or define variables in one command and they are still available in subsequent ones.)

:python activate_this = '/path/to/virtualenv/bin/activate_this.py'
:python execfile(activate_this, dict(__file__=activate_this))

This sets up the paths properly, but it was still not working. After a lot of investigation, I finally realised that this was because vim uses its own built-in Python interpreter, which is version 2.5, while my Snow Leopard machine was using 2.6. The confusion arose because doing this:

:python import sys; print sys.executable

does return the path to my virtualenv's version of Python. To make it even worse, this:

:python print sys.version

returned 2.5.2, when the executable printed in the previous command was actually Python 2.6.1! I can't explain that bit of weirdness, and would be interested to hear in the comments if anyone else can, but the fact that vim was clearly using a 2.5 Python did at least explain why it wasn't picking up the local packages which were installed in the virtualenv's lib/python2.6/site-packages directory.

The version of Python included is decided at compile time, and the pre-built versions of MacVim are actually compiled against Python 2.5, so to make it work with the 2.6 directories I had to build my own binaries. Luckily this is very easy.

The third and final piece of the puzzle was Django-specific. Anything that uses a Django model, or indeed imports anything that references django.db, needs a settings module. Normally when running from manage.py this is set automatically, but obviously that doesn't work within vim. So you just need to set the DJANGO_SETTINGS_MODULE environment variable, which again can be done from the vim command line:

:let $DJANGO_SETTINGS_MODULE='mysite.settings'

With all that in place, plus some further Python path manipulation to ensure it found all my project's code, I was now able to complete code within Django projects.

There's some work to be done to automate this. At the moment, I've put all the above commands into a .vimrc file at the base of the virtual env, and added some code to my main ~/.vimrc to load it based on the value of the VIRTUAL_ENV environment variable (which is set by bin/activate):

if filereadable($VIRTUAL_ENV . '/.vimrc')
    source $VIRTUAL_ENV/.vimrc
endif

This probably isn't ideal, as it involves remembering to create that file with all its specific hard-coded paths each time I set up a new virtualenv. On the other hand, trying to do something more automatic will be difficult, as my settings files are not always in a predictable location - eg for work they are often under projectname.configs.development.settings - so maybe this is the best I can do.

Bye-bye django.comments, hello Disqus

2010-04-21T04:29:31+01:00

So, I've learned my lesson. When I first set up this blog I said at great length that I'd prefer using Django's built-in commenting system than the third-party Disqus service that Mingus uses by default.

Well, after a week of trying to stop a flood of spam comments, first by deleting them as they came in and then by disabling comments altogether, I've taken the plunge and reverted my changes, so I'm now incorporating Disqus. All the (13!) real existing comments have been ported over to Disqus.

For those who are interested, copying the comments over was surprisingly simple. I thought I'd have to create some XML to import them into Disqus, but it turns out there is a nice API. It further turns out that the django-disqus app includes a simple management command, disqus-export, which copies the comments straight over. One issue - the fork of this app included in Mingus has some issues with this command on recent Django versions, but you just need to remove the verbosity option in the script to make it work.

Temporary models in Django

2010-04-13T11:40:05+01:00

Occasionally I need to create a temporary model within a Django application.

The most recent occasion for this was a one-off management command I was writing to import some data from a legacy system. The old database, for some reason, eschewed foreign keys in favour of char fields in a linking table which referred to the relevant rows. In converting this to a Django app, and wanting to use sensible database structure, I planned to replace this with normal ForeignKey fields. But I needed to temporarily hold onto the old references during the import process, so that I could set the new FK properly.

I didn't want to add a field to my model, create a migration for the new field, do the import, then add another migration to drop the field again, so a quick answer was to create a temporary table to hold the linking data during the import. And I wanted to define it within the management command itself, again so as not to pollute the real models with temporary code.

Surprisingly, this turned out to be quite easy. Here's the code:

from django.db import models, cursor
from django.contrib.contenttypes.management import update_contenttypes
from django.core.management import call_command

class TempCustomerAddress(models.Model):
    address = models.ForeignKey('accounts.Address')
    legacy_id = models.CharField(max_length=12, unique=True)

    class Meta:
        app_label = 'utils'


class Command(NoArgsCommand):

    def handle_noargs(self, **options):
        models.register_models('utils', TempCustomerAddress)
        models.signals.post_syncdb.disconnect(update_contenttypes)
        call_command('syncdb')

        # ... do importing and stuff referring to TempCustomerAddress ...

        cursor = connection.cursor()
        cursor.execute('DROP TABLE `utils_tempcustomeraddress`')

Firstly I define the model, giving it an explicit app_label referring to an existing application within my project - not, incidentally, the one containing the actual command.

Then within the body of the command, it was pretty much just a matter of registering the model and running syncdb. It turns out that there is a very simple, although undocumented, function, register_models, to do this - you just need to pass it the name of the application to register the model into, and the model class itself. Again, I'm using the 'utils' app to register the model against - mainly because in our project that isn't managed by South, so syncdb will work. One thing I did have to do though was disconnect the post_syncdb signal which creates content types, as this seemed not to like the temporary model.

The final task, after the import had run, was to drop the temporary table. Since I'm not using south here I have to do that manually by running some SQL.

Easy create or update

2010-03-09T05:38:44+00:00

One common database operation that isn't supported out of the box by Django's ORM is create_or_update - in other words, given a set of parameters, either update an existing object or create a new one if there isn't one already.

The naive implementation is to do a get() on the model, catching the DoesNotExist exception if there's no match and instantiating a new object, then updating the attributes and saving. (You wouldn't want to use get_or_create here, as that doesn't allow you to update the instance if it already exists, so you'd have some duplication of code and db queries).

try:
    obj = MyModel.objects.get(field1=value1)
except MyModel.DoesNotExist:
    obj = MyModel()
    obj.field1 = field1
obj.field2 = value2
obj.save()

The only problem with this is that it creates multiple queries: one to get the existing row, and then two to save it - Django checks to see if it should do an insert or an update when you save, which costs another query. Most of the time, this doesn't massively matter: creating and updating is usually done outside of the standard page rendering flow, so it's not a huge problem if it's a tiny bit slower.

But there are times when you do want to optimise this. One, which we recently ran into at work, is when you want to log items to the database in the course of normal page rendering. We do this to let users of our CMS know when they've put items on a page that aren't rendering how they should be, usually because they don't have the right selection of image assets. (There are good operational reasons as to why we can't stop them from entering them in the first place: I won't go into that here.) A further wrinkle for us is that we want to ensure each error only gets one entry in the log table, but should always record the most recent time that particular error scenario was encountered. So, an ideal case for create_or_update, if only it existed.

Of course I can't stand to see unnecessary db queries, so here's an implementation that uses QuerySet.update to do the initial getting and updating if a match exists. The trick is to realise that update returns the number of rows affected by the query - which has been true more or less ever since queryset-refactor landed nearly two years ago, but which was wrongly and explicitly denied in the documentation until recently (and still is denied in the 1.1 docs, even though it's true). We can use this number to tell if a matching row existed - and if it doesn't, we can then simply call create with the same arguments. Simple.

attrs = {'field1': 'value1', 'field2': 'value2'}
filter_attrs = {'filter_field': 'filtervalue'}
rows = MyModel.objects.filter(**filter_attrs).update(**attrs)
if not rows:
    attrs.update(filter_attrs)
    obj = MyModel.objects.create(**attrs)

The attrs dictionary contains the field names/values to use to update the object, and filter_attrs is the filter names/values to find the object to update. If we're creating a new object, it will of course need to set both the attrs values and the filter_attrs, so we update one dictionary from the other.

Now, note that this will always call a db UPDATE, and if no match exists, it will additionally call an INSERT. Compare this with the original version, which always calls a SELECT, plus another SELECT and an UPDATE if the match exists, but just an INSERT if there's no match. So whether this is more efficient will depend on the use case - if you expect more updates than create, this version should be better (a single UPDATE versus SELECT+UPDATE), but if the reverse is true the original implementation will probably be better.

Django patterns, part 4: forwards generic relations

2010-02-22T06:50:55+00:00

My last post talked about how to follow reverse generic relations efficiently. However, there's a further potential inefficiency in using generic relations, and that's the forward relationship.

If once again we take the example of an Asset model with a GenericForeignKey used to point at Articles and Galleries, we can get from each individual Asset to its related item by doing asset.content_object. However, if we have a whole queryset of Assets, doing this:

{% for asset in assets %}
   {{ asset.content_object }}
{% endfor %}

will result in as many queries as there are assets - in fact it's n+m, where n is the number of assets and m is the number of different content types, as you'll have one extra query per type to get the ContentType object. (Although it might be slightly less than that if you've used ContentTypes elsewhere, as the model manager caches lookups on the assumption that they never change once they've been set.)

However, luckily we can make this much more efficient as well, again using a variation of the dictionary technique.

generics = {}
for item in queryset:
    generics.setdefault(item.content_type_id, set()).add(item.object_id)

content_types = ContentType.objects.in_bulk(generics.keys())

relations = {}
for ct, fk_list in generics.items():
    ct_model = content_types[ct].model_class()
    relations[ct] = ct_model.objects.in_bulk(list(fk_list))

for item in queryset:
    setattr(item, '_content_object_cache', 
            relations[content_type_id][object_id])

Here we get all the different content types used by the relationships in the queryset, and the set of distinct object IDs for each one, then use the built-in in_bulk manager method to get all the content types at once in a nice ready-to-use dictionary keyed by ID. Then, we do one query per content type, again using in_bulk, to get all the actual object.

Finally, we simply set the relevant object to the _content_object_cache field of the source item. The reason we do this is that this is the attribute that Django would check, and populate if necessary, if you called x.content_object directly. By pre-populating it, we're ensuring that Django will never need to call the individual lookup - in effect what we're doing is implementing a kind of select_related() for generic relations.

Django patterns part 3: efficient generic relations

2010-02-15T15:04:55+00:00

I've previously talked about how to make reverse lookups more efficient using a simple dictionary trick. Today I want to write about how this can be extended to generic relations.

At its heart, a generic relationship is defined by two elements: a foreign key to the ContentType table, to determine the type of the related object, and an ID field, to identify the specific object to link to. Django uses these two elements to provide a content_object pseudo-field which, to the user, works similarly to a real ForeignKey field. And, again just like a ForeignKey, Django can helpfully provide a reverse relationship from the linked model back to the generic one, although you do need to explicitly define this using generic.GenericRelation to make Django aware of it.

As usual, though, the real inefficiency arises when you are accessing reverse relationships for a whole lot of items - say, each item in a QuerySet. As with reverse foreign keys, Django will attempt to resolve this relationship individually for each item, resulting in a whole lot of queries. The solution is a little different, though, to take into account the added complexity of generic relations.

Assuming the list of items is all of one type, the first step is to get the content type ID for this model. From that, we can get the object IDs, and then do the query in one go. From there, we can use the dictionary trick described last time to associate each item with its particular related items. In this example, we have an Asset model that is the generic model, holding assets for other models such as Article and Gallery.

articles = Article.objects.all()
article_dict = dict([(article.id, article for article in articles])

article_ct = ContentType.objects.get_for_model(Article)
assets = Asset.objects.filter(
                content_type=article_type, 
                object_id__in=[a.id for a in all_articles]
              )
asset_dict = {}
for asset in assets:
    asset_dict.setdefault(asset.object_id, []).append(asset)
for id, related_items in asset_dict.items():
    article_dict[id]._assets = related_items

This is good as far as it goes, but what about when we have a heterogeneous list of items? That, after all, is the point of generic relations. So what if our starting point is a collection of both Galleries and Articles, and we still want to get all the related Assets in one go? As it turns out, the solution is not massively different: we just need to change the way we key the items in the intermediate dictionary, to record the content type as well as the object ID.

article_ct = ContentType.objects.get_for_model(Article)
gallery_ct = ContentType.objects.get_for_model(Gallery
assets = Asset.objects.filter(
                Q(content_type=article_type, 
                    object_id__in=[a.id for a in articles]) |
                Q(content_type=gallery_ct, object_id__in=[g.id for g in galleries])
             )

    asset_dict = {}
    for asset in assets:
        asset_dict.setdefault("%s_%s" % (asset.content_type_id, asset.object_id), 
                                         []).append(asset)

    for article in articles:
        article._assets = asset_dict.get("%s_%s" % (article_ct.id, article.id), None)

    for gallery in galleries:
        gallery._assets = asset_dict.get("%s_%s" % (gallery_ct.id, gallery.id), None)

Here we first of all use Q objects to get all the assets of type Article with IDs in the list of articles, plus all those of type Gallery with IDs in the list of galleries. Then we use the fact that each asset knows its own content type ID to create the dictionary keys in the form <content_type_id>_<object_id>. Finally, we loop through the articles and the galleries separately to get the relevant assets for each item.

Middleware post-processing in Django: a gotcha

2010-02-01T17:17:31+00:00

One of the requirements for the new Heart website we've just launched was to allow users to personalise their location to one of 33 radio stations across the country. For various reasons, this meant rewriting all the links on the page, dynamically, depending on the user's location setting.

The easiest place to do this sort of post-processing in Django is in response middleware. So I wrote a quick class that used regexes to grab all the href and action attributes (for a and form elements respectively - images didn't need localising) and add the relevant locations. Because it was dynamic, I used the ability of re.sub to call a function to determine the replacement value; and to save on multiple database queries, I saved various things in the instance. So it looked a bit like this:

href = re.compile(r'(href|action)=["\'](.+?)["\']')

class LocalisationMiddleware(object):
    def process_response(self, request, response):
        self.current_station = get_station(request)
        self.stations = Station.objects.values_list('slug', flat=True)

        content = href.sub(self.re_replace, response.content.decode('utf8'))
        response.content = unicode(content)
        return response

    def re_replace(self, matchobj):
        current_station = self.current_station
        url = "/%s%s" % (current_station.slug, matchobj.group(2))
        return "%s=%s" % (matchobj.group(1), url)

But then, during testing, we started getting some rather odd bug reports. Someone would be happily browsing the London pages, and would suddenly get a link pointing at Essex - which is supposed to be impossible.

We eventually realised what the problem was. Django middleware is instantiated once per process: so several requests were being serviced by the same instance, and the values of the local instance attributes - in particular self.current_station - were being leaked across requests.

The solution is to use a separate object to contain the current station and the re_replace method, and instantiate it explicitly in process_response:

class LocalisationMiddleware(object):

    def process_response(self, request, response):
         url_replacement = UrlReplacement(request)
         content = href.sub(url_replacement,
                           response.content.decode('utf8'))
        # etc

class UrlReplacement(object):
    def __init__(self, request):
       self.current_station = get_station(request)
       self.stations = Station.objects.values_list('slug', flat=True)

    def __call__(self, matchobj):
        # do replacements

Django patterns, part 2: efficient reverse lookups

2010-01-11T09:07:33+00:00

One of the main sources of unnecessary database queries in Django applications is reverse relations.

By default, Django doesn't do anything to follow relations across models. This means that unless you're careful, any relationship can lead to extra hits on the database. For instance, assuming MyModel has a ForeignKey to MyRelatedModel, this:

myobj = MyModel.objects.get(pk=1)
print myobj.myrelatedmodel.name

hits the database two separate times - once to get the MyModel object, and once to get the related MyRelatedModel object. Luckily, it's easy to get Django to optimise this into a single call:

myobj = MyModel.objects.select_related.get(pk=1)

This way Django does a JOIN in the database call, and caches the related object in a hidden attribute of myobj. Printing myobj.__dict__ will show this:

{'_myrelatedmodel_cache': [MyRelatedModel: obj],
 'name': 'My name'}

Now, whenever you call myobj.myrelatedmodel, Django automatically uses the version in _myrelatedmodel_cache rather than going back to the database to get it. Note that this is exactly the same as what happens once the the related object was accessed in the first snippet above - Django caches it in the same way for future use. All select_related() does is pre-cache it before the first access.

None of this is new - it's quite well explained in the Django documentation. However, what's not obvious is how to do the same for reverse relationships. In other words, this:

myrelatedobj = MyRelatedObject.objects.get(pk=1)
print myrelatedobj.mymodel_set.all()

Here you'll always get two separate db calls, and adding select_related() anywhere won't help at all. Now one extra db call isn't that significant, but consider this in a template:

<ul>
{% for obj in myobjects %}
    <li>{{ myobj.name }}</li>
    <ul>
         {% for relobj in myobj.backwardsrelationship_set.all %}
         <li>{{ relobj.name }}</li>
         {% endfor %}
    </ul>
{% endfor %}
</ul>

Not an unreasonable thing to want to do - iterate through a bunch of objects, then for each one display all the objects in its backwards relationship. However, this will always cost n+1 queries, where n is the number of objects in the myobjects queryset. And what's worse, Django will go back and get the items from the database each time they're accessed, even if we've already got them for the same object in the same view or template. The queries quickly mount up. So how can we optimise this?

The answer is to get all the related objects at once, for the entire queryset, then cache each object's related objects in a hidden attribute. We can do this by sorting the objects once we've got them into a dict, keyed by the id of their parent object:

qs = MyRelatedObject.objects.all()
obj_dict = dict([(obj.id, obj) for obj in qs])
objects = MyObject.objects.filter(myrelatedobj__in=qs)
relation_dict = {}
for obj in objects:
    relation_dict.setdefault(obj.myobject_id, []).append(obj)
for id, related_items in relation_dict.items():
    obj_dict[id]._related_items = related_items

Now each MyRelatedObject instance in qs has a _related_items attribute, containing all the MyObject items in its reverse relationship. Obviously, since Django doesn't know about this, the only way to get the items is to explicitly iterate through _related_items rather than myobject_set.all in the template. And if you need extra filtering, you need to do it in the view where you first get the objects, since the resulting attribute isn't a queryset and can't be filtered.

There's quite a bit of looping etc in this snippet, so you should probably profile carefully to ensure this isn't actually more expensive than just going back to the database. But I've found that this is fairly efficient, and saves a lot of database access.

SSH and Mac OSX Terminal

2010-01-07T11:47:14+00:00

I like the Mac as a development environment most of the time, but occasionally some things annoy me.

One of these niggles is the way that the tab title in Terminal changes when you SSH to an external server, but doesn't change back when you close the connection. So you end up with tabs that claim to be connected to a server, but aren't.

The culprit seems to be SSH itself. Here's my solution: a shell script that runs SSH and then sets the tab title back to the default "Terminal".

1
2
3

#!/bin/sh
ssh $*
echo "\033]0;Terminal\007"

I've saved this to ~/bin/sshp, and made it executable, so now I just type sshp myserver instead of ssh. A further step would be to alias it back to ssh in .bash_profile with alias ssh=sshp

Vim taglist and Django

2009-12-26T12:33:41+00:00

Inspired by the graphical cheat sheet here, I've recently moved over to Vim as my main development environment.

After installing a whole range of plugins, I found that one of them, taglist, no longer worked with my Django code. The reason was that something was changing the filetype of Django modules to 'python.django', and taglist - unlike most other plugins - was trying to match against the whole filetype, rather than just a part of it.

My solution is to hack taglist so that it does a partial match on the filetype. In the Tlist_Get_Buffer_Filetype function (line 984), change

let buf_ft = getbufvar(a:bnum, '&filetype')

let buf_ft = split(getbufvar(a:bnum, '&filetype'), '\.')[0]

Showing queries in Haystack

2009-12-26T10:20:33+00:00

At work we've been using Haystack to manage our site search, with a Solr backend.

As usual, we're customising things quite a lot - using faceted queries and weighted indexes, and bypassing the built-in search forms - so I wanted to be sure, in line with my general obsession with query efficiency, that we weren't generating multiple Solr queries for every search.

Haystack does log queries for every request internally, but as far as I can tell there's no way of getting to that information without writing some custom code to import and expose the relevant variable. So I've written a (very basic) panel for the Django debug toolbar which does just that.

Just put this somewhere on your pythonpath or in your project, and add it to the DEBUG_TOOLBAR_PANELS list in settings.py.

Django patterns: memoizing

2009-12-20T17:34:27+00:00

One of the things I wanted to do with this blog was to cover some of the design patterns I've discovered/come across/stolen over the years I've been working with Django. So this is the first in what I hope will be a long-running series on Django patterns.

Memoizing is the process by which a complicated or expensive function is replaced by a simpler one that returns the previously calculated value. This is a very useful thing to do in a complicated model, especially in cases where methods like get_absolute_url are calculated via a series of lookups on related models. Frequently I've found myself calling one of these methods on the same object several times within a view or template, leading to a huge amount of unnecessary database calls.

It's very easy to do this manually - the method simply needs to check whether the cached value already exists, if not calculate it and store it somewhere, then return the cached value:

def get_expensive_calculation(self):
    if not hasattr(self, '_expensive_calculation'):
        self._expensive_calculation = do_expensive_calculation()
    return self._expensive_calculation

Here the cache lives within the instance itself. For the way I use it, this is useful: instances are created and destroyed within a single request/response cycle, so the cache dies with the object at the end of that process, and I don't need to worry about invalidating the cache if the value subsequently changes. Naturally, you could use Django's cache framework here - you'd need to create a unique key somehow, perhaps using the model name and pk as a prefix - but otherwise it would work pretty much the same way.

However, it's a bit of a pain having to write this same boilerplate each time you want to memoize something, so I wanted to write a decorator that would do it, which I could simply apply to a model method to get it to automatically cache the result. There are various memoizing decorators out there, but they mostly suffer from two problems: either they only work on plain functions, rather than methods, or they create a global cache, which would lead to a memory leak as the value would be kept even though the instance had gone out of scope.

So here's my version:

def memoize_method(func):
    key = "__%s" % func.__name__
    def inner(self, *args, **kwargs):
        if not hasattr(self, key):
            setattr(self, key, func(self, *args, **kwargs))
        return getattr(self, key)
    return inner

This is pretty simple in the end. The decorator uses the name of the function it's decorating to create a key, and when it's called it is passed 'self', so it checks if that key exists on that object and either creates or returns it.

One potential problem with this is that it doesn't take any account of the method's arguments: after the first call, it will always return the same value even if called again with completely different arguments. Most of the time, this won't be a problem: since the cache only persists for a single request, you're most likely to be calling it with the same arguments each time. But it's fairly simple to extend the caching mechanism to use parameters within the key:

def memoize_method_with_params(params):
    def wrap(func):
        key = "__%s__%s" % (func.__name__, '__'.join(['%s:%%(%s)s' % (a, a) for
                                                      a in params]))
        def inner(self, *args, **kwargs):
            actual_key = key % kwargs
            if not hasattr(self, actual_key):
                setattr(self, actual_key, func(self, *args, **kwargs))
            return getattr(self, actual_key)
        return inner
    return wrap

This time, since the decorator itself takes arguments, you need to use the double-wrap method: the outer function is called on definition, and it returns the decorator function, which itself contains the inner wrapped function. The algorithm to calculate the key looks complex, but is actually just creating a string in the form __funcname__key1:%(key1)s__key2:%(key2)s, which will use the dictionary string interpolation method to include the actual values when the function is called. (One issue, left for the reader to correct: params must be a list or tuple, if passed a string it will fail.)

Although this is pretty nice, I can't help feeling that I should be using descriptors to do this. Inspired by a posting by Marty Alchin and one by Ian Bicking, I attempted to make this work, but I unfortunately drew a blank - the problem is that only the __get__ method has access to the instance, where the cache needs to be stored, but that needs to be available in __call__ somehow. One possible solution would be to have __get__ return another descriptor itself, but that seems like overkill for this.

South migrations with MPTT

2009-12-08T06:37:35+00:00

We've been using django-MPTT at work for quite a while. It's a great way to manage hierarchical data in a read-efficient way, and we use it heavily in our CMS application. I'll definitely be talking about it further in future posts.

Recently we moved our database migrations from our defunct dmigrations project to Andrew Godwin's wonderful South application. One of South's best features is the ability to 'freeze' the ORM within each migration, so that you can manipulate the db via the familiar Django syntax rather than having to deal with raw SQL.

However, we ran into a problem when trying to use this to add new instances to a model that uses MPTT. We're actually using Ben Frishman's fork of django-mptt, which he wrote while he was working for us this summer. This has a base model class that defines all the MPTT fields and methods, rather than monkey-patching them in as the original version does.

The issue was that the frozen ORM only includes the basic fields that are defined on the actual model. This led to trouble when inserting a new object, especially when it's in the middle of an existing tree. MPTT includes values which identify an item's place in its tree, and when a new object is inserted most of the elements in the tree have to be updated to reflect the new positioning. django-mptt normally deals with all the SQL changes necessary, but this wasn't happening within a migration, because the dynamically-created model wasn't inheriting the correct models and fields.

The answer turned out to be simple, although it is undocumented. The frozen ORM definitions are stored in each migration as a nested dictionary. Each model is an key in the top level dictionary, whose value is a dictionary containing the field name/definitions as keys/values. However, in the sub-dictionaries, along with the field definitions, you can also store Meta defintions, including a South-specific extension: _bases, which defines the model base to inherit from. For example:

{
    'categories.category': {
        'Meta': {'unique_together': "(['slug', 'parent'],)", '_bases': ('mptt.models.Model',)},
        'id': ('django.db.models.fields.AutoField', [], {'primary_key': 'True'}),
        'name': ('django.db.models.fields.CharField', [], {'max_length': '50'}),
        'parent': ('django.db.models.fields.related.ForeignKey', [], {'blank': 'True', 'related_name': "'children'", 'null': 'True', 'to': "orm['categories.Category']"}),
        'slug': ('django.db.models.fields.CharField', [], {'max_length': '50'}),
    }
}

This ensures that the frozen category model inherits from mptt.models.Model, and gains all the special MPTT magic.

Customising Mingus, part 2

2009-12-05T23:00:04+00:00

This is intended to be primarily a technical blog, so I was keen to get the presentation of code snippets correct. I'm a - shall we say - fairly frequent answerer on StackOverflow, and I've got used to their Markdown-enabled edit box. Luckily, the Mingus basic-blog application allows a choice of markup for body text, and even defaults to Markdown. But as always there were quite a few things to improve.

Firstly, I do like StackOverflow's dynamic WYSIWSYG preview of the marked-up copy. Although Markdown syntax is quite simple, it's easy to get it wrong - using a three-space indent rather than four for code, for example. An instant preview just underneath the text entry field in the admin form is very useful. SO does it using the showdown.js library, which is part of their port of the 'what you see is what you mean' markdown editor, WMD.

It was as easy to integrate the whole of WMD as just the preview, by adding a mingus\admin.py like this:

from django import forms
from django.conf import settings
from django.contrib import admin
from django.utils.safestring import mark_safe
from basic.blog.models import Post
from basic.blog.admin import PostAdmin

class WMDEditor(forms.Textarea):

    def __init__(self, *args, **kwargs):
        attrs = kwargs.setdefault('attrs', {'class':'vLargeTextField'})
        super(WMDEditor, self).__init__(*args, **kwargs)

    def render(self, name, value, attrs=None):
        rendered = super(WMDEditor, self).render(name, value, attrs)
        return rendered + mark_safe(u'''
            <div id='wmd-container'>
            <div id='wmd-button-bar'></div>
            <div id='wmd-preview'></div>
            <script type="text/javascript">
            wmd_options = {
                output: "Markdown",
                buttons: "bold italic | link blockquote code image | ol ul"
            };
            </script>
            <script type="text/javascript" src="%sstatic/js/wmd.js"></script>
            </div>''' % settings.MEDIA_URL)

class PostForm(forms.ModelForm):
    body = forms.CharField(widget=WMDEditor)
    class Meta:
        model = Post

class WMDPostAdmin(PostAdmin):
    form = PostForm

    class Media:
        css = {
            "all": ("static/css/wmd.css",)
        }
        js = ("static/js/showdown.js",)

admin.site.unregister(Post)
admin.site.register(Post, WMDPostAdmin)

Because Mingus already does some Javascript on the Post admin to add the 'body inlines' section under the main textbox, I've made the WMD button bar appear underneath that, on top of the preview, instead of on top of the actual textarea. A bit weird, but it does work - it's not as if I use it all the time, anyway. This no doubt breaks if you use another markup language, but I always use Markdown, so no problem there.

So, from markup to syntax highlighting. Mingus is, unfortunately, a bit confusing here. Partly this is a result of Kevin's desire to integrate as many standalone applications as possible, and only write the minimum of glue code. However, this means that there are several applications that potentially supply markup functionality, and it confused me for quite a while. These include the django-extensions app, which includes the syntax_color templatetag; and django-sugar, which includes the pygment_tags library.

However, the basic django-blog app actually deals with markup and highlighting itself already. On saving a post, the markup is translated into HTML and saved in a body_markup field, thanks to the django-markup app. What I didn't realise is that django-markup already runs the formatted text through pygments to add the highlighting. The reason I didn't realise this is that pygments turns out not to be very clever in guessing the code language. If you don't tell it explicitly, it doesn't do anything. In the absence of a hard-coded hint, its attempt to guess the language is limited to looking at the first line of the code, where it hopes to see a pseudo-shebang line:

...
#! python

Once I started doing that, highlighting worked as expected (although there were some minor CSS issues - on some browsers the font used for pre was far too big). This also meant I could remove the call to the django-sugar pygmentize filter that mingus has for some reason added to all the blog templates.

I can't help feeling the proliferation of markup/highlighting code within mingus is a bit silly. I only realised in writing this that there is actually yet another place where highlighting could take place, as the Markdown library itself has an extension to call pygments (although presumably django-markup prefers to do this explicitly because other markup libraries don't have this extension).

There's one issue that remains unresolved. As well as the now-removed pygmentize filter, mingus also runs blog content through render_inlines, which allows insertion of arbitrary Django model content within a blog post. However, for some reason this removes all the indentation from code blocks - obviously not very useful when posting Python. I'm not using the inlines at the moment anyway, so I've removed them from the template until I can work out what's going on.

Other than that, everything works and the blog is now ready to use.

Cambridge Stack Overflow dev day

2009-10-31T06:28:45+00:00

I don't go to a lot of tech conferences - family life tends to make getting away for any length of time fairly difficult. So originally I ignored the banners advertising the Stack Overflow DevDays, thinking I wouldn't be able to make it anyway. But when my employer arbitrarily changed the rules over how much holiday I'm allowed to carry forward into next year, I ended up a couple of days in hand - and a conversation with a co-worker convinced me to go at the last minute. After a comedy of errors regarding the last available ticket for the London event, I finally managed to snap up a ticket for the Cambridge day.

Since this was a Stack Overflow conference, it wasn't surprising that the keynote was by Joel Spolsky. It was preceded by a mildly amusing short film where he satirised his 'treat developers right' reputation by pretending to be a cross between an autocratic boss and a sadistic PE teacher, which was funny enough but slightly pointless. The talk itself was good: it was about the tension between the 'simplicity is everything' attitude of firms like 37 Signals, versus the undeniable fact that people want features, as evidenced by the way FogBugz' sales went up every time they added more features.

Spolsky is an entertaining speaker and I enjoyed the talk, even if there wasn't a particularly coherent take-home message: he was trying to say that you should only give people options for things that are actually important, but the whole point is that what's not important to one user is vital for another, which is why software like Microsoft Word ends up with so many hundreds of options.

Next up was Christian Heilmann talking about Yahoo! Developer Tools. Now this was really interesting - something I haven't had a chance to play with at all, but definitely will in the future. Yahoo has put together a very nice way of querying any of their APIs via REST with a simple SQL-like language, YQL. What's more, it's possible to submit your own data sources which can be linked up via an XML translation table and made available for everyone to query via YQL. Carrying that forward, you can write mini-applications in Javascript that use any of these APIs and soon you'll be able to offer these to be installed on users' Yahoo home pages in much the same way as Facebook apps. I must admit my heart did sink a bit when Christian mentioned the customised markup language, after too much time wrestling with FBML, but it's an exciting possibility.

After a short break, next up was Cambridge University's Frank Stajano. This talk was ostensibly about computer security, and specifically what we can learn from fraudsters to make our systems more secure. But he's a fan of the BBC3 programme The Real Hustle, a hidden-camera show where members of the public are conned in various ways, and he's done various bits of research analysing the cons from the programme and relating them to systems security. So the format of the lecture was to show us various clips from the show, then a couple of slides which were supposed to tell us how this type of con was used in computer terms and how we could avoid it. However, it didn't really achieve that - the links to computer security were not well explained, and although the talk was quite fun I didn't feel I learned much.

Next was Joel again, talking about FogBugz. Now I know you have to expect this sort of thing at conferences (especially at Carsonified ones, or so I'm told), but I actually object to paying to sit through an hour of sales pitch, however entertainingly delivered it is. FogBugz looks like a perfectly competent product, but I didn't see anything that made it shine over a product like Jira, or even particularly over the open-source Redmine that we use these days at work. Plus the demo included a couple of screens that clearly violated the principle Joel had pushed earlier of only giving options where they made a difference.

Lunch, followed by Steven Sanderson on ASP.NET-MVC. I actually found this fairly good - despite my complete lack of interest in any Microsoft technology, I'm not actually hostile, so I paid enough attention to find out what they were doing in this area. As the speaker freely admitted, .NET MVC is quite obviously ripped off from Ruby on Rails. It does offer some nice ways of doing things, but is missing a lot of the things that Django and Rails do - no ORM, for example, because it relies on LINQ; and no real templating system, because you just use standard ASP files. So nothing amazingly revolutionary, except if you're a Microsoft fanboy who's totally unaware of what the wider world is doing, but still good to see that Microsoft is learning things and giving its developers some alternatives. Best part: it's "open source", which in Microsoft language means "we're not going to accept your patches or anything, but you're free to fork it if you want". Great.

Next: Remy Sharp on jQuery. A deeply disappointing talk. Ryan Carson introduced it by asking how many of the audience had used jQuery (about half) and how many considered themselves experts (a handful), telling the latter that they may as well get a cup of coffee. In fact, that whole half of the audience should have done so: this was a very basic introduction, covering only the fundamentals. Remy is not a particularly fluent talker and this was not very well presented.

After another break, we had Michael Foord on Python. This was another fairly basic introduction - I had suspected I wasn't going to learn anything, but got my hopes up when Michael started off by talking about IronPython (he's the co-author of IronPython In Action). Unfortunately this was only a short digression, although it did look very cool (instantiating a Windows dialog from the IronPython console...) and the rest of the talk was a run-through of a clever little spellchecker in 40-odd lines of Python. This was all well and good, but the code wasn't anything particularly special to Python - you could have done it in any of a dozen other languages in about the same number of lines - and it didn't cover any of Python's cooler features. If I'd never dabbled in Python, I don't think this would have been enough to whet my appetite.

Finally, Jeff Atwood talking about Stack Overflow. This was only a short talk, where Jeff spoke about the reasons he and Joel had set up the site, what he hoped and hopes to achieve, and the achievement he gets from it.

So, that was it for the talks. Free beer was offered in a bar in town, but unfortunately those family obligations raised their heads again and I had to drive home.

Overall, a good day. I had about a 50% hit rate on interesting talks, which I suppose is fairly good going, and I did get a chance to meet some new people. It was a shame that most of the talks slightly overran, leaving almost no time for questions.

One surprising thing was that the day wasn't very well integrated with Stack Overflow. I had at least expected us to get preprinted badges showing our SO username and reputation scores, but no such luck. And when Carson asked the audience at one point who thought they had the highest rep, I didn't put my hand up, assuming my 9,000 points would be average in this crowd. But when he tried to work it out, starting by asking who had 1,000 points, who had 1,500, etc, I soon found I did indeed have by far the highest rep - the next highest put his hand down at about 2,500. Made me feel slightly sad (which I am, of course). A shame that I missed the chance to parlay my brief moment of fame into something more long-lasting by skipping the drinks.

On the whole, I'm glad I went, and if nothing else it's convinced me I need to try to go to more of this sort of thing.

Customising Django-Mingus

2009-10-04T07:30:28+01:00

This blog is built using Kevin Fricovsky's excellent django-mingus project, which is mainly a set of standard pre-existing reusable apps with some templates and a bit of glue to hold it together.

Although it's quite usable out of the box, I found - inveterate hacker that I am - that there were several things that I didn't quite like in the project as it was. So I changed them (isn't open source great, laydees-n-genelmen). At some point I'll fork the project on github and upload the changes, but for now here's what I've done.

Firstly, mingus forsakes Django's built-in comments framework for the external Disqus project. I didn't really fancy signing up for another service - especially as I'm not expecting vast numbers of comments on this blog. It's quite a simple matter to reinstate the comments - the relevant template code is included in the post_detail.html template included with the basic-blog app which mingus extends, so I just needed to copy and paste it into the mingus version. Then add (r'comments/', include('django.contrib.comments.urls')), to urls.py, django.contrib.comments to settings.py, run a syncdb and it's all done.

There are however a couple of missing pieces here. basic-blog doesn't include templates for the comment preview and post confirmation, so you just get an unstyled white page. Simple to fix: add a comments directory with a base.html template as follows:

{% extends "base.html" %}
{% block content %}{% endblock %}

By default the post-confirmation page doesn't include a link back to the original object, leaving the user nowhere. So an overwritten posted.html in the same directory fixes that:

{% extends "comments/base.html" %}
{% load i18n %}    
{% block title %}{% trans "Thanks for commenting" %}.{% endblock %}    
{% block content %}
  <h2>{% trans "Thank you for your comment" %}.</h2>    
  <p><a href="{{ comment.get_content_object_url }}">Return to blog</a>
{% endblock %}

The last issue with comments was that there was no indication on the index page of how many comments each post had. This is a standard feature of blogs, and a bit surprising it wasn't there - perhaps it's a consequence of using Disqus. Anyway, the solution was to add the following to templates/proxy/includes/post_item.html:

{% if object.content_object.allow_comments %}
{% get_comment_count for object.content_object as comment_count %}
<div class="comment_count"><a href="{{ object.content_object.get_absolute_url }}#comments">{{ comment_count }} comment{{ comment_count|pluralize }}</a></div>
{% endif %}

I also added a style rule for the .comment_count class in base.css.

So much for comments. Now, layout. I couldn't help thinking that the default layout had the main area to narrow and the right-hand column too wide. Luckily the templates are based on the 960 Grid System css, so it was easy to change the central column to use the grid_11 suffix_1 classes, for a width of 11/16 and a gutter of 1/16, and the right-hand column to use grid_4.

The final issue was to do with markup - that was a bit more complicated, so I'll leave it to part 2.

The one where my friend the sysadmin kills me

2009-10-03T08:14:29+01:00

Warning: this entry is very much a matter of 'This isn't the right way to do it, but it works for me'.

For small projects that are in active development, I frequently have to deploy code changes to the live server. To make this as simple as possible for me, so I can concentrate on the coding, I tend to like running on a live checkout of the code directly from the repo.

I never really got this automated properly with svn, although no doubt it's a simple matter of setting up the right post-commit hooks. However, now I'm working mainly in git, and I thought it would be good if I could push straight from my local repo to the remote one, and automatically see the production code update.

It's fairly easy to set up a remote repository to push to - I followed the instructions here, which worked a treat. However, this wasn't helping with getting this code to auto-checkout and deploy itself. So I began experimenting, and what I came up with was this.

Firstly, instead of setting up a bare repo as recommended in those instructions, use a standard git init for your remote. If you now try and push to this, git will complain with a long message explaining that "Updating the currently checked out branch may cause confusion". It gives some tips about how to turn off that message, but we can avoid it altogether by using branches.

On the server, simply create and check out a live branch:

   git branch live
   git checkout live

Now, we just need a hook that pulls from master to live every time we commit to master. The hook we need is called post-receive, and like all hooks it lives in .git/hooks. Here's mine:

#!/bin/sh
read params
cd .. 
echo "ASSET_VERSION = '`echo $params|cut -d " " -f2`'" > local_settings.py
env -i ~/bin/git reset --hard
env -i ~/bin/git pull
exec ~/webapps/mysite/apache2/bin/restart

The two git commands simply ensure that the live branch has no local changes, and pulls all changes direct from master - which in turn of course has been updated directly from my development machine.

The rest is me trying to be even cleverer. I wanted an automatic cache-busting mechanism to stop my javascript being cached while in development. So I have a simple local_settings.py file which defines a value which is appended to the querystring of all my asset urls. The hook updates this automatically - it is passed the hash of the current commit, so it reads the parameters (which is far more difficult in bash than it needs to be, by the way), extracts the hash, and writes it to local_settings.py.

The final step is to restart Apache, and we're laughing.

Now, no doubt there are much better ways of doing this. But like I say, it works for me.