MOTD

Message Of The Day

Fri, 22 May 2009

14:24 [zork(~)] cat introducing-yardbird.txt

Yardbird

Ladies and Gentlemen a One Mister Charles Parker, Jr.

An IRC channel I consider my "home" channel is coming up on its tenth anniversary in a few months, and its founders have begun to reflect on how far we've really come in a decade. By far the biggest disappointment is that our beloved and snarky bot still runs a rickety hacked-up version of Kevin Lenzo's 1990s classic, Infobot.

We've looked into replacements in the past, but they always seemed like mere incremental improvements. They'd provide some degree of reliability and a slightly saner codebase, but they generally add little that appeals to us. Often they're written in Perl, like Infobot, which is something we're trying to move away from (for no reason beyond the fact that nobody in the channel is comfortable maintaining Perl code any more).

So when I finally found a bot that managed to have reasonable gains in terms of fun over Infobot, I leapt at the chance to put it through its paces despite being written in Perl. For various other reasons, it was not fit for purpose, although looking at the code led me to an inspiration.

How Not To Write A Bot

The bot in question was written with POE, which appears to be Perl's answer to Twisted Python. POE gives you lots of library functions and objects that allow you to do things asynchronously by firing off events and registering callbacks for when something finishes. This is somewhat important, so that you don't say "Bot, do this five-minute thing" and have it fall off the network because it hadn't returned to the protocol code for five minutes.

But the way this bot was written was done in a very hasty JFDI sort of scripting style. I'm told that the real marvel was how quickly it was brought up and running, and I don't think it's fair to judge the authors based solely on this work. However, there was something of a common antipattern throughout the command recognition code:

if ($msg =~ GREATBIGHOARYREGEX) {
        ...
} else if ($msg =~ ANOTHERCRAZYREGEX) {
        ...

Dispatchers

While tracing through a printout of this thing in an attempt to even figure out what its features were, I thought "Gosh, wouldn't it be great if this had a dispatch mechanism where you could associate regexes with functions in some kind of data structure, along with some kind of application data for context?" And of course that immediately reminded me of...

http://docs.djangoproject.com/en/dev/topics/http/urls/

I filed this little bit of inspiration away, and started worrying about how I'd go about cloning this system. I'd use Twisted Python, of course, but the asynchronous database library is on the same level as POE's, and I prefer a good ORM. Man, wouldn't it be great if my bot could use the Django ORM?

Asynchronicity

Of course, the Django ORM isn't built with a callback-based API, so while your code does a query that's all your program can do. This prompted me to wonder why that doesn't become a problem for Django Web apps. Surely they receive hundreds or thousands of requests per second, but concurrency never becomes a problem that exposes itself to the app programmer.

The answer is that while Django does not have any support for asynchronous programming, the very model in which it operates assumes that it's being called from a Web server such as apache. Apache has its own forking or threading model that it uses to handle lots of requests simultaneously, and the CGI or WSGI interfaces use a well-defined interface for passing connection information into a program and getting a response back out.

An IRC Apache

One morning on the Underground, I began to reason that what I needed was a sort of "IRC Bot Apache" that would handle incoming IRC events of various sorts (PRIVMSG, ACTION, etc.), and then pass them along with some connection information to my django code. I'd dispatch these messages through the regex-based patterns() system, then call view code that uses ORM objects to perform queries.

The obvious choice for implementing something like this is Twisted Python, which shows its age but remains the de facto Python library for coding state machines. I'm only occasionally familiar with the system (and the documentation is filled with distracting Java-esque Software Engineering babble for some bizarre reason), but I was able to localize the actual Twisted-using code to one function, which at the very least makes it simple to hand off to experts to tell me if I'm doing anything stupid.

Yardbird

Digging through twisted documentation I found their example LogBot and based it loosely on that pattern, subclassing irc.IRCClient and replacing the privmsg method along the following lines:

from twisted.words.protocols import irc
from twisted.internet import defer, threads
from django.core import urlresolvers
from django.conf import settings

class DjangoBot(irc.IRCClient):
    @defer.inlineCallbacks
    def privmsg(self, user, channel, msg):
        resolver = urlresolvers.get_resolver('.'.join(
                                (settings.ROOT_MSGCONF, 'privmsg')))
        request = dict(user=user, channel=channel, msg=msg,
                                settings=settings)
        callback, args, kwargs = resolver.resolve('/' + request['msg'])
        response = yield threads.deferToThread(callback, request,
                                *args, **kwargs)
        defer.returnValue(self.notice(response.recipient,
                                response.data.encode('UTF-8')))

Let me go through that line by line:

@defer.inlineCallbacks
def privmsg(self, user, channel, msg):

The inlineCallbacks decorator essentially catches any yield of a Twisted deferred object and schedules the next delve into the generator using standard Twisted deferred-execution mechanisms. So now yield really behaves like a scheduler yield, and you can let some more critical IRC-parsing code run between your own calls.

Next we build a Django urlresolver object so we can dispatch regexes to handler functions, using some Django settings info to determine path info:

from django.core import urlresolvers
from django.conf import settings

resolver = urlresolvers.get_resolver('.'.join(
                        (settings.ROOT_MSGCONF, 'privmsg')))

Then we build a request dictionary. In normal Django this would be an HttpRequest object, containing all sorts of information about the web server and the remote client and the HTTP request itself. Since this is a quick-and-dirty example, I've reduced this to a dict for simplicity. I also passed in the settings namespace just to be lazy (so I can keep things like nickname in there):

request = dict(user=user, channel=channel, msg=msg, settings=settings)

Now we actually use our resolver to test the incoming message against all our patterns in privmsg.py and return to us the appropriate function, along with all of the anonymous and named matches that were generated by the winning regular expression. Note that we have to prepend a / to our message to appease the URL-centric resolver:

callback, args, kwargs = resolver.resolve('/' + request['msg'])

Finally we get to the deferred execution magic! We have a function, a request object, and some arguments made from textual analysis of the message. We use the threads.deferToThread method to generate a deferred object that runs in a completely separate thread, and yield it up to our inlineCallbacks decorator to be scheduled:

response = yield threads.deferToThread(callback, request, *args,
                                       **kwargs)

Our view function then runs in the background, taking as long as it likes while our bot concerns itself with answering PING replies and dispatching further events to the resolver.

We're confident that the Django code is reasonably thread-safe, as it has to handle concurrency under a variety of Web server models (such as apache's Worker MPM or a traditional Prefork model). Once the function returns a value, the thread closes and execution comes back to this method again, chucking the returned value into our response object.

We're almost done, but we still need to actually do something with this information! In ordinary HTTP Django this would be an HttpResponse object, containing all sorts of information on what template to render and what dictionary to pass in as an extra context namespace. This is a bit overkill for this example, so I've simplified it to another dict:

defer.returnValue(self.notice(response['recipient'],
                        response['data'].encode('UTF-8')))

The various RFCs for IRC all state rather loudly that automated bots are meant to speak using NOTICE but always ignore NOTICEs from other sources. This is meant to prevent feedback loops flooding a channel. Also note that since this was the final statement of my inlineCallbacks function, I called the defer.returnValue to spit back the result of the notice call. I'm not convinced that it was at all necessary, but I believe it's harmless boilerplate in the worst case.

Whoa is that all?

My current version is obviously not exactly like this. I've been doing some rather wild and thrashing testing and debugging, and there's some mess and refactorings. The above is intended as a demonstration of the technology only while I do my explorations in the yardbird bazaar tree.

For a start, my current version applies an errback function to log exceptions in the view function thread, and I've refactored the above code into a separate function that both the ACTION and PRIVMSG functions can use. The principle is still the same, though.

What's With The Name?

Because Django is named after Django Reinhardt, and was split from a CMS project named Ellington (as in, Duke), it's become traditional to name Django projects after Jazz greats. For example, there's a popular Django e-commerce system named Satchmo. I dug around and couldn't find any named after Charlie Parker, or his nickname 'Bird'. I just decided to play it safe and use the rarer long form "Yardbird".

Where do I get Your Version?

My spazzy tree, complete with README.txt files for the apache indexing and a hackish approach at implementing some of the Infobot functionality is up at http://zork.net/~nick/yardbird/. It's also a bazaar tree, so you can just:

bzr branch http://zork.net/~nick/yardbird/

Right now I'm trying to figure out what I should really do for the IrcRequest and IrcResponse objects, and how to properly package all this up like a proper professional project.

Your code sucks!

Soz.


[zork(~)] cal
[zork(~)] tree
[zork(~)] syndicate.py
[zork(~)] cat README