MOTD

Message Of The Day

Fri, 22 May 2009

14:24 [zork(~)] cat introducing-yardbird.txt

Yardbird

Ladies and Gentlemen a One Mister Charles Parker, Jr.

An IRC channel I consider my "home" channel is coming up on its tenth anniversary in a few months, and its founders have begun to reflect on how far we've really come in a decade. By far the biggest disappointment is that our beloved and snarky bot still runs a rickety hacked-up version of Kevin Lenzo's 1990s classic, Infobot.

We've looked into replacements in the past, but they always seemed like mere incremental improvements. They'd provide some degree of reliability and a slightly saner codebase, but they generally add little that appeals to us. Often they're written in Perl, like Infobot, which is something we're trying to move away from (for no reason beyond the fact that nobody in the channel is comfortable maintaining Perl code any more).

So when I finally found a bot that managed to have reasonable gains in terms of fun over Infobot, I leapt at the chance to put it through its paces despite being written in Perl. For various other reasons, it was not fit for purpose, although looking at the code led me to an inspiration.

How Not To Write A Bot

The bot in question was written with POE, which appears to be Perl's answer to Twisted Python. POE gives you lots of library functions and objects that allow you to do things asynchronously by firing off events and registering callbacks for when something finishes. This is somewhat important, so that you don't say "Bot, do this five-minute thing" and have it fall off the network because it hadn't returned to the protocol code for five minutes.

But the way this bot was written was done in a very hasty JFDI sort of scripting style. I'm told that the real marvel was how quickly it was brought up and running, and I don't think it's fair to judge the authors based solely on this work. However, there was something of a common antipattern throughout the command recognition code:

if ($msg =~ GREATBIGHOARYREGEX) {
        ...
} else if ($msg =~ ANOTHERCRAZYREGEX) {
        ...

Dispatchers

While tracing through a printout of this thing in an attempt to even figure out what its features were, I thought "Gosh, wouldn't it be great if this had a dispatch mechanism where you could associate regexes with functions in some kind of data structure, along with some kind of application data for context?" And of course that immediately reminded me of...

http://docs.djangoproject.com/en/dev/topics/http/urls/

I filed this little bit of inspiration away, and started worrying about how I'd go about cloning this system. I'd use Twisted Python, of course, but the asynchronous database library is on the same level as POE's, and I prefer a good ORM. Man, wouldn't it be great if my bot could use the Django ORM?

Asynchronicity

Of course, the Django ORM isn't built with a callback-based API, so while your code does a query that's all your program can do. This prompted me to wonder why that doesn't become a problem for Django Web apps. Surely they receive hundreds or thousands of requests per second, but concurrency never becomes a problem that exposes itself to the app programmer.

The answer is that while Django does not have any support for asynchronous programming, the very model in which it operates assumes that it's being called from a Web server such as apache. Apache has its own forking or threading model that it uses to handle lots of requests simultaneously, and the CGI or WSGI interfaces use a well-defined interface for passing connection information into a program and getting a response back out.

An IRC Apache

One morning on the Underground, I began to reason that what I needed was a sort of "IRC Bot Apache" that would handle incoming IRC events of various sorts (PRIVMSG, ACTION, etc.), and then pass them along with some connection information to my django code. I'd dispatch these messages through the regex-based patterns() system, then call view code that uses ORM objects to perform queries.

The obvious choice for implementing something like this is Twisted Python, which shows its age but remains the de facto Python library for coding state machines. I'm only occasionally familiar with the system (and the documentation is filled with distracting Java-esque Software Engineering babble for some bizarre reason), but I was able to localize the actual Twisted-using code to one function, which at the very least makes it simple to hand off to experts to tell me if I'm doing anything stupid.

Yardbird

Digging through twisted documentation I found their example LogBot and based it loosely on that pattern, subclassing irc.IRCClient and replacing the privmsg method along the following lines:

from twisted.words.protocols import irc
from twisted.internet import defer, threads
from django.core import urlresolvers
from django.conf import settings

class DjangoBot(irc.IRCClient):
    @defer.inlineCallbacks
    def privmsg(self, user, channel, msg):
        resolver = urlresolvers.get_resolver('.'.join(
                                (settings.ROOT_MSGCONF, 'privmsg')))
        request = dict(user=user, channel=channel, msg=msg,
                                settings=settings)
        callback, args, kwargs = resolver.resolve('/' + request['msg'])
        response = yield threads.deferToThread(callback, request,
                                *args, **kwargs)
        defer.returnValue(self.notice(response.recipient,
                                response.data.encode('UTF-8')))

Let me go through that line by line:

@defer.inlineCallbacks
def privmsg(self, user, channel, msg):

The inlineCallbacks decorator essentially catches any yield of a Twisted deferred object and schedules the next delve into the generator using standard Twisted deferred-execution mechanisms. So now yield really behaves like a scheduler yield, and you can let some more critical IRC-parsing code run between your own calls.

Next we build a Django urlresolver object so we can dispatch regexes to handler functions, using some Django settings info to determine path info:

from django.core import urlresolvers
from django.conf import settings

resolver = urlresolvers.get_resolver('.'.join(
                        (settings.ROOT_MSGCONF, 'privmsg')))

Then we build a request dictionary. In normal Django this would be an HttpRequest object, containing all sorts of information about the web server and the remote client and the HTTP request itself. Since this is a quick-and-dirty example, I've reduced this to a dict for simplicity. I also passed in the settings namespace just to be lazy (so I can keep things like nickname in there):

request = dict(user=user, channel=channel, msg=msg, settings=settings)

Now we actually use our resolver to test the incoming message against all our patterns in privmsg.py and return to us the appropriate function, along with all of the anonymous and named matches that were generated by the winning regular expression. Note that we have to prepend a / to our message to appease the URL-centric resolver:

callback, args, kwargs = resolver.resolve('/' + request['msg'])

Finally we get to the deferred execution magic! We have a function, a request object, and some arguments made from textual analysis of the message. We use the threads.deferToThread method to generate a deferred object that runs in a completely separate thread, and yield it up to our inlineCallbacks decorator to be scheduled:

response = yield threads.deferToThread(callback, request, *args,
                                       **kwargs)

Our view function then runs in the background, taking as long as it likes while our bot concerns itself with answering PING replies and dispatching further events to the resolver.

We're confident that the Django code is reasonably thread-safe, as it has to handle concurrency under a variety of Web server models (such as apache's Worker MPM or a traditional Prefork model). Once the function returns a value, the thread closes and execution comes back to this method again, chucking the returned value into our response object.

We're almost done, but we still need to actually do something with this information! In ordinary HTTP Django this would be an HttpResponse object, containing all sorts of information on what template to render and what dictionary to pass in as an extra context namespace. This is a bit overkill for this example, so I've simplified it to another dict:

defer.returnValue(self.notice(response['recipient'],
                        response['data'].encode('UTF-8')))

The various RFCs for IRC all state rather loudly that automated bots are meant to speak using NOTICE but always ignore NOTICEs from other sources. This is meant to prevent feedback loops flooding a channel. Also note that since this was the final statement of my inlineCallbacks function, I called the defer.returnValue to spit back the result of the notice call. I'm not convinced that it was at all necessary, but I believe it's harmless boilerplate in the worst case.

Whoa is that all?

My current version is obviously not exactly like this. I've been doing some rather wild and thrashing testing and debugging, and there's some mess and refactorings. The above is intended as a demonstration of the technology only while I do my explorations in the yardbird bazaar tree.

For a start, my current version applies an errback function to log exceptions in the view function thread, and I've refactored the above code into a separate function that both the ACTION and PRIVMSG functions can use. The principle is still the same, though.

What's With The Name?

Because Django is named after Django Reinhardt, and was split from a CMS project named Ellington (as in, Duke), it's become traditional to name Django projects after Jazz greats. For example, there's a popular Django e-commerce system named Satchmo. I dug around and couldn't find any named after Charlie Parker, or his nickname 'Bird'. I just decided to play it safe and use the rarer long form "Yardbird".

Where do I get Your Version?

My spazzy tree, complete with README.txt files for the apache indexing and a hackish approach at implementing some of the Infobot functionality is up at http://zork.net/~nick/yardbird/. It's also a bazaar tree, so you can just:

bzr branch http://zork.net/~nick/yardbird/

Right now I'm trying to figure out what I should really do for the IrcRequest and IrcResponse objects, and how to properly package all this up like a proper professional project.

Your code sucks!

Soz.

Sat, 14 Mar 2009

15:20 [zork(~)] cat your-favorite-orm-sucks.txt

Your favorite ORM sucks

Over the past year or so I have been learning and enjoying Django, which is a large set of python modules that are impressively nice for creating dynamic Web sites. It's the latest generation in a set of tools that have been written to solve this very problem, going back all the way to the original CGI libraries of the mid-1990s.

You can get more details on the Django Web site, but the basic pieces you interact with are these:

Object-Relational Mapper:
 This saves you from having to type in COBOL-esque SQL queries and manually connect the results to the data structures in your code.
Template Language:
 This saves you from having to type in the punctuation-heavy HTML or XHTML to render your pages, and lets you focus more on the content.
URL Dispatcher:This saves you from having to build trees of scripts or put argument-based flow control spaghetti in your code, and lets you restructure the layout of your URLs in a nice clean manner.
Form Handler:This ties the previous three together to safely simplify one of the more dangerous aspects of programming: input validation.

There are more subsystems (such as the libraries of HTTP handling functions available to you in your “view” code), but those four seem like the legs of the table to me.

Feelings of Inadequacy

As I've been working with this, I've run into a number of conversations on IRC that seem to go something like this:

me:I'm really enjoying Django!
nerd:Django is a tinker toy! I prefer TurboGears or web.py!
me:Well Django sure takes a whole lot of busywork out of Web development!
nerd:Bah! It's a giant integrated ball of goo that you can't break out of!
me:Really? I mean their motto is "Loosely Coupled, Tightly Integrated". And sure the pieces all know a lot about one another, but you can swap out one or more trivially.
nerd:Bah! Huge all-singing all-dancing frameworks only help you in the beginning, and then they get in your way!
me:Right, but like I said you can easily swap them out or work in the raw for the more critical pieces of your app.
nerd:Bah! Django brings nothing to the table that hand-assembled collections of individual tools can't!
me:Really? I find it wonderful to have all that coordination among the pieces.
nerd:Django's ORM sucks and is too Django-specific! SQLAlchemy forever! Behold The Power!

Okay, so what can you do? After enough of these conversations, I had this sense that I was kind of using the My First Sony of Web development systems. And the Loosely Coupled, Tightly Integrated motto has been pointed out to have the effect that while you can swap a new ORM into your Django app, few people ever use the Django ORM in anything else.

Eventually it came time to do some work on a non-Web application and I needed a database. With all these conversations with Riot Nrrds like the above, I figured I'd save myself a lot of trouble by going straight to SQLAlchemy.

The Reason We Have ORMs

There's a reason we have these “Object-Relational Mapper” things, and it's because of a problem known as the Object-Relational Impedance Mismatch. Basically, the formal mathematical model for databases used to ensure that they stay intact follows a system of tables with rows and columns and references to other tables, while data structures in most programming languages we use today manipulate data in nested tree-like structures. It's rather like the difference between a spreadsheet and an XML document, or between a ledger and a family tree diagram.

For a long time people just suffered through it: building SQL queries in strings and shoving them over the wire to databases, then manually parsing the results and populating the local data structures, all the while hoping nothing got mistranslated in the process. But in the argument over why we keep relational databases around in this object-oriented day and age anyway, people noticed that there was a formal and automatic mapping you could perform between the two.

So fundamentally the ORM is there to allow you to manipulate all your data in one single format, save you the trouble of re-inventing the necessary mapping techniques between the two, and cut down the complexity of your code thereby reducing your exposure to bugs. It's a fantastic thing, since relational databases are still important for their speed and reliability, and every reasonably-expressive programming language since LISP (which is from the 1950s, after all!) has had an implicit bias toward hierarchical representation of data.

Behold The Power!

Having fiddled with Django's ORM, and looked into Ruby's ActiveRecord system and read a bit about the history of Java's Hibernate ORM, I decided to see what was so hot about all the other Python ORMs out there.

I quickly learned that my sample of three had been somewhat biased toward a particular state of mind: namely those who want the impedance mismatch to actually be solved for them. The majority of ORMs used by the Riot Nrrd set instead seem to be written by people who enjoy the power of SQL and demand lots of advanced SQL features. Nothing to actually make your job as a programmer easier ranks anywhere on their priority list. One of the high-end python ORMS even brags in its Features list about how it can't do schema generation!

The result of all this talking out of school was that I had a project using SQLAlchemy and it was a horrible bureaucratic mess. I had session object setup and tear-down all over, manual coupling between table objects and actual useful objects for some reason, and it barely worked at all. I was forking processes off that needed to do a bit of database work, and my transactions and sessions kept stepping on each other. Hell, even the simple single-process jobs were hard to get right!

I'm sure this is where the Riot Nrrd contingent would step in and scream about how incompetent I must be. Why, a truly skilled SQL craftsman would be able to foresee all the intricate concurrency and relational integrity performance-critical session collision transaction issues that would arise, and work around them!

Let me stop you right there.

Back to the Tinker-Toys

Frustrated and eyeing a looming deadline warily, I decided to see if maybe I could just stick to what I knew and come up with a better solution. Hey, maybe I'd bite the bullet and set up the whole thing as a Web app inside the much-derided all-singing all-dancing framework (a term that to me typically refers to an unfinished project, rather than a useful set of libraries).

So in my searches, I discovered a rant from 2006 about how horrible it is that you can't just use pieces of Django outside of django itself. As I contemplated the problem, I scrolled down to an intriguing response by Django's James Bennett.

James Bennett is one of those hackers you wish there were more of on the Internet. He's knowledgeable and proficient while remaining reasonable and courteous. He reminds me of an old Jesuit brother I once knew who never seemed to open his mouth unless he was saying something that would help someone out.

Anyway, Bennett wrote an entry of his own on how to do standalone django scripts, and that was in 2007! In 2007, Django was still in version 0.96, and there was a lot of isolating of components to go before the 1.0 release. And damn if the settings.configure() trick isn't simple and straightforward!

So my script now begins with:

from django.conf import settings
settings.configure(DATABASE_ENGINE='postgresql_psycopg2',
                   DATABASE_NAME='dbname',
                   DATABASE_USER='username',
                   DATABASE_PASSWORD='password',
                   INSTALLED_APPS=('myapp',))

…and then my database code lives in myapp/models.py and is something like ⅓ the length of the corresponding SQLAlchemy horror. It also gains a little input validation magic from some application-level data types (such as “IP Address”) that assert constraints not present in the database itself.

The Return of Session Management

Ah, but there comes a wrinkle! You just knew there'd be a wrinkle, didn't you? Remember how I was forking off asynchronous handler processes? So here's where that sinking feeling of inadequacy returned. I mean, Django's implicit session handling was causing me a headache, which sounds like precisely the sort of thing the Riot Nrrds were kvetching about!

Django's DB access model is very conservative: it won't chat with the database basically until the query you're building is at a stage where it needs information from the DB itself to proceed any further. It's a very lazy approach and I love it to bits. The session handling is all done behind the scenes, and any DB-bound “model” object will try to use a pre-existing connection if possible, but will open a new session if needed.

So I rolled up my sleeves and prepared to throw manual and bureaucratic session-management code all over my app, and ended up with the following:

from django.db import close_connection
close_connection()
pid = os.fork()

That's it! Two lines to just drop the current database connection before forking, and one of them is an import! Then the parent and the child each get a new session the next time they perform a query or update operation on any Django model object. I'm still picking my jaw up off the floor over how simple this was.

Django For The Win

So I got to do a massive bit of coding with the delete key, and I can now be confident about the correctness of all my database code. Furthermore, I did it without dragging in any baggage from the rest of the Django set (unless you count the brief use of the django.conf.settings module, which is a fair point).

So where does all this hatred for Django come from in the Riot Nrrd set? I suspect it is partly a quick look at early (pre-0.96) versions that were still in a “we're working on separating our framework from the Ellington CMS” state, and largely a reaction against the horrible “type a bit of python code into this browser input box” model of early Zope releases. I think that Zope in the late 90s left a lot of programmers with a bad taste in their mouths.

But in the end none of the complaints about Django held up, and SQLAlchemy simply wasn't fit for purpose. I ended up feeling kind of cheated, like I'd lost a week because I listened to some uninformed Real Programmer posturing that did nothing but make trouble. It could have been that I just hit a bunch of DBAs who use Python occasionally instead of people like me who are Python programmers who occasionally need a database, but I know that's not really the case.

Okay Maybe Only 99.9%

Right now my only complaint about Django is that the #django channel on Freenode contains an op by the nickname of Magus who is the absolute opposite of James Bennett. Every time Magus answers someone's question, he does so in the most condescending way possible. Someone noticed this and set up a twitter feed to log each time he uses “obvious” or “of course”, but in February he changed his nick to avoid it. I'm sure this guy is a useful contributor to Django, but answering even naïve questions like that is simply not helpful.


[zork(~)] cal
[zork(~)] tree
[zork(~)] syndicate.py
[zork(~)] cat README