Notes on Python Parser Generators

So, I’ve been reading through the OMeta documentation (by which I mean the PhD thesis linked from the OMeta web-page, titled “Experimenting with Programming Languages”) and trying to figure out how best to fit these ideas into Python. It’s not as natural and easy a fit as I’d like, let’s see if I can organise my ideas more effectively by writing them down.

Funky things in OMeta¶

So far I’ve practised by writing an ordinary PEG parser in Python (based on the Wikipedia article about PEG parsers) and that was easy enough, so I’ll try to concentrate on things where OMeta goes beyond basic PEG parsing.

Matching against multiple, arbitrary data streams¶

This is going to make memoization for packrat-parsing difficult. Also, it means rules can’t just take “position” as an input parameter; they need to know which sequence they’re in.

OMeta really wants to help parse source files¶

As a meta-circular language, OMeta really wants the world to be based on an OMeta parser. Ideally, the host language would have something like Scheme macros, where the host language allows arbitrary code to hook into the parser, the OMeta engine can read OMeta code and compile it into the host-language’s AST, then the host language’s parser can take that and continue. Python doesn’t really work like that, and doesn’t have any hooks to let you emulate it.

Python’s basic concession to meta-programming is the “metaclass”, but that just lets you tinker with the class’ dictionary before the class is defined; you don’t get the class’ AST or any information about line-numbers, etc. In particular, if your metaclass adds new methods to the class, the code in those methods is probably going to see global variables from the metaclass’ module, not from the module that defined the class.

Alternatively, you might be able to peek up the stack to find the filename being executed, then write your own Python parser in OMeta and use that to parse the file, then dig out the definitions you need… but no, that’s terrible.

Perhaps you could supply a custom importOMeta() function that takes the filename of a file containing only OMeta code, parses that into a Python module containing a single class definition, and then hands you the module object. So you could say:

parserlib = importOMeta("parserlib.ometa")
p = parserlib.IntParser("0xff")

…but that would make it difficult to have OMeta code access names from the Python environment (for the semantic-predicate and semantic-action operators, as well as the foreign rule that calls other grammars).

I’m thinking perhaps the least offensive API would be to say something like this:

class MyParser(BaseParser):
    __grammar = """
        digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
        integer = digit+:x -> int(x)
        """

…then the functions created from the grammar can have their filename set to something like “<MyParser grammar> and even if your IDE can’t automatically jump to exactly the correct spot, you should be able to figure things out.

Rule Values, Semantic Actions and Predicates¶

This seems simple enough - just parse a Python expression from the string. But Python’s standard library doesn’t include a function for parsing an expression from an offset within a string, just for parsing an entire string as an expression. We could use the Magic of OMeta to add support for the full Python expression syntax, but that would have to be kept up-to-date as the Python language changes, and that’s a drag.

It occurs to me that if we just teach OMeta about Python string syntax, we can probably get away with a cheap hack where we just match brackets, except for brackets that occur inside string literals (but string literals’ quotes need to be balanced too). That’s not quite as much help for Rule Values (since they don’t have any enclosing brackets or parentheses), but I’m sure there’s some way to figure out where they end (next newline character?).

I think the parameters to parameterized rules are supposed to be host-language expressions, too.

Matching Objects¶

In OMeta’s original host language, which was Scheme-like, giving special precedence to lists is perfectly reasonable, because every data-structure in Scheme is built from lists. In Python, where you have lists and dicts and arbitrarily-complicated objects, maybe this isn’t such a useful idea.

Perhaps we could say “an OMeta list rule matches if we can call iter() on the next object in the stream”, but in order for backtracking to work, we need random access to the sequence.

Perhaps we could say “an OMeta list rule matches the next object in the stream (O) if isinstance(O, list) is true”. That would suit the OMeta grammar nicely, but make it pretty useless for parsing most Python data structures. Better yet, isinstance(O, tuple), so that we can store the sequence, position, and rule-name in a dictionary for our packrat implementation.

As a (very) specific use-case, I’m thinking about parsing Python AST nodes, which are objects with an unordered (or at least, very implicitly ordered) list of properties. You might be able to parse AST nodes with OMeta if you transformed the AST into a Scheme-style tree of lists, but the decorate/undecorate boilerplate would make this approach pretty ugly.

Pattern Matching on Rule Arguments¶

Oww. Obviously Python has nothing approaching “method dispatch on number and type of arguments”, so implementing this along-side “rules are native methods” would be hella-difficult. The OMeta spec says that for a rule with multiple definitions, each definition is tried in order until one of the matches, but I don’t really know how such a thing might be achieved.

If you call a rule with the wrong number of arguments, I guess that’ll just be a TypeError like in regular Python.

Stateful Pattern Matching¶

We don’t need any fancy initialisation rule names in Python, since grammars are classes and Python already provides __init__().

Proposed Parser API¶

# Exceptions

class ParseError(Exception):
    """
    Parent of all possible parsing problem.s
    """
    def __init__(self, sequence, pos, description):
        self.sequence = sequence
        self.pos = pos
        self.description = description

class UnrecoverableParseError(ParseError):
    """
    A parsing problem that can't be solved by back-tracking.
    """

class BacktrackException(ParseError):
    """
    A parsing problem that might be solved by back-tracking.
    """

# Support classes

class Grammar(type)
    """
    MetaClass that compiles the __grammar attribute to parser methods.
    """
    def __new__(mcls, name, bases, contents):
        """
        This will compile the rules in the __grammar attribute to methods.
        """

    def match(cls, sequence, ruleName):
        """
        Basic entry point for parsing.
        """
        parser = cls(sequence)
        rule = getattr(parser, "rule_%s" % (ruleName,))
        result, size = rule()
        return result

class BaseParser(object):
    """
    All parser classes should inherit from this.
    """
    # Make sure we invoke the meta-class to define any rules, etc.
    __metaclass__ == Grammar

    def __init__(self, sequence, pos=0, packratCache=None):
        """
        Initialise the parser state.

        sequence should be any hashable, immutable sequence (such as
        a string or tuple). It will be parsed using the grammar rules of
        this class.

        pos should be an integer, the offset into sequence at which parsing
        will begin.

        packratCache should be a dict object. If not supplied, a fresh,
        blank dict will be created.

        If your grammar needs to store more state, you can override this
        but make sure you call the super class!
        """
        self._sequence = sequence
        self._pos = pos

        if packratCache is None:
            self._packratCache = {}
        else:
            self._packratCache = packratCache

    # Basic rules supplied by the implementation.

    # Note that rule methods must begin with "_rule_" and accept at least
    # these basic parameters (if they accept more, they should be called with
    # OMeta's parameterized-rule syntax).
    def _rule_anything(self):
        # The 'anything' rule accepts anything, but obviously it can't
        # accept anything beyond the end of the sequence!
        if self._pos < len(self._sequence):
            # Every rule that matches must return a tuple of its result,
            # and the number of items it has consumed from the sequence.
            return (self._sequence[self._pos], 1)

        # If we're at the end of the sequence, there's nothing more we can
        # do here, so back-track and try something else.
        raise BacktrackException(self._sequence, self._pos, "End of file")

    # The token(x) rule can be invoked explicitly, but is also invoked
    # by a string-literal with double-quotes. The default implementation
    # skips over any whitespace characters in the sequence, then matches
    # the characters from the literal.
    def _rule_token(self, token):
        pass # Not implemented yet.


    # The apply(x) rule applies the rule with the given name.
    def _rule_apply(self, ruleName):
        rule = getattr(self, "rule_%s" % (ruleName,))
        rule()

    # The foreign(x, y) rule applies the rule in the given grammar with the
    # given name.
    def _rule_foreign(self, grammar, ruleName):
        foreignGrammar = grammar(self._sequence, self._pos, self._packratCache)
        foreignRule = getattr(foreignGrammar, "rule_%s" % (ruleName,))
        foreignRule()

    # You may have noticed that although rules are defined with names
    # beginning "_rule_", code that looks up rules uses the prefix "rule_".
    # This is because every rule needs to be wrapped in boiler-plate code
    # that handles the packrat memoization.
    def _call_rule(self, ruleName):
        key = (self._sequence, self._pos, self.__class__, ruleName)
        if key in self._packratCache:
            value = self._packratCache[key]
            if isinstance(value, Exception):
                raise value
            return value

        rule = getattr(self, "_rule_%s" % (ruleName,))
        try:
            value = rule()
            self._packratCache[key] = value
            return value
        except ParseError, e:
            self._packratCache[key] = e
            raise

    def __getattr__(self, name):
        if name.startswith("rule_"):
            return lambda: self._call_rule(name)

        super(BaseParser, self).__getattr__(name)

Screwtape's Notepad

Introduction¶