Screwtape's Notepad

What is PyMeta2?

From the homepage:

PyMeta 2 is a Python implementation of OMeta which is an object-oriented language for pattern matching, based on a variant of Parsing Expression Grammars (PEGs). It’s a port of the old PyMeta implementation to the simplified OMeta 2 syntax.

In more general terms, it’s a Python module that helps you write code to parse structured text. Python comes with some built-in parsers like the email module for parsing email messages, the csv module for parsing CSV data, and the entire xml module hierarchy for parsing XML. PyMeta2 lets you go beyond those basic formats and allows you to easily create parsers for your own structured text formats.

Some history

OMeta was originally invented by Alessandro Warth, with an implementation for Smalltalk. Several implementations for other languages appeared, including PyMeta, an implementation for Python. Later, OMeta’s basic syntax was streamlined a little to produce OMeta2. PyMeta2 is a fork of PyMeta that updates it to support OMeta2 syntax.

Tutorial

Installation

First, download the PyMeta2 source archive from the homepage and extract it. Install it the same way you’d install any other Python module:

cd pymeta
python setup.py install

Check it’s been installed correctly by running:

python -c "import pymeta; print 'OK'"

If that command displays “OK”, then PyMeta2 is installed correctly. Otherwise, there’ll be an error message you can investigate.

First steps

At its most basic, OMeta2 is a language designed to describe structured text. For example, here is an OMeta2 grammar that recognises dates in ISO8601 format (YYYY-MM-DD):

grammar = digit digit digit digit "-" digit digit "-" digit digit

The format is defined to be four digits, followed by a hyphen, then two digits, another hyphen and two more digits. There’s a few things to notice here:

Multiple Rules

Here’s a more complicated example: let’s say we want to identify strings that represent unsigned integers, either in octal, decimal, or hexadecimal.

In English, the rules might look like this:

In OMeta2, this could be written:

hexDigit = (digit|"A"|"B"|"C"|"D"|"E"|"F")
hexInt = "0x" hexDigit+

octDigit = ("0"|"1"|"2"|"3"|"4"|"5"|"6"|"7")
octInt = "0" octDigit+

decInt = digit+

grammar = (hexInt | octInt | decInt)

More things to notice:

Parsing with Python

Writing a grammar in OMeta2 is an interesting exercise, but not very useful until Python can use it to parse with. Here’s how to use the PyMeta2 API to recognise a particular grammar:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Import the PyMeta2 APIs we need.
from pymeta.grammar import OMeta
from pymeta.runtime import ParseError

# This the OMeta2 description of the grammar we
# want to parse.
definition = """
hexDigit = (digit|"A"|"B"|"C"|"D"|"E"|"F")
hexInt = "0x" hexDigit+

octDigit = ("0"|"1"|"2"|"3"|"4"|"5"|"6"|"7")
octInt = "0" octDigit+

decInt = digit+

grammar = (hexInt | octInt | decInt)
"""

# Make a Grammar class that knows how to parse 
# strings according to the rules we've defined.
# The empty dictionary parameter is required, but
# can be ignored for now.
Grammar = OMeta.makeGrammar(definition, {})

while True:
    text = raw_input("Enter some text to parse: ")
    try:
        Grammar.parse(text)
        print("OK")
    except ParseError, e:
        print e

Things to notice:

A wrinkle: trailing characters

If you run the code above, you can play with the grammar, entering various things to see if they will be recognised as strings or not. Here’s some sample output:

Enter some text to parse: 0
OK
Enter some text to parse: 1
OK
Enter some text to parse: 1234
OK
Enter some text to parse: 01234
OK
Enter some text to parse: 0x1234
OK
Enter some text to parse: sasquatch

sasquatch
^
Parse error at line 1, column 0:

So far, that’s exactly what we expect! But here’s another test: digits 8 and 9 aren’t allowed in octal numbers, so what happens if we try to use one?

Enter some text to parse: 02468
OK

That’s not what we wanted at all!

It turns out that PyMeta2 is recognising 0246 as a valid octal integer, and then stopping. To instruct it that after parsing an integer we should be at the end of the string, there’s a special pattern named end. In the Python source, change the grammar rule so it looks like this:

grammar = (hexInt | octInt | decInt) end

…and try running the code again. This time we get a different response:

Enter some text to parse: 02468

02468
    ^
Parse error at line 1, column 4: expected one of string '7',
string '2', string '6', string '1', string '5', string '0',
string '4', or string '3'

Now we get a parse error just like we expected, with a helpful error message as well!

Labels and values

Just recognising that a string contains an integer is one thing, but it would be helpful to know which integer it is. OMeta2 helps us here, too, with the Label (“:”) and Rule Value (“->”) operators.

Inside a rule, you can put : after a particular pattern, followed by a name, to label the characters that pattern matches with the given name. Then, you can put -> at the end of the rule followed by a Python expression. Whenever the rule is matched, the expression is evaluated, and whatever it returns becomes the “value” of the rule — hence the name “Rule Value”.

As an example, let’s modify the hexInt rule in the previous example:

hexInt = "0x" hexDigit+:d -> int("".join(d), 16)

Here, we’ve added “:d” after the pattern that matches the digits. Because that pattern matches a variable number of digits, d will contain a Python list of the digits that pattern matched, in order.

We’ve also added the Rule Value operator ->, and a Python expression. Because d contains a list, we use the .join() method to join all the characters into a single string, then pass that string to the Python int() function to create an integer (in base 16, hexadecimal).

We can modify the rules for octInt and decInt in the same way (although we need to specify base 8 for octal, and don’t need to specify a base for decimal). We also need to modify the top-level grammar rule, so that it returns the value from the integer-parsing rules it calls, so the final result looks like this:

hexDigit = (digit|"A"|"B"|"C"|"D"|"E"|"F")
hexInt = "0x" hexDigit+:d -> int("".join(d), 16)

octDigit = ("0"|"1"|"2"|"3"|"4"|"5"|"6"|"7")
octInt = "0" octDigit+:d -> int("".join(d), 8)

decInt = digit+:d -> int("".join(d))

grammar = (octInt | hexInt | decInt):i end -> i

In order to see the result, we’ll have to amend our Python code to print the value that .parse() returns, like this:

The entire program now looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Import the PyMeta2 APIs we need.
from pymeta.grammar import OMeta
from pymeta.runtime import ParseError

# This the OMeta2 description of the grammar we
# want to parse.
definition = """
hexDigit = (digit|"A"|"B"|"C"|"D"|"E"|"F")
hexInt = "0x" hexDigit+:d -> int("".join(d), 16)

octDigit = ("0"|"1"|"2"|"3"|"4"|"5"|"6"|"7")
octInt = "0" octDigit+:d -> int("".join(d), 8)

decInt = digit+:d -> int("".join(d))

grammar = (octInt | hexInt | decInt):i end -> i
"""

# Make a Grammar class that knows how to parse 
# strings according to the rules we've defined.
# The empty dictionary parameter is required, but
# can be ignored for now.
Grammar = OMeta.makeGrammar(definition, {})

while True:
    text = raw_input("Enter some text to parse: ")
    try:
        print Grammar.parse(text)
    except ParseError, e:
        print e

And here it is in action:

Enter some text to parse: 0
0
Enter some text to parse: 1
1
Enter some text to parse: 1234
1234
Enter some text to parse: 01234
668
Enter some text to parse: 0x1234
4660

Other Stuff

Semantic predicates and actions
The not and lookahead operators (~, ~~)
List patterns?
Grammar inheritance?
Grammar globals?
Characters versus strings?
The OMeta2 grammar for OMeta2
Why does OMeta2 have number syntax?

Reference

Literals

Strings and characters

Quoting, backslash escapes

Numbers

Signed decimal, octal and hex numbers

Operators

Alternate (|)

This pattern matches the input if either the left-hand side or the right-hand side matches the input.

Example usage:

'a' | 'b'

This pattern matches the input “a” and also the input “b”.

Label (:)

Used after a pattern, this gives a name to the pattern that can be used in the expression after the Rule Value operator.

Example usage:

number:real '+' number:imag 'i' -> complex(real, imag)

Assuming number is a rule that matches a number and returns a Python numeric value, this pattern matches expressions like “1+3i” or “7+4i” and labels the first number real and the second number imag. Those labels are then available to the Python expression after the Rule Value operator.

Lookahead (~~)

Used before a pattern, this matches the input if the following pattern would match, but doesn’t actually consume the matched bytes.

Example usage:

~~sign numeric

Assuming there is a sign rule that matches + or -, and a numeric rule that has an optional sign, this pattern requires the sign to be present but the actual interpretation of the sign character is done in the numeric rule.

Many (*)

Used after a pattern, this matches the input if the preceding pattern occurs zero or more times.

Example usage:

'a'* 'b'

This pattern matches the inputs “b”, “ab”, “aab”, “aaab”, and so forth.

Not (~)

Used before a pattern, this matches the input if the following pattern would not match, but doesn’t actually consume the bytes that don’t match.

Example usage:

~"0x" integer

Assuming there is an integer rule that matches hexadecimal literals (that start with “0x”) as well as decimal literals, this pattern will re-use the logic in the integer rule while only accepting decimals.

Note: A common idiom in parsing is a pattern that ends at the first occurrence of some particular byte - for example, a comment that extends until the end of a line, or string literals that start and end with a quote character. If you’re familiar with regular expressions, it’s tempting to write a pattern like this:

'"' (~'"')* '"'

…but this won’t work, because the ~ operator doesn’t consume the bytes it examines. The way to do this in OMeta2 is to add the anything rule like this:

'"' (~'"' anything)* '"'

The anything rule will match any single character, but because we’ve already checked that it’s not a double quote, the overall effect of the pattern inside the brackets is “any character that’s not a quote”.

One or More (+)

Used after a pattern, this matches the input if the preceding pattern occurs one or more times.

Example usage:

'a'+ 'b'

This pattern matches the inputs “ab”, “aab”, “aaab” and so forth, but not “b”.

Optional (?)

Used after a pattern, this matches the input if the preceding pattern occurs zero or one time.

Example usage:

'a'? 'b'

This pattern matches the inputs “b” and “ab”, but not “aab”, “aaab”, etc.

Rule Value (->)

Used after a pattern, it is followed by a Python expression that is evaluated when the pattern is matched. The result of the expression is used as the value of the pattern, which can be used by the Label and Rule Value operators in rules that call this one. The value of the top-level grammar rule is the value returned by the Grammar object’s .parse() method.

When evaluated, the expression can refer to all the sub-patterns with labels, as well as the contents of the dictionary that was passed to the .makeGrammar() method. when the grammar was compiled.

For a usage example, see the Label operator above.

Semantic Action (!())

Semantic Predicate (?())

Built-in Rules

There’s a bunch of built-in rules you can use that address common parsing needs.

anything

Matches any single character.

spaces

Matches zero or more whitespace characters, as defined by Python’s str.isspace() method.

end

Matches if the parser is at the end of the string being parsed.

token(c)

Identical to:

spaces c

Note: c must be a Python string literal, with either single or double quotes. It is matched character-wise against the input stream.

letter

Matches any single letter, as defined by Python’s str.isalpha() method.

letterOrDigit

Matches any single alphanumeric character, as defined by Python’s str.isalnum() method.

digit

Matches any single decimal digit, as defined by Python’s str.isdigit() method.