From the homepage:
PyMeta 2 is a Python implementation of OMeta which is an object-oriented language for pattern matching, based on a variant of Parsing Expression Grammars (PEGs). It’s a port of the old PyMeta implementation to the simplified OMeta 2 syntax.
In more general terms, it’s a Python module that helps you write code to parse structured text. Python comes with some built-in parsers like the email
module for parsing email messages, the csv
module for parsing CSV data, and the entire xml
module hierarchy for parsing XML. PyMeta2 lets you go beyond those basic formats and allows you to easily create parsers for your own structured text formats.
OMeta was originally invented by Alessandro Warth, with an implementation for Smalltalk. Several implementations for other languages appeared, including PyMeta, an implementation for Python. Later, OMeta’s basic syntax was streamlined a little to produce OMeta2. PyMeta2 is a fork of PyMeta that updates it to support OMeta2 syntax.
First, download the PyMeta2 source archive from the homepage and extract it. Install it the same way you’d install any other Python module:
cd pymeta
python setup.py install
Check it’s been installed correctly by running:
python -c "import pymeta; print 'OK'"
If that command displays “OK”, then PyMeta2 is installed correctly. Otherwise, there’ll be an error message you can investigate.
At its most basic, OMeta2 is a language designed to describe structured text. For example, here is an OMeta2 grammar that recognises dates in ISO8601 format (YYYY-MM-DD):
grammar = digit digit digit digit "-" digit digit "-" digit digit
The format is defined to be four digits, followed by a hyphen, then two digits, another hyphen and two more digits. There’s a few things to notice here:
Here’s a more complicated example: let’s say we want to identify strings that represent unsigned integers, either in octal, decimal, or hexadecimal.
In English, the rules might look like this:
In OMeta2, this could be written:
hexDigit = (digit|"A"|"B"|"C"|"D"|"E"|"F")
hexInt = "0x" hexDigit+
octDigit = ("0"|"1"|"2"|"3"|"4"|"5"|"6"|"7")
octInt = "0" octDigit+
decInt = digit+
grammar = (hexInt | octInt | decInt)
More things to notice:
octInt
comes before decInt
. Otherwise, the grammar would never recognise an octal integer because any octal integer also matches the decInt
rule.Writing a grammar in OMeta2 is an interesting exercise, but not very useful until Python can use it to parse with. Here’s how to use the PyMeta2 API to recognise a particular grammar:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
Things to notice:
OMeta
class is the main interface to PyMeta2..makeGrammar()
class method parses the definition you give it, and constructs a new subclass of OMeta
that includes the grammar rules from your definition..makeGrammar()
have a .parse()
class method that take a string to parse, and try to parse it using the rule named grammar
.ParseError
exception will be raised, which prints a pretty message to indicate where the problem occurred.If you run the code above, you can play with the grammar, entering various things to see if they will be recognised as strings or not. Here’s some sample output:
Enter some text to parse: 0
OK
Enter some text to parse: 1
OK
Enter some text to parse: 1234
OK
Enter some text to parse: 01234
OK
Enter some text to parse: 0x1234
OK
Enter some text to parse: sasquatch
sasquatch
^
Parse error at line 1, column 0:
So far, that’s exactly what we expect! But here’s another test: digits 8 and 9 aren’t allowed in octal numbers, so what happens if we try to use one?
Enter some text to parse: 02468
OK
That’s not what we wanted at all!
It turns out that PyMeta2 is recognising 0246
as a valid octal integer, and then stopping. To instruct it that after parsing an integer we should be at the end of the string, there’s a special pattern named end
. In the Python source, change the grammar
rule so it looks like this:
grammar = (hexInt | octInt | decInt) end
…and try running the code again. This time we get a different response:
Enter some text to parse: 02468
02468
^
Parse error at line 1, column 4: expected one of string '7',
string '2', string '6', string '1', string '5', string '0',
string '4', or string '3'
Now we get a parse error just like we expected, with a helpful error message as well!
Just recognising that a string contains an integer is one thing, but it would be helpful to know which integer it is. OMeta2 helps us here, too, with the Label (“:”) and Rule Value (“->”) operators.
Inside a rule, you can put :
after a particular pattern, followed by a name, to label the characters that pattern matches with the given name. Then, you can put ->
at the end of the rule followed by a Python expression. Whenever the rule is matched, the expression is evaluated, and whatever it returns becomes the “value” of the rule — hence the name “Rule Value”.
As an example, let’s modify the hexInt
rule in the previous example:
hexInt = "0x" hexDigit+:d -> int("".join(d), 16)
Here, we’ve added “:d” after the pattern that matches the digits. Because that pattern matches a variable number of digits, d
will contain a Python list of the digits that pattern matched, in order.
We’ve also added the Rule Value operator ->
, and a Python expression. Because d
contains a list, we use the .join()
method to join all the characters into a single string, then pass that string to the Python int()
function to create an integer (in base 16, hexadecimal).
We can modify the rules for octInt
and decInt
in the same way (although we need to specify base 8 for octal, and don’t need to specify a base for decimal). We also need to modify the top-level grammar
rule, so that it returns the value from the integer-parsing rules it calls, so the final result looks like this:
hexDigit = (digit|"A"|"B"|"C"|"D"|"E"|"F")
hexInt = "0x" hexDigit+:d -> int("".join(d), 16)
octDigit = ("0"|"1"|"2"|"3"|"4"|"5"|"6"|"7")
octInt = "0" octDigit+:d -> int("".join(d), 8)
decInt = digit+:d -> int("".join(d))
grammar = (octInt | hexInt | decInt):i end -> i
In order to see the result, we’ll have to amend our Python code to print the value that .parse()
returns, like this:
The entire program now looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
And here it is in action:
Enter some text to parse: 0
0
Enter some text to parse: 1
1
Enter some text to parse: 1234
1234
Enter some text to parse: 01234
668
Enter some text to parse: 0x1234
4660
Semantic predicates and actions
The not and lookahead operators (~
, ~~
)
List patterns?
Grammar inheritance?
Grammar globals?
Characters versus strings?
The OMeta2 grammar for OMeta2
Why does OMeta2 have number syntax?
Quoting, backslash escapes
Signed decimal, octal and hex numbers
This pattern matches the input if either the left-hand side or the right-hand side matches the input.
Example usage:
'a' | 'b'
This pattern matches the input “a” and also the input “b”.
Used after a pattern, this gives a name to the pattern that can be used in the expression after the Rule Value operator.
Example usage:
number:real '+' number:imag 'i' -> complex(real, imag)
Assuming number
is a rule that matches a number and returns a Python numeric value, this pattern matches expressions like “1+3i” or “7+4i” and labels the first number real
and the second number imag
. Those labels are then available to the Python expression after the Rule Value operator.
Used before a pattern, this matches the input if the following pattern would match, but doesn’t actually consume the matched bytes.
Example usage:
~~sign numeric
Assuming there is a sign
rule that matches +
or -
, and a numeric
rule that has an optional sign, this pattern requires the sign to be present but the actual interpretation of the sign character is done in the numeric
rule.
Used after a pattern, this matches the input if the preceding pattern occurs zero or more times.
Example usage:
'a'* 'b'
This pattern matches the inputs “b”, “ab”, “aab”, “aaab”, and so forth.
Used before a pattern, this matches the input if the following pattern would not match, but doesn’t actually consume the bytes that don’t match.
Example usage:
~"0x" integer
Assuming there is an integer
rule that matches hexadecimal literals (that start with “0x”) as well as decimal literals, this pattern will re-use the logic in the integer
rule while only accepting decimals.
Note: A common idiom in parsing is a pattern that ends at the first occurrence of some particular byte - for example, a comment that extends until the end of a line, or string literals that start and end with a quote character. If you’re familiar with regular expressions, it’s tempting to write a pattern like this:
'"' (~'"')* '"'
…but this won’t work, because the ~
operator doesn’t consume the bytes it examines. The way to do this in OMeta2 is to add the anything
rule like this:
'"' (~'"' anything)* '"'
The anything
rule will match any single character, but because we’ve already checked that it’s not a double quote, the overall effect of the pattern inside the brackets is “any character that’s not a quote”.
Used after a pattern, this matches the input if the preceding pattern occurs one or more times.
Example usage:
'a'+ 'b'
This pattern matches the inputs “ab”, “aab”, “aaab” and so forth, but not “b”.
Used after a pattern, this matches the input if the preceding pattern occurs zero or one time.
Example usage:
'a'? 'b'
This pattern matches the inputs “b” and “ab”, but not “aab”, “aaab”, etc.
Used after a pattern, it is followed by a Python expression that is evaluated when the pattern is matched. The result of the expression is used as the value of the pattern, which can be used by the Label and Rule Value operators in rules that call this one. The value of the top-level grammar
rule is the value returned by the Grammar object’s .parse()
method.
When evaluated, the expression can refer to all the sub-patterns with labels, as well as the contents of the dictionary that was passed to the .makeGrammar()
method. when the grammar was compiled.
For a usage example, see the Label operator above.
There’s a bunch of built-in rules you can use that address common parsing needs.
Matches any single character.
Matches zero or more whitespace characters, as defined by Python’s str.isspace()
method.
Matches if the parser is at the end of the string being parsed.
Identical to:
spaces c
Note: c
must be a Python string literal, with either single or double quotes. It is matched character-wise against the input stream.
Matches any single letter, as defined by Python’s str.isalpha()
method.
Matches any single alphanumeric character, as defined by Python’s str.isalnum()
method.
Matches any single decimal digit, as defined by Python’s str.isdigit()
method.