Wednesday, July 8, 2009

Parsing Expression Grammars

Parsing Expression Grammar (PEG) is a rather new class of grammars for formal languages. The foundations date back to 1973 and come from "Parsing algorithms with backtrack" by Alexander Birman and Jeffrey D. Ullman. But in those days the approach was considered as not being practical because of limited memory resources. In 2002 Bryan Ford revived the topic in his master's thesis Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking by turning the theoretical model into an equivalent, but practical one. Since then the topic has been gaining more and more attraction and additional publications followed. Bryan is collecting the - in his mind - important ones on a PEG homepage.

prioritized choice

The main difference to context-free grammars (CFG) is the prioritized choice operator which does no backtracking to the second option once the first option matched. As a result PEGs are never ambiguous.

If during the parsing of an option of a choice a parse failure occurs the input stream is reset to the position it had before parsing the first option and the second option is parsed. If both options consist of common sub-rules this backtracking can cause that the same work is done again. This is why parsing with backtracking potentially takes exponential time.

Because of the semantics of the prioritized choice operator, though, parsing a string according to a given rule is a deterministic function which returns either a parse failure or the number of consumed characters. This enables a dynamic programming based linear time parser with a run time and memory complexity of O(the length of the string X the number of rules in the grammar). A common variant is a parser which does lazy memorization of parsing results and thus has a lower complexity.

Languages like C need a statefull parser, because typedef introduces new types which must be recognized as such by the parser. For PEGs this means, that a new state invalidates the memorized parse results. Thus states have a negative impact on the complexity of the parser.

infinite look-ahead

Because the parse function can take arbitrary subsequent characters into consideration, PEGs support infinite look-ahead. In contrast common parsers for context-free languages support a very limited amount of look-ahead. That's why a lexer is needed which changes the granularity of the input from characters to tokens. Because PEGs support infinite lookahead the character granularity is fine and no lexer is needed. Instead the lexer part, which is forming the very basic non-terminals from characters, is seamlessly integrated into the grammar. That's why PEGs look like a mixture of regular expressions and context-free grammars.

The infinite look-ahead additionally removes a major source for ambigiuty during parsing. With CFGs you often have to rewrite the grammar to help the parser to get along. With PEGs you are not subject to such technical restrictions. Instead you can state the grammar the way it is.

A last benefit of the infinite look-ahead is the extendability of PEGs. When adding new syntax to the language you almost don't have to think about the parser. The only thing you have to get right is that the options of a choice have to be ordered so that no string which is parsed by an earlier option is a prefix of a later one. Otherwise the later option is never chosen. For example, if there is an operator = and an operator == in your language, the choice should look like operator <- '==' / '=' and not the other way round.

The Fortress project uses PEGs to keep the syntax flexible, so that new requirements can be integrated seamlessly into the langauage (see Growing a Syntax). Due to the same reasons PEGs are a very good choice for designing DSLs.

more power

PEGs are not only easier to use than CFGs, but they are also more powerful. The syntactic predicate operator !e matches, if the expression e does not match, without consuming any input characters. Double application of this negative look-ahead results into a positive look-ahead, denoted as &e.

As far as I know it is unclear where exactly the borders beteen PEGs, CFGs and context sensitive grammars (GSG) are. In practice all CFG languages are considered beeing PEG languages as well, allthough the translation of the grammar is not always trivial. Pathological CFG languages which lack syntactic markers can pose problems to PEGs because the parser can not try all possiblities but sticks to the first which looks promising. An example is S <- xSx | x.

On the other side there are languages which are not in the CFG class but in the PEG class. For example {a^nb^nc^n | n > 0} is in the CSG class and has the following PEG grammar:

S <- &(A c) a+ B !(a/b/c)
A <- a A? b
B <- b B? c

(note that in Parsing Expression Grammars: A Recognition-Based Syntactic Foundation the grammar for the above language is wrong because it matches the string "aabc". My bug report was confirmed on the PEG mailing list.)

The expression &(A c) first asserts that there are as many a as b characters followed by a c character without consuming anything. The expression a+ then consumes all a characters without counting them. The expression B then consumes as many b characters as c characters. As the number of a characters equals the number of b characters due to the expression A and as the number of b characters equals the number of c characters due to the expression B the overall property a^nb^nc^n for some n > 0 holds. In the end, the expression !(a/b/c) asserts that no further garbage characters follow.

turing machine

A PEG defines, how a string is read in contrast to CFGs, which define, how a string is written. A major consequence of this is that an expression matches a string if it succeeds in consuming a prefix of that string. Thus a PEG parser does not necessarily have to consume the whole string in order to succeed. If you don't want that behaviour, you have to add !. to the end of your starting rule, where . is syntactic sugar for a choice between all possbile input characters.

Furthermore a PEG is said to handle a string if it's start expression either matches or fails. If a PEG does not handle a string then the parser does not terminate. This can happen due to left recursion, such as A <- A x, which means that in order to parse an A, an A has to be parsed first. Such loops can be indirect, built by multiple circularly depending rules. Applying the repetition operator e* to an expression e which does not consume any character also leads to a infinite loop. If a grammar handles all possible input strings - over a given alhpabet - it is said to be complete.

The question, whether a given grammar is complete or not is the question if there is an input for which the parser does not stop. Thus this is related to the halting problem and indeed it can be proven that it is undecidable

  • if a grammar is complete
  • if the set of strings for which an expression matches is empty.
  • if two arbitrary PEGs are equivalent.

conclusion

PEGs are a very promising new class of formal grammars. PEGs are rather easy to write and read because the lexer part is integrated and almost no technical restrictions cumber the author or the reader. Furthermore PEGs are extendable which allows an iterative development of languages. To sum it up, I think PEGs are the ideal tool for domain specific languages. I am really looking forward to the first meta programming system which uses PEGs.

No comments:

Post a Comment