copton's web log: Rats!

Rats! is an "easily extensible parser generator for C-like languages" written by Robert Grimm. The main idea of Rats! is to exploit the extensibility of PEGs by providing support for parametrizable grammar modules which can use, instantiate, extend and modify each other. Grimm describes version 1.8.2 of Rats! in Better Extensibility through Modular Syntax. More documentation can be found in the Rats! intro. It is part of the the extensive Javadoc of the xtc project, which is the framework project of Rats!. Here comes what I consider the main design principles, features and properties of Rats!.

First of all, "while Rats! currently targets only Java, all language-specific aspects have been carefully isolated to ease future ports to other languages." As with OMeta, though, a given grammar is pinned to a given backend language because the grammar's semantics are determined by code snippets written in that language. So, whenever I speak of Java in the following, I mean the given host language which could be any other language as well.

In Rats! a production is of the form

Attributes Type Nonterminal = expression ;

Attributes is a space separated list of attributes which modify the behavior of the production. Type is the Java type of the semantic value of the production and Nonterminal is an arbitrary name for the production. expression is the parsing expression of the production and for the most part is conforming to standard PEG expressions. Major differences are that sequences can be explicitly labeled with a sequence name, and semantic predicates like in OMeta are supported. Besides the pure grammar structure Rats! supports the computation of semantic values in order to build custom abstract syntax trees (AST). For this purpose v:e performs a binding of the semantic values of the expressions e to the variable v. Semantic actions then can operate on those variables in order to construct the semantic value. The semantic value of the whole production is determined by the value of the special variable yyValue, which is set either through binding or within a semantic action. The name of the variable has been chosen in deference to Yacc. In general, identifiers starting with yy are reserved for Rats! and should not be used.

Matching against strings is considered a common task in Rats!. That's why there is syntactic sugar for that, i.e. the so called text matching expressions "text":e is a shortcut for

fresh-id:e &{ "text".equals(fresh-id); }

which means the semantic value of e is bound to a new variable fresh-id and then checked by the semantic predicate, which is Java code and compares the value of the variable with the desired string literal. A last notable feature of string matching is that Rats! supports Unicode.

implicit semantic values

In many cases the semantic value of a production is implicitly clear which means explicit, redundant semantic actions can be left out. The trivial case for this is a sequence which consist of exactly one nonterminal. Then the semantic value can simply be passed through. Although trivial this case can be found in practice, for example as an alternative of a choice.

If a sequence besides the "main" nonterminal consists of some syntactic clutter like braces, the semantic values of those other nonterminals can be explicitly voided by binding it to the special variable void. If there is exactly one non-voided nonterminal in a sequence, again the semantic value is implicitly determined. Additionally so called void productions can have the type of void, which means their semantic value is forbidden to be bound to a variable and is implicitly voided.

"A text-only production is a production that declares the type of its semantic value to be a String, does not contain any bindings to yyValue nor actions that reference yyValue and references only other text-only productions (if any). The semantic value of a text-only production is the text matched in the input, represented as a String. This is comfortable and much faster than string concatenations by user-defined semantic actions.

Rats! has syntactic sugar for the common case when the semantic value of a production is an AST node. Instead of binding the semantic values of sub-expressions and manually constructing the node from this, the type of such productions can be declared to be generic. All non-voided semantic values of sub-expressions then implicitly become children of the production's AST node. So, for example

generic FullProduction =
   ProductionAttributes
   Name UnqualifiedNonTerminal void:"=":Symbol Choice void:";":Symbol 
   ;

is syntactic sugar for

Node FullProduction =
   v1:ProductionAttributes
   v2:Name v3:UnqualifiedNonTerminal "=":Symbol v4:Choice ";":Symbol
     { yyValue = GNode.create("Production", v1, v2, v3, v4); }
   ;

Another common case are productions for repetitions. If the production's type is Pair<Node>, which is the container class for lists with a head and a tail member, it's clear how the semantic value looks like. For example

Pair<Node> ParameterList =
    ParameterDeclaration (void:",":Symbol ParameterDeclaration)* ;

is syntactic sugar for

Pair<Node> ParameterList =
   head:ParamterDeclaration tail:(void:",":Symbol ParameterDeclaration)*
   { yyValue = new Pair<Node>(head, tail); } ;

left-recursion

Standard PEG parsers cannot support left-recursion. Thus in general direct and indirect left-recursion is detected and rejected. But as left-recursion occurs very naturally at language design, Rats! has some support to relieve this limitation. In Rats! void, text-only and generic productions may contain direct left-recursion. In this case the grammar is transparently rewritten to use repetition. For generic productions semantic actions are additionally generated which rewrite the semantic value in the end in order to get the desired left associative data structure.

The semantic values of generic productions are AST nodes whose names are deduced from the production's name. In case of direct left-recursion with a choice, though, all alternatives would end up with equally named nodes. That makes it hard to decide which alternative was taken. For this case, Rats! supports node markers with which the grammar author can explicitly set the nodes' name for each alternative. Example:

generic PostfixExpression =
    <Subscript>         PostfixExpression void:"[":Symbol Expression void:"]":Symbol 
                                          @SubscriptExpression
  / <DirectSelection>   PostfixExpression void:".":Symbol Identifier
                                          @DirectComponentSelection
  / <IndirectSelection> PostfixExpression void:"->":Symbol Identifier
                                          @IndirectComponentSelection
  /* .... */
  / <Primary>           PrimaryExpression ;

I wonder though, why the sequence's names can not be used instead of - as far as I can see - redundant node markers.

modules

Rats! modules group related productions and control their visibility. Every module has a hierarchical name and can be imported by that name. Additionally a module may have parameters which turns it into a module template. By providing values for the parameters other modules can instantiate the template. Parameter values are always other modules and the top level module puts everything together.

A module can modify at most one other module. A special syntax for partial productions enables the grammar author to add, remove or overwrite alternatives of a production's main choice. In order to address the intended position of the new alternative in the surrounding choice the name of corresponding sequence must be used. That's why grammar authors are encouraged to always label sequences. Finally it is also possible to alter the attributes of a production or to replace it completely with a new definition.

The visibility of a production is determined by the attributes public, protected and private. By default a production is protected which makes it visible from within the defining module itself and all modules which import or modify it. In contrast, private productions are only visible from withing the defining module itself and modifying ones. Last, public productions are like protected ones and additionally turn into public member functions of the Java parser and thus are part of the parser's API. In case of name collisions local productions have precedence and others can be referenced by their full qualified name.

Cyclic dependencies are not a problem for Rats! as resolving is done breadth-first. "As a result, the order of import, modify, and instantiate declarations in a module header does not matter, and mutually dependent modules can be instantiated within the same module".

states

As explained previously PEG based parsers need to memorize intermediate results in order to guarantee linear runtime and thus have problems with stateful parsing. Rats! solution to this is simple, as "all state modifications are performed within possibly nested, lightweight transactions". For example:

GNode ExternalDeclaration =                  {yyState.start(); }
          <Declaration> Declaration          {yyState.commit();}
        / <Function>    FunctionDefinition   {yyState.commit();}
        /                                    {yyState.abort();} &{ false } ;

The type of the yyState object is user defined and chosen appropriate to the language's needs. First, start() is meant to open a new transaction. Then, no matter which alternative matches, commit() is meant to commit the transaction. If both alternatives fail, abort() aborts the transaction and the whole production forcibly fails due to the semantic predicate. As this solution introduces some boilerplate code Rats! provides syntactic sugar for it through the stateful attribute:

stateful GNode ExternalDeclaration =
         <Declaration> Declaration
       / <Function>    FunctionDefinition ;

It is the author's responsibility to ensure that "if an ordered choice's alternatives may reference the same nonterminals for the same input positions, notably by having a common prefix, the state must be modified in the same way across all alternatives through the common nonterminals."

Transactions are one reason why Rats! is suitable for "C-like languages", because they only work if state information flows forward through the input document. For example, a typedef introduces a new type alias which can be used from then on and not before.

debugging

Rats! promises to have a good error handling and error reporting. If this is true or not can only be said after experiencing Rats! in real projects. From what is said, though, it looks quite promising.

First, the production's type becomes a Java type, thus when compiling the parser the Java compiler detects type errors. Second, parse error messages are deduced from productions' names in order to make them human readable. Furthermore file names and row and column number are tracked and are reported in case of an error. The generated AST nodes are additionally annotated with this information in order to enable later tools to locate the source of semantic errors.

A very handy feature is that Rats! can dump the grammar as seen in different stages of processing as a HTML file. The different stages are "after loading and instantiating all modules", "after applying all module modifications", or "after applying all optimizations". This gives transparency into what Rats! does and helps the grammar author to track bugs in the grammar.

finally

Not everything can be parsed with PEGs. For example length encoded data types require the parser to first interpret the length and then consume as many items as given by the length. PEGs can not provide this feature. Therefore Rats! supports parser actions which are low-level parse functions written in Java and which can be called from Rats! productions.

Rats! support a long list of grammar and production attributes for optimization, adaption and convenience. Concerning optimization, Rats!' authors payed much attention to keep things lean and fast. The paper has some performance measurements which show that this pays off.

The Fortress programming language uses Rats! because they want the syntax of their language to be growable [Growing a Syntax]. As far as I can see this is the major proof that Rats! is feature complete and mature enough to be used productively.

Besides some minor issues Rats! looks very promising to me. I think I will give it a try.

copton's web log

Wednesday, September 16, 2009

Rats!