Friday, January 22, 2010

xtc, meta-models and static typing

For my current work I want to implement a C* to C compiler, where * stands for a yet unknown set of extensions and restrictions.

In my opinion, parsing expression grammars are the tool of choice for growing a grammars. Among the existing PEG tools, the Rats! parser generator of the eXTensible C framework offers the most. Besides many handy features there is a complete C grammar and a pretty printer. Ready to take off!

Because my compiler does more than simple search-and-replace transformations, I won't get along with xform, which is a domain specific language for AST transformations of the xtc framework. Instead, I think I will need full turing-completeness. So I just want to use the AST from the xtc parser and plug my transformation engine there. For the implemenation of the latter I have chosen Scala, because it integrates well with Java and offers many features like functional programming and pattern matching, which should serve well for compiler tasks.

Parsers generated by Rats! produce ASTs which are composed of GNodes. A GNode basically is a tuple of a name and a list of children, which often are GNodes again and sometimes are Strings or other leafe objects. A compiler which works on such an AST must repeatedly compare strings and address child nodes by their index in the children's list. This makes the compiler's code hard to write, read and maintain. Furthermore the toolchain is unable to assist me as it has no information about what I am doing.

In principle, though, Scala could help me as it is a strongly typed programming language. In order to take full advantage of this I need to transform the GNode-based AST into Scala objects. Every name of a GNode corresponds to a case class with the types and the names of the members set according to the children's list of that node. In other words, I specify the meta-model of the C AST in terms of Scala types. Then I can use Scala's pattern matching to find nodes of the AST, use functional programming to do the transformation and use the Scala AST classes to construct the result. And for all those tasks the Scala compiler will assit me with type checking. Sound's good.

But here is the the problem: There are approx 140 different names of GNodes in the C AST of xtc. Writing all corresponding Scala classes plus the transformation code for both directions is a cumbersome and time-consuming way. Writing tests for all possible cases is even worse. And maintaining all this code for later modifications of the source language's grammar is hell.

Instead, I would like to generate that code. What I need therefore is a machine-readable form of the meta-model, which basically is specified by the Rats! grammar of the language. But there are a number of Rats! rules such as lifting and desugaring, which make the meta-model of the resulting AST slightly different from the grammar.

But fortunately xtc includes Typical, "a language and compiler for specification and generation of type checkers". Among other things Rats! support to dump the meta-model of any grammar in Typical notation, and there is a Rats! grammar for the Typical format. So, I could use this chain of tools and plug in a code generator at the end, which produces the meta-model in Scala type system notation.

At a closer look, though, some information is missing in the Typical format, which in particular is the name of child nodes if they are not GNodes. Without that information the generated Scala classes would end up with names such as string1, string2, etc, instead of structureTag or alike. This is not very nice.

That's why I though about introducing a backend next to Typical which is capable of dumping the meta-model as a UML class diagram in the XMI format. Unfortunatelly xtc seems to be not prepared for such extensions. The code which does the printing in Typical format is part of the AST class, which is the father of the JavaAST class. An instance of the latter is created by the Rats class and passed to a TreeTyper instance, which invokes the printing. So, it would need some hacking to integrate an alternative dump format into this code.

But the real show stopper is something else. Rats! parser have no problem with producing two GNodes with the same name but with different types of children.

$ cat example.rats
module example;

public generic Foo = Bar \ Baz;
generic Bar = "bar";
generic Baz = "baz";

$ java xtc.parser.Rats -ast -variant example.rats
mltype foo = Foo of [
    | `SomeBar of bar
    | `SomeBaz of baz
  ] ;
mltype bar = Bar of string ;
mltype baz = Baz of string ;

I don't fully understand the Typical format, but as far as I can see this means, that the child of a Foo is either a Bar or a Baz, but there is no common base class of them. Looking into the generated parser code confirms this:

$ java xtc.parser.Rats example.rats
$ cat
  public Result pFoo(final int yyStart) throws IOException {
    // Alternative 1.
    yyResult = pBar(yyStart);
      Node v$g$1 = yyResult.semanticValue();
      yyValue = GNode.create("Foo", v$g$1);
    // Alternative 2.
    yyResult = pBaz(yyStart);
      Node v$g$2 = yyResult.semanticValue();
      yyValue = GNode.create("Foo", v$g$2);

So it seems to me, that Rats! and static type systems do not fit together in general. But there might be a way out.

$ cat example2.rats
module example2;

public generic Foo = Bar @Bar / Baz @Baz;
generic Bar = "bar";
generic Baz = "baz";

$ java xtc.parser.Rats -ast -variant example2.rats
mltype foo =
  | Bar of foo
  | Baz of foo ;

By using node markers we can force the names of the generated GNodes to a specific value. For the meta-model this means that Foo is an abstract class and Bar and Baz inherit from it.

So I will have to find out, how I can rewrite the C grammar in a way that the language is not changed but the resulting meta-model can be statically typed. And then I will have to hack an XMI backend and finally write the code generator for the Scala classes. So much for unexpected problems...

No comments:

Post a Comment