Thursday, July 9, 2009

Stratego/XT

Stratego/XT is the first meta programming system I had a closer look at. The project defines itself as "a language and tool set for program transformation". By program transformation they mean "programming tasks using some form of automatic program generation or transformation, such as code generation from a domain-specific language, aspect weaving, optimization, or specialization of a generic program to a particular context." So Stratego/XT follows a more generalized approach then what MDE needs. Here is an extraction of major ideas, features and insights I got from the project homepage, the manual, the article on Stratego/XT 0.17 in the elsevier journal Science Of Computer Programming and the technical report Stratego/XT 0.17. A Language and Toolset for Program Transformation

ATerm

The Stratego tools work on document trees which are in the Annotated Term (ATerm) format. ATerms are nested constructor applications with integers, strings, lists and tuples as parameters. Additionally an ATerm can be annotated to provide additional semantic information. ATerms implement maximal sharing which means that common sub trees are stored once and referred to multiple times. There are text based and binary ATerm representations, tools for converting, e.g from and to XML, and libraries to work with ATerms.

SDF

The syntax definition formalism (SDF) is a language for defining syntax and is implemented by scannerless generalized-LR parsing (SGLR). Semantic actions in SDF are limited to the specification of one ATerm constructor per production. There is no support for performing any action on the parse tree while parsing. This is correct for meta-programming systems because the same parser must be able to feed input into different transformers, which is where the semantics are defined. As far as I can see, SDF has the following major advantages:
  • modules: SDF supports modules, so different aspects of the grammar can be specified separately and such parts can be reused by other languages.
  • integration of lexical and context-free syntax: as opposed to lex and yacc SDF does no distinguish much between lexical and context-free syntax. The only difference is that between the symbols of a context-free production layout is allowed. Hereby layout is meant to be white spaces and comments and can be arbitrarily defined.
  • disambiguation: SGLR grammars can be ambiguous. Therefore SDF provides the following means to narrow the set of valid parse trees down.
    • specification of associativity of productions
    • specification of priorities between productions
    • lexical restrictions, which basically are negative look-aheads on terminals
  • reject productions: single productions can be rejected explicitly. For example this is used to avoid that keywords are parsed as identifiers and to specify floating point literals in a "all but" fashion.
For me major disadvantages are:
  • as far as I can see SDF can has no unicode support
  • verbosity:
    • specifying the priority between productions requires retyping of all involved productions.
    • forward declaration of all sorts (non-terminals) needed to avoid warning.
  • SDF does no longest match because "in some languages longest match scanning actually doesn't work". As a consequence many lexical restrictions are needed to get things right. For example to force a white space after keywords you have to enumerate them all in a lexical restriction.
  • I find SDF quite complex. For example nested comment blocks can be written in PEG (section 2.2) as
    BlockComment <- '/*' (!'*/'. | BlockComment)* '*/'
    In SDF you have to write the following:
lexical syntax
  "/*" CommentPart* "*/" -> BlockComment
  ~[\/\*]      -> CommentPart
  Asterisk     -> CommentPart
  Slash        -> CommentPart
  BlockComment -> CommentPart
  [\/] -> Slash
  [\*] -> Asterisk

lexical restrictions
  Asterisk -/- [\/]
  Slash    -/- [\*]

Parse-unit

Parse-unit is a unit testing tool for SDF. A single Parse-unit test specifies an input string and the expected resulting ATerm. A test suite is a collection of such tests and can executed from the parse-unit command line tool. The output is along the lines of all the *Unit tools out there.

The Stratego language

Stratego is the transformation language of the Stratego/XT system. A Stratego program operates on an ATerm and applies user defined rewrite rules according to a user defined strategy to it. A rewrite rule consists of a pattern and a substitution term with common variables. If the pattern matches the so called subject term the variables are bound to the matching values. The substitution term is then expanded and used as a replacement for the matched subject term. If the pattern does not match then the rewrite rule fails. Stratego supports the embedding of the object language which makes writing rewrite rules quite comfortable. Instead of specifying pattern and substitution as ATerm trees you can state the corresponding code from the object language. In order to make this work you have to define the meta-variables that you use in the patterns. Single rewrite rules can be combined by strategies into more complex transformations. The sequential composition operator s1 ; s2 first applies strategy s1 and if in case of success then s2. If either s1 or s2 fails the sequence fails as well. The deterministic choice operator s1 <+ s2 first tries to apply s1 and if that succeeds the choice succeeds. However, if the application of s1 fails, s2 is applied. The guarded choice operator s1 < s2 + s3 first applies s1. If that succeeds then s2, else s3 is applied. If s2 fails the complete expression fails and no backtracking to s3 takes place. The strategy id always succeeds and the strategy fail always fails, both without altering the subject term. The two basic building blocks of the Stratego language are building terms and matching terms. The building term !p replaces the subject term with the instantiation of the pattern p using the bindings from the environment to the variables occurring in p. Matching terms come in three different flavors. First, the literal term ?t matches the subject term against the term t. Second, the term pattern ?p compares the subject term to the pattern p and adds bindings for the variables in pattern p to the environment. Third, the match strategy ?x binds the variable x to the subject term in the environment. The (explicit) scope {x1, x2 : s} provides unique instances of the variables x1 and x2 during the execution of the strategy s. On top of these building blocks a number of syntactic sugar is provided.
  • anonymous rewrite rule: (p1 -> p2) is sugar for ?p1 ; !p2
  • where:
    • where(s) is sugar for {x : ?x; s; !x} s is applied without changing the subject term. Other side effects like changing variable bindings are still possible, though.
    • p1 -> p2 where s is sugar for ?p1; where(s); !p2 providing additional conditions for the rule to apply.
  • lambda rules: \ p1 -> p2 where s \ is sugar for {x1, x2 : (p1 -> p2 where s)} with x1 and x2 being variables of p1. This allows you to provide rules on the fly.
  • applying a strategy:
    • <s> p is sugar for !p; s
    • s => p is sugar for s; ?p
    • <s> p1 => p2 is sugar for !p1; s; ?p
  • if-then-else:
    • if s1 then s2 else s3 end is sugar for where(s1) < s2 + s3
    • if s1 then s2 end is sugar for where(s1) < s2 + id
  • switch: C-style switch which is desugerated rather complex. Have a look at the manual
Dynamic rules are templates for rules which are instantiated by other rules with information obtained from parsing. This adds the additional power of context sensitive rules which is necessary because sometimes you know how a rule must look like only if you have already read parts of the document. There is a dynamic rule intersection operator s1 /R\ s2 "which applies the strategies s1 and s2 sequentially to the current term, each with a copy of the dynamic rule set for R. After applying both strategies, the resulting rulesets for R are intersected." To be honest I didn't really understand what this means. A much clearer feature are overlays which are parametrized abbreviations for ATerms and can help you to save a lot of typing. Stratego provides a set of traversal strategies like topdown, bottomup, innermost and all with which you can control how the ATerm tree is traversed in order to apply the rewrite strategies. Additionally Stratego provides a standard library with arithmetic operations, lists, strings, hash tables, sets, I/O functions and support for command line options - all in form of strategies. This is weird. Finally it is possible to make native C calls through the application of strategies.

Format checking

In order to detect bugs in the transformers it is a good idea to do format checking on the output ATerm. Therefore XT provides tools for deriving a so called regular tree grammar from a syntax definition. A format checker then checks a given ATerm against this grammar.

Pretty printing

After transformation the ATerm must be turned into text. Stratego/XT calls this step pretty printing which is done by the Generic Pretty Printer (GPP) package. Pretty printing is done in two steps. First an abstract syntax tree in AsFix format is converted to a declarative layout in the Box language. This step is configured by a pretty-printing table which is derived from the grammar specification. The second step is to format the layout description into text, HTML, LaTeX or other display formats. The official GPP homepage has a graphical overview of the GPP process. Pretty-printing is not trivial. For example the printer must decide which parentheses are important for the semantics and which ones are optional and should be omitted to make the result readable for humans. A second problem is the preservation of whitespaces and comments as abstract syntax trees normally do not include them. Furthermore it is an unsolved problem how to automatically generate text which is considered to be nice by humans. That's why GPP supports additional pretty-printing tables which overwrite the generated behavior so that manual configuration of the printing result is possible.

Transformation Tool Composition (XTC)

XTC implements the XT component model and provides support for creating compositions of XT components. It is also used to register components in an XTC repository which is used to find the installation location of a tool without needing to know all the installation paths. XTC is not a build system, though. Instead make, autoconf and automake is used, supported by a tool called Auto XT.

Stratego Shell

The Stratego project provides an interactive shell which supports the complete Stratego language. This is very handy for developing because you can experiment with rules and strategies on arbitrary ATerms. Furthermore the shell is claimed to be usable as debugger, but I don't know how that works. Finally there is an XTC integration so that you can import and invoke XT components from the shell.

Stratego/XT

The project comes along with a very extensive and good documentation. The manual though is in unstable state and partially incomplete. It might happen that some of the links to the manual will break soon because of that. We'll see. Stratego/XT can be considered mature because is is used in its own implementation. For example the Stratego compiler which is a Stratego to C transformer is implemented in Stratego. Implementations of many other Stratego/XT components make heavy use of SDF and Stratego as well. Stratego/XT uses the pipes and filters pattern, which is a good idea in my opinion. Among other things this means that the components are exchangeable. For example you can feed a stratego program with an ATerm obtained from a PEG parser, for example. So using SDF is optional, but nevertheless highly recommended. Stratego/XT is grammar centric. Many of the components of a concrete pipes and filters setup are either themselfs generated or use configuration which is generated from a corresponding SDF grammar. If you replace SDF you have to provide new generators for that purpose, too. A disadvantage of the pipes and filters pattern is the inefficiency of constantly dumping, copying and reading the data at the pipes. To overcome this the currently preferred approach of Stratego/XT is that every component forms a shared library which can be linked into a single application.

Conclusion

Investigating Stratego has been very illuminating for me. You can see of which components a meta programming system consists, which problems must be solved and how they can be solved. But in my opinion the overall Stratego/XT system in general and most of its components in particular are too complicated. For example have a look at the tool reference which lists 43 different tools which operate on some of the various different formats and languages. The Stratego languages itself is also very complex and hard to read. Maybe this is because I am new to it, but I constantly have to run a Stratego interpreter in my brain in order to understand, what that program does. And often enough I don't get it. I don't say that Stratego/XT could be done leaner. But I think that the crucial argument for MPE is that specifying a new DSL and implementing proper transformers must be easy. The whole advantage of the approach is gone if introducing DSLs is as cumbersome as writing the to-be-generated code manually. I think the approach of Stratego/XT for program transformation is too generic to be a good tool for MSE. The Stratego project itself has a different opinion. On this year's code generation conference there has been a hands-on tutorial entitled Creating Domain-Specific Languages with Stratego/XT. So maybe I am wrong. I will see if other meta-programming systems are easier.

No comments:

Post a Comment