LARA

Introduction

This week you will work on the second stage of the Amy compiler, the parser. The task of the parser is to take a sequence of tokens produced by the lexer and transform them into an Abstract Syntax Tree (AST) of Amy.

The strategy we will implement is the following: in the first stage, you will write a grammar for Amy in a specifically designed Domain Specific Language (DSL). This grammar will be used by an automated tool (the same you can access online at http://grammar.epfl.ch) to parse the sequence of tokens into parse trees. These parse trees do not correspond to Amy ASTs: they are a generic tree-like structure which the grammar tool understands. A tree of this sort basically tells us which grammar rules were invoked in each case and what terminal and nonterminal symbols they used. In fact, they represent the concrete syntax of Amy: e.g. all commas, parentheses, and brackets are there. The next stage of parsing is then to traverse this parse tree and transform it into an Amy AST, which really keeps only the information we need to represent.

As you have seen in the lectures, there are multiple algorithms to parse syntax trees corresponding to context free grammars. Any context free grammar (after some normalization) can be parsed with the CYK algorithm. However, this algorithm is too slow: its complexity is in O(n^3 * g) where n the size of the program and g the size of the grammar. This will not scale beyond a few tens of lines of code. On the other hand, a more restricted LL(1) grammar can be parsed in linear time. Thus it is desirable that we develop an LL(1) version of the Amy grammar.

The grammar DSL

We have developed a tool to analyse and generate parsers for grammars, which is given to you in the form of a JAR file. Among other things, this tool provides a DSL to define context-free grammars. Look at Parser.scala. It contains a field called amyGrammar which is a complete grammar of Amy, very similar to the one presented in the Amy specification page. The major difference is that we are not using EBNF.

A grammar is defined as a starting nonterminal symbol and a list (as in Scala List) of productions.

val amyGrammar = Grammar('Program, List[Rules[Token]](
  ...
))

Each production is a nonterminal, followed by the operator ::=, followed by a list of production rules separated with the operator |.

'ElseOpt ::= ELSE() ~ 'Statement | epsilon()

In turn, each production rule is a sequence of terminal and nonterminal symbols, separated with the operator ~.

ELSE() ~ 'Statement

Nonterminal symbols of the grammar are represented with Scala Symbol literals. You don't have to know what exactly that is, except that you can write them with a single quote ' , followed by a Scala identifier. E.g. 'Program. The grammar is polymorphic in the type of terminal symbols which, in our case, are tokens generated by the lexer. epsilon() stands for the empty production.

You will notice three strange tokens in the grammar called IDSENT, INTLITSENT and STRINGLITSENT, which we refer to as sentinel tokens. Unlike other tokens, the sentinel tokens represent token classes. Recall that the tokens ID, INTLIT and STRINGLIT store some content along with their kind, which is the name of the identifier or the value of the literal. Also notice that the rules of the grammar have to be agnostic to the content of these tokens. This is why we use the sentinel tokens in the grammar. The sentinel token IDSENT matches every token ID(name) regardless of the name. INTLITSENT matches any integer literal, and similarly STRINGLITSENT matches any string literal.

First task: LL(1) grammar

amyGrammar is a very simplistic grammar, and as it turns out it is too imprecise to be useful for parsing. Firstly, it is ambiguous. That is, it allows multiple ways to parse an expression. E.g. x + y * z could be parsed as either (x + y) * z or as x + (y * z). In other words, the grammar doesn't enforce either operator precedence or associativity correctly. Additionally, restrictions mentioned in the Amy specification are not followed. In the source files that we give you, below amyGrammar you will find amyGrammarLL1, which is for now blank. Your first task is to modify the given grammar and transform it into LL(1). Before starting, make sure you have read the Syntax section of the Amy specification carefully. Unfortunately, one other thing you cannot capture with LL(1) grammars is left associativity (why?). In other words, given an expression: x - y - z, your LL(1) grammar, however you modify it, will necessarily parse it as x - (y - z). So you need to handle left associativity at the time of AST construction. (We will explain this phase shortly.)

The grammar tool will read your grammar, examine if it is in LL(1), and parse input programs with it. It will use the LL(1) parser if your grammar is LL(1), otherwise it will use the CYK parser. In the latter case you get a warning. All this is implemented in Parser.run.

Transforming parse trees to ASTs

The grammar tool will parse a sequence of tokens according to the grammar you provide, however it does not know how to build Amy ASTs. Instead, it returns generic parse trees. Your second task is to traverse these trees and produce Amy programs out of them. This is the work of ASTConstructor.

The grammar tool returns a parse tree, which is of a type called NodeOrLeaf, and as the name suggests it has two constructors (sub-classes), Node and Leaf.

A Leaf of the parse tree contains just a terminal symbol, i.e. a Token.

A Node has two fields: a Rule, which is the production rule used to parse this Node, and a list of children, which is the list of terminal and nonterminal symbols parsed by the right-hand side of the rule. E.g. the rule

'AbstractClassDef ::= ABSTRACT() ~ CLASS() ~ 'Id,

will produce a Node which will contain this very rule and a list of 3 children, 2 of which will be Leaves and 1 of which will be a Node. The Node contains another Rule, and its own sub-Nodes and Leaves etc.

You can pattern match against a Node or a Leaf as shown below:

case Node('AbstractClassDef ::= _, List(Leaf(abs), _, name)) =>

Here you can ignore the right-hand side of the rule ('AbstractClassDef only has one rule, so we know what it is going to be) as well as most terminal symbols (stored inside leaves) on the children list, because you wouldn't need them to construct the AbstractClassDef AST. (Here we maintain the first Leaf only to read the Position of its terminal.) However, you can not ignore children that are leaves if they contain useful information that are needed by the ASTs. For example,

case Node('Id ::= _, List(Leaf(id@ID(name)))) => // we need the value of id

Second task: Adapt AST extraction for the LL(1) grammar

In the ASTConstructor file you are given a complete implementation of the conversion to AST's of Amy corresponding to the grammar you are given (amyGrammar). Each different tree is parsed by its own individual function. Another class, ASTConstructorLL1 extends the former one; it is empty for now. Your second task is to complete ASTConstructorLL1 by overriding methods as needed, so as to adapt the parse tree traversal to your new grammar. Feel free to add more methods as needed. Make sure to implement left associativity correctly: Since it cannot be imposed by the LL(1) grammar itself, use the helper methods provided both in ASTConstructorLL1 and the base class. Remember that as an exception, you can skip left associativity for semicolon ;.

Hint: Notice that the parse trees you get are valid parse trees belonging to the grammar you wrote. So you can exploit your knowledge of the grammar to simplify your code. E.g. when you are traversing a node corresponding to a nonterminal Expr you know that its children should correspond to one of the right-hand-sides of Expr.

Notes

Explanation of the ASTs

If you check the TreeModule file containing the ASTs, you will notice it is structured in an unusual way: There is a TreeModule class extended by two different TreeModules, Nominal and Symbolic. The reason for this design is that we need two very similar ASTs, but with different types representing names in each case: Just after parsing (this assignment), all names are just Strings and qualified names are essentially pairs of Strings. We call ASTs that only use such String-based names Nominal – the variant we will be using in this lab. Later, when implementing name analysis, we will resolve these names to unique identifiers, e.g. two variables that refer to different definitions will be distinct, even if they have the same name. We will come back to this and discuss the benefits of such distinctions in the next lab. For now you can just look at the TreeModule and substitute the types that are not defined there (Name and QualifiedName) with their definitions inside NominalTreeModule.

Positions

As you will notice in the code we provide, all generated ASTs have their position set. The position of each node of the AST is defined as its starting position. It is important that you set the positions in all the trees that you create for better error reporting later. Although the tests do not check for presence of positions, we will check it manually.

Printer

Along with the stubs, we provide a printer for Amy ASTs. It will print parentheses around all expressions so you can clearly see how your parser interprets precedence and associativity. You can use it to test your parser, and it will also be used during our testing to compare the output of your parser with the reference parser.

Skeleton

As usual, you can find the skeleton on Courseware. This lab builds on your previous work, so assuming you already implemented the lexer you will only unpack several files from the skeleton.

The structure of your project src directory should be as following:

amyc
 ├── Main.scala                (updated)
 │
 ├── ast                         
 │    ├── Identifier.scala
 │    ├── Printer.scala
 │    └── TreeModule.scala
 │
 ├── parsing
 │    ├── ASTConstructor.scala    (new)
 │    ├── ASTConstructorLL1.scala (new)
 │    ├── Parser.scala            (new)
 │    ├── Lexer.scala
 │    └── Tokens.scala
 │
 └── utils
      ├── AmycFatalError.scala
      ├── Context.scala
      ├── Document.scala
      ├── Pipeline.scala
      ├── Position.scala
      ├── Reporter.scala
      └── UniqueCounter.scala

Deliverables

You have 3 weeks to complete this assignment.

Deadline: Tuesday 30 October, 23:59

Please use the Courseware interface to submit your solution and get some preliminary feedback from our automated tests.