Labs 03: Parser
Introduction
This week you will work on the second stage of the tool compiler, the parser. The task of the parser is to take a sequence of tokens produced by the lexer and transform them into an Abstract Syntax Tree (AST) of Tool.
The strategy we will implement is the following: in the first stage, you will write a grammar for Tool in a specifically designed Domain Specific Language (DSL). This grammar will be used by an automated tool provided by us to parse the sequence of tokens into parse trees. These parse trees do not correspond to Tool ASTs: they are a generic tree-like structure which corresponds to grammar productions and terminal symbols. Also, they correspond to the concrete syntax of Tool as opposed to the abstract one: They contain the full syntax of Tool, which we don't need to explicitly represent in an AST: E.g. all commas, parentheses, and brackets are there. The next stage of parsing is then to traverse this parse tree and transform it into a Tool AST.
As you have seen in the lectures, there are multiple algorithms to parse syntax trees corresponding to context free grammars. Any context free grammar (after some normalization) can be parsed with the CYK algorithm. However, this algorithm is too slow: its complexity is in O(n^3 * g) where n the size of the program and g the size of the grammar. This will not scale beyond a few tens of lines of code. On the other hand, a more restricted LL(1) grammar can be parsed in linear time. Thus it is desirable that we develop an LL(1) version of the Tool grammar.
LL(1) grammar for Tool
Unfortunately, transforming a grammar to LL(1) is hard and indeed sometimes impossible unless we accept some restrictions to our language.
This is the case for Tool. The problem is nested if-statements without an else-clause, e.g. expressions of the form if (c1) if (c2) s1
. There is no way to parse such nested if-statements with an LL(1) parser, no matter what we do. This is why we introduce a restriction in Tool: It is not allowed to have an if
without an else
within the “then” case of another if
statement, or within a while
. This is not too restrictive because we can always use braces.
Examples:
if (c1) // parse error if (c2) s1
while (c1) // parse error if (c2) s1
if (c1) // ok, else always belongs to the inner if if (c2) s1 else s2
if (c1) // ok, if without else is within the else case s1 else if (c2) s2
while (c1) // ok, inner if has an else if (c2) s1 else s2
if (c1) { // ok, braces are used if (c2) s1 }
while (c1) { // ok, braces are used if (c2) s1 }
The assignment
This week, you are given a complete grammar of Tool that is not in LL(1), along with a traverser that transforms parse trees generated by that grammar to Tool ASTs. Your task is to modify the grammar into an LL(1) grammar and update the traverser to match the parse trees generated by the new grammar.
The grammar DSL
We have developed a tool to analyse and generate parsers for grammars, which will be given to you in the form of a JAR file. Among other things, this tool provides a DSL to define context-free grammars. Look at Parser.scala
. It contains a field called toolGrammar
which is a complete grammar of Tool, very similar to the one presented in the Tool specification page. The major differences is that we are not using EBNF, and we have incorporated the restriction to if
-statements described before in the definition of statements.
A grammar is defined as a starting nonterminal symbol and a list (as in Scala List) of productions.
val toolGrammar = Grammar('Program, List[Rules[Token]]( ... ))
Each production is a nonterminal, followed by the operator ::=
, followed by a list of production rules separated with the operator |
.
'ElseOpt ::= ELSE() ~ 'Statement | epsilon()
In turn, each production rule is a sequence of terminal and nonterminal symbols, separated with the operator ~
.
ELSE() ~ 'Statement
Nonterminal symbols of the grammar are represented with Scala Symbol literals. You don't have to know what exactly that is, except that you can write then with a single quote '
, followed by a Scala identifier. E.g. 'Program
. Terminal symbols for this language will of course be the Token
s generated by the lexer. epsilon()
stands for the empty production.
You will notice three strange tokens in the grammar called IDSENT
, INTLITSENT
and STRINGLITSENT
, which we refer to as sentinel tokens.
Unlike other tokens, the sentinel tokens represent token classes.
Recall that the tokens ID
, INTLIT
and STRINGLIT
store some content along with their kind, which is the name of the identifier or the value of the literal. Also notice that the rules of the grammar have to be agnostic to the content of these tokens. This is why we use the sentinel tokens in the grammar.
The sentinel token IDSENT matches every token ID(name)
regardless of the name.INTLITSENT
matches any integer literal, and similarly STRINGLITSENT
matches any string literal.
toolGrammar
is a very simplistic grammar, and as it turns out it is too imprecise to be useful for parsing.
Firstly, it is ambiguous. That is, it allows multiple ways to parse an expression. E.g. x + y * z could be parsed as either (x + y) * z or as
x + (y * z). In other words, the grammar doesn't enforce either operator precedence or associativity correctly. In the source files that we give you,
below the toolGrammar
you will find ll1Grammar
, which is for now identical to toolGrammar
. Your first task is to modify this grammar and transform it into LL(1). Make sure you implement Tool operator precedence correctly.
Unfortunately, one other thing you cannot capture with LL(1) grammars is left associativity (why?). In other words, given an expression:
x - y - z, your LL(1) grammar, howsoever you modify it, will necessarily parse it as x - (y - z). So you need to handle left associativity at the time of AST construction. (We will explain this phase shortly.)
The grammar tool will read your grammar, examine if it is in LL(1), and parse input programs with it. It will use the LL(1) parser if your grammar is LL(1), otherwise it will use the CYK parser. All this is implemented in Parser.run
.
Transforming to Tool programs
The grammar tool will parse a sequence of tokens according to the grammar you provide, however it does not know how to build Tool ASTs. Instead, it returns generic parse trees. Your second task is to traverse these trees and produce Tool programs out of them.
A parse tree is of a type called NodeOrLeaf, and as the name suggests it has two constructors (sub-classes) Node and Leaf.
A Leaf of the parse tree contains just a terminal symbol, i.e. a Token.
A Node has two fields: a Rule, which is the production rule used to parse this Node, and a list of children, which is the list of terminal and nonterminal symbols parsed by the right-hand side of the rule. E.g. the rule
'MainObject ::= PROGRAM() ~ 'Identifier ~ LBRACE() ~ 'Stmts ~ RBRACE()
will produce a Node which will contain this very rule and a list of 5 children, 3 of which will be Leaves and 2 of which will be Nodes. The Nodes will contain more Rules, Nodes and Leaves etc.
You can pattern match against a Node or a Leaf as shown below:
case Node('MainObject ::= _, List(_, objid, _, stmts, _)) =>
Here you can ignore the right-hand side of the rule ('MainObject
only has one rule, so we know what it is going to be) as well as the terminal symbols (stored inside leaves) on the children list, because you wouldn't need them to construct the MainObject
AST. However, you can not ignore children that are leaves if they contain useful information that are needed by the ASTs. For example,
case Node('Expression ::= List(INTLITSENT), List(Leaf(INTLIT(i)))) => // we need the value of i
In the ASTConstructor
file you are given a complete implementation of the conversion to AST's of Tool corresponding to the grammar you are given (toolGrammar
). Each different tree is parsed by its own individual function. Another class, ASTConstructorLL1
extends the former one; it is empty for now. Your second task is to complete ASTConstructorLL1
by overriding methods as needed, so as to adapt the parse tree traversal to your new grammar. Feel free to add more methods as needed. Make sure to implement left associativity correctly, as it is cannot be imposed by an LL(1) grammar.
Hint: Notice that the parse trees you get are valid parse trees belonging to the grammar you wrote. So you can exploit your knowledge of the grammar to simplify your code. E.g. when you are traversing a node corresponding to a nonterminal Statement
you know that its children should correspond to one of the right-hand-sides of Statement
.
Positions
As you will notice in the code we provide, all generated ASTs have their position set. The position of each node of the AST is defined as its starting position. It is important that you set the positions in all the trees that you create for better error reporting later. Although the tests do not check for presence of positions, we will check that manually.
Printer
Along with the stubs, we provide a printer for Tool ASTs. It will print parentheses around all expressions so you can clearly see how your parser interprets precedence and associativity. You can use it to test your parser, and it will also be used during our testing to compare the output of your parser with the reference parser.
Stubs
Merge the stubs of the parser into your main branch by typing
git fetch
git merge origin/Lab03
(assuming that origin
is the name of your remote repository).
The structure of your project src
directory should be as following:
toolc ├── Main.scala (updated this week) │ ├── lexer │ ├── Lexer.scala │ └── Tokens.scala (updated this week) │ ├── eval │ └── Evaluator.scala │ ├── ast │ ├── ASTConstructor.scala (new) │ ├── ASTConstructorLL1.scala (new) │ ├── Parser.scala (new) │ ├── Printer.scala (new) │ └── Trees.scala │ └── utils ├── Context.scala ├── Positioned.scala ├── Reporter.scala └── Pipeline.scala
Also, a new JAR file named grammarcomparison_2.11-0.1.jar
has been added to the lib/
directory. Although sbt will see this file automatically, you may have to explicitly add it to your dependencies if you are using an IDE.
Deliverable
You are given 2 weeks for this assignment. Please choose a commit from your git repository as a deliverable on our server before Tuesday, 25 October, 23:59.