Labs 02: Lexer for Tool
This assignment is the first stage of the Amy compiler.
Code Stubs
To merge the stubs for the Lexer into your current branch (which should contain a working Interpreter by now), use
git fetch origin git merge origin/Lab02
Notes
- We are no longer going to use the parser+trees given in
lib/amyc-frontend_2.12-1.0.jar
. This commit removes this file. - Because
Frontend
is no longer available,InterpreterTests
do not make sense so they are removed. This can create conflicts with your assignment if you changed them. If so, do
git rm test/scala/amyc/test/InterpreterTests.scala git rm test/resources/interpreter git commit
The structure of your src
directory should be as follows:
amyc ├── Main.scala (updated) │ ├── analyzer (new, for later) │ └── SymbolTable.scala │ ├── ast (new, for later) │ ├── Identifier.scala │ ├── Printer.scala │ └── TreeModule.scala │ ├── interpreter │ └── Interpreter.scala │ ├── parsing (new) │ ├── Lexer.scala │ └── Tokens.scala │ └── utils (new) ├── AmycFatalError.scala ├── Context.scala ├── Document.scala ├── Pipeline.scala ├── Position.scala ├── Reporter.scala └── UniqueCounter.scala
The ast
and analyzer
directories will be used later, but are included as dependencies of the interpreter.
This lab will focus on the following two files:
parsing/Tokens.scala
: list of tokens and token classes.parsing/Lexer.scala
: stub for theLexer
phase.
A Lexer for Amy
The role of a lexer is to read the input text and convert it to a list of tokens. Tokens are the smallest useful units in a source file: a name referring to a variable, a bracket, a keyword etc. The role of the lexer is to group together those useful units (e.g. return the keyword else
as a unit, as opposed to individual characters e
, l
, s
, e
) and to abstract away all useless information (i.e. whitespace, comments).
Code structure
You can find the lexer in the Lexer.scala
file. Its implementation is based on Streams, which are lazy sequences of elements, i.e. their elements are not produced all at the creation of the collection, but by demand. Unlike iterators, they are immutable, i.e. elements are not consumed when they are created, but remain in the Stream. You can find the Scala Stream API here.
The Lexer has the following components:
- The public method is
run
. It just callslexFile
for every input file and concatenates the results. lexFile
will lex one input file. It takes as input a Stream of pairs of(Char, Position)
(which we generate for you from the input via theSource
Scala library).lexFile
callstokenStream
, which in turn callsnextToken
until the input is empty.nextToken
is essentially a wrapper method aroundreadToken
. It removes whitespace and comments and then callsreadToken
. This recursive method has to be tail-recursive so it does not stack overflow on long inputs (this is imposed with@scala.annotation.tailrec
).readToken
is the main function of the lexer. It will look at the current character and decide what kind of token it has to emit, possibly reading more characters along the way. As said, your input is an immutable Stream so you don't have to worry about consuming characters, you can peak at them at will. Look at the utilities at the beginning ofreadToken
andnextToken
- Finally, the
keywords
method returns a keyword token corresponding to a string, orNone
if there is not corresponding keyword.
It is your task to complete nextToken
and readToken
.
Notes
Here are some details you should pay attention to:
- Make sure you recognize keywords as their own token type.
else
, for instance, should be lexed as a tokenELSE()
, not as an identifier with the content “while”. - Make sure you correctly register the position of all tokens. Note the
currentPos
on the head of the input stream and the utilityuseOne
anduseTwo
methods. - In general, it is good to output as many errors as possible (this helps whoever uses your compiler). This means your lexer should not give up after the first error, but rather skip the erroneous token, emit an error message, and then continue lexing. Use the
BAD
token to mark erroneous tokens. However, make sure to also call theerror
method insidectx.reporter
: this will register the error and eventually fail your program. - The Lexer does not immediately read and return all tokens, it returns an Stream[Token] that will be used by future phases to read tokens on demand.
- Comments should not produce tokens.
- Returned tokens should be fresh instances of the the appropriate Token subclass. Value tokens (tokens that carry a value, such as identifiers), need to be constructed with the appropriate value.
- You should apply the longest matching rule:
elsefoo
should becomeIdentifier(elsefoo)
and notELSE; Identifier(foo)
- Make sure to correctly implement the Amy lexing rules for literals and identifiers.
Handling Errors
Your compiler needs to exit with a non-0 code in case it encounters an error. Bad tokens of invalid characters are considered as such errors.
The Reporter
available in the ctx
parameter of the run
method, provides a method error
which will store in the Reporter the information that an error has occurred. Reporter
also provides the method terminateIfErrors
which will terminate the compiler if an error has occurred before. This method is already called between each phase of the compiler. To make sure that the compiler also terminates with non-0 code in case the *last* phase emits an error, it is also called in the end of Main.main
.
(Notice that, since your lexer phase returns an Iterator, errors might be emitted by your lexer while consuming the iterator during the DisplayToken phase)
In short, make sure you call error
whenever you encounter an erroneous character, to make sure that your compiler will eventually exit with an error code. For the lexing phase, avoid fatal
as this will terminate the compiler immediately (and not allow you to report multiple errors).
Example Output
For reference, here is a possible output for the example under examples/Hello.scala
. You can always get reference output for the lexer from the reference compiler by typing
java -jar amyc_2.12-1.X.jar --printTokens <files>
OBJECT()(1:1) ID(Hello)(1:8) EXTENDS()(1:14) APP()(1:22) LBRACE()(1:26) ID(Std)(2:3) DOT()(2:6) ID(printString)(2:7) LPAREN()(2:18) STRINGLIT(Hello )(2:19) CONCAT()(2:28) STRINGLIT(world!)(2:31) RPAREN()(2:39) RBRACE()(3:1) EOF()(3:2)
Deliverable
You are given 2 weeks for this assignment.
Deadline: Tuesday, Oct. 10, 23.59pm.
You must use the interface on https://larasrv13.epfl.ch/clp17 as follows:
- Click on the “Deliverables” tab
- Click “Deliver this version” on the commit corresponding to the version of your code you want to submit