Lab 02: Lexer
This assignment is the first stage of the Amy compiler.
Code Skeleton
In this lab you will start your own compiler from scratch, meaning that you will no longer rely on the compiler frontend which was previously provided to you as a jar file. In this lab you will build the lexical analysis phase (“lexer”). Since none of the compiler's code will be shared with the previous lab, you can extract the skeleton zip into a new directory (and ideally set up version control).
NOTE: If for some reason you'd like to keep the existing directory (for instance because you're maintaining each lab on a separate branch in the same repository), you should remove the following old files and subdirectories:
lib/amyc-frontend-1.X.jar
src/amyc/interpreter/
test/scala/amyc/test/InterpreterTests.scala
test/resources/interpreter/
In any case, before you start working, the structure of your src
directory should be as follows:
amyc ├── Main.scala (updated) │ ├── lib │ ├── amy-frontend_2.12-1.6.jar (removed) │ └── scallion_2.12-0.3.jar (new) │ ├── parsing (new) │ ├── Lexer.scala │ └── Tokens.scala │ └── utils (new) ├── AmycFatalError.scala ├── Context.scala ├── Document.scala ├── Env.scala ├── Pipeline.scala ├── Position.scala ├── Reporter.scala └── UniqueCounter.scala
This lab will focus on the following two files:
src/amyc/parsing/Tokens.scala
: list of tokens and token kinds.src/amyc/parsing/Lexer.scala
: skeleton for theLexer
phase.
A Lexer for Amy
The role of a lexer is to read the input text and convert it to a list of tokens. Tokens are the smallest useful units in a source file: a name referring to a variable, a bracket, a keyword etc. The role of the lexer is to group together those useful units (e.g. return the keyword else
as a unit, as opposed to individual characters e
, l
, s
, e
) and to abstract away all useless information (i.e. whitespace, comments).
Code structure
You can find the lexer in the Lexer.scala
file. Its implementation is based on Scallion, a Scala library for implementing parsing pipelines. Part of Scallion allows you to transform an input character stream (such as the contents of a Amy source file) into a sequence of Tokens. You can find more information on Scallion here, but we also included a short reference of its tokenization API in Lexer.scala
.
The Lexer has the following components:
- The public method is
run
. It just callslexer.spawn(source)
for every input file and concatenates the results. lexer
is the Scallion-based definition of tokenization rules. Each rule corresponds to a regular expression matching a prefix of the remaining program input. Scallion will compose all of these rules into one finite state machine and apply the maximum-munch rule you've seen in class.- Whenever a rule is found to match a (maximal) prefix of the remaining input, Scallion will call the transformation function provided using the
|>
operator in the rule. This function is given the matched input characters (cs
) along with positional information (range
) and should then produce an instance ofToken
. You can find its definition inTokens.scala
, which includes a list of all the different kinds of tokens that your Amy compiler should process. For instance,KeywordToken(“if”)
represents an occurence of the reserved wordif
in a program.
For more details on how to write new rules, read the short introduction to Scallion's API at the top of Lexer.scala
or consider the Scallion website.
Your task is to complete the rules in Lexer.scala
and implement the filtering of irrelevant tokens.
Notes
Here are some details you should pay attention to:
- Make sure you recognize keywords as their own token kind.
if
, for instance, should be lexed as a tokenKeywordToken(“if”)
, not as an identifier with the content “if”. - Make sure you correctly register the position of all tokens. Note the
range
parameter of the transformer functions. Once you have created a token, usesetPos(range._1)
to associate it with its position in the program source. - In general, it is good to output as many errors as possible (this will be helpful to whomever uses your compiler). Your lexer should therefore not give up after the first error, but rather skip the erroneous token, emit an error message, and then continue lexing. Scallion takes care of this for you for the most part. However, there are certain inputs that you might explicitly want to map to
ErrorToken
, such as unclosed multi-line comments. - The Lexer does not immediately read and return all tokens, it returns an
Iterator[Token]
that will be used by future phases to read tokens on demand. - Comments and whitespace should not produce tokens. (The most convenient way of doing this in Scallion is to first produce dedicated tokens and then filter them out later; See the related TODO in
Lexer.scala
.) - Returned tokens should be fresh instances of the the appropriate Token subclass. Value tokens (tokens that carry a value, such as identifiers), need to be constructed with the appropriate value.
- Make sure to correctly implement the Amy lexing rules for literals and identifiers.
Example Output
For reference, here is a possible output for the example under examples/Hello.scala
. You can always get reference output for the lexer from the reference compiler by typing
java -jar amyc_2.12-1.X.jar --printTokens <files>
KeywordToken(object)(1:1) IdentifierToken(Hello)(1:8) DelimiterToken({)(1:14) IdentifierToken(Std)(2:3) DelimiterToken(.)(2:6) IdentifierToken(printString)(2:7) DelimiterToken(()(2:18) StringLitToken(Good morning!)(2:19) DelimiterToken())(2:34) DelimiterToken(})(3:1) EOFToken()(4:1)
Deliverable
You are given 2 weeks for this assignment.
Deadline: Wednesday, Oct. 9, 23.00pm.
Please use the Courseware interface to submit your solution and get some preliminary feedback from our automated tests.