Lab 02: Lexer

This assignment is the first stage of the Amy compiler.

Code Skeleton

In this lab you will start your own compiler from scratch, meaning that you will no longer rely on the compiler frontend which was previously provided to you as a jar file. In this lab you will build the lexical analysis phase (“lexer”). Since none of the compiler's code will be shared with the previous lab, you can extract the skeleton zip into a new directory (and ideally set up version control).

NOTE: If for some reason you'd like to keep the existing directory (for instance because you're maintaining each lab on a separate branch in the same repository), you should remove the following old files and subdirectories:

lib/amyc-frontend-1.X.jar
src/amyc/interpreter/
test/scala/amyc/test/InterpreterTests.scala
test/resources/interpreter/

In any case, before you start working, the structure of your src directory should be as follows:

amyc
 ├── Main.scala                     (updated)
 │
 ├── lib
 │    ├── amy-frontend_2.12-1.6.jar (removed)
 │    └── scallion_2.12-0.3.jar     (new)
 │
 ├── parsing                        (new)
 │    ├── Lexer.scala
 │    └── Tokens.scala
 │
 └── utils                          (new)
      ├── AmycFatalError.scala
      ├── Context.scala
      ├── Document.scala
      ├── Env.scala
      ├── Pipeline.scala
      ├── Position.scala
      ├── Reporter.scala
      └── UniqueCounter.scala

This lab will focus on the following two files:

src/amyc/parsing/Tokens.scala: list of tokens and token kinds.
src/amyc/parsing/Lexer.scala: skeleton for the Lexer phase.

A Lexer for Amy

The role of a lexer is to read the input text and convert it to a list of tokens. Tokens are the smallest useful units in a source file: a name referring to a variable, a bracket, a keyword etc. The role of the lexer is to group together those useful units (e.g. return the keyword else as a unit, as opposed to individual characters e, l, s, e) and to abstract away all useless information (i.e. whitespace, comments).

Code structure

You can find the lexer in the Lexer.scala file. Its implementation is based on Scallion, a Scala library for implementing parsing pipelines. Part of Scallion allows you to transform an input character stream (such as the contents of a Amy source file) into a sequence of Tokens. You can find more information on Scallion here, but we also included a short reference of its tokenization API in Lexer.scala.

The Lexer has the following components:

The public method is run. It just calls lexer.spawn(source) for every input file and concatenates the results.
lexer is the Scallion-based definition of tokenization rules. Each rule corresponds to a regular expression matching a prefix of the remaining program input. Scallion will compose all of these rules into one finite state machine and apply the maximum-munch rule you've seen in class.
Whenever a rule is found to match a (maximal) prefix of the remaining input, Scallion will call the transformation function provided using the |> operator in the rule. This function is given the matched input characters (cs) along with positional information (range) and should then produce an instance of Token. You can find its definition in Tokens.scala, which includes a list of all the different kinds of tokens that your Amy compiler should process. For instance, KeywordToken(“if”) represents an occurence of the reserved word if in a program.

For more details on how to write new rules, read the short introduction to Scallion's API at the top of Lexer.scala or consider the Scallion website.

Your task is to complete the rules in Lexer.scala and implement the filtering of irrelevant tokens.

Notes

Here are some details you should pay attention to:

Make sure you recognize keywords as their own token kind. if, for instance, should be lexed as a token KeywordToken(“if”), not as an identifier with the content “if”.
Make sure you correctly register the position of all tokens. Note the range parameter of the transformer functions. Once you have created a token, use setPos(range._1) to associate it with its position in the program source.
In general, it is good to output as many errors as possible (this will be helpful to whomever uses your compiler). Your lexer should therefore not give up after the first error, but rather skip the erroneous token, emit an error message, and then continue lexing. Scallion takes care of this for you for the most part. However, there are certain inputs that you might explicitly want to map to ErrorToken, such as unclosed multi-line comments.
The Lexer does not immediately read and return all tokens, it returns an Iterator[Token] that will be used by future phases to read tokens on demand.
Comments and whitespace should not produce tokens. (The most convenient way of doing this in Scallion is to first produce dedicated tokens and then filter them out later; See the related TODO in Lexer.scala.)
Returned tokens should be fresh instances of the the appropriate Token subclass. Value tokens (tokens that carry a value, such as identifiers), need to be constructed with the appropriate value.
Make sure to correctly implement the Amy lexing rules for literals and identifiers.

Example Output

For reference, here is a possible output for the example under examples/Hello.scala. You can always get reference output for the lexer from the reference compiler by typing

java -jar amyc_2.12-1.X.jar --printTokens <files>

KeywordToken(object)(1:1)
IdentifierToken(Hello)(1:8)
DelimiterToken({)(1:14)
IdentifierToken(Std)(2:3)
DelimiterToken(.)(2:6)
IdentifierToken(printString)(2:7)
DelimiterToken(()(2:18)
StringLitToken(Good morning!)(2:19)
DelimiterToken())(2:34)
DelimiterToken(})(3:1)
EOFToken()(4:1)

Deliverable

You are given 2 weeks for this assignment.

Deadline: Wednesday, Oct. 9, 23.00pm.

Please use the Courseware interface to submit your solution and get some preliminary feedback from our automated tests.