Labs 02: Lexer for Amy

This assignment is the first stage of the Amy compiler.

Code Skeleton

In this lab you will start your own compiler from scratch, meaning that you will no longer rely on the compiler frontend which was previously provided to you as a jar file. In the first lab you will build the lexical analysis phase (“lexer”). Since none of the compiler's code will be shared with the previous lab, you can extract the skeleton zip into a new directory (and ideally set up version control).

NOTE: If for some reason you'd like to keep the existing directory (for instance because you're maintaining each lab on a separate branch in the same repository), you should remove the following old files and subdirectories:

lib/amyc-frontend_2.12-1.0.jar
src/amyc/interpreter/
test/scala/amyc/test/InterpreterTests.scala
test/resources/interpreter/

In any case, before you start working, the structure of your src directory should be as follows:

amyc
 ├── Main.scala                (updated)
 │
 ├── parsing                   (new)
 │    ├── Lexer.scala
 │    └── Tokens.scala
 │
 └── utils                     (new)
      ├── AmycFatalError.scala
      ├── Context.scala
      ├── Document.scala
      ├── Env.scala
      ├── Pipeline.scala
      ├── Position.scala
      ├── Reporter.scala
      └── UniqueCounter.scala

This lab will focus on the following two files:

src/amyc/parsing/Tokens.scala: list of tokens and token classes.
src/amyc/parsing/Lexer.scala: skeleton for the Lexer phase.

A Lexer for Amy

The role of a lexer is to read the input text and convert it to a list of tokens. Tokens are the smallest useful units in a source file: a name referring to a variable, a bracket, a keyword etc. The role of the lexer is to group together those useful units (e.g. return the keyword else as a unit, as opposed to individual characters e, l, s, e) and to abstract away all useless information (i.e. whitespace, comments).

Code structure

You can find the lexer in the Lexer.scala file. Its implementation is based on Streams, which are lazy sequences of elements, i.e. their elements are not produced at the creation time of the stream, but on demand. Unlike iterators, they are immutable, i.e. elements are not consumed when they are created, but remain in the Stream. You can find the Scala Stream API here.

The Lexer has the following components:

The public method is run. It just calls lexFile for every input file and concatenates the results.
lexFile will lex one input file. It takes as input a Stream of pairs of (Char, Position) (which we generate for you from the input via the Source Scala library). lexFile calls tokenStream, which in turn calls nextToken until the input is empty.
nextToken is essentially a wrapper method around readToken. It removes whitespace and comments and then calls readToken. This recursive method has to be tail-recursive so it does not stack overflow on long inputs (this is imposed with @scala.annotation.tailrec).
readToken is the main function of the lexer. It will look at the current character and decide what kind of token it has to emit, possibly reading more characters along the way. Your input is an immutable Stream so you don't have to worry about consuming characters, you can peek at them at will. Look at the helper functions at the beginning of readToken and nextToken.
Finally, the keywords method returns a keyword token corresponding to a string, or None if there is no corresponding keyword.

It is your task to complete nextToken and readToken.

Notes

Here are some details you should pay attention to:

Make sure you recognize keywords as their own token type. else, for instance, should be lexed as a token ELSE(), not as an identifier with the content “else”.
Make sure you correctly register the position of all tokens. Note the currentPos on the head of the input stream and the utility useOne and useTwo methods.
In general, it is good to output as many errors as possible (this will be helpful to whomever uses your compiler). Your lexer should therefore not give up after the first error, but rather skip the erroneous token, emit an error message, and then continue lexing. Use the BAD token to mark erroneous tokens. However, make sure to also call the error method inside ctx.reporter: this will register the error and eventually fail your program.
The Lexer does not immediately read and return all tokens, it returns an Stream[Token] that will be used by future phases to read tokens on demand.
Comments should not produce tokens.
Returned tokens should be fresh instances of the the appropriate Token subclass. Value tokens (tokens that carry a value, such as identifiers), need to be constructed with the appropriate value.
You should apply the longest matching rule: elsefoo should become Identifier(elsefoo) and not ELSE; Identifier(foo)
Make sure to correctly implement the Amy lexing rules for literals and identifiers.

Handling Errors

Your compiler needs to exit with a non-zero code in case it encounters an error. Bad tokens consisting of invalid characters are considered as such errors.

The Reporter available in the ctx parameter of the run method, provides a method error which will record that an error has occurred. Reporter also provides the method terminateIfErrors which will terminate the compiler if an error has occurred before. This method is already called between each phase of the compiler. To make sure that the compiler also terminates with non-zero code in case the *last* phase emits an error, it is also called in the end of Main.main.

(Notice that, since your lexer phase returns an Iterator, errors might be emitted by your lexer while the iterator is consumed in the DisplayToken phase.)

In short, make sure you call error whenever you encounter an erroneous character, so your compiler will eventually exit with an error code. For the lexing phase, avoid fatal as this will terminate the compiler immediately (and not allow you to report multiple errors).

Example Output

For reference, here is a possible output for the example under examples/Hello.scala. You can always get reference output for the lexer from the reference compiler by typing

java -cp amyc_2.12-1.X.jar amyc.Main --printTokens <files>

OBJECT()(1:1)
ID(Hello)(1:8)
LBRACE()(1:14)
ID(Std)(2:3)
DOT()(2:6)
ID(printString)(2:7)
LPAREN()(2:18)
STRINGLIT(Hello )(2:19)
CONCAT()(2:28)
STRINGLIT(world!)(2:31)
RPAREN()(2:39)
RBRACE()(3:1)
EOF()(3:2)

Deliverable

You are given 3 weeks for this assignment.

Deadline: Tuesday, Oct. 16, 23.59pm.

Please use the Courseware interface to submit your solution and get some preliminary feedback from our automated tests.