Labs 02: Lexer for Tool

This assignment is the first real part of the Tool compiler project. Make sure you have read the general project overview page first.

Code Stubs

Caution: We are no longer going to use the parser+trees given in lib/toolc-frontend_2.11-3.0.jar. This commit removes this file.

To merge the stubs for the Lexer into your current branch (which should contain a working Evaluator by now), use

git merge origin/Lab02

The structure of your project directory should be as follows:

toolc
 ├── Main.scala                (updated)
 │
 ├── ast                     
 │    └── Trees.scala          (new, for later)
 │
 ├── eval                     
 │    └── Evaluator.scala   
 │
 ├── lexer                     
 │    ├── Lexer.scala          (new)
 │    └── Tokens.scala         (new)
 │
 └── utils
      ├── Context.scala        (new)
      ├── Positioned.scala     (new)
      ├── Reporter.scala       (new)
      └── Pipeline.scala       (new)

This lab will focus on the following two files:

lexer/Tokens.scala: list of token kinds and tokens.
lexer/Lexer.scala: stub for the Lexer phase.

A Lexer for Tool

The role of a lexer is to read the input text and convert it to a list of tokens. Tokens are the smallest useful units in a source file: a name referring to a variable, a bracket, a keyword etc. The role of the lexer is to group together those useful units (e.g. return the keyword while as a unit, as opposed to individual characters w, h, i, l, e) and to abstract away all useless information (i.e. whitespace, comments).

Here are some details you should pay attention to:

Make sure you recognize keywords as their own token type. while, for instance, should be lexed as a token WHILE(), not as an identifier representing an object called “while”.
Make sure you correctly register the position of all tokens. Note the currentPos and the Positioned.setPos methods.
In general, it is good to output as many errors as possible (this helps whoever uses your compiler). This means your lexer should not give up after the first error, but rather skip the erroneous token, emit an error message, and then continue lexing. Use the BAD token to mark erroneous tokens.
The Lexer does not immediately read and return all tokens, it returns an Iterator[Token] that will be used by future phases to read tokens on demand.
Comments should not produce tokens.
Returned tokens should be fresh instances of the the appropriate Token subclass. Value tokens (tokens that carry a value, such as identifiers), need to be constructed with the appropriate value.
The position of a token should be the position of its first character.
You should apply the longest matching rule: whilefoo should become Identifier(whilefoo) and not WHILE; Identifier(foo)
Take special care at what is considered a valid identifier, integer literal, and string literal.

Handling Errors

Your compiler needs to exit with a non-0 code in case it encounters an error. Bad tokens of invalid characters are considered as such errors.

The Reporter available in the ctx parameter of the run method, provides a method error which will store in the Reporter the information that an error has occurred. Reporter also provides the method terminateIfErrors which will terminate the compiler if an error has occurred before. This method is already called between each phase of the compiler. To make sure that the compiler also terminates with non-0 code in case the *last* phase emits an error, it is also called in the end of Main.main.

(Notice that, since your lexer phase returns an Iterator, errors might be emitted by your lexer while consuming the iterator during the DisplayToken phase)

In short, make sure you call error whenever you encounter an erroneous character, to make sure that your compiler will eventually exit with an error code. For the lexing phase, avoid fatal as this will terminate the compiler immediately (and not allow you to report multiple errors).

Code structure

You can find the lexer in the Lexer.scala file. Its code is imperative and based on iterators, both for the input file and the output tokens.

The Lexer has the following components:

The public method is run. It returns an iterator of Tokens by repeatedly calling nextToken() until the EOF() token is found.
nextToken() is essentially a wrapper method around readToken(). It removes whitespace and comments and then calls readToken(). This method can be recursive if you want, but in this case make sure that it is tail-recursive so it does not stack overflow on long inputs (this is imposed with @scala.annotation.tailrec).
readToken() is the main function of the lexer. It will look at the current character and decide what kind of token it has to emit, possibly reading more characters along the way.
The input file is read by a class called SourceReader. To make it easier for you, SourceReader will cache two characters at a time so you can look ahead if needed. You have access to the two characters with currentChar and nextChar. consume will read a new character from the input file (there is no way to go back to previous characters, so structure your code carefully). Finally, currentPos returns the position of currentChar. You can use the methods of the reader without qualification due to import reader._
Finally, the keywords method should return a keyword token corresponding to a string, or None if there is not corresponding keyword.

It is your task to complete nextToken, readToken and keywords.

Example Output

For reference, here is a possible output for the factorial program. You can use the reference compiler with the flag --tokens

java -jar toolc-reference-3.0.jar --tokens <file.tool>

to show the tokens of any Tool source file.

  PROGRAM()(1:1)
  ID(Factorial)(1:9)
  LBRACE()(1:19)
  PRINTLN()(2:5)
  LPAREN()(2:12)
  STRINGLIT(10! = )(2:13)
  PLUS()(2:22)
  NEW()(2:24)
  ID(Fact)(2:28)
  LPAREN()(2:32)
  RPAREN()(2:33)
  DOT()(2:34)
  ID(computeFactorial)(2:35)
  LPAREN()(2:51)
  INTLIT(10)(2:52)
  RPAREN()(2:54)
  RPAREN()(2:55)
  SEMICOLON()(2:56)
  RBRACE()(3:1)
  CLASS()(5:1)
  ID(Fact)(5:7)
  LBRACE()(5:12)
  DEF()(6:5)
  ID(computeFactorial)(6:9)
  LPAREN()(6:25)
  ID(num)(6:26)
  COLON()(6:30)
  INT()(6:32)
  RPAREN()(6:35)
  COLON()(6:37)
  INT()(6:39)
  EQSIGN()(6:43)
  LBRACE()(6:45)
  VAR()(7:9)
  ID(num_aux)(7:13)
  COLON()(7:21)
  INT()(7:23)
  SEMICOLON()(7:26)
  IF()(8:9)
  LPAREN()(8:12)
  ID(num)(8:13)
  LESSTHAN()(8:17)
  INTLIT(1)(8:19)
  RPAREN()(8:20)
  ID(num_aux)(9:13)
  EQSIGN()(9:21)
  INTLIT(1)(9:23)
  SEMICOLON()(9:24)
  ELSE()(10:9)
  ID(num_aux)(11:13)
  EQSIGN()(11:21)
  ID(num)(11:23)
  TIMES()(11:27)
  LPAREN()(11:29)
  THIS()(11:30)
  DOT()(11:34)
  ID(computeFactorial)(11:35)
  LPAREN()(11:51)
  ID(num)(11:52)
  MINUS()(11:56)
  INTLIT(1)(11:58)
  RPAREN()(11:59)
  RPAREN()(11:60)
  SEMICOLON()(11:61)
  RETURN()(12:9)
  ID(num_aux)(12:16)
  SEMICOLON()(12:23)
  RBRACE()(13:5)
  RBRACE()(14:1)
  EOF()(14:2)

Note that the lexer emits EOF and only then will the Iterator.hasNext method return false.

Deliverable

You are given 2 weeks for this assignment.

Deadline: Tuesday, Oct. 11, 23.59pm.

You must use the interface on http://larasrv09.epfl.ch:9000/clp16/repository as follows:

Click on the “Deliverables” tab
Click “Deliver this version” on the commit corresponding to the version of your code you want to submit