Labs 02: Lexer for Tool

This assignment is the first real part of the Tool compiler project. Make sure you have read the general project overview page first.

Code Stubs

You can download the additional stubs from our main repository by following the instructions.

Caution: We are no longer going to use the parser+trees given in toolc-parser_2.11-2.0.jar. Make sure it is removed from lib/.

The structure of your project directory should be as follows:

toolc
 ├── Main.scala                (updated)
 │
 ├── ast                     
 │    └── Trees.scala          (new, for later)
 │
 ├── eval                     
 │    └── Evaluator.scala   
 │
 ├── lexer                     
 │    ├── Lexer.scala          (new)
 │    ├── PrintTokens.scala    (new)
 │    └── Tokens.scala         (new)
 │
 └── utils
      ├── Context.scala        (new)
      ├── Positioned.scala     (new)
      ├── Reporter.scala       (new)
      └── Pipeline.scala       (new)

This lab will focus on the following two files:

lexer/Tokens.scala: list of token kinds and tokens.
lexer/Lexer.scala: stub for the Lexer phase.

A Lexer for Tool

The role of a lexer is to read the input text and convert it to a list of Tokens.

Here are some details you should pay attention to:

Make sure you recognize keywords as their own token type. while, for instance, should be lexed as the token kind WHILE, not as an identifier representing an object called “while”.
Make sure you correctly register the position of all tokens. Note the currentPos() and the Positioned.setPos methods.
In general, it is good to output as many errors as possible (this helps whoever uses your compiler). For instance, if your lexer encounters an invalid character, it can output an error message, skip it, and keep on lexing the rest of the input. After lexing, the compiler still won't proceed to the next phase, but this helps the user correct more than one error per compilation. Use the special BAD token type to mark errors and keep lexing as long as it is possible.
The Lexer does not immediately read and return all tokens, it returns an Iterator[Token] that will be used by future phases to read tokens on demand.
Comments should not produce tokens.
Returned tokens should be fresh instances of the Token class, with the proper token kind. Value tokens (tokens that carry a value, such as identifiers), have their own subclass of Token.
The position of a token should be the position of its first character.
You should apply the longest matching rule: whilefoo should become Identifier(whilefoo) and not WHILE; Identifier(foo)
Take special care at what is considered a valid identifier, and a valid integer literal.

Example Output

For reference, here is a possible output for the factorial program:

  OBJECT(1:1) 
  ID(Factorial)(1:8) 
  LBRACE(1:18) 
  DEF(2:5) 
  MAIN(2:9) 
  LPAREN(2:13) 
  RPAREN(2:14) 
  COLON(2:16) 
  UNIT(2:18) 
  EQSIGN(2:23) 
  LBRACE(2:25) 
  PRINTLN(3:9) 
  LPAREN(3:16) 
  NEW(3:17) 
  ID(Fact)(3:21) 
  LPAREN(3:25) 
  RPAREN(3:26) 
  DOT(3:27) 
  ID(computeFactorial)(3:28) 
  LPAREN(3:44) 
  INT(10)(3:45) 
  RPAREN(3:47) 
  RPAREN(3:48) 
  SEMICOLON(3:49) 
  RBRACE(4:5) 
  RBRACE(5:1) 
  CLASS(7:1) 
  ID(Fact)(7:7) 
  LBRACE(7:12) 
  DEF(8:5) 
  ID(computeFactorial)(8:9) 
  LPAREN(8:25) 
  ID(num)(8:26) 
  COLON(8:30) 
  INT(8:32) 
  RPAREN(8:35) 
  COLON(8:37) 
  INT(8:39) 
  EQSIGN(8:43) 
  LBRACE(8:45) 
  VAR(9:9) 
  ID(num_aux)(9:13) 
  COLON(9:21) 
  INT(9:23) 
  SEMICOLON(9:26) 
  IF(10:9) 
  LPAREN(10:12) 
  ID(num)(10:13) 
  LESSTHAN(10:17) 
  INT(1)(10:19) 
  RPAREN(10:20) 
  ID(num_aux)(11:13) 
  EQSIGN(11:21) 
  INT(1)(11:23) 
  SEMICOLON(11:24) 
  ELSE(12:9) 
  ID(num_aux)(13:13) 
  EQSIGN(13:21) 
  ID(num)(13:23) 
  TIMES(13:27) 
  LPAREN(13:29) 
  THIS(13:30) 
  DOT(13:34) 
  ID(computeFactorial)(13:35) 
  LPAREN(13:51) 
  ID(num)(13:52) 
  MINUS(13:56) 
  INT(1)(13:58) 
  RPAREN(13:59) 
  RPAREN(13:60) 
  SEMICOLON(13:61) 
  RETURN(14:9) 
  ID(num_aux)(14:16) 
  SEMICOLON(14:23) 
  RBRACE(15:5) 
  RBRACE(16:1) 
  EOF(16:2)

Note that the lexer emits EOF and only then the Iterator.hasNext method will return false.

You can also use the reference compiler with the flag --tokens

java -jar toolc-reference-2.4.jar --tokens <file.tool>

to show the tokens of any Tool source file.

Handling Errors

Your compiler needs to exit with a non-0 code in case it encounters an error. Bad tokens of invalid characters are considered as such errors.

Your compile will already check between each phase if an error has been issued, in which case it will exit right away. Since your lexer phase returns an Iterator, errors might be emitted by your lexer while consuming the iterator (during the DisplayToken phase).

You should thus make sure that you also terminate with non-0 code in case the *last* phase emits an error.

An easy way to do that is to place

ctx.reporter.terminateIfErrors

at the end of the main() method. This should have been part of the stubs, sorry for that.

Deliverable

You are given 2 weeks for this assignment.

Deadline: Tuesday, Oct. 14, 23.59pm.

You must use the interface on http://larasrv09.epfl.ch:9000/clp14/repository as follows:

Click on the “Deliverables” tab
Click “Deliver this version” on the commit corresponding to the version of your code you want to submit