Hand-Written Lexical Analyzer for While

Lexical Analyzer

Also called: lexer, scanner, or tokenizer

Transforms

sequence of characters

into

sequence of tokens

Interfaces for Lexer

In practice, we read characters and generate tokens on demand

Work with streams instead of sequences, with procedures like

current - return current element in stream
next - advance the current element

Lexer uses input stream of characters

example character stream class: CharStream.scala

Lexer provides:

types defining tokens, like these Tokens.scala
lexer interface, like this one: LexerInterface.scala

Identifiers and Keywords

Regular expression for identifiers: letter (letter|digit)*

if (isLetter) {
  b = new StringBuffer
  while (isLetter || isDigit) {
     b.append(ch.current)
     ch.next
  }
}

Keywords look like identifiers, but are designated as keywords.

One implementation: a constant Map from strings to keyword token numbers

if not in map, then it is ordinary identifier

Integer Constants

Regular expression for numbers: digit digit*

if (isDigit) {
  k = 0
  while (isDigit) {
     k = 10 * k + digitToNumber(ch.current)
     ch.next
  }
}

For digitToNumber etc see ASCII

Skipping Comments

if (ch.current='/') {
  ch.next
  if (ch.current='/') {
     while (!isEOL && !isEOF) {
       ch.next
     }
  }
}

How about nested comments?

need a counter

Longest Match Rule

There are multiple ways to break character stream into tokens.

Consider language with identifiers, ⇐, <, =

Consider this character stream:

interpreters <= compilers

These are some ways to analyze it into tokens:

ID(interpreters) LEQ ID(compilers)
ID(inter) ID(preters) LESS AssignEQ ID(com) ID(pilers)
ID(i) ID(nte) ID(rpre) ID(ter) LESS AssignEQ ID(co) ID(mpi) ID(lers)

This is resolved by longest match rule:

if multiple tokens could follow, take the longest token possible

Note: consider language with only three operators '<', '<=', '=>'

Then sequence '<=>' will fail to tokenize

in practice not a problem–can always use space

Token Priority

What is token classes intersect?

Longest match rule does not help

Example: a keyword is usually also an identifier

Priority: order all tokens, if in doubt take one with higher priority

if it looks both like keyword and like an identifier, then it is a keyword

Combining Lexers for All Tokens

How do we know when we are supposed to analyze string, when integer sequence etc?

For manual construction: use lookahead (next symbol in stream) to decide on token class

compute first(e) - symbols with which e can start
check in which first(e) current token is

This is not always enough: sometimes must remember where we were and go back

Example: comment begins with division sign

A partial lexer: Lexer.scala and LexerTest.scala

LARA