LARA

Hand-Written Lexical Analyzer for While

Lexical Analyzer

Also called: lexer, scanner, or tokenizer

Transforms

sequence of characters

into

sequence of tokens

Interfaces for Lexer

In practice, we read characters and generate tokens on demand

Work with streams instead of sequences, with procedures like

  • current - return current element in stream
  • next - advance the current element

Lexer uses input stream of characters

Lexer provides:

Identifiers and Keywords

Regular expression for identifiers: letter (letter|digit)*

if (isLetter) {
  b = new StringBuffer
  while (isLetter || isDigit) {
     b.append(ch.current)
     ch.next
  }
}

Keywords look like identifiers, but are designated as keywords.

One implementation: a constant Map from strings to keyword token numbers

  • if not in map, then it is ordinary identifier

Integer Constants

Regular expression for numbers: digit digit*

if (isDigit) {
  k = 0
  while (isDigit) {
     k = 10 * k + digitToNumber(ch.current)
     ch.next
  }
}

For digitToNumber etc see ASCII

Skipping Comments

if (ch.current='/') {
  ch.next
  if (ch.current='/') {
     while (!isEOL && !isEOF) {
       ch.next
     }
  }
}

How about nested comments?

  • need a counter

Longest Match Rule

There are multiple ways to break character stream into tokens.

Consider language with identifiers, ⇐, <, =

Consider this character stream:

interpreters <= compilers

These are some ways to analyze it into tokens:

  1. ID(interpreters) LEQ ID(compilers)
  2. ID(inter) ID(preters) LESS AssignEQ ID(com) ID(pilers)
  3. ID(i) ID(nte) ID(rpre) ID(ter) LESS AssignEQ ID(co) ID(mpi) ID(lers)

This is resolved by longest match rule:

  • if multiple tokens could follow, take the longest token possible

Note: consider language with only three operators '<', '<=', '=>'

Then sequence '<=>' will fail to tokenize

  • in practice not a problem–can always use space

Token Priority

What is token classes intersect?

Longest match rule does not help

Example: a keyword is usually also an identifier

Priority: order all tokens, if in doubt take one with higher priority

  • if it looks both like keyword and like an identifier, then it is a keyword

Combining Lexers for All Tokens

How do we know when we are supposed to analyze string, when integer sequence etc?

For manual construction: use lookahead (next symbol in stream) to decide on token class

  • compute first(e) - symbols with which e can start
  • check in which first(e) current token is

This is not always enough: sometimes must remember where we were and go back

Example: comment begins with division sign

A partial lexer: Lexer.scala and LexerTest.scala

Comming Next

More principled solution

Basis: finite state machines

We will see how to

  • build code to recognize any regular expression
  • combine token classes, even if they start with common symbols
  • enforce longest match rule and token priorities
  • build a compiler from regular expressions to lexers
  • such compiler is called 'lexer generator' and is part of
    • text manipulation software, and
    • compiler-compilers