Hand-Written Lexical Analyzer for While

Lexical Analyzer

Also called: lexer, scanner, or tokenizer


sequence of characters


sequence of tokens

Interfaces for Lexer

In practice, we read characters and generate tokens on demand

Work with streams instead of sequences, with procedures like

  • current - return current element in stream
  • next - advance the current element

Lexer uses input stream of characters

Lexer provides:

Identifiers and Keywords

Regular expression for identifiers: letter (letter|digit)*

if (isLetter) {
  b = new StringBuffer
  while (isLetter || isDigit) {

Keywords look like identifiers, but are designated as keywords.

One implementation: a constant Map from strings to keyword token numbers

  • if not in map, then it is ordinary identifier

Integer Constants

Regular expression for numbers: digit digit*

if (isDigit) {
  k = 0
  while (isDigit) {
     k = 10 * k + digitToNumber(ch.current)

For digitToNumber etc see ASCII

Skipping Comments

if (ch.current='/') {
  if (ch.current='/') {
     while (!isEOL && !isEOF) {

How about nested comments?

  • need a counter

Longest Match Rule

There are multiple ways to break character stream into tokens.

Consider language with identifiers, ⇐, <, =

Consider this character stream:

interpreters <= compilers

These are some ways to analyze it into tokens:

  1. ID(interpreters) LEQ ID(compilers)
  2. ID(inter) ID(preters) LESS AssignEQ ID(com) ID(pilers)
  3. ID(i) ID(nte) ID(rpre) ID(ter) LESS AssignEQ ID(co) ID(mpi) ID(lers)

This is resolved by longest match rule:

  • if multiple tokens could follow, take the longest token possible

Note: consider language with only three operators '<', '<=', '=>'

Then sequence '<=>' will fail to tokenize

  • in practice not a problem–can always use space

Token Priority

What is token classes intersect?

Longest match rule does not help

Example: a keyword is usually also an identifier

Priority: order all tokens, if in doubt take one with higher priority

  • if it looks both like keyword and like an identifier, then it is a keyword

Combining Lexers for All Tokens

How do we know when we are supposed to analyze string, when integer sequence etc?

For manual construction: use lookahead (next symbol in stream) to decide on token class

  • compute first(e) - symbols with which e can start
  • check in which first(e) current token is

This is not always enough: sometimes must remember where we were and go back

Example: comment begins with division sign

A partial lexer: Lexer.scala and LexerTest.scala

Comming Next

More principled solution

Basis: finite state machines

We will see how to

  • build code to recognize any regular expression
  • combine token classes, even if they start with common symbols
  • enforce longest match rule and token priorities
  • build a compiler from regular expressions to lexers
  • such compiler is called 'lexer generator' and is part of
    • text manipulation software, and
    • compiler-compilers