Hand-Written Lexical Analyzer for While
Lexical Analyzer
Interfaces for Lexer
In practice, we read characters and generate tokens on demand
Work with streams instead of sequences, with procedures like
- current - return current element in stream
- next - advance the current element
Lexer uses input stream of characters
- example character stream class: CharStream.scala
Lexer provides:
- types defining tokens, like these Tokens.scala
- lexer interface, like this one: LexerInterface.scala
Identifiers and Keywords
Regular expression for identifiers: letter (letter|digit)*
if (isLetter) { b = new StringBuffer while (isLetter || isDigit) { b.append(ch.current) ch.next } }
Keywords look like identifiers, but are designated as keywords.
One implementation: a constant Map from strings to keyword token numbers
- if not in map, then it is ordinary identifier
Integer Constants
Regular expression for numbers: digit digit*
if (isDigit) { k = 0 while (isDigit) { k = 10 * k + digitToNumber(ch.current) ch.next } }
For digitToNumber etc see ASCII
Skipping Comments
if (ch.current='/') { ch.next if (ch.current='/') { while (!isEOL && !isEOF) { ch.next } } }
How about nested comments?
- need a counter
Longest Match Rule
There are multiple ways to break character stream into tokens.
Consider language with identifiers, ⇐, <, =
Consider this character stream:
interpreters <= compilers
These are some ways to analyze it into tokens:
- ID(interpreters) LEQ ID(compilers)
- ID(inter) ID(preters) LESS AssignEQ ID(com) ID(pilers)
- ID(i) ID(nte) ID(rpre) ID(ter) LESS AssignEQ ID(co) ID(mpi) ID(lers)
This is resolved by longest match rule:
- if multiple tokens could follow, take the longest token possible
Note: consider language with only three operators '<', '<=', '=>'
Then sequence '<=>' will fail to tokenize
- in practice not a problem–can always use space
Token Priority
What is token classes intersect?
Longest match rule does not help
Example: a keyword is usually also an identifier
Priority: order all tokens, if in doubt take one with higher priority
- if it looks both like keyword and like an identifier, then it is a keyword
Combining Lexers for All Tokens
How do we know when we are supposed to analyze string, when integer sequence etc?
For manual construction: use lookahead (next symbol in stream) to decide on token class
- compute first(e) - symbols with which e can start
- check in which first(e) current token is
This is not always enough: sometimes must remember where we were and go back
Example: comment begins with division sign
A partial lexer: Lexer.scala and LexerTest.scala
Comming Next
More principled solution
Basis: finite state machines
We will see how to
- build code to recognize any regular expression
- combine token classes, even if they start with common symbols
- enforce longest match rule and token priorities
- build a compiler from regular expressions to lexers
- such compiler is called 'lexer generator' and is part of
- text manipulation software, and
- compiler-compilers