LARA – Lab for Automated Reasoning and Analysis -

Lecture 02: Lexical Analysis

Key insights:

lexical analyzer maps a stream of characters into a stream of tokens
- while doing that, it typically needs only finite memory
we can specify tokens for a lexical analyzers using regular expressions
it is not difficult to construct a lexical analyzer manually – we give an example
in such case, we often use the first character to decide on token class; there is a notion first(L) = { a | aw in L }
it is also possible to automate the construction of lexical analyzers; the starting point of this construction is a conversion of regular expressions to deterministic automata
we follow the maximal munch rule: lexical analyzer should always eagerly accept the longest token that it can recognize from the current point
tools that automate this construction are part of compiler-compilers such as JavaCC

Background on regular languages and automata: