Towards a More Systematic Approach
We have seen how to make a lexer for simple language
What if the language has complex operators, each of which can be part of another, e.g.
> >= => ===> ==
how do can we make sure to produce a correct sequence of tokens?
We need a more principled solution
Basis: finite state machines
We will see how to
- build code to recognize any regular expression
- combine token classes, even if they start with common symbols
- enforce longest match rule and token priorities
- build a compiler from regular expressions to lexers
- such compiler is called 'lexer generator' and is part of
- text manipulation software, and
- compiler-compilers