Towards a More Systematic Approach

We have seen how to make a lexer for simple language

What if the language has complex operators, each of which can be part of another, e.g.

>
>=
=>
===>
==

how do can we make sure to produce a correct sequence of tokens?

We need a more principled solution

Basis: finite state machines

We will see how to

build code to recognize any regular expression
combine token classes, even if they start with common symbols
enforce longest match rule and token priorities
build a compiler from regular expressions to lexers
such compiler is called 'lexer generator' and is part of
- text manipulation software, and
- compiler-compilers