LARA

Regular expressions

Regular expressions are expressions denoting certain languages. They are precisely those languages that can be built from the singleton languages $\{\epsilon\}$, and $\{a\}$ for each $a \in \Sigma$, using operations union ($\cup$), concatenation (${\cdot}$), and iteration ($*$) on languages.

In regular expressions, $\cup$ is typically denoted $|$ (sometimes $+$), and the curly braces around singleton sets are omitted, so the language $\{a\}$ for $a \in \Sigma$ is denoted simply $a$ and the language $\{\epsilon\}$ by $\epsilon$. The concatenation operator is simply omitted. We next make this more precise. We write $L(r)$ to denote the set that denotes the meaning of a regular expression.

A regular expression over alphabet $\Sigma$ is given by:

  • $\emptyset$ is a regular expression, $L(\emptyset) = \emptyset$
  • $\epsilon$ is a regular expression $L(\epsilon) = \{\epsilon\}$
  • if $a \in \Sigma$, then $a$ is a regular expression $L(a) = \{a\}$
  • if $r_1$ and $r_2$ are regular expressions, so are $r_1 r_2$, $r_1 + r_2$, and $r_1*$; we define $L(r_1 r_2) = L(r_1) \cdot L(r_2)$, $L(r_1 | r_2) = L(r_1) \cup L(r_2)$, $L(r_1*) = L(r_1)^*$
  • every regular expression is obtained by a finite number of applications of the rules above.

Example: let $\Sigma = \{a,b,r\}$. The regular expression

(bar)*(a+b)*

denotes the language $\{ (bar)^n w \mid n \geq 0, w \in \{a,b\}^* \}$.

Example: Integer constants (such as 3123, 9123, 1389123) that do not begin with digit zero are given by (using '|' notation instead of '+'):

(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*

Example: let $\Sigma = \{ b, e, r, c, o, k \}$

(be*r + coke)*

denotes the set of strings that include:

br, ber, beer, beeer, ... , coke, brcoke, ... , berbrcokecokebeeer

Notes:

  • we can introduce shorthand r+ to stand for rr*
  • we can allow non-recursive shorthands, for example
posDigit = 1|2|3|4|5|6|7|8|9
digit = 0|posDigit
intConst = posDigit digit*