LARA

This is an old revision of the document!


Regular expressions

Regular expressions are expressions denoting certain languages. They are precisely those languages that can be built from the singleton languages $\{\epsilon\}$, and $\{a\}$ for each $a \in \Sigma$, using operations union ($\cup$), concatenation (${\cdot}$), and iteration ($*$) on languages.

In regular expressions, $\cup$ is typically denoted $+$ (sometimes $|$), and the curly braces around singleton sets are omitted, so the language $\{a\}$ for $a \in \Sigma$ is denoted simply $a$ and the language $\{\epsilon\}$ by $\epsilon$. The concatenation operator is simply omitted. We next make this more precise. We write $L(r)$ to denote the set that denotes the meaning of a regular expression.

A regular expression over alphabet $\Sigma$ is given by:

  • $\emptyset$ is a regular expression, $L(\emptyset) = \emptyset$
  • $\epsilon$ is a regular expression $L(\epsilon) = \{\epsilon\}$
  • if $a \in \Sigma$, then $a$ is a regular expression $L(a) = \{a\}$
  • if $r_1$ and $r_2$ are regular expressions, so are $r_1 r_2$, $r_1 + r_2$, and $r_1*$; we define $L(r_1 r_2) = L(r_1) \cdot L(r_2)$, $L(r_1 + r_2) = L(r_1) \cup L(r_2)$, $L(r_1*) = L(r_1)^*$
  • every regular expression is obtained by a finite number of applications of the rules above.

Example: let $\Sigma = \{a,b,r\}$. The regular expression

(bar)*(a+b)*

denotes the language $\{ (bar)^n w \mid n \geq 0, w \in \{a,b\}^* \}$.

Example: Integer constants (such as 3123, 9123, 1389123) that do not begin with digit zero are given by (using '|' notation instead of '+'):

(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*

Example: let $\Sigma = \{ b, e, r, c, o, k \}$

(be*r + coke)*

denotes the set of strings that include:

br, ber, beer, beeer, ... , coke, brcoke, ... , berbrcokecokebeeer

Notes:

  • we can introduce shorthand r+ to stand for rr*
  • we can allow non-recursive shorthands, for example
posDigit = 1|2|3|4|5|6|7|8|9
digit = 0|posDigit
intConst = posDigit digit*