Forward Symbolic Execution

Can be used for

verification condition generation
executing programs
testing, bug finding
proving programs correct (if extended with techniques for loop invariants)

Intuition for Symbolic Execution

Symbolic execution engine is like an interpreter, but can keep values of variables symbolic, and keep track of constraints on these symbolic values. This is why it is called symbolic execution. In the special case when the input values are known (and there is no I/O in the program), it reduces to execution (interpretation).

State of Symbolic Execution Engine

Assume for simplicity:

the set of variables is $V = \{x,y\}$
there are no loops

Symbolic execution maintains a formula $R(x_0,y_0,x,y)$ describing the relation between the initial state and the current state.

The engine ensures that $R(x_0,y_0,x,y)$ has the following normal form:

$\begin{equation*} \begin{array}{l} \exists z_1.\ldots \exists z_n.\\ \qquad x = e_x \land y = e_y \land P_C \end{array} \end{equation*}$

where

$e_x$ is a term describing symbolic state of $x$
$e_y$ describes symbolic state of $y$
$P_C$ is a path condition
variables $x,y$ do not appear in $e_x,e_y,P_C$ (they only appear on LHS of $x=e_x$ , $y=e_y$ .
$z_1,\ldots,z_n$ are fresh variables that do not appear in the program (we often omit existential quantification over these variables)

Symbolic Execution Rules

We derive these rules from the relational semantics. We assume $R$ has the form as above, with $e_x,e_y,P_C$ .

Assignment

Given assignment

$\begin{equation*} x{=}\ f \end{equation*}$

where $f$ is an expression, change $e_x$ into $f$ interpreted in the current state, obtaining

$\begin{equation*} x = f[x:=e_x,y:=e_y]\ \land y = e_y \land P_C \end{equation*}$

How does this follow from relation composition?

$\begin{equation*} \{((x0,y0),(x,y)) \mid (x = e_x \land y = e_y \land P)\} \circ \{ ((x,y),(x',y') \mid x' = f(x,y) \land y' = y \} \end{equation*}$

$\begin{equation*} \{((x0,y0),(x',y')) \mid \exists x,y. (x = e_x \land y = e_y \land P \land x' = f(x,y) \land y' = y \} \end{equation*}$

$\begin{equation*} \{((x0,y0),(x',y')) \mid (x' = f(e_x,e_y) \land y' = e_y \land P \} \end{equation*}$

$\begin{equation*} \{((x0,y0),(x,y)) \mid (x = f(e_x,e_y) \land y = e_y \land P \} \end{equation*}$

Assume

Given assume statement

$\begin{equation*} assume(F) \end{equation*}$

add $F$ to path condition, obtaining

$\begin{equation*} x = e_x \land y = e_y \land\ (P_C \land F[x:=e_x, y:=e_y]) \end{equation*}$

Havoc

Given havoc statement:

$\begin{equation*} havoc(x) \end{equation*}$

change $e_x$ into a fresh variable:

$\begin{equation*} x = z_k \land y = e_y \land P_C \end{equation*}$

where $z_k$ is a fresh variable (implicitly $\exists$ -quantified).

Assert

Given

$\begin{equation*} assert(F) \end{equation*}$

prove the condition:

$\begin{equation*} \forall x_0,y_0,x,y,z_1,\ldots,z_n.\ ((x = e_x \land y = e_y \land P_C) \rightarrow F) \end{equation*}$

(or save this condition as a conjunct in the final verification condition).

Justification: note that the postcondition at current program point is

$\begin{equation*} \exists x_0,y_0,z_1,\ldots,z_n. R(x_0,y_0,x,y) \end{equation*}$

so the above statement expresses the condition that the assertion will hold starting from all initial states.

Weakening Optionally, after assert(F), symbolic execution can execute

havoc(x)
havoc(y)
assume(F)

which essentially has the effect of obtaining

$\begin{equation*} x = x_k \land y = y_k \land F[x:=x_k, y:=y_k] \end{equation*}$

where $x_k$ , $y_k$ are fresh variables and where $F[x:=x_k, y:=y_k]$ becomes the new path condition. This means that only information about the assertion is propagated further, other information about the past is forgotten.

Sequential composition

This is handled implicitly: execute statements one by one.

Non-deterministic choice

Execute both branches and collect proof obligations from both branches.

Exponential explosion remedy:

work on control-flow graph
merge symbolic states when edges merge

Merging: Given two symbolic states:

$\begin{equation*} \begin{array}{l} x = e_{x1}\ \land\ y = e_{y1}\land\ P_{C1} \\ x = e_{x2}\ \land\ y = e_{y2}\land\ P_{C2} \end{equation*}$

generate a merged symbolic state

$\begin{equation*} x = x_k\ \land\ y = y_k\ \land\ P_C \end{equation*}$

where $x_k,y_k$ are fresh variables, and where $P_C$ is

$\begin{equation*} \begin{array}{l} ((x_k = e_{x1} \land y_k = e_{y1} \land P_{C1}) \ \lor \\ (x_k = e_{x2} \land y_k = e_{y2} \land P_{C2})) \end{array} \end{equation*}$

Then continue symbolic execution from this state.

Important Optimizations

Expression Simplification

Simplification: if we can simplify symbolic expressions $e_x$ , $e_y$ or path condition $P_C$ , this can be very useful, especially if the size of the expression decreases.

An important special case is constant folding. Example: if $e_x$ is $y_0+3+8$ , we replace it by $y_0+11$ .

Much more sophisticated simplifications are possible (e.g. keep polynomials in normal form, apply terminating rewrite rules); see the references at the end of page.

Path Pruning

If the path condition formula $P_C$ becomes unsatisfiable, the relation is empty and symbolic execution can stop (path is unreachable).

Special case: if the initial condition specifies values of variables, then we know which branch to enter, and symbolic execution reduces to concrete execution (interpreter for the programming language).

Approximation in Symbolic Execution

Instead of computing exact relation, we can compute approximate relation for the program. Two main approaches:

larger relation: overapproximation (safe approximation)
smaller relation: underaproximation (used for bug finding, sometimes it's not guaranteed to be smaller or bigger)

Why we approximate:

ensure termination even for programs that have loops
can still obtain useful information about the program

We can do approximation by modifying the current symbolic execution formula to be

stronger (for under-approximation)
weaker (for over-approximation)

We will study this topic in the rest of the course under the framework of abstract interpretation.

In terms of symbolic execution, these can often be described by imposing a stronger normal form on the formula $R$ of symbolic execution.

Over-Approximation

Creates weaker formulas, which are guaranteed to be of stronger form. Here are some examples of how normal form for $R$ looks in that case. Key question is how to ensure that the formulas look like this–it means that we need to modify propagation rules to generate as strong as possible formulas in this normal form.

Sign Analysis

$\begin{equation*} x=x_k \land y=y_k \land (P(x) \land P(y)) \end{equation*}$

where $P(x)$ is one of the formulas $x<0$ , $x=0$ , $x>0$ (and similarly for $P(y)$ )

Interval Analysis

$\begin{equation*} x=x_k \land y=y_k \land (P(x) \land P(y)) \end{equation*}$

where $P(x)$ is a formula of the forms $a \le x \le b$ , $x \le b$ , $a \le x$ , or $true$ (analogously for $y$ )

Linear or Polynomial Constraint Analysis

$\begin{equation*} x=x_k \land y=y_k \land\ (\bigwedge_{i=1}^m p_i(x,y)=0) \end{equation*}$

where $p_i(x,y)$ are polynomials of degree $n$ . For $n=1$ we obtain systems of linear constraints.

Ensuring termination of the analysis in the presence of loops:

we check whether we have seen a weaker symbolic state before, if we have, then we stop
we design normal forms of $R$ so that the number of iterations through loops is always finite

We look at these topics when we study abstract interpretation and data-flow analysis.

Over-Approximation Using Supplied Asserts

Approach two: interpreting assert(P) as:

assert(P)
havoc(x,y)
assume(P)

which is an over-approximation. To see this, we need to give a relational meaning to assert, which we do next. The advantage is that, if asserts express correct loop invariants, this over-approximation will terminate and prove the desired assertions. We can ensure termination by providing a correct assert in each cycle of the control-flow graph.

Under-Approximations

Creates stronger formulas than formulas representing program meaning.

can often be expressed as adding certain 'assume' statements

If the first statements are

assume (x=0)
assume (x=1)

Then symbolic execution knows the exact values and can work as an interpreter.

simplification optimization computes the exact values of variables

There are techniques that do under-approximation by selecting values of certain variables.

find a satisfying assignment for $R$ , take value $v$ of $x$ in it and replace $e_x$ with $v$
CUTE Tool

Other techniques for underapproximation:

Bounding depth: stop executing for paths longer than some value
Bounding state: small integer width, small number of objects in heap
Loop unrolling: replace loop( c ) with c[]c;c[]c;c;c for some number of times

References

Symbolic execution and program testing by James C. King (pdf) (1976)
Lori Clarke and Debra Richardson: Symbolic Evaluation Methods for Program Analysis, in Program Flow Analysis: Theory and Applications, Prentice-Hall 1981.
Verification Condition Generation via Theorem Proving