Designing Correct Data-Flow Analyses

We describe guidelines to design a correct analysis for a set of properties of interest

Define Properties of Interest

Define precisely what property is needed (depends on the application), e.g.

absence of errors
improving code performance without changing program meaning

Recall Applications of Data-Flow Analysis

Example: every variable assigned before being read

Define Analysis (Semi)Lattice to Express Properties

Find a lattice that

encodes properties of interest
enables terminating analysis

Specify lattice top, bottom, join

Precisely Specify Meaning of Lattice Elements

Specify concretization function, mapping lattice elements to sets of states

$\begin{equation*} \gamma : L \to {\cal P}(States) \end{equation*}$

In SignAnalysis.scala, we have for $m : Var \to Sign$ for $Var = \{v_1,\ldots,v_n\}$

$\begin{equation*} \gamma(m) = \gamma_0(m(v_1)) \times \ldots \times \gamma_0(m(v_n)) \end{equation*}$

where, assuming infinite integers we have

$\begin{equation*} \begin{array}{l} \gamma_0(\bot) = \emptyset \\ \gamma_0(Positive) = \{ 1, 2, \ldots, \} \\ \gamma_0(Zero) = \{ 0 \} \\ \gamma_0(Negative) = \{ \ldots, -2, -1 \} \\ \gamma_0(\top) = \{ \ldots, -2, -1, 0, 1, 2, \ldots \} \\ \end{array} \end{equation*}$

Monotonicity requirement for $\gamma$ :

$x \sqsubseteq y$ implies $\gamma(x) \subseteq \gamma(y)$ , for all $x,y$

Note that this implies

$\begin{equation*} \gamma(x) \cup \gamma(y) \subseteq \gamma(x \sqcup y) \end{equation*}$

Example

$Var = \{ x, y \}$

Lattice element $m : Var \to Sign$ given by $m(x)=Negative$ , $m(y) = Positive$

Then

$\begin{equation*} \begin{array}{r@{}l} \gamma(m) = & \{ \ldots, -2, -1 \} \times \{ 1, 2, \ldots, \} \\ = & \{ (-1, 1), (-1, 2), (-1, 3), \ldots, (-2, 1), (-2, -2), \ldots \} \end{array} \end{equation*}$

Here we wrote $(a,b)$ to mean the function $\{(x,a), (y,b)\}$ .

Compute a Sound Initial Lattice Element

Initially, the set of reachable states is empty at all points, except at CFG entry, where we have set of initial states of program $S_0$

In the initial state of data-flow analysis, we will assign $\bot$ to each program location except entry

At program entry we assign a lattice element $a_0$ such that

$\begin{equation*} \gamma(a_0) \supseteq S_0 \end{equation*}$

Putting $a_0$ to $\top$ is always safe (this corresponds to assuming any possible program states at program initialization time). If we know something more about the initial program states, we can give better initial element. For sign analysis, if we assume all variables are initialized to 0, then we can set $a_0 = \mbox{Zero}$ .

Derive Sound Transfer Functions

For each statement $st$ , define $transFun(st,a)$

Needs to correctly approximate the effect of statement $st$ on states given by $a$ .

Formally, for all statements $st$ and lattice elements $a$ the Correctness Condition is:

$\begin{equation*} \gamma(transFun(st,a)) \supseteq transFun_C(st,\gamma(a)) \end{equation*}$

Here $transFun_C$ is the transfer function from Concrete Execution as Data-Flow Analysis

takes set of states and runs statement $st$ on each state

Recall SignAnalysis.scala transfer functions

Running the Algorithm

As described in DataFlowAnalysis.scala, we execute this iteration

$\begin{equation*} facts(v) = facts(v) \sqcup \bigsqcup_{e.v2=v} transFun(e.lab, e.v1) \ \ \ (F) \end{equation*}$

The algorithm terminates because the elements get larger and larger.

When the algorithm terminates we know that the above assignment is in fact a true as an equation (not assignment).

Sketch of Correctness Proof

Here is why the computed result, given by $fact(v)$ for each vertex $v$ is correct.

Consider any execution of the program represented as control-flow graph (CFG). The execution goes through states, which we can describe by the vertex $u_i$ and the values of variables $w_i$

The execution is a path in CFG, starting at 'entry' and finishing at 'exit'. Consider the resulting alternating sequence of vertices and CFG statements:

$\begin{equation*} (u_1,w_1), s_1, (u_2,w_2), s_2, \ldots, s_n, (u_{n+1},w_{n+1}) \end{equation*}$

where $u_1$ = 'entry', $u_{n+1}$ ='exit'.

We prove by induction on $i$ claim that, for each $i$ ,

$\begin{equation*} w_i \in \gamma(fact(u_i)) \end{equation*}$

In other words, the values $w_i$ of variables at each program point $u_i$ are correctly described by the lattice value $fact(u_i)$ at that program point.

Proof:

Base case: this is how we chose $a_0$ .

Inductive step: follows from

Monotonicity of $\gamma$
Correctness Condition of $\gamma$
The condition that $fact$ map has stabilized

Indeed:

suppose $w_i \in \gamma(fact(u_i)$ by inductive hypothesis
then by definition, $w_{i+1} \in transFun_C(s_i,\gamma(fact(u_i)))$ because it is a next step in program execution
by Correctness Condition, $w_{i+1} \in \gamma(transFun(s_i,fact(u_i))$
because $fact$ stabilized, $transFun(s_i,fact(u_i)) \sqsubseteq fact(u_{i+1})$ ; indeed, $fact(u_{i+1})$ is an upper bound of values on right-hand side of equation $(F)$
by Monotonicity of $\gamma$ , we obtain $w_{i+1} \in \gamma(fact(u_{i+1}))$ . QED

Note: for correctness, it does not matter how we obtained fact, as long as we have checked that it stabilized (we could have guessed those values or found not-necessarily least solution).