Lecture 20: Interprocedural Analysis - Introduction

intraprocedural analysis = analysis for language without procedures

interprocedural analysis = analysis for language with procedures

Given intraprocedural analysis, how do we create interprocedural analysis?

Assumptions

No concurrency
Only global variables
No parameters

The last two assumptions are made for simplicity, since parameter passing can be simulated using global variables and assignments. For example, a function called p with a parameter x, p(x), can be represented as a function without parameters, p'(), by introducing a global variable x.

Example

proc main() {
  p1();
  p2();
}
proc p1() {
  open();
  readContent();
}
proc p2() {
  startLog();
  readContent();
  stopLog();
  close();
}
proc readContent() {
  while (...) {
    read();
    print();
  }
}

Desired properties

These are stated as regular expressions so that it is easier to check whether the program satisfies these constraints.

must follow protocol: (open (read+write)* close)*
must follow protocol: (startLog stopLog)*

Semantics for language with procedure calls

Big step

In this approach, we treat a procedure call as one step. This can be thought of like a step-over move in a debugger.

Let $S$ be the set of program states and $V$ be the set of vertices in the control flow graph. Then we can describe a point in the program as:

$(c,s)$

where $c \in V, s \in S$

Consider the following example:

$p()$ is a procedure which can described by the relation $r$ : $\{(x,y), (x',y') | x'=100 \wedge y'=y\}$

Then we can formally define the semantics of a big step ( $\mapsto$ ) as:

$(c_{1},s_{1}) \mapsto (c_{2}, s_{2})$

where $(s_{1},s_{2}) \in r$ iff $(c_{0},s_{1}) \rightarrow (c_{3},s_{2})$

Small step

Using small step semantics, we treat the procedure call as a sequence of small steps corresponding to each statement in the procedure. This is like the step-into move in a debugger. To accomplish this, we need a stack to keep track of the return point after we come back from a procedure. Therefore a program point can now be described as:

$(s, c, t)$

$s \in S, c \in V, t \in V^{*}$

Here $V^{*}$ is the set of sequences of vertices in the control flow graph.

Using the preceding example, the small steps ( $\hookrightarrow$ ) can be described as:

$(s,c,t_{1}) \hookrightarrow (s_{1}, c_{0}, c_{2} : t_{1}) \hookrightarrow ... \hookrightarrow (s_{2}, c_{3}, c_{2} t_{1}) \hookrightarrow (s_{2}, c_{2}, t_{1})$

After the first step at the point where the procedure is entered, $c_{2}$ is pushed on the stack so that we know where to return to after the procedure is done. The last step is the point where we return from the procedure, so we pop $c_{2}$ from the stack and that is the point in the control flow graph where we return to.

Here we have assumed the procedure has no local variables, although to support them, we'd just need to add another set to the program description:

$(s, l, c, t)$

Here $l \in L$ , where $L$ is the set of mappings of local variables of a procedure to their values. Furthermore, on each procedure call, the program pushes $l$ on the stack along with the program point. Everything would be popped off from $l$ when the procedure returns control to the callee.

One instance where small step semantics would be necessary is in concurrency, so that we can take into account interleaving.

Simple approaches to interprocedural analysis

Treat procedure calls as havoc statements

Assume the program is in an arbitrary state and model the procedure call as a havoc statement.

In previous lectures, we have seen how to compute

$sp\#(a,c) = a_{1}$

where $A$ is a lattice and $a,a_{1} \in A$ .

By treating a procedure call as a havoc, we would have

$sp\#(a,p()) = T$

which would mean that anything can happen, and we would lose all precision.

However, we could still recover some properties if we have an intermediate step where the procedure call would be a havoc statement only on those variables declared in the modifies clause of the procedure. The problem with this approach is that it may be difficult to precisely define the modifies clause.

Inlining

When a procedure call is encountered, it is replaced by the statements of the procedure (so the procedure code is inlined with the code of the callee). A problem arises when the procedure being inlined is recursive. One approach to dealing with this situation is to inline the recursive procedure a few times, and then treat the recursive call as a havoc statement (as described above). However, even if this is done, inlining increases the size of the code exponentially.

Treat procedure calls as goto statements

Procedure calls are approximated with a goto statement, and return from a procedure is a non-deterministic goto statement (non-deterministic because the procedure could have been called from different places in the code, so there is no way to know where to return to). Consider the example program above, and the second property we would like it to satisfy:

(startLog stopLog)*

At the program point just before the first statement of the readContent() function, the value of the boolean variable stopLog is unknown, because we don't know whether the call was made by p1() or p2(). Therefore in order to check such properties, we would need a finite state automaton (FSA) to keep track of the value of the boolean variables.

Modular analysis with supplied contracts

During the verification of a program with procedures, when a procedure call is encountered, the procedure's contract is evaluated. We assume that the user has supplied the procedure contracts, although it may also be possible to infer contracts (especially the ensures clause).

In our example, to satisfy the second property of the program, we introduce a boolean variable

boolean loggingOn := false;

The procedure contract for p2() then looks like this:

requires: loggingOn = false
modifies: loggingOn
ensures:  loggingOn = false

Now consider the first property:

(open (read+write)* close)*

In order to define procedure contracts for readContent() and p1() to ensure that this property is satisfied, we introduce an object called FileState which has a typestate. In typestate systems, the type of an object is allowed to change during its lifetime in the computation. In the paper Generalized typestate checking for data structure consistency, it says that “unlike standard type systems, typestate systems can enforce safety properties that depend on changing object states”.

Therefore FileState has type file and two states, open and closed.

Using this notion, we can write the procedure contracts for the two methods as follows:

readContent()
requires: FileState = open

p1()
requires: FileState = closed
modifies: FileState
ensures:  FileState = open

Analysis of regular properties of finite-state programs with stack (push down systems- PDS)

Assume we have a program with a finite number of program states. Recall that we can write a push down automata (PDA), $P$ , which will accept a language, $L(P)$ , that corresponds to all sequences of operations in the program (SAV07 Lecture 19). Then we can use the PDA to check whether the properties of the program (stated as regular expressions) hold. We do this by constructing a finite state automaton (FSA), $A$ , which accepts the language, $L(A)$ , corresponding to the regular expression for the property (recall the equivalence of finite state machine and regular expression languages). Then we verify whether the property holds by checking if the following is true:

$L(P) \subseteq L(A)$

PDAs and context-free grammars (CFG) are equivalent. Therefore, as an example, we construct the CFG corresponding to the example program above:

$\Sigma$ = {start, stop, open, read, close}

$\Gamma$ is the set of symbols on the stack

$Q$ is the set of program states

readContent ::= epsilon | read readContent
p2          ::= start readContent stop close
p1          ::= open readContent
main        ::= p1 p2

Now to check whether property 2 holds, we have to verify whether $L(CFG) \subseteq L(A)$ .

Checking whether one language is a subset of another

Recall that:

$X \subseteq Y$ iff $X \cap \overline{Y} = \emptyset$

Therefore showing that $L(P) \subseteq L(A)$ is true is equivalent to showing:

$L(P) \cap \overline{L(A)} = \emptyset$

$L(P) \cap L(\overline{A}) = \emptyset$

Let $Q_{1}$ be the set of states of $P$ , and $Q_{2}$ be the set of states of A. Then $L(P) \cap L(\overline{A})$ is the language accepted by a new PDA, $P_{1}$ , whose set of states is $Q_{1} \times Q_{2}$ . Let $CFG_{1}$ be the CFG equivalent to $P_{1}$ . Therefore the problem is now to check whether $L(P_{1}) = \emptyset$ or equivalently whether $L(CFG_{1}) = \emptyset$

One way to do this is to systematically check the productions of $CFG_{1}$ bottom up (starting from a terminal and following the non-terminals which are used to derive it). If the non-terminal representing the start symbol is reachable in at least one of the derivations, then $L(CFG_{1})$ is non-empty, therefore $L(P) \nsubseteq L(A)$ and the property that $A$ represents does not hold.

A formal analysis if this “reachability” problem in pushdown automata is discussed in the paper Reachability Analysis of Pushdown Automata: Application to Model Checking. A summary of the sections of the paper we covered in class are in the next section.

Convergence of automata for reachable configurations

One of the main results of this paper is an algorithm which, given a set of states $S$ , computes the set of all predecessors of $S$ , denoted by $pre^{*}(S)$ . The states in $pre^{*}(S)$ are regular, i.e. they can be represented using a finite state automaton.

In the paper, a pushdown system is defined as a triplet $\mathcal{P} = (P, \Gamma, \Delta)$ where

$P$ is a finite set of control locations

$\Gamma$ is a finite stack alphabet

$\Delta \subseteq (P \times \Gamma) \times (P \times \Gamma^{*})$ is a set of transition rules

If $((q, \gamma),(q',w)) \in \Delta$ then we write $(q, \gamma) \hookrightarrow (q', w)$

Note: an input alphabet is not defined, since here we are only interested in the behavior of the PDS, not the specific language it accepts.

Multi-automata

Let $\mathcal{P} = (P, \Gamma, \Delta)$ be a PDS with $P=\{p^{1},...,p^{m}\}$ . A multi-automaton (MA for short) for $\mathcal{P}$ is a tuple $\mathcal{A} = (\Gamma, Q, \delta, I, F)$ where

$Q$ is a finite set of states $\delta \subseteq Q \times \Gamma \times Q$ is a set of transitions $I \subseteq Q$ is a set of initial states $F \subseteq Q$ is a set of final states

The transition relation $\longrightarrow \subseteq Q \times \Gamma^{*} \times Q$ is defined as the smallest relation which satisfies the following properties

if $(q, \gamma, q')$ then $q \stackrel{\gamma}{\longrightarrow} q'$
$q \stackrel{\varepsilon}{\longrightarrow} q'$ for every $q \in Q$
if $q \stackrel{\w}{\longrightarrow} q''$ and $q'' \stackrel{\gamma}{\longrightarrow} q'$ then $q \stackrel{w\gamma}{\longrightarrow} q'$

$\mathcal{A}$ accepts or recognizes a configuration $\langle p^{i},w \rangle$ if $s^{i} \stackrel{w}{\longrightarrow} q$ for some $q \in F$ . The set of configurations recognized by $\mathcal{A}$ is denoted $Conf(\mathcal{A})$ . As stated before, a set of configurations is regular if it recognized by some MA.

Section 2.2 of the paper explains how, given a regular set of configurations $C$ of a PDS $\mathcal{P}$ recognized by an MA $\mathcal{A}$ , we can construct another MA $\mathcal{A}_{pre*}$ recognizing $pre^{*}(C)$ (refer to paper for details).

So the idea in the end is to make $C$ include those undesirable (error) states, and then check whether $pre^{*}(C)$ includes the starting state configurations (in other words, whether any of the error states are reachable from the start state).

References

Winskel is (almost) right
M. Sharir, and A. Pnueli. Two Approaches to Inter-Procedural Data-Flow Analysis. In Jones and Muchnik, editors, Program Flow Analysis: Theory and Applications. Prentice-Hall, 1981.
F. Nielson, H. R. Nielson, C. Hankin: Principles of program analysis, 2005. Chapter 2.5.
Reachability Analysis of Pushdown Automata: Application to Model Checking

Continued in SAV07 Lecture 21.