Shape Analysis

Example illustrating the idea of shape analysis. Consider the following statements:

x=y.f.g

x==y

We want to know whether the second statement is true. This depends on the shape of the structure on the heap. If it is a tree, then x and 'y may not point to the same object, so the statment does not not hold. If it is a structure that allows cycles, then the statement could be true.

So you perform shape analysis to “determine 'shape invariants' for programs that perform destructive updating on dynamically allocated storage” (ref: Parametric shape analysis via 3-valued logic).

Analyses with graph-based models of memory

This summary was compiled from the lecture notes and from reference 1 below (where the graph, its construction and other details are explained much more precisely).

The idea here is to represent the state of the memory after some statement in the program by building a graph. A node in the graph represents either

a simple variable (then called a variable node) or
a set of heap nodes $new_i$ (recall from preivous section that $new_i$ is the set of all objects allocated on the heap at program point $i$ )

Variable nodes and fields of heap nodes model locations in run-time data structures where pointers may be stored.

Edges in the graph represent pointer values. Edges originate either from variable nodes or from fields within heap nodes and are directed towards heap nodes. (Note: variable nodes do not contain fields, therefore edges cannot be directed towards variable nodes).

The graph contains a finite number of nodes if we build one abstract location $new_i$ per allocation site $i$ . This allocation site model was the one presented in class, although there could be other methods for merging heap nodes in the graph.

Loops represent the unbounded size of a data structure.

Assignments

We consider assignments of the form y = x.f or x.f = y. These change the graph in the following way:

y = x.f

Add an edge directed from the variable node for y to the set of nodes reached by leaving fields in x (or the set of nodes that x belongs to). This is the green edge in the diagram below. Remove the old outgoing edge of y (the red edge in the diagram below). This is called a strong update because we are replacing the old edge with a new edge. If we added the new edge without removing the old one, then we would be performing a weak update.

x.f = y

Add an edge from fields in the set of x to the set of nodes that y points to. This is the green edge in the diagram below. This is a weak update since we do not remove old edges leaving the fields of the nodes in x (or the set of nodes, denoted $new_B$ , which includes the object that x points to). If we were sure that $new_B$ contains a single heap node (and thus that the variable node x points to a single object), then we could remove the old edge, and have a strong update.

We want to do strong updates as much as possible because they are the ones that give us more information about what is going on in the memory (who is pointing to who). Therefore we could split abtract locations into smaller sets if they enable us to do a strong update, and then can merge them later if possible.

References

Parameteric analyses

Most of the information below is summarized from reference 1 and the lecture notes I took.

In the shape analysis algorithms we have considered so far, the set of possible memory states (“stores”) that can arise at a given program point are represented by shape graphs, where heap cells are the nodes and sets of indistinguishable heap cells are represented using a single node (often referred to as the summary node) [ref 1]. We have looked at defining indistinguishable heap cells using allocation sites. Reference 1 describes another approach using predicates. They describe a parametric framework for shape analysis. The framework can be instantiated in different ways to create shape analysis algorithms of varying precision.

A parametric framework must address the following issues:

What is the language for specifying
- the properties of stores that are to be tracked, and
- how such properties are affected by the execution of the different kinds of statements in the programming language?

How is a shape-analysis algorithm generated from such a specification?

In the paper, to address issue 1, they make use of 2-valued and 3-valued logic:

2-valued structures are used to represent concrete stores
3-valued structures are used to represent abstract stores

In other words, unary and binary predicates are used to represent the content of variables and pointer-valued structure fields.

First-order formulas with transitive closure are used to specify properties such as

sharing
cyclicity
reachability, etc.

Formulas are also used to specify how the store is affected by the execution of the different kinds of statements in the programming language.

The predicates are parameters to the framework. The specified set of predicates determines the set of data-structure properties that can be tracked, and consequently what properties of stores can be “discovered” to hold at the different points in the program.

To address issue 2, the authors of the paper use an abstract interpretation for the analysis algorithm. The algorithm finds the least fixed point of a set of equations that are generated from the analysis specification.

Applicability

The set of problems that this analysis can be applied to is not predefined like it is with some other shape analysis algorithms (which are only applicable to programs that manipulate structures such as lists, trees, etc).

Characterizing a data structure

Section 2.1 from ref 1 explains:

Each data structure can be characterized by a set of properties such as:

anchor pointer variables, that is, information about which pointer variables point into the data structure;
the types of the data-structure elements, and in particular, which fields hold pointers;
connectivity properties, such as
- whether all elements of the data structure are reachable from a root pointer variable,
- whether any data-structure elements are shared,
- whether there are cycles in the data structure, and
- whether an element v pointed to by a “forward” pointer of another element v
other properties, for instance, whether an element of an ordered list is in the correct position.

1 and 2 are called core properties because most analyses track them using a pair functions often called the environment and the store.

Connectivity and other properties, such as those mentioned in 3 and 4, are usually not explicitly part of the semantics of pointers in a language, but instead are properties derived from this core semantics. They are essential ingredients in program verification, however, as well as in our approach to shape analysis of programs. Noncore properties are called instrumentation properties.

Representing stores using 2-valued and 3-valued logical structures

A store is represented as a logical structure. A logical structure has a number of individuals and a set of predicates. In a 2-valued logical structure, the predicated map each individual to the value 0 (false) or 1 (true). In a 3-valued logical structure, the predicated map each individual to the value 0 (false), 1/2 (unknown) or 1 (true).

2-valued logical strucures are used to encode concrete stores as follows:

Individuals represent memory locations in the heap
pointers from the stack into the heap are represented by unary “pointed-to-by-variable-q” predicates
and pointer-valued fields of data structures are represented by binary predicates

3-valued logical structures are used to encode abstract stores (like the summary nodes of other algorithms). A 3-valued logical structure corresponds to a set of 2-valued logical structures. The way you abstract a 2-valued logical structure to a 3-valued one is the following: there is some predicates called abstraction predicates. Equivalence classes are formed by individuals in a 2-valued logical structure which have the same values for their abstraction predicates. All members of such equivalence classes are mapped to the same individual in the 3-valued structure. Then, for each nonabstraction predicate in the 2-valued structure, the following value is given to the corresponding predicate in the 3-valued strcture:

if the predicate has value 0 for all of the individuals in the equivalence class, then its value in the 3-valued structure is 0 (property is false)
if the predicate has value 1 for all of the individuals in the equivalence class, then its value in the 3-valued structure is 1 (property is true)
if the predicate has value 0 for some of the individuals and 1 for others in the equivalence class, then its value in the 3-valued structure is 1/2 (don't know whether property holds or not)

This concludes the summary of the first 10 pages of reference 1 below, and should help you understand what was covered in class. The paper explains everything in more detail and continues for another ~70 pages, so there is plenty more details if you are interested.

Reasoning about predicates using decision procedures

Reference 3 below introduces Bohne, an algorithm for inferring loop invariants of programs that manipulate heap-allocated data structures. It also takes properties to be verified as parameters. What makes it unique is the use of a precise abstraction domain that can express detailed properties of a program's infinite memory, and a range of techniques for exploring this analysis domain using decision procedures. The algorithm also uses predicates to partition sets of objects, which is useful for analyzing shape properties (e.g. transitive closure) and also for combining shape properties with sorting properties and properties expressable using linear arithmetic and first-order logic.

The symbolic shape analysis algorithm relies on decision procedures for expressive logics to perform synthesis of loop invariants. The system then verifies that the synthesized invariants are sufficient to prove the absence of errors and to prove the postcondition. During the invariant synthesis, the analysis primarily uses the MONA decision procedure with field constraint analysis to reason about expressive invariants involving reachability in tree-like linked structures.

Bohne has been deployed in the verification systems Hob and Jahob.

References

Parametric shape analysis via 3-valued logic
Pointer Assertion Logic Enginer - verification condition generation, MSOL over strings - no loop invariant inference
Verifying Complex Properties using Symbolic Shape Analysis - loop invariant inference, wider class of structures