数据流分析：修订间差异

删除的内容添加的内容

行内

2008年12月17日 (三) 08:14的版本

数据流分析 是一种用于收集计算机程序在不同点计算的值的信息的技术。一个程序的控制流图(control flow graph, CFG)被用来确定对变量的一次赋值可能传播到程序中的哪些部分。这些信息通常被编译器用来优化程序。数据流分析的一个典型的例子就是可到达定义的计算。

进行数据流分析的最简单的一种形式就是对控制流图的某个节点建立数据流方程，然后通过迭代计算，反复求解，直到到达不动点。这种一般的方法是由Gary Kildall在Naval Postgraduate School讲课时发明的。^[1]

基本原理

数据流分析试图获得程序中每一点的特定信息。通常，在基本块(basic blocks)的界限内就可以获得这些信息，因为很容易计算基本块中的信息。在前向流分析(forward flow analysis)中，一个块的结束状态是这个块起始状态的一个函数。函数由块内的语句的影响信息组成。一个块的开始状态是它的前驱的结束状态的函数。这就产生了一系列的数据流方程：

对于每一个块b:

out_{b}=trans_{b}(in_{b})

in_{b}=join_{p\in pred_{b}}(out_{p})

在这里， $trans_{b}$ 是块 $b$ 的 转移函数。它作用于入口状态 $in_{b}$ ，并产生出口状态 $out_{b}$ 。连接运算符 $join$ 将块 $b$ 的前驱节点 $p\in pred_{b}$ 的出口状态联合起来，产生入口状态 $b$ 。

After solving this set of equations, the entry and / or exit states of the blocks can be used to derive properties of the program at the block boundaries. The transfer function of each statement separately can be applied to get information at a point inside a basic block.

Each particular type of data flow analysis has its own specific transfer function and join operation. Some data flow problems require backward flow analysis. This follows the same plan, except that the transfer function is applied to the exit state yielding the entry state, and the join operation works on the entry states of the successors to yield the exit state.

The entry point (in forward flow) plays an important role: Since it has no predecessors, its entry state is well-defined at the start of the analysis. For instance, the set of local variables with known values is empty. If the control flow graph does not contain cycles (there were no explicit or implicit loops in the procedure) solving the equations is straightforward. The control flow graph can then be topologically sorted; running in the order of this sort, the entry states can be computed at the start of each block, since all predecessors of that block have already been processed, so their exit states are available. If the control flow graph does contain cycles, a more advanced algorithm is required.

An iterative algorithm

The most common way of solving the data flow equations is by using an iterative algorithm. It starts with an approximation of the in-state of each block. The out-states are then computed by applying the transfer functions on the in-states. From these, the in-states are updated by applying the join operations. The latter two steps are repeated until we reach the so-called fixpoint: the situation in which the in-states (and the out-states in consequence) don't change.

A basic algorithm for solving data flow equations is the round-robin iterative algorithm:

for i ← 1 to N

initialize node i

while (sets are still changing)

for i ← 1 to N

recompute sets at node i

Convergence

To be usable, the iterative approach should actually reach a fixpoint. This can be guaranteed by imposing constraints on the combination of the value domain of the states, the transfer functions and the join operation.

The value domain should be a partial order with finite height (i.e., there are no infinite ascending chains $x_{1}$ < $x_{2}$ < ...). The combination of the transfer function and the join operation should be monotonic with respect to this partial order. Monotonicity ensures that on each iteration the value will either stay the same or will grow larger, while finite height ensures that it cannot grow indefinitely. Thus we will ultimately reach a situation where T(x) = x for all x, which is the fixpoint.

The work list approach

It is easy to improve on the algorithm above by noticing that the in-state of a block will not change if the out-states of its predecessors don't change. Therefore, we introduce a work list: a list of blocks that still needs to be processed. Whenever the out-state of a block changes, we add its successors to the work list. In each iteration, a block is removed from the work list. Its out-state is computed. If the out-state changed, the block's successors are added to the work list. For efficiency, a block should not be in the work list more than once.

The algorithm is started by putting the entry point in the work list. It terminates when the work list is empty.

The order matters

The efficiency of iteratively solving data-flow equations is influenced by the order at which local nodes are visited. Furthermore, it depends, whether the data-flow equations are used for forward or backward data-flow analysis over the CFG. Intuitively, in a forward flow problem, it would be fastest if all predecessors of a block have been processed before the block itself, since then the iteration will use the latest information. In the absence of loops it is possible to order the blocks in such a way that the correct out-states are computed by processing each block only once.

In the following, a few iteration orders for solving data-flow equations are discussed (a related concept to iteration order of a CFG is tree traversal of a tree).

random order This iteration order is not aware whether the data-flow equations solve a forward or backward data-flow problem. Therefore, the performance is relatively poor compared to specialized iteration orders.

postorder This is a typical iteration order for backward data-flow problems. In postorder iteration a node is visited after all its successor nodes have been visited. Typically, the postorder iteration is implemented with the depth-first strategy.

reverse postorder This is a typical iteration order for forward data-flow problems. In reverse-postorder iteration a node is visited before all its successor nodes have been visited, except when the successor is reached by a back edge. (Note that this is not the same as preorder.)

Initialization

The initial value of the in-states is important to obtain correct and accurate results. If the results are used for compiler optimizations, they should provide conservative information, i.e. when applying the information, the program should not change semantics. The iteration of the fixpoint algorithm will take the values in the direction of the maximum element. Initializing all blocks with the maximum element is therefore not useful. At least one block starts in a state with a value less than the maximum. The details depend on the data flow problem. If the minimum element represents totally conservative information, the results can be used safely even during the data flow iteration. If it represents the most accurate information, fixpoint should be reached before the results can be applied.

Examples

The following are examples of properties of computer programs that can be calculated by data-flow analysis. Note that the properties calculated by data-flow analysis are typically only approximations of the real properties. This is because data-flow analysis operates on the syntactical structure of the CFG without simulating the exact control flow of the program. However, to be still useful in practice, a data-flow analysis algorithm is typically designed to calculate an upper respectively lower approximation of the real program properties.

Forward Analysis

reaching definitions

The reaching definition analysis calculates for each program point the set of definitions that may potentially reach this program point.

1: if b==4 then
2:    a = 5;
3: else 
4:    a = 3;
5: endif
6:
7: if  a < 4 then
8:    ...

The reaching definition of variable "a" at line 7 is the set of assignments a=5 at line 2 and a=3 at line 4.

Backward Analysis

live variables

The live variable analysis calculates for each program point the variables that may be potentially read afterwards before their next write update. The result is typically used by dead code elimination to remove statements that assign to a variable whose value is not used afterwards.

The in-state of a block is the set of variable that is live at the end of the block. Its out-state is the set of variable that is live at the start of it. The in-state is the union of the out-states of the blocks successors. The transfer function of a statement is applied by making the variables that are written dead, then making the variables that are read live.

  // out: {}
b1: a = 3; 
    b = 5;
    d = 4;
    if a > b then
  // in: {a,b,d}

  // out: {a,b}
b2:    c = a + b;
       d = 2;
  // in: {b,d}

  // out: {b,d}
b3: endif
    c = 4;
    return b * d + c;
  // in:{}

The out-state of b3 only contains b and d, since c has been written. The in-state of b1 is the union of the out-states of b2 and b3. The definition of c in b2 can be removed, since c is not live immediately after the statement.

Solving the data flow equations starts with initializing all in-states and out-states to the empty set. The work list is initialized by inserting the exit point (b3) in the work list (typical for backward flow). Its computed out-state differs from the previous one, so its predecessors b1 and b2 are inserted and the process continues. The progress is summarized in the table below.

processing	in-state	old out-state	new out-state	work list
b3	{}	{}	{b,d}	(b1,b2)
b1	{b,d}	{}	{}	(b2)
b2	{b,d}	{}	{a,b}	(b1)
b1	{a,b,d}	{}	{}	()

Note that b1 was entered in the list before b2, which forced processing b1 twice (b1 was re-entered as predecessor of b2). Inserting b2 before b1 would have allowed earlier completion.

Initializing with the empty set is an optimistic initialization: all variables start out as dead. Note that the out-states cannot shrink from one iteration to the next, although the out-state can be smaller that the in-state. This can be seen from the fact that after the first iteration the out-state can only change by a change of the in-state. Since the in-state starts as the empty set, it can only grow in further iterations.

Other approaches

In 2002, Mohnen described a new method of data-flow analysis that does not require the explicit construction of a data-flow graph,^[2] instead relying on abstract interpretation of the program and keeping a working set of program counters. At each conditional branch, both targets are added to the working set. Each path is followed for as many instructions as possible (until end of program or until it has looped with no changes), and then removed from the set and the next program counter retrieved.

Bit vector problems

The examples above are problems in which the data flow value is a set, e.g. the set of reaching definitions (Using a bit for a definition position in the program), or the set of live variables. These sets can be represented efficiently as bit vectors, in which each bit represents set membership of one particular element. Using this representation, the join and transfer functions can be implemented as bitwise logical operations. The join operation is typically union or intersection, implemented by bitwise logical or and logical and. The transfer function for each block can be decomposed in so-called gen and kill sets.

As an example, in live-variable analysis, the join operation is union. The kill set is the set of variables that are written in a block, whereas the gen set is the set of variables that are read without being written first. The data flow equations become

$in_{b}=\bigcup _{s\in succ_{b}}out_{s}$

$out_{b}=(in_{b}-kill_{b})\cup gen_{b}$

In logical operations, this reads as

in(b) = 0

for s in succ(b)

in(b) = in(b) or out(s)

out(b) = (in(b) and not kill(b)) or gen(b)

Sensitivities

Data-flow analysis is inherently flow-sensitive. Data-flow analysis is typically path-insensitive, though it is possible to define data-flow equations that yield a path-sensitive analysis.

These terms aren't specific to data-flow analysis. The following definitions should be moved somewhere else and fleshed out with more verbose examples. Wikipedia's inter-document structure for static analysis could use some rework.

A flow-sensitive analysis takes into account the order of statements in a program. For example, a flow-insensitive pointer alias analysis may determine "variables x and y may refer to the same location", while a flow-sensitive analysis may determine "after statement 20, variables x and y may refer to the same location".
A path-sensitive analysis computes different pieces of analysis information dependent on the predicates at conditional branch instructions. For instance, if a branch contains a condition x>0, then on the fall-through path, the analysis would assume that x<=0 and on the target of the branch it would assume that indeed x>0 holds.
A context-sensitive analysis is an interprocedural analysis that takes the calling context into account when analyzing the target of a function call. In particular, using context information one can jump back to the original call site, whereas without that information, the analysis information has to be propagated back to all possible call sites, potentially losing precision.

List of data-flow analyses

Notes

^ Kildall, Gary. A Unified Approach to Global Program Optimization. Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages. 1973 [2006-11-20].
^ Mohnen, Markus. A Graph-Free Approach to Data-Flow Analysis. Lecture Notes in Computer Science. 2002, 2304: 185–213 [2008-12-02].