Category Archives: [UCSD] Advanced Compilers

[Compilers] Interprocedural Analysis

Cost of procedure calls

Without interprocedural analysis, make the flow function for call nodes return TOP. –> loss of precision (“lost-precision” cost) / indirect cost
Calls incur cost of return, argument & result passing, stack frame maintenance. –> “direct runtime” cost / direct cost

Ways to address

Get rid of calls using inlining and other techniques.
– Pros: eliminate overhead of call/return sequence, eliminate overhead of passing args & returning results, can optimize callee in context of caller and vice versa
– Cons: can increase compiled code space requirements, can slow down compilation, recursion?
– Let this be done by heuristics (machine learning can be used these days…)

Interprocedural analysis
– Pros: extend intraprocedural analysis to work across calls, doesn’t increase code size
– Cons: doesn’t eliminate direct runtime costs of calls, may not be as effective as inlining at cutting the precision cost of procedure calls.

Interprocedural Analysis

Simple approach: create a single CFG

Given call graph and CFGs of procedures, create a single CFG.
– connecting call sites to entry nodes of callees.
– connecting return nodes of callees back to calls.

Cons: speed, separate compilation, imprecision due to unrealizable paths.

Another approach: summaries

Compute summary info for each procedure
– Callee summary: summarizes effect/results of callee procedures for callers
– Caller summary: summarizes context of all callers for callee procedure.

Compute summaries using the worklist algorithm 🙂
– output (callee): summarizes effect/results (output) of callee procedures for callers
– input (caller): summarizes context (input) of all callers for callee procedure


main
fgh
inputoutputinputoutputinputoutputinputoutput

u
mainu
gl
mainl
fl
hu
fu
mainu
uuluullu

main
fg
inputoutputinputoutputinputoutput
u
mainu
gl
mainl
fu, llu, l
gu, l
maineu, l

Problem: need context!

Approach #1 to context-sensitivity


main
fg
inputoutputinputoutputinputoutput
L0: u
mainL0: uL1: u
gL0: uL1: uL1: l
mainL0: uL2: lL1: uL1: l
fL0: uL2: l
L1: u
L3: l
L1: l

gL0: uL2: l
L1: u
L3: l
L1: l
L3: u
fL0: u

L2: lL2: uL1: u
L3: l
L1: l
L3: u

Problem: we might need more than one level of stack… –> Shivers’ k-CFA

Approach #2 to context-sensitivity

Use transfer functions: this looks like functions from input information to output information.


main
fg
inputoutputinputoutputinputoutput
u: u
main
u: u
gu: l
mainl: l
f
u: u
l: l
u: l

g
u: l
l: u
f
l: u
mainu: u

For data-based context sensitivity, we can also run the analysis bottom-up.
There are ways to handle the problem using Graph Theory (Reps Horowitz and Sagiv 95 (RHS))

Procedure specialization

Interprocedural analysis is great for callers, but for the callee, information is still merged.
Even if you know everything about the context, it might not help! So just duplicate the callee procedures and it might help.

Summary

[Compilers] Alias Analysis

Alias Analysis

Also called points-to analysis (alias: two expressions that denote the same memory location)
Can be used to improve the precision of analysis that require knowing what is modified and referenced (e.g. constant prop, common subexpression elimination)

Intraprocedural analysis

Could build a simple algorithm that just creates list where each node denotes slice in memory… NO
– Lattice infinitely tall
– This is just like running the program, and no guarantee of termination

Introduce summary nodes where each node S denotes locations of all slices in memory allocated by one statement… YES?

Interprocedural analysis

Intraprocedural pointer analysis is not enough
– Sharing data structures across multiple procedures is one of the big benefits of pointers. So, without interprocedural analysis, there needs to be conservative assumptions making the whole analysis too imprecise.
Main difficulty in performing interprocedural pointer analysis is scaling (space and time)

Default… NO

Make it flow-insensitive (default was flow-sensitive)… NO
– Loss of too much information
– Still have to iterate

Do not kill information… Andersen’s algorithm
– Leads to loss of some information
– Uses less memory and runs faster than default

Lets make one successor per node… Steensgaard’s algorithm
– Leads to loss of more information
– But very very fast!!!

Details

Referencing (a = new b): a –> S#
Aliasing (a = b): a –> pts(b)
Dereferencing read (a = *b): a –> pts(pts(b))
Dereferencing write (*a = b): pts(a) –> pts(b)

Andersen’s Points-to Analysis

– Asympotic performance is O(n^3)
– More precise than Steensgaard’s Analysis
– Subset-based (Inclusion-based): just propagates information to the other set (more precise in some cases)

Steensgaard’s Points-to Analysis

– Fast: O(n α (n,n)) over variables in program
– Sacrifices precision
– Equality-based: merges information

[Compilers] Program Representations

Program Representations

Goals
– Make analysis easy and effective
– Make transformation easy to perform
– General across input languages and target machines

– Compact in memory
– Easy to translate and form
– Tracks information from source through to binary, from source-level debugging, profiling, typed binaries
– Extensible and displayable

Components
– Control dependencies: sequencing of operations; evaluation of if & then, side-effects of statements occur in right order (JTS comes before JMP in SRP)
– Data dependencies: fl0ow of definitions from defs to uses; operands computed before operations
– Only ones that are pertinent should be represented

High-level vs Low-level

High-level syntax based

Analysis can exploit high-level knowledge of constructs
Easy to map to source code (debugging, profiling)

e.g. Abstract Syntax Tree (AST)

Low-level syntax based

Can do low-level, machine specific reasoning
Can be language-independent

e.g. Assembly code, Virtual machine code, LLVM IR, three-address code, RTL

Components

Data dependencies

Def/Use Chains… MAYBE?
– Ignores control flow: misses some optimization opportunities since conservatively considers all paths, not executable by itself and is not appropriate for code motion transformations
– Must update after each transformation
– Space consuming

Static Single Assignment (SSA)… HELL YEAH!
– To SSA : Create a new variable for each def, and insert PHI pseudo-assignments at merge points
– From SSA: Insert assignments at the end of pertinent predecessors, which can be removed later if register allocator assigns them to the same registers
– Invariant since each use of a variable has only one def…
– Captures more chances for optimization…
– Pointers complicate things
. don’t use SSA for pointed to variables
. adapt SSA to account for pointers (LLVM)
. define src language so that variables cannot be pointed to (Java)

Control dependencies

A node (basic block) Y is control-dependent on another x if X determines whether Y executes
– X is not post-dominated by Y (there exists a path from X to Y such that every node in the path order than X and Y is post-dominated by Y)

Control Dependence Graph… EMM
– Y becomes descendent of X if Y is control dependent on X
– Cannot trash CFG after making control dependence graph, because data dependence information is not present in control dependence graph

Program Dependence Graph… EMM
– Super-impose dataflow graph (in SSA form or not) on top of the control dependence graph

Control Dependence Graph & Program Dependence Graph
– More flexible way of representing control dependencies than CFG (less constraining), even makes code motion a local transformation
– However, much harder to convert back to an executable form, so not used much

Examples

Common subexpression Elimination

Iterative Dataflow Analysis
Static Single Assignment: we find opportunities to optimize z1 = i1 * 4, m1 = a1 + b1, w1 = 4 * m1

Loop-invariant Code Motion (LICM)

Step 1: detecting loop invariants

Use/Def Chains: p = w + y is not loop-invariant w is not “single reaching definition”
Static Single Asssignment: Nothing can be hoisted since PHI functions are not pure
Static Single Assignment with Preheader: Everything that was possible in Use/Def Chains and p1 = w3 + y3 becomes loop-invariant

Step 2: move them outside loop (hoisting / sinking)

Need to preserve relative order of invariant computations to preserve dataflow among move statements
– To move statement S to loop pre-header, S must dominate all loop exits
– Domination restrictions can be circumvented through loop normalization

x = a / b do not dominate loop exit before, but by performing “while-do to if-do-while”, we can make x = a / b dominate loop exit

In Static Single Assignment, such restriction is unnecessary… HELL YEAH!

[Compilers] Lattices

Relations

Reflexive: a R a, equal to itself…
Transitive: a R b ^ b R c –> a R c
Symmetric: a R b –> b R a
Anti-symmetric: a R b –> ㄱ(b R a) (causes to be not reflexive)
a R b ^ b R a –> a = b

Equivalence class: reflexive & transitive & symmetric
Partial order: reflexive & transitive & anti-symmetric, ex) <=

Least upper bound (lub): a <= c, b <= c, and all d in S.
a <= d ^ b <= d –> c <= d
Greatest lower bound (glb): c <= a, c <= b, and all d in S.
d <= a ^ d <= b –> d <= c

Worklist Algorithm using Lattices

Abstract Interpretation Ordering uses BOTTOM, Dataflow Analysis uses TOP

Try to define a global flow function F, which takes a map m as a parameter and returns a new map m’ in which individual local flow functions have been applied!

Always safe to use TOP, but hard to go down the lattice and gives less precise solution which is not favorable.
If F is monotonic and the lattice has a finite hight, it guarantees termination

[Compilers] Dataflow Analysis

A common framework for expressing algorithms that compute information about a program.

Control Flow Graph

Each statement becomes a node
Edges between nodes represent control flow

Dataflow Analysis

Worklist Algorithm

  • Initialize all d to empty set…
  • add nodes to worklist…
  • while worklist is not empty
    • remove a node n
    • info_in will be map of sets of incoming information
    • info_out will be map of sets of outgoing information computed through F
    • for each outgoing edge
      • if info_out is different from what has been previously calculated, update with info_out and add all destinations of the outgoing edges…
      • let new_info be superset of previously calculated and info_out
      • if new_info is different from what has been previously calculated, update with new_info and add all destinations of the outgoing edges…

Issue: (1) Ordering? Random order will take long due to redundancy, so use reverse depth-first post-order (2) Does this algorithm terminate? May not terminate oscillating in the original algorithm, so make supersets to make guarantees…

May vs Must

Forward vs Backward

Although constraints are not directional, flow functions are

Forward
– the constraints are of the form out = F(in)

Backward
– In some cases, the constraints are of the form in = F(out)
– e.g. live variables

Can formalize backward analysis in two ways
– Option 1: reverse flow graph, and then run forward problem
– Option 2: re-develop the theory, but in the backward direction

Precision

Pretty much any problem in compiler optimization is undecidable.
In order to find solutions for such problem, we approximate! Therefore, we may lose precision.

There are several factors that may cause imprecision:
– Unreacheable code
– Path sensitivity
. Branch correlations: some paths are infeasible
. Path merging: can lead to loss of precision

MOP: meet over all paths

We find all paths that lead to certain program point, do analysis on each path, and merge all information.

MOP is the “best” possible answer given a fixed set of flow functions
However, there are too many paths, so it is not computable.
MOP != dataflow

However, by imposing some restrictions on the flow functions, we could guarantee that dataflow is the same as MOP.
If F(a U b) == F(a) U F(b) (distributivity holds), then dataflow is the same as MOP.

Reaching Definitions

For each program point in the CFG, compute the set of definitions (statements) that may reach that point.

D = 2^A
Top = All, Bottom = Empty (May)
Join = U, Meet = ^

Constant Propagation

D = 2^A
Top = Empty, Bottom = All (MUST)
Join = ^, Meet = U

We could improve the lattice by reducing the infinite height lattice to simpler one
– D = {Top, Bottom} U Z
– Two possibilities
. Option 1: tuple of lattices
. Option 2: map from variables to single lattice

Common Subexpression Elimination

Want to compute when an expression is available in a var

S = {X–>E | X \in Var, E \in Expr}
D = 2^S
Top = Empty, Bottom = All (MUST)
Join = ^, Meet = U

Live Analysis

A variable is live at a program point if it will be used before being redefined
A variable is dead at a program point if it is redefined before being used

D = 2^{Vars}
Top = All, Bottom = Empty (MAY)
Join = U, Meet = ^

[Compilers] Introduction and Common Optimizations

Compiler

Parser: Builds AST from string
Optimizer: Transform the program to make it better
CodeGen: Transforms from compiler representation to assembly or machine code

Optimizer

Compute information about a program + Use that information to perform program transformations

Representations

  • Abstract Syntax Tree
  • Control Flow Graph
  • Dataflow Graph
  • Static Single Assignment
  • Control Dependence Graph
  • Program Dependence Graph
  • Call Graph

Analysis/Transformation Algorithms

  • Dataflow Analysis
  • Interprocedural Analysis
  • Pointer Analysis

Applications

  • Scalar Optimizations
  • Loop Optimizations
  • Object Oriented Optimizations
  • Program Verification
  • Bug Finding

Common Optimizations

Constant Propagation: Replacing uses of constant variables with the constant value itself
Constant Folding: Precomputing constant values with simple operators
Strength Reduction: Replacing MULT with SHL

Common Subexpression Elimination: Replace use of common expressions with LHS of first appearance of the expression –> need guarantee that values of LHS and RHS are not modified between the two identical expressions
Partial Redundancy Elimination: Sometimes redundant expression could be eliminated…
Copy Propagation: If LHS of a copy operation is used later, it can be replaced –> need guarantee that values of LHS and RHS are not modified

Dead Assignment Elimination: Unused assignment can be removed
Dead Code Elimination: Unreachable code can be eliminated

Branch Folding, Range Analysis, Pointer/Alias Analysis, Loop Invariant Code Motion, Inlining…