SSA-Based Compiler Design Insights
SSA-Based Compiler Design Insights
SSA-based
Compiler Design
SSA-based Compiler Design
Fabrice Rastello • Florent Bouchez Tichadou
Editors
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword
Static Single Assignment (SSA) form is now the dominant framework upon which
program analysis and transformation tools are currently built. It is possible to pro-
duce a state-of-the-art optimizing compiler that transforms the incoming program
into SSA form after the initial syntactic and semantic analysis, and maintain that
representation throughout the entire optimization process only to leave SSA form
when the final machine code is emitted. This book describes the state-of-the-art
representations of SSA form and the algorithms that use it.
The landscape of techniques for program analysis and transformation was very
different when the two of us started the initial development of SSA form. Many
of the compilers being built at the time were very simple, similar to what a student
might generate in a first compiler class. They produced code based on a template for
each production in the grammar of the language. The initial optimizing compilers
transformed the code based on the branch-free context of the surrounding code,
perhaps removing computation if the results were not to be used, perhaps computing
a result in a simpler way by using a previous result.
The first attempts at more ambitious optimization occurred at IBM, in a few
small companies in the Boston area and in a small number of academic institutions.
These people demonstrated that large gains in performance were available if the
scope of the optimizations was increased beyond the local level. In general, an
optimizing compiler can make transformations to an intermediate representation of a
program provided that the preconditions exist for the transformation to preserve the
semantics of the program. Before SSA form, the dominant provers would construct
a summary of what must be true about the state of the program before and after every
statement in the program. The proof constructed would proceed by understanding
what parts of the state were modified by the statement. This entire process was
known as dataflow analysis. The results of that analysis were then used to make
transformations to the intermediate code.
v
vi Foreword
at the time: the transformation did not adversely effect the correctness of the
representation.
The first work was incomplete. It didn’t cover the semantics of many common
storage classes, and the construction algorithm was not well thought through. But
the surprising lesson was that by constructing a uniform representation in which
both transformations and analysis could be represented, we could simplify and
improve the efficiency of both.
Over the next few years, we teamed up with others within IBM to develop not
only an efficient method to enter SSA form but also a suite of techniques that each
used SSA form and left the representation intact when finished. These techniques
included dead code elimination, value numbering, and invariant code motion.
The optimizations that we developed are in fact quite simple, but that simplicity
is a result of pushing much of the complexity into the SSA representation. The
dataflow counterparts to these optimizations are complex, and this difference in
complexity was not lost on the community of researchers outside of IBM. Being able
to perform optimizations on a mostly functional representation is much easier than
performing them on a representation that had all of the warts of a real programming
language.
By today’s standards, our original work was not really that useful: there were
only a handful of techniques and the SSA form only worked for unaliased variables.
But that original work defined the framework for how programming language
transformation was to be performed. This book is an expression of how far SSA
form has come and how far it needs to go to fully displace the older techniques with
more efficient and powerful replacements. We are grateful and humbled that our
work has led to such a powerful set of directions by others.
This book could have been given several names: “SSA-based compiler design,”
or “Engineering a compiler using the SSA form,” or even “The Static Single
Assignment form in practice.” But now, if anyone mentions “The SSA book,” then
they are certainly referring to this book.1
Twelve years were necessary to give birth to a book composed of 24 chapters
and written by 31 authors. Twelve years: We are not very proud of this counter
performance but here is how it comes: one day, a young researcher (myself) was
told by a smart researcher (Jens Palsberg): “we should write a book about all
this!” “We” was including Florent Bouchez, Philip Brisk, Sebastian Hack, Fernando
Pereira, Jens, and myself. “All this” was referring to all the cool stuff related to
register allocation of SSA variables. Indeed, the discovery made independently by
the Californian (UCLA), German (U. of Karlsruhe), and French (Inria) research
groups (Alain Darte was the one at the origin of the discovery in the French group)
was intriguing, as it was questioning the relevance of all the register allocation
papers published during the last thirty years. The way those three research groups
came to know each other is worth mentioning by itself as without this anecdote the
present world would probably be different (maybe not a lot, but most certainly with
no SSA book. . . ): A missing pdf file of a CGO article [240] on my web page made
Sebastian and Fernando independently contact me. “Just out of curiosity, why are
you interested in this paper?” led to a long friendly and fruitful collaboration. . . The
lesson to learn from that story is: do not spend too much time on your web page, it
may pay off in unexpected ways!
But why restricting a book on SSA to just register allocation? SSA was starting
to be widely adopted in mainstream compilers, such as LLVM, Hotspot, and GCC,
and was motivating many developers and research groups to revisit the different
1 Put differently, this book fulfils the criterion of referential transparency, which seems to be a
minimum requirement for a book about Static Single Assignment form. . . If this bad joke does not
make sense to you, then you definitely need to read the book. If it does make sense, I am pretty
sure you can still perfect your knowledge by reading it.
ix
x Preface
compiler analyses and optimizations for this new form. I was myself lucky to take
my first steps in compiler development from 2000 to 2002 in LAO [102], an SSA-
based assembly level compiler developed at STMicroelectronics.2 Thanks to SSA,
it took me only four days (including the two days necessary to read the related
work) to develop a partial redundancy elimination very similar to the GVN-PRE
of Thomas VanDrunen [295] for predicated code! Given my very low expertise,
implementing the SSAPRE of Fred Chow (which is not SSA-based—see Chap. 11)
would probably have taken me several months of development and debugging, for
an algorithm less effective in removing redundancies. In contrast, my pass contained
only one bug: I was removing the redundancies of the stack pointer because I did
not know what it was used for. . . There is a lesson to learn from that story: If you
do not know what a stack pointer is, do not give up, with substantial efforts you can
still expect a position in a compiler group one day.
I realized later that many questions were not addressed by any of the existing
compiler books: If register allocation can take advantage the structural properties
of SSA, what about other low-level optimizations such as post-pass scheduling or
instruction selection? Are the extensions for supporting aliasing or predication as
powerful as the original form? What are the advantages and disadvantages of SSA
form? I believed that writing a book that would address those broader questions
should involve the experts of the domain.
With the help of Sebastian Hack, I decided to organize the SSA seminar.
It was held in Autrans near Grenoble in France (yes for those who are old
enough, this is where the 1968 winter Olympics were held) and regrouped 55
people (from left to right in the picture—speakers reported in italic): Philip Brisk,
Christoph Mallon, Sebastian Hack, Benoit Dupont de Dinechin, David Monniaux,
Christopher Gautier, Alan Mycroft, Alex Turjan, Dmitri Cheresiz, Michael Beck,
Paul Biggar, Daniel Grund, Vivek Sarkar, Verena Beckham, Jose Nelson Amaral,
Donald Nguyen, Kenneth Zadeck, James Stanier, Andreas Krall, Dibyendu Das,
Ramakrishna Upadrasta, Jens Palsberg, Ondrej Lhotak, Hervé Knochel, Anton
xii Preface
required. Indeed, almost every chapter contained at least one important technical
flaw; people have different definitions (with slight but still important differences)
for the same notion; some spent pages developing a new formalism that is already
covered by existing well-established ones. But a deep review of a single chapter
roughly takes a week. . . Starts a long walk in a tunnel were you context switch
from one chapter review to another, try to orchestrate the cross-reviews, ask for the
unfinished chapters to be completed, pray that the missing chapters finally change
their status to “work-in-progress,” and harass colleagues with metre-long back and
forth email threads trying to finally clarify this one 8-word sentence that may be
interpreted in the wrong way. You then realize no one will be able to implement
the agreed changes in time because they got themselves swamped by other duties
(the ones from their real job), and when these are available that is when you get
yourself swamped by other duties (the ones from your real job), and your 2-month
window is already flying away, leaving yourself realizing that the next one will be
next year. And time flies because, for everybody including yourself, there is always
something else more urgent than the book itself, and messages in this era of near-
instantaneous communication behave as if they were using strange paths around the
solar system, experiencing the joys of time dilation with 6-month return trips around
Saturn’s rings. More prosaically, I will lay the blame on the incompatibility between
the highly imposed constraints and the absence of a strong common deadline for
everybody. And so, the final lesson to learn from that experience is that if a mountain
climber proposes to join him for a “short” hike, go for it: forget about the book, it
will be less painful!
A few years after we initiated this project, as I was depressed by the time all this
was taking, I met Charles Glaser from Springer who told me: “You have plenty of
time: I have a book that took 15 years to be completed.” At the time I am writing
those lines, I still have a long list of to-dos, but I do not want to be the person Charles
mentions as “Do not worry, you cannot do worse than this book they started writing
15 years ago and which is still under construction. . . ” Sure, you might still find
some typos or inconsistencies, but there should not be many: the book regroups the
knowledge of many talented compiler experts, including substantial unpublished
materials, and I am proud to release it today.
Part II Analysis
7 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Markus Schordan and Fabrice Rastello
8 Propagating Information Using SSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Florian Brandner and Diego Novillo
9 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Benoît Boissinot and Fabrice Rastello
10 Loop Tree and Induction Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Sebastian Pop and Albert Cohen
11 Redundancy Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Fred Chow
xv
xvi Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
About the Editors
Fabrice Rastello is an Inria research director and the leader of the CORSE
(Compiler Optimization and Runtime SystEms) Inria team. His expertise includes
automatic parallelization (PhD thesis on tiling as a loop transformations) and
compiler back-end optimizations (engineer at STMicroelectronics’s compiler group
+ researcher at Inria). Among others, he has advised several PhD theses so as to
fully revisit register allocation for JIT compilation in the light of Static Single
Assignment (SSA) properties. He likes mixing theory (mostly graphs, algorithmic,
and algebra) and practice (industrial transfer). His current research topics include:
(1) combining runtime techniques with static compilation, hybrid compilation being
an example of such approach he is trying to promote; (2) performance debugging
through static and dynamic (binary instrumentation) analysis; and (3) revisiting
compilers infrastructure for pattern-specific programs.
Florent Bouchez Tichadou received his PhD in computer science in 2009 from the
ENS Lyon in France, working on program compilation. He was then a postdoctoral
fellow at the Indian Institute of Science (IISc) in Bangalore, India. He worked for 3
years at Kalray, a startup company in the Grenoble area in France. Since 2013, he is
an assistant professor at the Université Grenoble Alpes (UGA).
xvii
Part I
Vanilla SSA
Chapter 1
Introduction
Jeremy Singer
In computer programming, as in real life, names are useful handles for concrete
entities. The key message of this book is that having unique names for distinct
entities reduces uncertainty and imprecision.
For example, consider overhearing a conversation about “Homer.” Without any
more contextual clues, you cannot disambiguate between Homer Simpson and
Homer the classical Greek poet; or indeed, any other people called Homer that you
may know. As soon as the conversation mentions Springfield (rather than Smyrna),
you are fairly sure that the Simpsons television series (rather than Greek poetry) is
the subject. On the other hand, if everyone had a unique name, then there would
be no possibility of confusing twentieth century American cartoon characters with
ancient Greek literary figures.
This book is about the Static Single Assignment form (SSA), which is a naming
convention for storage locations (variables) in low-level representations of computer
programs. The term static indicates that SSA relates to properties and analysis of
program text (code). The term single refers to the uniqueness property of variable
names that SSA imposes. As illustrated above, this enables a greater degree of
precision. The term assignment means variable definitions. For instance, in the code
x = y + 1;
the variable x is being assigned the value of expression (y + 1). This is a definition,
or assignment statement, for x. A compiler engineer would interpret the above
assignment statement to mean that the lvalue of x (i.e., the memory location labeled
as x) should be modified to store the value (y + 1).
J. Singer ()
School of Computing Science, University of Glasgow, Glasgow, UK
e-mail: [Link]@[Link]
The simplest, least constrained, definition of SSA can be given using the following
informal prose:
A program is defined to be in SSA form if each variable is a target of exactly one assignment
statement in the program text.
However, there are various, more specialized, varieties of SSA, which impose
further constraints on programs. Such constraints may relate to graph-theoretic
properties of variable definitions and uses, or the encapsulation of specific control-
flow or data-flow information. Each distinct SSA variety has specific characteristics.
Basic varieties of SSA are discussed in Chap. 2. Part III of this book presents more
complex extensions.
One important property that holds for all varieties of SSA, including the simplest
definition above, is referential transparency: i.e., since there is only a single
definition for each variable in the program text, a variable’s value is independent of
its position in the program. We may refine our knowledge about a particular variable
based on branching conditions, e.g., we know the value of x in the conditionally
executed block following an if statement that begins with
if(x == 0) .
However, the underlying value of x does not change at this if statement. Programs
written in pure functional languages are referentially transparent. Such referentially
transparent programs are more amenable to formal methods and mathematical
reasoning, since the meaning of an expression depends only on the meaning of
its subexpressions and not on the order of evaluation or side effects of other
expressions. For a referentially opaque program, consider the following code
fragment.
x = 1;
y = x + 1;
x = 2;
z = x + 1;
A naive (and incorrect) analysis may assume that the values of y and z are equal,
since they have identical definitions of (x + 1). However, the value of variable x
depends on whether the current code position is before or after the second definition
of x, i.e., variable values depend on their context. When a compiler transforms this
program fragment to SSA code, it becomes referentially transparent. The translation
1 Introduction 5
process involves renaming to eliminate multiple assignment statements for the same
variable. Now it is apparent that y and z are equal if and only if x1 and x2 are equal.
x1 = 1;
y = x1 + 1;
x2 = 2;
z = x2 + 1;
1 Kenneth Zadeck reports that φ-functions were originally known as phoney-functions, during the
development of SSA at IBM Research. Although this was an in-house joke, it did serve as the basis
for the eventual name.
6 J. Singer
The SSA code features two φ-functions in the loop header; these merge incoming
definitions from before the loop for the first iteration and from the loop body for
subsequent iterations.
It is important to outline that SSA should not be confused with (dynamic) single
assignment (DSA or simply SA) form used in automatic parallelization. Static
Single Assignment does not prevent multiple assignments to a variable during
program execution. For instance, in the SSA code fragment above, variables y3
and x3 in the loop body are redefined dynamically with fresh values at each loop
iteration.
8 J. Singer
Full details of the SSA construction algorithm are given in Chap. 3. For now, it
is sufficient to see that:
1. A φ-function has been inserted at the appropriate control-flow merge point where
multiple reaching definitions of the same variable converged in the original
program.
2. Integer subscripts have been used to rename variables x and y from the original
program.
As we will discover further in Chaps. 13 and 8, one of the major advantages of SSA
form concerns data-flow analysis. Data-flow analysis collects information about
programs at compile time in order to make optimizing code transformations. During
actual program execution, information flows between variables. Static analysis
captures this behaviour by propagating abstract information, or data-flow facts,
using an operational representation of the program such as the control-flow graph
(CFG). This is the approach used in classical data-flow analysis.
Often, data-flow information can be propagated more efficiently using a func-
tional, or sparse, representation of the program such as SSA. When a program is
translated into SSA form, variables are renamed at definition points. For certain
data-flow problems (e.g., constant propagation) this is exactly the set of program
points where data-flow facts may change. Thus it is possible to associate data-flow
facts directly with variable names, rather than maintaining a vector of data-flow facts
indexed over all variables, at each program point.
Figure 1.1 illustrates this point through an example of non-zero value analysis.
For each variable in a program, the aim is to determine statically whether that
variable can contain a zero integer value (i.e., null) at runtime. Here 0 represents
the fact that the variable is null, 0A the fact that it is non-null, and the fact that it
is maybe-null. With classical dense data-flow analysis on the CFG in Fig. 1.1a, we
would compute information about variables x and y for each of the entry and exit
points of the six basic blocks in the CFG, using suitable data-flow equations. Using
sparse SSA-based data-flow analysis on Fig. 1.1b, we compute information about
each variable based on a simple analysis of its definition statement. This gives us
seven data-flow facts, one for each SSA version of variables x and y.
For other data-flow problems, properties may change at points that are not
variable definitions. These problems can be accommodated in a sparse analysis
framework by inserting additional pseudo-definition functions at appropriate points
to induce additional variable renaming. See Chap. 13 for one such instance.
However, this example illustrates some key advantages of the SSA-based analysis.
1. Data-flow information propagates directly from definition statements to uses, via
the def-use links implicit in the SSA naming scheme. In contrast, the classical
1 Introduction 9
Fig. 1.1 Example control-flow graph for non-zero value analysis, only showing relevant definition
statements for variables x and y
Static Single Assignment form was one such IR, which was developed at IBM
Research and announced publicly in several research papers in the late 1980s [6, 89,
249]. SSA rapidly acquired popularity due to its intuitive nature and straightforward
construction algorithm. The SSA property gives a standardized shape for variable
def-use chains, which simplifies data-flow analysis techniques.
Current Usage The majority of current commercial and open source compilers,
including GCC, LLVM, the HotSpot Java virtual machine, and the V8
JavaScript engine, use SSA as a key intermediate representation for program
analysis. As optimizations in SSA are fast and powerful, SSA is increasingly used
in just-in-time (JIT) compilers that operate on a high-level target-independent
program representation such as Java byte-code, CLI byte-code (.NET MSIL), or
LLVM bitcode.
Initially created to facilitate the development of high-level program transforma-
tions, SSA form has gained much interest due to its favourable properties that often
enable the simplification of algorithms and reduction of computational complexity.
Today, SSA form is even adopted for the final code generation phase (see Part IV),
i.e., the back-end. Several industrial and academic compilers, static or just-in-time,
use SSA in their back-ends, e.g., LLVM, HotSpot, LAO, libFirm, Mono. Many
compilers that use SSA form perform SSA elimination before register allocation,
including GCC, HotSpot, and LLVM. Recent research on register allocation (see
Chap. 22) even allows the retention of SSA form until the very end of the code
generation process.
SSA for High-Level Languages So far, we have presented SSA as a useful feature
for compiler-based analysis of low-level programs. It is interesting to note that some
high-level languages enforce the SSA property. The SISAL language is defined in
such a way that programs automatically have referential transparency, since multiple
assignments are not permitted to variables. Other languages allow the SSA property
to be applied on a per-variable basis, using special annotations like final in Java,
or const and readonly in C#.
The main motivation for allowing the programmer to enforce SSA in an explicit
manner in high-level programs is that immutability simplifies concurrent program-
ming. Read-only data can be shared freely between multiple threads, without any
data dependence problems. This is becoming an increasingly important issue, with
the shift to multi- and many-core processors.
High-level functional languages claim referential transparency as one of the cor-
nerstones of their programming paradigm. Thus functional programming supports
the SSA property implicitly. Chapter 6 explains the dualities between SSA and
functional programming.
1 Introduction 11
In this chapter, we have introduced the notion of SSA. The rest of this book presents
various aspects of SSA, from the pragmatic perspective of compiler engineers and
code analysts. The ultimate goals of this book are:
1. To demonstrate clearly the benefits of SSA-based analysis
2. To dispel the fallacies that prevent people from using SSA
This section gives pointers to later parts of the book that deal with specific topics.
Myth Reference
SSA greatly increases the number of Chapter 2 reviews the main varieties of SSA,
variables some of which introduce far fewer variables
than the original SSA formulation.
SSA property is difficult to maintain Chapters 3 and 5 discuss simple techniques for
the repair of SSA invariants that have been
broken by optimization rewrites.
SSA destruction generates many copy Chapters 3 and 21 present efficient and
operations effective SSA destruction algorithms.
Chapter 2
Properties and Flavours
Recall from the previous chapter that a procedure is in SSA form if every variable is
defined only once, and every use of a variable refers to exactly one definition. Many
variations, or flavours, of SSA form that satisfy these criteria can be defined, each
offering its own considerations. For example, different flavours vary in terms of the
number of φ-functions, which affects the size of the intermediate representation;
some variations are more difficult to construct, maintain, and destruct than others.
This chapter explores these SSA flavours and provides insights into their relative
merits in certain contexts.
Under SSA form, each variable is defined once. Def-use chains are data structures
that provide, for the single definition of a variable, the set of all its uses. In turn,
a use-def chain, which under SSA consists of a single name, uniquely specifies
the definition that reaches the use. As we will illustrate further in the book (see
Chap. 8), def-use chains are useful for forward data-flow analysis as they provide
direct connections that shorten the propagation distance between nodes that generate
and use data-flow information.
Because of its single definition per variable property, SSA form simplifies def-
use and use-def chains in several ways. First, SSA form simplifies def-use chains
P. Brisk ()
UC Riverside, Riverside, CA, USA
e-mail: philip@[Link]
F. Rastello
Inria, Grenoble, France
e-mail: [Link]@[Link]
Fig. 2.1 Def-use chains (dashed) for non-SSA form and its corresponding SSA form program
edges.
2 Properties and Flavours 15
2.2 Minimality
Fig. 2.2 A non-strict code and its corresponding strict SSA form. The presence of ⊥ indicates a
use of an undefined value
2 Properties and Flavours 17
need to be inserted somewhere in J (r, D), as in our example of Fig. 2.2b where
⊥ represents the undefined pseudo-definition. The so-called minimal SSA form is a
variant of SSA form that satisfies both the minimality and dominance properties. As
shall be seen in Chap. 3, minimal SSA form is obtained by placing the φ-functions
of variable v at J (Dv , r) using the formalism of dominance frontier. If the original
procedure is non-strict, conversion to minimal SSA will create a strict SSA-based
representation. Here, strictness refers solely to the SSA representation; if the input
program is non-strict, conversion to and from strict SSA form cannot address errors
due to uninitialized variables. To finish with, the use of an implicit pseudo-definition
in the CFG entry node to enforce strictness does not change the semantics of the
program by any means.
SSA with dominance property is useful for many reasons that directly originate
from the structural properties of the variable live ranges. The immediate dominator
or “idom” of a node N is the unique node that strictly dominates N but does not
strictly dominate any other node that strictly dominates N . All nodes but the entry
node have immediate dominators. A dominator tree is a tree where the children
of each node are those nodes it immediately dominates. Because the immediate
dominator is unique, it is a tree with the entry node as root. For each variable,
its live range, i.e., the set of program points where it is live, is a sub-tree of the
dominator tree. Among other consequences of this property, we can cite the ability
to design a fast and efficient method to query whether a variable is live at point q or
an iteration-free algorithm to compute liveness sets (see Chap. 9). This property also
allows efficient algorithms to test whether two variables interfere (see Chap. 21).
Usually, we suppose that two variables interfere if their live ranges intersect (see
Sect. 2.6 for further discussions about this hypothesis). Note that in the general case,
a variable is considered to be live at a program point if there exists a definition of that
variable that can reach this point (reaching-definition analysis), and if there exists a
definition-free path to a use (upward-exposed use analysis). For strict programs, any
program point from which you can reach a use without going through a definition is
necessarily reachable from a definition.
Another elegant consequence is that the intersection graph of live ranges belongs
to a special class of graphs called chordal graphs. Chordal graphs are significant
because several problems that are NP-complete on general graphs have efficient
linear-time solutions on chordal graphs, including graph colouring. Graph colouring
plays an important role in register allocation, as the register assignment problem can
be expressed as a colouring problem of the interference graph. In this graph, two
variables are linked with an edge if they interfere, meaning they cannot be assigned
the same physical location (usually, a machine register, or “colour”). The underlying
chordal property highly simplifies the assignment problem otherwise considered
NP-complete. In particular, a traversal of the dominator tree, i.e., a “tree scan,” can
colour all of the variables in the program, without requiring the explicit construction
of an interference graph. The tree scan algorithm can be used for register allocation,
which is discussed in greater detail in Chap. 22.
As we have already mentioned, most φ-function placement algorithms are based
on the notion of dominance frontier (see Chaps. 3 and 4) and consequently do
18 P. Brisk and F. Rastello
provide the dominance property. As we will see in Chap. 3, this property can be
broken by copy propagation: In our example of Fig. 2.2b, the argument a1 of the
copy represented by a2 = φ(a1 , ⊥) can be propagated and every occurrence of
a2 can be safely replaced by a1 ; the now identity φ-function can then be removed
obtaining the initial code, that is, still SSA but not strict anymore. Making a non-
strict SSA code strict is about the same complexity as SSA construction (actually
we need a pruned version as described below). Still, the “strictification” usually
concerns only a few variables and a restricted region of the CFG: The incremental
update described in Chap. 5 will do the work with less effort.
One drawback of minimal SSA form is that it may place φ-functions for a variable
at a point in the control-flow graph where the variable was not actually live prior to
SSA. Many program analyses and optimizations, including register allocation, are
only concerned with the region of a program where a given variable is live. The
primary advantage of eliminating those dead φ-functions over minimal SSA form
is that it has far fewer φ-functions in most cases. It is possible to construct such
a form while still maintaining the minimality and dominance properties otherwise.
The new constraint is that every use point for a given variable must be reached by
exactly one definition, as opposed to all program points. Pruned SSA form satisfies
these properties.
Under minimal SSA, φ-functions for variable v are placed at the entry points
of basic blocks belonging to the set J (S, r). Under pruned SSA, we suppress the
instantiation of a φ-function at the beginning of a basic block if v is not live at
the entry point of that block. One possible way to do this is to perform liveness
analysis prior to SSA construction, and then use the liveness information to suppress
the placement of φ-functions as described above; another approach is to construct
minimal SSA and then remove the dead φ-functions using dead-code elimination;
details can be found in Chap. 3.
Figure 2.3a shows an example of minimal non-pruned SSA. The corresponding
pruned SSA form would remove the dead φ-function that defines Y3 since Y1 and
Y2 are only used in their respective definition blocks.
In many non-SSA and graph colouring based register allocation schemes, register
assignment is done at the granularity of webs. In this context, a web is the maximum
unions of def-use chains that have either a use or a def in common. As an example,
the code in Fig. 2.4a leads to two separate webs for variable a. The conversion to
2 Properties and Flavours 19
Fig. 2.3 Non-pruned SSA form allows value numbering to determine that Y3 and Z3 have the
same value
Fig. 2.4 (a) Non-SSA register webs of variable a and (b) corresponding SSA φ-webs {a1 } and
{a2 , a3 , a4 }; (b) conventional and (c) corresponding transformed SSA form after copy propagation
of a1
20 P. Brisk and F. Rastello
minimal SSA form replaces each web of a variable v in the pre-SSA program with
some variable names vi . In pruned SSA, these variable names partition the live
range of the web: At every point in the procedure where the web is live, exactly
one variable vi is also live; and none of the vi is live at any point where the web
is not.
Based on this observation, we can partition the variables in a program that has
been converted to SSA form into φ-equivalence classes that we will refer as φ-
webs. We say that x and y are φ-related to one another if they are referenced
by the same φ-function, i.e., if x and y are either parameters or defined by the φ-
function. The transitive closure of this relation defines an equivalence relation that
partitions the variables defined locally in the procedure into equivalence classes,
the φ-webs. Intuitively, the φ-equivalence class of a resource represents a set of
resources “connected” via φ-functions. For any freshly constructed SSA code, the
φ-webs exactly correspond to the register web of the original non-SSA code.
Conventional SSA form (C-SSA) is defined as SSA form for which each φ-
web is interference-free. Many program optimizations such as copy propagation
may transform a procedure from conventional to a non-conventional (T-SSA for
Transformed SSA) form, in which some variables belonging to the same φ-web
interfere with one another. Figure 2.4c shows the corresponding transformed SSA
form of our previous example: Here variable a1 interferes with variables a2 , a3 , and
a4 , since it is defined at the top and used last.
Bringing back the conventional property of a T-SSA code is as “difficult” as
translating out of SSA (also known as SSA “destruction,” see Chap. 3) . Indeed,
the destruction of conventional SSA form is straightforward: Each φ-web can be
replaced with a single variable; all definitions and uses are renamed to use the new
variable, and all φ-functions involving this equivalence class are removed. SSA
destruction starting from non-conventional SSA form can be performed through
a conversion to conventional SSA form as an intermediate step. This conversion
is achieved by inserting copy operations that dissociate interfering variables from
the connecting φ-functions. As those copy instructions will have to be inserted at
some points to get rid of φ-functions, for machine-level transformations such as
register allocation or scheduling, T-SSA provides an inaccurate view of the resource
usage. Another motivation for sticking to C-SSA is that the names used in the
original program might help capture some properties otherwise difficult to discover.
Lexical partial redundancy elimination (PRE) as described in Chap. 11 illustrates
this point.
Apart from those specific examples most current compilers choose not to
maintain the conventional property. Still, we should outline that, as later described
in Chap. 21, checking if a given φ-web is (and if necessary turning it back to)
interference-free can be done in linear time (instead of the naive quadratic time
algorithm) in the size of the φ-web.
2 Properties and Flavours 21
Throughout this chapter, two variables have been said to interfere if their live
ranges intersect. Intuitively, two variables with overlapping lifetimes will require
two distinct storage locations; otherwise, a write to one variable will overwrite
the value of the other. In particular, this definition has applied to the discussion
of interference graphs and the definition of conventional SSA form, as described
above.
Although it suffices for correctness, this is a fairly restrictive definition of
interference, based on static considerations. Consider for instance the case when
two simultaneously live variables in fact contain the same value, then it would
not be a problem to put both of them in the same register. The ultimate notion of
interference, which is obviously undecidable because of a reduction to the halting
problem, should decide for two distinct variables whether there exists an execution
for which they simultaneously hold two different values. Several “static” extensions
to our simple definition are still possible, in which, under very specific conditions,
variables whose live ranges overlap one another may not interfere. We present two
examples.
Firstly, consider the double-diamond graph of Fig. 2.2a again, which, although
non-strict, is correct as soon as the two if conditions are the same. Even if a and
b are unique variables with overlapping live ranges, the paths along which a and b
are respectively used and defined are mutually exclusive with one another. In this
case, the program will either pass through the definition of a and the use of a, or
the definition of b and the use of b, since all statements involved are controlled by
the same condition, albeit at different conditional statements in the program. Since
only one of the two paths will ever execute, it suffices to allocate a single storage
location that can be used for a or b. Thus, a and b do not actually interfere with
one another. A simple way to refine the interference test is to check if one of the
variables is live at the definition point of the other. This relaxed but correct notion
of interference would not make a and b in Fig. 2.2a interfere while variables a1
and b1 of Fig. 2.2b would still interfere. This example illustrates the fact that live
range splitting required here to make the code fulfil the dominance property may
lead to less accurate analysis results. As far as the interference is concerned, for a
SSA code with dominance property, the two notions are strictly equivalent: Two live
ranges intersect iff one contains the definition of the other.
Secondly, consider two variables u and v, whose live ranges overlap. If we can
prove that u and v will always hold the same value at every place where both
are live, then they do not actually interfere with one another. Since they always
have the same value, a single storage location can be allocated for both variables,
because there is only one unique value between them. Of course, this new criterion
is in general undecidable. Still, a technique such as global value numbering that is
straightforward to implement under SSA (see Chap. 11.5.1) can do a fairly good
job, especially in the presence of a code with many variable-to-variable copies, such
22 P. Brisk and F. Rastello
as one obtained after a naive SSA destruction pass (see Chap. 3). In that case (see
Chap. 21), the difference between the refined notion of interference and the non-
value-based one is significant.
This refined notion of interference has significant implications if applied to SSA
form. In particular, the interference graph of a procedure is no longer chordal, as
any edge between two variables whose lifetimes overlap could be eliminated by this
property.
The advantages of def-use and use-def chains provided almost for free under SSA
are well illustrated in Chaps. 8 and 13.
The notion of minimal SSA and a corresponding efficient algorithm to compute
it were introduced by Cytron et al. [90]. For this purpose they extensively develop
the notion of dominance frontier of a node n, DF (n) = J (n, r). The fact
that J + (S) = J (S) was actually discovered later, with a simple proof by
Wolfe [307]. More details about the theory on (iterated) dominance frontier can be
found in Chaps. 3 and 4. The post-dominance frontier, which is its symmetric notion,
also known as the control dependence graph, finds many applications. Further
discussions on control dependence graph can be found in Chap. 14.
Most SSA papers implicitly consider the SSA form to fulfil the dominance
property. The first technique that really exploits the structural properties of the
strictness is the fast SSA destruction algorithm developed by Budimlić et al. [53]
and revisited in Chap. 21.
The notion of pruned SSA has been introduced by Choi, Cytron and
Ferrante [67]. The example of Fig. 2.3 to illustrate the difference between pruned
and non-pruned SSA has been borrowed from Cytron et al. [90]. The notions of
conventional and transformed SSA were introduced by Sreedhar et al. in their
seminal paper [267] for destructing SSA form. The description of the existing
techniques to turn a general SSA into either a minimal, a pruned, a conventional, or
a strict SSA is provided in Chap. 3.
The ultimate notion of interference was first discussed by Chaitin in his seminal
paper [60] that presents the graph colouring approach for register allocation. His
interference test is similar to the refined test presented in this chapter. In the
context of SSA destruction, Chap. 21 addresses the issue of taking advantage of
the dominance property with this refined notion of interference.
Chapter 3
Standard Construction and Destruction
Algorithms
This chapter describes the standard algorithms for construction and destruction of
SSA form. SSA construction refers to the process of translating a non-SSA program
into one that satisfies the SSA constraints. In general, this transformation occurs as
one of the earliest phases in the middle-end of an optimizing compiler, when the
program has been converted to three-address intermediate code. SSA destruction
is sometimes called out-of-SSA translation. This step generally takes place in
an optimizing compiler after all SSA optimizations have been performed, and
prior to code generation. Note however that there are specialized code generation
techniques that can work directly on SSA-based intermediate representations such
as instruction selection (see Chap. 19), if-conversion (see Chap. 20), and register
allocation (see Chap. 22).
The algorithms presented in this chapter are based on material from the seminal
research papers on SSA. These original algorithms are straightforward to implement
and have acceptable efficiency. Therefore such algorithms are widely implemented
in current compilers. Note that more efficient, albeit more complex, alternative
algorithms have been devised. These are described further in Chaps. 4 and 21.
Figure 3.1 shows the control-flow graph (CFG) of an example program. The
set of nodes is {r, A, B, C, D, E}, and the variables used are {x, y, tmp}. Note that
the program shows the complete control-flow structure, denoted by directed edges
between the nodes. However, the program only shows statements that define relevant
variables, together with the unique return statement at the exit point of the CFG.
All of the program variables are undefined on entry. On certain control-flow paths,
J. Singer ()
University of Glasgow, Glasgow, UK
e-mail: [Link]@[Link]
F. Rastello
Inria, Grenoble, France
e-mail: [Link]@[Link]
some variables may be used without being defined, e.g., x on the path r → A →
C. We discuss this issue later in the chapter. We intend to use this program as a
running example throughout the chapter, to demonstrate various aspects of SSA
construction.
3.1 Construction
The original construction algorithm for SSA form consists of two distinct phases.
1. φ-function insertion performs live range splitting to ensure that any use of a
given variable v is reached1 by exactly one definition of v. The resulting live
ranges exhibit the property of having a single definition, which occurs at the
beginning of each live range.
2. Variable renaming assigns a unique variable name to each live range. This
second phase rewrites variable names in program statements such that the
program text contains only one definition of each variable, and every use refers
to its corresponding unique reaching definition.
As already outlined in Chap. 2, there are different flavours of SSA with distinct
properties. In this chapter, we focus on the minimal SSA form.
1 A program point p is said to be reachable by a definition of v if there exists a path in the CFG
from that definition to p that does not contain any other definition of v.
3 Standard Construction and Destruction Algorithms 25
In order to explain how φ-function insertions occur, it will be helpful to review the
related concepts of join sets and dominance frontiers.
For a given set of nodes S in a CFG, the join set J (S) is the set of join nodes of
S, i.e., nodes in the CFG that can be reached by two (or more) distinct elements of
S using disjoint paths. Join sets were introduced in Chap. 2, Sect. 2.2.
Let us consider some join set examples from the program in Fig. 3.1.
1. J ({B, C}) = {D}, since it is possible to get from B to D and from C to D
along different, non-overlapping, paths.
2. Again, J ({r, A, B, C, D, E}) = {A, D, E} (where r is the entry), since the
nodes A, D, and E are the only nodes with multiple direct predecessors in the
program.
The dominance frontier of a node n, DF(n), is the border of the CFG region that
is dominated by n. More formally,
• Node x strictly dominates node y if x dominates y and x = y.
• The set of nodes DF(n) contains all nodes x such that n dominates a direct
predecessor of x but n does not strictly dominate x.
For instance, in our Fig. 3.1, the dominance frontier of the y defined in block B
is the first operation of D, while the DF of the y defined in block C would be the
first operations of D and E.
Note that DF is defined over individual nodes, but for simplicity of presentation,
we overload it to operate over sets of nodes too, i.e., DF(S) = s∈S DF(s). The
iterated dominance frontier DF+ (S) is obtained by iterating the computation of DF
until reaching a fixed point, i.e., it is the limit DFi→∞ (S) of the sequence:
functions at the iterated dominance frontier DF+ (Defs(v)) where Defs(v) is the set
of nodes containing definitions of v. This leads to the construction of SSA form that
has the dominance property, i.e., where each definition of each renamed variable
dominates its entire live range.
Consider again our running example from Fig. 3.1. The set of nodes containing
definitions of variable x is {B, C, D}. The iterated dominance frontier of this set is
{A, D, E} (it is also the DF, no iteration needed here). Hence we need to insert φ-
functions for x at the beginning of nodes A, D, and E. Figure 3.2 shows the example
CFG program with φ-functions for x inserted.
As far as the actual algorithm for φ-functions insertion is concerned, we will
assume that the dominance frontier of each CFG node is pre-computed and that the
iterated dominance frontier is computed on the fly, as the algorithm proceeds. The
algorithm works by inserting φ-functions iteratively using a worklist of definition
points, and flags (to avoid multiple insertions). The corresponding pseudo-code for
φ-function insertion is given in Algorithm 3.1. The worklist of nodes W is used
to record definition points that the algorithm has not yet processed, i.e., it has not
yet inserted φ-functions at their dominance frontiers. Because a φ-function is itself a
definition, it may require further φ-functions to be inserted. This is the cause of node
insertions into the worklist W during iterations of the inner loop in Algorithm 3.1.
Effectively, we compute the iterated dominance frontier on the fly. The set F is used
to avoid repeated insertion of φ-functions on a single block. Dominance frontiers
of distinct nodes may intersect, e.g., in the example CFG in Fig. 3.1, DF(B) and
DF(C) both contain D, but once a φ-function for a particular variable has been
inserted at a node, there is no need to insert another, since a single φ-function per
variable handles all incoming definitions of that variable to that node.
3 Standard Construction and Destruction Algorithms 27
We give a walkthrough example of Algorithm 3.1 in Table 3.1. It shows the stages
of execution for a single iteration of the outermost for loop, inserting φ-functions
for variable x. Each row represents a single iteration of the while loop that iterates
over the worklist W . The table shows the values of X, F , and W at the start of each
while loop iteration. At the beginning, the CFG looks like Fig. 3.1. At the end, when
all the φ-functions for x have been placed, then the CFG looks like Fig. 3.2.
Provided that the dominator tree is given, the computation of the dominance
frontier is quite straightforward. As illustrated by Fig. 3.3, this can be understood
using the DJ-graph notation. The skeleton of the DJ-graph is the dominator tree
of the CFG that makes the D-edges (dominance edges). This is augmented with
J-edges (join edges) that correspond to all edges of the CFG whose source does
not strictly dominate its destination. A DF-edge (dominance frontier edge) is an
edge whose destination is in the dominance frontier of its source. By definition,
there is a DF-edge (a, b) between every CFG nodes a, b such that a dominates
a direct predecessor of b, but does not strictly dominate b. In other words, for
each J -edge (a, b), all ancestors of a (including a) that do not strictly dominate
b have b in their dominance frontier. For example, in Fig. 3.3, (F, G) is a J-edge,
so {(F, G), (E, G), (B, G)} are DF-edges. This leads to the pseudo-code given in
28 J. Singer and F. Rastello
A A A
G B G B G B
E C E C E C
F D F D F D
Fig. 3.3 An example CFG and its corresponding DJ-graph (D-edges are top-down), DF-graph,
and DF+ -graph
Algorithm 3.2, where for every edge (a, b) we visit all ancestors of a to add b to
their dominance frontier.
Since the iterated dominance frontier is simply the transitive closure of the
dominance frontier, we can define the DF+ -graph as the transitive closure of the
DF-graph. In our example, as {(C, E), (E, G)} are DF-edges, (C, G) is a DF+ -
edge. Hence, a definition of x in C will lead to inserting φ-functions in E and G.
We can compute the iterated dominance frontier for each variable independently, as
outlined in this chapter, or “cache” it to avoid repeated computation of the iterated
dominance frontier of the same node. This leads to more sophisticated algorithms
detailed in Chap. 4.
Algorithm 3.2: Algorithm for computing the dominance frontier of each CFG
node
1 for (a, b) ∈ CFG edges do
2 x←a
3 while x does not strictly dominate b do
4 DF(x) ← DF(x) ∪ b
5 x ← immediate dominator(x)
Once φ-functions have been inserted using this algorithm, the program usually
still contains several definitions per variable; however, now there is a single
definition statement in the CFG that reaches each use. For each variable use in
a φ-function, it is conventional to treat them as if the use actually occurs on the
corresponding incoming edge or at the end of the corresponding direct predecessor
node. If we follow this convention, then def-use chains are aligned with the CFG
dominator tree. In other words, the single definition that reaches each use dominates
that use.
3 Standard Construction and Destruction Algorithms 29
The variable renaming algorithm translates our running example from Fig. 3.1
into the SSA form of Fig. 3.4a. The table in Fig. 3.4b gives a walkthrough example
of Algorithm 3.3, only considering variable x. The labels li mark instructions
in the program that mention x, shown in Fig. 3.4a. The table records (1) when
30 J. Singer and F. Rastello
is encountered (we read from the reachingDef field at this point). Multiple stack
values may be popped when moving to a different node in the dominator tree (we
always check whether we need to update the reachingDef field before we read from
it). While the slot-based algorithm requires more memory, it can take advantage of
an existing working field for a variable and be more efficient in practice.
3.1.4 Summary
Now let us review the flavour of SSA form that this simple construction algorithm
produces. We refer back to several SSA properties that were introduced in Chap. 2.
• It is minimal (see Sect. 2.2). After the φ-function insertion phase, but before
variable renaming, the CFG contains the minimal number of inserted φ-functions
to achieve the property that exactly one definition of each variable v reaches every
point in the graph.
• It is not pruned (see Sect. 2.4). Some of the inserted φ-functions may be dead,
i.e., there is not always an explicit use of the variable subsequent to the φ-function
(e.g., y5 in Fig. 3.4a).
• It is conventional (see Sect. 2.5). The transformation that renames all φ-related
variables into a unique representative name and then removes all φ-functions is
a correct SSA destruction algorithm.
• Finally, it has the dominance property (that is, it is strict—see Sect. 2.3). Each
variable use is dominated by its unique definition. This is due to the use of
iterated dominance frontiers during the φ-placement phase, rather than join sets.
Whenever the iterated dominance frontier of the set of definition points of a
variable differs from its join set, then at least one program point can be reached
both by r (the entry of the CFG) and one of the definition points. In other words,
as in Fig. 3.1, one of the uses of the φ-function inserted in block A for x does
not have any actual reaching definition that dominates it. This corresponds to
the ⊥ value used to initialize each reachingDef slot in Algorithm 3.3. Actual
implementation code can use a NULL value, create a fake undefined variable at
the entry of the CFG, or create undefined pseudo-operations on the fly just before
the particular use.
3.2 Destruction
While freshly constructed SSA code is conventional, this may not be the case
after performing some optimizations such as copy propagation. Going back to
conventional SSA form requires the insertion of copies. The simplest (although not
the most efficient) way to destroy non-conventional SSA form is to split all critical
edges, and then replace φ-functions by copies at the end of direct predecessor
basic blocks. A critical edge is an edge from a node with several direct successors
to a node with several direct predecessors. The process of splitting an edge, say
(b1 , b2 ), involves replacing edge (b1 , b2 ) by (1) an edge from b1 to a freshly
created basic block and by (2) another edge from this fresh basic block to b2 . As
φ-functions have a parallel semantic, i.e., have to be executed simultaneously not
sequentially, the same holds for the corresponding copies inserted at the end of
direct predecessor basic blocks. To this end, a pseudo instruction called a parallel
copy is created to represent a set of copies that have to be executed in parallel.
The replacement of parallel copies by sequences of simple copies is handled
later on. Algorithm 3.5 presents the corresponding pseudo-code that makes non-
conventional SSA conventional. As already mentioned, SSA destruction of such
form is straightforward. However, Algorithm 3.5 can be slightly modified to directly
destruct SSA by deleting line 13, replacing ai by a0 in the following lines, and
adding “remove the φ-function” after them.
3 Standard Construction and Destruction Algorithms 33
We stress that the above destruction technique has several drawbacks: first
because of specific architectural constraints, region boundaries, or exception han-
dling code, the compiler might not permit the splitting of a given edge; second, the
resulting code contains many temporary-to-temporary copy operations. In theory,
reducing the frequency of these copies is the role of the coalescing during the
register allocation phase. A few memory- and time-consuming coalescing heuristics
mentioned in Chap. 22 can handle the removal of these copies effectively. Coalesc-
ing can also, with less effort, be performed prior to the register allocation phase.
As opposed to a (so-called conservative) coalescing during register allocation, this
aggressive coalescing would not cope with the interference graph colourability.
Further, the process of copy insertion itself might take a substantial amount of time
and might not be suitable for dynamic compilation. The goal of Chap. 21 is to cope
both with non-splittable edges and difficulties related to SSA destruction at machine
code level, but also aggressive coalescing in the context of resource constrained
compilation.
Once φ-functions have been replaced by parallel copies, we need to sequentialize
the parallel copies, i.e., replace them by a sequence of simple copies. This phase
can be performed immediately after SSA destruction or later on, perhaps even after
register allocation (see Chap. 22). It might be useful to postpone the copy sequential-
ization since it introduces arbitrary interference between variables. As an example,
a1 ← a2 b1 ← b2 (where inst1 inst2 represents two instructions inst1 and inst2
to be executed simultaneously) can be sequentialized into a1 ← a2 ; b1 ← b2 , which
would make b2 interfere with a1 while the other way round b1 ← b2 ; a1 ← a2 would
make a2 interfere with b1 instead.
34 J. Singer and F. Rastello
The construction algorithm described above does not build pruned SSA form.
If available, liveness information can be used to filter out the insertion of φ-
functions wherever the variable is not live: The resulting SSA form is pruned.
Alternatively, pruning SSA form is equivalent to a dead code elimination pass after
SSA construction. As use-def chains are implicitly provided by SSA form, dead-
φ-function elimination simply relies on marking actual uses (non-φ-function ones)
as useful and propagating usefulness backwards through φ-functions. Algorithm 3.7
presents the relevant pseudo-code for this operation. Here, stack is used to store
useful and unprocessed variables defined by φ-functions.
To construct pruned SSA form via dead code elimination, it is generally much
faster to first build semi-pruned SSA form, rather than minimal SSA form, and then
apply dead code elimination. Semi-pruned SSA form is based on the observation
that many variables are local, i.e., have a small live range that is within a single
basic block. Consequently, pruned SSA would not instantiate any φ-functions for
these variables. Such variables can be identified by a linear traversal over each basic
block of the CFG. All of these variables can be filtered out: minimal SSA form
restricted to the remaining variables gives rise to the so-called semi-pruned SSA
form.
36 J. Singer and F. Rastello
2 A CFG is reducible if there are no jumps into the middle of loops from the outside, so the only
entry to a loop is through its header. Section 3.4 gives a fuller discussion of reducibility, with
pointers to further reading.
3 Standard Construction and Destruction Algorithms 37
Fig. 3.5 T1 and T2 rewrite rules for SSA-graph reduction, applied to def-use relations between
SSA variables
This approach can be implemented using a worklist, which stores the candidate
nodes for simplification. Using the graph made up of def-use chains (see Chap. 14),
the worklist can be initialized with φ-functions that are direct successors of non-
φ-functions. However, for simplicity, we may initialize it with all φ-functions. Of
course, if loop nesting forest information is available, the worklist can be avoided
by traversing the CFG in a single pass from inner to outer loops, and in a topological
order within each loop (header excluded). But since we believe the main motivation
for this approach to be its simplicity, the pseudo-code shown in Algorithm 3.8 uses
a work queue.
This algorithm is guaranteed to terminate in a fixed number of steps. At
every iteration of the while loop, it removes a φ-function from the work queue
W . Whenever it adds new φ-functions to W , it removes a φ-function from the
program. The number of φ-functions in the program is bounded so the number of
insertions to W is bounded. The queue could be replaced by a worklist, and the
insertions/removals done at random. The algorithm would be less efficient, but the
end result would be the same.
38 J. Singer and F. Rastello
The early literature on SSA form [89, 90] introduces the two phases of the
construction algorithm we have outlined in this chapter and discusses algorithmic
complexity on common and worst-case inputs. These initial presentations trace the
ancestry of SSA form back to early work on data-flow representations by Shapiro
and Saint [258].
Briggs et al. [50] discuss pragmatic refinements to the original algorithms for
SSA construction and destruction, with the aim of reducing execution time. They
introduce the notion of semi-pruned form, show how to improve the efficiency of
the stack-based renaming algorithm, and describe how copy propagation must be
constrained to preserve correct code during SSA destruction.
There are numerous published descriptions of alternative algorithms for SSA
construction, in particular for the φ-function insertion phase. The pessimistic
approach that first inserts φ-functions at all control-flow merge points and then
removes unnecessary ones using simple T1/T2 rewrite rules was proposed by
Aycock and Horspool [14]. Brandis and Mössenböck [44] describe a simple,
syntax-directed approach to SSA construction from well structured high-level
source code. Throughout this textbook, we consider the more general case of SSA
construction from arbitrary CFGs.
A reducible CFG is one that will collapse to a single node when it is transformed
using repeated application of T1/T2 rewrite rules. Aho et al. [2] describe the concept
of reducibility and trace its history in early compilers literature.
Sreedhar and Gao [263] pioneer linear-time complexity φ-function insertion
algorithms based on DJ-graphs. These approaches have been refined by other
researchers. Chapter 4 explores these alternative construction algorithms in depth.
3 Standard Construction and Destruction Algorithms 39
Blech et al. [30] formalize the semantics of SSA, in order to verify the
correctness of SSA destruction algorithms. Boissinot et al. [35] review the history of
SSA destruction approaches and highlight misunderstandings that led to incorrect
destruction algorithms. Chapter 21 presents more details on alternative approaches
to SSA destruction.
There are instructive dualisms between concepts in SSA form and functional
programs, including construction, dominance, and copy propagation. Chapter 6
explores these issues in more detail.
Chapter 4
Advanced Construction Algorithms
for SSA
D. Das ()
Intel Corporation, Bangalore, India
e-mail: [Link]@[Link]
U. Ramakrishna
IIT Hyderabad, Kandi, India
e-mail: ramakrishna@[Link]
V. C. Sreedhar
IBM, Yorktown Heights, NY, USA
e-mail: vugranam@[Link]
We start by recalling the basic algorithm already described in Chap. 3. The original
algorithm for φ-functions is based on computing the dominance frontier (DF) set
for the given control-flow graph. The dominance frontier DF(x) of a node x is the
set of all nodes z such that x dominates a direct predecessor of z, without strictly
dominating z. For example, DF(8) = {6, 8} in Fig. 4.1. The basic algorithm for
the insertion of φ-functions consists in computing the iterated dominance frontier
(DF+ ) for a set of all definition points (or nodes where variables are defined). Let
Defs(v) be the set of nodes where variable v is defined. Given that the dominance
frontier for a set of nodes is just the union of the DF set of each node, we can
compute DF+ (Defs(v)) as a limit of the following recurrence equation (where S is
initially Defs(v)):
A φ-function is then inserted at each join node in the DF+ (Defs(v)) set.
We now present a linear time algorithm for computing the DF+ (S) set of a given
set of nodes S without the need for explicitly pre-computing the full DF set. The
algorithm uses the DJ-graph (see Chap. 3, Sect. 3.1.2 and Fig. 3.3b) representation
CFG DJ-graph
of a CFG. The DJ-graph for our example CFG is also shown in Fig. 4.1b. Rather
than explicitly computing the DF set, this algorithm uses a DJ-graph to compute the
DF+ (Defs(v)) on the fly.
Now let us try to understand how to compute the DF set for a single node using the
DJ-graph. Consider the DJ-graph shown in Fig. 4.1b where the depth of a node is
the distance from the root in the dominator tree. The first key observation is that
a DF-edge never goes down to a greater depth. To give a raw intuition of why this
property holds, suppose there was a DF-edge from 8 to 7, then there would be a path
from 3 to 7 through 8 without flowing through 6, which contradicts the dominance
of 7 by 6.
As a consequence, to compute DF(8) we can simply walk down the dominator
(D) tree from node 8 and from each visited node y, identify all join (J) edges y →z
J
such that [Link] ≤ [Link]. For our example the J-edges that satisfy this condition
are 10→8J and 9→6.
J Therefore DF(8) = {6, 8}. To generalize the example, we
can compute the DF of a node x using the following formula (see Fig. 4.2a for an
illustration):
DF(x) = {z | ∃ y ∈ dominated(x) ∧ y →z
J
∈ J-edges ∧ [Link] ≤ [Link]}
where
dominated(x) = {y | x dom y}
Now we can extend the above idea to compute the DF+ for a set of nodes, and
hence the insertion of φ-functions. This algorithm does not precompute DF; given
a set of initial nodes S = Defs(v) for which we want to compute the relevant set of
φ-functions, a key observation can be made. Let w be an ancestor node of a node x
on the dominator tree. If DF(x) has already been computed before the computation
of DF(w), the traversal of dominated(x) can be avoided and DF(x) directly used for
the computation of DF(w). This is because nodes reachable from dominated(x) are
already in DF(x). However, the converse may not be true, and therefore the order of
the computation of DF is crucial.
To illustrate the key observation consider the example DJ-graph in Fig. 4.1b, and
let us compute DF+ ({3, 8}). It is clear from the recursive definition of DF+ that we
have to compute DF(3) and DF(8) as a first step. Now, suppose we start with node
3 and compute DF(3). The resulting DF set is DF(3) = {2}. Now, suppose we next
compute the DF set for node 8, and the resulting set is DF(8) = {6, 8}. Notice here
that we have already visited node 8 and its subtree when visiting node 3. We can
avoid such duplicate visits by ordering the computation of DF set so that we first
compute DF(8) and then during the computation of DF(3) we avoid visiting the
subtree of node 8 and use the result DF(8) that was previously computed.
Thus, to compute DF(w), where w is an ancestor of x in the DJ-graph, we do
not need to compute it from scratch as we can re-use the information computed as
part of DF(x) as shown. For this, we need to compute the DF of deeper (based on
depth) nodes (here, x), before computing the DF of a shallower node (here, w). The
formula is as follows, with Fig. 4.2b illustrating the positions of nodes z and z .
In this section we present the algorithm for computing DF+ . Let, for node x, [Link]
be its depth from the root node r, with [Link] = 0. To ensure that the nodes
are processed according to the above observation we use a simple array of sets
OrderedBucket, and two functions defined over this array of sets: (1) InsertNode(n)
that inserts the node n in the set OrderedBucket[[Link]], and (2) GetDeepestNode()
that returns a node from the OrderedBucket with the deepest depth number.
In Algorithm 4.1, at first we insert all nodes belonging to S in the OrderedBucket.
Then the nodes are processed in a bottom-up fashion over the DJ-graph from deepest
node depth to least node depth by calling Visit(x). The procedure Visit(x) essentially
walks top-down in the DJ-graph avoiding already visited nodes. During this traversal
it also peeks at destination nodes of J-edges. Whenever it notices that the depth
number of the destination node of a J-edge is less than or equal to the depth number
of the current_x, the destination node is added to the DF+ set (Line 4) if it is not
4 Advanced Construction Algorithms for SSA 45
Procedure Visit(y)
J
1 foreach J-edge y →z do
2 if [Link] ≤ current_x.depth then
3 if z ∈ DF+ then
4 DF+ ← DF+ ∪ {z}
5 if z ∈ S then InsertNode(z)
6 foreach D-edge y →y D do
7 if y .visited = false then
8 y .visited ← true
/* if y .boundary = false */ See the section on Further Reading for details
9 Visit(y )
present in DF+ already. Notice that at Line 5 the destination node is also inserted in
the OrderedBucket if it was never inserted before. Finally, at Line 9 we continue to
process the nodes in the subtree by visiting over the D-edges. When the algorithm
terminates, the set DF+ contains the iterated dominance frontier for the initial set S.
In Fig. 4.3, some of the phases of the algorithm are depicted for clarity. The
OrderedBucket is populated with the nodes 1, 3, 4, and 7 corresponding to S =
Defs(v) = {1, 3, 4, 7}. The nodes are inserted in the buckets corresponding to the
depths at which they appear. Hence, node 1 which appears at depth 0 is in the 0-th
bucket, node 3 is in bucket 2 and so on. Since the nodes are processed bottom-up,
the first node that is visited is node 7. The J-edge 7→2
J is considered, and the DF+
+
set is empty: the DF set is updated to hold node 2 according to Line 4 of the Visit
procedure. In addition, InsertNode(2) is invoked and node 2 is inserted in bucket 2.
The next node visited is node 4. The J-edge 4→5 J is considered, which results in the
+ +
new DF = {2, 5}. The final DF set converges to {2, 5, 6} when node 5 is visited.
Subsequent visits of other nodes do not add anything to the DF+ set. An interesting
case arises when node 3 is visited. Node 3 finally causes nodes 8, 9, and 10 also to be
visited (Line 9 during the down traversal of the D-graph). However, when node 10
46 D. Das et al.
In this section we describe a method to iteratively compute the DF+ relation using
a data-flow formulation. As already mentioned, the DF+ relation can be computed
using a transitive closure of the DF-graph, which in turn can be computed from
the DJ-graph. In the algorithm proposed here, explicit DF-graph construction or
the transitive closure formation are not necessary. Instead, the same result can be
achieved by formulating the DF+ relation as a data-flow problem and solving it
iteratively. For several applications, this approach has been found to be a fast and
effective method to construct DF+ (x) for each node x and the corresponding φ-
function insertion using the DF+ (x) sets.
Data-flow equation:
Consider a J-edge y →z.
J
Then for all nodes x such that x dominates y and [Link] ≥ [Link]:
The set of data-flow equations for each node n in the DJ-graph can be solved
iteratively using a top-down pass over the DJ-graph. To check whether multiple
passes are required over the DJ-graph before a fixed point is reached for the data-
flow equations, we devise an “inconsistency condition” stated as follows:
Inconsistency Condition:
For a J-edge, y →x,
J if y does not satisfy DF+ (y ) ⊇ DF+ (x),
then the node y is said to be inconsistent.
4 Advanced Construction Algorithms for SSA 47
The algorithm described in the next section is directly based on the method of
building up the DF+ (x) sets of the nodes as each J-edge is encountered in an iterative
fashion by traversing the DJ-graph top-down. If no node is found to be inconsistent
after a single top-down pass, all the nodes are assumed to have reached fixed-point
solutions. If any node is found to be inconsistent, multiple passes are required until
a fixed-point solution is reached.
Function TDMSC-Main(DJ-graph)
Input: A DJ-graph representation of a program.
Output: The DF+ sets for the nodes.
1 foreach node x ∈ DJ-graph do
2 DF+ (x) ← {}
3 repeat until TDMSC-I(DJ-graph)
Function TDMSC-I(DJ-graph)
1 RequireAnotherPass ← false
2 foreach edge e do [Link] ← false
3 while z ← next node in B(readth) F(irst) S(earch) order of DJ-graph do
J
4 foreach incoming edge e = y →z do
5 if not [Link] then
6 [Link] ← true
7 x←y
8 while ([Link] ≥ [Link]) do
9 DF+ (x) ← DF+ (x) ∪ DF+ (z) ∪ {z}
10 lx ← x
11 x ← parent(x) dominator tree parent
12 foreach incoming edge e = y →lx J do
13 if e .visited then
14 if DF+ (y ) ⊇ DF+ (lx) then Check inconsistency
15 RequireAnotherPass ← true;
16 return RequireAnotherPass
The first and direct variant of the approach laid out above is poetically
termed TDMSC-I. This variant works by scanning the DJ-graph in a top-down
fashion as shown in Line 3 of Function TDMSC-I. All DF+ (x) sets are set to the
empty set before the initial pass of TDMSC-I. The DF+ (x) sets computed in a
previous pass are carried over if a subsequent pass is required.
48 D. Das et al.
Fig. 4.4 (a) Snapshot during execution of the TDMSC-I algorithm and (b) recall of running
example
The DJ-graph is visited depth by depth. During this process, for each node z
encountered, if there is an incoming J-edge y →z J as in Line 4, then a separate
bottom-up pass starts at Line 8 (see Fig. 4.4a for a snapshot of the variables during
algorithm execution).
This bottom-up pass traverses all nodes x such that x dominates y and [Link] ≥
[Link], updating the DF+ (x) values using the aforementioned data-flow equation.
Line 12 is used for the inconsistency check. RequireAnotherPass is set to true only
if a fixed point is not reached and the inconsistency check succeeds for any node.
There are some subtleties in the algorithm that should be noted. Line 12 of the
algorithm visits incoming edges to lx only when lx is at the same depth as z, which
is the current depth of inspection and the incoming edges to lx’s posterity are at a
depth greater than that of node z and are unvisited yet.
Here, we will briefly walk through TDMSC-I using the DJ-graph of Fig. 4.1b
(reprinted here as Fig. 4.4b). Moving top-down over the graph, the first J-edge
encountered is when z = 2, i.e., 7→2. J As a result, a bottom-up climbing of the
nodes happens, starting at node 7 and ending at node 2 and the DF+ sets of these
nodes are updated so that DF+ (7) = DF+ (6) = DF+ (3) = DF+ (2) = {2}.
The next J-edge to be visited can be any of 5→6, J 9→6,
J 6→5,
J 4→5,
J or 10→8
J
is another J-edge that is already visited and is an incoming edge of node 6. Checking
for DF+ (5) ⊇ DF+ (6) fails, implying that the DF+ (5) needs to be computed again.
This will be done in a succeeding pass as suggested by the RequireAnotherPass
value of true. In a second iterative pass, the J-edges are visited in the same order.
4 Advanced Construction Algorithms for SSA 49
6→5,
J DF+ (6) is also set to {2, 5, 6}. The inconsistency no longer appears and the
algorithm proceeds to handle the edges 4→5,J
9→6
J
and 10→8
J
which have also been
visited in the earlier pass. TDMSC-I is repeatedly invoked by a different function
which calls it in a loop till RequireAnotherPass is returned as false, as shown in the
procedure TDMSCMain.
Once the iterated dominance frontier relation is computed for the entire CFG,
inserting the φ-functions is a straightforward application of the DF+ (x) values for
a given Defs(x), as shown in Algorithm 4.2.
This section illustrates the use of loop nesting forests to construct the iterated
dominance frontier (DF+ ) of a set of vertices in a CFG. This method works with
reducible as well as irreducible loops.
A loop nesting forest is a data structure that represents the loops in a CFG and the
containment relation between them. In the example shown in Fig. 4.5a the loops with
back edges 11 → 9 and 12 → 2 are both reducible loops. The corresponding loop
nesting forest is shown in Fig. 4.5b and consists of two loops whose header nodes
are 2 and 9. The loop with header node 2 contains the loop with header node 9.
The idea is to use the forward CFG, an acyclic version of the control-flow graph (i.e.,
without back edges), and construct the DF+ for a variable in this context: whenever
50 D. Das et al.
Fig. 4.5 A sample CFG with four defs and one use of variable v and its loop nesting forest
two distinct definitions reach a join point, it belongs to the DF+ . Then, we take into
account the back edges using the loop nesting forest: if a loop contains a definition,
its header also belongs to the DF+ .
A definition node d “reaches” another node u if there is non-empty a path in the
graph from d to u which does not contain any redefinition. If at least two definitions
reach a node u, then u belongs to DF+ (S) where S = Defs(x) consists of these
definition nodes. This suggests Algorithm 4.3 which works for acyclic graphs. For
a given S, we can compute DF+ (S) as follows:
• Initialize DF+ to the empty set.
• Using a topological order, compute the subset of DF+ (Defs(x)) that can reach a
node using forward data-flow analysis.
• Add a node to DF+ if it is reachable from multiple nodes.
For Fig. 4.5, the forward CFG of the graph G, termed Gfwd , is formed by
dropping the back edges 11 → 9 and 12 → 2. Also, r is a specially designated
node that is the root of the CFG. For the definitions of x in nodes 4, 5, 7, and 12 in
Fig. 4.5, the subsequent nodes (forward) reached by multiple definitions are 6 and
8: node 6 can be reached by any one of the two definitions in nodes 4 or 5, and node
8 by either the definition from node 7 or one of 4 or 5. Note that the back edges do
not exist in the forward CFG and hence node 2 is not part of the DF+ set yet. We
will see later how the DF+ set for the entire graph is computed by considering the
contribution of the back edges.
4 Advanced Construction Algorithms for SSA 51
10 if |ReachingDefs| = 1 then
11 UniqueReachingDef (u) ← ReachingDefs;
12 else
13 DF+ ← DF+ ∪ {u};
14 return DF+
Let us walk through this algorithm computing DF+ for variable v, i.e., S =
{4, 5, 7, 12}. The nodes in Fig. 4.5 are already numbered in topological order.
Nodes 1 to 5 have only one direct predecessor, none of them being in S, so their
UniqueReachingDef stays r, and DF+ is still empty. For node 6, its two direct
predecessors belong to S, hence ReachingDefs = {4, 5}, and 6 is added to DF+ .
Nothing changes for 7, then for 8 its direct predecessors 6 and 7 are, respectively, in
DF+ and S: they are added to ReachingDefs, and 8 is then added to DF+ . Finally,
for nodes 8 to 12, their UniqueReachingDef will be updated to node 8, but this will
no longer change DF+ , which will end up being {6, 8}.
DF+ on Reducible Graphs
A reducible graph can be decomposed into an acyclic graph and a set of back edges.
The contribution of back edges to the iterated dominance frontier can be identified
by using the loop nesting forest. If a vertex v is contained in a loop, then DF+ (v)
will contain the loop header, i.e., the unique entry of the reducible loop. For any
vertex v, let HLC(v) denote the set of loop headers of the loops containing v. Given
it turns out that DF+ (S) = HLC(S) ∪ DF+fwd (S ∪ HLC(S))
a set of vertices S,
where HLC(S) = v∈S HLC(v), and where DF+fwd denote the DF+ restricted to
the forward CFG Gfwd .
Reverting back to Fig. 4.5, we see that in order to find the DF+ for the nodes
defining x, we need to evaluate DF+fwd ({4, 5, 7, 12} ∪ HLC({4, 5, 7, 12})). As all
these nodes are contained in a single loop with header 2, HLC({4, 5, 7, 12}) = {2}.
Computing DF+fwd ({4, 5, 7, 12}) gives us {6, 8}, and finally, DF+ ({4, 5, 7, 12}) =
{2, 6, 8}.
52 D. Das et al.
Fig. 4.6 (a) An irreducible graph. (b) The loop nesting forest. (c) The acyclic subgraph. (d)
Transformed graph
Concluding Remarks
Although all these algorithms claim to be better than the original algorithm by
Cytron et al., they are difficult to compare due to the unavailability of these
algorithms in a common compiler framework.
In particular, while constructing the whole DF+ set seems very costly in the
classical construction algorithm, its cost is actually amortized as it will serve to
4 Advanced Construction Algorithms for SSA 53
insert φ-functions for many variables. It is, however, interesting not to pay this cost
whenever we only have a few variables to consider, for instance, when repairing
SSA as in the next chapter.
Note also that people have observed in production compilers that, during SSA
construction, what seems to be the most expensive part is the renaming of the
variables and not the insertion of φ-functions.
Further Reading
Cytron’s approach for φ-function insertion involves a fully eager approach of con-
structing the entire DF-graph [90]. Sreedhar and Gao proposed the first algorithm
for computing DF+ sets without the need for explicitly computing the full DF set
[263], producing the linear algorithm that uses DJ-graphs.
The lazy algorithm presented in this chapter that uses DJ-graph was introduced
by Sreedhar and Gao and constructs DF on the fly only when a query is encountered.
Pingali and Bilardi [29] suggested a middle-ground by combining both approaches.
They proposed a new representation called ADT (Augmented Dominator Tree). The
ADT representation can be thought of as a DJ-graph, where the DF sets are pre-
computed for certain nodes called “boundary nodes” using an eager approach. For
the rest of the nodes, termed “interior nodes,” the DF needs to be computed on the
fly as in the Sreedhar-Gao algorithm. The nodes which act as “boundary nodes” are
detected in a separate pass. A factor β is used to determine the partitioning of the
nodes of a CFG into boundary or interior nodes by dividing the CFG into zones. β
is a number that represents space/query-time tradeoff. β << 1 denotes a fully eager
approach where storage requirement for DF is maximum but query time is faster
while β >> 1 denotes a fully lazy approach where storage requirement is zero but
query is slower.
Given the ADT of a control-flow graph, it is straightforward to modify Sreedhar
and Gao’s algorithm for computing φ-functions in linear time. The only modifica-
tion that is needed is to ensure that we need not visit all the nodes of a subtree rooted
at a node y when y is a boundary node whose DF set is already known. This change
is reflected in Line 8 of Algorithm 4.1, where a subtree rooted at y is visited or not
visited based on whether it is a boundary node or not.
The algorithm computing DF+ without the explicit DF-graph is from Das and
Ramakrishna [94]. For iterative DF+ set computation, they also exhibit TDMSC-
II, an improvement to algorithm TDMSC-I. This improvement is fueled by the
observation that for an inconsistent node u, the DF+ sets of all nodes w such that
w dominates u and [Link] ≥ [Link], can be locally corrected for some special
cases. This heuristic works very well for certain classes of problems—especially for
CFGs with DF-graphs having cycles consisting of a few edges. This eliminates extra
passes as an inconsistent node is made consistent immediately on being detected.
Finally, the part on computing DF+ sets using loop nesting forests is based on
Ramalingam’s work on loops, dominators, and dominance frontiers [236].
Chapter 5
SSA Reconstruction
Sebastian Hack
S. Hack ()
Saarland University, Saarbrücken, Germany
e-mail: hack@[Link]
In this chapter, we will discuss two algorithms. The first is an adoption of the
classical dominance-frontier based algorithm. The second performs a search from
the uses of the variables to the definition and places φ-functions on demand at
appropriate places. In contrast to the first, the second algorithm might not construct
minimal SSA form in general; however, it does not need to update its internal data
structures when the CFG is modified.
We consider the following scenario: The program is represented as a control-
flow graph (CFG) and is in SSA form with dominance property. For the sake of
simplicity, we assume that each instruction in the program only writes to a single
variable. An optimization or transformation violates SSA by inserting additional
definitions for an existing SSA variable, like in the examples above. The original
variable and the additional definitions can be seen as a single non-SSA variable that
has multiple definitions and uses. Let in the following v be such a non-SSA variable.
When reconstructing SSA for v, we will first create fresh variables for every
definition of v to establish the single-assignment property. What remains is asso-
ciating every use of v with a suitable definition. In the algorithms, [Link] denotes
the set of all instructions that define v. A use of a variable is a pair consisting of a
program point (an instruction) and an integer denoting the index of the operand at
this instruction.
Both algorithms presented in this chapter share the same driver routine described
in Algorithm 5.1. First, we scan all definitions of v so that for every basic block b we
have the list [Link] that contains all instructions in the block which define one of the
variables in [Link]. It is best to sort this according to the schedule of the instructions
in the block from back to front, making the latest definition the first in the list.
5 SSA Reconstruction 57
Then, all uses of the variable v are traversed to associate them with the proper
definition. This can be done by using precomputed use-def chains if available or
scanning all instructions in the dominance subtree of v’s original SSA definition. For
each use, we have to differentiate whether the use is in a φ-function or not. If so, the
use occurs at the end of the direct predecessor block that corresponds to the position
of the variable in the φ’s argument list. In that case, we start looking for the reaching
definition from the end of that block. Otherwise, we scan the instructions of the
block backwards until we reach the first definition that is before the use (Line 14). If
there is no such definition, we have to find one that reaches this block from outside.
We use two functions, FindDefFromTop and FindDefFromBottom that search
the reaching definition, respectively, from the beginning or the end of a block. Find-
DefFromBottom actually just returns the last definition in the block, or call Find-
DefFromTop if there is none.
The two presented approaches to SSA repairing differ in the implementation
of the function FindDefFromTop. The differences are described in the next two
sections.
19 v ← version of v defined by d
20 rewrite use of v by v in inst
58 S. Hack
Procedure FindDefFromBottom(v, b)
1 if [Link] = ∅ then
2 return latest instruction in [Link]
3 else
4 return FindDefFromTop(v, b)
This algorithm follows the same principles as the classical SSA construction
algorithm by Cytron at al. as described in Chap. 3. We first compute the iterated
dominance frontier (DF+ ) of v. This set is a sound approximation of the set where
φ-functions must be placed—it might contain blocks where a φ-function would be
dead. Then, we search for each use u the corresponding reaching definition. This
search starts at the block of u. If that block b is in the DF+ of v, a φ-function needs to
be placed at its entrance. This φ-function becomes a new definition of v and has to be
inserted in [Link] and in [Link]. The operands of the newly created φ-function will
query their reaching definitions by recursive calls to FindDefFromBottom on direct
predecessors of b. Because we inserted the φ-function into [Link] before searching
for the arguments, no infinite recursion can occur (otherwise, it could happen, for
instance, with a loop back edge).
If the block is not in the DF+ , the search continues in the immediate dominator of
the block. This is because in SSA, every use of a variable must be dominated by its
definition.1 Therefore, the reaching definition is the same for all direct predecessors
of the block, and hence for the immediate dominator of this block.
Procedure FindDefFromTop(v, b)
SSA Reconstruction based on Dominance Frontiers
1 if b ∈ DF+ ([Link]) then
2 v ← fresh variable
3 d ← new φ-function in b: v ← φ(. . . )
4 append d to [Link]
5 foreach p ∈ [Link] do
6 o ← FindDefFromBottom(v, p)
7 v ← version of v defined by o
8 set corresponding operand of d to v
9 else
10 d ← FindDefFromBottom(v, [Link]) search in immediate dominator
11 return d
1 The definition of an operand of a φ-function has to dominate the corresponding direct predecessor
block.
5 SSA Reconstruction 59
Procedure FindDefFromTop(b)
Search-based SSA Reconstruction
Input: b, a basic block
1 if [Link] = ⊥ then
2 return [Link]
3 pending_φ ← new φ-function in b
4 vφ ← result of pending_φ
5 [Link] ← vφ
6 reaching_defs ← []
7 foreach p ∈ [Link] do
8 reaching_defs ← reaching_defs ∪ FindDefFromBottom(v, p)
9 vdef ← Phi-Necessary(vφ , reaching_defs)
10 if vdef = vφ then
11 set arguments of pending_φ to reaching_defs
12 else
13 rewire all uses of pending_φ to vdef
14 remove pending_φ
15 [Link] ← vdef
16 return vdef
this assertion is violated if reaching_defs contains only pending_φ which never can
happen
8 assert (other = ⊥)
9 return other
5 SSA Reconstruction 61
unnecessary because other φ-functions are optimized away. Consider the example
in Fig. 5.2. We look for a definition of x from block E. If Algorithm FindDef-
FromTop considers the blocks in a unfavourable order, e.g., E, D, C, B, A, D, C,
some unnecessary φ-functions cannot be removed by Phi-Necessary, as shown in
Fig. 5.2b. While the φ-function in block C can be eliminated by the local criterion
applied by Phi-Necessary, the φ-function in block B remains. This is because the
depth-first search carried out by FindDefFromTop will not visit block B a second
time. To remove the remaining φ-functions, the local criterion can be iteratively
applied to all placed φ-functions until a fixpoint is reached. For reducible control
flow, this then produces the minimal number of placed φ-functions. The classical
(∗)-graph in Fig. 5.3 illustrates that this does not hold for the irreducible case. This
is similar to the rules discussed in Sect. 3.3.1.
62 S. Hack
The algorithms presented in this chapter are independent of the transformation that
violated SSA and can be used as a black box: For every variable for which SSA was
violated, a routine is called that restores SSA. Both algorithms rely on computed
def-use chains because they traverse all uses from a SSA variable to find suitable
definitions; however, they differ in their prerequisites and their runtime behaviour:
The first algorithm (Choi et al. [69]) is based on the iterated dominance frontiers
like the classical SSA construction algorithm by Cytron et al. [90]. Hence, it is
less suited for optimizations that also change the flow of control since that would
require recomputing the iterated dominance frontiers. On the other hand, by using
the iterated dominance frontiers, the algorithm can find the reaching definitions
quickly by scanning the dominance tree upwards. Furthermore, one could also
envision applying incremental algorithms to construct the dominance tree [238, 264]
and the dominance frontier [265] to account for changes in the control flow. This has
not yet been done and no practical experiences have been reported so far.
The second algorithm is based on the algorithm by Braun et al. [47] which is
an extension of the construction algorithm that Click describes in his thesis [75] to
construct SSA from an abstract syntax tree. It does not depend on additional analysis
information such as iterated dominance frontiers or the dominance tree. Thus, it is
well suited for transformations that change the CFG because no information needs to
be recomputed. On the other hand, it might be slower to find the reaching definitions
because they are searched by a depth-first search in the CFG.
Both approaches construct pruned SSA, i.e., they do not add dead φ-functions.
The first approach produces minimal SSA by the very same arguments Cytron
et al. [90] give. The second only guarantees minimal SSA for reducible CFGs.
This follows from the iterative application of the function Phi-Necessary which
implements the two simplification rules presented by Aycock and Horspool [14],
who showed that their iterative application yields minimal SSA on reducible graphs.
These two local rules can be extended to a non-local one which has to find strongly
connected φ-components that all refer to the same exterior variable. Such a non-
local check also eliminates unnecessary φ-functions in the irreducible case [47].
Chapter 6
Functional Representations of SSA
Lennart Beringer
This chapter discusses alternative representations of SSA using the terminology and
structuring mechanisms of functional programming languages. The reading of SSA
as a discipline of functional programming arises from a correspondence between
dominance and syntactic scope that subsequently extends to numerous aspects of
control and data-flow structure.
The development of functional representations of SSA is motivated by the
following considerations:
1. Relating the core ideas of SSA to concepts from other areas of compiler
and programming language research provides conceptual insights into the SSA
discipline and thus contributes to a better understanding of the practical appeal
of SSA to compiler writers.
2. Reformulating SSA as a functional program makes explicit some of the syntactic
conditions and semantic invariants that are implicit in the definition and use of
SSA. Indeed, the introduction of SSA itself was motivated by a similar goal:
to represent aspects of program structure—namely the def-use relationships—
explicitly in syntax, by enforcing a particular naming discipline. In a similar way,
functional representations directly enforce invariants such as “all φ-functions in a
block must be of the same arity,” “the variables assigned to by these φ-functions
must be distinct,” “φ-functions are only allowed to occur at the beginning of
a basic block,” or “each use of a variable should be dominated by its (unique)
definition.” Constraints such as these would typically have to be validated or
(re-)established after each optimization phase of an SSA-based compiler, but
are typically enforced by construction if a functional representation is chosen.
L. Beringer ()
Princeton University, Princeton, NJ, USA
e-mail: eberinge@[Link]
let x = e1 in e2 end
The effect of this expression is to evaluate e1 and bind the resulting value to variable
x for the duration of the evaluation of e2 . The code affected by this binding, e2 , is
called the static scope of x and is easily syntactically identifiable. In the following,
we occasionally indicate scopes by code-enclosing boxes and list the variables that
are in scope using subscripts.
For example, the scope associated with the top-most binding of v to 3 in code
let v = 3 in
let y = (let v = 2 × v in 4 × v end)
v (6.2)
in y × v end
v,y
v
end
spans both inner let-bindings, the scopes of which are themselves not nested inside
one other as the inner binding of v occurs in the e1 position of the let-binding for y.
In contrast to an assignment in an imperative language, a let-binding for variable
x hides any previous value bound to x for the duration of evaluating e2 but does not
permanently overwrite it. Bindings are treated in a stack-like fashion, resulting in
a tree-shaped nesting structure of boxes in our code excerpts. For example, in the
above code, the inner binding of v to value 2 × 3 = 6 shadows the outer binding
of v to value 3 precisely for the duration of the evaluation of the expression 4 × v.
Once this evaluation has terminated (resulting in the binding of y to 24), the binding
of v to 3 becomes visible again, yielding the overall result of 72.
The concepts of binding and static scope ensure that functional programs enjoy
the characteristic feature of SSA, namely the fact that each use of a variable is
uniquely associated with a point of definition. Indeed, the point of definition for a
use of x is given by the nearest enclosing binding of x. Occurrences of variables in
an expression that are not enclosed by a binding are called free. A well-formed
procedure declaration contains all free variables of its body among its formal
parameters. Thus, the notion of scope makes explicit the invariant that each use
of a variable should be dominated by its (unique) definition.
In contrast to SSA, functional languages achieve the association of definitions
to uses without imposing the global uniqueness of variables, as witnessed by the
duplicate binding occurrences for v in the above code. As a consequence of this
decoupling, functional languages enjoy a strong notion of referential transparency:
The choice of x as the variable holding the result of e1 depends only on the free
66 L. Beringer
variables of e2 . For example, we may rename the inner v in code (6.2) to z without
altering the meaning of the code:
let v = 3 in
let y = (let z = 2 × v in 4 × z end)
v,z
(6.3)
in y × v end
v,y
v
end
Note that this conversion formally makes the outer v visible for the expression 4×z,
as indicated by the index v, z decorating its surrounding box.
In order to avoid altering the meaning of the program, the choice of the newly
introduced variable has to be such that confusion with other variables is avoided.
Formally, this means that a renaming
can only be carried out if y is not a free variable of e2 . Moreover, in the event
that e2 already contains some preexisting bindings to y, the substitution of x by
y in e2 (denoted by e2 [y ↔ x] above) first renames these preexisting bindings in
a suitable manner. Also note that the renaming only affects e2 —any occurrences
of x or y in e1 refer to conceptually different but identically named variables, but
the static scoping discipline ensures these will never be confused with the variables
involved in the renaming. In general, the semantics-preserving renaming of bound
variables is called α-renaming. Typically, program analyses for functional languages
are compatible with α-renaming in that they behave equivalently for fragments
that differ only in their choice of bound variables, and program transformations
α-rename bound variables whenever necessary.
A consequence of referential transparency, and thus a property typically enjoyed
by functional languages, is compositional equational reasoning: the meaning of
a piece of code e is only dependent on its free variables and can be calculated
from the meaning of its subexpressions. For example, the meaning of a phrase
let x = e1 in e2 end only depends on the free variables of e1 and on the free
variables of e2 other than x. Hence, languages with referential transparency allow
one to replace a subexpression by some semantically equivalent phrase without
altering the meaning of the surrounding code. Since semantic preservation is a core
requirement of program transformations, the suitability of SSA for formulating and
implementing such transformations can be explained by the proximity of SSA to
functional languages.
6 Functional Representations of SSA 67
let v = 3 in
let y = (let v = 2 × v in 4 × v end)
(6.4)
in k(y × v) end
end
In effect, k represents any function that may be applied to the result of expres-
sion (6.2).
Surrounding code may specify the concrete continuation by binding k to a
suitable expression. It is common practice to write these continuation-defining
expressions in λ-notation, i.e., in the form λ x.e where x typically occurs free in
e. The effect of the expression is to act as the (unnamed) function that sends x to
e(x), i.e., formal parameter x represents the place-holder for the argument to which
the continuation is applied. Note that x is α-renameable, as λ acts as a binder. For
example, a client of the above code fragment wishing to multiply the result by 2
may insert code (6.4) in the e2 position of a let-binding for k that contains λ x. 2 × x
in its e1 -position, as in the following code:
let k = λ x. 2 × x
in let v = 3 in
let y = (let z = 2 × v in 4 × z end)
(6.5)
in k(y × v) end
end
end
Alternatively, the client may wrap fragment (6.4) in a function definition with
formal argument k and construct the continuation in the calling code, where he
would be free to choose a different name for the continuation-representing variable:
function f (k) =
let v = 3 in
let y = (let z = 2 × v in 4 × z end)
in k(y × v) end (6.6)
end
in let k = λ x. 2 × x in f (k) end
end
where f is invoked with a newly constructed continuation k that applies the addition
of 7 to its formal argument x (which at runtime will hold the result of f ) before
passing the resulting value on as an argument to the outer continuation k. In a similar
way, the function
function h(y, k) =
let x = 4 in
let k = λ z. k(z × x)
in if y > 0
(6.8)
then let z = y × 2 in k (z) end
else let z = 3 in k (z) end
end
end
Fig. 6.1 Control-flow graph for code (6.8) (a), and SSA representation (b)
The SSA form of this CFG is shown in Fig. 6.1b. If we apply similar renamings
of z to z1 and z2 in the two branches of (6.8), we obtain the following fragment:
function h(y, k) =
let x = 4 in
let k = λ z. k(z × x)
in if y > 0
(6.9)
then let z1 = y × 2 in k (z1 ) end
else let z2 = 3 in k (z2 ) end
end
end
We observe that the role of the formal parameter z of continuation k is exactly that
of a φ-function: to unify the arguments stemming from various calls sites by binding
them to a common name for the duration of the ensuing code fragment—in this case
just the return expression. As expected from the above understanding of scope and
dominance, the scopes of the bindings for z1 and z2 coincide with the dominance
regions of the identically named imperative variables: both terminate at the point of
function invocation/jump to the control-flow merge point.
The fact that transforming (6.8) into (6.9) only involves the referentially transpar-
ent process of α-renaming indicates that program (6.8) already contains the essential
structural properties that SSA distills from an imperative program.
70 L. Beringer
function h(y) =
let x = 4 in
function h (z) = z × x
x,y,z,h,h
in if y > 0
then let z = y × 2 in h (z) end (6.10)
x,y,z,h,h
else let z = 3 in h (z) end
x,y,z,h,h
x,y,h,h
end
x,y,h
end
y,h
where the local function h plays a similar role to the continuation k and is jointly
called from both branches. In contrast to the CPS representation, however, the body
of h returns its result directly rather than by passing it on as an argument to some
continuation. Also note that neither the declaration of h nor that of h contains
additional continuation parameters. Thus, rather than handing its result directly over
to some caller-specified receiver (as communicated by the continuation argument k),
h simply returns control back to the caller, who is then responsible for any further
execution. Roughly speaking, the effect is similar to the imperative compilation
discipline of always setting the return address of a procedure call to the instruction
pointer immediately following the call instruction.
6 Functional Representations of SSA 71
function h(y) =
let x = 4 in
function h (z) = z × x
in if y > 0
then function h1 () = let z1 = y × 2 in h (z1 ) end
(6.12)
in h1 () end
else function h2 () = let z2 = 3 in h (z2 ) end
in h2 () end
end
end
Again, the role of the formal parameter z of the control-flow merge point function
h is identical to that of a φ-function. In accordance with the fact that the basic
blocks representing the arms of the conditional do not contain φ-functions, the local
functions h1 and h2 have empty parameter lists—the free occurrence of y in the
body of h1 is bound at the top level by the formal argument of h.
72 L. Beringer
For both direct style and CPS the correspondence to SSA is most pronounced for
code in let-normal form: Each intermediate result must be explicitly named by a
variable, and function arguments must be names or constants. Syntactically, let-
normal form isolates basic instructions in a separate category of primitive terms a
and then requires let-bindings to be of the form let x = a in e end. In particular,
neither jumps (conditional or unconditional) nor let-bindings are primitive. Let-
normalized form is obtained by repeatedly rewriting code as follows:
let y = e
let x = let y = e in e y
end
into in let x = e in e end
in e x
x,y
y
end end,
subject to the side condition that y is not free in e . For example, let-normalizing
code (6.3) pulls the let-binding for z to the outside of the binding for y, yielding
let v = 3 in
let z = 2 × v in
let y = 4 × z in y × v end (6.13)
v,y,z
v,z
end
v
end
let v = 3,
z = 2 × v,
(6.14)
y =4×z
in y × v end
Summarizing our discussion up to this point, Table 6.1 collects some correspon-
dences between functional and imperative/SSA concepts.
6 Functional Representations of SSA 73
Table 6.1 Correspondence pairs between functional form and SSA (part I)
Functional concept · · ·· Imperative/SSA concept
Variable binding in let · · ·· Assignment (point of definition)
α-renaming · · ·· Variable renaming
Unique association of binding occurrences to uses · · ·· Unique association of defs to uses
Formal parameter of continuation/local function · · ·· φ-function (point of definition)
Lexical scope of bound variable · · ·· Dominance region
Table 6.2 Correspondence pairs between functional form and SSA: program structure
Functional concept · · ·· Imperative/SSA concept
Immediate subterm relationship · · ·· Direct control-flow successor relationship
Arity of function fi · · ·· Number of φ-functions at beginning of bi
Distinctness of formal param. of fi · · ·· Distinctness of LHS-variables in the φ-block of bi
Number of call sites of function fi · · ·· Arity of φ-functions in block bi
Parameter lifting/dropping · · ·· Addition/removal of φ-function
Block floating/sinking · · ·· Reordering according to dominator tree structure
Potential nesting structure · · ·· Dominator tree
The relationship between SSA and functional languages is extended by the corre-
spondences shown in Table 6.2. We discuss some of these aspects by considering
the translation into SSA, using the program in Fig. 6.2 as a running example.
one let-binding for each assignment and converting jumps into function calls. In
order to determine the formal parameters of these functions we perform a liveness
analysis. For each basic block bi , we choose an arbitrary enumeration of its live-
in variables. We then use this enumeration as the list of formal parameters in the
declaration of the function fi , and also as the list of actual arguments in calls to fi .
We collect all function definitions in a block of mutually tail-recursive functions at
the top level:
function f1 () = let v = 1, z = 8, y = 4
in f2 (v, z, y) end
and f2 (v, z, y) = let x = 5 + y, y = x × z, x = x − 1
(6.15)
in if x = 0 then f3 (y, v) else f2 (v, z, y) end
and f3 (y, v) = let w = y + v in w end
in f1 () end
function f1 () = let v1 = 1, z1 = 8, y1 = 4
in f2 (v1 , z1 , y1 ) end
and f2 (v2 , z2 , y2 ) = let x1 = 5 + y2 , y3 = x1 × z2 , x2 = x1 − 1
(6.16)
in if x2 = 0 then f3 (y3 , v2 ) else f2 (v2 , z2 , y3 ) end
and f3 (y4 , v3 ) = let w1 = y4 + v3 in w1 end
in f1 () end
1 Apart from the function identifiers fi , which can always be chosen distinct from the variables.
6 Functional Representations of SSA 75
b1
b2
b3
6.2.2 λ-Dropping
λ-dropping may be performed before or after variable names are made distinct, but
for our purpose, the former option is more instructive. The transformation consists
of two phases, block sinking and parameter dropping.
Block sinking analyses the static call structure to identify which function definitions
may be moved inside each other. For example, whenever our set of function
declarations contains definitions f (x1 , . . . , xn ) = ef and g(y1 , . . . , ym ) = eg
where f = g and such that all calls to f occur in ef or eg , we can move the
declaration for f into that of g—note the similarity to the notion of dominance. If
applied aggressively, block sinking indeed amounts to making the entire dominance
tree structure explicit in the program representation. In particular, algorithms for
computing the dominator tree from a CFG discussed elsewhere in this book can be
applied to identify block sinking opportunities, where the CFG is given by the call
graph of functions.
76 L. Beringer
In our example (6.15), f3 is only invoked from within f2 , and f2 is only called
in the bodies of f2 and f1 (see the dominator tree in Fig. 6.3 (right)). We may thus
move the definition of f3 into that of f2 , and the latter one into f1 .
Several options exist as to where f should be placed in its host function. The first
option is to place f at the beginning of g, by rewriting to
This transformation does not alter the semantic of the code, as the declaration of f
is closed: moving f into the scope of the formal parameters y1 , . . . , ym (and also
into the scope of g itself) does not alter the bindings to which variable uses inside
ef refer.
Applying this transformation to example (6.15) yields the following code:
function f1 () =
function f2 (v, z, y) =
function f3 (y, v) = let w = y + v in w end
in let x = 5 + y, y = x × z, x = x − 1
in if x = 0 then f3 (y, v) else f2 (v, z, y) end (6.17)
end
in let v = 1, z = 8, y = 4 in f2 (v, z, y) end
end
in f1 () end
An alternative strategy is to insert f near the end of its host function g, in the
vicinity of the calls to f. This brings the declaration of f additionally into the
scope of all let-bindings in eg . Again, referential transparency and preservation of
semantics are respected as the declaration on f is closed. In our case, the alternative
strategy yields the following code:
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (v, z, y) =
let x = 5 + y, y = x × z, x = x − 1
in if x = 0
then function f3 (y, v) = let w = y + v in w end
(6.18)
in f3 (y, v) end
else f2 (v, z, y)
end
in f2 (v, z, y) end
end
in f1 () end
In general, one would insert f directly prior to its call if g contains only a single
call site for f. In the event that g contains multiple call sites for f, these are (due
to their tail-recursive positioning) in different arms of a conditional, and we would
insert f directly prior to this conditional.
6 Functional Representations of SSA 77
Both outlined placement strategies result in code whose nesting structure reflects
the dominance relationship of the imperative code. In our example, code (6.17)
and (6.18) both nest f3 inside f2 inside f1 , in accordance with the dominator tree
of the imperative program show on Fig. 6.3.
of f3 and also at its call site is the one rooted at the formal parameter v of f2 . In
case of y, the common scope is the one rooted at the let-binding for y in the body
of f2 . We thus obtain the following code:
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (v, z, y) =
let x = 5 + y, y = x × z, x = x − 1
in if x = 0
then function f3 () = let w = y + v in w end
(6.19)
in f3 () end
else f2 (v, z, y)
end
in f2 (v, z, y) end
end
in f1 () end
Considering the recursive function f2 next we observe that the recursive call is in
the scope of the let-binding for y in the body of f2 , preventing us from removing y.
In contrast, neither v nor z has binding occurrences in the body of f2 . The scopes
applicable at the external call site to f2 coincide with those applicable at its site of
declaration and are given by the scopes rooted in the let-bindings for v and z. Thus,
parameters v and z may be removed from f2 :
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (y) =
let x = 5 + y, y = x × z, x = x − 1
in if x = 0
then function f3 () = let w = y + v in w end
(6.20)
in f3 () end
else f2 (y)
end
in f2 (y) end
end
in f1 () end
Interpreting the uniquely renamed variant of (6.20) back in SSA yields the desired
code with a single φ-function, for variable y at the beginning of block b2 , see
Fig. 6.4. The reason that this φ-function cannot be eliminated (the redefinition of
y in the loop) is precisely the reason why y survives parameter dropping.
Given this understanding of parameter dropping we can also see why inserting
functions near the end of their hosts during block sinking (as in code (6.18))
is in general preferable to inserting them at the beginning of their hosts (as in
code (6.17)): The placement of function declarations in the vicinity of their calls
potentially enables the dropping of more parameters, namely those that are let-
bound in the body of the host function.
6 Functional Representations of SSA 79
b1
b2
b3
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (v, z, y) =
function f3 (y, v) = let w = y + v in w end
in let x = 5 + y, y = x × z, x = x − 1
(6.21)
in if x = 0 then f3 (y, v) else f2 (v, z, y) end
end
in f2 (v, z, y) end
end
in f1 () end
We may now drop v (but not y) from the parameter list of f3 , and v and z from f2 ,
to obtain code (6.22).
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (y) =
function f3 (y) = let w = y + v in w end
in let x = 5 + y, y = x × z, x = x − 1
(6.22)
in if x = 0 then f3 (y) else f2 (y) end
end
in f2 (y) end
end
in f1 () end
The SSA form corresponding to (6.22) contains the desired loop-closing φ-node for
y at the beginning of b3 , as shown in Fig. 6.5a. The nesting structure of both (6.21)
and (6.22) coincides with the dominance structure of the original imperative code
and its loop-closed SSA form.
6 Functional Representations of SSA 81
b1
b2
b3
Loop-closed
b1
b2
b3 b3
Loop-unrolled
Fig. 6.5 Loop-closed (a) and loop-unrolled (b) forms of running example program, corresponding
to codes (6.22) and (6.23), respectively
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (y) =
function f3 (y) = let w = y + v in w end
in let x = 5 + y, y = x × z, x = x − 1
in if x = 0 then f3 (y)
else function f2 (y) =
(6.23)
let x = 5 + y, y = x × z, x = x − 1
in if x = 0 then f3 (y) else f2 (y) end
in f2 (y) end
end
in f2 (y) end
end
in f1 () end
82 L. Beringer
Both calls to f3 are in the scope of the declaration of f3 and contain the appropriate
loop-closing arguments. In the SSA reading of this code—shown in Fig. 6.5b—
the first instruction in b3 has turned into a non-trivial φ-node. As expected, the
parameters of this φ-node correspond to the two control-flow arcs leading into b3 ,
one for each call site to f3 in code (6.23). Moreover, the call and nesting structure
of (6.23) is indeed in agreement with the control flow and dominance structure of
the loop-unrolled SSA representation.
The above example code excerpts where variables are not made distinct exhibit
a further pattern: The argument list of any call coincides with the list of formal
parameters of the invoked function. This discipline is not enjoyed by functional
programs in general, and is often destroyed by optimizing program transformations.
However, programs that do obey this discipline can be immediately converted to
imperative non-SSA form. Thus, the task of SSA destruction amounts to converting
a functional program with arbitrary argument lists into one where argument lists
and formal parameter lists coincide for each function. This can be achieved by
introducing additional let-bindings of the form let x = y in e end. For example,
a call f (v, z, y) where f is declared as function f (x, y, z) = e may be converted
to
let x = v, a = z, z = y, y = a in f (x, y, z)
b1 b2 b4 b
g b1 b2 b3 b4 b5
b5 b3 g
(a) (b)
Fig. 6.6 Example placement for reducible flow graphs. (a) Control-flow graph; (b) Overlay of
CFG (dashed edges) arcs between dominated blocks onto dominance graph (solid edges)
84 L. Beringer
The code respects the dominance relationship in much the same way as the naive
placement, but additionally makes f1 inaccessible from within e5 , and makes f3
inaccessible from within f1 or f5 . As the reordering does not move function
declarations inside each other (in particular: no function declaration is brought
into or moved out of the scope of the formal parameters of any other function)
the reordering does not affect the potential to subsequently perform parameter
dropping.
Declaring functions using λ-abstraction brings further improvements. This
enables us not only to syntactically distinguish between loops and non-recursive
control-flow structures using the distinction between let and letrec present in many
functional languages, but also to further restrict the visibility of function names.
Indeed, while b3 is immediately dominated by b in the above example, its only
control-flow predecessors are b2 /g and b4 . We would hence like to make the
declaration of f3 local to the tuple (f2 , f4 ), i.e., invisible to f. This can be achieved
by combining let/letrec bindings with pattern matching, if we insert the shared
declaration of f3 between the declaration of the names f2 and f4 and the λ-bindings
of their formal parameters pi :
u, v
entry
u v
exit w, x
u w x v w x
(a) (b)
Fig. 6.8 Illustration of Steensgaard’s construction of loop nesting forests: (a) CFG-enriched
dominance tree; (b) resulting loop nesting forest
function entry(. . .) =
let . . . < body of entry > . . .
in letrec (u, v) = define outer loop L0 , with headers u, v
letrec (w, x) = define inner loop L1 , with headers w, x
let exit = λpexit . . . . < body of exit >
in ( λpw . . . . < body of w, with calls to u, x, and exit > . . .
(6.27)
, λpx . . . . < body of x, with calls to w, v, and exit > . . . )
end end of inner loop
in ( λpu . . . . < body of u, with call to w > . . .
, λpv . . . . < body of v, with call to x > . . . )
end end of outer loop
in . . . < calls from entry to u and v > . . .
By placing L1 inside L0 according to the scheme from code (6.26) and making
exit private to L1 , we obtain the representation (6.27), which captures all the
essential information of Steensgaard’s construction. Effectively, the functional
reading of the loop nesting forest extends the earlier correspondence between
the nesting of individual functions and the dominance relationship to groups of
functions and basic blocks: loop L0 dominates L1 in the sense that any path from
entry to a node in L1 passes through L0 ; more specifically, any path from entry to
a header of L1 passes through a header of L0 .
In general, each step of Steensgaard’s construction may identify several loops,
as a CFG may contain several maximal SCCs. As the bodies of these SCCs are
necessarily non-overlapping, the construction yields a forest comprised of trees
shaped like the loop nesting forest in Fig. 6.8b. As the relationship between the
trees is necessarily acyclic, the declarations of the function declaration tuples
corresponding to the trees can be placed according to the loop-extended notion of
dominance.
Further Reading
Shortly after the introduction of SSA, O’Donnell [213] and Kelsey [160] noted
the correspondence between let-bindings and points of variable declaration and its
extension to other aspects of program structure using continuation-passing style.
Appel [10, 11] popularized the correspondence using a direct-style representation,
building on his earlier experience with continuation-based compilation [9].
Continuations and low-level functional languages have been an object of inten-
sive study since their inception about four decades ago [174, 294]. For retrospective
accounts of the historical development, see Reynolds [246] and Wadsworth [299].
Early studies of CPS and direct style include work by Reynolds and Plotkin [229,
244, 245]. Two prominent examples of CPS-based compilers are those by Sussman
et al. [279] and Appel [9]. An active area of research concerns the relative merit
6 Functional Representations of SSA 87
based sparse conditional constant propagation algorithm [303]. Beringer et al. [26]
consider data-flow equations for liveness and read-once variables, and formally
translate their solutions to properties of corresponding typing derivations. Laud
et al. [177] present a formal correspondence between data-flow analyses and type
systems but consider a simple imperative language rather than SSA.
At present, intermediate languages in functional compilers do not provide
syntactic support for expressing nesting forest directly. Indeed, most functional
compilers do not perform advanced analyses of nested loops. As an exception to this
rule, the MLton compiler ([Link] implements Steensgaard’s algorithm for
detecting loop nesting forests, leading to a subsequent analysis of the loop unrolling
and loop switching transformations [278].
Concluding Remarks
In addition to low-level functional languages, alternative representations for SSA
have been proposed, but their discussion is beyond the scope of this chapter.
Glesner [129] employs an encoding in terms of abstract state machines [134] that
disregards the sequential control flow inside basic blocks but retains control flow
between basic blocks to prove the correctness of a code generation transformation.
Later work by the same author uses a more direct representation of SSA in the
theorem prover Isabelle/HOL for studying further SSA-based analyses.
Matsuno and Ohori [195] present a formalism that captures core aspects of SSA
in a notion of (non-standard) types while leaving the program text in unstructured
non-SSA form. Instead of classifying values or computations, types represent the
definition points of variables. Contexts associate program variables at each program
point with types in the standard fashion, but the non-standard notion of types
means that this association models the reaching-definitions relationship rather than
characterizing the values held in the variables at runtime. Noting a correspondence
between the types associated with a variable and the sets of def-use paths, the
authors admit types to be formulated over type variables whose introduction and
use corresponds to the introduction of φ-nodes in SSA.
Finally, Pop et al.’s model [233] dispenses with control flow entirely and instead
views programs as sets of equations that model the assignment of values to variables
in a style reminiscent of partial recursive functions. This model is discussed in more
detail in Chap. 10.
Part II
Analysis
Chapter 7
Introduction
M. Schordan
Lawrence Livermore National Laboratory, Livermore, CA, USA
e-mail: schordan1@[Link]
F. Rastello ()
Inria, Grenoble, France
e-mail: [Link]@[Link]
and its variant, the speculative PRE, which (possibly speculatively) perform partial
redundancy elimination (PRE). It also discusses a direct and very useful application
of the presented technique, register promotion via PRE. Unfortunately, the SSAPRE
algorithm is not capable of recognizing redundant computations among lexically
different expressions that yield the same value. Therefore, redundancy elimination
based on value analysis (GVN) is also discussed, although a description of the value-
based partial redundancy elimination (GVN-PRE) algorithm of VanDrunen [295],
which subsumes both PRE and GVN, is missing.
Alias Analysis
The book is lacking an extremely important compiler analysis, namely alias
analysis. Disambiguating pointers improves the precision and performance of
many other analyses and optimizations. To be effective, flow-sensitivity and inter-
procedurality are required but, with standard iterative data-flow analysis, lead to
serious scalability issues. The main difficulty with making alias analysis sparse is
that it is a chicken and egg problem. For each may-alias between two variables,
the associated information interferes. Assuming the extreme scenario where any
pair of variables is a pair of aliases, one necessarily needs to assume that the
modification of one potentially modifies the other. In other words, the def-use chains
are completely dense and there is no benefit in using any extended SSA form such as
HSSA form (see Chap. 16). The idea is thus to decompose the analysis into phases
where sophistication increases with sparsity. This is precisely what the staged flow-
sensitive analysis of Hardekopf and Lin [138] achieves with two “stages.” The first
stage, performing a not-so-precise auxiliary pointer analysis, creates def-use chains
used to enable sparsity in the second stage. The second stage, the primary analysis,
is then a flow-sensitive pointer analysis. An important aspect is that as long as
the auxiliary pointer analysis in the first stage is sound, the primary flow-sensitive
analysis in the second stage will also be sound and will be at least as precise as a
traditional “non-sparse” flow-sensitive analysis. The sparsity improves the runtime
of the analysis, but it does not reduce precision.
Another way to deal with pointers in SSA form is to use a variant of SSA,
called partial SSA, which requires variables to be divided into two classes: one
class that contains only variables that are never referenced by pointers, and another
class containing all those variables that can be referenced by pointers (address-taken
variables). To avoid complications involving pointers in SSA form, only variables
in the first class are put into SSA form. This technique is used in modern compilers
such as GCC and LLVM.
Chapter 8
Propagating Information Using SSA
A central task of compilers is to optimize a given input program such that the
resulting code is more efficient in terms of execution time, code size, or some other
metric of interest. However, in order to perform these optimizations, typically some
form of program analysis is required to determine if a given program transformation
is applicable, to estimate its profitability, and to guarantee its correctness.
Data-flow analysis is a simple yet powerful approach to program analysis that is
utilized by many compiler frameworks and program analysis tools today. We will
introduce the basic concepts of traditional data-flow analysis in this chapter and
will show how the Static Single Assignment form (SSA) facilitates the design and
implementation of equivalent analyses. We will also show how the SSA property
allows us to reduce the compilation time and memory consumption of the data-flow
analyses that this program representation supports.
Traditionally, data-flow analysis is performed on a control-flow graph representa-
tion (CFG) of the input program. Nodes in the graph represent operations, and edges
represent the potential flow of program execution. Information on certain program
properties is propagated among the nodes along the control-flow edges until the
computed information stabilizes, i.e., no new information can be inferred from the
program.
The propagation engine presented in the following sections is an extension of
the well-known approach by Wegman and Zadeck for sparse conditional constant
propagation (also known as SSA-CCP). Instead of using the CFG, they represent
the input program as an SSA graph as defined in Chap. 14: operations are again
F. Brandner ()
Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France
e-mail: [Link]@[Link]
D. Novillo
Google, Toronto, ON, Canada
e-mail: dnovillo@[Link]
represented as nodes in this graph; however, the edges represent data dependencies
instead of control flow. This representation allows selective propagation of program
properties among data-dependent graph nodes only. As before, the processing
stops when the information associated with the graph nodes stabilizes. The basic
algorithm is not limited to constant propagation and can also be applied to solve
a large class of other data-flow problems efficiently. However, not all data-flow
analyses can be modelled. Chapter 13 addresses some of the limitations of the SSA-
based approach.
The remainder of this chapter is organized as follows. First, the basic concepts
of (traditional) data-flow analysis are presented in Sect. 8.1. This will provide
the theoretical foundation and background for the discussion of the SSA-based
propagation engine in Sect. 8.2. We then provide an example of a data-flow analysis
that can be performed efficiently by the aforementioned engine, namely copy
propagation in Sect. 8.3.
8.1 Preliminaries
symmetric relation. Using the relation, upper and lower bounds, as well as least
upper and greatest lower bounds, can be defined for subsets of L.
A particularly interesting class of partially ordered sets are complete lattices,
where all subsets have a least upper bound as wellas a greatest lower bound. Those
bounds are unique and are denoted by and , respectively. In the context of
program analysis, the former is often referred to as the join operator, while the latter
is termed the meet operator. Complete lattices have two distinguished elements, the
least element and the greatest element, often denoted by ⊥ and , respectively.
An ascending chain is a totally ordered subset {l1 , . . . , ln } of a complete lattice.
A chain is said to stabilize if there exists an index m, where ∀i > m : li = lm . An
analogous definition can be given for descending chains.
Program Representation The functions of the input program are represented as
control-flow graphs, where the nodes represent operations, or instructions, and edges
denote the potential flow of execution at runtime. Data-flow information is then
propagated from one node to another adjacent node along the respective graph edge
using in and out sets associated with every node. If there exists only one edge
connecting two nodes, data can be simply copied from one set to the other. However,
if a node has multiple incoming edges, the information from those edges has to be
combined using the meet or join operator.
Sometimes, it is helpful to reverse the flow graph to propagate information, i.e.,
reverse the direction of the edges in the control-flow graph. Such analyses are termed
backward analyses, while those using the regular flow graph are forward analyses.
Transfer Functions Aside from the control flow, the operations of the program
need to be accounted for during analysis. Usually, these operations change the
way data is propagated from one control-flow node to the other. Every operation
is thus mapped to a transfer function, which transforms the information available
from the in set of the flow graph node of the operation and stores the result in the
corresponding out set.
Putting all those elements together—a complete lattice, a flow graph, and a set of
transfer functions—yields an instance of a monotone framework. This framework
describes a set of data-flow equations whose solution will ultimately converge to
the solution of the data-flow analysis. A very popular and intuitive way to solve
these equations is to compute the maximal (minimal) fixed point (MFP) using an
iterative work list algorithm. The work list contains edges of the flow graph that
have to be revisited. Visiting an edge consists of first combining the information
from the out set of the source node with the in set of the target node, using
the meet or join operator, and then applying the transfer function of the target
98 F. Brandner and D. Novillo
node. The obtained information is then propagated to all direct successors of the
target node by appending the corresponding edges to the work list. The algorithm
terminates when the data-flow information stabilizes, as the work list then becomes
empty.
A single flow edge can be appended several times to the work list in the course
of the analysis. It may even happen that an infinite feedback loop prevents the
algorithm from terminating. We are thus interested in bounding the number of
times a flow edge is processed. Recalling the definition of chains from before (see
Sect. 8.1), the height of a lattice is defined by the length of its longest chain. We
can ensure termination for lattices fulfiling the ascending chain condition, which
ensures that the lattice has finite height. Given a lattice with finite height h and a
flow graph G = (V , E), it is easy to see that the MFP solution can be computed
in O(|E| · h) time, where |E| represents the number of edges. Since the number of
edges is bounded by the number of graph nodes |V |, more precisely, |E| ≤ |V |2 ,
this gives a O(|V |2 · h) general algorithm to solve data-flow analyses. Note that
the height of the lattice often depends on properties of the input program, which
might ultimately yield bounds worse than cubic in the number of graph nodes. For
instance, the lattice for copy propagation consists of the cross product of many
smaller lattices, each representing the potential values of a variable occurring in
the program. The total height of the lattice thus directly depends on the number of
variables in the program.
In terms of memory consumption, we have to propagate data-flow to all relevant
program points. Nodes are required to hold information even when it is not directly
related to the node; hence, each node must store complete in and out sets.
SSA form allows us to solve a large class of data-flow problems more efficiently
than the iterative work list algorithm presented previously. The basic idea is to
directly propagate information computed at the unique definition of a variable to
all its uses. In this way, intermediate program points that neither define nor use the
variable of interest do not have to be taken into consideration, thus reducing memory
consumption and compilation time.
by passing through that join. The SSA form properties ensure that a φ-function is
placed at the join node and that any use of the variable that the φ-function defines
has been properly updated to refer to the correct name.
Consider, for example, the code excerpt shown in Fig. 8.1, along with its
corresponding SSA graph and CFG. Assume we are interested in propagating
information from the assignment of variable y1 , at the beginning of the code,
down to its unique use at the end. The traditional CFG representation causes the
propagation to pass through several intermediate program points. These program
points are concerned only with computations of the variables x1 , x2 , and x3
and are thus irrelevant for y1 . The SSA graph representation, on the other hand,
propagates the desired information directly from definition to use sites, without any
intermediate step. At the same time, we also find that the control-flow join following
the conditional is properly represented by the φ-function defining the variable x3 in
the SSA graph.
Even though the SSA graph captures data dependencies and the relevant join
nodes in the CFG, it lacks information on other control dependencies. However,
analysis results can often be improved significantly by considering the additional
information that is available from the control dependencies in the CFG. As an
example, consider once more the code in Fig. 8.1, and assume that the condition
associated with the if-statement is known to be false for all possible program
executions. Consequently, the φ-function will select the value of x2 in all cases,
which is known to be of constant value 5. However, due to the shortcomings of the
SSA graph, this information cannot be derived. It is thus important to use both the
control-flow graph and the SSA graph during data-flow analysis in order to obtain
the best possible results.
100 F. Brandner and D. Novillo
operations are processed by applying the relevant transfer function and possibly
propagating the updated information to all uses by appending the respective SSA
graph edges to the SSAWorkList.
As an example, consider the program shown in Fig. 8.1 and the constant
propagation problem. First, assume that the condition of the if-statement cannot
be statically evaluated, we thus have to assume that all its successors in the CFG
are reachable. Consequently, all control-flow edges in the program will eventually
be marked executable. This will trigger the evaluation of the constant assignments
to the variables x1 , x2 , and y1 . The transfer functions immediately yield that the
variables are all constant, holding the values 4, 5, and 6, respectively. This new
information will trigger the reevaluation of the φ-function of variable x3 . As both of
its incoming
control-flow edges are marked executable, the combined information
yields 4 5 = ⊥, i.e., the value is known not to be a particular constant value.
Finally, also the assignment to variable z1 is reevaluated, but the analysis shows that
its value is not a constant as depicted in Fig. 8.2a. If, however, the if-condition
is known to be false for all possible program executions, a more precise result
can be computed, as shown in Fig. 8.2b. Neither the control-flow edge leading to
the assignment of variable x1 nor its outgoing edge leading to the φ-function of
variable x3 is marked executable. Consequently, the reevaluation of the φ-function
considers the data-flow information of its second operand x2 only, which is known
to be constant. This enables the analysis to show that the assignment to variable z1
is, in fact, constant as well.
8.2.3 Discussion
During the course of the propagation algorithm, every edge of the SSA graph is
processed at least once, whenever the operation corresponding to its definition
is found to be executable. Afterwards, an edge can be revisited several times
depending on the height h of the lattice representing the property space of the
analysis. On the other hand, edges of the control-flow graph are processed at most
8 Propagating Information Using SSA 103
once. This leads to an upper bound in execution time of O(|ESSA | · h + |ECFG |),
where ESSA and ECFG represent the edges of the SSA graph and the control-
flow graph, respectively. The size of the SSA graph increases with respect to
the original non-SSA program. Measurements indicate that this growth is linear,
yielding a bound that is comparable to the bound of traditional data-flow analysis.
However, in practice, the SSA-based propagation engine outperforms the traditional
approach. This is due to the direct propagation from the definition of a variable
to its uses, without the costly intermediate steps that have to be performed on the
CFG. The overhead is also reduced in terms of memory consumption: instead of
storing the in and out sets capturing the complete property space on every program
point, it is sufficient to associate every node in the SSA graph with the data-flow
information of the corresponding variable only, leading to considerable savings in
practice.
Even though data-flow analysis based on SSA graphs has its limitations, it is still
a useful and effective solution for interesting problems, as we will show in the
following example. Copy propagation under SSA form is, in principle, very simple.
Given the assignment x ← y, all we need to do is to traverse the immediate
uses of x and replace them with y, thereby effectively eliminating the original
copy operation. However, such an approach will not be able to propagate copies
past φ-functions, particularly those in loops. A more powerful approach is to split
copy propagation into two phases: First, a data-flow analysis is performed to find
copy-related variables throughout the program. Second, a rewrite phase eliminates
spurious copies and renames variables.
The analysis for copy propagation can be described as the problem of propagating
the copy of value of variables. Given a sequence of copies as shown in Fig. 8.3a,
we say that y1 is a copy of x1 and z1 is a copy of y1 . The problem with this
representation is that there is no apparent link from z1 to x1 . In order to handle
transitive copy relations, all transfer functions operate on copy of values instead of
the direct source of the copy. If a variable is not found to be a copy of anything
else, its copy of value is the variable itself. For the above example, this yields that
both y1 and z1 are copies of x1 , which in turn is a copy of itself. The lattice of
this data-flow problem is thus similar to the lattice used previously for constant
propagation. The lattice elements correspond to variables of the program instead of
integer numbers. The least element of the lattice represents the fact that a variable
is a copy of itself.
Similarly, we would like to obtain the result that x3 is a copy of y1 for the example
of Fig. 8.3b. This is accomplished by choosing the join operator such that a copy
relation is propagated whenever the copy of values of all the operands of the φ-
function matches. When visiting the φ-function for x3 , the analysis finds that x1 and
x2 are both copies of y1 , and consequently propagates that x3 is also a copy of y1 .
The next example shows a more complex situation where copy relations are
obfuscated by loops—see Fig. 8.4. Note that the actual visiting order depends on
the shape of the CFG and immediate uses; in other words, the ordering used here
is meant for illustration only. Processing starts at the operation labeled 1, with both
work lists empty and the data-flow information associated with all variables:
1. Assuming that the value assigned to variable x1 is not a copy, the data- flow
information for this variable is lowered to ⊥, the SSA edges leading to operations
2 and 3 are appended to the SSAWorkList, and the control-flow graph edge e1 is
appended to the CFGWorkList.
2. Processing the control-flow edge e1 from the work list causes the edge to be
marked executable and the operations labeled 2 and 3 to be visited. Since edge e5
is not yet known to be executable, the processing of the φ-function yields a copy
relation between x2 and x1 . This information is utilized in order to determine
which outgoing control-flow graph edges are executable for the conditional
branch. Examining the condition shows that only edge e3 is executable and thus
needs to be added to the work list.
8 Propagating Information Using SSA 105
3. Control-flow edge e3 is processed next and marked executable for the first time.
Furthermore, the φ-function labeled 5 is visited. Due to the fact that edge e4 is
not known to be executable, this yields a copy relation between x4 and x1 (via
x2 ). The condition of the branch labeled 6 cannot be analysed and thus causes its
outgoing control-flow edges e5 and e6 to be added to the work list.
4. Now, control-flow edge e5 is processed and marked executable. Since the target
operations are already known to be executable, only the φ-function is revisited.
However, variables x1 and x4 have the same copy of value x1 , which is identical
to the previous result computed in Step 2. Thus, neither of the two work lists is
modified.
5. Assuming that the control-flow edge e6 leads to the exit node of the control-flow
graph, the algorithm stops after processing the edge without modifications to the
data-flow information computed so far.
The straightforward implementation of copy propagation would have needed
multiple passes to discover that x4 is a copy of x1 . The iterative nature of the
propagation, along with the ability to discover non-executable code, allows one
to handle even obfuscated copy relations. Moreover, this kind of propagation will
only reevaluate the subset of operations affected by newly computed data-flow
information instead of the complete control-flow graph once the set of executable
operations has been discovered.
offer similar properties. Static Single Information form (SSI) first introduced by
Ananian [8] and then later fixed by Singer [261] has been designed with the
objective of allowing both backward and forward problems. It uses σ -functions,
which are placed at program points where data-flow information for backward
problems needs to be merged [260]. But as illustrated by dead-code elimination
(backward sparse data-flow analysis on SSA), the fact that σ -functions are necessary
for backward propagation problems (or φ-functions for forward) is a misconception.
This subtle point is further explained in Sect. 13.2. Bodík also uses an extended
SSA form, e-SSA, his goal being to eliminate array bounds checks [32]. e-SSA is
a simple extension that allows one to also propagate information from conditionals.
Chapter 13 revisits the concepts of static single information and proposes a
generalization that subsumes all those extensions to SSA form. Ruf [250] introduces
the value dependence graph, which captures both control and data dependencies. He
derives a sparse representation of the input program, which is suited for data-flow
analysis, using a set of transformations and simplifications.
The sparse evaluation graph by Choi et al. [67] is based on the same basic
idea as the approach presented in this chapter: intermediate steps are eliminated
by bypassing irrelevant CFG nodes and merging the data-flow information only
when necessary. Their approach is closely related to the placement of φ-functions
and similarly relies on the dominance frontier during construction. A similar
approach, presented by Johnson and Pingali [153], is based on single-entry/single-
exit regions. The resulting graph is usually less sparse but is also less complex
to compute. Ramalingam [237] further extends those ideas and introduces the
compact evaluation graph, which is constructed from the initial CFG using two
basic transformations. The approach is superior to the sparse representations by
Choi et al. as well as the approach presented by Johnson and Pingali.
The previous approaches derive a sparse graph suited for data-flow analysis
using graph transformations applied to the CFG. Duesterwald et al. instead exam-
ine the data-flow equations, eliminate redundancies, and apply simplifications to
them [107].
Chapter 9
Liveness
This chapter illustrates the use of strict SSA properties to simplify and accelerate
liveness analysis, which determines, for all variables, the set of program points
where these variables are live, i.e., their values are potentially used by subsequent
operations. Liveness information is essential to solve storage assignment problems,
eliminate redundancies, and perform code motion. For instance, optimizations like
software pipelining, trace scheduling, register-sensitive redundancy elimination (see
Chap. 11), if-conversion (see Chap. 20), as well as register allocation (see Chap. 22)
heavily rely on liveness information.
Traditionally, liveness information is obtained by data-flow analysis: liveness sets
are computed for all basic blocks and variables simultaneously by solving a set of
data-flow equations. These equations are usually solved by an iterative algorithm,
propagating information backwards through the control-flow graph (CFG) until a
fixed point is reached and the liveness sets stabilize. The number of iterations
depends on the control-flow structure of the considered program, and more precisely
on the structure of its loops.
In this chapter, we show that, for strict SSA form programs, the live range of a
variable has valuable properties that can be expressed in terms of the loop nesting
forest of the CFG and its corresponding directed acyclic graph, the forward-CFG.
B. Boissinot
ENS Lyon, Lyon, France
e-mail: [Link]@[Link]
F. Rastello ()
Inria, Grenoble, France
e-mail: [Link]@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 107
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
108 B. Boissinot and F. Rastello
9.1 Definitions
Liveness is a property that relates program points to sets of variables which are
considered to be live at these program points. Intuitively, a variable is considered
live at a given program point when its value will be used in the future of any dynamic
execution. Statically, liveness can be approximated by following paths backwards on
the control-flow graph, connecting the uses of a given variable to its definitions—or,
in the case of SSA forms, to its unique definition. The variable is said to be live at
all program points along these paths. For a CFG node q, representing an instruction
or a basic block, a variable v is live-in at q if there is a path, not containing the
definition of v, from q to a node where v is used (including q itself). It is live-out
at q if it is live-in at some direct successor of q.
The computation of live-in and live-out sets at the entry and the exit of basic
blocks is usually termed liveness analysis. It is indeed sufficient to consider only
these sets at basic block boundaries, since liveness within a basic block is trivial to
recompute from its live-out set with a backward traversal of the block (whenever
the definition of a variable is encountered, it is pruned from the live-out set). Live-
ranges are closely related to liveness. Instead of associating program points with
sets of live variables, the live range of a variable specifies the set of program points
where that variable is live. Live ranges of programs under strict SSA form exhibit
certain useful properties (see Chap. 2), some of which can be exploited for register
allocation (see Chap. 22).
The special behaviour of φ-functions often causes confusion about where exactly
its operands are actually used and defined. For a regular operation, variables are
used and defined where the operation takes place. However, the semantics of φ-
functions (and, in particular, the actual place of φ-uses) should be defined carefully,
especially when dealing with SSA destruction. In algorithms for SSA destruction
(see Chap. 21), a use in a φ-function is considered live somewhere inside the
corresponding direct predecessor block, but, depending on the algorithm and, in
particular, the way copies are inserted, it may or may not be considered as live-
out for that predecessor block. Similarly, the definition of a φ-function is always
considered to be at the beginning of the block, but, depending on the algorithm,
it may or may not be marked as live-in for the block. To make the description
of algorithms easier, we follow the same definition as the one used in Chap. 21,
Sect. 21.2:
Definition 9.1 (Liveness for φ-Function Operands—Multiplexing Mode) For a
φ-function a0 = φ(a1 , . . . , an ) in block B0 , where ai comes from block Bi :
• Its definition-operand is considered to be at the entry of B0 , in other words
variable a0 is live-in of B0 .
• Its use operands are at the exit of the corresponding direct predecessor basic
blocks, in other words, variable ai is live-out of basic block Bi .
110 B. Boissinot and F. Rastello
A well-known and frequently used approach to compute the live-in and live-out sets
of basic blocks is backward data-flow analysis (see Chap. 8, Sect. 8.1). The liveness
sets are given by a set of equations that relate upward-exposed uses and definitions
to live-in and live-out sets. We say a use is upward-exposed in a block when there is
no local definition preceding it, i.e., the live range “escapes” the block at the top. The
sets of upward-exposed uses and definitions do not change during liveness analysis
and can thus, for each block B, be pre-computed:
With these pre-computed sets, the data-flow equations can be written as:
Informally, the live-in of block B are the variables defined in the φ-functions
of B, those used in B (and not defined in B), and those which are just “passing
through.” On the other hand, the live-out are those that must be live for a direct
successor S, i.e., either live-in of S (but not defined in a φ-function of S) or used in
a φ-function of S.
Instead of computing a fixed point, we show that liveness information can be derived
in two passes over the control-flow graph. The first version of the algorithm requires
the CFG to be reducible. We then show that arbitrary control-flow graphs can be
handled elegantly and with no additional cost, except for a cheap pre-processing
step on the loop nesting forest.
9 Liveness 111
... ← v
5 9
The key properties of live ranges under strict SSA form on a reducible CFG that
we exploit for this purpose can be outlined as follow:
1. Let q be a CFG node that does not contain the definition d of a variable, and h be
the header of the maximal loop containing q but not d. If such a maximal loop
does not exist, then let h = q.
The variable is live-in at q if and only if there exists a forward path from h to
a use of the variable without going through the definition d.
2. If a variable is live-in at the header of a loop then it is live at all nodes inside the
loop.
As an example, consider the code of Fig. 9.1. For q = 6, the header of the largest
loop containing 6 but not the definition d in 3 is h = 5. As a forward path (down
edges) exists from 3 to 5, variable v is live-in at 5. It is thus also live in all nodes
inside the loop, in particular, in node 6. On the other hand, for q = 7, the largest
“loop” containing 7 but not 3 is 7 itself. As there is no forward path from 7 to any
use (node 5), v is not live-in of 7 (note that v is not live-in of 2 either).
Those two properties pave the way for describing the two steps that make up our
liveness set algorithm:
1. A backward pass propagates partial liveness information upwards using a post-
order traversal of the forward-CFG;
2. The partial liveness sets are then refined by traversing the loop nesting forest,
propagating liveness from loop-headers down to all basic blocks within loops.
Algorithm 9.1 shows the necessary initialization and the high-level structure to
compute liveness in two passes.
112 B. Boissinot and F. Rastello
The next phase, which traverses the loop nesting forest, is shown in Algo-
rithm 9.3. The live-in and live-out sets of all basic blocks within a loop are unified
with the liveness sets of its loop-header.
... ← a 2 1 L2
3 2 L3
3 4
4
Example 9.1 The CFG of Fig. 9.2a is a pathological case for iterative data-flow
analysis. The pre-computation phase does not mark variable a as live throughout
the two loops. An iteration is required for every loop nesting level until the final
solution is computed. In our algorithm, after the CFG traversal, the traversal of the
loop nesting forest (Fig. 9.2b) propagates the missing liveness information from the
loop-header of loop L2 down to all blocks within the loop’s body and all inner loops,
i.e., blocks 3 and 4 of L3 .
[Link] Correctness
The first pass propagates the liveness sets using a post-order traversal of the forward-
CFG, Gfwd , obtained by removing all back edges from the CFG G. The first two
lemmas show that this pass correctly propagates liveness information to the loop-
headers of the original CFG.
Lemma 9.1 Let G be a reducible CFG, v an SSA variable, and d its definition. If L
is a maximal loop not containing d, then v is live-in at the loop-header h of L iff
there is a path in Gfwd (i.e., back edge free) from h to a use of v that does not go
through d.
Lemma 9.2 Let G be a reducible CFG, v an SSA variable, and d its definition.
Let p be a node of G such that all loops containing p also contain d. Then v is live-
in at p iff there is a path in Gfwd , from p to a use of v that does not go through d.
Pointers to formal proofs are provided in the last section of this chapter. The
important property used in the proof is the dominance property that requires the full
live range of a variable to be dominated by its definition d. As a consequence, any
back edge part of the live range is dominated by d, and the associated loop cannot
contain d.
Algorithm 9.2, which propagates liveness information along the DAG Gfwd , can
only mark variables as live-in if they are indeed live-in. Furthermore, if, after this
propagation, a variable v is missing in the live-in set of a CFG node p, Lemma 9.2
114 B. Boissinot and F. Rastello
shows that p belongs to a loop that does not contain the definition of v. Let L be
such a maximal loop. According to Lemma 9.1, v is correctly marked as live-in
at the header of L. The next lemma shows that the second pass of the algorithm
(Algorithm 9.3) correctly adds variables to the live-in and live-out sets where they
are missing.
Lemma 9.3 Let G be a reducible CFG, L a loop, and v an SSA variable. If v is
live-in at the loop-header of L, it is live-in and live-out at every CFG node in L.
The intuition is straightforward: a loop is a strongly connected component, and
because d is live-in of L, d cannot be part of L.
The algorithms based on loops described above are only valid for reducible graphs.
We can also derive an algorithm that works for irreducible graphs, as follows:
transform the irreducible graph into a reducible graph, such that the liveness in both
graphs is equivalent. First of all we would like to stress two points:
1. We do not require the transformed graph to be semantically equivalent to the
original one, only isomorphism of liveness is required.
2. We do not actually modify the graph in practice, but Algorithm 9.2 can be
changed to simulate the modification of some edges on the fly.
There are loop nesting forest representations with possibly multiple headers per
irreducible loop. For the sake of clarity (and simplicity of implementation), we
consider a representation where each loop has a unique entry node as header. In
this case, the transformation simply relies on redirecting any edge s → t arriving in
the middle of a loop to the header of the outermost loop (if it exists) that contains t
but not s. The example of Fig. 9.3 illustrates this transformation, with the modified
edge highlighted. Considering the associated loop nesting forest (with nodes 2, 5,
and 8 as loop-headers), edge 9 → 6 is redirected to node 5.
Obviously the transformed code does not have the same semantics as the
original one. But, because a loop is a strongly connected component, the dominance
relationship is unchanged. As an example, the immediate dominator of node 5 is 3,
in both the original and the transformed CFG. For this reason, any variable live-in
of loop L5 —thus live everywhere in the loop—will be live on any path from 3 to
the loop. Redirecting an incoming edge to another node of the loop—in particular,
the header—does not change this behaviour.
To avoid building this transformed graph explicitly, an elegant alternative is to
modify the CFG traversal in Algorithm 9.2. Whenever an entry-edge s → t is
encountered during the traversal, instead of visiting t, we visit the header of the
largest loop containing t and not s. This header node is nothing else than the highest
ancestor of t in the loop nesting forest that is not an ancestor of s. We represent
9 Liveness 115
1 1
2 2
Lr
1 10
10 3 10 3
L2
4 8 4 8 2
7
3
4 L8
L5
5 9 5 9
8 9
5 6
6 6
7 7
Fig. 9.3 A reducible CFG derived from an irreducible CFG, using the loop nesting forest. The
transformation redirects edges arriving inside a loop to the loop-header (here 9 → 6 into 9 → 5)
Our approach potentially involves many outermost excluding loop queries, espe-
cially for the liveness check algorithm, as developed further. An efficient implemen-
tation of OLE is required. The technique proposed here and shown in Algorithm 9.5
is to pre-compute the set of ancestors from the loop-tree for every node. A simple set
operation can then find the node we are looking for: the ancestors of the definition
node are removed from the ancestors of the query point. From the remaining
ancestors, we pick the shallowest. Using bitsets for encoding the set of ancestors
of a given node, indexed with a topological order of the loop-tree, these operations
are easily implemented. The removal is a bit inversion followed by a bitwise “and”
operation, and the shallowest node is found by searching for the first set bit in the
bitset. Since the number of loops (and thus the number loop-headers) is rather small,
the bitsets are themselves small as well and this optimization does not result in much
wasted space.
116 B. Boissinot and F. Rastello
Using this information, finding the outermost excluding loop can be done by
simple bitset operations as in Algorithm 9.5.
Example 9.2 Consider the example of Fig. 9.3c again and suppose the loops L2 ,
L8 , and L5 are, respectively, indexed 0, 1, and 2. Using big-endian notations for
bitsets, Algorithm 9.4 would give binary labels 110 to node 9 and 101 to node 6. The
outermost loop containing 6 but not 9 is given by the leading bit of 101 ∧ ¬110 =
001, i.e., L5 .
9 Liveness 117
In contrast to liveness sets, the liveness check does not provide the set of variables
live at a block, but provides a query system to answer questions such as “is
variable v live at location q?” Such a framework is well suited to tree scan based
register allocation (see Chap. 22), SSA destruction (see Chap. 21), or Hyperblock
scheduling (see Chap. 18). Most register-pressure aware algorithms such as code-
motion are not designed to take advantage of a liveness check query system and
still require sets. This query system can obviously be built on top of pre-computed
liveness sets. Queries in O(1) are possible, at least for basic block boundaries,
providing the use of sparsesets or bitsets to allow for efficient element-wise queries.
If sets are only stored at basic block boundaries, to allow a query system at
instruction granularity, it is possible to use the list of uses of variables or backward
scans. Constant time, worst-case complexity is lost in this scenario and liveness
sets that have to be incrementally updated at each (even minor) code transformation
can be avoided and replaced by less memory-consuming data structures that only
depend on the CFG.
In the following, we consider the live-in query of variable a at node q. To
avoid notational overhead, let a be defined in the CFG node d = def (a) and let
u ∈ uses(a) be a node where a is used. Suppose that q is strictly dominated by
d (otherwise v cannot be live at q). Lemmas 9.1–9.3 given in Sect. [Link] can be
rephrased as follows:
1. Let h be the header of the maximal loop containing q but not d. Let h be q if
such maximal loop does not exist. Then v is live-in at h if and only if there exists
a forward path that goes from h to u.
2. If v is live-in at the header of a loop then it is live at any node inside the loop.
In other words, v is live-in at q if and only if there exists a forward path from h to
u where h, if it exists, is the header of the maximal loop containing q but not d, and
q itself otherwise. Given the forward control-flow graph and the loop nesting forest,
finding out if a variable is live at some program point can be done in two steps.
First, if there exists a loop containing the program point q and not the definition,
pick the header of the biggest such loop instead as the query point. Then check for
reachability from q to any use of the variable in the forward CFG. As explained
in Sect. 9.2.2, for irreducible CFG, the modified-forward CFG that redirects any
edge s → t to the loop-header of the outermost loop containing t but excluding s
([Link] (s)), has to be used instead. Correctness is proved from the theorems used
for liveness sets.
Algorithm 9.6 puts a little bit more effort into providing a query system at
instruction granularity. If q is in the same basic block as d (lines 8–13), then v
is live at q if and only if there is a use outside the basic block, or inside but after
q. If h is a loop-header then v is live at q if and only if a use is forward reachable
from h (lines 19–20). Otherwise, if the use is in the same basic block as q it must be
118 B. Boissinot and F. Rastello
after q to bring the variable live at q (lines 17–18). In this pseudo-code, upper case
is used for basic blocks while lower case is used for program points at instruction
granularity. “def(a)” is an operand. “uses(a)” is a set of operands. “basicBlock(u)”
returns the basic block containing the operand u. Given the semantics of the φ-
function instruction, the basic block returned by this function for a φ-function
operand can be different from the block where the instruction textually occurs. Also,
“[Link]” provides the corresponding (increasing) ordering in the basic block. For
a φ-function operand, the ordering number might be greater than the maximum
ordering of the basic block if the semantics of the φ-function places the uses
on outgoing edges of the direct predecessor block. [Link] (D) corresponds to
Algorithm 9.5 given in Sect. 9.2.3. forwardReachable(H, U ), which tells whether
U is reachable in the modified-forward CFG, will be described later.
12 return false
13 H ← [Link] (D)
14 foreach u in uses(a) do
15 U ← basicBlock(u)
16 if (not isLoopHeader(H )) and U = Q and order(u) < order(q) then
17 continue
18 if forward Reachable(H, U ) then
19 return true
20 return false
The live-out check algorithm, given by Algorithm 9.7, only differs from live-in
check in lines 5, 11, and 17, which involve ordering comparisons. In line 5, if q is
equal to d it cannot be live-in while it might be live-out; in lines 11 and 17 if q is at
a use point it makes it live-in but not necessarily live-out.
9 Liveness 119
12 return false
13 H ← [Link] (D)
14 foreach u in uses(a) do
15 U ← basicBlock(u)
16 if (not isLoopHeader(H )) and U = Q and order(u) ≤ order(q) then
17 continue
18 if forwardReachable(H, U ) then
19 return true
20 return false
The liveness check query system relies on pre-computations for efficient OLE and
forwardReachable queries. The outermost excluding loop is identical to the one
used for liveness sets. We explain how to compute modified-forward reachability
here (i.e., forward reachability on transformed CFG to handle irreducibility). In
practice we do not explicitly build the modified-forward graph. To efficiently
compute modified-forward reachability we simply need to traverse the modified-
forward graph in reverse topological order. A post-order initiated by a call to the
recursive function DFS_Compute_forwardReachable(r) (Algorithm 9.8) will do the
job. Bitsets can be used to efficiently implement sets of basic blocks. Once forward
reachability has been pre-computed this way, forwardReachable(H, U ) returns true
if and only if U ∈ [Link].
Another, maybe more intuitive, way of calculating liveness sets is closely related to
the definition of the live range of a given variable. As recalled earlier, a variable is
live at a program point p, if p belongs to a path of the CFG leading from a definition
120 B. Boissinot and F. Rastello
of that variable to one of its uses without passing through the definition. Therefore,
the live range of a variable can be computed using a backward traversal starting at
its uses and stopping when its (unique) definition is reached.
Actual implementation of this idea could be done in several ways. In particular,
the order along which use operands are processed, in addition to the way liveness
sets are represented, can substantially impact the performance. The one we choose
to develop here allows the use of a simple stack-like set representation which avoids
any expensive set-insertion operations and set-membership tests. The idea is to
process use operands variable by variable. In other words, the processing of different
variables is not intermixed, i.e., the processing of one variable is completed before
the processing of another variable begins.
Depending on the particular compiler framework, a pre-processing step that
performs a full traversal of the program (i.e., the instructions) might be required
in order to derive the def-use chains for all variables, i.e., a list of all uses for each
SSA variable. The traversal of the variable list and processing of its uses thanks to
def-use chains is depicted in Algorithm 9.9.
Note that, in strict SSA form, in a given block, no use can appear before a
definition. Thus, if v is live-out or used in a block B, it is live-in iff it is not
defined in B. This leads to the code of Algorithm 9.10 for path exploration. Here,
the liveness sets are implemented using a stack-like data structure.
a variable. These λ-operators and the other uses of variables are chained together
and liveness is efficiently computed on this graph representation. The technique
of Gerlek et al. can be considered as a precursor of the live variable analysis
based on the Static Single Information (SSI) form conjectured by Singer [261] and
revisited by Boissinot et al. [37]. In both cases, the insertion of pseudo-instructions
guarantees that any definition is post-dominated by a use.
Another approach to computing liveness was proposed by Appel [10]. Instead
of computing the liveness information for all variables at the same time, variables
are handled individually by exploring paths in the CFG starting from variable uses.
Using logic programming, McAllester [197] presented an equivalent approach to
show that liveness analysis can be performed in time proportional to the number
of instructions and variables. However, his theoretical analysis is limited to a
restricted input language with simple conditional branches and instructions. A more
generalized analysis is given in Chapter 2 of the habilitation thesis of Rastello [239],
in terms of both theoretical complexity and practical evaluation (Sect. 9.4 describes
a path-exploration technique restricted to SSA programs).
The loop nesting forest considered in this chapter corresponds to the one
obtained using Havlak’s algorithm [142]. A more generalized definition exists and
corresponds to the minimal loop nesting forest as defined by Ramalingam [236]. The
handling of any minimal loop nesting forest is also detailed in Chapter 2 of [239].
Handling of irreducible CFG can be done through CFG transformations such as
node splitting [2, 149]. Such a transformation can lead to an exponential growth in
the number of nodes. Ramalingam [236] proposed a transformation (different from
the one presented here but also without any exponential growth) that only maintains
the dominance property (not the full semantic).
Finding the maximal loop not containing a node s but containing a node t (OLE)
is a problem similar to finding the least common ancestor (LCA) of the two nodes s
and t in the rooted loop-nested forest: the loop in question is the only direct child
of LCA(s, t), ancestor of t. As described in [23], an LCA query can be reduced to a
Range Minimum Query (RMQ) problem that can itself be answered in O(1), with a
pre-computation of O(n). The adaptation of LCA to provide an efficient algorithm
for OLE queries is detailed in Chapter 2 of [239].
This chapter is a short version of Chapter 2 of [239] which, among other details,
contains formal proofs and handling of different φ-function semantics. Sparsesets
are described by Cooper and Torczon [83].
Chapter 10
Loop Tree and Induction Variables
This chapter presents an extension of SSA whereby the extraction of the reducible
loop tree can be done only on the SSA graph itself. This extension also captures
reducible loops in the CFG. This chapter first illustrates this property and then shows
its usefulness through the problem of induction variable recognition.
S. Pop
Amazon Web Services, Austin, Texas, USA
e-mail: spop@[Link]
A. Cohen ()
Google, Paris, France
e-mail: albertcohen@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 123
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
124 S. Pop and A. Cohen
This first section shows that the classical SSA representation is not sufficient to
represent the semantics of the original program. We will see the minimal amount
of information that has to be added to the classical SSA representation in order to
represent the loop information: similar to the φexit -function used in the gated SSA
presented in Chap. 14, the loop-closed SSA form adds an extra variable at the end
of a loop for each variable defined in a loop and used after the loop.
In the classical definition of SSA, the CFG provides the skeleton of the program:
basic blocks contain assignment statements defining SSA variable names, and the
basic blocks with multiple direct predecessors contain φ-nodes. Let us look at what
happens when, starting from a classical SSA representation, we remove the CFG.
In order to remove the CFG, imagine a pretty printer function that dumps only
the arithmetic instructions of each basic block and skips the control instructions
of an imperative program by traversing the CFG structure in any order. Does
the representation obtained from this pretty printer contain enough information to
enable us to compute the same thing as the original program?1 Let us see what
happens with an example in its CFG-based SSA representation:
After removing the CFG structure, listing the definitions in an arbitrary order, we
could obtain this:
1 To simplify the discussion, we consider the original program to be free of side effect instructions.
10 Loop Tree and Induction Variables 125
return c
b ← ... some computation independent of a
c ←a+b
a ← ... some computation independent of b
And this SSA code is sufficient, in the absence of side effects, to recover an order of
computation that leads to the same result as in the original program. For example,
the evaluation of this sequence of statements would produce the same result:
We will now see how to represent the natural loops in the SSA form by systemati-
cally adding extra φ-nodes at the end of loops, together with extra information about
the loop-exit predicate. Supposing that the original program contains a loop:
126 S. Pop and A. Cohen
By pretty printing with a random order traversal, we could obtain this SSA code:
x←3
return k
i ← φ(x, j )
k ← φ(i)
j ←i+1
Note that some information is lost in this pretty printing: the exit condition of
the loop has been lost. We will have to record this information in the extension of
the SSA representation. However, the loop structure still appears through the cyclic
definition of the induction variable i. To expose it, we can rewrite this SSA code
using simple substitutions, such as:
i ← φ(3, i + 1)
k ← φ(i)
return k
Thus, we have the definition of the SSA name i defined as a function of itself. This
pattern is characteristic of the existence of a loop. We remark that there are two
kinds of φ-nodes used in this example:
• Loop-φ nodes “i = φ(x, j )” (also denoted i = φentry (x, j ) as in Chap. 14) have
an argument that contains a self-reference j and an invariant argument x: here the
defining expression “j = i+1” contains a reference to the same loop-φ definition
i, while x (here 3) is not part of the circuit of dependencies that involves i and j .
Note that it is possible to define a canonical SSA form by limiting the number of
arguments of loop-φ nodes to two.
• Close-φ nodes “k = φexit (i)” (also denoted k = φexit (i) as in Chap. 14) capture
the last value of a name defined in a loop. Names defined in a loop can only be
used within that loop or in the arguments of a close-φ node (which is “closing”
the set of uses of the names defined in that loop). In a canonical SSA form, it is
possible to limit the number of arguments of close-φ nodes to one.
As we have seen in the example above, the exit condition of the loop disappeared
during the basic pretty printing of the SSA. To capture the semantics of the
computation of the loop, we have to specify this condition in the close-φ-node when
we exit the loop, so as to be able to derive which value will be available at the end
10 Loop Tree and Induction Variables 127
of the loop. With our extension, which adds the loop-exit condition to the syntax of
the close-φ, the SSA pretty printing of the above example would be:
x←3
i ← φentry (x, j )
j ←i+1
k ← φexit (i ≥ N, i)
return k
The first phase of the induction variable analysis is the detection of the strongly
connected components of the SSA. This can be performed by traversing the use-def
SSA chains and detecting that some definitions are visited twice. For a self-referring
use-def chain, it is possible to derive the step of the corresponding induction variable
as the overall effect of one iteration of the loop on the value of the loop-φ node.
When the step of an induction variable depends on another cyclic definition, one
has to further analyse the inner cycle. The analysis of the induction variable ends
when all the inner cyclic definitions used for the computation of the step have been
analysed. Note that it is possible to construct SSA graphs with strongly connected
128 S. Pop and A. Cohen
Fig. 10.1 Detection of the cyclic definition using a depth-first search traversal of the use-def
chains
components that are impossible to characterize with the chains of recurrences. This
is precisely the case of the following example which shows two inter-dependent
circuits, the first involving a and b with step c + 2, and the second involving c and
d with step a + 2. This leads to an endless loop, which must be detected.
a ← φentry (0, b)
c ← φentry (1, d)
b ←c+2
d ←a+3
Let us now look at an example, presented in Fig. 10.1, to see how the stride
detection works. The arguments of a φ-node are analysed to determine whether
they contain self-references or are pointing towards the initial value of the induction
variable. In this example, (a) represents the use-def edge that points towards the
invariant definition. When the argument to be analysed points towards a longer use-
def chain, the full chain is traversed, as shown in (b), until a φ-node is reached.
In this example, the φ-node that is reached in (b) is different to the φ-node from
which the analysis started, and so in (c) a search starts on the uses that have not
yet been analysed. When the original φ-node is found, as in (c), the cyclic def-use
chain provides the step of the induction variable: the step is “+e” in this example.
Knowing the symbolic expression for the step of the induction variable may not
be enough, as we will see next; one has to instantiate all the symbols (“e” in the
current example) defined in the varying loop to precisely characterize the induction
variable.
Once the def-use circuit and its corresponding overall loop update expression
have been identified, it is possible to translate the sequence of values of the
induction variable to a chain of recurrences. The syntax of a polynomial chain of
10 Loop Tree and Induction Variables 129
recurrence is {base, +, step}x , where base and step may be arbitrary expressions
or constants, and x is the loop with which the sequence is associated. As a chain
of recurrences represents the sequence of values taken by a variable during the
execution of a loop, the associated expression of a chain of recurrences is given
by {base, +, step}x ( x ) = base + step × x , which is a function of x , the number
of times the body of loop x has been executed.
When base or step translates to sequences varying in outer loops, the resulting
sequence is represented by a multivariate chain of recurrences. For example,
{{0, +, 1}x , +, 2}y defines a multivariate chain of recurrences with a step of 1 in
loop x and a step of 2 in loop y, where loop y is enclosed in loop x. When
step translates into a sequence varying in the same loop, the chain of recurrences
represents a polynomial of a higher degree. For example, {3, +, {8, +, 5}x }x
represents a polynomial evolution of degree 2 in loop x. In this case, the chain of
recurrences is also written omitting the extra braces: {3, +, 8, +, 5}x. The
semantics
of a chain of recurrences is defined using the binomial coefficient pn = p!(n−p)!
n!
,
by the equation:
n
{c0 , +, c1 , +, c2 , +, . . . , +, cn }x ( x ) =
x
cp ,
p
p=0
with the iteration domain vector (the iteration loop counters of all the loops in
which the chain of recurrences variates), and x the iteration counter of loop x.
This semantics is very useful in the analysis of induction variables, as it makes it
possible to split the analysis into two phases, with a symbolic representation as a
partial intermediate result:
1. First, the analysis leads to an expression where the step part “s” is left in a
symbolic form, i.e., {c0 , +, s}x .
2. Then, by instantiating the step, i.e., s = {c1 , +, c2 }x , the chain of recur-
rences is that of a higher-degree polynomial, i.e., {c0 , +, {c1 , +, c2 }x }x =
{c0 , +, c1 , +, c2 }x .
The last phase of the induction variable analysis consists in the instantiation (or
further analysis) of symbolic expressions left from the previous phase. This includes
the analysis of induction variables in outer loops, computing the last value of the
counter of a preceding loop, and the propagation of closed-form expressions for
loop invariants defined earlier. In some cases, it becomes necessary to leave in a
symbolic form every definition outside a given region, and these symbols are then
called parameters of the region.
130 S. Pop and A. Cohen
Let us look again at the example of Fig. 10.1 to see how the sequence of
values of the induction variable c is characterized with the notation of the chains
of recurrences. The first step, after the cyclic definition is detected, is the translation
of this information into a chain of recurrences: in this example, the initial value (or
base of the induction variable) is a and the step is e, and so c is represented by a
chain of recurrences {a, +, e}1 that is varying in loop number 1. The symbols are
then instantiated: a is trivially replaced by its definition, leading to {3, +, e}1 . The
analysis of e leads to the chain of recurrences {8, +, 5}1 , which is then used in the
chain of recurrences of c, {3, +, {8, +, 5}1 }1 , and is equivalent to {3, +, 8, +, 5}1 , a
polynomial of degree two:
F( ) = 3 +8 +5
0 1 2
5 11
= 2
+ + 3.
2 2
One of the important static analyses for loops is to evaluate their trip count, i.e.,
the number of times the loop body is executed before the exit condition becomes
true. In common cases, the loop-exit condition is a comparison of an induction
variable against some constant, parameter, or other induction variables. The number
of iterations is then computed as the minimum solution of a polynomial inequality
with integer solutions, also called a Diophantine inequality. When one or more
coefficients of the Diophantine inequality are parameters, the solution is left in
parametric form. The number of iterations can also be an expression varying in
an outer loop, in which case it can be characterized using a chain of recurrences.
Consider a scalar variable varying in an outer loop with strides dependent on
the value computed in an inner loop. The expression representing the number of
iterations in the inner loop can then be used to express the evolution function of the
scalar variable varying in the outer loop.
For example, the following code:
x←0
for i = 0; i < N ; i + + do loop1
for j = 0; j < M; j + + do loop2
x ←x+1
10 Loop Tree and Induction Variables 131
x0 ← 0
i ← φentry
1 (0, i + 1)
x1 ← φentry
1 (x0 , x2 )
x4 ← φexit
1 (i < N, x )
1
j ← φentry
2 (0, j + 1)
x3 ← φentry
2 (x1 , x3 + 1)
x2 ← φexit
2 (j < M, x )
3
x3 represents the value of variable x at the end of the original imperative program.
The analysis of scalar evolutions for variable x4 would trigger the analysis of
scalar evolutions for all the other variables defined in the loop-closed SSA form,
as follows:
• First, the analysis of variable x4 would trigger the analysis of i, N, and x1 .
– The analysis of i leads to i = {0, +, 1}1 , i.e., the canonical loop counter l1 of
loop1 .
– N is a parameter and is left in its symbolic form.
– The analysis of x1 triggers the analysis of x0 and x2 :
The analysis of x0 leads to x0 = 0.
Analysing x2 triggers the analysis of j , M, and x3 :
· j = {0, +, 1}2 , i.e., the canonical loop counter l2 of loop2 .
· M is a parameter.
· x3 = φentry
2 (x1 , x3 + 1) = {x1 , +, 1}2 .
x2 = φentry
2 (j < M, x3 ) is then computed as the last value of x3 after loop2 ,
i.e., it is the chain of recurrences of x3 applied to the first iteration of loop2
that does not satisfy j < M or equivalently l2 < M. The corresponding
Diophantine inequality l2 ≥ M has a minimum solution l2 = M. So, to
finish the computation of the scalar evolution of x2 , we apply M to the
scalar evolution of x3 , leading to x2 = {x1 , +, 1}2 (M) = x1 + M.
– The scalar evolution analysis of x1 then leads to x1 = φentry
1 (x0 , x2 ) =
φentry (x0 , x1 + M) = {x0 , +, M}1 = {0, +, M}1 .
1
Induction variable detection has been studied extensively in the past because of
its central role in loop optimizations. Wolfe [306] designed the first SSA-based
induction variable recognition technique. It abstracts the SSA graph and classifies
inductions according to a wide spectrum of patterns.
When operating on a low-level intermediate representation with arbitrary gotos,
detecting the natural loops is the first step in the analysis of induction variables.
In general, and when operating on low-level code in particular, it is preferable to
use analyses that are more robust to complex control flows that do not resort to an
early classification into predefined patterns. Chains of recurrences [15, 167, 318]
have been proposed to characterize the sequence of values taken by a variable
during the execution of a loop [293], and it has proven to be more robust to the
presence of complex, unstructured control flows, to the characterization of induction
variables over modulo-arithmetic such as unsigned wrap-around types in C, and to
implementation in a production compiler [232].
The formalism and presentation of this chapter are derived from the thesis work
of Sebastian Pop. The manuscript [231] contains pseudo-code and links to the
implementation of scalar evolutions in GCC since version 4.0. The same approach
has also influenced the design of LLVM’s scalar evolution, but the implementation
is different.
Induction variable analysis is used in dependence tests for scheduling and
parallelization [308] and more recently, the extraction of short vector to SIMD
instructions [212].2 The Omega test [235] and parametric integer linear program-
ming [115] have typically been used to reason about system parametric affine
Diophantine inequalities. But in many cases, simplications and approximations
can lead to polynomial decision procedures [16]. Modern parallelizing compilers
tend to implement both kinds, depending on the context and aggressiveness of the
optimization.
Substituting an induction variable with a closed-form expression is also useful
for the removal of the cyclic dependencies associated with the computation of
the induction variable itself [127]. Other applications include enhancements to
strength reduction and loop-invariant code motion [127], and induction variable
canonicalization (reducing induction variables to a single one in a given loop) [185].
The number of iterations of loops can also be computed based on the charac-
terization of induction variables. This information is essential to advanced loop
analyses, such as value-range propagation [210], and enhanced dependence tests
for scheduling and parallelization [16, 235]. It also enables more opportunities for
scalar optimization when the induction variable is used after its defining loop. Loop
2 Note, however, that the computation of closed-form expressions is not required for dependence
transformations also benefit from the replacement of the end-of-loop value as this
removes scalar dependencies between consecutive loops. Another interesting use of
the end-of-loop value is the estimation of the worst-case execution time (WCET)
where an attempt is made to obtain an upper bound approximation of the time
necessary for a program to terminate.
Chapter 11
Redundancy Elimination
Fred Chow
1 All values referred to in this chapter are static values viewed with respect to the program code. A
static value can map to different dynamic values during program execution.
F. Chow ()
Huawei, Fremont, CA, USA
e-mail: fchow99@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 135
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
136 F. Chow
Figure 11.1 shows the two most basic forms of partial redundancy. In Fig. 11.1a,
a + b is redundant when the right path is taken. In Fig. 11.1b, a + b is redundant
whenever the back edge (see Sect. 4.4.1) of the loop is taken. Both are examples
of strictly partial redundancies , in which insertions are required to eliminate the
redundancies. In contrast, a full redundancy can be deleted without requiring any
insertion. Partial redundancy elimination (PRE) is powerful because it subsumes
global common subexpressions and loop-invariant code motion.
We can visualize the impact on redundancies of a single computation, as shown
in Fig. 11.2. In the region of the control-flow graph dominated by the occurrence
of a + b, any further occurrence of a + b is fully redundant, assuming a and b
are not modified. Following the program flow, once we are past the dominance
frontiers, any further occurrence of a +b is partially redundant. In constructing SSA
form, dominance frontiers are where φs are inserted. Since partial redundancies start
at dominance frontiers, partial redundancy elimination should borrow techniques
2 The opposite of maximal expression tree form is the triplet form in which each arithmetic
... m a + b ( m(
...(
(
a+b
... m a + b
... m a + b ( m(
...(
(
a+b
Fig. 11.2 Dominance frontiers (dashed) are boundaries between fully (highlighted basic blocks)
and partially (normal basic blocks) redundant regions
from SSA φs insertion. In fact, the same sparse approach to modelling the use-def
relationships among the occurrences of a program variable can be used to model the
redundancy relationships among the different occurrences of a + b.
The algorithm that we present, named SSAPRE, performs PRE efficiently by
taking advantage of the use-def information inherent in its input Conventional SSA
Form. If an occurrence aj + bj is redundant with respect to ai + bi , SSAPRE
138 F. Chow
3 Adhering to the SSAPRE convention, we use lower case φs in the SSA form of variables and
There are three kinds of occurrences of the expression in the program: (real) the
occurrences in the original program, which we call real occurrences; ( -def) the
inserted s; and ( -use) the use operands of the s, which are regarded as
occurring at the ends of the direct predecessor blocks of their corresponding edges.
During the visitation in Renaming, a is always given a new version. For a non-
, i.e., cases (real) and ( -use), we check the current version of every variable in
the expression (the version on the top of each variable’s renaming stack) against the
version of the corresponding variable in the occurrence on the top of the expression’s
renaming stack. If all the variable versions match, we assign it the same version as
the top of the expression’s renaming stack. If one of the variable versions does not
match, for case (real), we assign it a new version, as in the example of Fig. 11.4a;
for case ( -use), we assign the special class ⊥ to the -use to denote that the value
of the expression is unavailable at that point, as in the example of Fig. 11.4b. If a
new version is assigned, we push the version on the expression stack.
The FRG captures all the redundancies of a +b in the program. In fact, it contains
just the right amount of information for determining the optimal code placement.
Because strictly partial redundancies can only occur at the -nodes, insertions for
PRE only need to be considered at the s.
The objective of SSAPRE is to find a placement that satisfies the following four
criteria, in this order:
– Correctness : X is fully available at all the original computation points.
– Safety: There is no insertion of X on any path that did not originally contain X.
– Computational optimality : No other safe and correct placement can result in
fewer computations of X on any path from entry to exit in the program.
– Lifetime optimality : Subject to computational optimality, the life range of the
temporary introduced to store X is minimized.
Each occurrence of X at its original computation point can be qualified with
exactly one of the following attributes: (1) fully redundant; (2) strictly partially
redundant; (3) non-redundant.
As a code placement problem, SSAPRE follows the same two-step process used
in all PRE algorithms. The first step determines the best set of insertion points that
render fully redundant as many strictly partially redundant occurrences as possible.
The second step deletes fully redundant computations, taking into account the
effects of the inserted computations. As we consider this second step to be well
understood, the challenge lies in the first step for coming up with the best set of
insertion points. The first step will tackle the safety, computational optimality, and
lifetime optimality criteria, while the correctness criterion is delegated to the second
step. For the rest of this section, we only focus on the first step for finding the best
insertion points, which is driven by the strictly partially redundant occurrences.
We assume that all critical edges in the control-flow graph have been removed
by inserting empty basic blocks at such edges (see Algorithm 3.5). In the SSAPRE
approach, insertions are only performed at -uses. When we say a is a candidate
for insertion, it means we will consider insertions at its use operands to render X
available at the entry to the basic block containing that . An insertion at a -use
means inserting X at the incoming edge corresponding to that operand. In reality,
the actual insertion is done at the end of the direct predecessor block.
As we have pointed out at the end of Sect. 11.1, insertions only need to be considered
at the s. The safety criterion implies that we should only insert at s where X
is downsafe (fully anticipated). Thus, we perform data-flow analysis on the FRG
to determine the downsafe attribute for s. Data-flow analysis can be performed
with linear complexity on SSA graphs, which we illustrate with the Downsafety
computation.
A is not downsafe if there is a control-flow path from that along which
the expression is not computed before program exit or before being altered by the
redefinition of one of its variables. Except for loops with no exit, this can only
happen in one of the following cases: (dead) there is a path to exit or an alteration
of the expression along which the result version is not used; or (transitive) the
11 Redundancy Elimination 141
result version appears as the operand of another that is not downsafe. Case
(dead) represents the initialization for our backward propagation of ¬downsafe;
all other s are initially marked downsafe. The Downsafety propagation is based
on case (transitive). Since a real occurrence of the expression blocks the case
(transitive) propagation, we define a has_real_use flag attached to each operand
and set this flag to true when the operand is defined by another and the path
from its defining to its appearance as a operand crosses a real occurrence.
The propagation of ¬downsafe is blocked whenever the has_real_use flag is true.
Figure 11.1 gives the Downsafety propagation algorithm. The initialization of the
has_real_use flags is performed in the earlier Renaming phase.
8 Function Reset_downsafe(X)
9 if def(X) is not a then return
10 f ← def(X)
11 if not downsafe(f ) then return
12 downsafe(f ) ← false
13 foreach operand ω of f do
14 if not has_real_use (ω) then Reset_downsafe(ω)
At this point, we have eliminated the unsafe s based on the safety criterion.
Next, we want to identify all the s that are possible candidates for insertion, by
disqualifying s that cannot be insertion candidates in any computationally optimal
placement. An unsafe can still be an insertion candidate if the expression is fully
available there, though the inserted computation will itself be fully redundant. We
define the can_be_avail attribute for the current step, whose purpose is to identify
the region where, after appropriate insertions, the computation can become fully
available. A is ¬can_be_avail if and only if inserting there violates computational
optimality. The can_be_avail attribute can be viewed as
We could compute the avail attribute separately using the full availability
analysis, which involves propagation in the forward direction with respect to the
control-flow graph. But this would have performed some useless computation
because we do not need to know its values within the region where the s are
downsafe. Thus, we choose to compute can_be_avail directly by initializing a
to be ¬can_be_avail if the is not downsafe and one of its operands is ⊥. In the
propagation phase, we propagate ¬can_be_avail forward when a ¬downsafe has
an operand that is defined by a ¬can_be_avail and that operand is not marked
has_real_use.
After can_be_avail has been computed, computational optimality could be
fulfiled simply by performing insertions at all the can_be_avail s. In this case, full
redundancies would be created among the insertions themselves, but the subsequent
full redundancy elimination step would remove any fully redundant inserted or non-
inserted computation. This would leave the earliest computations as the optimal
code placement.
We illustrate our discussion in this section with the example of Fig. 11.5, where
the program exhibits partial redundancy that cannot be removed by safe code
motion. The two s with their computed data-flow attributes are as shown. If
insertions were based on can_be_avail, a + b would have been inserted at the
exits of blocks 4 and 5 due to the in block 6, which would have resulted in
11 Redundancy Elimination 143
Fig. 11.5 Example to show the need for the later attribute
If we ignore the safety requirement of PRE discussed in Sect. 11.2, the resulting
code motion will involve speculation. Speculative code motion suppresses redun-
dancy in some paths at the expense of another path where the computation is added
but result is unused. As long as the paths that are burdened with more computations
are executed less frequently than the paths where the redundant computations are
avoided, a net gain in program performance can be achieved. Thus, speculative code
motion should only be performed when there are clues about the relative execution
frequencies of the paths involved.
Without profile data, speculative PRE can be conservatively performed by
restricting it to loop-invariant computations. Figure 11.6 shows a loop-invariant
computation a + b that occurs in a branch inside the loop. This loop-invariant
code motion is speculative because, depending on the branch condition inside the
loop, it may be executed zero times, while moving it to the loop header causes it
to execute once. This speculative loop-invariant code motion is profitable unless the
path inside the loop containing the expression is never taken, which is usually not
the case. When performing SSAPRE, marking s located at the start of loop bodies
as downsafe will effect speculative loop-invariant code motion.
144 F. Chow
... m a + b
Computations such as indirect loads and divides are called dangerous computa-
tions because they may generate a fault. Dangerous computations in general should
not be speculated. As an example, if we replace the expression a + b in Fig. 11.6 by
a/b and the speculative code motion is performed, it may cause a runtime divide-
by-zero fault after the speculation because b can be 0 at the loop header, while it is
never 0 in the branch that contains a/b inside the loop body.
Dangerous computations are sometimes protected by tests (or guards) placed
in the code by the programmers or automatically generated by language compilers
such as those for Java. When such a test occurs in the program, we say the dangerous
computation is safety-dependent on the control-flow point that establishes its safety.
At the points in the program where its safety dependence is satisfied, the dangerous
instruction is fault-safe and can still be speculated.
We can represent safety dependencies as value dependencies in the form of
abstract τ variables. Each successful runtime test defines a τ variable on its fall-
through path. During SSAPRE, we attach these τ variables as additional operands
to the dangerous computations related to the test. The τ variables are also put
into SSA form, so their definitions can be found by following the use-def chains.
The definitions of the τ variables have abstract right-hand-side values that are not
allowed to be involved in any optimization. Because they are abstract, they are also
omitted in the generated code after the SSAPRE phase. A dangerous computation
can be defined to have more than one τ operand, depending on its semantics.
When all its τ operands have definitions, it means the computation is fault-safe;
otherwise, it is unsafe to speculate. By taking the τ operands into consideration,
speculative PRE automatically honors the fault-safety of dangerous computations
when it performs speculative code motion.
In Fig. 11.7, the program contains a non-zero test for b. We define an additional
τ operand for the divide operation in a/b in SSAPRE to provide the information
about whether a non-zero test for b is available. At the start of the region guarded
by the non-zero test for b, the compiler inserts the definition of τ1 with the abstract
11 Redundancy Elimination 145
right-hand-side value τ -edge. Any appearance of a/b in the region guarded by the
non-zero test for b will have τ1 as its τ operand. Having a defined τ operand allows
a/b to be freely speculated in the region guarded by the non-zero test, while the
definition of τ1 prevents any hoisting of a/b past the non-zero test.
Variables and most data in programs normally start out residing in memory. It is the
compiler’s job to promote those memory contents to registers as much as possible
to speed up program execution. Load and store instructions have to be generated to
transfer contents between memory locations and registers. The compiler also has to
deal with the limited number of physical registers and find an allocation that makes
the best use of them. Instead of solving these problems all at once, we can tackle
them as two smaller problems separately:
1. Register promotion—We assume there is an unlimited number of registers, called
pseudo-registers (also called symbolic registers, virtual registers, or tempo-
raries). Register promotion will allocate variables to pseudo-registers whenever
possible and optimize the placement of the loads and stores that transfer their
values between memory and registers.
2. Register allocation (see Chap. 22)—This phase will fit the unlimited number of
pseudo-registers to the limited number of real or physical registers.
146 F. Chow
In this chapter, we only address the register promotion problem because it can be
cast as a redundancy elimination problem.
Variables with no aliases are trivial register promotion candidates. They include the
temporaries generated during PRE to hold the values of redundant computations.
Variables in the program can also be determined via compiler analysis or by
language rules to be alias-free. For these trivial candidates, one can rename them
to unique pseudo-registers, and no load or store needs to be generated.
Our register promotion is mainly concerned with scalar variables that have
aliases, indirectly accessed memory locations and constants. A scalar variable can
have aliases whenever its address is taken, or if it is a global variable, since it can
be accessed by function calls. A constant value is a register promotion candidate
whenever some operations using it have to refer to it through register operands.
Since the goal of register promotion is to obtain the most efficient placement for
loads and stores, register promotion can be modelled as two separate problems: PRE
of loads, followed by PRE of stores. In the case of constant values, our use of the
term load will extend to referring to the operation performed to put the constant
value in a register. The PRE of stores does not apply to constants.
From the point of view of redundancy, loads behave like expressions: the later
occurrences are the ones to be deleted. For stores, the reverse is true: as illustrated
in the examples of Fig. 11.8, the earlier stores are the ones to be deleted. The PRE
of stores, also called partial dead code elimination, can thus be treated as the dual
of the PRE of loads. Thus, performing PRE of stores has the effects of moving
stores forward while inserting them as early as possible. Combining the effects of
the PRE of loads and stores results in optimal placements of loads and stores while
minimizing the live ranges of the pseudo-registers, by virtue of the computational
and lifetime optimality of our PRE algorithm.
PRE applies to any computation, including loads from memory locations or creation
of constants. In program representations, loads can be either indirect through a
pointer or direct. Indirect loads are automatically covered by the PRE of expressions.
Direct loads correspond to scalar variables in the program, and since our input
program representation is in HSSA form, the aliasing that affects the scalar variables
is completely modelled by the χ and μ functions. In our representation, both direct
loads and constants are leaves of the expression trees. When we apply SSAPRE to
direct loads, since the hypothetical temporary h can be regarded as the candidate
variable itself, the FRG corresponds somewhat to the variable’s SSA graph, so the
-insertion step and Rename step can be streamlined.
When working on the PRE of memory loads, it is important to also take into
account the stores, which we call l-value occurrences. A store of the form X ←
< expr > can be regarded as being made up of the sequence:
r ← < expr >
X←r
Fig. 11.10 Register promotion via load PRE followed by store PRE
the loop header does not involve speculation. The occurrence of A ← . . . causes r
to be updated by splitting the store into the two statements r ← . . . ; A ← r. In the
PRE of stores (SPRE), speculation is needed to sink A ← . . . to outside the loop
because the store occurs in a branch inside the loop. Without performing LPRE first,
the load of A inside the loop would have blocked the sinking of A ← . . . .
As mentioned earlier, SPRE is the dual of LPRE. Code motion in SPRE will have the
effect of moving stores forward with respect to the control-flow graph. Any presence
of (aliased) loads has the effect of blocking the movement of stores or rendering the
earlier stores non-redundant.
To apply the dual of the SSAPRE algorithm, it is necessary to compute a program
representation that is the dual of the SSA form, the static single use (SSU) form
(see Chap. 13—SSU is a special case of SSI). In SSU, use-def edges are factored
at divergence points in the control-flow graph using σ -functions (see Sect. 13.1.4).
Each use of a variable establishes a new version (we say the load uses the version),
and every store reaches exactly one load.
We call our store PRE algorithm SSUPRE, which is made up of the corre-
sponding steps in SSAPRE. The insertion of σ -functions and renaming phases
constructs the SSU form for the variable whose store is being optimized. The
data-flow analyses consist of UpSafety to compute the upsafe (fully available)
11 Redundancy Elimination 149
Fig. 11.11 Example of program in SSU form and the result of applying SSUPRE
The PRE algorithm we have described so far is not capable of recognizing redundant
computations among lexically different expressions that yield the same value. In this
section, we discuss redundancy elimination based on value analysis.
The term value number originates from a hash-based method for recognizing when
two expressions evaluate to the same value within a basic block. The value number
of an expression tree can be regarded as the index of its hashed entry in the hash
table. An expression tree is hashed recursively bottom-up, starting with the leaf
150 F. Chow
nodes. Each internal node is hashed based on its operator and the value numbers
of its operands. The local algorithm for value numbering will conduct a scan down
the instructions in a basic block, assigning value numbers to the expressions. At
an assignment, the assigned variable will be assigned the value number of the right-
hand side expression. The assignment will also cause any value number that refers to
that variable to be killed. For example, the program code in Fig. 11.12a will result
in the value numbers v1 , v2 , and v3 shown in Fig. 11.12b. Note that variable c is
involved with both value numbers v2 and v3 because it has been redefined.
SSA form enables value numbering to be easily extended to the global scope,
called global value numbering (GVN), because each SSA version of a variable
corresponds to at most one static value for the variable. In the example of Fig. 11.13,
a traversal along any topological ordering of the SSA graph can be used to assign
value numbers to variables. One subtlety is regarding the φ-functions. When we
value number a φ-function, we would like the value numbers for its use operands
to have been determined already. One strategy is to perform the global value
numbering by visiting the nodes in the control-flow graph in a reverse post-order
traversal of the dominator tree. This traversal strategy can minimize the instances
when a φ-use has an unknown value number, which arises only in the case of back
edges from loops. When this arises, we have no choice but to assign a new value
number to the variable defined by the φ-function. For example, in the following
loop:
1 i1 ← 0
2 j1 ← 0
3 while <cond> do
4 i2 ← φ(i3 , i1 )
5 j2 ← φ(j3 , j1 )
6 i3 ← i2 + 4
7 j3 ← j2 + 4
11 Redundancy Elimination 151
When we try to hash a value number for either of the two φs, the value numbers for
i3 and j3 are not yet determined. As a result, we create different value numbers for
i2 and j2 . This makes the above algorithm unable to recognize that i2 and j2 can be
given the same value number, or that i3 and j3 can be given the same value number.
The above hash-based value numbering algorithm can be regarded as pessimistic,
because it will not assign the same value number to two different expressions unless
it can prove they compute the same value. There exists a different approach (see
Sect. 11.6 for references) to performing value numbering that is not hash-based
and is optimistic. It does not depend on any traversal over the program’s flow
of control and so is not affected by the presence of back edges. The algorithm
partitions all the expressions in the program into congruence classes. Expressions
in the same congruence class are considered equivalent because they evaluate to the
same static value. The algorithm is optimistic because when it starts, it assumes all
expressions that have the same operator to be in the same congruence class. Given
two expressions within the same congruence class, if their operands at the same
operand position belong to different congruence classes, the two expressions may
compute to different values and thus should not be in the same congruence class.
This is the subdivision criterion. As the algorithm iterates, the congruence classes
are subdivided into smaller ones, while the total number of congruence classes
increases. The algorithm terminates when no more subdivisions can occur. At this
point, the set of congruence classes in this final partition will represent all the values
in the program that we care about, and each congruence class is assigned a unique
value number.
While such a partition-based algorithm is not obstructed by the presence of back
edges, it does have its own deficiencies. Because it has to consider one operand
position at a time, it is not able to apply commutativity to detect more equivalences.
Since it is not applied bottom-up with respect to the expression tree, it is not able
to apply algebraic simplifications while value numbering. To get the best of both
the hash-based and the partition-based algorithms, it is possible to apply the two
algorithms independently and then combine their results together to shrink the final
set of value numbers.
So far, we have discussed finding computations that compute to the same values but
have not addressed eliminating the redundancies among them. Two computations
that compute to the same value exhibit redundancy only if there is a control-flow
path that leads from one to the other.
An obvious approach is to consider PRE for each value number separately. This
can be done by introducing, for each value number, a temporary that stores the
redundant computations. But value-number-based PRE has to deal with the issue of
how to generate an insertion. Because the same value can come from different forms
of expressions at different points in the program, it is necessary to determine which
152 F. Chow
form to use at each insertion point. If the insertion point is outside the live range
of any variable version that can compute that value, then the insertion point has
to be disqualified. Due to this complexity, and the expectation that strictly partial
redundancy is rare among computations that yield the same value, it seems to be
sufficient to perform only full redundancy elimination among computations that
have the same value number.
However, it is possible to broaden the scope and consider PRE among lexically
identical expressions and value numbers at the same time. In this hybrid approach,
it is best to relax our restriction on the style of program representation described
in Sect. 11. By not requiring Conventional SSA Form, we can more effectively
represent the flow of values among the program variables. By considering the live
range of each SSA version to extend from its definition to program exit, we allow its
value to be used whenever convenient. The program representation can even be in
the form of triplets, in which the result of every operation is immediately stored in a
temporary. It will just assign the value number of the right-hand side to the left-hand-
side variables. This hybrid approach (GVN-PRE—see below) can be implemented
based on an adaptation of the SSAPRE framework. Since each φ-function in the
input can be viewed as merging different value numbers from the direct predecessor
blocks to form a new value number, the -function insertion step will be driven by
the presence of φs for the program variables. Several FRGs can be formed, each
being regarded as a representation of the flow and merging of computed values.
Using each individual FRG, PRE can be performed by applying the remaining steps
of the SSAPRE algorithm.
The concept of partial redundancy was first introduced by Morel and Renvoise.
In their seminal work [201], Morel and Renvoise showed that global common
subexpressions and loop-invariant computations are special cases of partial redun-
dancy, and they formulated PRE as a code placement problem. The PRE algorithm
developed by Morel and Renvoise involves bidirectional data-flow analysis, which
incurs more overhead than unidirectional data-flow analysis. In addition, their
algorithm does not yield optimal results in certain situations. A better placement
strategy, called lazy code motion (LCM), was later developed by Knoop et
al. [170, 172]. It improved on Morel and Renvoise’s results by avoiding unnecessary
code movements, by removing the bidirectional nature of the original PRE data-
flow analysis and by proving the optimality of their algorithm. Since lazy code
motion was introduced, there have been alternative formulations of PRE algorithms
that achieve the same optimal results but differ in the formulation approach and
implementation details [98, 106, 217, 313].
The above approaches to PRE are all based on encoding program properties in
bit-vector forms and the iterative solution of data-flow equations. Since the bit-
vector representation uses basic blocks as its granularity, a separate algorithm is
11 Redundancy Elimination 153
needed to detect and suppress local common subexpressions. Chow et al. [73, 164]
came up with the first SSA-based approach to perform PRE. Their SSAPRE
algorithm is an adaptation of LCM that takes advantage of the use-def information
inherent in SSA. It avoids having to encode data-flow information in bit-vector
form and eliminates the need for a separate algorithm to suppress local common
subexpressions. Their algorithm was the first to make use of SSA to solve data-
flow problems for expressions in the program, taking advantage of SSA’s sparse
representation so that fewer steps are needed to propagate data-flow information.
The SSAPRE algorithm thus brings the many desirable characteristics of SSA-based
solution techniques to PRE.
In the area of speculative PRE, Murphy et al. [206] introduced the concept of
fault-safety and used it in the SSAPRE framework for the speculation of dangerous
computations. When execution profile data are available, it is possible to tailor the
use of speculation to maximize runtime performance for the execution that matches
the profile. Xue and Cai [312] presented a computationally and lifetime optimal
algorithm for speculative PRE based on profile data. Their algorithm uses data-
flow analysis based on bit-vector and applies minimum cut to flow networks formed
out of the control-flow graph to find the optimal code placement. Zhou et al. [317]
applied the minimum cut approach to flow networks formed out of the FRG in the
SSAPRE framework to achieve the same computational and lifetime optimal code
motion. They showed their sparse approach based on SSA results in smaller flow
networks, enabling the optimal code placements to be computed more efficiently.
Lo et al. [187] showed that register promotion can be achieved by load placement
optimization followed by store placement optimization. Other optimizations can
potentially be implemented using the SSAPRE framework, for instance code
hoisting, register shrink-wrapping [70], and live range shrinking. Moreover, PRE
has traditionally provided the context for integrating additional optimizations into
its framework. They include operator strength reduction [171] and linear function
test replacement [163].
Hashed-based value numbering originated from Cocke and Schwartz [77], and
Rosen et al. [249] extended it to global value numbering based on SSA. The
partition-based algorithm was developed by Alpern et al. [6]. Briggs et al. [49] pre-
sented refinements to both the hash-based and partition-based algorithms, including
applying the hash-based method in a post-order traversal of the dominator tree.
VanDrunen and Hosking proposed A-SSAPRE (anticipation-based SSAPRE)
which removes the requirement of Conventional SSA Form and is best for pro-
gram representations in the form of triplets [296]. Their algorithm determines
optimization candidates and constructs FRGs via a depth-first, pre-order traversal
over the basic blocks of the program. Within each FRG, non-lexically identical
expressions are allowed, as long as there are potential redundancies among them.
VanDrunen and Hosking [297] subsequently presented GVN-PRE (Value-based
Partial Redundancy Elimination), which is claimed to subsume both PRE and GVN.
Part III
Extensions
Chapter 12
Introduction
So far, we have introduced the foundations of SSA form and its use in different
program analyses. We now explain the need for extensions to SSA form to enable
a larger class of program analyses. The extensions arise from the fact that many
analyses need to make finer-grained distinctions between program points and data
accesses than what can be achieved by vanilla SSA form. However, these richer
flavours of extended SSA-based analyses still retain many of the benefits of SSA
form (e.g., sparse data-flow propagation) which distinguish them from classical
data-flow frameworks for the same analysis problems.
The sparseness in vanilla SSA form arises from the observation that information for
an unaliased scalar variable can be safely propagated from its (unique) definition
to all its reaching uses without examining any intervening program points. As
an example, SSA-based constant propagation aims to compute for each single
assignment variable, the (usually over-approximated) set of possible values carried
by the definition of that variable. For instance, consider an instruction that defines a
variable a and uses two variables, b and c. An example is a = b+c. In an SSA-form
program, constant propagation will determine if a is a constant by looking directly
at the definition point of b and at the definition point of c. We say that information
V. Sarkar ()
Georgia Institute of Technology, Atlanta, GA, USA
e-mail: vsarkar@[Link]
F. Rastello
Inria, Grenoble, France
e-mail: [Link]@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 157
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
158 V. Sarkar and F. Rastello
So as to expose parallelism and locality, one needs to get rid of the CFG at
some points. For loop transformations, software pipelining, there is a need to
manipulate a higher degree of abstraction to represent the iteration space of nested
loops and to extend data-flow information to this abstraction. One can expose even
more parallelism (at the level of instructions) by replacing control flow by control
dependencies: the goal is either to express a predicate expression under which a
given basic block is to be executed or to select afterwards (using similar predicate
expressions) the correct value among a set of eagerly computed ones.
1. Technically, we say that SSA provides data flow (data dependencies). The goal
is to enrich it with control dependencies. The program dependence graph (PDG)
constitutes the basis of such IR extensions. Gated single assignment (gated SSA,
GSA) mentioned below provides an interpretable (data- or demand-driven) IR
12 Introduction 159
that uses this concept. Psi-SSA (ψ-SSA) also mentioned below is a very similar
IR but more appropriate to code generation for architectures with predication.
2. Note that such extensions sometimes face difficulties handling loops correctly
(need to avoid deadlock between the loop predicate and the computation of the
loop body, replicate the behaviour of infinite loops, etc.). However, we believe
that, as we will illustrate further, loop carried control dependencies complicate
the recognition of possible loop transformations: it is usually better to represent
loops and their corresponding iteration space using a dedicated abstraction.
As already mentioned, one of the strengths of SSA form is its associated data-flow
graph (DFG), the SSA graph that is used to propagate directly the information along
the def-use chains. This is what makes data-flow analysis sparse. By combining
the SSA graph with the control-flow graph, static analysis can be made context-
sensitive. This can be done in a more natural and powerful way by incorporating in
a unified representation both the data-flow and control-flow information.
The program dependence graph (PDG) adds to the data dependence edges (SSA
graph as the data dependence graph—DDG) the control dependence edges (control
dependence graph—CDG). As already mentioned, one of the main motivations for
the development of the PDG was to aid automatic parallelization of instructions
across multiple basic blocks. However, in practice, it also exposes the relationship
between the control predicates and their related control-dependent instructions, thus
allowing us to propagate the associated information. A natural way to represent this
relationship is through the use of gating functions that are used in some extensions
such as the gated SSA (GSA) or the value state dependence graph (VSDG) .
Gating functions are directly interpretable versions of φ-nodes. As an example,
φif (P , v1 , v2 ) can be interpreted as a function that selects the value v1 if predicated
P evaluates to true and the value v2 otherwise. PDG, GSA, and VSDG are described
in Chap. 14.
ψ-SSA form (Chap. 15) addresses the need for modelling Static Single Assignment
form in predicated operations. A predicated operation is an alternate representation
of a fine-grained control-flow structure, often obtained by using the well-known
if-conversion transformation (see Chap. 20). A key advantage of using predicated
operations in a compiler’s intermediate representation is that it can often enable
more optimizations by creating larger basic blocks compared to approaches in
which predicated operations are modelled as explicit control-flow graphs. From
an SSA-form perspective, the challenge is that a predicated operation may or may
160 V. Sarkar and F. Rastello
not update its definition operand, depending on the value of the predicate guarding
that assignment. This challenge is addressed in ψ-SSA form by introducing ψ-
functions that perform merge functions for predicated operations, analogous to the
merge performed by φ-functions at join points in vanilla SSA form.
In general, a ψ-function has the form, a0 = ψ (p1 ?a1 , ..., pi ?ai , ..., pn ?an ),
where each input argument ai is associated with a predicate pi as in a nested if-
then-else expression for which the value returned is the rightmost argument whose
predicate is true. Observe that the semantic of a ψ-function imposes the logical dis-
junction of its predicates to evaluate to true. A number of algebraic transformations
can be performed on ψ-functions, including ψ-inlining, ψ-reduction, ψ-projection,
ψ-permutation, and ψ-promotion. Chapter 15 also includes an algorithm for
transforming a program out of Psi-SSA form that extends the standard algorithm
for destruction of vanilla SSA form.
The motivation for SSI form arose from the need to perform additional renaming
to distinguish among different uses of an SSA variable. Hashed SSA (HSSA) form
introduced in Chap. 16 addresses another important requirement, viz., the need to
model aliasing among variables. For example, a static use or definition of indirect
memory access ∗p in the C language could represent the use or definition of multiple
local variables whose addresses can be taken and may potentially be assigned
to p along different control-flow paths. To represent aliasing of local variables,
HSSA form extends vanilla SSA form with MayUse (μ) and MayDef (χ ) functions
to capture the fact that a single static use or definition could potentially impact
multiple variables. Note that MayDef functions can result in the creation of new
names (versions) of variables, compared to vanilla SSA form. HSSA form does
not take a position on the accuracy of alias analysis that it represents. It is capable
of representing the output of any alias analysis performed as a pre-pass to HSSA
construction. As summarized above, a major concern with HSSA form is that its size
could be quadratic in the size of the vanilla SSA form, since each use or definition
can now be augmented by a set of MayUse’s and MayDef’s, respectively. A heuristic
approach to dealing with this problem is to group together all variable versions that
have no “real” occurrence in the program, i.e., do not appear in a real instruction
outside of a φ, μ, or χ function. These versions are grouped together into a single
version called the zero version of the variable.
In addition to aliasing of locals, it is important to handle the possibility of
aliasing among heap-allocated variables. For example, ∗p and ∗q may refer to the
same location in the heap, even if no aliasing occurs among local variables. HSSA
form addresses this possibility by introducing a virtual variable for each address
expression used in an indirect memory operation and renaming virtual variables
with φ-functions as in SSA form. Further, the alias analysis pre-pass is expected to
provide information on which virtual variables may potentially be aliased, thereby
12 Introduction 161
leading to the insertion of μ or χ functions for virtual variables as well. Global value
numbering is used to increase the effectiveness of the virtual variable approach,
since all indirect memory accesses with the same address expression can be merged
into a single virtual variable (with SSA renaming as usual). In fact, the Hashed
SSA name in HSSA form comes from the use of hashing in most value-numbering
algorithms.
In contrast to HSSA form, Array SSA form (Chap. 17) takes an alternate approach
to modelling aliasing of indirect memory operations by focusing on aliasing in
arrays as its foundation. The aliasing problem for arrays is manifest in the fact that
accesses to elements A[i] and A[j ] of array A refer to the same location when
i = j . This aliasing can occur with just local array variables, even in the absence
of pointers and heap-allocated data structures. Consider a program with a definition
of A[i] followed by a definition of A[j ]. The vanilla SSA approach can be used to
rename these two definitions to (say) A1 [i] and A2 [j ]. The challenge with arrays
arises when there is a subsequent use of A[k]. For scalar variables, the reaching
definition for this use can be uniquely identified in vanilla SSA form. However,
for array variables, the reaching definition depends on the subscript values. In this
example, the reaching definition for A[k] will be A2 or A1 if k == j or k == i
(or a prior definition A0 if k = j and k = i). To provide A[k] with a single
reaching definition, Array SSA form introduces a definition- (d ) operator that
represents the merge of A2 [j ] with the prevailing value of array A prior to A2 .
The result of this d operator is given a new name, A3 (say), which serves as the
single definition that reaches use A[k] (which can then be renamed to A3 [k]). This
extension enables sparse data-flow propagation algorithms developed for vanilla
SSA form to be applied to array variables, as illustrated by the algorithm for sparse
constant propagation of array elements presented in this chapter. The accuracy of
analyses for Array SSA form depends on the accuracy with which pairs of array
subscripts can be recognized as being definitely same (DS ) or definitely different
(DD).
To model heap-allocated objects, Array SSA form builds on the observation that
all indirect memory operations can be modelled as accesses to elements of abstract
arrays that represent disjoint subsets of the heap. For modern object-oriented
languages such as Java, type information can be used to obtain a partitioning of the
heap into disjoint subsets, e.g., instances of field x are guaranteed to be disjoint from
instances of field y. In such cases, the set of instances of field x can be modelled as
a logical array (map) H x that is indexed by the object reference (key). The problem
of resolving aliases among field accesses p.x and q.x then becomes equivalent to
the problem of resolving aliases among array accesses H x [p] and H x [q], thereby
enabling Array SSA form to be used for analysis of objects as in the algorithm
for redundant load elimination among object fields presented in this chapter. For
162 V. Sarkar and F. Rastello
weakly typed languages such as C, the entire heap can be modelled as a single heap
array. As in HSSA form, an alias analysis pre-pass can be performed to improve the
accuracy of definitely same and definitely different information. In particular, global
value numbering is used to establish definitely same relationships, analogous to its
use in establishing equality of address expressions in HSSA form. In contrast to
HSSA form, the size of Array SSA form is guaranteed to be linearly proportional
to the size of a comparable scalar SSA form for the original program (if all array
variables were treated as scalars for the purpose of the size comparison).
Without the use of φ-nodes, the amount of def-use chains required to link the
assignments to their uses would be quadratic (4 here). Hence, the usefulness of
generalizing SSA and its φ-node for scalars to handle memory access for sparse
analyses. HSSA (see Chap. 16) and Array-SSA (see the Chap. 17) are two different
implementations of this idea. One has to admit that if this early combination is well
suited for analysis or interpretation, the introduction of a φ-function might add a
control dependence to an instruction that would not exist otherwise. In other words,
only simple loop carried dependencies can be expressed this way. Let us illustrate
this point using a simple example:
for i do
A[i] ← f (A[i − 2])
This would easily allow for the propagation of information that A2 ≥ 0. On the
other hand, by adding this φ-node, it becomes difficult to devise that iteration i and
i + 1 can be executed in parallel: the φ-node adds a loop carried dependence. If
you are interested in performing loop transformations that are more sophisticated
than just exposing fully parallel loops (such as loop interchange, loop tiling, or
multidimensional software pipelining), then (dynamic) single assignment forms
should be your friend. There exist many formalisms including Kahn process
networks (KPN) or Fuzzy Data-flow Analysis (FADA) that implement this idea.
But each time restrictions apply. This is part of the huge research area of automatic
parallelization outside of the scope of this book. For further details, we refer the
reader to the corresponding Encyclopedia of Parallel Computing [216].
Chapter 13
Static Single Information Form
The objective of a data-flow analysis is to discover facts that are true about a
program. We call such facts information. Using the notation introduced in Chap. 8,
information is an element in the data-flow lattice. For example, the information
that concerns liveness analysis is the set of variables alive at a certain program
point. Similarly to liveness analysis, many other classical data-flow approaches
bind information to pairs formed by a variable and a program point. However, if
an invariant occurs for a variable v at any program point where v is alive, then we
can associate this invariant directly with v. If the intermediate representation of a
program guarantees this correspondence between information and variable for every
variable, then we say that the program representation provides the Static Single
Information (SSI) property.
In Chap. 8 we have shown how the SSA form allows us to solve sparse
forward data-flow problems such as constant propagation. In the particular case of
constant propagation, the SSA form lets us assign to each variable the invariant—or
information—of being constant or not. The SSA intermediate representation gives
us this invariant because it splits the live ranges of variables in such a way that each
variable name is defined only once. Now we will show that live range splitting can
also provide the SSI property not only to forward but also to backward data-flow
analyses.
Different data-flow analyses might extract information from different program
facts. Therefore, a program representation may provide the SSI property to some
data-flow analyses but not to all of them. For instance, the SSA form naturally
F. M. Q. Pereira ()
Federal University of Minas Gerais, Belo Horizonte, Brazil
e-mail: fernando@[Link]
F. Rastello
Inria, Grenoble, France
e-mail: [Link]@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 165
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
166 F. M. Q. Pereira and F. Rastello
provides the SSI property to the reaching definition analysis. Indeed, the SSA form
provides the static single information property to any data-flow analysis that obtains
information at the definition sites of variables. These analyses and transformations
include copy and constant propagation, as illustrated in Chap. 8. However, for a
data-flow analysis that derives information from the use sites of variables, such as
the class inference analysis that we will describe in Sect. 13.1.6, the information
associated with a variable might not be unique along its entire live range even under
SSA: In that case the SSA form does not provide the SSI property.
There are extensions of the SSA form that provide the SSI property to more data-
flow analyses than the original SSA does. Two classic examples—detailed later—
are the Extended-SSA (e-SSA) form and the Static Single Information (SSI) form.
The e-SSA form provides the SSI property to analyses that take information from
the definition site of variables, and also from conditional tests where these variables
are used. The SSI form provides the static single information property to data-flow
analyses that extract information from the definition sites of variables and from the
last use sites (which we define later). These different intermediate representations
rely on a common strategy to achieve the SSI property: live range splitting. In this
chapter we show how to use live range splitting to build program representations
that provide the static single information property to different types of data-flow
analyses.
The goal of this section is to define the notion of Static Single Information, and
to explain how it supports the sparse data-flow analyses discussed in Chap. 8.
With this aim in mind, we revisit the concept of sparse analysis in Sect. 13.1.1.
There is a special class of data-flow problems, which we call Partitioned Lattice
per Variable (PLV), which fits into the sparse data-flow framework of this chapter
very well. We will look more carefully into these problems in Sect. 13.1.2. The
intermediate program representations discussed in this chapter provide the static
single information property—formalized in Sect. 13.1.3—to any PLV problem.
In Sect. 13.1.5 we give algorithms to solve sparsely any data-flow problem that
contains the SSI property. This sparse framework is very broad: Many well-known
data-flow problems are partitioned lattice, as we will see in the examples in
Sect. 13.1.6.
(a) original code (b) CFG and program points (c) result of a dense
implementation of range analysis
Fig. 13.1 An example of a dense data-flow analysis that finds the range of possible values
associated with each variable at each program point
replace all the constraint variables [v]p by a single constraint variable [v], for each
variable v and every p ∈ live(v).
Although not every data-flow problem can be easily solved sparsely, many of
them can as they fit into the family of PLV problems described in the next section.
The non-relational data-flow analysis problems we are interested in are the ones
that bind information to pairs of program variables and program points. We refer
to this class of problems as Partitioned Lattice per Variable problems and formally
describe them as follows.
Definition 13.1 (PLV) Let V = {v1 , . . . , vn } be the set of program variables. Let
us consider, without loss of generality, a forward data-flow analysis that searches
for a maximum. This data-flow analysis can be written as an equation system that
associates each program point p, with n element of a lattice L , given by the
following equation:
xp = F s→p (x s ),
s∈directpreds(p)
where x p denotes the abstract state associated with program point p, and F s→p is
the transfer function from direct predecessor s to p. The analysis can alternatively
be written as a constraint system that binds to each program point p and each s ∈
directpreds(p) the equation x p = x p ∧ F s→p (x s ) or, equivalently, the in equation
x p F s→p (x s ).
Note that not all data-flow analyses are PLV, for instance problems dealing with
relational information, such as “i < j ?”, which needs to hold information on pairs
of variables.
If the information associated with a variable is invariant along its entire live range,
then we can bind this information to the variable itself. In other words, we can
replace all the constraint variables [v]p by a single constraint variable [v], for each
variable v and every p ∈ live(v). Consider the problem of range analysis again.
There are two types of control-flow points associated with non-identity transfer
functions: definitions and conditionals. (1) At the definition point of variable v, Fv
simplifies to a function that depends only on some [u] where each u is an argument
of the instruction defining v; (2) At the conditional tests that use a variable v,
Fv can be simplified to a function that uses [v] and possibly other variables that
appear in the test. The other program points are associated with an identity transfer
s→p
function and can thus be ignored: [v]p = [v]p ∧ Fv ([v1 ]s , . . . , [vn ]s ) simplifies
to [v]p = [v]p ∧[v]p i.e., [v]p = [v]p . This gives the intuition on why a propagation
engine along the def-use chains of an SSA form program can be used to solve the
constant propagation problem in an equivalent, yet “sparser,” manner.
A program representation that fulfils the Static Single Information (SSI) property
allows us to attach the information to variables, instead of program points, and
needs to fulfil the following four properties: Split forces the information related to
a variable to be invariant along its entire live range; Info forces this information to
be irrelevant outside the live range of the variable; Link forces the def-use chains
to reach the points where information is available for a transfer function to be
evaluated; finally, Version provides a one-to-one mapping between variable names
and live ranges.
We now give a formal definition of the SSI and the four properties.
Property 1 (SSI) STATIC SINGLE INFORMATION: Consider a forward (resp. back-
ward) monotone PLV problem Edense stated as a set of constraints
for every variable v, each program point p, and each s ∈ directpreds(p) (resp.
s ∈ directsuccs(p)). A program representation fulfils the Static Single Information
property if and only if it fulfils the following four properties:
Split Let s be the unique direct predecessor (resp. direct successor) of a program
s→p
point where a variable v is live and such that Fv = λx.⊥ is non-trivial, i.e., is
not the simple projection on Lv , then s should contain a definition (resp. last use)
p
of v; for (v, p) ∈ variables × progPoints, let (Yv ) be a maximum solution to
Edense . Each node p that has several direct predecessors (resp. direct successors),
170 F. M. Q. Pereira and F. Rastello
s→p
and for which Fv (Yvs1 , . . . , Yvsn ) has different values on its incoming edges
(s → p) (resp. outgoing edges (p → s)), should have a φ-function at the entry
of p (resp. σ -function at the exit of p) for v as defined in the next section.
Info Each program point p such that v ∈ live-out(p) (resp. v ∈ live-in(p)))
p
should be bound to an undefined transfer function, i.e., Fv = λx.⊥.
Link Each instruction inst for which Fv depends on some [u]s should contain
inst
We perform live range splitting via special instructions: the σ -functions and parallel
copies that, together with φ-functions, create new definitions of variables. These
notations are important elements of the propagation engine described in the section
that follows. In short, a σ -function (for a branch point) is the dual of a φ-function
(for a join point), and a parallel copy is a copy that must be done in parallel with
another instruction. Each of these special instructions, φ-function, σ -functions, and
parallel copies, split live ranges at different kinds of program points: interior nodes,
branches, and joins.
Interior nodes are program points that have a unique direct predecessor and a
unique direct successor. At these points we perform live range splitting via copies.
If the program point already contains another instruction, then this copy must be
done in parallel with the existing instruction. The notation,
inst v1 = v1 ... vm = vm
might be associated with two different constants, hence providing two different
pieces of information about v. To avoid these definitions reaching the same use
of v, we merge them at the earliest program point where they meet. We do it via our
well-known φ-functions.
In backward analyses the information that emerges from different uses of a
variable may reach the same branch point, which is a program point with a unique
direct predecessor and multiple direct successors. To ensure Property 1, the use
that reaches the definition of a variable must be unique, in the same way that in
an SSA form program the definition that reaches a use is unique. We ensure this
property via special instructions called σ -functions. The σ -functions are the dual
of φ-functions, performing a parallel assignment depending on the execution path
taken. The assignment
q q
(l 1 : v11 , . . . , l q : v1 ) = σ (v1 ) ... (l 1 : vm
1
, . . . , l q : vm ) = σ (vm )
j
represents m σ -functions that assign to each variable vi the value in vi if control
flows into block l j . As with φ-functions, these assignments happen in parallel, i.e.,
the m σ -functions encapsulate m parallel copies. Also, note that variables live in
different branch targets are given different names by the σ -function.
Let us consider a unidirectional forward (resp. backward) PLV problem Edense ssi
s→p
stated as a set of equations [v] Fv ([v1 ] , . . . , [vn ] ) (or equivalently [v]p =
p s s
s→p
[v]p ∧ Fv ([v1 ]s , . . . , [vn ]s ) for every variable v, each program point p, and each
s ∈ directpreds(p) (resp. s ∈ directsuccs(p)). To simplify the discussion, any φ-
function (resp. σ -function) is seen as a set of copies, one per direct predecessor
(resp. direct successor), which leads to many constraints. In other words, a φ-
function such as p : a = φ(a1 : l 1 , . . . , am : l m ) gives us n constraints such
as
l j →p j j
[a]p Fa ([a1 ]l , . . . , [an ]l )
j
which usually simplifies into [a]p [aj ]l . This last can be written equivalently
into the classical meet
j
[a]p [aj ]l
lj ∈ directpreds(p)
172 F. M. Q. Pereira and F. Rastello
which usually simplifies into [aj ]lj [a]p . Given a program that fulfils the SSI
ssi
property for Edense and the set of transfer functions Fvs , we show here how to build
an equivalent sparse constrained system.
Definition 13.2 (SSI Constrained System) Consider that a program in SSI form
gives us a constraint system that associates with each variable v the constraints
s→p
[v]p Fv ([v1 ]s , . . . , [vn ]s ). We define a system of sparse equations Esparse
ssi as
follows:
• For each instruction i that defines (resp. uses) a variable v, let a . . . z be the set
s→p
of used (resp. defined) variables. Because of the Link property, Fv (that we
will denote Fv from now) depends only on some [a] . . . [z] . Thus, there exists
i s s
Algorithm 13.1 while it does not for Algorithm 13.2. This stems from the asymmetry
of our SSI form that ensures (for practical purposes only, as we will explain soon)
the Static Single Assignment property but not the Static Single Use (SSU) property.
If we have several uses of the same variable, then the sparse backward constraint
system will have several inequations—one per variable use—with the same left-
hand side. Technically this is the reason why we manipulate a constraint system
(system with inequations) and not an equation system as in Chap. 8. Both systems
can be solved1 using a scheme known as chaotic iteration such as the worklist
algorithm we provide here. The slight and important difference for a constraint
system as opposed to an equation system is that one needs to meet Giv (. . . ) with the
old value of [v] to ensure the monotonicity of the consecutive values taken by [v]. It
would still be possible to enforce the SSU property, in addition to the SSA property,
of our intermediate representation, at the expense of adding more φ-functions and
σ -functions. However, this guarantee is not necessary to every sparse analysis. The
dead-code elimination problem illustrates this point well: For a program under SSA
form, replacing Giv in Algorithm 13.1 by the property “i is a useful instruction
or one of the variables it defines is marked as useful” leads to the standard SSA-
based dead-code elimination algorithm. The sparse constraint system does have
several equations (one per variable use) for the same left-hand side (one for each
variable). It is not necessary to enforce the SSU property in this instance of dead-
code elimination, and doing so would lead to a less efficient solution in terms of
compilation time and memory consumption. In other words, a code under SSA form
fulfils the SSI property for dead-code elimination.
(a) Live range splitting used by a (b) sparse constraint system & solution
sparse implementation of range analysis
Fig. 13.2 Live range splitting on Fig. 13.1 and a solution to this instance of the range analysis
problem
Fig. 13.3 Class inference analysis as an example of backward data-flow analysis that takes
information from the uses of variables
a backward analysis that extracts information from use sites, we split live ranges
using parallel copies at these program points and rely on σ -functions to merge
them back, as shown in Fig. 13.3c. The use-def chains that we derive from the
program representation lead naturally to a constraint system, shown in Fig. 13.3d,
where [vj ] denotes the set of methods associated with variable vj . A fixed point
to this constraint system is a solution to our data-flow problem. This instance of
class inference is a Partitioned Variable Problem (PVP),2 because the data-flow
information associated with a variable v can be computed independently from the
other variables.
Null-Pointer Analysis
The objective of null-pointer analysis is to determine which references may hold
null values. This analysis allows compilers to remove redundant null-exception
(a) Object oriented program (b) Live range splitting (c) Constraints that
that might invoke methods strategy used by a determine a solution for
of null objects sparse implementation the sparse version
of null-pointer analysis of null-pointer analysis
Fig. 13.4 Null-pointer analysis as an example of forward data-flow analysis that takes information
from the definitions and uses of variables (0 represents the fact that the pointer is possibly null,
A0
if it cannot be) (a) Object oriented program that might invoke methods of null objects (b) Live
range splitting strategy used by a sparse implementation of null-pointer analysis (c) Constraints
that determine a solution for the sparse version of null-pointer analysis
tests and helps developers find null-pointer dereferences. Figure 13.4 illustrates this
analysis. Because information is produced not only at definition but also at use sites,
we split live ranges after each variable is used, as shown in Fig. 13.4b. For instance,
we know that v2 cannot be null, otherwise an exception would have been thrown
during the invocation v1 .m(); hence the call v2 .m() cannot result in a null-pointer
dereference exception. On the other hand, we notice in Fig. 13.4a that the state of
v4 is the meet of the state of v3 , definitely not-null, and the state of v1 , possibly null,
and we must conservatively assume that v4 may be null.
In the previous section we have seen how the static single information property
gives the compiler the opportunity to solve a data-flow problem sparsely. However,
we have not yet seen how to convert a program to a format that provides the SSI
property. This is a task that we address in this section, via the three-step algorithm
from Sect. 13.2.2.
Fig. 13.5 Live range splitting strategies for different data-flow analyses. Defs (resp. Uses) denotes
the set of instructions that define (resp. use) the variable; Conds denotes the set of instructions that
apply a conditional test on a variable; Out(Conds) denotes the exits of the corresponding basic
blocks; LastUses denotes the set of instructions where a variable is used, and after which it is no
longer live
Similarly, we let I↑ denote a set of points i with backward direction. The live range
of v must be split at least at every point in Pv . Going back to the examples from
Sect. 13.1.6, we have the live range splitting strategies enumerated below. The list
in Fig. 13.5 gives further examples of live range splitting strategies. Corresponding
references are given in the last section of this chapter.
• Range analysis is a forward analysis that takes information from points where
variables are defined and conditional tests that use these variables. For instance,
in Fig. 13.1, we have Pi = {l1 , Out(l3 ), l4 }↓ where Out(li ) is the exit of li (i.e.,
the program point immediately after li ), and Ps = {l2 , l5 }↓ .
• Class inference is a backward analysis that takes information from the uses of
variables; thus, for each variable, the live range splitting strategy is characterized
by the set Uses↑ where Uses is the set of use points. For instance, in Fig. 13.3,
we have Pv = {l4 , l6 , l7 }↑ .
• Null-pointer analysis takes information from definitions and uses and propa-
gates this information forwardly. For instance, in Fig. 13.4, we have Pv =
{l1 , l2 , l3 , l4 }↓ .
The algorithm SSIfy in Fig. 13.6 implements a live range splitting strategy
in three steps. Firstly, it splits live ranges, inserting new definitions of variables
into the program code. Secondly, it renames these newly created definitions; hence,
ensuring that the live ranges of two different re-definitions of the same variable
do not overlap. Finally, it removes dead and non-initialized definitions from the
program code. We describe each of these phases in the rest of this section.
178 F. M. Q. Pereira and F. Rastello
Fig. 13.7 Live range splitting. In(l) denotes a program point immediately before l, and Out(l) a
program point immediately after l
actually insert the new definitions of v. These new definitions might be created by
σ functions (due to Pv or to the splitting in lines 2–7); by φ-functions (due to Pv
or to the splitting in lines 8–13); or by parallel copies.
The rename algorithm in Fig. 13.8 builds def-use and use-def chains for a program
after live range splitting. This algorithm is similar to the classical algorithm used to
rename variables during the SSA construction that we saw in Chap. 3. To rename a
variable v we traverse the program’s dominance tree, from top to bottom, stacking
each new definition of v that we find. The definition currently on the top of the
stack is used to replace all the uses of v that we find during the traversal. If
the stack is empty, this means that the variable is not defined at this point. The
renaming process replaces the uses of undefined variables by ⊥ (see comment
of function stack.set_use). We have two methods, stack.set_use and
180 F. M. Q. Pereira and F. Rastello
stack.set_def, that build the chains of relations between variables. Note that
sometimes we must rename a single use inside a φ-function, as in lines 16–17 of the
algorithm. For simplicity we consider this single use as a simple assignment when
calling stack.set_use, as can be seen in line 17. Similarly, if we must rename
a single definition inside a σ -function, then we treat it as a simple assignment, like
we do in lines 12–14 of the algorithm.
13 Static Single Information Form 181
Fig. 13.9 Dead and undefined code elimination. Original instructions not inserted by split are
called actual instructions. [Link] denotes the (set) of variable(s) defined by inst, and [Link]
denotes the set of variables used by inst
Just like Algorithm 3.7, the algorithm in Fig. 13.9 eliminates φ-functions and
parallel copies that define variables not actually used in the code. By way of
symmetry, it also eliminates σ -functions and parallel copies that use variables not
actually defined in the code. We mean by “actual” instructions those that already
existed in the program before we transformed it with split. In line 2, “web” is
fixed to the set of versions of v, so as to restrict the cleaning process to variable v,
as we see in the first two loops. The “active” set is initialized to actual instructions,
line 4. Then, during the first loop in lines 5–8, we augment it with φ-functions,
σ -functions, and copies that can reach actual definitions through use-def chains.
The corresponding version of v is hence marked as defined (line 8). The next loop,
lines 11–14, performs a similar process, this time to add to the active set instructions
that can reach actual uses through def-use chains. The corresponding version of v is
then marked as used (line 14). Each non-live variable, i.e., either undefined or dead
(non-used), hence not in the “live” set (line 15) is replaced by ⊥ in all φ, σ , or copy
functions where it appears by the loop, lines 15–18. Finally, all useless φ, σ , or copy
functions are removed by lines 19–20.
182 F. M. Q. Pereira and F. Rastello
Fig. 13.10 (a) Implementing σ -functions via single arity φ-functions; (b) getting rid of copies and
σ -functions
Implementing σ -Functions
The most straightforward way to implement σ -functions, in a compiler that already
supports the SSA form, is to represent them by φ-functions. In this case, the
σ -functions can be implemented as single arity φ-functions. As an example,
Fig. 13.10a shows how we would represent the σ -functions of Fig. 13.3d. If l
is a branch point with n direct successors that would contain a σ -function (l 1 :
v1 , . . . , l n : vn ) ← σ (v), then, for each direct successor l j of l, we insert at the
beginning of l j an instruction vj ← φ(l j : v). Note that l j may already contain
a φ-function for v. This happens when the control-flow edge l → l j is critical: A
critical edge links a basic block with several direct successors to a basic block with
several direct predecessors. If l j already contains a φ-function v ← φ(. . . , vj , . . .),
then we rename vj to v.
SSI Destruction
Traditional instruction sets do not provide φ-functions or σ -functions. Thus, before
producing an executable program, the compiler must implement these instructions.
We have already seen in Chap. 3 how to replace φ-functions with actual assembly
instructions; however, now we must also replace σ -functions and parallel copies.
A simple way to eliminate all the σ -functions and parallel copies is via copy-
propagation. In this case, we copy-propagate the variables that these special
instructions define. As an example, Fig. 13.10b shows the result of copy folding
applied on Fig. 13.10a.
13 Static Single Information Form 183
The monotone data-flow framework is an old ally of compiler writers. Since the
work of pioneers like Prosser [234], Allen [3, 4], Kildall [166], Kam [158], and
Hecht [143], data-flow analyses such as reaching definitions, available expressions,
and liveness analysis have made their way into the implementation of virtually
every important compiler. Many compiler textbooks describe the theoretical basis
of the notions of lattice, monotone data-flow framework, and fixed points. For a
comprehensive overview of these concepts, including algorithms and formal proofs,
we refer the interested reader to Nielson et al.’s book [208] on static program
analysis.
The original description of the intermediate program representation known as
Static Single Information form was given by Ananian in his Master’s thesis [8]. The
notation for σ -functions that we use in this chapter was borrowed from Ananian’s
work. The SSI program representation was subsequently revisited by Jeremy Singer
in his PhD thesis [261]. Singer proposed new algorithms to convert programs to
SSI form, and also showed how this program representation could be used to handle
truly bidirectional data-flow analyses. We have not discussed bidirectional data-flow
problems, but the interested reader can find examples of such analyses in Khedker
et al.’s work [165]. Working on top of Ananian’s and Singer’s work, Boissinot et
al. [37] have proposed a new algorithm to convert a program to SSI form. Boissinot
et al. have also separated the SSI program representation into two flavours, which
they call weak and strong. Tavares et al. [282] have extended the literature on SSI
representations, defining building algorithms and giving formal proofs that these
algorithms are correct. The presentation that we use in this chapter is mostly based
on Tavares et al.’s work.
There exist other intermediate program representations that, like the SSI form,
make it possible to solve some data-flow problems sparsely. Well-known among
these representations is the Extended Static Single Assignment form, introduced by
Bodik et al. to provide a fast algorithm to eliminate array bound checks in the
context of a JIT compiler [32]. Another important representation, which supports
data-flow analyses that acquire information at use sites, is the Static Single Use form
(SSU). As uses and definitions are not fully symmetric (the live range can “traverse”
a use while it cannot traverse a definition), there are different variants of SSU [125,
187, 228]. For instance, the “strict” SSU form enforces that each definition reaches
a single use, whereas SSI and other variations of SSU allow two consecutive uses
of a variable on the same path. All these program representations are very effective,
having seen use in a number of implementations of flow analyses; however, they
only fit specific data-flow problems.
The notion of Partitioned Variable Problem (PVP) was introduced by Zadeck,
in his PhD dissertation [316]. Zadeck proposed fast ways to build data structures
that allow one to solve these problems efficiently. He also discussed a number
of data-flow analyses that are partitioned variable problems. There are data-flow
analyses that do not meet the Partitioned Lattice per Variable property. Notable
184 F. M. Q. Pereira and F. Rastello
Many compilers represent the input program as some form of graph in order to
support analysis and transformation. Over time a cornucopia of program graphs have
been presented in the literature and subsequentially implemented in real compilers.
Many of these graphs use SSA concepts as the core principle of their representation,
ranging from literal translations of SSA into graph form to more abstract graphs
which are implicitly in SSA form. We aim to introduce a selection of program graphs
that use these SSA concepts, and examine how they may be useful to a compiler
writer.
A well-known graph representation is the control-flow graph (CFG) which we
encountered at the beginning of the book while being introduced to the core concept
of SSA. The CFG models control flow in a program, but the graphs that we
will study instead model data flow. This is useful as a large number of compiler
optimizations are based on data-flow analysis. In fact, all graphs that we consider in
this chapter are all data-flow graphs.
In this chapter, we will look at a number of SSA-based graph representations.
An introduction to each graph will be given, along with diagrams to show how
sample programs look when translated into that particular graph. Additionally, we
will describe the techniques that each graph was created to solve, with references to
the literature for further research.
For this chapter, we assume that the reader already has familiarity with SSA (see
Chap. 1) and the applications that it is used for.
J. Stanier ()
Brandwatch, Cumbria, UK
F. Rastello
Inria, Grenoble, France
e-mail: [Link]@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 185
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
186 J. Stanier and F. Rastello
Since all of the graphs in this chapter are data-flow graphs, let us define them.
A data-flow graph (DFG) is a directed graph G = (V , E) where the edges E
represent the flow of data from the result of one instruction to the input of another.
An instruction executes once all of its input data values have been computed. When
an instruction executes, it produces a new data value which is propagated to other
connected instructions.
Whereas the CFG imposes a total ordering on instructions—the same ordering
that the programmer wrote them in—the DFG has no such concept of ordering;
it just models the flow of data. This means that it typically needs a companion
representation such as the CFG to ensure that optimized programs are still correct.
However, with access to both the CFG and DFG, optimizations such as dead
code elimination, constant folding, and common subexpression elimination can be
performed effectively. But this comes at a price: keeping both graphs updated during
optimization can be costly and complicated.
We begin our exploration with a graph that is a literal representation of SSA: the
SSA graph. The SSA graph can be constructed from a program in SSA form by
explicitly adding use-def chains. To demonstrate what the graph looks like, we
present some sample code in Fig. 14.1 which is then translated into an SSA graph.
An SSA graph consists of vertices that represent instructions (such as + and
print) or φ-functions, and directed edges that connect uses to definitions of values.
The outgoing edges of a vertex represent the arguments required for that instruction,
and the ingoing edge(s) to a vertex represent the propagation of the instruction’s
result(s) after they have been computed. We call these types of graphs demand-
based representations. This is because in order to compute an instruction, we must
first demand the results of the operands.
Although the textual representation of SSA is much easier for a human to read,
the primary benefit of representing the input program in graph form is that the
compiler writer is able to apply a wide array of graph-based optimizations by using
standard graph traversal and transformation techniques.
In the literature, the SSA graph has been used to detect induction variables in
loops, for performing instruction selection (see Chap. 19), operator strength reduc-
tion, rematerialization, and has been combined with an extended SSA language to
support compilation in a parallelizing compiler. The reader should note that the
exact specification of what constitutes an SSA graph changes from paper to paper.
The essence of the intermediate representation (IR) has been presented here, as each
author tends to make small modifications for their particular implementation.
14 Graphs and Gating Functions 187
Fig. 14.1 Some SSA code translated into an SSA graph. Note how edges demand the input values
for a node
We illustrate the usefulness of the SSA graph through a basic induction variable (IV)
recognition technique. A more sophisticated technique is developed in Chap. 10.
Given that a program is represented as an SSA graph, the task of finding induction
variables is simplified. A basic linear induction variable i is a variable that appears
only in the form:
1 i = 10
2 while <cond> do
3 ...
4 i =i+k
5 ...
easily discovered in linear time using any depth-first search traversal. Each such
SCC must conform to the following constraints:
• The SCC contains only one φ-function at the header of the loop.
• Every component is either i = φ(i0 , ..., in ) or i = ik ⊕ n, where ⊕ is addition or
subtraction, and n is loop-invariant.
This technique can be expanded to detect a variety of other classes of induction
variables, such as wraparound variables, non-linear induction variables, and nested
induction variables. Scans and reductions also show a similar SSA graph pattern
and can be detected using the same approach.
The program dependence graph (PDG) represents both control and data depen-
dencies together in one graph. The PDG was developed to support optimizations
requiring reordering of instructions and graph rewriting for parallelism, as the
strict ordering of the CFG is relaxed and complemented by the presence of data
dependence information. The PDG is a directed graph G = (V , E) where nodes V
are statements, predicate expressions, or region nodes, and edges E represent either
control or data dependencies. Thus, the set of all edges E has two distinct subsets:
the control dependence subgraph EC and the data dependence subgraph ED .
Statement nodes represent instructions in the program. Predicate nodes test a
conditional statement and have true and false edges to represent the choice taken on
evaluation of the predicate. Region nodes group control dependencies with identical
source and label together. If the control dependence for a region node is satisfied,
then it follows that all of its children can be executed. Thus, if a region node has
three different control-independent statements as immediate children, then those
statements could potentially be executed in parallel. Diagrammatically, rectangular
nodes represent statements, diamond nodes predicates, and circular nodes are region
nodes. Dashed edges represent control dependence, and solid edges represent data
dependence. Loops in the PDG are represented by back edges in the control
dependence subgraph. We show example code translated into a PDG in Fig. 14.2.
Building a PDG is a multi-stage process involving:
• Construction of the post-dominator tree
• Use of the post-dominator tree to generate the control dependence subgraph
• Insertion of region nodes
• Construction of DAGs for each basic block which are then joined to create the
data dependence subgraph
Let us explore this construction process in more detail.
An ENTRY node is added with one edge labeled true pointing to the CFG entry
node, and another labeled false going to the CFG exit node.
14 Graphs and Gating Functions 189
code PDG
Fig. 14.2 Example code translated into a PDG. Dashed/full edges, respectively, represent con-
trol/data dependencies
Before constructing the rest of the control dependence subgraph EC , let us define
control dependence. A node w is said to be control dependent on edge (u, v) if
w post-dominates v and w does not strictly post-dominate u. Control dependence
between nodes is equivalent to the post-dominance frontier on the reversed CFG. To
compute the control dependence subgraph, the post-dominator tree is constructed
for the CFG. Then, the control dependence edges from u to w are labeled with the
boolean value taken by the predicate computed in u when branching on edge (u, v).
Then, let S consist of the set of all edges (A, B) in the CFG such that B is not
an ancestor of A in the post-dominator tree. Each of these edges has an associated
label true or false. Then, each edge in S is considered in turn. Given (A, B), the post-
dominator tree is traversed backwards from B until we reach A’s parent, marking
all nodes visited (including B) as control dependent on A with the label of S.
Next, region nodes are added to the PDG. Each region node summarizes a
set of control conditions and “groups” all nodes with the same set of control
conditions together. Region nodes are also inserted so that predicate nodes will
only have two direct successors. To begin with, an unpruned PDG is created by
checking, for each node of the CFG, which control region it depends on. This
is done by traversing the post-dominator tree in post-order and mapping sets of
190 J. Stanier and F. Rastello
control dependencies to region nodes. For each node N visited in the post-dominator
tree, the map is checked for an existing region node with the same set CD of
control dependencies. If none exists, a new region node R is created with these
control dependencies and entered into the map. R is made to be the only control
dependence direct predecessor of N. Next, the intersection INT of CD is computed
for each immediate child of N in the post-dominator tree. If INT = CD, then the
corresponding dependencies are removed from the child and replaced with a single
dependence on the child’s control direct predecessor. Then, a pass over the graph is
made to make sure that each predicate node has a unique direct successor for each
boolean value. If more than one exists, the corresponding edges are replaced by a
single edge to a freshly created region node that itself points to the direct successor
nodes.
Finally, the data dependence subgraph is generated. This begins with the
construction of DAGs for each basic block where each upwards reaching leaf is
called a merge node. Data-flow analysis is used to compute reaching definitions.
All individual DAGs are then connected together: edges are added from definition
nodes to the corresponding merge nodes that may be reached. The resulting graph
is the data dependence subgraph, and PDG construction is complete.
The PDG has been used for generating code for parallel architectures and has
also been used in order to perform accurate program slicing and testing.
In SSA form, φ-functions are used to identify points where variable definitions
converge. However, they cannot be directly interpreted, as they do not specify the
condition that determines which of the variable definitions to choose. By this logic,
14 Graphs and Gating Functions 191
we cannot directly interpret the SSA graph. Being able to interpret our IR is a
useful property as it gives the compiler writer more information when implementing
optimizations and also reduces the complexity of performing code generation. Gated
single assignment form (GSA—sometimes called gated SSA) is an extension of
SSA with gating functions. These gating functions are directly interpretable versions
of φ-nodes and replace φ-nodes in the representation. We usually distinguish the
three following forms of gating functions:
• The φif function explicitly represents the condition that determines which φ
value to select. A φif function of the form φif (p, v1 , v2 ) has p as a predicate,
and v1 and v2 as the values to be selected if the predicate evaluates to true or
false, respectively. This can be read simply as if-then-else.
• The φentry function is inserted at loop headers to select the initial and loop carried
values. A φentry function of the form φentry (vinit , viter ), has vinit as the initial input
value for the loop, and viter as the iterative input. We replace φ-functions at loop
headers with φentry functions.
• The φexit function determines the value of a variable when a loop terminates.
A φexit function of the form φexit (p, vexit ) has P as predicate and vexit as the
definition reaching beyond the loop.
It is easiest to understand these gating functions by means of an example.
Figure 14.3 shows how our earlier code in Fig. 14.2 translates into GSA form. Here,
code GSA
Fig. 14.3 A graph representation of our sample code in (demand-based) GSA form
192 J. Stanier and F. Rastello
Fig. 14.4 (a) A structured code; (b) the PDG (with region nodes omitted); (c) the DAG
representation of the nested gated φif
we can see the use of both φentry and φexit gating functions. At the header of our
sample loop, the φ-function has been replaced by a φentry function which determines
between the initial and iterative value of i. After the loop has finished executing, the
nested φexit function selects the correct live-out version of a.
This example shows several interesting points. First, the semantics of both the
φexit and φif are strict in their gate: here a1 or φexit (q, a2 ) is not evaluated before
p is known.1 Similarly, a φif function that results from the nested if-then-else code
of Fig. 14.4 would be itself nested as a = φif (p, φif (q, a2 , a3 ), a1 ). Second, this
representation of the program does not allow for an interpreter to decide whether
an instruction with a side effect (such as A[i1 ] = a2 in our running example) has
to be executed or not. Finally, computing the values of gates is highly related to the
simplification of path expressions: in our running example, a2 should be selected
when the path ¬p followed by q (denoted ¬p.q) is taken, while a1 should be
selected when the path p is taken; for our nested if-then-else example, a1 should
be selected either when the path ¬p.r is taken or when the path ¬p.¬r is taken,
which simplifies to ¬p. Diverse approaches can be used to generate the correct
nested φif or φexit gating functions.
The most natural way uses a data-flow analysis that computes, for each program
point and each variable, its unique reaching definition and the associated set of
reaching paths. This set of paths is abstracted using a path expression. If the code
is not already under SSA, and if at a merge point of the CFG, its direct predecessor
basic blocks are reached by different variables, a φ-function is inserted. The gate of
each operand is set to the path expression of its corresponding incoming edge. If a
unique variable reaches all the direct predecessor basic blocks, the corresponding
1 Asopposed to the ψ-function described in Chap. 15 that would use syntax such as a3 = φ((p ∧
¬q)?a1 , (¬p ∧ q)?a2 ) instead.
14 Graphs and Gating Functions 193
path expressions are merged. Of course, a classical path compression technique can
be used to minimize the number of visited edges. One can observe the similarities
with the φ-function placement algorithm described in Sect. 4.4.
There also exists a relationship between the control dependencies and the gates:
from a code already under strict and conventional SSA form, one can derive
the gates of a φif function from the control dependencies of its operands. This
relationship is illustrated by Fig. 14.4 in the simple case of a structured code.
These gating functions are important as the concept will form components of
the value state dependence graph later. GSA has seen a number of uses in the
literature including analysis and transformations based on data flow. With the
diversity of applications (see Chaps. 10 and 23), many variants of GSA have been
proposed. Those variations concern the correct handling of loops in addition to the
computation and representation of gates.
By using gating functions, it becomes possible to construct IRs based solely on
data dependencies. These IRs are sparse in nature compared to the CFG, making
them good for analysis and transformation. This is also a more attractive proposition
than generating and maintaining both a CFG and DFG, which can be complex and
prone to human error. One approach has been to combine both of these into one
representation, as is done in the PDG. Alternatively, we can utilize gating functions
along with a data-flow graph for an effective way of representing whole program
information using data-flow information.
1 JMAX ← EXPR
2 if p then
3 J ← JMAX − 1
4 else
5 J ← JMAX
6 assert (J ≤ JMAX)
If forward substitutions were to be used in order to determine whether the
assertion is correct, then the symbolic value of J must be discovered, starting at
the top of the program in statement at line 1. Forward propagation through this
program results in statement at line 6 being
and thus the assert () statement evaluates to true. In real, non-trivial programs, these
expressions can get unnecessarily long and complicated.
Using GSA instead allows for backwards, demand-driven substitutions. The
program above has the following GSA form:
1 JMAX 1 ← EXPR
2 if p then
3 J1 ← JMAX 1 − 1
4 else
5 J2 ← JMAX 1
6 J3 ← φif (p, J1 , J2 )
7 assert (J3 ≤ JMAX 1 )
J3 = φif (p, J1 , J2 )
= φif (p, JMAX 1 − 1, JMAX 1 )
The backward substitution then stops because enough information has been
found, avoiding the redundant substitution of JMAX 1 by EXPR. In non-trivial
programs, this can greatly reduce the number of redundant substitutions, making
symbolic analysis significantly cheaper.
The gating functions defined in the previous section were used in the development of
a sparse data-flow graph IR called the value state dependence graph (VSDG). The
VSDG is a directed graph consisting of operation nodes, loop, and merge nodes
together with value and state dependency edges. Cycles are permitted but must
satisfy various restrictions. A VSDG represents a single procedure: this matches
the classical CFG. An example VSDG is shown in Fig. 14.5.
code VSDG
Fig. 14.5 A recursive factorial function, whose VSDG illustrates the key graph components—
value dependency edges (solid lines), state dependency edges (dashed lines), a const node, a
call node, a γ -node, a conditional node, and the function entry and exit nodes
The VSDG corresponds to a reducible program, e.g., there are no cycles in the
VSDG except those mediated by θ -nodes (loop).
Value dependency (EV ) indicates the flow of values between nodes. State depen-
dency (ES ) represents two things; the first is essentially a sequential dependency
required by the original program, e.g., a given load instruction may be required
to follow a given store instruction without being re-ordered, and a return
node in general must wait for an earlier loop to terminate even though there might
be no value dependency between the loop and the return node. The second
purpose is that state dependency edges can be added incrementally until the VSDG
corresponds to a unique CFG. Such state dependency edges are called serializing
edges .
The VSDG is implicitly represented in SSA form: a given operator node, n, will
have zero or more EV -consumers using its value. Note that, in implementation
196 J. Stanier and F. Rastello
terms, a single register can hold the produced value for consumption at all
consumers; it is therefore useful to talk about the idea of an output port for n being
allocated a specific register, r, to abbreviate the idea of r being used for each edge
(n1 , n2 ), where n2 ∈ directsucc(n1 ).
There are four main classes of VSDG nodes: value nodes (representing pure
arithmetic) ,γ -nodes (conditionals), θ -nodes (loops) , and state nodes (side effects).
The majority of nodes in a VSDG generates a value based on some computation
(add, subtract, etc.) applied to their dependent values (constant nodes, which have
no dependent nodes, are a special case).
γ -Nodes
The γ -node is similar to the φif gating function in being dependent on a control
predicate, rather than the control-independent nature of SSA φ-functions. A γ -node
γ (C : p, T : vtrue , F : vfalse ) evaluates the condition dependency p and returns
the value of vtrue if p is true, otherwise vfalse . We generally treat γ -nodes as single-
valued nodes (contrast θ -nodes, which are treated as tuples), with the effect that two
separate γ -nodes with the same condition can be later combined into a tuple using
a single test. Figure 14.6 illustrates two γ -nodes that can be combined in this way.
Here, we use a pair of values (2-tuple) for prots T and F . We also see how two
syntactically different programs can map to the same structure in the VSDG.
θ -Nodes
The θ -node models the iterative behaviour of loops, modelling loop state with the
notion of an internal value which may be updated on each iteration of the loop. It has
five specific elements that represent dependencies at various stages of computation.
Fig. 14.6 Two different code schemes map to the same γ -node structure
14 Graphs and Gating Functions 197
code VSDG
Fig. 14.7 An example showing a for loop. Evaluating X triggers it to evaluate the I value
(outputting the value L). While C evaluates to true, it evaluates the R value (which in this case
also uses the θ-node’s L value). When C is false, it returns the final internal value through X. As i
is not used after the loop, there is no dependency on i at X
The θ -node corresponds to a merge of the φentry and φexit nodes in gated SSA. A
θ -node θ (C : p, I : vinit , R : vreturn , L : viter , X : vexit ) sets its internal value to
initial value vinit and then, while condition value p holds true, sets viter to the current
internal value and updates the internal value with the repeat value vreturn . When p
evaluates to false, computation ceases, and the last internal value is returned through
vexit .
A loop that updates k variables will have: a single condition p, initial values
1 , . . . , v k , loop iterations v 1 , . . . , v k , loop returns v k
vinit init iter iter return1 , . . . , vreturn , and
1 k
loop exits vexit , . . . , vexit . The example in Fig. 14.7 also shows a pair (2-tuple) of
values being used on ports I, R, L, X, one for each loop-variant value.
The θ -node directly implements pre-test loops (while, for); post-test loops
(do...while, repeat...until) are synthesized from a pre-test loop pre-
ceded by a duplicate of the loop body. At first, this may seem to cause unnecessary
duplication of code, but it has two important benefits: (1) it exposes the first loop
body iteration to optimization in post-test loops (cf. loop-peeling) and (2) it normal-
izes all loops to one loop structure, which both reduces the cost of optimization and
increases the likelihood of two schematically dissimilar loops being isomorphic in
the VSDG.
State Nodes
Loads and stores compute a value and state. The call node takes both the name of
the function to call and a list of arguments and returns a list of results; it is treated
as a state node as the function body may read or update state.
We maintain the simplicity of the VSDG by imposing the restriction that all
functions have one return node (the exit node N∞ ), which returns at least one result
198 J. Stanier and F. Rastello
(which will be a state value in the case of void functions). To ensure that function
calls and definitions are able to be allocated registers easily, we suppose that the
number of arguments to, and results from, a function is smaller than the number of
physical registers—further arguments can be passed via a stack as usual.
Note also that the VSDG does not force loop-invariant code into or out of loop
bodies, but rather allows later phases to determine, by adding serializing edges, such
placement of loop-invariant nodes for later phases.
5 Function WalkAndMark(n, G)
6 if n is marked then return
7 mark n
8 foreach node m ∈ N ∧ (n, m) ∈ (EV ∪ ES ) do
9 WalkAndMark(m)
10 Function DeleteMarked(G)
11 foreach node n ∈ N do
12 if n is unmarked then delete(n)
13
14 Graphs and Gating Functions 199
François de Ferrière
In the SSA representation, each definition of a variable is given a unique name, and
new pseudo-definitions are introduced on φ-functions to merge values coming from
different control-flow paths. An example is given in Fig. 15.1b. Each definition is an
unconditional definition, and the value of a variable is the value of the expression
on the unique assignment to this variable. This essential property of the SSA
representation no longer holds when definitions may be conditionally executed.
When a variable is defined by a predicated operation, the value of the variable will
or will not be modified depending on the value of a guard register. As a result,
the value of the variable after the predicated operation is either the value of the
expression on the assignment if the predicate is true, or the value the variable had
before this operation if the predicate is false. This is represented in Fig. 15.1c where
we use the notation p ? a = op to indicate that an operation a = op is executed
only if predicate p is true , and is ignored otherwise. We will also use the notation
p to refer to the complement of predicate p. The goal of the ψ-SSA form advocated
in this chapter is to express these conditional definitions while keeping the Static
Single Assignment property.
F. de Ferrière ()
STMicroelectronics, Grenoble, France
e-mail: [Link]-ferriere@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 201
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
202 F. de Ferrière
on, the compiler may also generate predicated operations through if-conversion
optimizations as described in Chap. 20.
In Fig. 15.1c, the use of a on the last instruction refers to the variable a1 if
p is false, or to the variable a2 if p is true. These multiple reaching definitions
on the use of a cannot be represented by the standard SSA representation. One
possible representation would be to use the gated SSA form, presented in Chap. 14.
In such a representation, the φ-function would be augmented with the predicate p
to tell which value between a1 and a2 is to be considered. However, gated SSA is a
completely different intermediate representation where the control flow is no longer
represented. This representation is better suited to program interpretation than to
optimizations at code-generation level as addressed in this chapter. Another possible
representation would be to add a reference to a1 on the definition of a2 . In this case,
p ? a2 = op2 | a1 would have the following semantic: a2 takes the value computed
by op2 if p is true, or holds the value of a1 if p is false. The use of a on the last
instruction of Fig. 15.1c would now refer to the variable a2 , which holds the correct
value. The drawback of this representation is that it adds dependencies between
operations (here a flow dependence from op1 to op2), which would prevent code
reordering for scheduling.
Our solution is presented in Fig. 15.1d. The φ-function of the SSA code with
control flow is “replaced” by a ψ-function on the corresponding predicated code,
with information on the predicate associated with each argument. This representa-
tion is adapted to code optimization and code generation on a low-level intermediate
representation. A ψ-function a0 = ψ(p1 ?a1 , . . . , pi ?ai , . . . , pn ?an ) defines one
variable, a0 , and takes a variable number of arguments ai ; each argument ai is
associated with a predicate pi . In the notation, the predicate pi will be omitted
if pi ≡ true.
A ψ-function has the following properties:
• It is an operation: A ψ-function is a regular operation. It can occur at any location
in a basic block where a regular operation is valid. Each argument ai , and each
predicate pi , must be dominated by its definition.
• It
nis predicated: A ψ-function is a predicated operation, under the predicate
k=1 pk , although this predicate is not explicit in the representation.
15 Psi-SSA Form 203
predicate p1 (possibly true); after renaming x into a freshly created version, say
x2 , a ψ-function of the form x = ψ(p1 ?x1 , p2 ?x) is inserted right after op. Then
the renaming of this new operation proceeds. The first argument of the ψ-function
is already renamed and thus is not modified. The second argument is renamed
into the current version of x, that is, x2 . On the definition of the ψ-function, the
variable x is given a new name, x3 , which becomes the current version for further
references to the x variable. This insertion and renaming of a ψ-function is shown
in Fig. 15.3.
ψ-functions can also be introduced into an SSA representation by applying
an if-conversion transformation, such as the one described in Chap. 20. Local
transformations of control-flow patterns can also require the replacement of φ-
functions by ψ-functions.
Fig. 15.5 ψ-Reduction. The first argument a1 of the ψ-function can safely be removed
n
i−1
pi \ pk ∩ pk = ∅, (15.1)
k=i k=1
where pi \ nk=i pk corresponds to the possible increase of the predicate of the ψ-
function, nk=1 pk . This promotion must also satisfy the properties of ψ-functions,
and, in particular, that the predicate associated with a variable in a ψ-function must
be included in or equal to the predicate on the definition of that variable (which itself
can be a ψ-function). A simple ψ-promotion is illustrated in Fig. 15.8c.
The ψ-SSA representation can be used on a partially predicated architecture,
where only a subset of the instructions supports a predicate operand. Figure 15.8
shows an example where some code with control-flow edges was transformed into
a linear sequence of instructions. Taking the example of an architecture where the
ADD operation cannot be predicated, the ADD operation must be speculated under
the true predicate. On an architecture where the ADD operation can be predicated,
it may also be profitable to perform speculation in order to reduce the number of
predicates on predicated code and to reduce the number of operations to compute
these predicates. Once speculation has been performed on the definition of a variable
used in a ψ-function, the predicate associated with this argument can be promoted,
provided that the semantics of the ψ-function is maintained (Eq. 15.1).
Usually, the first argument of a ψ-function can be promoted under the true pred-
icate. Also, when disjoint conditions are computed, one of them can be promoted
to include the other conditions, usually reducing the number of predicates. A side
effect of this transformation is that it may increase the number of copy instructions
to be generated during the ψ-SSA destruction phase, as will be explained in the
following section.
The SSA destruction phase reverts an SSA representation into a non-SSA repre-
sentation. This phase must be adapted to the ψ-SSA representation. This algorithm
uses ψ-φ-webs to create a conventional ψ-SSA representation. The notion of φ-
15 Psi-SSA Form 207
Fig. 15.9 Non-conventional ψ-SSA (ψ-T-SSA) form, ψ-C-SSA forms, and non-SSA form after
destruction
208 F. de Ferrière
of the variable b, after the definition of b and p; in the ψ-function, variable b is then
replaced by variable c. The renaming of variables a, c, and x into a single variable
will now follow the correct behaviour.
In Fig. 15.9g, the renaming of the variables a, b, c, x, and y into a single variable
will not give the correct semantics. In fact, the value of a used in the second ψ-
function would be overridden by the definition of b before the definition of the
variable c. Such code will occur after copy folding has been applied on a ψ-SSA
representation. We see that the value of a has to be preserved before the definition
of b. This is done through the definition of a new variable (d here), resulting in
the code given in Fig. 15.9h. Now, the variables a, b, and x can be renamed into a
single variable, and the variables d, c, and y will be renamed into another variable,
resulting in a program in a non-SSA form with the correct behaviour.
We will now present an algorithm that will transform a program from a ψ-SSA
form into its ψ-C-SSA form. This algorithm comprises three parts:
• Psi-normalize: This phase puts all ψ-functions in what we call a normalized
form.
• Psi-web: This phase grows ψ-webs from ψ-functions and introduces repair code
where needed such that each ψ-web is interference-free.
• Phi-web: This phase is the standard SSA destruction algorithm (e.g., see
Chap. 21) with the additional constraint that all variables in a ψ-web must be
coalesced together. This can be done using the pinning mechanism presented in
Chap. 21.
We now detail the implementation of each of the first two parts.
15.4.1 Psi-Normalize
15.4.2 Psi-web
The role of the psi-web phase is to repair the ψ-functions that are part of a
non-interference-free ψ-web. This case corresponds to the example presented in
Fig. 15.9g. In the same way as there is a specific point of use for arguments on φ-
functions for liveness analysis (e.g., see Sect. 21.2), we give a definition of the actual
point of use of arguments on normalized ψ-functions for liveness analysis. With this
definition, liveness analysis is computed accurately, and an interference graph can
be built. The cases where repair code is needed can be easily and accurately detected
by observing that variables in a ψ-function interfere.
1 When ai is defined by a ψ-function, its definition may appear after the definition for ai−1 ,
although the non-ψ definition for ai appears before the definition for ai−1 .
210 F. de Ferrière
the definition for ai+1 , or just above the ψ-function in case of the last argument.
The current argument ai in the ψ-function is replaced by the new variable ai . The
interference graph is updated. This can be done by considering the set of variables,
say U , that ai interferes with. For each u ∈ U , if u is in the merged ψ-web, it should
not interfere with ai ; if the definition of u dominates the definition of ai , it is live
through the definition of ai , thus it should be made interfering with ai ; last, if the
definition of ai dominates the definition of b, it should be made interfering only if
this definition is within the live range of ai (see Chap. 9).
Consider the code in Fig. 15.11 to see how this algorithm works. The liveness on
the ψ-function creates a live range for variable a that extends down to the definition
of b, but not further down. Thus, the variable a does not interfere with the variables
b, c, or x. The live range for variable b extends down to its use in the definition of
variable d. This live range interferes with the variables c and x. The live range for
variable c extends down to its use in the ψ-function that defines the variable x. At
the beginning of the processing on the ψ-function x = ψ(p?a, q?b, r?c), ψ-webs
are singletons {a}, {b}, {c}, {x}, {d}. The argument list is processed from right to
left, i.e., starting with variable c. {c} does not interfere with {x}, and they can be
merged together, resulting in psiWeb = {x, c}. Next, variable b is processed. Since
it interferes with both x and c, repair code is needed. A variable b’ is created and is
initialized just below the definition for b, as a predicated copy of b. The interference
graph is updated conservatively, with no changes. psiWeb now becomes {x, b , c}.
Then variable a is processed, and as no interference is encountered, {a} is merged
to psiWeb. The final code after SSA destruction is shown in Fig. 15.11c.
In this chapter, we have mainly described the ψ-SSA representation and have
detailed specific transformations that can be performed thanks to this representation.
More details on the implementation of the ψ-SSA algorithms, and figures on the
benefits of this representation, can be found in [275] and [104].
We mentioned in this chapter that a number of classical SSA-based algorithms
can be easily adapted to the ψ-SSA representation, usually by just adapting the rules
on the φ-functions to the ψ-functions. Among these algorithms, we can mention the
constant propagation algorithm described in [303], dead code elimination [202],
global value numbering [76], partial redundancy elimination [73], and induction
variable analysis [306], which have already been implemented into a ψ-SSA
framework.
There are also other SSA representations that can handle predicated instructions,
of which the Predicated SSA representation [58]. This representation is targeted
at very low-level optimization to improve operation scheduling in the presence of
predicated instructions. Another representation is the gated SSA form, presented in
Chap. 14.
The ψ-SSA destruction algorithm presented in this chapter is inspired from the
SSA destruction algorithm of Sreedhar et al. [267], which introduces repair code
when needed as it grows φ-webs from φ-functions. The phi-web phase mentioned
in this chapter to complete the ψ-SSA destruction algorithm can use exactly the
same approach by simply initializing ψ-φ-webs by ψ-webs.
Chapter 16
Hashed SSA Form: HSSA
Hashed SSA (or in short, HSSA) is an SSA extension that can effectively represent
how aliasing relations affect a program in SSA form. It works equally well
for aliasing among scalar variables and, more generally, for indirect load and
store operations on arbitrary memory locations. Thus, all common SSA-based
optimizations can be applied uniformly to any storage area no matter how they are
represented in the program.
It should be noted that only the representation of aliasing is discussed here. HSSA
relies on a separate alias analysis pass that runs before its creation. Depending on
the actual alias analysis algorithm used, the HSSA representation will reflect the
accuracy produced by the alias analysis pass.
This chapter starts by defining notations to model the effects of aliasing for scalar
variables in SSA form. We then introduce a technique that can reduce the overhead
linked to the SSA representation by avoiding an explosion in the number of SSA
versions for aliased variables. Next, we introduce the concept of virtual variables
to model indirect memory operations as if they were scalar variables, effectively
allowing indirect memory operations to be put into SSA form together with scalar
variables. Finally, we apply global value numbering (GVN) to the program to derive
Hashed SSA form1 as the effective SSA representation of all storage entities in the
program.
1 The name Hashed SSA comes from the use of hashing in value numbering.
M. Mantione ()
WorkWave, Gorgonzola, Milan, Italy
e-mail: [Link]@[Link]
F. Chow
Huawei, Fremont, CA, USA
e-mail: fchow99@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 213
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
214 M. Mantione and F. Chow
Aliasing occurs in a program when a storage location (that contains a value) referred
to in the program code can potentially be accessed through a different means; this
can occur under one of the following four conditions:
• Two or more storage locations partially overlap. For example, in the C union
construct, a storage location can be accessed via different field names.
• A variable is pointed to by a pointer. In this case, the variable can be accessed in
two ways: directly, through the variable name, and indirectly, through the pointer
that holds its address.
• The address of a variable is passed in a procedure call. This enables the called
procedure to access the variable indirectly.
• The variable is declared in the global scope. This allows the variable to
potentially be accessed in any function call.
We seek to model the effects of aliasing on a program in SSA form based on the
results of the alias analysis performed. To characterize the effects of aliasing, we
distinguish between two types of definitions of a variable: MustDef and MayDef. A
MustDef must redefine the variable and thus blocks the references to its previous
definitions from that point on. A MayDef only potentially redefines the variable and
so does not prevent previous definitions of the same variable from being referenced
later in the program.2 We represent MayDef through the use of χ -functions. On
the use side, in addition to real uses of the variable, which are MustUses, there are
MayUses that arise in places in the program where there are potential references to
the variable. We represent MayUse through the use of μ-functions. The semantics of
μ and χ operators can be illustrated through the C-like example in Fig. 16.1, where
∗p represents an indirect access through pointer p. The argument of the μ-operator
is the potentially used variable. The argument of the χ -operator is the potentially
assigned variable itself, to express the fact that the variable’s original value will flow
through if the MayDef does not modify the variable.
The use of μ- and χ -functions does not alter the complexity of transforming a
program into SSA form. All that is necessary is a pre-pass that inserts them in the
program. Ideally, μ- and χ -functions should be placed parallel to the instruction
that led to their insertion. Parallel instructions are represented in Fig. 16.1 using
the notation introduced in Sect. 13.1.4. Nonetheless, practical implementations may
choose to insert μ- and χ -functions before or after the instructions that involve
aliasing. In particular, μ-functions can be inserted immediately before the involved
statement or expression and χ -operators immediately after the statement. This
distinction allows us to model call effects correctly: the called function appears to
potentially use the values of variables before the call, and the potentially modified
values appear after the call.
2 MustDefs are often referred to as Killing Defs and MayDefs as Preserving or Non-killing Defs
in the literature.
16 Hashed SSA Form: HSSA 215
Fig. 16.1 Program example where ∗p might alias i, and function f might indirectly use i but not
alter it
While it is true that μ and χ insertion does not alter the complexity of SSA
construction, applying it to a production compiler as described in the previous
section may make working with code in SSA form inefficient in programs that
exhibit a lot of aliasing. Each χ -function introduces a new version, and it may in
turn cause new φ-functions to be inserted. This can make the number of distinct
variable versions needlessly large.
The major issue is that the SSA versions introduced by χ -operators are useless
for most optimizations that deal with variable values. χ definitions add uncertainty
to the analysis of variable values: the actual value of a variable after a χ definition
could be its original value, or it could be the one indirectly assigned by the χ .
Our solution to this problem is to factor all variable versions that are considered
useless together, so that SSA versions are not wasted. We assign number 0 to this
special variable version and call it zero version.
Our notion of useless versions relies on the concept of real occurrence of a
variable, which is an actual definition or use of a variable in the original program.
From this point of view, in the SSA form, variable occurrences in μ-, χ -, and φ-
functions are not regarded as real occurrences. In our example in Fig. 16.1, i2 has
no real occurrence, while i1 , i3 , and i4 do have. The idea is that variable versions
that have no real occurrence do not play important roles in the optimization of
the program. Once the program is converted back from SSA form, these variables
216 M. Mantione and F. Chow
will no longer show up. Since they do not directly appear in the code and their
values are usually unknown, we can dispense with the cost of distinguishing among
them.
For these reasons, we assign the zero version to versions of variables that have
no real occurrence and whose values are derived from at least one χ -function
through zero or more intervening φ-functions. An equivalent recursive definition
is as follows:
• The result of a χ has zero version if it has no real occurrence.
• If the operand of a φ is zero version, then the result of the φ is zero version if it
has no real occurrence.
For a program in full SSA form, Algorithm 16.1 determines which variable ver-
sions are to be made zero version under the above definition. The algorithm assumes
that only use-def edges (and not def-use edges) are available. A HasRealOcc flag is
associated with each original SSA version, being set to true whenever it has a real
occurrence in the program. This can be done during the initial SSA construction. A
list NonZeroPhiList, initially empty, is also associated with each original program
variable.
10 changes ← true
11 while changes do
12 changes ← false
13 foreach vi ∈ [Link] do
14 let V = vi .[Link]
15 if ∀vj ∈ V , vj .HasRealOcc then
16 vi .HasRealOcc ← true
17 [Link](vi )
18 changes ← true
19 else if ∃vj ∈ V , vj .version = 0 then
20 vi .version ← 0
21 [Link](vi )
22 changes ← true
16 Hashed SSA Form: HSSA 217
The loop from lines 2 to 12 of Algorithm 16.1 can be regarded as the initialization
pass, processing each variable version once. The while loop from lines 14 to 25
is the propagation pass, with time bound by the length of the longest chain of
contiguous φ assignments. This bound can easily be reduced to the deepest loop
nesting depth of the program by traversing the versions based on a topological
order of the forward control flow graph. All in all, zero-version detection in the
presence of μ- and χ -functions does not significantly change the complexity of SSA
construction, while the corresponding reduction in the number of variable versions
can reduce the overhead in the subsequent SSA-based optimization phases.
Because zero versions can have multiple static assignments, they do not have
fixed or known values. Thus, two zero versions of the same variable cannot be
assumed to have identical values. The occurrence of a zero version breaks use-def
chains. But since the results of χ -functions have unknown values, zero versioning
does not affect the performance of optimizations that propagate known values,
such as constant propagation, because they cannot be propagated across points
of MayDefs in any case. Optimizations that operate on real occurrences, such
as equivalencing and redundancy detection, are also unaffected. In performing
dead store elimination, zero versions have to be assumed live. Since zero versions
can only have uses represented by μ-functions, the chances of the deletion of
stores associated with χ -functions with zero versions are small. But if some later
optimizations delete the code that contains a μ-function, zero versioning could
prevent its defining χ -function from being recognized as dead. Overall, zero
versioning should only cause a small loss of effectiveness in the subsequent SSA-
based optimization phases.
The techniques described in the previous sections only apply to scalar variables
in the program and not to arbitrary memory locations accessed indirectly. As an
example, in Fig. 16.1, μ-, χ -, and φ-functions have been introduced to keep track
of i’s use-defs, but ∗p is not considered as an SSA variable. Thus, even though we
can apply SSA-based optimizations to scalar variables when they are affected by
aliasing, indirect memory access operations are not targeted in our optimizations.
This situation is far from ideal, because code written in current mainstream
imperative languages (like C++, Java, or C#) typically contains many operations on
data stored in unfixed memory locations. For instance, in C, a two-dimensional vec-
tor could be represented as a struct declared as “typedef struct {double
x; double y;} point;.” Then, we can have a piece of code that computes
the modulus of the vector p written as “m = (p->x * p->x) + (p->y *
p->y);.” As x is accessed twice with both accesses yielding the same value, the
second access could be optimized away, and the same goes for y. The problem is that
x and y are not scalar variables: p is a pointer variable, while “p->x” and “p->y”
218 M. Mantione and F. Chow
are indirect memory operations. Representing this code snippet in SSA form tells
us that the value of p never changes, but it reveals nothing about the values stored
in the locations “p->x” and “p->y.” It is worth noting that operations on array
elements suffer from the same problem.
The purpose of HSSA is to put indirect memory operations in SSA form just like
scalar variables, so we can apply all SSA-based optimizations to them uniformly.
In our discussion, we use the C dereferencing operator dereference to denote
indirection from a pointer. This operator can be placed on either the left- or the right-
hand side of the assignment operator. Appearances on the right-hand side represent
indirect loads, while those on the left-hand side represent indirect stores. Examples
include:
• *p: read memory at address p
• *(p+4): read memory at address p + 4 (as in reading the field of an object at
offset 4)
• **p: double indirection
• *p=: indirect store
As noted above, indirect memory operations cannot be handled by the regular
SSA construction algorithm because they are operations, while SSA construction
works on variables only. In HSSA, we represent the target locations of indirect
memory operations using virtual variables. A virtual variable is an abstraction of
a memory area and appears under HSSA thanks to the insertion of μ-, χ -, and φ-
functions. Indeed, like any other variable, they can also have aliases. For the same
initial C code shown in Fig. 16.2a, we show two different scenarios after μ-function
and χ -function insertion. In Fig. 16.2b, the two virtual variables introduced, v ∗ and
w ∗ , are associated with the memory locations pointed to by ∗p and ∗q, respectively.
As a result, v ∗ and w ∗ both alias with all the indirect memory operations (of lines
3, 5, 7, and 8). In Fig. 16.2c, x ∗ is associated with the memory location pointed to
by b and y ∗ is associated with the memory location pointed to by b + 1. Assuming
alias analysis determines that there is no alias between b and b + 1, only x ∗ aliases
with the indirect memory operations of line 3, and only y ∗ aliases with the indirect
memory operations of lines 5, 7, and 8.
Fig. 16.2 Some virtual variables and their insertion depending on how they alias with operands
16 Hashed SSA Form: HSSA 219
It can be seen that the behaviour of a virtual variable in annotating the SSA
representation is dictated by its definition. The only discipline imposed by HSSA
is that each indirect memory operand must be associated with at least one virtual
variable. At one extreme, there could be one virtual variable for each indirect
memory operation. On the other hand, it may be desirable to cut down the number
of virtual variables by making each virtual variable represent more forms of indirect
memory operations. Called assignment factoring, this has the effect of replacing
multiple use-def chains belonging to different virtual variables with one use-def
chain that encompasses more nodes and thus more versions. At the other extreme,
the most factored HSSA form would define only one single virtual variable that
represents all indirect memory operations in the program.
In practice, it is best not to use assignment factoring among memory operations
that do not alias among themselves. This provides high accuracy in the SSA
representation without incurring additional representation overhead, because the
total number of SSA versions is unaffected. On the other hand, among memory
operations that alias among themselves, using multiple virtual variables would result
in greater representation overhead. Zero versioning can also be applied to virtual
variables to help reduce the number of versions. Appearances of virtual variables in
the μ- and χ -functions next to their defined memory operations are regarded as real
occurrences in the algorithm that determines zero versions for them.
Virtual variables can be instrumented into the program during μ-function and χ -
function insertion, in the same pass as for scalar variables. During SSA construction,
virtual variables are handled just like scalar variables. In the resulting SSA form, the
use-def relationships of the virtual variables will represent the use-def relationships
among the memory access operations in the program. At this point, we are ready to
complete the construction of the HSSA form by applying global value numbering
(GVN).
In the previous sections, we have laid the foundations for dealing with aliasing
and indirect memory operations in SSA form: we introduced μ- and χ -operators
to model aliases, applied zero versioning to keep the number of SSA versions
acceptable, and defined virtual variables as a way to apply SSA to storage locations
accessed indirectly. However, HSSA is incomplete unless global value numbering
is applied to handle scalar variables and indirect storage locations uniformly (see
Chap. 11).
Value numbering works by assigning a unique number to every expression in the
program with the idea that expressions identified by the same number are guaranteed
to compute to the same value. The value number is obtained using a hash function
applied to each node in an expression tree. For an internal node in the tree, the value
number is a hash function of the operator and the value numbers of all its immediate
operands. The SSA form enables value numbering to be applied on the global scope,
220 M. Mantione and F. Chow
taking advantage of the property that the same SSA version of a variable must store
the same value regardless of where it appears.
GVN enables us to easily determine when two address expressions compute the
same address when building HSSA. We can then construct the SSA form among
indirect memory operations whose address expressions have the same value number.
Because a memory location accessed indirectly may be assigned different values
at different points in the program, having the same value number for address
expressions is not a sufficient condition for the read of the memory location to
have the same value number. This is where we make use of virtual variables. In
HSSA, for two reads of indirect memory locations to have the same value number,
apart from their address expressions having identical value numbers, an additional
condition is that they must have the same SSA version for their virtual variable.
If they are associated with different versions of their virtual variable, they will be
assigned different value numbers, because their reads may return different values.
This enables the GVN in HSSA to maintain the consistency whereby nodes with the
same value number must yield the same value. Thus, in HSSA, indirect memory
operations can be handled in the same rank as scalar variables, and they can
benefit transparently from any SSA-based optimizations applied to the program. For
instance, in the vector modulus computation described above, every occurrence of
the expression “p->x” will always have the same GVN and is therefore guaranteed
to return the same value, allowing the compiler to emit code that avoids redundant
memory reads (the same holds for “p->y”).
Using the same code example from Fig. 16.2, we can see in Fig. 16.3 that p2 and
q2 have the same value number h7 , while p1 ’s value number is h1 and is different.
The loads at lines 5 and 8 cannot be considered redundant between each other
because the versions for v (v1∗ then v2∗ ) are different. On the other hand, the load
at line 8 can be safely avoided by reusing the value computed in line 7, as both
their versions for v (v2∗ ) and their address expressions’ value numbers are identical.
As a last example, if all the associated virtual variable versions for an indirect
memory store (defined in parallel by χ -functions) are found to be dead, then it can
be safely eliminated.3 In other words, HSSA transparently extends SSA’s dead store
elimination algorithm to indirect memory stores.
Another advantage of HSSA in this context is that it enables uniform treatment of
indirect memory operations regardless of the levels of indirections (as in the “**p”
expression in C which represents a double indirection). This happens naturally
because each node of the expression is identified by its value number, and the fact
that it is used as an address in another expression does not cause any additional
complications.
Having explained why HSSA uses GVN, we are ready to explain how the
HSSA intermediate representation is structured. A program in HSSA keeps its CFG
3 Note that any virtual variable that aliases with a memory region live out of the compiled procedure
is considered to alias with the return instruction of the procedure and as a consequence will lead to
a live μ-function.
16 Hashed SSA Form: HSSA 221
Hash Table
Fig. 16.3 Some code after variable versioning, its corresponding HSSA form along with its hash
table entries. q1 + 1 that simplifies into +(h0 , h6 ) will be hashed to h7 , and ivar(∗q2 , v2∗ ) that
simplifies into ivar(∗h7 , v2∗ ) will be hashed to h14
accesses to the addresses specified as their operand. On the other hand, the virtual
variables have no real counterpart in the program’s code generation. In this sense,
virtual variables can be regarded as a tool to associate aliasing effects with indirect
memory operations and construct use-def relationships in the program that involves
the indirect memory operations. Using virtual variables, indirect memory operations
are no longer relegated to second-class citizens in SSA-based optimization phases.
As a supplementary note, in HSSA, because values and variables in particular
can be uniquely identified using value numbers, the reference to SSA versions has
been rendered redundant. In fact, to further ensure uniform treatment among scalar
and indirect variables in the implementation, the use of SSA version numbers should
be omitted in the representation.
We now put together all the topics discussed in this chapter by detailing the steps
to build HSSA starting from some non-SSA representations of the program. The
first task is to construct the SSA form for the scalar and virtual variables, and this is
displayed in Algorithm 16.2.
Algorithm 16.2: SSA form construction
1. Perform alias analysis and assign a virtual variable to all indirect memory operations.
2. Insert μ-functions and χ-functions for scalar and virtual variables.
3. Insert φ-functions as in standard SSA construction, including the χ’s as additional
assignment statements.
4. Perform SSA renaming on all scalar and virtual variables as in standard SSA construction.
At the end of Algorithm 16.2, all scalar and virtual variables have SSA
information, but the indirect memory operations are only “annotated” with virtual
variables and cannot be regarded as being in SSA form.
Next, we perform a round of dead store elimination based on the constructed SSA
form and then run our zero-version detection algorithm to detect zero versions in the
scalar and virtual variables. These correspond to Steps 5 and 6 in Algorithm 16.3,
respectively.
Algorithm 16.3: Detecting zero versions
5. Perform dead store elimination on the scalar and virtual variables based on their SSA form.
6. Initialize HasRealOcc and NonZeroPhiList as for Algorithm 16.1, then run Algorithm 16.1
(Zero-version detection).
At this point, however, the number of unique SSA versions have diminished
because of the application of zero versioning. The final task is to build the
HSSA representation of the program by applying GVN. The steps are displayed
in Algorithm 16.4.
16 Hashed SSA Form: HSSA 223
At the end of Algorithm 16.4, HSSA form is complete, and every value in the
program code is represented by a reference to a node in the hash table.
The purpose of the preorder traversal processing order in Algorithm 16.4 is not
strictly required to ensure the correctness of the HSSA, because we already have
SSA version information for the scalar and virtual variables. Because this processing
order ensures that we always visit definitions before their uses, it streamlines the
implementation and also makes it easier to perform additional optimizations like
copy propagation during the program traversal. It is also possible to go up the
use-def chains for virtual variables and analyse the address expressions of their
associated indirect memory operations to determine more accurate alias relations
among ivar nodes that share the same virtual variable.
In HSSA form, expression trees are converted to directed acyclic graphs (DAGs),
which is more memory efficient than ordinary program representation because of
node sharing. Many optimizations can run faster on HSSA because they only need
to be applied once on the shared use nodes. Optimization implementations can also
take advantage of the fact that it is trivial to check if two expressions compute the
same value in HSSA.
The earliest attempts at accommodating may-alias information into SSA form are
represented by the work of Cytron and Gershbein [88], where they defined may-
alias sets of the form MayAlias(p, S), which gives the variable names aliased with
∗p at statement S in the program. Calls to an IsAlias(p, v) function are then inserted
224 M. Mantione and F. Chow
into the program at points in the program where modifications due to aliasing
occur. The IsAlias(p, v) function contains code that models runtime logic and
returns appropriate values based on the pointer values. To address the high cost
of this representation, Cytron and Gershbein proposed an incremental approach for
including may-alias information into SSA form in their work.
The use of assignment factoring to represent MayDefs was first proposed by
Choi et al. [68]. The same factoring was referred to as location factored SSA form
by Steensgaard [272]. He also used assignment factoring to reduce the number of
SSA variables in his SSA form.
The idea of value numbering can be traced to Cocke and Schwartz [77]. While the
application of global value numbering (GVN) in this chapter is more for the purpose
of representation than optimization, GVN has mostly been discussed in the context
of redundancy elimination [76, 249]. Chapter 11 covers redundancy elimination in
depth and also contains a discussion of GVN.
HSSA was first proposed by Chow et al. [72] and first implemented in the
Open64 compiler [7, 64, 65]. All the SSA-based optimizations in the Open64
compiler were implemented based on HSSA. Their copy propagation and dead
store elimination work uniformly on both direct and indirect memory operations.
Their redundancy elimination covers indirect memory references as well as other
expressions. Other SSA-based optimizations in the Open64 compiler include loop
induction variable canonicalization [185], strength reduction [163], and register
promotion [187]. Induction variable recognition is discussed in Chap. 10. Strength
reduction and register promotion are also discussed in Chap. 11.
The most effective way to optimize indirect memory operations is to promote
them to scalars when possible. This optimization is called indirection removal.
In the Open64 compiler, it depends on the ability of copy propagation to convert
the address expressions of indirect memory operations to address constants. The
indirect memory operations can then be folded to direct loads and stores. Subsequent
register promotion will promote the scalars to registers. If the scalars are free of
aliasing, they will not be allocated storage in memory.
Lapkowski and Hendren proposed the use of Extended SSA Numbering to
capture the semantics of aliases and pointer accesses [175]. Their SSA number
idea borrows from SSA’s version numbering, in the sense that a new number is
used to annotate the variable whenever it could assume a new value. φ-nodes are
not represented, and not all SSA numbers need to have explicit definitions. SSA
numbers for pointer references are “extended” to two numbers, the primary one
for the pointer variable and the secondary one for the pointed-to memory location.
But because it is not SSA form, it does not exhibit many of the nice properties of
SSA, like single definitions and built-in use-defs. It cannot benefit from the many
SSA-based optimization algorithms either.
The GCC compiler originally uses different representation schemes between
unaliased scalar variables and aliased memory-residing objects. To avoid compro-
mising the compiler’s optimization when dealing with memory operations, Novillo
proposed a unified approach for representing both scalars and arbitrary memory
expressions in their SSA form [211]. They do not use Hashed SSA, but their
16 Hashed SSA Form: HSSA 225
approach to representing aliasing in its SSA form is very similar to ours. They
define the virtual operators VDEF and VUSE, which correspond to our χ - and
μ-functions. They started out by creating symbol names for any memory-residing
program entities. This resulted in an explosion in the number of VDEFs and VUSEs
inserted. They then used assignment factoring to cut down the number of these
virtual operators, in which memory symbols are partitioned into groups. The virtual
operators were then inserted on a per-group basis, thus reducing compilation time.
Since the reduced representation precision has a negative effect on optimizations,
they provided different partitioning schemes to reduce the impact on optimizations.
One class of indirectly accessed memory objects is array, in which each element
is addressed via an index or subscript expression. HSSA distinguishes among
different elements of an array only to the extent of determining if the address
expressions compute to the same value or not. When two address expressions have
the same value, the two indirect memory references are definitely the same. But
when the two address expressions cannot be determined to be the same, they may
still be the same. Thus, HSSA cannot provide definitely different information. In
contrast, the array SSA form can enable more accurate program analysis among
accesses to array elements by incorporating the indices into the SSA representation.
The array SSA form can capture element-level use-defs, whereas HSSA cannot. In
addition, the heap memory storage area can be modelled using abstract arrays that
represent disjoint subsets of the heap, with pointers to the heap being treated like
indices. Array SSA is covered in detail in Chap. 17. When array references are affine
expressions of loop indices, the appropriate representation for dependencies is the
so-called dependence relations in the polyhedral model [116].
Chapter 17
Array SSA Form
In this chapter, we introduce an Array SSA form that captures element-level data-
flow information for array variables and coincides with standard SSA form when
applied to scalar variables. Any program with arbitrary control-flow structures and
arbitrary array subscript expressions can be automatically converted to this Array
SSA form, thereby making it applicable to structures, heap objects, and any other
data structure that can be modelled as a logical array. A key extension over standard
SSA form is the introduction of a definition- function (denoted def ) that is
capable of merging values from distinct array definitions on an element-by-element
basis. There are several potential applications of Array SSA form in compiler
analysis and optimization of sequential and parallel programs. In this chapter, we
focus on sequential programs and use constant propagation as an exemplar of a
program analysis that can be extended to array variables using Array SSA form and
redundant load elimination as an exemplar of a program optimization that can be
extended to heap objects using Array SSA form.
The rest of the chapter is organized as follows. Section 17.1 introduces full Array
SSA form for runtime evaluation and partial Array SSA form for static analysis.
Section 17.2 extends the scalar SSA constant propagation algorithm to enable
constant propagation through array elements. This includes an extension to the
constant propagation lattice to efficiently record information about array elements
and an extension to the worklist algorithm to support definition- functions
V. Sarkar ()
Georgia Institute of Technology, Atlanta, GA, USA
e-mail: vsarkar@[Link]
K. Knobe
Rice University, Houston, TX, USA
e-mail: [Link]@[Link]
S. Fink
Facebook, Yorktown Hieght, NY, USA
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 227
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
228 V. Sarkar et al.
To introduce full Array SSA form with runtime evaluation of -functions, we use
the concept of an iteration vector to differentiate among multiple dynamic instances
of a static definition, Sk , that occur in the same dynamic instance of Sk ’s enclosing
procedure, f (). Let n be the number of loops that enclose Sk in procedure f ().
These loops could be for-loops, while-loops, or even loops constructed out of goto
statements. For convenience, we treat the outermost region of acyclic control flow
in a procedure as a dummy outermost loop with a single iteration, thereby ensuring
that n ≥ 1.
A single point in the iteration space is specified by the iteration vector i =
(i1 , . . . , in ), which is an n-tuple of iteration numbers, one for each enclosing loop.
For convenience, this definition of iteration vectors assumes that all loops are single-
entry, or equivalently, that the control-flow graph is reducible. (This assumption is
not necessary for partial Array SSA form.) For single-entry loops, we know that
each def executes at most once in a given iteration of its surrounding loops, and
hence the iteration vector serves the purpose of a “timestamp.” The key extensions
in Array SSA form relative to standard SSA form are as follows:
1. Renamed array variables: All array variables are renamed so as to satisfy the
Static Single Assignment property. Analogous to standard SSA form, control
-functions are introduced to generate new names for merging two or more
prior definitions at control-flow join points and to ensure that each use refers
to precisely one definition.
2. Array-valued @ variables: For each static definition Aj , we introduce an @
variable (pronounced “at variable”) @Aj that identifies the most recent iteration
vector at which definition Aj was executed. We assume that all @ variables
are initialized to the empty vector, ( ), at the start of program execution. Each
update of a single array element, Aj [k] ← . . ., is followed by the statement,
@Aj [k] ← i, where i is the iteration vector for the loops surrounding the
definition of Aj .
3. Definition ’s: A definition operator def is introduced in Array SSA form to
deal with preserving (“non-killing”) definitions of arrays. Consider A0 and A1 ,
two renamed arrays that originated from the same array variable in the source
program such that A1 [k] ← . . . is an update of a single array element and A0 is
the prevailing definition at the program point just prior to the definition of A1 . A
definition , A2 ← def (A1 , @A1 , A0 , @A0 ), is inserted immediately after the
definitions for A1 and @A1 . Since definition A1 only updates one element of A0 ,
17 Array SSA Form 229
The key extension over the scalar case is that the conditional expression specifies
an element-level merge of arrays A1 and A0 .
Figure 17.1 shows an example program with an array variable and the conversion
of the program to full Array SSA form as defined above.
We now introduce a partial Array SSA form for static analysis, which serves as
an approximation of full Array SSA form. Consider a (control or definition) -
function, A2 ← (A1 , @A1 , A0 , @A0 ). A static analysis will need to approximate
the computation of this -function by some data-flow transfer function, L . The
inputs and output of L will be lattice elements for scalar/array variables that are
compile-time approximations of their runtime values. We use the notation L (V ) to
denote the lattice element for a scalar or array variable V . Therefore, the statement,
Example program
Fig. 17.1 Example program with array variables and its conversion to full array SSA form
230 V. Sarkar et al.
In this section, we describe the lattice representation used to model array values for
constant propagation. Let Uind A and U A be the universal set of index values and
elem
the universal set of array element values, respectively, for an array variable A. For an
array variable, the set denoted by lattice element L (A) is a subset of index-element
pairs in Uind
A × U A . There are three kinds of lattice elements for array variables
elem
that are of interest in our framework:
1. L (A) = ⇒ SET(L (A)) = { }
This “top” case indicates that the set of possible index-element pairs that have
been identified thus far for A is the empty set, { }.
2. L (A) = (i1 , e1 ), (i2 , e2 ), . . ."
⇒ SET(L (A)) = {(i1 , e1 ), (i2 , e2 ), . . .} ∪ (Uind A − {i , i , . . .}) × U A
1 2 elem
The lattice element for this “constant” case is represented by a finite list
of constant index–element pairs, (i1 , e1 ), (i2 , e2 ), . . .". The constant indices,
i1 , i2 , . . ., must represent distinct (non-equal) index values. The meaning of this
lattice element is that the current stage of analysis has identified some finite
number of constant index–element pairs for array variable A, such that A[i1 ] =
e1 , A[i2 ] = e2 , etc. All other elements of A are assumed to be non-constant.
(Extensions to handle non-constant indices are described in Sect. 17.2.2.)
3. L (A) = ⊥ ⇒ SET(L (A)) = Uind A ×UA
elem
This “bottom” case indicates that, according to the approximation in the current
stage of analysis, array A may take on any value from the universal set of index–
element pairs. Note that L (A) = ⊥ is equivalent to an empty list, L (A) = ",
in case (2) above; they both denote the universal set of index–element pairs.
17 Array SSA Form 231
Fig. 17.2 Lattice computation for L (A1 [k]) = L[ ] (L (A1 ), L (k)), where A1 [k] is an array
element read operator
Fig. 17.3 Lattice computation for L (A1 ) = Ld[ ] (L (k), L (i)), where A1 [k] ← i is an array
element write operator
We now describe how array lattice elements are computed for various operations
that appear in Array SSA form. We start with the simplest operation viz. a read
access to an array element. Figure 17.2 shows how L (A1 [k]), the lattice element
for array reference A1 [k], is computed as a function of L (A1 ) and L (k), the
lattice elements for A1 and k. We denote this function by L[ ] , i.e., L (A1 [k]) =
L[ ] (L (A1 ), L (k)). The interesting case in Fig. 17.2 occurs in the middle cell when
neither L (A1 ) nor L (k) is or ⊥. The notation DS in the middle cell in Fig. 17.2
represents a “definitely same” binary relation, i.e., DS (a, b) = true if and only if
a and b are known to have exactly the same value.
Next, consider a write access of an array element, which in general has the form
A1 [k] ← i. Figure 17.3 shows how L (A1 ), the lattice element for the array being
written into, is computed as a function of L (k) and L (i), the lattice elements for
k and i. We denote this function by Ld[ ] , i.e., L (A1 ) = Ld[ ] (L (k), L (i)). As
before, the interesting case in Fig. 17.3 occurs in the middle cell when both L (k)
and L (i) are constant. For this case, the value returned for L (A1 ) is simply the
singleton list, (L (k), L (i)) ", which contains exactly one constant index–element
pair.
Now, we turn our attention to the φ-function. Consider a definition-φ-function
of the form, A2 ← φdef (A1 , A0 ). The lattice computation for L (A2 ) =
Lφdef (L (A1 ), L (A0 )) is shown in Fig. 17.4. Since A1 corresponds to a definition
of a single array element, the list for L (A1 ) can contain at most one pair (see
Fig. 17.3). Therefore, the three cases considered for L (A1 ) in Fig. 17.4 are
L (A1 ) = , L (A1 ) = (i , e )", and L (A1 ) = ⊥.
232 V. Sarkar et al.
Fig. 17.4 Lattice computation for L (A2 ) = Lφdef (L (A1 ), L (A0 )) where A2 ← φdef (A1 , A0 )
is a definition φ operation
Fig. 17.5 Lattice computation for L (A2 ) = Lφ (L (A1 ), L (A0 )) = L (A1 ) # L (A0 ), where
A2 ← φ(A1 , A0 ) is a control φ operation
The notation UPDATE((i , e ), (i1 , e1 ), . . .") used in the middle cell in Fig. 17.4
denotes a special update of the list L (A0 ) = (i1 , e1 ), . . ." with respect to the
constant index–element pair (i , e ). UPDATE involves four steps:
1. Compute the list T = { (ij , ej ) | (ij , ej ) ∈ L (A0 ) and DD(i , ij ) = true }.
Analogous to DS , DD denotes a “definitely different” binary relation, i.e.,
DD(a, b) = true if and only if a and b are known to have distinct (non-equal)
values.
2. Insert the pair (i , e ) into T to obtain a new list, I .
3. (Optional) If there is a desire to bound the height of the lattice due to compile-
time considerations and the size of list I exceeds a threshold size Z, then one
of the pairs in I can be dropped from the output list so as to satisfy the size
constraint.
4. Return I as the value of UPDATE((i , e ), (i1 , e1 ) . . .").
Finally, consider a control φ-function that merges two array values, A2 ←
φ(A1 , A0 ). The join operator (#) is used to compute L (A2 ), the lattice element
for A2 , as a function of L (A1 ) and L (A0 ), the lattice elements for A1 and A0 , i.e.,
L (A2 ) = Lφ (L (A1 ), L (A0 )) = L (A1 ) # L (A0 ). The rules for computing this
join operator are shown in Fig. 17.5, depending on different cases for L (A1 ) and
L (A0 ). The notation L (A1 ) ∩ L (A0 ) used in the middle cell in Fig. 17.5 denotes
a simple intersection of lists L (A1 ) and L (A0 )—the result is a list of pairs that
appear in both L (A1 ) and L (A0 ).
We conclude this section by discussing the example program in Fig. 17.6a. The
partial Array SSA form for this example is shown in Fig. 17.6b, and the data-flow
equations for this example are shown in Fig. 17.7a. Each assignment statement in
the partial Array SSA form (in Fig. 17.6b) results in one data-flow equation (in
Fig. 17.7a); the numbering S1 through S8 indicates the correspondence. Any solver
17 Array SSA Form 233
Fig. 17.6 Sparse constant propagation example (a) and its Array SSA form (b)
Fig. 17.7 (a) Data-flow equations for the sparse constant propagation example of Fig. 17.6 and
their solutions assuming (b) I is unknown or (c) known to be equal to 3
can be used for these data-flow equations, including the standard worklist-based
algorithm for constant propagation using scalar SSA form. The fixpoint solution is
shown in Fig. 17.7b. This solution was obtained assuming L (I ) = ⊥. If, instead,
variable I is known to equal 3, i.e., L (I ) = 3, then the lattice variables that would
be obtained after the fixpoint iteration step has completed are shown in Fig. 17.7c.
In either case (L (I ) = ⊥ or L (I ) = 3), the resulting array element constants
revealed by the algorithm can be used in whatever analyses or transformations the
compiler considers to be profitable to perform.
even though the index/subscript value i is not a constant. We would like to extend
the framework from Sect. 17.2.1 to be able to recognize the read of a[i] as constant
in such programs. There are two key extensions that need to be considered for non-
constant (symbolic) subscript values:
• For constants, C1 and C2 , DS (C1 , C2 ) = DD(C1 , C2 ). However, for two
symbols, S1 and S2 , it is possible that both DS (S1 , S2 ) and DD(S1 , S2 ) are
false, that is, we do not know if they are the same or different.
• For constants, C1 and C2 , the values for DS (C1 , C2 ) and DD(C1 , C2 ) can be
computed by inspection. For symbolic indices, however, some program analysis
is necessary to compute the DS and DD relations.
We now discuss the compile-time computation of DS and DD for symbolic
indices. Observe that, given index values I1 and I2 , only one of the following three
cases is possible:
Case 1: DS (I1 , I2 ) = false; DD(I1 , I2 ) = false
Case 2: DS (I1 , I2 ) = true; DD(I1 , I2 ) = false
Case 3: DS (I1 , I2 ) = false; DD(I1 , I2 ) = true
The first case is the most conservative solution. In the absence of any other
knowledge, it is always correct to state that DS (I1 , I2 ) = false and DD(I1 , I2 ) =
false.
The problem of determining if two symbolic index values are the same is
equivalent to the classical problem of global value numbering. If two indices i
and j have the same value number, then DS (i, j ) must = true . The problem of
computing DD is more complex. Note that DD, unlike DS , is not an equivalence
relation because DD is not transitive. If DD(A, B) = true and DD(B, C) = true,
it does not imply that DD(A, C) = true. However, we can leverage past work on
array dependence analysis to identify cases for which DD can be evaluated to true.
For example, it is clear that DD(i, i + 1) = true and that DD(i, 0) = true if i is a
loop index variable that is known to be ≥ 1.
Let us consider how the DS and DD relations for symbolic index values are
used by our constant propagation algorithms. Note that the specification of how DS
and DD are used is a separate issue from the precision of the DS and DD values.
We now describe how the lattice and the lattice operations presented in Sect. 17.2.1
can be extended to deal with non-constant subscripts.
First, consider the lattice itself. The and ⊥ lattice elements retain the same
meaning as in Sect. 17.2.1 viz. SET() = { } and SET(⊥) = Uind A × U A . Each
elem
element in the lattice is a list of index-value pairs where the value is still required
17 Array SSA Form 235
Fig. 17.9 Lattice computation for L (A1 [k]) = L[ ] (L (A1 ), L (k)), where A1 [k] is an array
element read operator. If L (k) = VN(k), the lattice value of index k is a value number that
represents a constant or a symbolic value
Fig. 17.10 Lattice computation for L (A1 ) = Ld[ ] (L (k), L (i)), where A1 [k] ← i is an array
element write operator. If L (k) = VN(k), the lattice value of index k is a value number that
represents a constant or a symbolic value
to be constant but the index may be symbolic—the index is represented by its value
number (VN(i) for SSA variable i).
We now revisit the processing of an array element read of A1 [k] and the
processing of an array element write of A1 [k]. These operations were presented in
Sect. 17.2.1 (Figs. 17.2 and 17.3) for constant indices. The versions for non-constant
indices appear in Figs. 17.9 and 17.10. For the read operation in Fig. 17.9, if there
exists a pair (ij ,ej ) such that DS (ij , k) = true (i.e., ij and VN(k) have the same
value number), then the result is ej . Otherwise, the result is or ⊥ as specified in
Fig. 17.9. For the write operation in Fig. 17.10, if the value of the right-hand side, i,
is a constant, the result is the singleton list (VN(k), L (i))". Otherwise, the result is
or ⊥ as specified in Fig. 17.10.
Let us now consider the propagation of lattice values through φdef operators.
The only extension required relative to Fig. 17.4 is that the DD relation used in
performing the UPDATE operation should be able to determine when DD(i , ij ) =
true if i and ij are symbolic value numbers rather than constants. (If no symbolic
information is available for i and ij , then it is always safe to return DD(i , ij ) =
false.)
and definitely different analyses can be extended to heap array indices (Sect. 17.3.2),
followed by a scalar promotion transformation that uses the analysis results to
perform load elimination (Sect. 17.3.3).
In this section, we show how global value numbering and allocation-site information
can be used to efficiently compute definitely same (DS ) and definitely different
(DD) information for heap array indices, thereby reducing pointer analysis queries
17 Array SSA Form 237
The main program analysis needed to enable redundant load elimination is index
propagation, which identifies the set of indices that are available at a specific def/use
Ai of heap array A. Index propagation is a data-flow problem, the goal of which is to
compute a lattice value L (H ) for each renamed heap variable H in the Array SSA
form such that a load of H [i] is available if V (i) ∈ L (H ). Note that the lattice
element does not include the value of H [i] (as in constant propagation), just the
fact that it is available. Figures 17.12, 17.13, and 17.14 give the lattice computations
Fig. 17.12 Lattice computation for L (A2 ) = Lφdef (L (A1 ), L (A0 )) where A2 ← φdef (A1 , A0 )
is a definition φ operation
Fig. 17.13 Lattice computation for L (A2 ) = Lφuse (L (A1 ), L (A0 )) where A2 ← φuse (A1 , A0 )
is a use φ operation
Fig. 17.14 Lattice computation for L (A2 ) = Lφ (L (A1 ), L (A0 )) = L (A1 ) # L (A0 ), where
A2 ← φ(A1 , A0 ) is a control φ operation
17 Array SSA Form 239
Fig. 17.15 Trace of index propagation and load elimination transformation for program in
Fig. 17.11a
which define the index propagation solution. The notation UPDATE(i , i1 , . . .") used
in the middle cell in Fig. 17.12 denotes a special update of the list L (A0 ) = i1 , . . ."
with respect to index i . UPDATE involves four steps:
1. Compute the list T = { ij | ij ∈ L (A0 ) and DD(i , ij ) = true }. List T
contains only those indices from L (A0 ) that are definitely different from i .
2. Insert i into T to obtain a new list, I .
3. (Optional) As before, if there is a desire to bound the height of the lattice due
to compile-time considerations and the size of list I exceeds a threshold size Z,
then any one of the indices in I can be dropped from the output list.
4. Return I as the value of UPDATE(i , i1 , . . .").
After index propagation, the algorithm selects a load, Aj [
x ], for scalar promotion
if and only if index propagation determines that an index with value number
VN(()x ) is available at the def of Aj . Figure 17.15 illustrates a trace of this load
elimination algorithm for the example program in Fig. 17.11a. Figure 17.15a shows
the extended Array SSA form computed for this example program. The results
of index propagation are shown in Fig. 17.15b. These results depend on definitely
different analysis establishing that VN(p) = VN(q) and definitely same analysis
establishing that VN(p) = VN(r). Figure 17.15c shows the transformed code after
performing the scalar promotion actions. The load of p.x has thus been eliminated
in the transformed code and replaced by a use of the scalar temporary, T1.
addition to reading the other chapters in this book for related topics, the interested
reader can consult [168] for details on full Array SSA form, [169] for details on
constant propagation using Array SSA form, [120] for details on load and store
elimination for pointer-based objects using Array SSA form, [253] for efficient
dependence analysis of pointer-based array objects, and [18] for extensions of this
load elimination algorithm to parallel programs.
There are many possible directions for future research based on this work. The
definitely same and definitely different analyses outlined in this chapter are sound
but conservative. In restricted cases, they can be made more precise using array
subscript analysis from polyhedral compilation frameworks. Achieving a robust
integration of Array SSA and polyhedral approaches is an interesting goal for future
research. Past work on Fuzzy Array Data-flow Analysis (FADA) [19] may provide
a useful foundation for exploring such an integration. Another interesting direction
is to extend the value numbering and definitely different analyses mentioned in
Sect. 17.2.2 so that they can be combined with constant propagation rather than
performed as a pre-pass. An ultimate goal is to combine conditional constant, type
propagation, value numbering, partial redundancy elimination, and otion analyses
within a single framework that can be used to analyse scalar variables, array
variables, and pointer objects with a unified approach.
Part IV
Machine Code Generation and
Optimization
Chapter 18
SSA Form and Code Generation
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 243
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
244 B. Dupont de Dinechin
In high-end compilers such as Open64 [64, 65], GCC [268], or LLVM [176],
code generation techniques have significantly evolved, as they are mainly respon-
sible for exploiting the performance-oriented features of architectures and micro-
architectures. In those compilers, code generator optimizations include:
• If-conversion using select, conditional move, or predicated instructions
(Chap. 20)
• Use of specialized addressing modes such as auto-modified addressing [180] and
modulo addressing [66]
• Exploitation of hardware looping [183] or static branch prediction hints
• Matching fixed-point arithmetic and SIMD idioms to special instructions
• Optimization with regard to memory hierarchy, including cache prefetching and
register preloading [99]
• VLIW instruction bundling, where parallel instruction groups constructed by
postpass instruction scheduling are encoded into instruction bundles [159]
This sophistication of modern compiler code generation is one of the reasons
behind the introduction of the SSA form for machine code, in order to simplify
certain analyses and optimizations. In particular, liveness analysis (Chap. 9), if-
conversion (Chap. 20), unrolling-based loop optimizations (Chap. 10), and exploita-
tion of special instructions or addressing modes benefit significantly from the SSA
form. Chapter 19 presents an advanced technique for instruction selection on the
SSA form by solving a specialized quadratic assignment problem (PBQP). Although
there is a debate as to whether or not SSA form should be used in a register allocator,
Chap. 22 makes a convincing case for it. The challenge of correct and efficient SSA
form destruction under the constraints of machine code is addressed in Chap. 21.
Finally, Chap. 23 illustrates how the SSA form has been successfully applied to
hardware compilation.
Basing ourselves on our experience on a family of production code generators
and linear assembly optimizers for the ST120 DSP core [102, 103, 240, 275] and
the Lx/ST200 VLIW family [34–36, 100, 101, 113], this chapter reviews some of
the issues of using the SSA form in a code generator. Section 18.1 presents the
challenges of maintaining the SSA form on a program representation based on
machine instructions. Section 18.2 discusses two code generator optimizations that
seem at odds with the SSA form, yet must occur before register allocation. One is
if-conversion, whose modern formulations require an extension of the SSA form.
The other is prepass instruction scheduling, for which the benefit of using the SSA
form has not been assessed by any implementation yet. Using the SSA form at
machine-code level requires the ability to construct and destruct SSA form at that
level. Section 18.3 characterizes various SSA form destruction algorithms in terms
of satisfying the constraints of machine code.
18 SSA Form and Code Generation 245
• Associate a semantic combinator [300], that is, a tree of IR-like operators, to each
target operand of a machine instruction. This alternative has been implemented
in the SML/NJ [182] compiler and the LAO compiler [102].
An issue related to the representation of instruction semantics is how to encode
it. Most information can be statically tabulated by the instruction operator, yet
properties such as safety for control speculation, or equivalence to a simple IR
instruction, can be refined by the context where the instruction appears. For instance,
range propagation may ensure that an addition cannot overflow, that a division
by zero is impossible, or that a memory access is safe for control speculation.
Context-dependent semantics, which needs to be associated with specific machine
instructions in the code generator’s internal representation, can be provided as
annotations that override the statically tabulated information.
Finally, code generation for some instruction set architectures requires that
pseudo-instructions with standard semantics be available, as well as variants of φ-
functions and parallel copy operations.
• Machine instructions that operate on register pairs, such as the long multiplies on
the ARM, or more generally on register tuples, must be handled. In such cases
there is a need for pseudo-instructions to compose wide operands in register
tuples, and to independently extract register allocatable operands from wide
operands.
• Embedded processor architectures such as the Tensilica Xtensa [130] provide
zero-overhead loops (hardware loops), where an implicit conditional branch
back to the loop header is taken whenever the program counter matches some
addresses. The implied loop-back branch is also conveniently materialized by a
pseudo-instruction.
• Register allocation for predicated architectures requires that the live ranges
of temporary variables with predicated definitions be contained by pseudo-
instructions [128] that provide backward kill points for liveness analysis.
The SSA form requires variable definitions to be killing definitions (see e.g. [208]
or kill points in data-flow analysis). This is not the case for target operands such
as a status register, which contains several independent bit-fields. Moreover, some
instruction effects on bit-fields may be sticky, that is, with an implied disjunction
with the previous value. Typical sticky bits include exception flags of the IEEE 754
arithmetic, or the integer overflow flag on DSPs with fixed-point arithmetic.
When mapping a status register to a SSA variable, any operation that partially
reads or modifies the register bit-fields should appear as reading and writing the
corresponding variable.
Predicated execution and conditional execution are other sources of definitions
that do not kill their target register. The execution of predicated instructions is
guarded by the evaluation of a single bit operand. The execution of conditional
instructions is guarded by the evaluation of a condition on a multi-bit operand. We
extend the ISA classification of [193] to include four classes:
Live-in and live-out sets at basic block boundaries are also candidates for
program representation invariants. However, when using and updating liveness
information under the SSA form, it appears convenient to distinguish the φ-function
contributions from the results of data-flow fixed-point computation. In particular,
Sreedhar et al. [267] introduced the φ-function semantics that became later known
as multiplexing mode (see Chap. 21), where a φ-function B0 : a0 = φ(B1 :
a1 , . . . , Bn : an ) makes a0 live-in of basic block B0 , and a1 , . . . an live-out of basic
blocks B1 , . . . Bn . The classical basic block invariants LiveIn(B) and LiveOut(B)
are then complemented with PhiDefs(B) and PhiUses(B).
Finally, some compilers adopt the invariant that the SSA form be conventional
across the code generation phases. This approach is motivated by the fact that
classical optimizations such as SSAPRE [164] (see Chap. 11) require that “the live
ranges of different versions of the same original program variable do not overlap,”
implying that the SSA form should be conventional. Other compilers that use
SSA numbers and omit the φ-functions from the program representation [175] are
similarly constrained. Work by Sreedhar et al. [267] and by Boissinot et al. [35] have
clarified how to convert the transformed SSA form into conventional form wherever
required, so there now is no reason for this property to be an invariant.
18.2.1 If-Conversion
K function
or-ing multiple predicates. The RK algorithm is simplified for the case of single-
entry single-exit regions, and adapted to the proposed architectural support.
The other contribution is the generation of PHI-ops, whose insertion points are
computed like the SSA form placement of φ-functions. The φ-functions’ source
operands are replaced by phi-lists, where each operand is associated with the
predicate of its source basic block. The phi-lists are processed by topological
order of the predicates to generate the PHI-ops.
If-Conversion Under SSA Form
The ability to perform if-conversion on the SSA form of a program representation
requires the handling of operations that do not kill the target operand because of
predicated or conditional execution. The following papers address this issue:
Stoutchinin and Ferrière [275] Introduction of ψ-functions in order to represent
fully predicated code under the SSA form, which is then called the ψ-SSA
form. The ψ-functions’ arguments are paired with predicates and are ordered in
dominance order in the ψ-function argument list. This ordering is a correctness
condition re-discovered by Chuang et al. [74] for their PHI-ops. The ψ-SSA form
is presented in Chap. 15.
Stoutchinin and Gao [276] Proposition of an if-conversion technique based on
the predication of Fang [112] and the replacement of φ-functions by ψ-
functions. The authors prove that the conversion is correct provided that the SSA
form is conventional. The technique is implemented in Open64 for the IA-64
architecture.
Bruel [52] The technique targets VLIW architectures with select and dismis-
sible load instructions. The proposed framework reduces acyclic control-flow
constructs from innermost to outermost. A benefit criterion is used to stop the
reduction process. The core technique performs control speculation in addition
to tail duplication, and reduces the height of predicate computations. It can
also generate ψ-functions instead of select operations. A generalization of
this framework, which also accepts ψ-SSA form as input, is described in
Chap. 20.
Ferrière [104] Extension of the ψ-SSA form algorithms of [275] to architectures
with partial predicated execution support, by formulating simple correctness
conditions for the predicate promotion of operations that do not have side-effects.
This work also details how to transform the ψ-SSA form to conventional ψ-
SSA form by generating cmov operations. A self-contained explanation of these
techniques appears in Chap. 15.
Thanks to these contributions, virtually all if-conversion techniques formulated
without the SSA form can be adapted to the ψ-SSA form, with the added benefit
that already predicated code may be part of the input. In practice, these contributions
follow the generic steps of if-conversion proposed by Fang [112]:
• If-conversion region selection
• Code hoisting and sinking of common sub-expressions
• Assignment of predicates to the basic blocks
18 SSA Form and Code Generation 253
Further down the code generator, the last major phase before register allocation is
prepass instruction scheduling. Innermost loops with a single basic block, super-
block, or hyper-block body are candidates for software pipelining techniques
such as modulo scheduling [241]. For innermost loops that are not software
pipelined, and for other program regions, acyclic instruction scheduling techniques
apply: basic block scheduling [131]; super-block scheduling [146]; hyper-block
scheduling [191]; tree region scheduling [140]; or trace scheduling [189].
By definition, prepass instruction scheduling operates before register allocation.
At this stage, instruction operands are mostly virtual registers, except for instruc-
tions with ISA or ABI constraints that bind them to specific architectural registers.
254 B. Dupont de Dinechin
The destruction of the SSA form in a code generator is required at some point.
A weaker form of SSA destruction is the conversion of transformed SSA form
to conventional SSA form, which is required by a few classical SSA form
optimizations such as SSAPRE (see Chap. 11). For all such cases, the main objective
is the lowering to machine-code representation (getting rid of pseudo-instructions
and satisfying naming constraints) by inserting the necessary copy/spill instructions.
18 SSA Form and Code Generation 255
Cytron et al. [90] First technique for translating out of SSA, by “naive replace-
ment preceded by dead code elimination and followed by colouring.” The authors
replace each φ-function B0 : a0 = φ(B1 : a1 , . . . , Bn : an ) by n copies a0 = ai ,
one per basic block Bi , before applying Chaitin-style coalescing.
Briggs et al. [50] The correctness issues of Cytron et al. [90] out of (transformed)
SSA form translation are identified and illustrated by the lost-copy problem and
the swap problem. These problems appear in relation to the critical edges and
when the parallel assignment semantics of a sequence of φ-functions at the start
of a basic block is not accounted for [35]. Two SSA form destruction algorithms
are proposed, depending on the presence of critical edges in the control-flow
graph. However, the need for parallel copy operations to represent code after
φ-function removal is not recognized.
Sreedhar et al. [267] This work is based on the definition of φ-congruence
classes as the sets of SSA variables that are transitively connected by a φ-
function. When none of the φ-congruence classes has members that interfere, the
SSA form is called conventional and its destruction is trivial: replace all the SSA
variables of a φ-congruence class by a temporary variable, and remove the φ-
functions. In general, the SSA form is transformed after program optimizations,
that is, some φ-congruence classes contain interferences. In Method I, the SSA
form is made conventional by isolating φ-functions using copies both at the end
of direct predecessor blocks and at the start of the current block. The latter
is the key for not depending on critical edge splitting [35]. The code is then
improved by running a new SSA variable coalescer that grows the φ-congruence
classes with copy-related variables, while keeping the SSA form conventional. In
Method II and Method III, the φ-congruence classes are initialized as singletons,
then merged as the φ-functions are processed. In Method II, two variables of the
current φ-function that interfere directly or through their φ-congruence classes
are isolated by inserting copy operations for both. This ensures that the φ-
congruence class that is grown from the classes of the variables related by the
current φ-function is interference-free. In Method III, if possible only one copy
operation is inserted to remove the interference, and more involved choices about
which variables to isolate from the φ-function congruence class are resolved by a
maximum independent set heuristic. Both methods are correct except for a detail
about the live-out sets to consider when testing for interferences [35].
Leung and George [182] This work is the first to address the problem of satisfy-
ing the same resource and the dedicated register operand naming constraints of
the SSA form on machine code. They identify that Chaitin-style coalescing after
256 B. Dupont de Dinechin
SSA form destruction is not sufficient, and that adapting the SSA optimizations
to enforce operand naming constraints is not practical. They work in three steps:
collect the renaming constraints; mark the renaming conflicts; and reconstruct
code, which adapts the SSA destruction of Briggs et al. [50]. This work is also
the first to make explicit use of parallel copy operations. A few correctness issues
were later identified and corrected by Rastello et al. [240].
Budimlić et al. [53] Contribution of a lightweight SSA form destruction moti-
vated by JIT compilation. It uses the (strict) SSA form property of dominance of
variable definitions over uses to avoid the maintenance of an explicit interference
graph. Unlike previous approaches to SSA form destruction that coalesce
increasingly larger sets of non-interfering φ-related (and copy-related) variables,
they first construct SSA webs with early pruning of obviously interfering vari-
ables, then de-coalesce the SSA webs into non-interfering classes. They propose
the dominance forest explicit data structure to speed up these interference tests.
This SSA form destruction technique does not handle the operand naming
constraints, and also requires critical edge splitting.
Rastello et al. [240] The problem of satisfying the same resource and dedicated
register operand naming constraints of the SSA form on machine code is
revisited, motivated by erroneous code produced by the technique of Leung and
George [182]. Inspired by the work of Sreedhar et al. [267], they include the
φ-related variables as candidates in the coalescing that optimizes the operand
naming constraints. This work avoids the patent of Sreedhar et al. (US patent
6182284).
Boissinot et al. [35] Formulation of a generic approach to SSA form destruction
that is proved correct handles operand naming constraints and can be optimized
for speed (see Chap. 21 for details of this generic approach). The foundation
of this approach is to transform the program to conventional SSA form by
isolating the φ-functions like in Method I of Sreedhar et al. [267]. However,
the copy operations inserted are parallel, so a parallel copy sequentialization
algorithm is provided. The task of improving the conventional SSA form is
then seen as a classical aggressive variable coalescing problem, but thanks to
the SSA form the interference relation between SSA variables is made precise
and frugal to compute. Interference is obtained by combining the intersection of
SSA live ranges, and the equality of values, which is easily tracked under the
SSA form across copy operations. Moreover, the use of the dominance forest
data structure of Budimlić et al. [53] to speed up interference tests between
congruence classes is obviated by a linear traversal of these classes in pre-order of
the dominance tree. Finally, the same resource operand constraints are managed
by pre-coalescing, and the dedicated register operand constraints are represented
by pre-colouring the congruence classes. Congruence classes with a different
pre-colouring always interfere.
Chapter 19
Instruction Code Selection
D. Ebner
Waymo, San Jose, CA, USA
e-mail: ebner@[Link]
A. Krall
TU Wien, Vienna, Austria
e-mail: andi@[Link]; [Link]@[Link]
B. Scholz ()
University of Sydney, Sydney, NSW, Australia
e-mail: [Link]@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 257
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
258 D. Ebner et al.
Code
Machine
Generator Description
IR Lowering
Instruction Machine- Target Program
Code Dependent
(optional)
Selection Backend
Fig. 19.1 Scenario: an instruction code selector translates a compiler’s IR to a low-level machine-
dependent representation
Fig. 19.2 Example of a data-flow tree (b) and a rule fragment with associated costs (a)
has an associated semantic action that is used to emit the corresponding machine
instructions, either by constructing a new intermediate representation or by rewriting
the DFT bottom-up.
An example of a DFT along with a set of rules representing valid ARM
instructions are shown in Fig. 19.2. Each rule consists of non-terminals (shown in
lower case) and terminal symbols (shown in upper case). Non-terminals are used to
chain individual rules together. Non-terminal s denotes a distinguished start symbol
for the root node of the DFT. Terminal symbols match the corresponding labels of
nodes in the data-flow trees. The terminals of the grammar are VAR, CST, SHL,
ADD, and LD. Rules that translate from one non-terminal to another are called chain
rules, e.g., reg ← imm, which translates an immediate value to a register. Note
that there are multiple possibilities to obtain a cover of the data-flow tree for the
example shown in Fig. 19.2b. Each rule has associated costs. The cost of a tree
cover is the sum of the costs of the selected rules. For example, the DFT could be
covered by rules R3 , R4 , and R10 , which would give a total cost for the cover of
one cost unit. Alternatively, the DFT could be covered by rules R2 , R3 , R5 , R7 ,
and R8 yielding four cost units for the cover for issuing four assembly instructions.
A dynamic programming algorithm selects a cost-optimal cover for the DFT.
Tree-pattern matching on a DFT is limited to the scope of tree structures.
To overcome this limitation, we can extend the scope of the matching algorithm
to the computational flow of a whole procedure. The use of the SSA form as
19 Instruction Code Selection 259
Fig. 19.3 Instruction code selection SSA Graph for a vector dot-product in fixed-point arithmetic.
“fp_” stands for unsigned short fixed-point type and “*” for pointer manipulations (like in C). The
colours of the nodes indicate which basic block the operations belong to
block, performing the intermediate calculations in the extended format. Note that a
tree-pattern matcher generates code at statement level, and hence the information of
having values as double-precision cannot be hoisted across basic block boundaries.
However, an instruction code selector that is operating on the SSA graph is able to
propagate non-terminal fp2 across the φ node prior to the return and emits the code
for the shift to the right in the return block.
In the following, we will explain how to perform instruction code selection
on SSA graphs by means of a specialized quadratic assignment problem (PBQP).
First, we discuss the instruction code selection problem by employing a discrete
optimization problem called a partitioned boolean quadratic problem. An extension
of patterns to arbitrary acyclic graph structures, which we refer to as DAG
grammars, is discussed in Sect. 19.2.1.
19 Instruction Code Selection 261
The matching problem for SSA graphs is reduced to a discrete optimization problem
called a Partitioned Boolean Quadratic Problem (PBQP). First, we will introduce
the PBQP problem and then we will describe the mapping of the instruction code
selection problem to PBQP.
The PBQP problem seeks an assignment of variables xi with minimum total costs.
In the following we represent both the local cost function and the related
cost function in matrix form, i.e., the related cost function C(xi , xj , di , dj ) is
decomposed for each pair (xi , xj ). The costs for the pair are represented as
|Di |-by-|Dj | matrix/table Cij . A matrix element corresponds to an assignment
(di , dj ). Similarly, the local cost function c(xi , di ) is represented by a cost vector −
→
ci
enumerating the costs of the elements. A PBQP problem has an underlying graph
structure, expressed as graph G = (V , E, C, c), which we refer to as a PBQP graph.
262 D. Ebner et al.
In this section, we describe the modelling of instruction code selection for SSA
graphs as a PBQP problem. In the basic modelling, SSA and PBQP graphs coincide.
The variables xi of the PBQP are decision variables reflecting the choices of
applicable rules (represented by Di ) for the corresponding node of xi . The local
costs reflect the costs of the rules, and the related costs reflect the costs of chain
rules making rules compatible with each other. This means that the number of
decision vectors and the number of cost matrices in the PBQP are determined by
the number of nodes and edges in the SSA graph, respectively. The sizes of Di
depend on the number of rules in the grammar. A solution for the PBQP instance
induces a complete cost-minimal cover of the SSA graph.
As in traditional tree-pattern matching, an ambiguous graph grammar consisting
of tree patterns with associated costs and semantic actions is used. Input grammars
have to be normalized. This means that each rule is either a so-called base rule or a
chain rule. A base rule is a production of the form nt0 ← OP(nt1 , . . . , ntkp )
where nti are non-terminals and OP is a terminal symbol, i.e., an operation
represented by a node in the SSA graph. A chain rule is a production of the form
nt0 ← nt1 , where nt0 and nt1 are non-terminals. A production rule nt ←
OP1 (α, OP2 (β), γ )) can be normalized by rewriting the rule into two production
rules nt ← OP1 (α, nt , γ ) and nt ← OP2 (β) where nt is a new non-terminal
symbol and α, β, and γ denote arbitrary pattern fragments. This transformation
can be iteratively applied until all production rules are either chain rules or base
rules. To illustrate this transformation, consider the grammar in Fig. 19.4, which is
19 Instruction Code Selection 263
Fig. 19.4 PBQP instance derived from the example shown in Fig. 19.2. The grammar has been
normalized by introducing additional non-terminals. Highlighted elements show a cost-minimal
solution
a normalized version of the tree grammar introduced in Fig. 19.2a. Temporary non-
terminal symbols t1, t2, and t3 are used to decompose larger tree patterns into
simple base rules. Each base rule spans across a single node in the SSA graph.
The instruction code selection problem for SSA graphs is modelled in PBQP
as follows. For each node u in the SSA graph, a PBQP variable xu is introduced.
The domain of variable xu is determined by the subset of base rules whose terminal
symbol matches the operation of the SSA node, e.g., there are three rules (R4 , R5 ,
R6 ) that can be used to cover the shift operation SHL in our example. The last rule is
the result of automatic normalization of a more complex tree pattern. The cost vector
−
→
cu = wu · c(R1 ), . . . , c(Rku )" of variable xu encodes the local costs for a particular
assignment where c(Ri ) denotes the associated cost of base rule Ri . Weight wu is
used as a parameter to optimize for various objectives including speed (e.g., wu is
264 D. Ebner et al.
the expected execution frequency of the operation at node u) and space (e.g., the wu
is set to one). In our example, both R4 and R5 have associated costs of one. Rule R6
contributes no local costs, as we account for the full costs of a complex tree pattern
at the root node. All nodes have the same weight of one, thus the cost vector for the
SHL node is 1, 1, 0".
An edge in the SSA graph represents data transfer between the result of an
operation u, which is the source of the edge, and the operand v, which is the tail
of the edge. To ensure consistency among base rules and to account for the costs of
chain rules, we impose costs that are dependent on the selection of variable xu and
variable xv in the form of a cost matrix Cuv . An element in the matrix corresponds
to the costs of selecting a specific base rule ru ∈ Ru of the result and a specific
base rule rv ∈ Rv of the operand node. Assume that ru is nt ← OP (. . . ) and rv
is · · · ← OP (α, nt , β) where nt is the non-terminal of operand v whose value is
obtained from the result of node u. There are three possible cases:
1. If the non-terminals nt and nt’ are identical, the corresponding element in
matrix Cuv is zero, since the result of u is compatible with the operand of node v.
2. If the non-terminals nt and nt differ and there exists a rule r : nt ← nt in
the transitive closure of all chain rules, the corresponding element in Cuv has the
costs of the chain rule, i.e., wv · c(r).
3. Otherwise, the corresponding element in Cuv has infinite costs prohibiting the
selection of incompatible base rules.
As an example, consider the edge from CST:2 to node SHL in Fig. 19.4. There is
a single base rule R1 with local cost 0 and result non-terminal imm for the constant.
Base rules R4 , R5 , and R6 are applicable for the shift, the first of which expects
non-terminal reg as its second argument, and rules R5 and R6 both expect imm.
Consequently, the corresponding cost matrix accounts for the cost of converting
from reg to imm at index (1, 1) and is zero otherwise.
Highlighted elements in Fig. 19.4 show a cost-minimal solution of the PBQP
with cost one. A solution of the PBQP directly induces a selection of base and chain
rules for the SSA graph. The execution of the semantic action rules inside a basic
block follows the order of basic blocks. Special care is necessary for chain rules
that correspond to data flow across basic blocks. Such chain rules may be placed
inefficiently, and a placement algorithm is required for some grammars.
In the previous section we have introduced an approach based on code patterns that
resemble simple tree fragments. This restriction often complicates code generators
19 Instruction Code Selection 265
for modern CPUs with specialized instructions and SIMD extensions, e.g., there is
no support for machine instructions with multiple results.
Consider the introductory example shown in Fig. 19.3. Many architectures have
some form of auto-increment addressing modes. On such a machine, the load and
the increment of both p and q can be done in a single instruction benefiting both
code size and performance. However, post-increment loads cannot be modelled
using a single tree-shaped pattern. Instead, it produces multiple results and spans
across two non-adjacent nodes in the SSA graph, with the only restriction that their
arguments have to be the same.
Similar examples can be found in most architectures, e.g., the DIVU instruction
in the Motorola 68K architecture performs the division and the modulo operation for
the same pair of inputs. Other examples are the RMS (read-modify-store) instructions
on the IA32/AMD64 architecture, autoincrement- and decrement-addressing modes
of several embedded systems architectures, the IRC instruction of the HPPA
architecture, or fsincos instructions of various math libraries. Compiler writers
are forced to pre- or post-process these patterns heuristically, often missing much
of the optimization potential. These architecture-specific tweaks also complicate
re-targeting, especially in situations where patterns are automatically derived from
generic architecture descriptions.
We will now outline, through the example in Fig. 19.5, a possible problem
formulation for these generalized patterns in the PBQP framework discussed so
far. The code fragment contains three feasible instances of a post-increment store
pattern. Assuming that p, q, and r point to mutually distinct memory locations,
there are no further dependencies apart from the edges shown in the SSA graph. If
we select all three instances of the post-increment store pattern concurrently, the
cover induced by SSA edges becomes cyclic, and the code cannot be emitted. To
overcome this difficulty, the idea is to express in the PBQP model a numbering of
chosen nodes that reflects the existence of a topological order in the cover-avoiding
cycles. PBQP has no constraints as such, but they can be simulated by imposing
arbitrary high costs denoted by ∞ for certain combinations that do not satisfy the
topological constraint.
Modelling
The first step is to explicitly enumerate instances of complex patterns, i.e., concrete
tuples of nodes that match the terminal symbols specified in a particular production.
There are three instances of the post-increment store pattern (surrounded by
boxes) in the example shown in Fig. 19.5. As for tree patterns, DAG patterns are
decomposed into simple base rules for the purpose of modelling, e.g., the post-
increment store pattern
P1 : tmt ← ST(x: reg, reg), reg ← INC(x) : 3
is decomposed into the individual pattern fragments
P1,1 : stmt ← ST(reg, reg)
P1,2 : reg ← INC(reg)
For our modelling, new variables are created for each enumerated instance of a
complex production. They encode whether a particular instance is chosen or not,
i.e., the domain basically consists of the elements on and off. The local costs are
set to the combined costs for the particular pattern for the on state and to 0 for
the off state. Furthermore, the domain of existing nodes is augmented with the
base rule fragments obtained from the decomposition of complex patterns. We can
safely squash all identical base rules obtained from this process into a single state.
Thus, each of these new states can be seen as a proxy for the whole set of instances
of (possibly different) complex productions including the node. The local costs for
these proxy states are set to 0.
Continuing our example, the PBQP for the SSA graph introduced in Fig. 19.5 is
shown in Fig. 19.6. In addition to the post-increment store pattern with costs three,
we assume regular tree patterns for the store and the increment nodes with costs
two denoted by P2 and P3 , respectively. Rules for the VAR nodes are omitted for
simplicity.
Nodes 1–6 correspond to the nodes in the SSA graph. Their domain is defined
by the simple base rule with costs two and the proxy state obtained from the
decomposition of the post-increment store pattern. Nodes 7, 8, and 9 correspond to
the three instances identified for the post-increment store pattern. As noted before,
we have to guarantee the existence of a topological order in the cover among the
chosen nodes. To this end, we refine the state on such that it reflects a particular
index in a concrete topological order. Matrices among these nodes account for data
dependencies, e.g., consider the matrix established among nodes 7 and 8. Assuming
instance 7 is on at index 2 (i.e., mapped to on2 ), the only remaining choices for
instance 8 are not to use the pattern (i.e., mapped to off) or to enable it at index 3
(i.e., mapped to on3 ), as node 7 has to precede node 8.
Additional cost matrices are required to ensure that the corresponding proxy state
is selected on all the variables forming a particular pattern instance (which can be
modelled with combined costs of 0 or ∞, respectively). However, this formulation
allows for the trivial solution where all of the related variables encoding the selection
19 Instruction Code Selection 267
Fig. 19.6 PBQP Graph for the example shown in Fig. 19.5. M is a large integer value. We use k
as a shorthand for the term 3 − 2M
268 D. Ebner et al.
of a complex pattern are set to off (accounting for 0 costs) even though the artificial
proxy state has been selected. We can overcome this problem by adding a large
integer value M to the costs for all proxy states. In exchange, we subtract these
costs from the cost vector of instances. Thus, the penalties for the proxy states are
effectively eliminated unless an invalid solution is selected.
Cost matrices among nodes 1–6 do not differ from the basic approach discussed
before and reflect the costs of converting the non-terminal symbols involved. It
should be noted that for general grammars and irreducible graphs, the heuristic
solver of PBQP cannot guarantee delivering a solution that satisfies all constraints
modelled in terms of ∞ costs. This would be an NP-complete problem. One way
to work around this limitation is to include a small set of rules that cover each node
individually and that can be used as a fallback rule in situations where no feasible
solution has been obtained, which is similar to macro substitution techniques and
ensures a correct but possibly non-optimal matching. These limitations do not
apply to exact PBQP solvers such as the branch-&-bound algorithm. It is also
straightforward to extend the heuristic algorithm with a backtracking scheme on
RN reductions, which would of course also be exponential in the worst case.
Aggressive optimizations for the instruction code selection problem are enabled
by the use of SSA graphs. The whole flow of a function is taken into account
for instruction selection rather than a local scope of statements. The move from
basic tree-pattern matching [1] to SSA-based DAG matching is a relatively small
step as long as a PBQP library and some basic infrastructure (graph grammar
translator, etc.) are provided. The complexity of the approach is hidden in the
discrete optimization problem called PBQP. Free PBQP libraries are available from
the web-pages of the authors and a library is implemented as part of the LLVM [186]
framework.
Many aspects of the PBQP formulation presented in this chapter could not be
covered in detail. The interested reader is referred to the relevant literature [108, 109]
for an in-depth discussion.
As we move from acyclic linear code regions to whole-functions, it becomes less
clear in which basic block the selected machine instructions should be emitted. For
chain rules, the obvious choices are often non-optimal. In [254], a polynomial-time
algorithm based on generic network flows is introduced that allows a more efficient
placement of chain rules across basic block boundaries. This technique is orthogonal
to the generalization to complex patterns.
Chapter 20
If-Conversion
Christian Bruel
C. Bruel ()
STMicroelectronics, Grenoble, France
e-mail: [Link]@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 269
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
270 C. Bruel
After if-conversion
outcome, and assuming a very optimistic one-cycle branch penalty. The main benefit
here is that it can be executed without branch disruption.
From this introductory example, we can observe that:
• The two possible execution paths have been merged into a single execution path,
implying a better exploitation of the available resources.
• The schedule height has been reduced, because instructions can be control-
speculated before the branch.
• The variables have been renamed, and a merge pseudo-instruction has been
introduced.
Thanks to SSA, the merging point is already materialized in the original control
flow as a φ pseudo-instruction, and register renaming has been performed by SSA
construction. Given this, the transformation to generate if-converted code seems
natural locally. Exploiting these properties on larger-scale control-flow regions
requires a framework that we will develop further.
20 If-Conversion 271
Note that the select instruction is an architecture instruction that does not
need to be replaced during the SSA destruction phase. If the target architecture
does not provide such a gating instruction, it can be emulated using two conditional
moves. This translation can be done afterwards, and the select instruction can
still be used as an intermediate form. It allows the program to stay in full SSA form
where all the data dependencies are made explicit, and can thus be fed to all SSA
optimizers.
This chapter is organized as follows. We begin by describing the SSA techniques
to convert a CFG region into SSA form to produce an if-converted SSA repre-
sentation using speculation. We then describe how this framework is extended to
use predicated instructions, using the ψ-SSA form presented in Chap. 15. Finally,
we outline a global framework to pull together these techniques, incrementally
enlarging the scope of the if-converted region to its maximum beneficial size.
Unlike global approaches that identify a control-flow region and if-convert it in one
shot, the technique described in this chapter is based on incremental reductions. To
this end, we consider basic SSA transformations whose goal is to isolate a simple
diamond-DAG structure (informally an if-then-else-end) that can be easily
if-converted. The complete framework, which identifies and incrementally performs
the transformation, is described in Sect. 20.3.
The basic transformation that actually if-converts the code is the φ removal, which
takes a simple diamond-DAG as an input, i.e., a single-entry node/single-exit node
(SESE) DAG with only two distinct forward paths from its entry node to its
20 If-Conversion 273
exit node. The φ removal consists in (1) speculating the code of both branches
in the entry basic block (denoted head); then (2) replacing the φ-function by a
select; and finally (3) simplifying the control flow to a single basic block. This
transformation is illustrated in Fig. 20.4.
The goal of the φ reduction transformation is to isolate a diamond-DAG from a
structure that resembles a diamond-DAG but has side entries to its exit block. This
diamond-DAG can then be reduced using the φ removal transformation. Nested
if-then-else-end in the original code can create such a control-flow region.
Of note is the similarity with the nested arity-two φif -functions used for gated SSA
(see Chap. 14). In the most general case, the joint node of the considered region
has n direct predecessors with φ-functions of the form B0 : r = φ(B1 : r1 , B2 :
r2 , . . . , Bn : rn ) and is such that removing edges from B3 , . . . , Bn would give a
diamond-DAG. After the transformation, B1 and B2 point to a freshly created basic
block, say B12 , that itself points to B0 ; a new variable B12 : r12 = φ(B1 : r1 , B2 :
r2 ) is created in this new basic block; the φ-function in B0 is replaced by B0 : r =
φ(B12 : r12 , . . . , Bn : rn ). This is illustrated in Fig. 20.5.
The goal of path duplication is to isolate a diamond-DAG from a structure that
resembles a diamond-DAG but has side exit edges. Through path duplication, all
edges that point to a node different from the exit node or to the willing entry node
are “redirected” to the exit node. φ reduction can then be applied to the region
obtained. More formally, consider two distinguished nodes, the first named head and
the second a single-exit node of the region named exit, such that there are exactly
two different control-flow paths from head to exit; consider (if it exists) the first node
274 C. Bruel
sidei on one of the forward paths head → side0 → . . . sidep → exit, which has at
least two direct predecessors. The transformation duplicates the path P = sidei →
· · · → sidep → exit into P = side i → · · · → side p → exit and redirects sidei−1
(or head if i = 0) to side i . All the φ-functions that are along P and P for which the
number of direct predecessors has changed have to be updated accordingly. Hence,
a r = φ(sidep : r1 , B2 : r2 , . . . , Bn : rn ) in exit will be updated into r = φ(sidep :
r1 , B2 : r2 , . . . , Bn : rn , sidep : r1 ); a r = φ(sidei−1 : r0 , r1 , . . . , rm ) originally
in sidei will be updated into r = φ(r1 , . . . , rm ) in sidei and into r = φ(r0 ), i.e.,
r = r0 in side i . Variable renaming (see Chap. 5) along with copy folding can then
be performed on P and P . All steps are illustrated in Fig. 20.6.
The last transformation, namely the conjunctive predicate merge, concerns the
if-conversion of a control-flow pattern that sometimes appears on codes to represent
logical and or or conditional operations. As illustrated in Fig. 20.7, the goal is to
isolate a diamond-DAG from a structure that resembles a diamond-DAG but has side
exit edges. As opposed to path duplication, the transformation is actually restricted
to a very simple pattern, highlighted in Fig. 20.7, made up of three distinct basic
blocks: head, which branches with predicate p to side, or exit. side, which is empty
and branches itself with predicate q to another basic block outside of the region,
or exit. Conceptually, the transformation can be understood as first isolating the
outgoing path p → q and then if-converting the obtained diamond-DAG.
Implementing the same framework on a non-SSA-form program would require
more effort: The φ reduction would do the renaming, involving either a global
data-flow analysis or the insertion of copies at the exit node of the diamond-DAG;
inferring the minimum amount of select operations would also require the upkeep of
liveness information. SSA form solves the renaming issue without additional effort,
and, as illustrated in Fig. 20.8, the minimality and the pruned type of the SSA form
avoid inserting useless select operations.
Fig. 20.9 Speculation removes the dependency with the predicate but adds anti-dependencies
between concurrent computations
one of the diamond-DAG branches are actually speculated. This partial speculation
leads to the manipulation of predicated code.
Speculating code is the easiest part, as it could be done prior to the actual if-
conversion by simply hoisting the code above the conditional branch. Still, it is
worth pointing out that since ψ-functions are part of the intermediate representation,
they can be considered for inclusion in a candidate region for if-conversion, and in
particular for speculation. However, the strength of ψ-SSA allows ψ-functions to
be treated just like any other operation. Consider the code in Fig. 20.10a containing
a subregion that has already been processed. To speculate the operation d1 = f(x),
the operation defining x, i.e., the ψ-function, also has to be speculated. Similarly, all
the operations defining the operands x1 and x2 should also be speculated. If one of
them can produce hazardous execution, then the ψ-function cannot be speculated,
which in turn forbids the operation d1 = f(x) from being speculated. Marking
operations that cannot be speculated can be done easily using a forward propagation
along def-use chains.
All operations that cannot be speculated, possibly including some ψ-functions,
should be predicated. Suppose we are considering a non-speculated operation that
we aim to if-convert and that is part of the then branch on predicate q. Just as
for x2 = c in Fig. 20.10a, this operation might already be predicated (on p here)
prior to the if-conversion. In that case, a projection on q is performed, meaning that
instead of predicating x2 = c by p it gets predicated by q ∧p. A ψ-function can also
be projected on a predicate q, as described in Chap. 15: All gates of each operand
are individually projected on q. As an example, originally non-gating operand x1
gets gated by q, while the p-gated operand x2 gets gated by s = q ∧ p. Note that
as opposed to speculating it, predicating a ψ-function does not impose predicating
the operations that defined its operands. The only subtlety related to projection is the
generation of the new predicate as the logical conjunction of the original guard (e.g.,
p) and the current branch predicate (e.g., q). Here, s needs to be computed at some
point. The heuristic consists in first listing the set of all necessary predicates and
then emitting the corresponding code at the earlier place. Here, the used predicates
are q, q, and q ∧ p, and q and q are already available. The earlier place where q ∧ p
can be computed is just after calculating p.
Once operations have been speculated or projected (on q for the then branch,
on q for the else), each φ-function at the merge point is replaced by a ψ-function:
operands of speculated operations are placed first and guarded by true; operands of
projected operations follow, guarded by the predicate of the corresponding branch.
pressure; the number of predicate registers will determine the depth of the if-
conversion so that the number of conditions does not exceed the number of
available predicates; and the number of processing units will determine the number
of instructions that can be executed simultaneously. The inner–outer incremental
process advocated in this chapter serves to evaluate precisely the profitability of
if-conversion.
The algorithm takes as input a CFG in SSA form and applies incremental reductions
using the list of candidate-conditional basic blocks sorted in post-order. Each basic
block in the list designates the head of a sub-graph that can be if-converted using
the transformations described in Sect. 20.2. Post-order traversal serves to process
each region from the inner to the outer. When the if-converted region cannot grow
anymore because of resources, or because a basic block cannot be if-converted,
then the next sub-graph candidate is considered until the entire CFG is explored.
Note that as the reduction proceeds, maintaining SSA can be done using the general
technique described in Chap. 5. Basic local ad hoc updates can also be implemented
instead.
Consider for example the CFG reported in Fig. 20.11a. The exit node B6 and
basic block B3 (which contains a function call) cannot be if-converted. The post-
order list of conditional blocks (represented in bold) is [B9 , B14 , B13 , B11 , B8 , B7 ,
B5 , B2 ]. (1) The first candidate region is composed of {B9 , B2 , B10 }; φ reduction
can be applied, promoting the instructions of B10 in B9 ; B2 becomes the single
direct successor of B9 . (2) The region headed by B14 is then considered; B15 cannot
yet be promoted because of the side entries coming both from B12 and B13 ; B15
is duplicated into a B15 with B2 as direct successor; B15 can then be promoted
into B14 , which now has a single direct successor B2 . (3) The region headed by
B13 , which has B14 and B15 as direct successors, is now considered; B15 is again
duplicated into B15 ’, so as to promote B14 and B15 into B13 through φ reduction;
B15 already contains predicated operations from the previous transformation, so a
new merging predicate is computed and inserted. After the completion of φ removal,
B13 has a unique direct successor, B2 . (4) B11 is the head of the new candidate
region; here, B12 and B13 can be promoted. Again, since B13 contains predicated
and predicate setting operations, a fresh predicate must be created to hold the
merged conditions. (5) B8 is then considered; B11 needs to be duplicated to B11 .
The process finishes with the region head B7 .
As shown in Fig. 20.11, some basic blocks (such as B3 ) may have to be excluded
from the region to if-convert. Tail duplication can be used for this purpose. Similar
20 If-Conversion 279
B1
B1
B2
B2
B3 B4
B3 B4
B5
B5
B6 B7
B6 B7
If-conversion
B8
B1
B11 B9
B2
B12 B13 B10
B3 B4
B14
B15 B5 B5
B7 B6 B7
Fig. 20.11 If-conversion of wc (word count program). Basic blocks in the highlighted region
cannot be if-converted. Tail duplication can be used to exclude B3 from the to-be-if-converted
region
to path duplication described in Sect. 20.1, the goal of tail duplication is to get rid of
the incoming edges of a region to if-convert. This is usually done in the context of
hyperblock formation, a technique that, unlike the inner-outer incremental technique
described in this chapter, consists in if-converting a region in “one shot.” Consider
again the example of Fig. 20.11a, and suppose that the set of selected basic blocks
defining the region to if-convert consists of all basic blocks from B2 to B15 excluding
B3 and B6 . Getting rid of the incoming edge from B3 to B5 is possible by duplicating
all basic blocks of the region reachable from B5 , as shown in Fig. 20.11c.
Consider a region R made up of a set of basic blocks, a distinguished one entry
and the others denoted (Bi )2≤i≤n , such that any Bi is reachable from entry in R.
Suppose a basic block Bs has some direct predecessors out1 , . . . , outm that are
not in R. Tail duplication involves the following steps: (1) for all Bj (including Bs )
reachable from Bs in R, create a basic block Bj as a copy of Bj ; (2) any branch
from Bj that points to a basic block Bk of the region is rerouted to its duplicate Bk ;
280 C. Bruel
(3) any branch from a basic block outk to Bs is rerouted to Bs . In our example, we
would have entry = B2 , Bs = B6 , and out = B4 .
A global approach would follow the steps in Fig. 20.11c: First, select the region;
second, get rid of the incoming edges using tail duplication; and finally, perform
if-conversion of the whole region in one shot. It is worth pointing out that there is
no phasing issue with tail duplication. To illustrate this point, consider the example
of Fig. 20.12 where B2 cannot be if-converted. The selected region is made up of
all other basic blocks. Using a global approach as in standard hyperblock formation,
tail duplication would be performed prior to any if-conversion. This would result in
the CFG on the left part of the figure. Note that a new node, B7 , has been added
here after the tail duplication by a process called branch coalescing. Applying if-
conversion on the two disjoint regions, whose heads are, respectively, B4 and B4 ,
would result in the final code shown at the bottom of the figure. Our incremental
scheme would first perform if-conversion of the region headed by B4 , resulting in
the code depicted in the CFG on the right. Applying tail duplication to get rid of the
side entry from B2 would result in exactly the same final code at the bottom.
20.3.3 Profitability
Fusing execution paths can overcommit the architectural ability to execute the multi-
ple instructions in parallel: Data dependencies and register renaming introduce new
register constraints. Moving operations earlier in the instruction stream increases
live ranges. Aggressive if-conversion can easily exceed processor resources, leading
to excessive register pressure or moving infrequently used long latency instructions
into the critical path. The prevalent idea is that a region can be if-converted if the cost
of the resulting if-converted basic block is smaller than the cost of each individual
basic block of the original region weighted by their respective execution probability.
To evaluate these costs, we consider all possible paths impacted by the if-conversion.
For all transformations except the conjunctive predicate merge, there are two
such paths starting at the basic block head. For the code in Fig. 20.13, we would
have pathp = [head, B1 , exit[ and pathp = [head, B1 , B2 , exit[ of respective
probabilities prob(p) and prob(p). For a path Pq = [B0 , B1 , . . . , Bn [ of probability
prob(q), its cost is given by P q = prob(q) × n−1 [B i , Bi+1 [, where [Bi , Bi+1 [
i=0
represents the cost of basic block [Bi ] estimated using its schedule height plus the
branch latency br_lat, if the edge (Bi , Bi+1 ) corresponds to a conditional branch, 0
otherwise. Note that if Bi branches to Sq on predicate q and falls through to Sq , we
have
Bi = prob(q) × [B i , Sq [ + prob(q) × [B
i , Sq [ = [B i ] + prob(q) × br_lat.
p + path
costcontrol = path p
n
m
+ prob(p) × br_lat +
= [head]
[Bi ] + prob(p) × ].
[B i
i=0 i=0
where ◦ is the composition function that merges basic blocks together, removes
associated branches, and creates the predicate operations.
The profitability of the logical conjunctive merge in Fig. 20.14 can be
evaluated similarly. There are three paths impacted by the transformation:
pathp∧q = [head, side, B1 [, pathp∧q = [head, side, exit[, and pathp = [head, exit[
of respective probabilities prob(p ∧ q), prob(p ∧ q), and prob(p). The overall cost
before the transformation (if branches are on p and q) path
p∧q + pathp∧q + pathp
simplifies to
+ side
costcontrol = head = [head]
+ prob(p) × (1 + prob(q)) × br_lat,
which should be compared to (if the branch on the new head block is on p ∧ q)
costpredicated = head
◦ side = [head ◦ side] + prob(p) × prob(q) × br_lat.
Note that if prob(p) & 1, emitting a conjunctive merge might not be beneficial.
In that case, another strategy such as path duplication from the exit block will
be evaluated. Profitability for any conjunctive predicate merge (disjunctive or
conjunctive; convergent or not) is evaluated similarly.
A speed-oriented objective function needs the target machine description to
derive the instruction latencies, resource usage, and scheduling constraints. The
local dependencies computed between instructions are used to compute the depen-
dence height. The branch probability is obtained from static branch prediction
heuristics, profile information, or user-inserted directives. Naturally, this heuristic
can be either pessimistic, because it does not take into account new optimization
opportunities introduced by the branch removal or explicit new dependencies, or
optimistic because of inaccurate register pressure estimation leading to register
spilling on the critical path, or uncertainty in the branch prediction. But since
the SSA incremental if-conversion framework reduces the scope for the decision
function to a localized part of the CFG, the size and complexity of the inner
20 If-Conversion 283
Fabrice Rastello
Chapter 3 provides a basic algorithm for destructing SSA that suffers from several
limitations and drawbacks: first, it works under implicit assumptions that are not
necessarily fulfiled at machine level; second, it must rely on subsequent phases to
remove the numerous copy operations it inserts; finally, it subsequently increases
the size of the intermediate representation, thus making it unsuitable for just-in-time
compilation.
Correctness
SSA at machine level complicates the process of destruction that can potentially
lead to bugs if not performed carefully. The algorithm described in Sect. 3.2 involves
the splitting of every critical edge. Unfortunately, because of specific architectural
constraints, region boundaries, or exception handling code, edge splitting is not
always possible. As we will see further on, this obstacle could easily be overcome
by appending copy operations at the very beginning and very end of basic blocks.
Unfortunately, appending a copy operation at the very end of a basic block might not
be possible either (it has to be before the jump operation). Also, care must be taken
with duplicated edges, i.e., when the same basic block appears twice in the list of
direct predecessors. This can occur after control-flow graph structural optimizations
such as dead code elimination or empty block elimination, etc.
SSA imposes a strict discipline on variable naming: every “name” must be
associated with only one definition which, most of the time, is obviously not
compatible with the instruction set of the target architecture. As an example, a
two-address mode instruction such as auto-increment (x = x + 1) would force
its definition to use the same resource as one of its arguments (defined elsewhere),
thus imposing two different definitions for the same temporary variable. This is
F. Rastello ()
Inria, Grenoble, France
e-mail: [Link]@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 285
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
286 F. Rastello
why some compiler designers prefer using, for SSA construction, the notion of
versioning in place of renaming. Implicitly, two versions of the same original
variable should not interfere, while two names can. Such a flavour corresponds to
the C-SSA form described in Chap. 2. The former simplifies the SSA destruction
phase, while the latter simplifies and allows more transformations to be performed
under SSA (updating C-SSA is very difficult) . Apart from dedicated registers for
which optimizations are usually very careful in managing their live range, register
constraints related to calling conventions or instruction set architecture might be
handled by the register allocation phase. However, as we will see, enforcement
of register constraints impacts the register pressure as well as the number of copy
operations. For those reasons we may want those constraints to be expressed earlier
(such as for the pre-pass scheduler), in which case the SSA destruction phase might
have to cope with them.
Code Quality
The natural way of lowering φ-functions and expressing register constraints is
through the insertion of copies (when edge splitting is not mandatory as discussed
above). If done carelessly, the resulting code will contain many temporary-to-
temporary copy operations. In theory, reducing the number of these copies is the
role of the coalescing during the register allocation phase. A few memory and time-
consuming existing coalescing heuristics mentioned in Chap. 22 are quite effective
in practice. The difficulty comes both from the size of the interference graph
(the information of colourability is spread out) and from the presence of many
overlapping live ranges that carry the same value (so are non-interfering). With
less effort, coalescing can also be performed prior to the register allocation phase.
As opposed to a (so-called conservative) coalescing during register allocation, this
aggressive coalescing would not cope with the interference graph colourability. As
we will see, strict SSA form is really helpful for both computing and representing
equivalent variables. This makes the SSA destruction phase the right candidate for
eliminating (or not inserting) those copies.
Speed and Memory Footprint
The cleanest and simplest way to perform SSA destruction with good code quality
is to first insert copy instructions to make the SSA form conventional, then
take advantage of the SSA form to efficiently run aggressive coalescing (without
breaking the conventional property), before eventually renaming φ-webs and getting
rid of φ-functions. Unfortunately, in a transitional stage this approach will lead to
an intermediate representation with a substantial number of variables: The liveness
sets and the interference graph classically used to perform coalescing become
prohibitively large for dynamic compilation. To overcome this difficulty liveness
and interference can be computed on demand, which, as we already mentioned, is
made simpler by the use of SSA form (see Chap. 9). There remains the process
of copy insertion itself that might still take a substantial amount of time. To fulfil
memory and time constraints imposed by just-in-time compilation, one idea is to
virtually insert those copies, and only effectively insert the non-coalesced ones.
21 SSA Destruction for Machine Code 287
21.1 Correctness
point of ai and lead to an incorrect code after renaming a0 and ai with the same
name. φ-node isolation can be used to solve most of the issues that may be faced at
machine level. However, the subtleties listed below remain.
Limitations
A tricky case is where the basic block contains variables defined after the point
of copy insertion. This, for example, is the case of the PowerPC bclr branch
instructions with a behaviour similar to hardware loop. In addition to the condition,
a counter u is decremented by the instruction itself. If u is used in a φ-function in a
direct successor block, no copy insertion can split its live range. It must then be given
the same name as the variable defined by the φ-function. If both variables interfere,
this is simply impossible! For example, suppose that for the code in Fig. 21.2a,
the instruction selection chooses a branch with decrement (denoted br_dec) for
Block B1 (Fig. 21.2b). Then, the φ-function of Block B2 , which uses u, cannot
be translated out of SSA by standard copy insertion because u interferes with t1
and its live range cannot be split. To destruct SSA, one could add t1 ← u − 1
in Block B1 to anticipate the branch. Or one could split the critical edge between
B1 and B2 as in Fig. 21.2c. In other words, simple copy insertions are not enough in
this case. We see several alternatives to solve the problem: (1) The SSA optimization
could be designed with more care; (2) the counter variable must not be promoted to
SSA; (3) some instructions must be changed; (4) the control-flow edge must be split
somehow.
Fig. 21.2 Copy insertion may not be sufficient. br_dec u, B1 decrements u, then branches to B1
if u = 0
21 SSA Destruction for Machine Code 289
Fig. 21.3 Copy folding followed by empty block elimination can lead to SSA code for which
destruction is not possible through simple copy insertion
Another tricky case is when a basic block has the same direct predecessor
block twice. This can result from consecutively applying copy folding and control-
flow graph structural optimizations such as dead code elimination or empty block
elimination. This is the case for the example of Fig. 21.3 where copy folding would
remove the copy a2 ← b in Block B2 . If B2 is eliminated, there is no way to
implement the control dependence of the value to be assigned to a3 other than
through predicated code (see Chaps. 15 and 14) or through the reinsertion of a basic
block between B1 and B0 by splitting one of the edges.
The last difficulty SSA destruction faces when performed at machine level
is related to register constraints such as instruction set architecture (ISA) or
application binary interface (ABI) constraints. For the sake of the discussion
we differentiate two kinds of resource constraints that we will refer as operand
pinning and live range pinning. The live range pinning of a variable v to resource
R will be represented by Rv , just as if v were a version of temporary R. An
operand pinning to a resource R will be represented using the exponent ↑R on
the corresponding operand. Live range pinning expresses the fact that the entire
live range of a variable must reside in a given resource (usually a dedicated
register). Examples of live range pinning are versions of the stack-pointer temporary
that must be assigned back to register SP. On the other hand, the pinning of an
operation’s operand to a given resource does not impose anything on the live
range of the corresponding variable. The scope of the constraint is restricted to
the operation. Examples of operand pinning are operand constraints such as two-
address mode where two operands of one instruction must use the same resource,
or where an operand must use a given register. This last case encapsulates ABI
constraints.
Note that looser constraints where the live range or the operand can reside in
more than one resource are not handled here. We assume that the handling of this
latter constraint is the responsibility of the register allocation. We first simplify the
problem by transforming any operand pinning into a live range pinning, as sketched
290 F. Rastello
in Fig. 21.4: Parallel copies with new variables pinned to the corresponding resource
are inserted just before (for use operand pinning) and just after (for definition-
operand pinning) the operation.
Detection of Strong Interferences
The scheme we propose in this section to perform SSA destruction that deals with
machine level constraints does not address compilation cost (in terms of speed and
memory footprint). It is designed to be simple. It first inserts parallel copies to isolate
φ-functions and operand pinning. Then it checks for interferences that may persist.
We will denote such interferences as strong, as they cannot be tackled through the
simple insertion of temporary-to-temporary copies in the code. We consider that
fixing strong interferences should be done on a case-by-case basis and restrict the
discussion here to their detection.
As far as correctness is concerned, Algorithm 21.1 splits the data flow between
variables and φ-nodes through the insertion of copies. For a given φ-function a0 ←
φ(a1 , . . . , an ), this transformation is correct as long as the copies can be inserted
close enough to the φ-function. This might not be the case if the insertion point
(for a use operand) of copy ai ← ai is not dominated by the definition point of ai
(such as for argument u of the φ-function t1 ← φ(u, t2 ) for the code in Fig. 21.2b);
symmetrically, it will not be correct if the insertion point (for the definition-operand)
of copy a0 ← a0 does not dominate all the uses of a0 . More precisely, this leads to
the insertion of the following tests in Algorithm 21.1:
• Line 9: “if the definition of ai does not dominate PCi then continue.”
• Line 16: “if one use of a0 is not dominated by PC0 then continue.”
For the discussion, we will denote as split operands the newly created local variables
to differentiate them from the ones concerned by the two previous cases (designated
as non-split operands). We suppose that a similar process has been performed for
operand pinning to express them in terms of live range pinning with (when possible)
very short live ranges around the concerned operations.
At this point, the code is still under SSA, and the goal of the next step is to check
that it is conventional: This will obviously be the case only if all the variables of a
21 SSA Destruction for Machine Code 291
φ-web can be coalesced together. But this is not the only constraint: The set of all
variables pinned to a common resource must also be interference free. We say that x
and y are pinned-φ-related to one another if they are φ-related or if they are pinned
to a common resource. The transitive closure of this relation defines an equivalence
relation that partitions the variables defined locally in the procedure into equivalence
classes, the pinned-φ-webs. Intuitively, the pinned-φ-equivalence class of a resource
represents a set of resources “connected” via φ-functions and resource pinning. The
computation of φ-webs given by Algorithm 3.4 can be generalized easily to compute
pinned-φ-webs. The resulting pseudo-code is given by Algorithm 21.2.
Now we need to check that each web is interference free. A web contains
variables and resources. The notion of interferences between two variables is the
one discussed in Sect. 2.6 for which we will propose an efficient implementation
later in this chapter. A variable and a physical resource do not interfere while two
distinct physical resources interfere with one another.
If any interference has been discovered, it has to be fixed on a case-by-case basis.
Note that some interferences such as the one depicted in Fig. 21.3 can be detected
and handled initially (through edge splitting if possible) during the copy insertion
phase.
292 F. Rastello
Once the code is in conventional SSA, the correctness problem is solved: Destruc-
ting it is by definition straightforward, as it consists in renaming all variables in
each φ-web into a unique representative name and then removing all φ-functions.
To improve the code, however, it is important to remove as many copies as possible.
Aggressive Coalescing
Aggressive coalescing can be treated with standard non-SSA coalescing technique.
Indeed, conventional SSA allows us to coalesce the set of all variables in each φ-
web together. Coalesced variables are no longer SSA variables, but φ-functions can
be removed. Liveness and interferences can then be defined as for a regular code
(with parallel copies). An interference graph (as depicted in Fig. 21.5e) can be used.
A solid edge between two nodes (e.g., between x2 and x3 ) materializes the presence
of an interference between the two corresponding variables, i.e., expressing the fact
that they cannot be coalesced and share the same resource. A dashed edge between
two nodes materializes an affinity between the two corresponding variables, i.e.,
the presence of a copy (e.g., between x2 and x2 ) that could be removed by their
coalescing.
This process is illustrated by Fig. 21.5: the isolation of the φ-function leads to
the insertion of the three copies that respectively define x1 , define x3 , and use x2 ;
the corresponding φ-web {x1 , x2 , x3 } is coalesced into a representative variable x;
according to the interference graph in Fig. 21.5e, x1 , x3 can then be coalesced with
x leading to the code in Fig. 21.5c.
Liveness Under SSA
If the goal is not to destruct SSA completely but remove as many copies as possible
while maintaining the conventional property, liveness of φ-function operands should
reproduce the behaviour of the corresponding non-SSA code as if the variables
of the φ-web were coalesced all together. The semantic of the φ-operator in the
21 SSA Destruction for Machine Code 293
Interferences Interferences
Implementing the technique of the previous section may be considered too costly.
First, it inserts many instructions before realizing that most are useless. Also, copy
insertion is already in itself time-consuming. It introduces many new variables,
too: The size of the variable universe has an impact on the liveness analysis and
the interference graph construction. Finally, if a general coalescing algorithm is
used, a graph representation with adjacency lists (in addition to the bit matrix) and
a working graph to explicitly merge nodes when coalescing variables, would be
1 Dominance property is required here, e.g., consider the following loop body if(i = 0) {b ← a; }
As explained in Paragraph 21.2, this can be done linearly (without requiring a hash
map-table) on a single traversal of the program if under strict SSA form. We also
suppose that the liveness check is available, meaning that for a given variable a and
program point p, one can answer if a is live at this point through the boolean value
of [Link](p). This can directly be used, under strict SSA form, to check if two
variables live ranges intersect:
from the merged set. As we will see, thanks to the dominance property, this can be
done linearly using a single traversal of the set.
In reference to register allocation, and graph colouring, we will associate the
notion of colours with merged sets: All the variables of the same set are assigned
the same colour, and different sets are assigned different colours. The process of
de-coalescing a variable is to extract it from its set; it is not put in another set,
just isolated. We will say uncoloured. Actually, variables pinned together have
to stay together. We denote the (interference free) set of variables pinned to a
common resource that contains variable v, atomic-merged-set(v). So the process
of uncolouring a variable might have the effect of uncolouring some others. In other
words, a coloured variable is to be coalesced with variables of the same colour, and
any uncoloured variable v is to be coalesced only with the variables it is pinned
with, i.e., atomic-merged-set(v).
We suppose that variables have already been coloured, and the goal is to uncolour
some of them (preferably not all of them) so that each merged set becomes
interference free. We suppose that if two variables are pinned together they have
been assigned the same colour, and that a merged set cannot contain variables pinned
to different physical resources. Here we focus on a single merged set and the goal
is to make it interference free within a single traversal. The idea exploits the tree
shape of variables’ live ranges under strict SSA. To this end, variables are identified
by their definition point and ordered accordingly using dominance.
Algorithm 21.3 performs a traversal of this set along the dominance order,
enforcing at each step the subset of already considered variables to be interference
free. From now, we will abusively design as the dominators of a variable v, the set
of variables of colour identical to v whose definition dominates the definition of v.
Variables defined at the same program point are arbitrarily ordered, so as to use the
standard definition of immediate dominator (denoted [Link], set to ⊥ if they do not
exist, updated lines 6–8). To illustrate the role of [Link] in Algorithm 21.3, let us
consider the example of Fig. 21.6 where all variables are assumed to be originally in
the same merged set: [Link] (updated line 16) represents the immediate intersecting
dominator with the same value as v; so we have [Link] = ⊥ and [Link] = a. When
line 14 is reached, cur_anc (if not ⊥) represents a dominating variable interfering
with v and with the same value than [Link]: when v is set to c ([Link] = b), as b
does not intersect c and as [Link] = ⊥, cur_anc = ⊥, which allows us to conclude
that there is no dominating variable that interferes with c; when v is set to e, d does
not intersect e but as a intersects and has the same value as d (otherwise a or d
would have been uncoloured), we have [Link] = a and thus cur_anc = a. This
allows us to detect on line 18 the interference of e with a.
Virtualizing φ-Related Copies
The last step towards a memory-friendly and fast SSA destruction algorithm consists
in emulating the initial introduction of copies and only actually inserting them on
the fly when they appear to be required. We use exactly the same algorithms as for
the solution without virtualization, and use a special location in the code, identified
as a “virtual” parallel copy, where the real copies, if any, will be placed.
21 SSA Destruction for Machine Code 297
5 Function DeCoalesce(v,
u)
6 while (u = ⊥) (¬(udominates v) ∨ uncolored(u)) do u ← [Link]
7
8 [Link] ← u
9 [Link] ← ⊥
10 cur_anc ← [Link]
11 while cur_anc = ⊥ do
12 while cur_anc = ⊥ ¬ (colored(cur_anc) ∧ intersect(cur_anc, v)) do
13 cur_anc ← cur_anc.eanc
14 if cur_anc = ⊥ then
15 if V (cur_anc) = V (v) then
16 [Link] ← cur_anc
17 break
18 else cur_anc and v interfere
19 if preferable to uncolor v then
20 uncolor atomic-merged-set(v)
21 break
22 else
23 uncolor atomic-merged-set(cur_anc)
24 cur_anc ← cur_anc.eanc
14 else
15 foreach operation OP at l (including φ-functions) do
16 foreach variable v defined by OP do
17 if ¬colored(v) then continue
18 else c ← color(v)
19
20 DeCoalesce(v, c.cur_idom)
21 if colored(v) then c.cur_idom ← v
22
destination of all copies to be treated. Copies are first treated by considering leaves
(while loop on the list ready). Then, the to_do list is considered, ignoring copies
that have already been treated, possibly breaking a circuit with no duplication,
thanks to an extra copy into the fresh variable n.
SSA destruction was first addressed by Cytron et al. [90] who propose to simply
replace each φ-function by copies in the direct predecessor basic block. Although
this naive translation seems, at first sight, correct, Briggs et al. [50] pointed out
subtle errors due to parallel copies and/or critical edges in the control-flow graph.
Two typical situations are identified, namely the “lost-copy problem” and the “swap
problem.” The first solution, both simple and correct, was proposed by Sreedhar
et al. [267]. They address the associated problem of coalescing and describe three
solutions. The first one consists of three steps: (a) translate SSA into CSSA, by
isolating φ-functions; (b) eliminate redundant copies; (c) eliminate φ-functions and
leave CSSA. The third solution, which turns out to be nothing else than the first
solution except that it virtualizes the isolation of φ-functions, has the advantage
of introducing fewer copies. The reason for that, identified by Boissinot et al., is
21 SSA Destruction for Machine Code 301
the fact that in the presence of many copies the code contains many intersecting
variables that do not actually interfere. Boissinot et al. [35] revisited Sreedhar et
al.’s approach in the light of this remark and proposed the value-based interference
described in this chapter.
The ultimate notion of interference was discussed by Chaitin et al. [60] in
the context of register allocation. They proposed a simple conservative test: Two
variables interfere if one is live at a definition point of the other and this definition is
not a copy between the two variables. This interference notion is the most commonly
used, see for example how the interference graph is computed in [10]. Still they
noticed that, with this conservative interference definition, after coalescing some
variables the interference graph has to be updated or rebuilt. A counting mechanism
to update the interference graph was proposed, but it was considered to be too space-
consuming. Recomputing it from time to time was preferred [59, 60].
The value-based technique described here can also obviously be used in the
context of register allocation even if the code is not under SSA form. The notion
of value may be approximated using data-flow analysis on specific lattices [6], and
under SSA form simple global value numbering [249] can be used.
302 F. Rastello
Leung and George [182] addressed SSA destruction for machine code. Register
renaming constraints, such as calling conventions or dedicated registers, are treated
with pinned variables. A simple data-flow analysis scheme is used to place repairing
copies. By revisiting this approach to address the coalescing of copies, Rastello et
al. [240] pointed out and fixed a few errors present in the original algorithm. While
being very efficient in minimizing the introduced copies, this algorithm is quite
complicated to implement and not suited to just-in-time compilation.
The first technique to address speed and memory footprint was proposed by
Budimlić et al. [53]. It proposes the de-coalescing technique, revisited in this
chapter, that exploits the underlying tree structure of the dominance relation between
variables of the same merged set.
Last, this chapter describes a fast sequentialization algorithm that requires the
minimum number of copies. A similar algorithm has already been proposed by
C. May [196].
Chapter 22
Register Allocation
F. Bouchez Tichadou
University of Grenoble Alpes, Grenoble, France
e-mail: [Link]-tichadou@[Link]
F. Rastello ()
Inria, Grenoble, France
e-mail: [Link]@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 303
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
304 F. Bouchez Tichadou and F. Rastello
22.1 Introduction
Let us first review the basics of register allocation, to help us understand the choices
made by graph-based and linear scan style allocations.
Register allocation is usually performed per procedure. In each procedure, a
liveness analysis (see Chap. 9) determines for each variable the program points
where the variable is live. The set of all program points where a variable is live
is called the live range of the variable, and all along this live range, storage needs
to be allocated for that variable, ideally a register. When two variables “exist” at the
same time, they are conflicting for resources, i.e., they cannot reside in the same
location.
This resource conflict of two variables is called interference and is usually
defined via liveness: two variables interfere if (and only if) there exists a program
point where they are simultaneously live, i.e., their live ranges intersect.1 It
represents the fact that those two variables cannot share the same register. For
instance, in Fig. 22.1, variables a and b interfere as a is live at the definition of b.
There are multiple questions that arise at that point that a register allocator has to
answer to:
• Are there enough registers for all my variables? (spill test)
• If yes, how do I choose which register to assign to which variable? (assignment)
• If no, how do I choose which variables to store in memory? (spilling)
Without going into the details, let us see how linear scan and graph-based
allocators handle these questions. Figure 22.1 will be used in the next paragraphs to
illustrate how these allocators work.
Linear Scan
The linear scan principle is to consider that a procedure is a long basic block
and, hence, live ranges are approximated as intervals. For instance, the procedure
shown in Fig. 22.1b is viewed as the straight-line code of Fig. 22.1a. The algorithm
then proceeds in scanning the block from top to bottom. When encountering
the definition of a variable (i.e., the beginning of a live range), we check if
some registers are free (spill test). If yes, we pick one to assign the variable to
(assignment). If no, we choose from the set of currently live variables the one that
1 This definition of interference by liveness is an over-approximation (see Sect. 2.6 of Chap. 2), and
there are refined definitions that create less interferences (see Chap. 21). However, in this chapter,
we will restrict ourselves to this definition and assume that two variables whose live ranges intersect
cannot be assigned the same register.
22 Register Allocation 305
Fig. 22.1 Linear scan makes an over-approximation of live ranges as intervals, while graph-based
allocator creates an interference graph capturing the exact interferences. Linear scan requires 5
registers in this case, while colouring the interference graph can be done with 4 registers
has the farthest use and spill it (spilling). When we encounter the end of its live
range (e.g., a last use), we free the register it was assigned to. On our example,
we would thus greedily colour along the following order: p, a, b, x, and finally y.
When the scan encounters the first definition of y in the second basic block, four
other variables are live and a fifth colour is required to avoid spilling.
Graph-Based
Graph-based allocators, such as the “Iterated Register Coalescing” allocator (IRC),
represent interferences of variables as an undirected interference graph: the nodes
are the variables of the program, and two nodes are connected by an edge if they
interfere, i.e., if their live ranges intersect. For instance, Fig. 22.1c shows the
interference graph of the code example presented in Fig. 22.1b. In this model,
two neighbouring nodes must be assigned different registers, so the assignment
of variables to registers amounts to colouring the graph—two neighbouring nodes
must have a different colour—using at most R colours, the number of registers.2
Here, the allocator will try to colour the graph. If it succeeds (spill test), then the
colouration represents a valid assignment of registers to variables (assignment). If
not, the allocator will choose some nodes (usually the ones with the highest number
of neighbours) and remove them from the graph by storing the corresponding
variables in memory (spilling).
On our example, one would need to use four different colours for a, b, p, and x,
but y could use the same colour as a or b.
2 Hence, the terms “register” and “colour” will be used interchangeably in this chapter.
306 F. Bouchez Tichadou and F. Rastello
Comparison
Linear scan is a very fast allocator where a procedure is modelled as a straight-line
code. In this model, the colouring scheme is considered to be optimal. However, the
model itself is very imprecise: Procedures generally are not just straight-line codes
but involve complex flow structures such as if-conditions and loops. The live ranges
are artificially longer and produce more interferences than there actually are. If we
look again at the example Fig. 22.1, we have a simple code with an if-condition.
Linear scan would decide that, because four variables are live at the definition y ←
b, it needs five registers (spill test). But one can observe that a is actually not live
at that program point: Modelling the procedure as a straight-line code artificially
increases the live ranges of variables.
On the other hand, a graph-based allocator has a much more precise notion
of interference. Unfortunately, graph k-colouring is known to be an NP-complete
problem. Control-flow structures create cycles in the interference graph that can get
arbitrarily complex. The allocator uses a heuristic for colouring the graph and will
base its spill decisions on this heuristic.
In our example, IRC would create the graph depicted in Fig. 22.1c, which
includes a 4-clique (i.e., a complete sub-graph of size 4, here with variables a, b,
p, and x), and, hence, would require at least 4 colours. This simple graph would
actually be easily 4-colourable with a heuristic; hence, the spill test would succeed
with four registers.
Still, one could observe that at each point of the procedure, no more than three
variables are simultaneously live. However, since x interferes with b on the left
branch and with a on the right branch, with the model used by IRC, it is indeed
impossible to use only three registers.
The question we raise here is: can’t we do better? The answer is yes, as depicted
in Fig. 22.2a. If it was possible for x to temporarily use the same register as b in the
right branch as long as a lives (short live range denoted x in the figure), then x could
use the same colour as a (freed after a’s last use). In this context, three registers are
enough. In fact, one should expect that, for a code where only three variables are
live at any program point, it should be possible to register allocate without spilling
with only three registers and proper reshuffling of variables in registers from time
to time.
So what is wrong with the graph colouring based scheme we just described?
We will develop below that its limitation is mostly due to the fact that it arbitrarily
enforces all variables to be assigned to only one register for their entire live range.
In conclusion, linear scan allocators are faster, and graph colouring ones have
better results in practice, but both approaches have an inexact spill test: linear
scan has artificial interferences, and graph colouring uses a colouring heuristic.
Moreover, both require variables to be assigned to exactly one register for all their
live ranges. This means both allocators will potentially spill more variables than
strictly necessary. We will see how SSA can help with this problem.
22 Register Allocation 307
Fig. 22.2 Splitting variable x in the previous example breaks the interference between x and a.
By using copies between x/x and y/a in parallel, now only 3 registers are required on the left.
SSA introduces splitting that guarantees Maxlive registers are enough
The number of simultaneously live variables at a program point is called the register
pressure at that program point.3 The maximum register pressure over all program
points in a procedure is called the register pressure of that procedure, or “Maxlive.”
One can observe that Maxlive expresses the minimum number of required registers
for a spill-free register assignment, i.e., an allocation that does not require memory.
For instance, in Fig. 22.1b, Maxlive = 3, so a minimum number of 3 registers is
required. If a procedure is restricted to a single basic block (straight-line code), then
Maxlive also constitutes a sufficient number of registers for a spill-free assignment.
But in general, a procedure is made of several basic blocks, and under the standard
model described so far, an allocation might require more than Maxlive registers to
be spill-free.
This situation changes if we permit live range splitting. This means inserting a
variable-to-variable copy instruction at a program point that creates a new version
of a variable. Thus, the value of a variable is allowed to reside in different registers
at different times. For instance, in Fig. 22.1b, we can split x in the right branch
by changing the definition to x and using a copy x ← x at the end of the block,
producing the code shown in Fig. 22.2a. It means the node x in the graph is split into
two nodes, x and x . Those nodes do not interfere, and also interfere differently with
the other variables: Now, x can use the same register as a because only x interferes
with a. Conversely, x can use the same register as b, because only x interferes with
b. In this version, we only need Maxlive = 3 registers, which means that, if the
number of registers was tight (i.e., R = 3), we have traded a spill (here one store
and one load) for one copy, which is an excellent bargain.
This interplay between live range splitting and colourability is one of the key
issues in register allocation, and we will see in the remainder of this chapter how
SSA, which creates live range splitting at particular locations, can play a role in
register allocation.
As said above, a register allocation scheme needs to address the following three
sub-problems: spill test, assignment, and spilling. We focus here on the spill test,
that is, verify whether there are enough registers for all variables without having to
store any of them in memory.
As already mentioned in Sect. 2.3 of Chap. 2, the live ranges in an SSA-
form program with dominance property have interesting structural properties:
In that flavour, SSA requires that all uses of a variable are dominated by its
definition. Hence, the whole live range is dominated by the definition of the variable.
Dominance, however, induces a tree on the control-flow graph (see for instance the
dominance edges of Fig. 4.1 in Chap. 4). Thus, the live ranges of SSA variables
are all tree-shaped. They can branch downward on the dominance tree but have a
single root: the program point where the variable is defined. Hence, a situation like
in Fig. 22.1 can no longer occur: x and y had two “roots” because they were defined
twice. Under SSA form, the live ranges of those variables are split by φ-functions,
which creates the code shown in Fig. 22.2b, where we can see that live ranges form
a “tree.” The argument and result variables of the φ-functions constitute new live
ranges, giving more freedom to the register allocator since they can be assigned to
different registers.
This structural property is interesting as we can now perform exact polynomial
colouring schemes that work both for graph-based and linear-style allocators.
Graph-Based
Graph-based allocators such as the IRC mentioned above use a simplification
scheme that works quite well in practice but is a heuristic for colouring general
graphs. We will explain it in more detail in Sect. 22.3.1, but the general idea is to
remove from the interference graph nodes that have strictly less than R neighbours,
as there will always be a colour available for them. If the whole graph can be
simplified, nodes are given a colour in the reverse order of their simplification. We
also call this method the greedy colouring scheme. On our running example, the
interference graph of Fig. 22.1c, candidates for simplification with R = 4 would
be nodes with strictly less than 4 neighbours, that is, a, b, or y. As soon as one of
them is simplified (removed from the graph), p and x now have only 3 neighbours
and can also be simplified. So a possible order would be to simplify a, then y, p,
b, and finally x. Colours can be greedily assigned during the reinsertion of nodes in
22 Register Allocation 309
the graph in reverse order, starting from x, and ending with a: When we colour a,
all its 3 neighbours already have a colour and we assign it to the fourth colour.
Interestingly, under SSA, the interference graph becomes a chordal graph.4 The
important nice property about chordal graphs is that they can be coloured minimally
in polynomial time. Even more interesting is the fact that the simplification scheme
used in IRC is optimal for such graphs and will always manage to colour it with
Maxlive colours! Thus the same colouring algorithm can be used, without any
modification, and now becomes an exact spill test: Spilling is required if and only if
the simplification scheme fails to completely simplify (hence, colour) the graph.
Under SSA, the live ranges are intervals that can “branch” but never “join.” This
allows for a simple generalization of the linear scan mentioned above that we call the
tree scan. As we will see, the tree scan always succeeds in colouring the tree-shaped
live ranges with Maxlive colours. This greedy assignment scans the dominance tree,
colouring the variables from the root to the leaves in a top-down order. This means
the variables are simply coloured in the order of appearance of their respective
definition. On our example (Fig. 22.2b), tree scan would colour the variables in
the following order: p1 , a1 , b1 , x1 , y1 , x2 , y2 , x3 , y3 . This works because branches
of the tree are independent, so colouring one will not add constraints on other parts
of the tree, contrary to the general non-SSA case that may expose cycles.
The pseudo-code of the tree scan is shown in Algorithm 22.1. In this pseudo-
code, the program points p are processed using a depth-first search traversal of the
dominance tree T . For a colour c, available[c] is a boolean that expresses if c is
4 In a chordal graph, also called a triangulated graph, every cycle of length 4 or more has (at least)
one chord (i.e., an edge joining two non-consecutive edges in the cycle).
310 F. Bouchez Tichadou and F. Rastello
available for the currently processed point. Intuitively, when the scanning arrives at
the definition of a variable, the only already coloured variables are “above” it, and
since there is at most Maxlive − 1, other variables live at this program point, and
there is always a free colour. As an example, when colouring x1 , the variables live at
its definition point are p1 and b and are already coloured. So a third colour, different
from the ones assigned to p1 and b, can be given to x1 .
Conclusion
The direct consequence is that, as opposed to general form programs, and whether
we consider graph-based or scan-based allocators, the only case where spilling is
required is when Maxlive > R, i.e., when the maximum number of simultaneously
live variables is greater than the number of available registers. In other words, there
is no need to try to colour the interference graph to check if spilling is necessary or
not: the spill test for SSA-form programs simplifies to simply computing Maxlive
and then comparing it to R.
This allows to design a register allocation scheme where spilling is decoupled
from colouring: First, lower the register pressure to at most R everywhere in the
program. Then, colour the interference graph with R colours in polynomial time.
22.2 Spilling
We have seen previously that, under SSA, it is easy to decide in polynomial time
whether there is enough registers or not, simply by checking if Maxlive ≤ R,
the number of registers. The goal of this section is to present algorithms that will
lower the register pressure when it is too high, i.e., when Maxlive > R, by spilling
(assigning) some variables to memory.
Spilling is handled differently depending on the allocator used. For a scan-based
allocator, the spilling decision happens when we are at a particular program point.
Although it is actually a bit more complex, the idea when spilling a variable v is
to insert a store at that point, and a load just before its next use. This process leads
to spilling only a part of the live range. On the other end, a graph-based allocator
has no notion of program points since the interferences have been combined in an
abstract structure: the interference graph. In the graph colouring setting, spilling
means removing a node of the interference graph and thus the entire live range of
a variable. This is a called a spill-everywhere strategy, which implies inserting load
instructions in front of every use and storing instructions after each definition of the
variables. These loads and stores require temporary variables that were not present
in the initial graph. These temporary variables also need to be assigned to registers.
This implies that whenever the spilling/colouring is done, the interference graph
has to be rebuilt and a new pass of allocation is triggered, until no variable is spilled
anymore: this is where the “Iterated” comes from in the IRC name.
In this section, we will consider the two approaches: the graph-based approach
with a spill-everywhere scheme, and the scan-based approach that allows partial live
22 Register Allocation 311
range spilling. In both cases, we will assume that the program was in SSA before
spilling. This is important to notice that there are pros and cons of assuming so.
In particular, the inability to coalesce or move the shuffle code associated to φ-
functions can lead to spurious load and store instructions on CFG edges. Luckily,
these can be handled by a post-pass of partial redundancy elimination (PRE, see
Chap. 11), and we will consider here the spilling phase as a full-fledged SSA
program transformation.
Suppose we have R registers, the objective is to establish Maxlive ≤ R (Maxlive
lowering) by inserting loads and stores into the program. Indeed, as stated above,
lowering Maxlive to R ensures that a register allocation with R registers can be
found in polynomial time for SSA programs. Thus, spilling should take place before
registers are assigned and yield a program in SSA form. In such a decoupled register
allocation scheme, the spilling phase is an optimization problem for which we define
the following constraints and objective function:
• The constraints that describe the universe of possible solutions express that the
resulting code should be R-colourable.
• The objective function expresses the fact that the (weighted) amount of inserted
loads and stores should be minimized.
The constraints directly reflect the “spill test” that expresses whether more
spilling is necessary or not. The objective is expressed with the profitability test:
among all variables, which one is more profitable to spill? The main implication of
spilling in SSA programs is that the spill test—which amounts to checking whether
Maxlive has been lowered to R or not—becomes precise.
The other related implication of the use of SSA form follows from this
observation: consider a variable such that for any program point in its entire live
range the register pressure is at most R, and then spilling this variable is useless
with regard to the colourability of the code. In other words, spilling such a variable
will never be profitable. We will call this yes-or-no criteria, enabled by the use of
SSA form, the “usefulness test.”
We will see now how to choose, among all “useful” variables (with regard to
the colourability), the ones that seem most profitable. In this regard, we present
in the next section how SSA allows to better account for the program structure
in the spilling decision even in a graph-based allocator, thanks to the enabled
capability to decouple spilling (allocation) to colouring (assignment). However,
register allocation under SSA shines the most in a scan-based setting, and we present
guidelines to help the spill decisions in such a scheme in Sect. 22.2.2.
of a variable then degenerates into small intervals: one from the definition and
the store, and one from each load to its subsequent use. However, even in this
simplistic setting, it is NP-complete to find the minimum number of nodes to
establish Maxlive ≤ R. In practice, heuristics such as the one in IRC spill variables
(graph nodes) greedily using a weight that takes into account its node degree (the
number of interfering uncoloured variables) and an estimated spill cost (estimated
execution frequency of inserted loads and stores). Good candidates are high-degree
nodes of low spill cost, as this means they will lessen colouring constraints on many
nodes—their neighbours—while inserting few spill codes.
The node degree represents the profitability of spilling the node in terms of
colourability. It is not very precise as it is only a graph property, independent of the
control-flow graph. We can improve this criteria by using SSA to add a usefulness
tag. We will now show how to build this criteria and how to update it.
We attach to each variable v a “useful” tag, an integer representing the number
of program points that would benefit from spilling v, i.e., [Link] expresses the
number of program points that belong to the live range of v and for which the register
pressure is strictly greater than R.
Building the “Useful” Criteria
We will now explain how the “useful” tag can be built at the same time as the
interference graph. Under SSA, the interference graph can be built through a simple
bottom-up traversal of the CFG. When encountering the last use of a variable p,
p is added to the set of currently live variables (live set) and a corresponding node
(that we also call p) is added in the graph. Arriving at the definition of p, it is
removed from the current live set, and edges (interferences) are added to the graph:
for all variables v in the live set, there is an edge (v → p). Note that, as opposed to
standard interference graphs, we consider directed edges here, where the direction
represents the dominance. At that point, we also associate the following fields to
node p and its interferences:
• [Link] that corresponds to the number of variables live at the definition point
of p, that is, |live-set| + 1
• (v → p).high, a boolean set to true if and only if [Link] > R, meaning this
interference belongs to a clique of size more than R
We then create the “usefulness” field of p. If [Link] ≤ R, then [Link] is
set to 0. Otherwise, we do the following:
• [Link] is set to 1.
• For all (v → p), [Link] gets incremented by 1.
At the end of the build process, [Link] expresses the number of program points
that belong to the live range of [Link] and for which the register pressure is greater
than R. More precisely,
In the context of a basic block, a simple algorithm that gives good results is the
“furthest first” algorithm that is presented in Algorithm 22.2. The idea is to scan
the block from top to bottom: whenever the register pressure is too high, we will
spill the variable whose next use is the furthest away, and it is spilled only up to this
next use. In the evict function of Algorithm 22.2, this corresponds to maximizing
distance_to_next_use_after(p). Spilling this variable frees a register for the longest
time, hence diminishing the chances to have to spill other variables later. This
algorithm is not optimal because it does not take into account the fact that the
314 F. Bouchez Tichadou and F. Rastello
first time we spill a variable is more costly than subsequent spills of the same
variable (the first time, a store and a load are added, but only a load must be
added afterwards). However, the general problem is NP-complete, and this heuristic,
although it may produce more stores than necessary, gives good results on “straight-
line codes,” i.e., basic blocks.
Profitability to Spill
To illustrate the generalization of the further-first priority strategy to a CFG, let us
consider the example of Fig. 22.3. In this figure, the + n sign denotes regions with
(high) register pressure of R + n. At program point p0 , register pressure is too high
by one variable (suppose there are other hidden variables that we do not want to
spill). We have two candidates for spilling: x and y, and the classical furthest first
criteria would be different depending on the chosen branch:
• If the left branch is taken, considering an execution trace (AB 100 C): In this
branch, the next use of y appears in a loop, while the next use of x appears way
further, after the loop has fully executed. It is more profitable to evict variable x
(at distance 101).
• If the right branch is taken, the execution trace is (AD). In that case, this is
variable y that has the further use (at distance 2) so we would evict variable y.
We have two opposing viewpoints, but looking at the example as a whole, we see
that the left branch is not under pressure, so spilling x would only help for program
point p0 , and one would need to spill another variable in block D (x is used at the
beginning of D); hence, it would be preferable to evict variable y.
On the other hand, if we modify a little bit the example by assuming a
high register pressure within the loop at program point p1 (by introducing other
variables), then evicting variable x would be preferred in order to avoid a load and
store in a loop!
This dictates the following remarks:
1. Program points with low register pressure can be ignored.
2. Program points within loops, or more generally with higher execution frequency,
should account in the computation of the “distance” more than program points
with lower execution frequency.
316 F. Bouchez Tichadou and F. Rastello
We will then replace the notion of “distance” in the furthest first algorithm with
a notion of “profitability,” that is, a measure of the number of program points
(weighted by frequency) that would benefit from the spilling of a variable v.
Definition 22.1 (Spill Profitability from p) Let p be a program point and v a
variable live at p. Let [Link](p) (high pressure) be the set of all program points q
such that: 1. Register pressure at q is strictly greater than R. 2. v is live at q. 3. There
exists a path from p to q that does not contain any use or definition of v. Then,
v.spill_profitability(p) = [Link].
q∈[Link](p)
In the first scenario, we would evict y with a profitability of 1.5. In the second,
we would evict x with a profitability of 51 (plus another variable later, when arriving
at p3 ), which is the behaviour we wanted in the first place.
Initial Register Filling at the Beginning of a Basic Block
When visiting basic block B, the set of variables that must reside in a register is
stored in B.in_regs. For each basic block, the initial value of this set has to be
computed before we start processing it. The heuristic for computing this set is
different for a “regular” basic block and for a loop entry. For a regular basic block,
22 Register Allocation 317
as we assume a topological order traversal of the CFG, all its direct predecessors
will have been processed. Live-in variables fall into three sets:
1. The ones that are available at the exit of all direct predecessor basic blocks:
allpreds_in_regs = P .in_regs
P ∈directpred(B)
2. The ones that are available in some of the direct predecessor basic blocks:
somepreds_in_regs = P .in_regs
P ∈directpred(B)
For a basic block at the entry of a loop, as illustrated by the example of Fig. 22.4,
one does not want to account for allocation on the direct predecessor basic block but
starts from scratch instead. Here, we assume the first basic block has already been
processed, and one wants to compute B.in_regs:
1. Example (a): Even if at the end of the direct predecessor basic block, at p0 , x is
not available in a register, one wants to insert a reload of x at p1 , that is, include
x in B.in_regs. Not doing so would involve a reload at every iteration of the loop
at p2 .
2. Example (b): Even if at the entry of the loop, x is available in a register, one
wants to spill it at p1 and restore at p4 so as to lower the register pressure that
318 F. Bouchez Tichadou and F. Rastello
is too high within the loop. Not doing so would involve a store in the loop at p2
and a reload on the back edge of the loop at p3 . This means excluding x from
B.in_regs.
This leads to Algorithm 22.4 where [Link] represents the set of live-in
variables of B and [Link] is the maximal register pressure in the whole loop
L. Init_inregs first fills B.in_regs with live-in variables that are used within
the loop L. Then, we add live-through variables to B.in_regs, but only those
that can survive the loop: If [Link] > R, then [Link] − R variables
will have to be spilled (hopefully some live-through variables), so no more than
|[Link]| − ([Link] − R) are allocated to a register at the entry of B.
We advocate here a decoupled register allocation: First, lower the register pressure
so that Maxlive ≤ R; second, assign variable to registers. Live range splitting
ensures that, after the first phase is done, no more spilling will be required as R
will be sufficient, possibly at the cost of inserting register-to-register copies.
We already mentioned in Sect. 22.1.3 that the well-known “Iterated Register
Coalescing” (IRC) allocation scheme, which uses a simplification scheme, can take
advantage of the SSA-form property. We will show here that, indeed, the underlying
structural property makes a graph colouring simplification scheme (recalled below)
an “optimal” scheme. This is especially important because, besides minimizing the
amount of spill code, the second objective of register allocation is to perform a
“good coalescing,” that is, try to minimize the amount of register-to-register copies:
a decoupled approach is practically viable if the coalescing phase is effective in
merging most of the live ranges, introduced by the splitting from SSA, by assigning
live ranges linked by copies to the same register.
In this section, we will first present the traditional graph colouring heuristic,
based on a simplification scheme, and show how it successfully colours programs
under SSA form. We will then explain in greater detail the purpose of coalescing
and how it translates when performed on SSA-form program. Finally, we will show
how to extend the graph-based (from IRC) and the scan-based (of Algorithm 22.1)
greedy colouring schemes to perform efficient coalescing.
13 if V =
∅ then
14 Failure ‘‘The graph is not simplifiable’’
15 return stack
degree at least R. In that case, we do not know whether the graph is R-colourable or
not. In traditional register allocation, this is the trigger for spilling some variables so
as to unstuck the simplification process. However, under the SSA form, if spilling
has already been done so that the maximum register pressure is at most R, the greedy
colouring scheme can never get stuck! We will not formally prove this fact here but
will nevertheless try to give insight as to why this is true.
22 Register Allocation 321
The key to understanding that property is to picture the dominance tree, with live
ranges are sub-trees of this tree, such as the one in Fig. 22.2b. At the end of each
dangling branch, there is a “leaf” variable: the one that is defined last in this branch.
These are the variables y1 , y2 , y3 , and x3 in Fig. 22.2b. We can visually see that this
variable will not have many intersecting variables: those are the variables alive at its
definition point, i.e., no more than Maxlive − 1, hence less than R − 1. In Fig. 22.2b,
with Maxlive = 3, we see that each of them has no more than two neighbours.
Considering again the greedy scheme, this means each of them is a candidate
for simplification. Once removed, another variable will become the new leaf of that
particular branch (e.g., x1 if y1 is simplified). This means simplification can always
happens at the end of the branches of the dominance tree, and the simplification
process can progress upward until the whole tree is simplified.
In terms of graph theory, the general problem is knowing whether a graph is k-
colourable or not. Here, we can define a new class of graphs that contains the graphs
that can be coloured with this simplification scheme.
Definition 22.2 A graph is greedy-k-colourable if it can be simplified using the
Simplify function of Algorithm 22.5.
We could prove the following theorem:
Theorem 1 Setting k = Maxlive, the interference graph of a code under SSA form
is always greedy-k-colourable.
This tells us that, if we are under SSA form and the spilling has already been
done so that Maxlive ≤ R, the classical greedy colouring scheme is guaranteed to
perform register allocation with R colours without any additional spilling, as the
graph is greedy-R-colourable.5
5 Observe that, with a program originally under SSA form, practical implementation may still
choose to interleave the process of spilling and coalescing/colouring. Result will be unchanged,
but speed might be impacted.
322 F. Bouchez Tichadou and F. Rastello
This way, coalescing, when done during the register allocation phase, is used
to minimize the amount of register-to-register copies in the final code. While there
may not be so many such copies at the high level (e.g., instructions “a ← b”)—
especially after a phase of copy propagation under SSA (see Chap. 8), many such
instructions are added in different compiler phases by the time compilation reaches
the register allocation phase. For instance, adding copies is a common way to deal
with register constraints (see the practical discussion in Sect. 22.4).
An even more obvious and unavoidable reason in our case is the presence of φ-
functions due to the SSA form: The semantic of a φ-function corresponds to parallel
copies on incoming edges of basic blocks, and destructing SSA, that is, getting rid
of φ-functions that are not machine instructions, is done through the insertion of
copy instructions. It is thus better to assign variables linked by a φ-function to the
same register, so as to “remove” the associated copies between subscripts of the
same variable. As already formalized (see Sect. 21.2 of Chap. 21) for the aggressive
coalescing scheme, we define a notion of affinity, acting as the converse of the
relation of interference and expressing how much two variables “want” to share the
same register. By adding a metric to this notion, it measures the benefit one could
get if the two variables were assigned to the same register: the weight represents
how many instructions coalescing would save at execution.
Coalescing comes with several flavours, which can be either aggressive or con-
servative. Aggressively coalescing an interference graph means coalescing non-
interfering nodes (that is, constraining the colouring) regardless of the chromatic
number of the resulting graph. An aggressive coalescing scheme is presented in
Chap. 21. Conservatively coalescing an interference graph means coalescing non-
interfering nodes without increasing the chromatic number of the graph. In both
cases, the objective function is the maximization of satisfied affinities, that is, the
maximization of the number of (weighted) affinities between nodes that have been
coalesced together. In the current context, we will focus on the conservative scheme,
as we do not want more spilling.
Obviously, because of the reducibility to graph-k-colouring, both coalescing
problems are NP-complete. However, graph colouring heuristics such as the Iter-
ated Register Coalescing use incremental coalescing schemes where affinities are
considered one after another. Incrementally, for two nodes linked by an affinity, the
heuristic will try to determine whether coalescing those two nodes will, with regard
to the colouring heuristic, increase the chromatic number of the graph or not. If not,
then the two corresponding nodes are (conservatively) coalesced. The IRC considers
two conservative coalescing rules that we recall here. Nodes with degree strictly less
than R are called low-degree nodes (those are simplifiable), while others are called
high-degree nodes.
22 Register Allocation 323
Briggs rule merges u and v if the resulting node has less than R neighbours of
high degree. This node can always be simplified after its neighbours, low-degree
neighbours, are simplified; thus the graph remains greedy-R-colourable.
George rule merges u and v if all neighbours of u with high degree are also
neighbours of v. After coalescing and once all low-degree neighbours are
simplified, one gets a sub-graph of the original graph, thus greedy-R-colourable
too.
The Iterated Register Coalescing algorithm normally also performs spilling and
includes many phases that are interleaved with colouring and coalescing, called in
the literature “freezing,” “potential spills,” “select,” and “actual spill” phases.
A pruned version of the coalescing algorithm used in the IRC can be obtained
by removing the freezing mechanism (explained below) and the spilling part. It is
presented in Algorithm 22.7. In this code, both the processes of coalescing and
simplification are combined. It works as follows:
1. Low-degree nodes that are not copy-related (no affinities) are simplified as much
as possible.
2. When no more nodes can be simplified this way, an affinity is chosen. If one of
the two rules (Briggs or George) succeeds, the corresponding nodes are merged.
If not, the affinity is erased.
3. The process iterates (from stage 1) until the graph is empty.
Originally, those rules were used for any graph, not necessarily greedy-R-
colourable, and with an additional clique of pre-coloured nodes—the physical
machine registers. With such general graphs, some restrictions on the applicability
of those two rules had to be applied when one of the two nodes was a pre-
coloured one. But in the context of greedy-R-colourable graphs, we do not need
such restrictions.
However, in practice, those two rules give insufficient results to coalesce the
many copies introduced, for example, by a basic SSA destruction conversion. The
main reason is because the decision is too local: it depends on the degree of
neighbours only. But these neighbours may have a high degree just because their
neighbours are not simplified yet, that is, the coalescing test may be applied too
early in the simplify phase.
This is the reason why the IRC actually iterates: instead of giving up coalescing
when the test fails, the affinity is “frozen,” that is, placed in a sleeping list and
“awakened” when the degree of one of the nodes implied in the rule changes. Thus,
affinities are in general tested several times, and copy-related nodes—nodes linked
by affinities with other nodes—should not be simplified too early to ensure the
affinities get tested.
The advocated scheme corresponds to the pseudo-code of Algorithm 22.7 and
is depicted in Fig. 22.5. It tries the coalescing of a given affinity only once and
thus does not require any complex freezing mechanism as done in the original IRC.
This is made possible thanks to the following enhancement of the conservative
coalescing rule: Recall that the objective of Briggs and George rules is to test
324 F. Bouchez Tichadou and F. Rastello
coalesce Brute
Fig. 22.5 Combined coalescing and colouring simplification scheme including Brute rule
The most natural way to perform coalescing using a scan-based (linear scan or tree
scan) approach would be to simply do biased colouring. Consider again the “tree
scan” depicted in Algorithm 22.1. Let us suppose the current program point p is a
copy instruction from variable v to v . The colour of v that is freed at line 3 can be
reused for v at line 5. In practice, this extremely local strategy does not work that
well for φ-functions. Consider as an example variables x1 , x2 , and x3 in the program
Fig. 22.2b. As there is no specific reason for the greedy allocation to assign the same
register to both x1 and x2 , when it comes to assigning one to x3 , the allocator will
often be able to only satisfy one of its affinities.
To overcome this limitation, the idea is to use an aggressive pre-coalescing as
a pre-pass of our colouring phase. We can use one of the algorithms presented
in Chap. 21, but the results of aggressive coalescing should only be kept in a
separate structure and not applied to the program. The goal of this pre-pass is to
put copy-related variables into equivalence classes. In a classical graph colouring
allocator, the live ranges of the variables in a class are fused. We do not do so but
use the classes to bias the colouring. Each equivalence class has a colour, which is
initially unset, and is set as soon as one variable of the class is assigned to register.
When assigning a colour to a variable, tree scan checks if the colour of the class is
available, and picks it if it is. If not, it chooses a different colour (based on the other
heuristics presented here) and updates the colour of the class.
There exists an elegant relationship between tree scan colouring and graph
colouring: Back in 1974, Gavril [122] showed that the intersection graphs of sub-
trees are the chordal graphs. By providing an elimination scheme that exposes the
underlying tree structure, this relationship allowed to prove that chordal graphs
can be optimally coloured in linear time with respect to the number of edges
in the graph. This is the rediscovering of this relationship in the context of
live ranges of variables for SSA-form programs that motivated different research
groups [39, 51, 136, 221] to revisit register allocation in the light of this interesting
property. Indeed, at that time, most register allocation schemes were incomplete
assuming that the assignment part was hard by referring to the NP-completeness
reduction of Chaitin et al. to graph colouring. However, the observation that SSA-
based live range splitting allows to decouple the allocation and assignment phases
was not new [87, 111]. Back in the nineties, the LaTTe [314] just-in-time compiler
already implemented the ancestor of our tree scan allocator. The most aggressive
live range splitting that was proposed by Appel and George [10] allowed to stress the
actual challenge that past approaches were facing when splitting live range to help
colouring, which is coalescing [40]. The PhD theses of Hack [135], Bouchez [38],
and Colombet [79] address the difficult challenge of making a neat idea applicable
to real life but without trading the elegant simplicity of the original approach. For
some more exhaustive related work references, we refer to the bibliography of those
documents.
Looking Under the Carpet
As done in (too) many register allocation papers, the heuristics described in this
chapter assume a simple non-realistic architecture where all variables or registers are
equivalent and where the instruction set architecture does not impose any specific
constraint on register usage. Reality is different, including: 1. register constraints
such as 2-address mode instructions that impose two of the three operands to
use the same register, or instructions that impose the use of specific registers;
2. registers of various sizes (vector registers usually leading to register aliasing),
historically known as register pairing problem; 3. instruction operands that cannot
reside in memory. Finally, SSA-based—but also any scheme that relies on live
range splitting—must deal with critical edges possibly considered abnormal (i.e.,
that cannot be split) by the compiler.
In the context of graph colouring, register constraints are usually handled by
adding an artificial clique of pre-coloured node in the graph and splitting live
ranges around instructions with pre-coloured operands. This approach has several
disadvantages. First, it substantially increases the number of variables. Second, it
makes coalescing much harder. This motivated Colombet et al. [80] to introduce
the notion of antipathies (affinities with negative weight) and extend the coalescing
rules accordingly. The general idea is, instead of enforcing architectural constraints,
to simply express the cost (through affinities and antipathies) of shuffle code inserted
by a post-pass repairing. In a scan-based context, handling of register constraints is
usually done locally [203, 252]. The biased colouring strategy used in this chapter
is proposed by Braun et al. and Colombet et al. [46, 80] and allows to reduce need
for shuffle code.
22 Register Allocation 327
This chapter describes the use of SSA-based high-level program representations for
the realization of the corresponding computations using hardware digital circuits.
We begin by highlighting the benefits of using a compiler SSA-based intermediate
representation in this hardware mapping process, using an illustrative example. The
subsequent sections describe hardware translation schemes for discrete hardware
logic structures or datapaths of hardware circuits and outline several compiler
transformations that benefit from SSA. We conclude with a brief survey of various
hardware compilation efforts from both academia and industry that have adopted
SSA-based internal representations.
P. C. Diniz ()
University of Porto, Porto, Portugal
e-mail: peddiniz@[Link]
P. Brisk
UC Riverside, Riverside, CA, USA
e-mail: philip@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 329
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
330 P. C. Diniz and P. Brisk
description will typically include discrete elements such as registers (flip-flops) that
capture the state of the computation at specific events, such as clock edges, and
combinatorial elements that transform the values carried by wires. The composition
of these elements allows programmers to build finite-state machines (FSMs) that
orchestrate the flow and the processing of data stored in internal registers or RAM
structures. These architectures support an execution model with operations akin to
assembly instructions found in common processors that support the execution of
high-level programs.
A common vehicle for the realization of hardware designs is an FPGA or field-
programmable gate array. These devices include a large number of configurable
logic blocks (or CLBs), each of which can be individually programmed to realize
an arbitrary combinatorial function of k inputs whose outputs can be latched in
flip-flops and connected via an internal interconnection network to any subset of
the CLBs in the device. Given their design regularity and simple structure, these
devices, popularized in the 1980s as fast hardware prototyping vehicles, have taken
advantage of Moore’s law to grow to large sizes with which programmers can define
custom architectures capable of TFlops/Watt performance, thus making them the
vehicle of choice for very power-efficient custom computing machines.
While developers were initially forced to design hardware circuits exclusively
using schematic capture tools, over the years, high-level behavioural synthesis
allowed them to leverage a wealth of hardware mapping and design exploration
techniques to realize substantial productivity gains.
As an example, Fig. 23.1 illustrates these concepts of hardware mapping for the
computation expressed as x ← (a×b)−(c×d)+f . Figure 23.1b depicts a graphical
representation of a circuit that directly implements this computation. Here, there is
a direct mapping between hardware operators such as adders and multipliers and the
operations in the computation. Input values are stored in the registers at the top of
the diagram, and the entire computation is carried out during a single (albeit long)
clock cycle, at the end of which the results propagated through the various hardware
operators are captured (or latched) in the registers at the bottom of the diagram. In
all, this direct implementation uses two multipliers, two adders/subtractors, and six
registers, five registers to hold the computation’s input values and one to capture
the computation’s output result. This hardware implementation requires a simple
control scheme, as it just needs to record the input values, and wait for a single
clock cycle at the end of which it stores the outputs of the operations in the
output register. Figure 23.1c depicts a different implementation variant of the same
computation, this time using nine registers and the same number of adders and
subtractors.1 The increased number of registers allows the circuit to be clocked
at a higher frequency as well as to be executed in a pipelined fashion. Lastly,
Fig. 23.1d depicts yet another possible implementation of the same computation,
1 It may be apparent that the original computation lacks any temporal specification in terms of the
relative order in which data-independent operations can be carried out. Implementation variants
exploit this property.
23 Hardware Compilation Using SSA 331
but using a single multiplier operator. This last version allows for the reuse in time
of the multiplier operator and the required thirteen registers as well as multiplexers
to route the inputs to the multiplier in two distinct control steps. As is apparent,
the reduction in the number of operators, in this particular case the multipliers,
carries a penalty in the form of an increased number of registers and multiplexers.2
It can be viewed as a hardware implementation of the C programming language
selection operator: out = q ? in1 : in2 and increased complexity of the control
scheme.
This example illustrates the many degrees of freedom in high-level behavioural
hardware synthesis. Synthesis techniques perform the classical tasks of allocation,
binding, and scheduling of the various operations in a computation given specific
target hardware resources. For instance, a designer can use a behavioural synthesis
tool (e.g., Xilinx’s Vivado) to automatically derive a hardware implementation for a
computation, as expressed in the example in Fig. 23.1a. This high-level description
suggests a simple, direct hardware implementation using a single adder and two
multipliers, as depicted in Fig. 23.1b. Alternative hardware implementations taking
advantage of pipelined execution and the sharing of a multiplier to reduce hardware
resources are respectively illustrated in Fig. 23.1c and Fig. 23.1d. The tool then
derives the control scheme required to route the data from registers to the selected
units so as to meet the designers’ goals.
Despite the introduction of high-level behavioural synthesis techniques into
commercially available tools, hardware synthesis, and thus hardware compilation,
has never enjoyed the same level of success as traditional software compilation.
Sequential programming paradigms popularized by programming languages such
as C/C++ and, more recently, by Java, allow programmers to easily reason about
program behaviour as a sequence of program memory state transitions. The
underlying processors and the corresponding system-level implementations present
a number of simple unified abstractions—such as a unified memory model, a stack,
and a heap that do not exist (and often do not make sense) in customized hardware
designs.
Hardware compilation, in contrast, has faced numerous obstacles that have
hampered its wide adoption. When developing hardware solutions, designers
must understand the concept of spatial concurrency that hardware circuits offer.
Precise timing and synchronization between distinct hardware components are key
abstractions in hardware. Solid and robust hardware design implies a detailed
understanding of the precise timing of specific operations, including I/O, that simply
cannot be expressed in languages such as C, C++, or Java. Alternatives such
as SystemC have emerged in recent years, giving the programmer considerably
more control over these issues. The inherent complexity of hardware designs
has hampered the development of robust synthesis tools that can offer high-level
programming abstractions enjoyed by tools that target traditional architecture and
software systems, thus substantially raising the barrier of entry for hardware design-
2 A 2 × 1 multiplexer is a combinatorial circuit with two data inputs, a single output and a control
input, where the control input selects which of the two data inputs is transmitted to the output.
23 Hardware Compilation Using SSA 333
a multiplexer thus acts as a gated transfer of value that parallels the actions of an
if-then-else construct in software. Note also that in the case of a backward
control flow (e.g., associated with a back edge of a loop), the possible indefinition
of one of the φ-function’s inputs is transparently ignored by the fact that in a correct
execution the predicate associated with the true control-flow path will yield the value
associated with a defined input of the SSA representation.
Equally important in this mapping is the notion that the computation in hardware
can now take a spatial dimension. In the hardware circuit in Fig. 23.2c, the computa-
tion derived from the statement in both branches of the if-then-else construct
can be evaluated concurrently by distinct logic circuits. After the evaluation of both
circuits, the multiplexer will define which set of values is used based on the value
of its control input, in this case of the value of the computation associated with p.
In a sequential software execution environment, the predicate p would be
evaluated first, and then either branches of the if-then-else construct would
be evaluated, based on the value of p; as long as the register allocator is able to
assign t1 , t2 , and t3 to the same register, then the φ-function is executed implicitly;
if not, it is executed as a register-to-register copy.
There have been some efforts to automatically convert the sequential program
above into a semi-spatial representation that could obtain some speedup if executed
on a VLIW (very long instruction word) type of processor. For example, if-
conversion (see Chap. 20) would convert the control dependency into a data
dependency: statements from the if and else blocks could be interleaved, as
long as they do not overwrite one another’s values, and the proper result (the φ-
function) could be selected using a conditional move instruction. In the worst case,
however, this approach would effectively require the computation of both branch
sides, rather than one, so it could increase the computation time. In contrast, in a
spatial representation, the correct result can be output as soon as two of the three
23 Hardware Compilation Using SSA 335
inputs to the multiplexer are known (p, and one of t1 or t2 , depending on the value
of p).
While the Electronic Design Automation (EDA), community has for decades
now exploited similar information regarding data and control dependencies for the
generation of hardware circuits from increasingly higher-level representations (e.g.,
Behavioural HDL), SSA-based representations make these dependencies explicit
in the intermediate representation itself. Similarly, the more classical compiler
representation, using three-address instructions augmented with the def-use chains,
already exposes the data-flow information as for the SSA-based representation.
However, as we will explore in the next section, the latter facilitates the mapping
and selection of hardware resources.
3 As a first approach, these registers are virtual, and then, after synthesis, some of them are
Temporal mapping
4 Speculation is also possible in the temporal mode by activating the inputs and execution of
multiple hardware blocks and is only limited by the available storage bandwidth to restore the
input context in each block, which, in the spatial approach, is trivial.
338 P. C. Diniz and P. Brisk
to reduce the pressure on resource requirements and thus lead to feasible hardware
implementation designs. As these optimizations are not specific to the SSA repre-
sentation, we will not discuss them further here.
In the spatial mapping approach, the SSA form plays an important role in the
minimization of multiplexers and thus in the simplification of the corresponding
data-path logic and execution control.
Consider the illustrative example in Fig. 23.4a. Here, basic block BB0 defines
a value for the variables x and y. One of the two subsequent basic blocks BB1
redefines the value of x, whereas the other basic block BB2 only reads x.
A naive implementation based exclusively on liveness analysis (see Chap. 9)
would use multiplexers for both variables x and y to merge their values as inputs
to the hardware circuit implementing basic block BB3, as depicted in Fig. 23.4b.
As can be observed, however, the SSA form representation captures the fact that
such a multiplexer is only required for variable x. The value for variable y can be
propagated either from the output value in the hardware circuit for basic block BB0
(as shown in Fig. 23.4c) or from any other register that has a valid copy of the
y variable. The direct flow of the single definition point to all its uses, across the
hardware circuits corresponding to the various basic blocks in the SSA form, thus
allows a compiler to use the minimal number of multiplexers strictly required.5
An important aspect of the implementation of a multiplexer associated with a
φ-function is the definition and evaluation of the predicate associated with each
multiplexer’s control (or selection) input signal. In the basic SSA representation, the
selection predicates are not explicitly defined, as the execution of each φ-function is
implicit when the control flow reaches it. When mapping a computation to hardware,
however, a φ-function clearly elicits the need to define a predicate to be included as
part of the hardware logic circuit that defines the value of the multiplexer circuit’s
selection input signal. To this effect, hardware mapping must rely on a variant of
SSA, named gated SSA (see Chap. 14), which explicitly captures the symbolic
predicate information in the representation.6 The generation of the hardware circuit
simply uses the register that holds the corresponding variable’s version value of the
predicate. Figure 23.5 illustrates an example of a mapping using the information
provided by the gated SSA form.
When combining multiple predicates in the gated SSA form, it is often desirable
to leverage the control-flow representation in the form of the program dependence
graph (PDG) described in Chap. 14. In the PDG representation, basic blocks that
5 Under the scenarios of a spatial mapping and with the common disclaimers about static control-
flow analysis.
6 As with any SSA representation, variable names fulfil the referential transparency.
23 Hardware Compilation Using SSA 339
Fig. 23.4 Mapping of variable values across hardware circuit using spatial mapping
share common execution predicates (i.e., both execute under the same predicate
conditions) are linked to the same region nodes. Nested execution conditions are
easily recognized as the corresponding nodes are hierarchically organized in the
PDG representation. As such, when generating code for a given basic block, an
algorithm will examine the various region nodes associated with a given basic
block and compose (using AND operators) the outputs of the logic circuits that
implement the predicates associated with these nodes. If a hardware circuit already
exists that evaluates a given predicate that corresponds to a given region, the
implementation can simply reuse its output signal. This lazy code generation and
predicate composition achieves the goal of hardware circuit sharing, as illustrated
by the example in Fig. 23.6 where some of the details were omitted for simplicity.
When using the PDG representation, however, care must be taken regarding the
340 P. C. Diniz and P. Brisk
Original code
Source code
Fig. 23.6 Use of the predicates in region nodes of the PDG for mapping into the multiplexers
associated with each φ-function
k-input integer addition circuit is more efficient than adding k integers two at a time.
Moreover, hardware multipliers naturally contain multi-operand adders as building
blocks: a partial product generator (a layer of AND gates) is followed by a multi-
operand adder called a partial product reduction tree. For these reasons, there have
been several efforts in recent years to apply high-level algebraic transformations
to source code with the goal of merging multiple addition operations with partial
product reduction trees of multiplication operations. The basic flavour of these
transformations is to push the addition operators towards the outputs of a data-flow
graph, so that they can be merged at the bottom. Examples of these transformations
that use multiplexers are depicted in Fig. 23.7a and b. In the case of Fig. 23.7a,
the transformation leads to the fact that an addition is always executed, unlike in
the original hardware design. This can lead to more predictable timing or more
uniform power draw signatures.7 Figure 23.7c depicts a similar transformation that
a b c d a c b “0” d
+ MUX MUX
MUX +
y y
(a)
a b c a “0” b c
+ MUX
MUX +
y y
(b)
a b d c a d b c
+ +
y y
(c)
merges two multiplexers sharing a common input, while exploiting the commutative
property of the addition operator. The SSA-based representation facilitates these
transformations as it explicitly indicates which values (by tracing backward in the
representation) are involved in the computation of the corresponding values. For the
example in Fig. 23.7b, a compiler could quickly detect the variable a to be common
to the two expressions associated with the φ-function.
23 Hardware Compilation Using SSA 343
For spatial oriented hardware circuits, moving a φ-function from one basic block
to another can alter the length of the wires that are required to transmit data from
the hardware circuits corresponding to the various basic blocks. As the boundaries
of basic blocks are natural synchronization points, where values are captured in
hardware registers, the length of wires dictates the maximum allowed hardware
clock rate for synchronous designs. We illustrate this effect via an example as
depicted in Fig. 23.8. In this figure, each basic block is mapped to a distinct hardware
unit, whose spatial implementation is approximated by a rectangle. A floor planning
algorithm must place each of the units in a two-dimensional plane while ensuring
that no two units overlap. As can be seen in Fig. 23.8a, placing block 4 on the
right-hand side of the plane will result in several mid-range and one long-range
wire connections. However, placing block 4 at the centre of the design would
virtually eliminate all mid-range connections as all connections corresponding to the
transmission of the values for variable x are now next-neighbouring connections.
The same result could be obtained by moving the φ-function to block 5, which
is illustrated in Fig. 23.8b. But moving a multiplexer from one hardware unit to
another can significantly change the dimensions of the resulting unit, which is not
under the control of the compiler. Changing the dimensions of the hardware units
fundamentally changes the placement of modules, so it is very difficult to predict
whether moving a φ-function will actually be beneficial. For this reason, compiler
optimizations that attempt to improve the physical layout must be performed using
a feedback loop so that the results of the lower-level CAD tools that produce the
layout can be reported back to the compiler.
8 A typical k-input LUT will include an arbitrary combinatorial functional block of those k inputs
Fig. 23.8 Example of the impact of φ-function movement in reducing hardware wire length
The CASH compiler [54] uses an augmented predicated SSA representation with
tokens to explicitly express synchronization and handle may-dependencies, thus
supporting speculative execution. This fine-grain synchronization mechanism is also
used to serialize the execution of consecutive hyper-blocks, thus greatly simplifying
the code generation. Other efforts also exploit instruction-level optimizations or
algebraic properties of the operators for minimization of expensive hardware
resources such as multipliers, adders, and in some cases even multiplexers [198,
298]. For a comprehensive description of a wide variety of hardware-oriented high-
level program transformations, the reader is referred to [57].
Further Reading
Despite their promise in terms of high performance and high computational
efficiency, hardware devices such as FPGAs have long been beyond the reach of
the “average” software programmer. To effectively program them using hardware-
oriented programming languages such as VHDL [12], Verilog [283], OpenCL [173],
or SystemC [218], the developers must assume the role of both software and
hardware designs.
To address the semantic gap between a hardware-oriented programming model
and high-level software programming models, various research projects, first in
academia and later in industry, developed prototype tools that could bridge this gap
and make the promising technology of configurable logic approachable among a
wider audience of programmers. In these efforts, loosely labeled C-to-Gates, com-
pilers performed the traditional phases of program data- and control-dependence
analysis to uncover opportunities for concurrent and/or pipelined execution and
directly translated the underlying data flow to Verilog/VHDL description alongside
the corresponding control logic. In practice, these compilers, of which Vivado
HLS [310] and LegUp [56] are notable examples, focus on loop constructs
with significant execution time weights (the so-called hot-spots) for which they
automatically derive hardware pipelined implementations (often guided by user-
provided compilation directives) that efficiently executed them. When the target
architecture is a “raw” FPGA (rather than an overlay architecture), this approach
invariably leads to long compilation and synthesis times.
The inherent difficulties and limitations in extracting enough instruction-level
Parallelism in these approaches, coupled with the increase in the devices’ capacities
(e.g. Intel’s Arria [147] and Xilinx’s Virtex UltraScale+ [311]), have prompted
a search for programming models with a more natural concurrency that would
facilitate the mapping of high-level computation to hardware. One such example is
MapReduce[95], originally introduced by Google to naturally distribute concurrent
jobs across clusters of servers [95], which has been an effective programming
model for FPGAs [315]. Similarly, high-level languages based on parallel models of
computation such as synchronous data flow [179] or functional single-assignment
languages have also been shown to be good choices for hardware compilation [33,
137, 145].
Chapter 24
Building SSA in a Compiler for PHP
Dynamic scripting languages such as PHP, Python, and JavaScript are among the
most widely used programming languages.
Dynamic scripting languages provide flexible high-level features, a fast modify-
compile-test environment for rapid prototyping, strong integration with popular
strongly typed programming languages, and an extensive standard library. Dynamic
scripting languages are widely used in the implementation of many of the best-
known web applications of the last decade such as Facebook, Wikipedia, and Gmail.
Most web browsers support scripting languages such as JavaScript for client-side
applications that run within the browser. Languages such as PHP and Ruby are
popular for server-side web pages that generate content dynamically and provide
close integration with back-end databases.
One of the most widely used dynamic scripting languages is PHP, a general-
purpose language that was originally designed for creating dynamic web pages.
PHP has many of the features that are typical of dynamic scripting languages. These
include a simple syntax, dynamic typing and the ability to dynamically generate new
source code during runtime, and execute that code. These simple flexible features
facilitate rapid prototyping, exploratory programming, and in the case of many non-
professional websites, a copy-paste-and-modify approach to scripting by non-expert
programmers.
Constructing SSA form for languages such as C/C++ and Java is a well-studied
problem. Techniques exist to handle the most common features of static languages,
and these solutions have been tried and tested in production-level compilers over
many years. In these static languages, it is not difficult to identify a set of scalar
P. Biggar ()
Darklang, New York, NY, USA
e-mail: [Link]@[Link]
D. Gregg
Trinity College, Dublin, Ireland
e-mail: [Link]@[Link]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 347
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
348 P. Biggar and D. Gregg
variables that can be safely renamed. Better analysis may lead to more such variables
being identified, but significant numbers of such variables can be found with very
simple analysis.
In our study of optimizing dynamic scripting languages, specifically PHP, we
find this is not the case. The information required to build SSA form—that is, a
conservatively complete set of unaliased scalars, and the locations of their uses and
definitions—is not available directly from the program source and cannot be derived
from a simple analysis. Instead, we find a litany of features whose presence must be
ruled out, or heavily analysed, in order to obtain a non-pessimistic estimate.
Dynamic scripting languages commonly feature runtime generation of source
code, which is then executed, built-in functions, and variable-variables, all of which
may alter arbitrary unnamed variables. Less common—but still possible—features
include the existence of object handlers which have the ability to alter any part of
the program state, most dangerously a function’s local symbol table. The original
implementations of dynamic scripting languages were all interpreted, and in many
cases, this led to their creators including very dynamic language features that are
easy to implement in an interpreter but make program analysis and compilation
very difficult.
Ruling out the presence of these features requires precise, inter-procedural,
whole-program analysis. We discuss the futility of the pessimistic solution, the
analyses required to provide a precise SSA form, and how the presence of variables
of unknown types affects the precision of SSA.
Note that dynamic scripting languages share similar features of other kinds of
dynamic programming languages, such as Lisp and Smalltalk. The main distin-
guishing feature of dynamic languages is that they resolve at runtime behaviours
that many other languages perform at compile time. Thus, languages such as Lisp
support dynamic typing and runtime generation of source code which can then
be executed. The main difference between scripting languages and other dynamic
languages is their intended use, rather than the features of the language. Scripting
languages were originally designed for writing short scripts, often to invoke other
programs or services. It was only later that developers started to write large complex
programs using scripting languages.
The rest of this chapter describes our experiences with building SSA in PHC, an
open source compiler for PHP. We identify the features of PHP that make building
SSA difficult, outline the solutions we found to these some of these challenges, and
draw some lessons about the use of SSA in analysis frameworks for PHP.
whose local variables may all be converted into SSA form. In Fig. 24.1, it is trivial
to convert each variable into SSA form. For each statement, the list of variables that
are used and defined is immediately obvious from a simple syntactic check.
By contrast, Fig. 24.2 contains variables that cannot be trivially converted into
SSA form. On Line 3, the variable x has its address taken. As a result, to convert x
into SSA form, we must know that Line 4 modifies x and introduce a new version
of x accordingly. Chapter 16 describes HSSA form, a powerful way to represent
indirect modifications to scalars in SSA form.
To discover the modification of x on Line 4 requires an alias analysis. Alias
analyses detect when multiple program names (aliases) represent the same memory
location. In Fig. 24.2, x and ∗y alias each other.
There are many variants of alias analysis, of varying complexity. The most
complex ones analyse the entire program text, taking into account the control flow,
the function calls, and multiple levels of pointers. However, it is not difficult to
perform a very simple alias analysis. Address-taken alias analysis identifies all
variables whose addresses are taken by a referencing assignment. All variables
whose address has been taken anywhere in the program are considered to alias each
other (that is, all address-taken variables are in the same alias set).
When building SSA form with address-taken alias analysis, variables in the alias
set are not renamed into SSA form. All other variables are converted. Variables not
in SSA form do not possess any SSA properties, and pessimistic assumptions must
be made. As a result of address-taken alias analysis, it is straightforward to convert
any C program into SSA form, without sophisticated analysis. In fact, this allows
more complex alias analyses to be performed on the SSA form.
350 P. Biggar and D. Gregg
1: $x = 5;
2: $y =& $x;
3: $y = 7;
Fig. 24.4 Similar (a) PHP and (b) C functions with parameters
24 Building SSA in a Compiler for PHP 351
third variable. It is important to note, however, that the possibility of such aliasing is
not apparent from the function prototype or any type declarations. Instead, whether
the formal parameter x and/or y are references depends on the types of actual
parameters that are passed when the function is invoked. If a reference is passed as a
parameter to a function in PHP, the corresponding formal parameter in the function
also becomes a reference.
The addition operation in Line 2 of Fig. 24.4a may therefore change the value of
y, if x is a reference to y or vice versa. In addition, recall that dynamic typing in PHP
means that whether or not a variable contains a reference can depend on control flow
leading to different assignments. Therefore, on some executions of a function, the
passed parameters may be references, whereas on other executions they may not.
In the PHP version, there are no syntactic clues that variables may alias.
Furthermore, as we show in Sect. 24.4, there are additional features of PHP that
can cause the values of variables to be changed without simple syntactic clues. In
order to be sure that no such features can affect a given variable, an analysis is
needed to detect such features. As a result, a simple conservative aliasing estimate
that does not take account of PHP’s references and other difficult features—similar
to C’s address-taken alias analysis—would need to place all variables in the alias
set. This would leave no variables available for conversion to SSA form. Instead, an
inter-procedural analysis is needed to track references between functions.
PHP’s dynamic typing means that program analysis cannot be performed a function
at a time. As function signatures do not indicate whether parameters are references,
this information must be determined by inter-procedural analysis. Furthermore, each
function must be analysed with full knowledge of its calling context. This requires
a whole-program analysis. We present an overview of the analysis below. A full
description is beyond the scope of this chapter.
analysis begins at the first statement in the program, which is placed in a worklist.
The worklist is then processed a statement at a time.
For each analysed statement, the results of the analysis are stored. If the analysis
results change, the statement’s direct successors (in the control-flow graph) are
added to the worklist. This is similar to to the treatment of CFG edges (through
CFGWorkList) in the SCCP algorithm. There is no parallel to the treatment of SSA
edges (SSAWorkList), since the analysis is not performed on the SSA form. Instead,
loops must be fully analysed if their headers change.
This analysis is therefore less efficient than the SCCP algorithm, in terms of time.
It is also less efficient in terms of space. SSA form allows results to be compactly
stored in a single array, using the SSA index as an array index. This is very space
efficient. In our analysis, we must instead store a table of variable results at all points
in the program.
Upon reaching a function or method call, the analysis begins analysing the callee
function, pausing the caller’s analysis. A new worklist is created and initialized with
the first statement in the callee function. The worklist is then run until it is exhausted.
If another function call is analysed, the process recurses.
Upon reaching a callee function, the analysis results are copied into the scope of
the callee. Once a worklist has ended, the analysis results for the exit node of the
function are copied back to the calling statement’s results.
Our analysis computes and stores three different kinds of results. Each kind of result
is stored at each point in the program.
The first models the alias relationships in the program in a points-to graph.
The graph contains variable names as nodes, and the edges between them indicate
aliasing relationships. An aliasing relationship indicates that two variables either
must-alias or may-alias. Two unconnected nodes cannot alias. A points-to graph is
stored for each point in the program. Graphs are merged at CFG join nodes.
Secondly, our analysis also computes a conservative estimate of the types of
variables in the program. Since PHP is an object-oriented language, polymorphic
method calls are possible, and they must be analysed. As such, the set of possible
types of each variable is stored at each point in the program. This portion of the
analysis closely resembles using SCCP for type propagation.
Finally, like the SCCP algorithm, constants are identified and propagated through
the analysis of the program. Where possible, the algorithm resolves branches
statically using propagated constant values. This is particularly valuable because
our PHC ahead-of-time compiler for PHP creates many branches in the interme-
diate representation during early stages of compilation. Resolving these branches
statically eliminates unreachable paths, leading to significantly more precise results
from the analysis algorithm.
24 Building SSA in a Compiler for PHP 353
To build SSA form, we need to be able to identify the set of points in a program
where a given variable is defined or used. Since we cannot easily identify these sets
due to potential aliasing, we build them as part of our program analysis. Using our
alias analysis, any variables that may be written to or read from during a statement’s
execution are added to a set of defs and uses for that statement. These are then used
during construction of the SSA form.
For an assignment by copy, $x = $y:
1. $x’s value is defined.
2. $x’s reference is used (by the assignment to $x).
3. For each alias $x_alias of $x, $x_alias’s value is defined. If the alias is possible,
$x_alias’s value is may-defined instead of defined. In addition, $x_alias’s
reference is used.
4. $y’s value is used.
5. $y’s reference is used.
For an assignment by reference, $x =& $y:
1. $x’s value is defined.
2. $x’s reference is defined (it is not used—$x does not maintain its previous
reference relationships).
3. $y’s value is used.
4. $y’s reference is used.
24.3.4 HSSA
Once the set of locations where each variable is defined or used has been identified,
we have the information needed to construct SSA. However, it is important to
note that due to potential aliasing and potential side effects of some difficult-to-
analyse PHP features (see Sect. 24.4), many of the definitions we compute are
may-definitions. Whereas a normal definition of a variable means that the variable
will definitely be defined, a may-definition means that the variable may be defined
at that point.2
In order to accommodate these may-definitions in our SSA form for PHP, we
adapt a number of features from Hashed SSA form.
2 Or more precisely, a may-definition means that there exists at least one possible execution of
the program where the variable is defined at that point. Our algorithm computes a conservative
approximation of may-definition information. Therefore, our algorithm reports a may-definition
in any case where the algorithm cannot prove that no such definition can exist on any possible
execution.
354 P. Biggar and D. Gregg
For simplicity, the description so far of our algorithm has only considered the
problems arising from aliasing due to PHP reference variables. However, in the
process of constructing SSA in our PHP compiler several other PHP language
features make it difficult to identify all the points in a program where a variable
may be defined. In this section, we briefly describe these language features and how
they may be dealt with in order to conservatively identify all may-definitions of all
variables in the program.
Figure 24.5 shows a program which accesses a variable indirectly. On line 2, a string
value is read from the user and stored in the variable var_name. On line 3, some
variable—whose name is the value in var_name—is set to 5. That is, any variable
24 Building SSA in a Compiler for PHP 355
can be updated, and the updated variable is chosen by the user at runtime. It is not
possible to know whether the user has provided the value “x” and so know whether
x has the value 5 or 6.
This feature is known as variable-variables. They are possible because a
function’s symbol table in PHP is available at runtime. Variables in PHP are not
the same as variables in C. A C local variable is a name which represents a memory
location on the stack. A PHP local variable is the domain of a map from strings
to values. The same runtime value may be the domain of multiple string keys
(references, discussed above). Similarly, variable-variables allow the symbol table
to be accessed dynamically at runtime, allowing arbitrary values to be read and
written.
Upon seeing a variable-variable, all variables may be written to. In HSSA form,
this creates a χ -function for every variable in the program. In order to reduce the set
of variables that might be updated by assigning to a variable-variable, the contents of
the string stored in var_name may be modelled using a string analysis by Wasserman
and Su.
String analysis is a static program analysis that models the structure of strings.
For example, string analysis may be able to tell us that the name stored in the
variable-variable (that is the name of the variable that will be written to) begins with
the letter “x.” In this case, all variables that do not begin with “x” do not require a
χ -function, leading to a more precise SSA form.
PHP provides a feature that allows an arbitrary string to be executed as a code. This
eval statement simply executes the contents of any string as if the contents of the
string appeared inline in the location of the eval statement. The resulting code can
modify local or global variables in arbitrary ways. The string may be computed by
the program or read in from the user or from a web form.
The PHP language also allows the contents of a file to be imported into the
program text using the include statement. The name of the file is an arbitrary
string which may be computed or read in at runtime. The result is that any file
can be included in the text of the program, even one that has just been created by
the program. Thus, the include statement is potentially just as flexible as the eval
statement for arbitrarily modifying program state.
Both of these may be modelled using the same string analysis techniques as
discussed in Sect. 24.4.1. The eval statement may also be handled using profiling,
356 P. Biggar and D. Gregg
which restricts the set of possible eval statements to those which actually are used
in practice.
In this chapter, we have described our experience of building SSA in PHC, an ahead-
of-time compiler for PHP. PHP is quite different to static languages such as C/C++
and Java. In addition to dynamic typing, it has other dynamic features such as very
flexible and powerful reference types, variable-variables, and features that allow
almost arbitrary code to be executed.
The main result of our experience is to show that SSA cannot be used as an end-
to-end intermediate representation (IR) for a PHP compiler. The main reason is that
in order to build SSA, significant analysis of the PHP program is needed to deal
with aliasing and to rule out potential arbitrary updates of variables. We have found
that in real PHP programs these features are seldom used in ways that make analysis
really difficult [28]. But analysis is nonetheless necessary to show the absence of the
bad use of these features.
In principle, our analysis could perform only the alias analysis prior to building
SSA and perform type analysis and constant propagation in SSA. But our experience
is that combining all three analyses greatly improves the precision of alias analysis.
24 Building SSA in a Compiler for PHP 357
1. Aho, A. V., & Johnson, S. C. (1976). Optimal code generation for expression trees. Journal
of the ACM, 23(3), 488–501.
2. Aho, A. V., Sethi, R., & Ullman, J. D. (1986). Compilers: Principles, techniques, and tools.
Addison-Wesley series in computer science/World student series edition.
3. Allen, F. E. (1970). Control flow analysis. Sigplan Notices, 5(7), 1–19 (1970).
4. Allen, F. E., & Cocke, J. (1976). A program data flow analysis procedure. Communications
of the ACM, 19(3), 137–146.
5. Allen, J. R., et al. (1983). Conversion of control dependence to data dependence. In Proceed-
ings of the Symposium on Principles of Programming Languages. POPL ’83 (pp. 177–189).
6. Alpern, B., Wegman, M. N., & Kenneth Zadeck, F. (1988). Detecting equality of variables
in programs. In Proceedings of the Symposium on Principles of Programming Languages.
POPL ’88, pp. 1–11.
7. Amaral, J. N., et al. (2001). Using the SGI Pro64 open source compiler infra-structure
for teaching and research. In Symposium on Computer Architecture and High Performance
Computing (pp. 206–213).
8. Ananian, S. (1999). The Static Single Information Form. Master’s Thesis. MIT.
9. Appel, A. W. (1992). Compiling with continuations. Cambridge: Cambridge University Press.
10. Appel, A. W. (1998). Modern compiler implementation in {C,Java,ML}. Cambridge: Cam-
bridge University Press.
11. Appel, A. W. (1998). SSA is functional programming. Sigplan Notices, 33(4), 17–20 (1998).
12. Ashenden, P. J. (2001). The designer’s guide to VHDL. Burlington: Morgan Kaufmann
Publishers Inc.
13. August, D. I., et al. (1998). Integrated predicated and speculative execution in the IMPACT
EPIC architecture. In Proceedings of the International Symposium on Computer Architecture.
ISCA ’98 (pp. 227–237).
14. Aycock, J., & Horspool, N., (2000). Simple generation of static single assignment form. In
International Conference on Compiler Construction. CC ’00 (pp. 110–125).
15. Bachmann, O., Wang, P. S., & Zima, E. V. (1994). Chains of recurrences—A method
to expedite the evaluation of closed-form functions. In Proceedings of the International
Symposium on Symbolic and Algebraic Computation (pp. 242–249).
16. Banerjee, U. (1988). Dependence analysis for supercomputing. Alphen aan den Rijn: Kluwer
Academic Publishers.
17. Barik, R. (2010). Efficient Optimization of Memory Accesses in Parallel Programs. PhD
Thesis. Rice University.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 359
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
360 References
18. Barik, R., & Sarkar, V. (2009). Interprocedural load elimination for dynamic optimization of
parallel programs. In Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques. PACT ’09 (pp. 41–52).
19. Barthou, D., Collard, J.-F., & Feautrier, P. (1997). Fuzzy array dataflow analysis. Journal of
Parallel and Distributed Computing, 40(2), 210–226.
20. Bates, S., & Horwitz, S. (1993). Incremental program testing using program dependence
graphs. In Proceedings of the Symposium on Principles of Programming Languages. POPL
’93 (pp. 384–396).
21. Baxter, W. T., & Bauer III, H. R. (1989). The program dependence graph and vectorization.
Proceedings of the Symposium on Principles of Programming Languages. POPL ’89 (pp. 1–
11).
22. Belady, L. A. (1966). A study of replacement of algorithms for a virtual storage computer. In
IBM Systems Journal, 5, 78–101.
23. Bender, M., & Farach-Colton, M. (2000). The LCA problem revisited. In Latin 2000:
Theoretical informatics. Lecture notes in computer science (pp. 88–94).
24. Benton, N., Kennedy, A., & Russell, G. (1998). Compiling standard ML to Java bytecodes.
In Proceedings of the International Conference on Functional Programming. ICFP ’98.
SIGPLAN Notices, 34(1), 129–140.
25. Beringer, L. (2007). Functional elimination of phi-instructions. In Electronic Notes in
Theoretical Computer Science, 176(3), 3–20.
26. Beringer, L., MacKenzie, K., & Stark, I. (2003). Grail: A functional form for imperative
mobile code. In Electronic Notes in Theoretical Computer Science, 85(1), 3–23.
27. Berson, D. A., Gupta, R., & Soffa, M. L. (1999). Integrated instruction scheduling and
register allocation techniques. In Proceedings of the International Workshop on Languages
and Compilers for Parallel Computing. LCPC ’98 (pp. 247–262).
28. Biggar, P. (2010). Design and Implementation of an Ahead-of-Time Compiler for PHP. PhD
Thesis. Trinity College Dublin.
29. Bilardi, G., & Pingali, K. (1996). A framework for generalized control dependence. Sigplan
Notices, 31(5), 291–300.
30. Blech, J. O., et al. (2005). Optimizing code generation from SSA form: A comparison between
two formal correctness proofs in Isabelle/HOL. In Electronic Notes in Theoretical Computer
Science, 141(2), 33–51.
31. Blickstein, D. S., et al. (1992). The GEM optimizing compiler system. Digital Technical
Journal, 4(4), 121–136.
32. Bodík, R., Gupta, R., & Sarkar, V. (2000). ABCD: Eliminating array bounds checks on
demand. In International Conference on Programming Languages Design and Implemen-
tation. PLDI ’00 (pp. 321–333).
33. Böhm, W., et al. (2002). Mapping a single assignment programming language to reconfig-
urable systems. The Journal of Supercomputing, 21(2), 117–130.
34. Boissinot, B., et al. (2008). Fast liveness checking for SSA-form programs. In Proceedings of
the International Symposium on Code Generation and Optimization. CGO ’08 (pp. 35–44).
35. Boissinot, B., et al. (2009). Revisiting out-of-SSA translation for correctness, code quality
and efficiency. In Proceedings of the International Symposium on Code Generation and
Optimization. CGO ’09 (pp. 114–125).
36. Boissinot, B., et al. (2011). A non-iterative data-flow algorithm for computing liveness sets in
strict SSA programs. In Asian Symposium on Programming Languages and Systems. APLAS
’11 (pp. 137–154).
37. Boissinot, B., et al. (2012). SSI properties revisited. In ACM Transactions on Embedded
Computing Systems. Special Issue on Software and Compilers for Embedded Systems.
38. Bouchez, F. (2009). A Study of Spilling and Coalescing in Register Allocation as Two Separate
Phases. PhD Thesis. École normale supérieure de Lyon, France.
References 361
39. Bouchez, F., et al. (2006). Register allocation: What does the NP-completeness proof of
chaitin et al. really prove? Or revisiting register allocation: Why and how. In Proceedings
of the International Workshop on Languages and Compilers for Parallel Computing. LCPC
’06 (pp. 283–298).
40. Bouchez, F., Darte, A., & Rastello, F. (2007). On the complexity of register coalescing. In
Proceedings of the International Symposium on Code Generation and Optimization. CGO
’07 (pp. 102–114).
41. Bouchez, F., Darte, A., & Rastello, F. (2007). On the complexity of spill everywhere under
SSA form. In Proceedings of the International Conference on Languages, Compilers, and
Tools for Embedded Systems. LCTES ’07 (pp. 103–112).
42. Bouchez, F., Darte, A., & Rastello, F. (2008). Advanced conservative and optimistic register
coalescing. In Proceedings of the International Conference on Compilers, Architecture, and
Synthesis for Embedded Systems. CASES ’08 (pp. 147–156).
43. Bouchez, F., et al. (2010). Parallel copy motion. In Proceedings of the International Workshop
on Software & Compilers for Embedded Systems. SCOPES ’10 (pp. 1:1–1:10).
44. Brandis, M. M., & Mössenböck, H. (1994). Single-pass generation of static single assignment
form for structured languages. ACM Transactions on Programming Language and Systems,
16(6), 1684–1698.
45. Braun, M., & Hack, S. (2009). Register spilling and live-range splitting for SSA-form
programs. In International Conference on Compiler Construction. CC ’09 (pp. 174–189).
46. Braun, M., Mallon, C., & Hack, S. (2010). Preference-guided register assignment. In
International Conference on Compiler Construction. CC ’10 (pp. 205–223). New York:
Springer.
47. Braun, M., et al. (2013). Simple and efficient construction of static single assignment form.
In International Conference on Compiler Construction. CC ’13 (pp. 102–122).
48. Briggs, P., Cooper, K. D., & Torczon, L. (1992). Rematerialization. In International Confer-
ence on Programming Languages Design and Implementation. PLDI ’92 (pp. 311–321).
49. Briggs, P., Cooper, K. D., & Taylor Simpson, L. (1997). Value numbering. In Software
Practice and Experience, 27(6), 701–724.
50. Briggs, P., et al. (1998). Practical improvements to the construction and destruction of static
single assignment form. In Software—Practice and Experience, 28(8), 859–881.
51. Brisk, P., et al. (2005). Polynomial-time graph coloring register allocation. In International
Workshop on Logic and Synthesis. IWLS ’05.
52. Bruel, C. (2006). If-conversion SSA framework for partially predicated VLIW architectures.
In Odes 4. SIGPLAN (pp. 5–13).
53. Budimlić, Z., et al. (2002). Fast copy coalescing and live range identification. In International
Conference on Programming Languages Design and Implementation. PLDI ’02 (pp. 25–32).
54. Budiu, M., & Goldstein, S. C. (2002). Compiling application-specific hardware. In Interna-
tional Conference on Field Programmable Logic and Applications. FPL ’02 (pp. 853–863).
55. Callahan, T. J., Hauser, J. R., & Wawrzynek, J. (2000). The Garp architecture and C compiler.
Computer, 33(4), 62–69 (2000).
56. Canis, A., et al. (2011). High-level synthesis for FPGA-based processor/accelerator systems.
In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable
Gate Arrays. FPGA ’11 (pp. 33–36).
57. Cardoso, J. M. P., & Diniz, P. C. (2008). Compilation techniques for reconfigurable
architectures. New York: Springer.
58. Carter, L., et al. (1999). Predicated static single assignment. In Proceedings of the Interna-
tional Conference on Parallel Architectures and Compilation Techniques. PACT ’99 (p. 245).
59. Chaitin, G. J. (1982). Register allocation & spilling via graph coloring. In Proceedings of the
Symposium on Compiler Construction. SIGPLAN ’82 (pp. 98–105).
60. Chaitin, G. J., et al. (1981). Register allocation via coloring. In Computer Languages, 6, 47–
57.
61. Chaitin, G. J., et al. (1981). Register allocation via graph coloring. Journal of Computer
Languages, 6, 45–57 (1981).
362 References
62. Chakravarty, M. M. T., Keller, G., & Zadarnowski, P. (2003). A functional perspective on
SSA optimisation algorithms. Electronic Notes in Theoretical Computer Science, 82(2), 15.
63. Chambers, C., & Ungar, D. (1989). Customization: Optimizing compiler technology for
SELF, a dynamically-typed object-oriented programming language. Sigplan Notices, 24(7),
146–160.
64. Chan, S., et al. (2008). Open64 compiler infrastructure for emerging multicore/manycore
architecture. Tutorial at the International Symposium on Parallel and Distributed Processing.
SPDP ’08.
65. Chapman, B., Eachempati, D., & Hernandez, O. (2013). Experiences developing the OpenUH
compiler and runtime infrastructure. International Journal of Parallel Programming, 41(6),
825–854.
66. Chen, C.-H. (1988). Signal processing handbook (Vol. 51). Boca Raton: CRC Press.
67. Choi, J.-D., Cytron, R. K., & Ferrante, J. (1991). Automatic construction of sparse data
flow evaluation graphs. In Proceedings of the Symposium on Principles of Programming
Languages. POPL ’91 (pp. 55–66).
68. Choi, J.-D., Cytron, R. K., & Ferrante, J. (1994). On the efficient engineering of ambitious
program analysis. IEEE Transactions on Software Engineering, 20, 105–114.
69. Choi, J.-D., Sarkar, V., & Schonberg, E. (1996). Incremental computation of static single
assignment form. In: International Conference on Compiler Construction. CC ’96 (pp. 223–
237).
70. Chow, F. (1988). Minimizing register usage penalty at procedure calls. In International
Conference on Programming Languages Design and Implementation. PLDI ’88 (pp. 85–94).
71. Chow, F., & Hennessy, J. L. (1990). The priority-based coloring approach to register
allocation. ACM Transactions on Programming Language and Systems, 12(4), 501–536.
72. Chow, F., et al. (1996). Effective representation of aliases and indirect memory operations in
SSA form. In International Conference on Compiler Construction. CC ’96 (pp. 253–267).
73. Chow, F., et al. (1997). A new algorithm for partial redundancy elimination based on SSA
form. In International Conference on Programming Languages Design and Implementation.
PLDI ’97 (pp. 273–286).
74. Chuang, W., Calder, B., & Ferrante, J. (2003). Phi-predication for light-weight if-conversion.
In Proceedings of the International Symposium on Code Generation and Optimization. CGO
’03 (pp. 179–190).
75. Click, C. (1995). Combining Analyses, Combining Optimizations. PhD Thesis. Rice Univer-
sity.
76. Click, C. (1995). Global code motion/global value numbering. In International Conference
on Programming Languages Design and Implementation. PLDI ’95 (pp. 246–257).
77. Cocke, J. W., & Schwartz, J. T. (1970). Programming languages and their compilers. New
York: New York University.
78. Codina, J. M., Sánchez, J., & González, A. (2001). A unified modulo scheduling and register
allocation technique for clustered processors. In Proceedings of the International Conference
on Parallel Architectures and Compilation Techniques. PACT ’01 (pp. 175–184).
79. Colombet, Q. (2012). Decoupled (SSA-Based) Register Allocators: From Theory to Practice,
Coping with Just-in-Time Compilation and Embedded Processors Constraints. PhD Thesis.
École normale supérieure de Lyon, France, 2012.
80. Colombet, Q., et al. (2011). Graph-coloring and register allocation using repairing. In
Proceedings of the International Conference on Compilers, Architecture, and Synthesis for
Embedded Systems. CASES ’04 (pp. 45–54).
81. Colwell, R. P., et al. (1987). A VLIW architecture for a trace scheduling compiler. In
Proceedings of the International Conference on Architectual Support for Programming
Languages and Operating Systems. ASPLOS-II (pp. 180–192).
82. Cooper, K. D., & Taylor Simpson, L. (1998). Live range splitting in a graph coloring register
allocator. In International Conference on Compiler Construction. CC ’98 (pp. 174–187). New
York: Springer.
References 363
83. Cooper, K. D., & Torczon, L. (2004). Engineering a compiler. Burlington: Morgan Kauf-
mann.
84. Cooper, K. D., Taylor Simpson, L., & Vick, C. A. (2001). Operator strength reduction. ACM
Transactions on Programming Language and Systems, 23(5), 603–625.
85. Cooper, K. D., Harvey, T. J., & Kennedy, K. W. (2006). An empirical study of iterative data-
flow analysis. In International Conference on Computing. ICC ’06 (pp. 266–276).
86. Cousot, P., & Halbwachs, N. (1978). Automatic discovery of linear restraints among variables
of a program. In Proceedings of the Symposium on Principles of Programming Languages.
POPL ’78 (pp. 84–96).
87. Cytron, R. K., & Ferrante, J. (1987). What’s in a name? or the value of renaming for
parallelism detection and storage allocation. In Proceedings of the 1987 International
Conference on Parallel Processing (pp. 19–27).
88. Cytron, R. K., & Gershbein, R. (1993). Efficient accommodation of may-alias information in
SSA form. In International Conference on Programming Languages Design and Implemen-
tation. PLDI ’93 (pp. 36–45).
89. Cytron, R. K., et al. (1989). An efficient method of computing static single assignment
form. In Proceedings of the Symposium on Principles of Programming Languages. POPL
’89 (pp. 25–35).
90. Cytron, R. K., et al. (1991). Efficiently computing static single assignment form and the
control dependence graph. ACM Transactions on Programming Language and Systems, 13(4),
451–490.
91. Danvy, O., & Schultz, U. P. (2000). Lambda-dropping: Transforming recursive equations into
programs with block structure. Theoretical Computer Science, 248(1–2), 243–287.
92. Danvy, O., Millikin, K., & Nielsen, L. R. (2007). On one-pass CPS transformations. Journal
of Functional Programming, 17(6), 793–812.
93. Darte, A., Robert, Y., & Vivien, F. (2000). Scheduling and Automatic Parallelization, 1st ed.
Boston: Birkhauser.
94. Das, D., & Ramakrishna, U. (2005). A practical and fast iterative algorithm for -function
computation using DJ graphs. ACM Transactions on Programming Language and Systems,
27, 426–440.
95. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters.
Communications of the ACM, 51(1), 107–113.
96. Dennis, J. B. (1974). First version of a data flow procedure language. In Programming
Symposium, Proceedings Colloque sur la Programmation (pp. 362–376).
97. Dennis, J. B. (1980). Data flow supercomputers. Computer, 13(11), 48–56 (1980).
98. Dhamdhere, D. M. (2002). E-path_PRE: Partial redundancy made easy. Sigplan Notices,
37(8), 53–65.
99. de Dinechin, B. D. (1999). Extending modulo scheduling with memory reference merging. In
International Conference on Compiler Construction. CC ’99 (pp. 274–287).
100. de Dinechin, B. D. (2007). Time-indexed formulations and a large neighborhood search
for the resource-constrained modulo scheduling problem. In Multidisciplinary International
Scheduling Conference: Theory and Applications. MISTA.
101. de Dinechin, B. D. (2008). Inter-block scoreboard scheduling in a JIT compiler for VLIW
processors. In European Conference on Parallel Processing (pp. 370–381).
102. de Dinechin, B. D., et al. (2000). Code generator optimizations for the ST120 DSP-MCU core.
In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for
Embedded Systems. CASES ’00 (pp. 93–102).
103. de Dinechin, B. D., et al. (2000). DSP-MCU processor optimization for portable applications.
Microelectronic Engineering, 54(1–2), 123–132.
104. de Ferrière, F. (2007). Improvements to the Psi-SSA representation. In Proceedings of the
International Workshop on Software & Compilers for Embedded Systems. SCOPES ’07
(pp. 111–121).
105. Diniz, P. C., et al. (2005). Automatic mapping of C to FPGAs with the DEFACTO compilation
and synthesis system. Microprocessors and Microsystems, 29(2–3), 51–62
364 References
106. Drechsler, K.-H., & Stadel, M. P. (1993). A variation of knoop, rüthing and steffen’s lazy code
motion. Sigplan Notices, 28(5), 29–38.
107. Duesterwald, E., Gupta, R., & Soffa, M. L. (1994). Reducing the cost of data flow analysis
by congruence partitioning. In International Conference on Compiler Construction. CC ’94
(pp. 357–373).
108. Ebner, D., et al. (2008). Generalized instruction selection using SSA-graphs. Proceedings of
the International Conference on Languages, Compilers, and Tools for Embedded Systems.
LCTES ’08 (pp. 31–40).
109. Eckstein, E., König, O., & Scholz, B. (2003). Code instruction selection based on SSA-graphs.
In: Proceedings of the International Workshop on Software & Compilers for Embedded
Systems. SCOPES ’03 (pp. 49–65).
110. Emami, M., Ghiya, R., & Hendren, L. J. (1994). Context-sensitive interprocedural points-to
analysis in the presence of function pointers. In International Conference on Programming
Languages Design and Implementation. PLDI ’94 (pp. 242–256).
111. Fabri, J. (1979). Automatic storage optimization. In SIGPLAN ’79.
112. Fang, J. Z. (1997). Compiler algorithms on if-conversion, speculative predicates assignment
and predicated code optimizations. In Lcpc ’97 (pp. 135–153).
113. Faraboschi, P., et al. (2000). Lx: A technology platform for customizable VLIW embedded
processing. SIGARCH Computer Architecture News, 28(2), 203–213.
114. Farach-Colton, M., & Liberatore, V. (1998). On local register allocation. In Proceedings of
the Symposium on Discrete Algorithms. SODA ’98 (pp. 564–573).
115. Feautrier, P. (1988). Parametric integer programming. Rairo recherche opérationnelle, 22,
243–268.
116. Feautrier, P., & Lengauer, C. (2011). Polyhedron model. In Encyclopedia of Parallel
Computing (pp. 1581–1592).
117. Ferrante, J., & Mace, M. (1985). On linearizing parallel code. In Proceedings of the
Symposium on Principles of Programming Languages. POPL ’85 (pp. 179–190).
118. Ferrante, J., Ottenstein, K. J., & Warren, J. D. (1987). The program dependence graph and
its use in optimization. ACM Transactions on Programming Language and Systems, 9(3),
319–349.
119. Ferrante, J., Mace, M., & Simons, B. (1988). Generating sequential code from parallel code.
In Proceedings of the International Conference on Supercomputing. ICS ’88 (pp. 582–592).
120. Fink, S., Knobe, K., & Sarkar, V. (2000). Unified analysis of array and object references in
strongly typed languages. In International Static Analysis Symposium. SAS ’00 (pp. 155–174).
121. Flanagan, C., et al. (1993). The essence of compiling with continuations. In International
Conference on Programming Languages Design and Implementation. PLDI ’93 (pp. 237–
247).
122. Gavril, F. (1974). The intersection graphs of subtrees in trees are exactly the chordal graphs.
Journal of Combinatorial Theory, Series B, 16(1), 47–56.
123. Gawlitza, T., et al. (2009). Polynomial precise interval analysis revisited. Efficient Algorithms,
1, 422–437.
124. George, L., & Appel, A. W. (1996). Iterated register coalescing. ACM Transactions on
Programming Language and Systems, 18, 300–324 (1996).
125. George, L., & Matthias, B. (2003). Taming the IXP network processor. In International
Conference on Programming Languages Design and Implementation. PLDI ’03 (pp. 26–37).
126. Gerlek, M. P., Wolfe, M., & Stoltz, E. (1994). A reference chain approach for live variables.
Technical Report CSE 94-029. Oregon Graduate Institute of Science & Technology.
127. Gerlek, M. P., Stoltz, E., & Wolfe, M. (1995). Beyond induction variables: detecting
and classifying sequences using a demand-driven SSA form. In ACM Transactions on
Programming Language and Systems, 17(1), 85–122.
128. Gillies, D. M., et al. (1996). Global predicate analysis and its application to register allocation.
In Micro 29 (pp. 114–125).
References 365
129. Glesner, S. (2004). An ASM semantics for SSA intermediate representations. In Abstract
State Machines 2004. Advances in Theory and Practice, 11th International Workshop (ASM
2004), Proceedings. Lecture Notes in Computer Science (pp. 144–160).
130. Gonzalez, R. E. (2000). Xtensa: a configurable and extensible processor. IEEE Micro, 20(2),
60–70.
131. Goodman, J. R., & Hsu, W. (1988). Code scheduling and register allocation in large
basic blocks. In: Proceedings of the International Conference on Supercomputing. ICS ’88
(pp. 442–452).
132. Grund, D., & Hack, S. (2007). A fast cutting-plane algorithm for optimal coalescing. In
International Conference on Compiler Construction. CC ’07 (pp. 111—125).
133. Guo, Z., et al. (2008). A compiler intermediate representation for reconfigurable fabrics.
International Journal of Parallel Programming, 36(5), 493–520.
134. Gurevich, Y. (2000). Sequential abstract-state machines capture sequential algorithms. In:
ACM Transactions on Computational Logic (TOCL), 1(1), 77–111.
135. Hack, S. (2007). Register Allocation for Programs in SSA Form. PhD Thesis. Universität
Karlsruhe.
136. Hack, S., Grund, D., & Goos, G. (2005). Towards Register Allocation for Programs in SSA
Form. Technical Report 2005–27. University of Karlsruhe.
137. Hagiescu, A., et al. (2009). A computing origami: Folding streams in FPGAs. In Proceedings
of the Design Automation Conference. DAC ’09 (pp. 282–287).
138. Hardekopf, B., & Lin, C. (2011). Flow-sensitive pointer analysis for millions of lines of code.
In Proceedings of the International Symposium on Code Generation and Optimization. CGO
’11. New York: IEEE Computer Society (pp. 289–298).
139. Hasti, R., & Horwitz, S. (1998). Using static single assignment form to improve flow-
insensitive pointer analysis. In International Conference on Programming Languages Design
and Implementation. PLDI ’98 (pp. 97–105).
140. Havanki, W. A., Banerjia, S., & Conte, T. M. (1998). Treegion scheduling for wide issue
processors. In: International Symposium on High-Performance Computer Architecture, 266.
141. Havlak, P. (1993). Construction of thinned gated single-assignment form. In Proceedings of
the International Workshop on Languages and Compilers for Parallel Computing. LCPC ’93
(pp. 477–499).
142. Havlak, P. (1997). Nesting of reducible and irreducible loops. ACM Transactions on Program-
ming Language and Systems, 19, 557–567.
143. Hecht, M. S. (1977). Flow Analysis of Computer Programs. New York: Elsevier Science Inc.
144. Hecht, M. S., & Ullman, J. D. (1973). Analysis of a simple algorithm for global data flow
problems. In Proceedings of the Symposium on Principles of Programming Languages. POPL
’73 (pp. 207–217).
145. Hormati, A., et al. (2008). Optimus: Efficient realization of streaming applications on FPGAs.
In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for
Embedded Systems. CASES ’08, pp. 41–50.
146. Hwu, W. -M. W., et al. (1993). The superblock: An effective technique for VLIW and super-
scalar compilation. The Journal of Supercomputing, 7(1–2), 229–248.
147. Intel Corp. Arria10 device overview. 2017. [Link]
series/arria-10/[Link] (visited on 03/15/2017).
148. Jacome, M. F., De Veciana, G., & Pillai, S. (2001). Clustered VLIW architectures with
predicated switching. In Proceedings of the Design Automation Conference. DAC ’01
(pp. 696–701).
149. Janssen, J., & Corporaal, H. (1997). Making graphs reducible with controlled node splitting.
ACM Transactions on Programming Language and Systems, 19, 1031–1052.
150. Jensen, S. H., Møller, A., & Thiemann, P. (2009). Type analysis for JavaScript. In Interna-
tional Static Analysis Symposium. SAS ’09 (pp. 238–255).
151. Johnson, N. E. (2004). Code Size Optimization for Embedded Processors. Technical Report
UCAM-CL-TR-607. University of Cambridge, Computer Laboratory.
366 References
152. Johnson, N. E., & Mycroft, A. (2003). Combined code motion and register allocation using
the value state dependence graph. In: International Conference on Compiler Construction.
CC ’03 (pp. 1–16).
153. Johnson, R., & Pingali, K. (1993). Dependence-based program analysis. In International
Conference on Programming Languages Design and Implementation. PLDI ’93 (pp. 78–89).
154. Johnson, R., Pearson, D., & Pingali, K. (1994). The program tree structure. In International
Conference on Programming Languages Design and Implementation. PLDI ’94 (pp. 171–
185).
155. Johnsson, T. (1985). Lambda lifting: Transforming programs to recursive equations. In
Conference on Functional Programming Languages and Computer Architecture. Lecture
Notes in Computer Science (pp. 190–203).
156. Jovanovic, N., Kruegel, C., & Kirda, E. (2006). Pixy: A static analysis tool for detecting web
application vulnerabilities (short paper). In Symposium on Security and Privacy (pp. 258–
263).
157. Kam, J. B., & Ullman, J. D. (1976). Global data flow analysis and iterative algorithms. Journal
of the ACM, 23(1), 158–171.
158. Kam, J. B., & Ullman, J. D. (1977). Monotone data flow analysis frameworks. Acta
Informatica, 7(3), 305–317.
159. Kästner, D., & Winkel, S. (2001). ILP-based instruction scheduling for IA-64. In Proceedings
of the International Conference on Languages, Compilers, and Tools for Embedded Systems.
LCTES ’01 (pp. 145–154).
160. Kelsey, R. (1995). A correspondence between continuation passing style and static single
assignment form. In Intermediate Representations Workshop (pp. 13–23).
161. Kennedy, A. (2007). Compiling with continuations, continued. In Proceedings of the Interna-
tional Conference on Functional Programming. ICFP ’07 (pp. 177–190).
162. Kennedy, K. W. (1975). Node listings applied to data flow analysis. In Proceedings of the
Symposium on Principles of Programming Languages. POPL ’75 (pp. 10–21).
163. Kennedy, R., et al. (1998). Strength reduction via SSAPRE. In International Conference on
Compiler Construction. CC ’98.
164. Kennedy, R., et al. (1999). Partial redundancy elimination in SSA form. ACM Transactions
on Programming Language and Systems, 21(3), 627–676 (1999).
165. Khedker, U. P., & Dhamdhere, D. M. (1999). Bidirectional data flow analysis: Myths and
reality. In SIGPLAN Notices, 34(6), 47–57.
166. Kildall, G. A. (1973). A unified approach to global program optimization. In Proceedings of
the Symposium on Principles of Programming Languages. POPL ’73 (pp. 194–206).
167. Kislenkov, V., Mitrofanov, V., & Zima, E. V. (1998). Multidimensional chains of recurrences.
In Proceedings of the International Symposium on Symbolic and Algebraic Computation
(pp. 199–206).
168. Knobe, K., & Sarkar, V. (1998). Array SSA form and its use in parallelization. In Proceedings
of the Symposium on Principles of Programming Languages. POPL ’98.
169. Knobe, K., & Sarkar, V. (1998). Conditional constant propagation of scalar and array
references using array SSA form. In International Static Analysis Symposium. Lecture Notes
in Computer Science (pp. 33–56).
170. Knoop, J., Rüthing, O., & Steffen, B. (1992). Lazy code motion. In International Conference
on Programming Languages Design and Implementation. PLDI ’92 (pp. 224–234).
171. Knoop, J., Rüthing, O., & Steffen, B. (1993). Lazy strength reduction. Journal of Program-
ming Languages, 1(1), 71–91.
172. Knoop, J., Rüthing, O., & Steffen, B. (1994). Optimal code motion: Theory and practice.
ACM Transactions on Programming Language and Systems, 16(4), 1117–1155.
173. Kronos Group (2018). OpenCL overview. [Link]
174. Landin, P. (1965). A Generalization of Jumps and Labels. Technical Report. Reprinted in
Higher Order and Symbolic Computation, 11(2), 125–143, 1998, with a foreword by Hayo
Thielecke. UNIVAC Systems Programming Research, 1965.
References 367
175. Lapkowski, C., & Hendren, L. J. (1996). Extended SSA numbering: Introducing SSA
properties to languages with multi-level pointers. In Proceedings of the Conference of the
Centre for Advanced Studies on Collaborative Research. CASCON ’96 (pp. 23–34).
176. Lattner, C., & Adve, V. S. (2004). LLVM: a compilation framework for lifelong program anal-
ysis & transformation. In Proceedings of the International Symposium on Code Generation
and Optimization. CGO ’04 (pp. 75–88).
177. Laud, P., Uustalu, T., & Vene, V. (2006). Type systems equivalent to dataflow analyses for
imperative languages. Theoretical Computer Science, 364(3), 292–310.
178. Lawrence, A. C. (2007). Optimizing Compilation with the Value State Dependence Graph.
Technical Report UCAM-CL-TR-705. University of Cambridge, Computer Laboratory.
179. Lee, E. A., & Messerschmitt, D. G. (1987). Synchronous data flow. Proceedings of the IEEE,
75(9), 1235–1245.
180. Lee, J.-Y., & Park, I.-C. (2003). Address code generation for DSP instruction-set architec-
tures. ACM Transactions on Design Automation of Electronic Systems, 8(3), 384–395.
181. Lenart, A., Sadler, C., & Gupta, S. K. S. (2000). SSA-based flow-sensitive type analysis:
Combining constant and type propagation. In Proceedings of the Symposium on Applied
Computing. SAC ’00 (pp. 813–817).
182. Leung, A., & George, L. (1999). Static single assignment form for machine code. In
International Conference on Programming Languages Design and Implementation. PLDI ’99
(pp. 204–214).
183. Leupers, R. (1997). Retargetable code generation for digital signal processors. Amsterdam:
Kluwer Academic Publishers.
184. Leupers, R. (1999). Exploiting conditional instructions in code generation for embedded
VLIW processors. In Proceedings of the Conference on Design, Automation and Test in
Europe. DATE ’99.
185. Liu, S.-M., Lo, R., & Chow, F. (1996). Loop induction variable canonicalization in paralleliz-
ing compilers. In PACT ’96 (p. 228).
186. LLVM website. [Link]
187. Lo, R., et al. (1998). Register promotion by sparse partial redundancy elimination of loads and
stores. In International Conference on Programming Languages Design and Implementation.
PLDI ’98 (pp. 26–37).
188. Logozzo, F., & Fähndric, M. (2010). Pentagons: A weakly relational abstract domain for the
efficient validation of array accesses. Science of Computer Programming, 75, 796–807.
189. Lowney, P. G., et al. (1993). The multiflow trace scheduling compiler. The Journal of
Supercomputing, 7(1–2), 51–142 (1993).
190. Lueh, G.-Y., Gross, T., & Adl-Tabatabai, A.-R. (2000). Fusion-based register allocation. ACM
Transactions on Programming Language and Systems, 22(3), 431–470 (2000).
191. Mahlke, S. A., et al. (1992). Effective compiler support for predicated execution using the
hyperblock. In Proceedings of the International Symposium on Microarchitecture. MICRO
25 (pp. 45–54).
192. Mahlke, S. A., et al. (1992). Sentinel scheduling for VLIW and superscalar processors.
In Proceedings of the International Conference on Architectural Support for Programming
Languages and Operating Systems. ASPLOS-V (pp. 238–247).
193. Mahlke, S. A., et al. (1995). A comparison of full and partial predicated execution support for
ILP processors. In Proceedings of the International Symposium on Computer Architecture.
ISCA ’95 (pp. 138–150)
194. Mahlke, S. A., et al. (2001). Bitwidth cognizant architecture synthesis of custom hardware
accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 20(11), 1355–1371.
195. Matsuno, Y., & Ohori, A. (2006). A type system equivalent to static single assignment.
In Proceedings of the International Conference on Principles and Practice of Declarative
Programming. PPDP ’06 (pp. 249–260).
196. May, C. (1989). The parallel assignment problem redefined. In IEEE Transactions on
Software Engineering, 15(6), 821–824.
368 References
197. McAllester, D. (2002). On the complexity analysis of static analyses. Journal of the ACM, 49,
512–537.
198. Metzgen, P., & Nancekievill, D. (2005). Multiplexer restructuring for FPGA implementation
cost reduction. In Proceedings of the Design Automation Conference. DAC ’05 (pp. 421–426).
199. Miné, A. (2006). The octagon abstract domain. Higher-Order and Symbolic Computation, 19,
31–100.
200. Moggi, E. (1991). Notions of computation and monads. Information and Computation, 93(1),
55–92.
201. Morel, E., & Renvoise, C. (1979). Global optimization by suppression of partial redundancies.
Communications of the ACM, 22(2), 96–103.
202. Morgan, R. (1998). Building an optimizing compiler. Oxford: Butterworth-Heinemann.
203. Mössenböck, H., & Pfeiffer, M. (2002). Linear scan register allocation in the context of SSA
form and register constraints. In International Conference on Compiler Construction. CC ’02
(pp. 229–246).
204. Motwani, R., et al. (1995). Combining Register Allocation and Instruction Scheduling.
Technical Report. Stanford University.
205. Muchnick, S. S. (1997). Advanced compiler design and implementation. Burlington: Morgan
Kaufmann.
206. Murphy, B. R., et al. (2008). Fault-safe code motion for type-safe languages. In Proceedings
of the International Symposium on Code Generation and Optimization. CGO ’08 (pp. 144–
154).
207. Nanda, M. G., & Sinha, S. (2009). Accurate interprocedural null-dereference analysis for
Java. In International Conference on Software Engineering (pp. 133–143).
208. Nielson, F., Nielson, H. R., & Hankin, C. (2005). Principles of program analysis. New York:
Springer.
209. Norris, C., & Pollock, L. L. (1993). A scheduler-sensitive global register allocator. In
Proceedings of the International Conference on Supercomputing. ICS ’93 (pp. 804–813). New
York: IEEE.
210. Novillo, D. (2005). A propagation engine for GCC. In Proceedings of the GCC Developers
Summit (pp. 175–184).
211. Novillo, D. (2007). Memory SSA — a unified approach for sparsely representing memory
operations. In Proceedings of the GCC Developers Summit.
212. Nuzman, D., & Henderson, R. (2006). Multi-platform auto-vectorization. In Proceedings of
the International Symposium on Code Generation and Optimization. CGO ’06.
213. O’Donnell, C. (1994). High Level Compiling for Low Level Machines. PhD Thesis. Ecole
Nationale Superieure des Telecommunications.
214. Ottenstein, K. J., & Ottenstein, L. M. (1984). The program dependence graph in a software
development environment. In ACM SIGSOFT Software Engineering Notes, 99(3), 177–184.
215. Ottenstein, K. J., Ballance, R. A., & MacCabe, A. B. (1990). The program dependence web:
A representation supporting control-, data-, and demand-driven interpretation of imperative
languages. In International Conference on Programming Languages Design and Implemen-
tation. PLDI ’90 (pp. 257–271).
216. D. A. Padua (Ed.) (2011). Encyclopedia of parallel computing. New York: Springer.
217. Paleri, V. K., Srikant, Y. N., & Shankar, P. (2003). Partial redundancy elimination: a simple,
pragmatic and provably correct algorithm. Science of Programming Programming, 48(1), 1–
20.
218. Panda, P. R. (2001). SystemC: A modeling platform supporting multiple design abstractions.
In Proceedings of International Symposium on Systems Synthesis. ISSS ’01 (pp. 75–80).
219. Park, J., & Moon, S.-M. (2004). Optimistic register coalescing. ACM Transactions on
Programming Language and Systems, 26(4), 735–765.
220. Park, J. C. H., & Schlansker, M. S. (1991). On Predicated Execution. Technical Report HPL-
91-58. Hewlett Packard Laboratories.
221. Pereira, F. M. Q., & Palsberg, J. (2005). Register allocation via coloring of chordal graphs. In
Asian Symposium on Programming Languages and Systems. APLAS ’05 (pp. 315–329).
References 369
222. Pereira, F. M. Q., & Palsberg, J. (2008). Register allocation by puzzle solving. In International
Conference on Programming Languages Design and Implementation. PLDI ’08 (pp. 216–
226).
223. Peyton Jones, S., et al. (1998). Bridging the gulf: A common intermediate language for ML
and Haskell. In Proceedings of the Symposium on Principles of Programming Languages.
POPL ’98 (pp. 49–61).
224. Pingali, K., & Bilardi, G. (1995). APT: A data structure for optimal control dependence
computation. In International Conference on Programming Languages Design and Imple-
mentation. PLDI ’95 (pp. 211–222).
225. Pingali, K., & Bilardi, G. (1997). Optimal control dependence computation and the roman
chariots problem. In ACM Transactions on Programming Language and Systems (pp. 462–
491) (1997).
226. Pinter, S. S. (1993). Register allocation with instruction scheduling. In International Confer-
ence on Programming Languages Design and Implementation. PLDI ’93 (pp. 248–257).
227. Pioli, A., & Hind, M. (1999). Combining Interprocedural Pointer Analysis and Conditional
Constant Propagation. Technical Report. IBM T. J. Watson Research Center.
228. Plevyak, J. B. (1996). Optimization of Object-Oriented and Concurrent Programs. PhD
Thesis. University of Illinois at Urbana-Champaign.
229. Plotkin, G. D. (1975). Call-by-name, call-by-value and the lambda-calculus. Theoretical
Computer Science, 1(2), 125–159.
230. Poletto, M., & Sarkar, V. (1999). Linear scan register allocation. In ACM Transactions on
Programming Language and Systems, 21(5), 895–913.
231. Pop, S. (2006). The SSA Representation Framework: Semantics, Analyses and GCC Imple-
mentation. PhD Thesis. Research center for computer science (CRI) of the Ecole des mines
de Paris.
232. Pop, S., Cohen, A., & Silber, G.-A. (2005). Induction variable analysis with delayed
abstractions. In Proceedings of the First International Conference on High Performance
Embedded Architectures and Compilers. HiPEAC’05 (pp. 218–232).
233. Pop, S., Jouvelot, P., & Silber, G.-A. (2007). In and Out of SSA: A Denotational Specification.
Technical Report. Research center for computer science (CRI) of the Ecole des mines de Paris.
234. Prosser, R. T. (1959). Applications of boolean matrices to the analysis of flow diagrams. In
Eastern Joint IRE-AIEE-ACM Computer Conference (pp. 133–138).
235. Pugh, W. (1991). Uniform techniques for loop optimization. In Proceedings of the Interna-
tional Conference on Supercomputing. ICS ’91 (pp. 341–352).
236. Ramalingam, G. (2002). On loops, dominators, and dominance frontiers. ACM Transactions
on Programming Language and Systems, 24(5), 455–490.
237. Ramalingam, G. (2002). On sparse evaluation representations. Theoretical Computer Science,
277(1–2), 119–147.
238. Ramalingam, G., & Reps, T. (1994). An incremental algorithm for maintaining the dominator
tree of a reducible flowgraph. In Proceedings of the Symposium on Principles of Programming
Languages. POPL ’94 (pp. 287–296).
239. Rastello, F. (2012). On sparse intermediate representations: some structural properties and
applications to just in time compilation. Habilitation à diriger des recherches. École normale
supérieure de Lyon, France.
240. Rastello, F., de Ferrière, F., & Guillon, C. (2004). Optimizing translation out of SSA using
renaming constraints. In Proceedings of the International Symposium on Code Generation
and Optimization. CGO ’04 (pp. 265–278).
241. Rau, B. R. (1996). Iterative modulo scheduling. International Journal of Parallel Program-
ming, 24(1), 3–65.
242. Rawat, P. S., et al. (2018). Register optimizations for stencils on GPUs. In Proceedings of the
Symposium on Principles and Practice of Parallel Programming. PPoPP ’18 (pp. 168–182).
243. Reppy, J. H. (2002). Optimizing nested loops using local CPS conversion. Higher-Order and
Symbolic Computation, 15(2–3), 161–180.
370 References
265. Sreedhar, V. C., Gao, G. R., & Lee, Y.-F. (1996). A new framework for exhaustive and
incremental data flow analysis using DJ graphs. In International conference on programming
languages design and implementation. PLDI ’96 (pp. 278–290).
266. Sreedhar, V. C., Gao, G. R., & Lee, Y.-F. (1996). Identifying loops using DJ graphs. textitACM
Transactions on Programming Language and Systems, 18(6), 649–658.
267. Sreedhar, V. C., et al. (1999). Translating out of static single assignment form. In International
Static Analysis Symposium. SAS ’99 (pp. 194–210).
268. Stallman, R. M., & GCC Dev. Community (2017). GCC 7.0 GNU compiler collection
internals. Samurai Media Limited.
269. Stanier, J. (2011). Removing and Restoring Control Flow with the Value State Dependence
Graph. PhD Thesis. University of Sussex, School of Informatics.
270. Stanier, J., & Watson, D. (2013). Intermediate representations in imperative compilers: A
survey. In ACM Computing Surveys, 45(3), 26:1–26:27.
271. Steensgaard, B. (1993). Sequentializing Program Dependence Graphs for Irreducible Pro-
grams. Technical Report MSR-TR-93-14. Microsoft Research, Redmond, WA.
272. Steensgaard, B. (1995). Sparse functional stores for imperative programs. In Workshop on
Intermediate Representations.
273. Stephenson, M., Babb, J., & Amarasinghe, S. (2000). Bidwidth analysis with application to
silicon compilation. In International Conference on Programming Languages Design and
Implementation. PLDI ’00 (pp. 108–120).
274. Stoltz, E., Gerlek, M. P., & Wolfe, M. (1994). Extended SSA with factored use-def chains to
support optimization and parallelism. In Proceedings of the Hawaii International Conference
on System Sciences (pp. 43–52).
275. Stoutchinin, A., & de Ferrière, F. (2001). Efficient static single assignment form for
predication. In Proceedings of the International Symposium on Microarchitecture. MICRO
34 (pp. 172–181).
276. Stoutchinin, A., & Gao, G. R. (2004). If-conversion in SSA form. In European Conference
on Parallel Processing. Lecture Notes in Computer Science (pp. 336–345).
277. Su, Z., & Wagner, D. (2005). A class of polynomially solvable range constraints for interval
analysis without widenings. Theoretical Computeter Science, 345(1), 122–138.
278. Surawski, M. J. (2016). Loop Optimizations for MLton. Master’s Thesis. Department of
Computer Science, Rochester Institute of Technology.
279. Sussman, G. J., & Steele, G. L. Jr. (1998). Scheme: An Interpreter for Extended Lambda
Calculus. Technical Report AI Lab Memo AIM-349. Reprinted in Higher-Order and
Symbolic Computation, 11(4), 405–439, 1998. MIT AI Lab, 1975.
280. Tarditi, D., et al. (1996). TIL: A type-directed optimizing compiler for ML. In International
Conference on Programming Languages Design and Implementation. PLDI ’96 (pp. 181–
192).
281. Tavares, A. L. C., et al. (2011). Decoupled graph-coloring register allocation with hierarchical
aliasing. In Proceedings of the International Workshop on Software & Compilers for
Embedded Systems. SCOPES ’11 (pp. 1–10).
282. Tavares, A. L. C., et al. (2014). Parameterized construction of program representations for
sparse dataflow analyses. In International Conference on Compiler Construction. CC ’14
(pp. 18–39).
283. Thomas, D., & Moorby, P. (1998). The verilog hardware description language. Alphen aan
den Rijn: Kluwer Academic Publishers.
284. Tobin-Hochstadt, S., & Felleisen, M. (2008). The design and implementation of typed
scheme. In POPL ’08 (pp. 395–406).
285. Tolmach, A. P., & Oliva, D. (1998). From ML to Ada: Strongly-typed language inter-
operability via source translation. Journal of Functional Programming, 8(4), 367–412.
286. Touati, S., & Eisenbeis, C. (2004). Early periodic register allocation on ILP processors.
Parallel Processing Letters, 14(2), 287–313.
372 References
287. Traub, O., Holloway, G. H., & Smith, M. D. (1998). Quality and speed in linear-scan
register allocation. In International Conference on Programming Languages Design and
Implementation. PLDI ’98 (pp. 142–151).
288. Tripp, J. L., Jackson, P. A., & Hutchings, B. L. (2002). Sea cucumber: A synthesizing
compiler for FPGAs. In: International Conference on Field Programmable Logic and
Applications. FPL ’02 (pp. 875–885).
289. Tu, P., & Padua, D. (1995). Efficient building and placing of gating functions. In International
Conference on Programming Languages Design and Implementation. PLDI ’95 (pp. 47–55).
290. Tu, P., & Padua, D. (1995). Gated SSA-based demand-driven symbolic analysis for paralleliz-
ing compilers. In Proceedings of the International Conference on Supercomputing. ICS ’95
(pp. 414–423).
291. Upton, E. (2003). Optimal sequentialization of gated data dependence graphs is NP-complete.
In Proceedings of the International Conference on Parallel and Distributed Processing
Techniques and Applications (pp. 1767–1770).
292. Upton, E. (2006). Compiling with Data Dependence Graphs. PhD Thesis. University of
Cambridge, Computer Laboratory.
293. van Engelen, R. A. (2001). Efficient symbolic analysis for optimizing compilers. In Interna-
tional Conference on Compiler Construction. CC ’01 (pp. 118–132).
294. van Wijngaarden, A. (1966). Recursive defnition of syntax and semantics. In Formal
Language Description Languages for Computer Programming (pp. 13–24).
295. VanDrunen, T. J. (2004). Partial Redundancy Elimination for Global Value Numbering. PhD
Thesis. Purdue University.
296. Vandrunen, T., & Hosking, A. L. (2004). Anticipation-based partial redundancy elimination
for static single assignment form. In Software—Practice and Experience, 34(15), 1413–1439
(2004).
297. Vandrunen, T., & Hosking, A. L. (2004). Value-based partial redundancy elimination. In
International Conference on Compiler Construction. CC ’04 (pp. 167–184).
298. Verma, A. K., Brisk, P., & Ienne, P. (2008). Dataflow transformations to maximize the use
of carry-save representation in arithmetic circuits. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 27(10), 1761–1774.
299. Wadsworth, C. P. (2000). Continuations revisited. Higher-Order and Symbolic Computation,
13(1/2), 131–133.
300. Wand, M. (1983). Loops in combinator-based compilers. Information and Control, 57(2/3),
148–164.
301. Wang, J., et al. (1994). Software pipelining with register allocation and spilling. In Proceed-
ings of the International Symposium on Microarchitecture. MICRO 27 (pp. 95–99).
302. Wassermann, G., & Su, Z. (2007). Sound and precise analysis of web applications for
injection vulnerabilities. In International Conference on Programming Languages Design and
Implementation. PLDI ’07 (pp. 32–41).
303. Wegman, M. N., & Kenneth Zadeck, F. (1991). Constant propagation with conditional
branches. ACM Transactions on Programming Language and Systems, 13(2), 181–210.
304. Weise, D., et al. (1994). Value dependence graphs: Representation without taxation. In
Proceedings of the Symposium on Principles of Programming Languages. POPL ’94 (pp.
297–310).
305. Wimmer, C., & Mössenböck, H. (2005). Optimized interval splitting in a linear scan register
allocator. In Proceedings of the 1st ACM/USENIX International Conference on Virtual
Execution Environments (pp. 132–141). Chicago: ACM.
306. Wolfe, M. (1992). Beyond induction variables. In International Conference on Programming
Languages Design and Implementation. PLDI ’92 (pp. 162–174).
307. Wolfe, M. (1994). “J+ =j”. Sigplan Notices, 29(7), 51–53.
308. Wolfe, M. (1996). High performance compilers for parallel computing. Reading, MA:
Addison-Wesley.
References 373
309. Wu, P., Cohen, A., & Padua, D. (2001). Induction variable analysis without idiom recognition:
Beyond monotonicity. In Proceedings of the International Workshop on Languages and
Compilers for Parallel Computing. LCPC ’01 (p. 2624).
310. Xilinx Inc. (2014). Vivado high-level synthesis. [Link]
%20tools/vivado/integration/esl-design/[Link] (visited on 06/2014).
311. Xilinx Inc. (2016). Virtex ultrascale+ FPGA devices. [Link]
silicon-devices/fpga/[Link]
312. Xue, J., & Cai, Q. (2006). A lifetime optimal algorithm for speculative PRE. ACM Transac-
tions on Architecture and Code Optimization, 3(2), 115–155.
313. Xue, J., & Knoop, J. (2006). A fresh look at PRE as a maximum flow problem. In
International Conference on Compiler Construction. CC ’06 (pp. 139–154).
314. Yang, B.-S., et al. (1999). LaTTe: A Java VM just-in-time compiler with fast and efficient
register allocation. In Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques. PACT ’99 (pp. 128–138). New York: IEEE.
315. Yeung, J. H. C., et al. (2008). Map-reduce as a programming model for custom computing
machines. In Proceedings of the International Symposium on Field-Programmable Custom
Computing Machines. FCCM ’08 (pp. 149–159).
316. Zadeck, F. K. (1984). Incremental Data Flow Analysis in a Structured Program Editor. PhD
Thesis. Rice University.
317. Zhou, H., Chen, W., & Chow, F. (2011). An SSA-based algorithm for optimal speculative code
motion under an execution profile. In International Conference on Programming Languages
Design and Implementation. PLDI ’11.
318. Zima, E. V. (2001). On computational properties of chains of recurrences. In Proceedings of
the International Symposium on Symbolic and Algebraic Computation (pp. 345–352).
Index
Page numbers are underlined in the index when they represent the definition or the
main source of information about whatever is being indexed. A page number is
given in italics when that page contains an instructive example, use, or discussion
of the concept in question.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 375
F. Rastello, F. Bouchez Tichadou (eds.), SSA-based Compiler Design,
[Link]
376 Index
transfer function, 96, 97, 230, 232 value state dependence graph, VSDG, 159,
transformed ψ-SSA, ψ-T-SSA, 207, 208 194
transformed SSA, T-SSA, 20, 32, 249, 255 γ -nodes, 7, 196
transient information, 248 θ-nodes, 196
translation out of SSA, see destruction of serializing edge, 195
SSA state node, 196, 197
tree region scheduling, 253 value node, 196
tree scan, 17, 117 value-driven redundancy elimination, 136
two-address mode, 244, 245, 247, 253, 285, variable
289 free, 65
type analysis, see analysis name, 5, 20, 24, 29, 138, 215, 286
type inference analysis, see analysis undefined, 31
typed variable-, 355
statically-, 348 version, see variable name
strongly-, 228, 235 Verilog, 329
very long instruction word, VLIW, 244, 269,
334
ultimate interference, 293 VHDL, 329
undefined code elimination, 181 virtual isolation, 255, 256, 296
undefined variable, 31 virtual register, see pseudo register
unreachable code, 102 Vivado, 332
unrolling, 81 VLIW, see very long instruction word
upward-exposed use, 17, 108, 110 VSDG, see value state dependence graph
use-def chain, 13, 35, 127, 186
uses
of φ-function, 109 WCET, see worst case execution time
of ψ-function, 209, 210 web
phi, see φ-web
psi, see ψ-web
value based interference, 294, 295 worst case execution time, WCET, 251
value node, see value state dependence
graph
value numbering, 150, 161, 237 XilinxLUT, 343
global, 93, 150, 204, 219, 224, 234, 294,
354
optimistic, 151 zero version, 160, 215, 354