Customized Load-Store Architectures
Customized Load-Store Architectures
architectures
the degree of
Doctor of Philosophy
in
Guanglin Xu
Pittsburgh, PA
May 2023
©Guanglin Xu, 2023
This work would not have been possible without the advice and support from my
great advisors Prof. Franz Franchetti and Prof. James Hoe. I am very thankful
for their time and effort along the unusually long journey of my PhD study. They
have not only provided intelligent advice for research, but also generously taught
me precious principles and lessons for life success. I owe all my success to them.
Special thanks to my thesis committee in providing important feedback for
significant improvements for this thesis. The thesis committee is composed of Prof.
Franz Franchetti (Co-Chair), Prof. James Hoe (Co-Chair), Prof. Tze Meng Low
(CMU - ECE), Prof. Peter Milder (Stony Brook - ECE). In particular, Prof. Milder
has offered me internal access to the Spiral DFT IP core generator and provided me
tutorials so that I can understand the internal structure of state-of-the-art designs.
I would like to thank several people who have helped me a lot during my PhD
study. In my early years, Prof. Tze Meng Low has taught me to question everything
that’s out there by steadily asking me “simple” questions like “what” and “why”.
Dr. Doru Thom Popovici, before becoming a doctor himself, has generously shared
with me tons of hands-on experience with the Spiral system. Dr. Jing Huang has
taught me how to design and debug in RTL. Special thanks to Prof. David Padua
at UIUC who has hosted my half-year visit to UIUC. There are more people that
I feel grateful to. They include Richard Veras, Yu Wang, Jiyuan Zhang, Zhipeng
Zhao, Marie Nguyen, Joe Melber, Fazle Sadi, Qi Guo, Daniele Spampinato, Maia
iii
Blanco, Shashank Obla, Chengyue Wang, and many more.
This work was partially sponsored by the Defense Advanced Research Projects
Agency (DARPA) PERFECT program under agreement HR0011-13-2-0007 and the
BRASS program under agreement FA8750-16-2-003. I’m thankful for their generous
supports to fund this research.
This thesis was proposed at the beginning of the pandemic and has been
delayed partially due to the pandemic. I am grateful to my wife Dr. Xiao Wang for
her continuous love and support during the hard time. I am grateful to my daughter
Stephanie and my son Echo for bringing me the endless joy.
Guanglin Xu
May 2023
iv
Abstract
v
register-transfer level designs in another hardware-extended DSL where local opti-
mizations are employed.
I implement the approach by extending the open-source Spiral system. I
demonstrate the flexibility of the system by generating designs for signal transforms
including WHT and DFT, and the sorting operation. Experimental results show the
benefit of hardware-oriented optimizations. In particular, the FFT IP cores gener-
ated with my approach are comparable to state-of-the-art designs. Despite further
parallelization and hardware compilation efforts to be pursued, this dissertation has
paved the way for generating competitive hardware designs with Spiral in a flexible
manner.
vi
Contents
Acknowledgments iii
Abstract v
List of Figures xi
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 My Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
vii
2.2.2 Program Transformations . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Σ-OL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
viii
Chapter 4 Extending Spiral for Generating Specialized Load-store
Architectures 66
ix
5.2.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Bibliography 148
x
List of Figures
1.4 The Spiral code generation flows involving multi-level domain spe-
cific languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 The three steps in mapping the 8-point DFT computation to hard-
tecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
xi
3.4 Possible partial hardening solutions. . . . . . . . . . . . . . . . . . . 53
4.4 Allocating buffers when output/intermediate buffers are the same size. 72
4.5 Allocating buffers when only intermediate buffers are the same size. 73
4.12 Classic algorithmic candidates in data flow graphs for FFT (N=8). . 103
4.14 Decomposing size-m*n table to size-n table and size-m table. . . . . 106
4.15 A general solution that loads twiddle factors from main memory. . . 107
xii
5.2 The latency comparison between two dependency management strate-
xiii
List of Tables
2.8 Translating Σ-OL constructs to code; x denotes the input and y the
4.1 The indices for accessing a size-2 vector from a size-8 buffer with
various strides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 A schedule for small write stride and large read stride. . . . . . . . . 78
4.5 A schedule for large write stride and small read stride. . . . . . . . . 79
xiv
4.7 The icode extensions of types, commands and location descriptors. . 94
4.11 Synthesize the coordinating FSMs in icode for loop nest controllers. 99
xv
Chapter 1
Introduction
ators to handle the large trade-off space between performance and resource utiliza-
tion. The Spiral framework focuses on automating the designs for digital signal
for algorithms with uniform blocks in the data flow graph representation. In this
by processor designs, into the Spiral framework. The updated Spiral framework
first stage sets up and solves a constraint program for producing algorithms match-
ing the required architectural features. The second stage represents the generated
algorithms in imperfect loop nest programs and uses pattern-based loop transforma-
tions that are enabled by the domain-specific language (DSL) in Spiral, to optimize
1
the programs. The final stage interprets the optimized programs into hardware de-
for a compute pattern that process high-dimensional data cubes on a load-store ar-
conforming to the pattern are optimized for execution latency, RAM utilization,
using the updated framework. This approach has been applied to Walsh-Hadamard
transform, discrete Fourier transform and the bitonic sorter algorithm. In a case
study of FFT accelerators, I show that by allowing for slight non-uniformity in the
choice of algorithms, the execution latency in cycle counts is reduced and the SRAM
1.1 Motivation
signs has centered around customized parallel architectures for particular algo-
rithms. The popular examples include streaming architectures (Figure 1.1a) and
systolic arrays (Figure 1.1b). These architectures can achieve much higher compu-
ever, these high-throughput architectures only fit an important while also limited set
scenarios. Hence, a question raised naturally is: can we tradeoff some throughput
2
(a) A streaming architecture composed (b) A systolic array architecture com-
of different functional units connected by posed of homogenerous processing ele-
streaming links. ments connected by neighboring commu-
nication links.
ducted in an ALU whose I/O ports are connected to a register file. The register
file exchanges data with the memory system via load and store operations. A pro-
served for flexibility while both primitive operations in the datapath and compute
sequences in the controller are highly customized to specific algorithms. The peak
and the number of memory ports connected to the datapath. Figure 1.3 shows two
mature parallel paradigms known in processor designs that can be applied to cus-
style design.
3
Figure 1.2: From processors to customized load-store architectures.
sential task is to explore the large tradeoff space between performance and resource
utilization. This is, however, challenging due to the mutual restriction between a
wide range of algorithms and a large design space of architectures. For general pro-
cessors, designing an efficient architecture and programming for the architecture are
the best choice in one domain depends on which choices are made in the other.
Lacking a tool to reason about both sets of options at once, the cost of explor-
ing the tradeoff space between performance and resource utilization is prohibitively
high.
tures because it offers a unique method to reason about algorithms and architectures
4
simultaneously. In this way, the architecture is by construction optimized for the
algorithm considered for hardware acceleration. The control of the customized ar-
solver, which automatically derives algorithms fitting the specified architecture, and
a multi-level rewriting system for program optimizations and code generation. The
a unified formal system called the operator language (OL). A constraint problem
algorithms, the base cases that the hardware can efficiently process, and a set of
problem is solved by applying the rule set recursively to the specification until
termination, producing efficient algorithms for the architecture. Next, the derived
OL language for loop optimizations. Finally, the optimized loop is translated to the
icode representation for code generation. The code generation process using multi-
level domain specific languages (DSLs) is shown by the left-most flow of Figure 1.4.
In the past, Spiral has been successfully applied for generating high performance
library code for novel architectures that are difficult to human programmers.
Though was initially for program generation for off-the-shelf processors, the
5
Figure 1.4: The Spiral code generation flows involving multi-level domain specific
languages.
co-synthesis. In addition, the multi-level rewriting system can be extended for code
generation at the register-transfer level (RTL). In the past, a Spiral approach for
solutions in the tradeoff space between performance and resource utilization. In that
work, the specification is decomposed to SPL1 formulas describing the generated al-
gorithm with streaming hardware parameters. Then, the SPL formula is translated
flow in Figure 1.4. Traversing the tradeoff space of the streaming architecture for an
algorithm boils down to the vertical and horizontal foldings of the datapath mapped
from the data flow graph representation, as illustrated by an example in Figure 1.5,
thus it requires uniform blocks in the data flow graph that are executed stage by
1
a subset of OL for linear operators.
6
Stage 2 Stage 1 Stage 0 Bit reversal
comm parallel comm parallel comm parallel communication
(a) Data flow graph of 8-point Pease FFT algorithm where the data flows from right to left.
It is composed of an initial bit reversal permutation stage, followed by three stages with
uniform geometry containing parallel computational blocks and a stride permutation.
(b) Vertically folded streaming datapath. The left most block filled with diagonal stripes
represents streamed bit reversal permutation datapath. The solid diamond grid represents
the computational datapath. The normal grid pattern represents the streamed stride per-
mutation datapath.
Figure 1.5: Mapping a uniform data flow graph to streamed datapath. There exists
a degree of freedom when folding datapath in both dimensions.
stage. As a result, this method works the best in domains where the algorithms
Despite the earlier success, applying Spiral for customized load-store ar-
sors, and 2) the extra considerations necessary in interpreting imperfect loop nest
7
troller designs, besides the instruction stream controller used in general purpose
or FSM-based controller. For memory sub-system designs, the options range from a
single-level fast SRAM for IP core designs to a multi-level hierarchy involving main
for decomposing operations to base cases that can be directly handled in hardware.
imperfect loop nest programs in Σ-OL and requires optimization for features not
easily modeled in the constraint solving stage. In Σ-OL, the memory access indices
general compiler optimizers. In software code generation, Spiral has employed loop
fusion to reduce data transfers between on-chip and off-chip memory and software
pipelining to hide the latency. When interpreting loop programs to hardware im-
the efficient mapping between the arrays used as intermediate buffers between loops
8
1.3 My Approach
The proposed approach extends the DSLs in Spiral for hardware generation.
It includes OL extensions for addressing the design space using the constraint solver,
Σ-OL extensions for loop optimizations for hardware interpretation purpose, and
icode extensions for modeling the interconnected RTL modules. The hardware icode
is finally unparsed to Chisel RTL code. The resulting hardware generation flow is
shown in the middle of Figure 1.4. Note that the hardware generation flow does not
of computation, this work has focused on a compute pattern that processes high-
This pattern covers the popular algorithms in computing the Walsh-hadamard trans-
architecture with a flat on-chip memory that allows reading and writing one scalar
data word in steady state. The parallelization can be achieved with moderate effort
current Spiral.
9
Stage 2 Stage 1 Stage 0
Figure 1.6: The three steps in mapping the 8-point DFT computation to hardware
designs with the proposed approach.
10
By focusing on a constrained compute patttern and architecture, this dis-
sertation makes the first step in opening up the full power of Spiral for hardware
tomized load-store architecture design. At the first step, the extended constraint
solver derives an algorithm for the architecture. Compared to the derived algorithm
for streaming architecture in Figure 1.5 , the algorithm for load-store architecture
treats data permutation in the form of memory indices. As a result, it allows dif-
ferent access patterns at each stage and merges the initial permutation to the first
stage. Next, the data flow graph representation in OL formula is lowered to the
memory buffers. Finally, the loop program is interpreted to the icode representa-
tion that captures the interconnected RTL modules of datapath and controller unit.
The datapath is synthesized from the basic block specification of the loop program.
The controller is synthesized from the loop nest structure of the program.
1.4 Contributions
iteration space, the memory indices, and the kernel operations correspond-
11
sions are developed for this pattern.
extending the Spiral approach. The Σ-OL language is extended for pattern-
based loop optimizations. The icode language is extended for modeling the
interconnected RTL modules internally and produces Chisel RTL code na-
tively.
1.5 Limitations
It is worth noting that this work marks the first fundamental step in extending
Spiral for generating customized load-store architectures while more efforts are
required to uncover the full power of load-store architectures. First, this work
has focused on algorithms with limited irregularity. Other algorithms may benefit
more from the flexible nature of load-store architectures. Second, this work has
limits the peak processing throughput to one data word per cycle. To scale the
throughput, parallel techniques like SIMD and multicore can be employed which
12
1.6 Thesis Outline
Chapter 2 presents the background of Spiral. The complete code generation flow
is introduced which emphasizes the multi-level DSLs of OL, Σ-OL and icode. The
chitectures from imperfectly nested loop programs. The chapter discusses the major
flexibility is investigated via the cost of adding new algorithms to the framework.
studied. Third, an FFT core generated with the proposed approach is compared
against those produced by the existing Spiral hardware backend on a Xilinx FPGA
13
Chapter 2
of this chapter explains the principles of Spiral, and its code generation flow targeting
store architectures. This chapter ends by briefing the previous hardware generation
is challenging due to the complicated architecture features used for scaling per-
formance from the initial stored-program computers, including the deep memory
14
hierarchy, the parallel computing paradigms including SIMD, multi-core, many-
core, and distributed memory. The Spiral code generation approach is based on
ble while the computer platforms change frequently and span a wide spectrum of
expert-tuned designs.
unified representation called operator language (OL). Then the problem is casted
following three sections explains the Spiral approach centering around the three
DSLs of the multistage rewriting flow. The three DSLs, from top to bottom, cap-
tures the data flow graphs (DFGs), abstract loops, and intermediate code as is shown
in Figure 1.4. Along the generation flow, efficient implementations are obtained by
rithm space. The first step of Spiral is deriving the “right” algorithms for a given
computing platform. For doing that, OL is used to capture the algorithms, the
15
setup by specifying the algorithm breakdown rules, architecture-specific breakdown
rules and the base cases. Then, the solutions are obtained by recursively apply-
ing breakdown rules to the functional specification until fully expanded. Finally,
Specifications
ators with unambiguous input/output behaviors that map vectors to vectors. The
operators support taking multiple input vectors and producing multiple output vec-
tors, utilizing multiple base types for the vectors, including fields (R, C, GF (k)),
rings and semi-rings. In this work, the focus is on the kernels that map from one
n n 1 P
kj nj
WHTn : R → R : x 7→ (−1) j x, (2.1)
2n/2 0≤k,l<n
h i
DFTn : Cn → Cn : x 7→ e2πikl x, (2.2)
0≤k,l<n
Non-linear operators are also supported, with an example of the sorting net-
16
work in ascending order,
Besides the above linear and non-linear kernels that are used in this thesis,
Spiral has modeled many more kernels as OL operators. Examples include the dis-
crete cosine transform [3], the wavelet transforms [4], the polar formatting synthetic
aperture radar [5], the matrix-matrix multiplication [6], and the Viterbi decoder [7].
Operator Language
Spiral employs data flow graphs (DFGs) to represent algorithms. The DFGs in
Spiral are modeled with the operator language (OL) for convenient manipulation.
Specifically, OL models the basic DFG fragments as operators and the meaningful
The linear transform origin of Spiral has brought linear operators that can
be represented as matrices. The specifications (2.1) and (2.2) are linear operators.
1 0 ··· 0
0 1 · · · 0
In = . . , (2.5)
.
. . ..
. . 0
0 0 ··· 1
17
the stride permutation matrix Lnk , which reads the input at stride and stores it at
n n
Lnk : i( ) + j 7→ jk + i, 0 ≤ i < k, 0 ≤ j < , (2.6)
k k
d 0 ··· 0
0
..
0 . ··· 0
Dn = diag(d0 , · · · , , dn−1 ) = . (2.7)
. .. . .
. . . 0
0 0 · · · dn−1
cannot be represented as matrices but they are also clearly defined as shown in the
Higher-order functions capture the shapes of DFG fragments that are essen-
tial to reason about efficient implementations. The direct sum ⊕ of operator A and
B
A
(A ⊕ B) · x = x (2.8)
B
partitions the input vector into two sub-vectors to feed A and B separately and
concatenates the resulted sub-vectors to form the result vector. The Kronecker
18
product ⊗ [8] of an identity matrix In and an operator Am
A
m
Am
(In ⊗ Am ) · x = x (2.9)
...
Am
also performs repetitive operations except that the size-m sub-vectors are obtained
from the input vector with stride of n. In another word, the same Am operator is
applied for neighboring data items thus can exploit data parallelism through vector-
does not apply, the Kronecker product is defined formally in OL. For example, the
◦ of A and B
19
represents consecutive computational steps where the the input vector is firstly
the ◦ operator is omitted for simplicity while multiple operators can be visually
Besides the existing operators in current OL, the language is arbitrarily ex-
tensible as long as the extensions are well-defined and mathematically legal. In this
work, we will slightly modify the sorting operator to add the sorting direction as a
Breakdown Rules
compute stages are encoded as breakdown rules. Note the difference between the
algorithms modeled as breakdown rules and the fully specified algorithms obtained
specification) and replaces the matched operator with the right-hand side of the
rule. Spiral has defined more than 200 breakdown rules, part of which relevant to
which encodes that the nonterminal WHT2k1 +k2 is translated into a right-hand side
that involves new nonterminals WHT2k1 and WHT2k2 . In this rule, the right-hand
side nonterminals are of smaller power-of-2 sizes. Hence, the recursive application
of rule will terminate when a terminal rule specifies how a minimal size WHT2 be
20
translated into an atomic OL operator which is called a “butterfly”,
1 1
WHT2 → F2 with F2 = . (2.14)
1 −1
DFT. Similarly, the nonterminal DFTmk can be factorized with the general
n
DFTn → (DFTk ⊗ Im ) Tm (Ik ⊗ DFTm ) Lnk , (2.15)
that translates the input to new nonterminals DFTm and DFTk . This rule produces
defined as
i
n b c(i%m)
Tm = diag(d0 , . . . , dn−1 ), where di = ωnm . (2.16)
1 1
DFT2 → F2 with F2 = . (2.17)
1 −1
χn → Mn,χ χn/2 ⊕ Θn/2 , (2.18)
that produces nonterminals of a half size ascending sorter χn/2 , a half size descending
sorter Θn/2 , and a bitonic merger of ascending order Mn,χ . The bitonic merger can
21
be factorized with a rule
Mn,χ → I2 ⊗Mn/2,χ χ2 ⊗ In/2 , (2.19)
that involves Θ2 and a half-size merger. The descending sorter is factorized with a
rule
Θn → Mn,Θ χn/2 ⊕ Θn/2 , (2.20)
similar to (2.18) except that a bitonic merger of descending order Mn,Θ is produced.
The reversed order bitonic merger is then factorize by a rule similar to (2.19),
Mn,Θ → I2 ⊗Mn/2,Θ Θ2 ⊗ In/2 . (2.21)
Since sorting is non-linear, the terminal rules define the base case sorter with defining
the behaviors explicitly, being the minmax and maxmin operation respectively, as
x0 min(x0 , x1 )
χ2 → S2 with S2 = (2.22)
x1 max(x0 , x1 )
x0 max(x0 , x1 )
Θ2 → Ŝ2 with Ŝ2 = (2.23)
x1 min(x0 , x1 )
is computed through smaller problem sizes via breakdown rules. Earlier research
like [9] has shown that the computations captured now in OL, including the stride
22
equivalence with different computational structures to address various computer
Imn → Im ⊗ In (2.24)
which can be used to represent loop tiling. The following example describes using
computations of A:
Rule (2.25) specifies that the block parallel computations of Am can be equivalently
and Lmn
m afterwards. Rule (2.26) specifies the reversed conversion with different
tions can be challenging in performance when permuting a large data set. In fact,
23
they can be performed in a way that exploits block granurarity.
Lkmn Ik ⊗ Lmn
kn
km → m Lk ⊗ Im , (2.28)
The rules (2.27), (2.28) involves the Kronecker product of stride permutations.
Ik ⊗ Lmn
n encodes block permutation of size mn while Lkn
k ⊗ Im encodes permuta-
tion of blocks of size m. These behaviors are more efficient in communication with
The architectural features that require data flow optimizations are captured in
the formal framework. These features include memory organizations and parallel
paradigms.
rules via hardware tags. A breakdown rule can be associated with tags on the
left-side nonterminal, encoding a “right” way to expand the nonterminal such that
OL, Spiral has provided general tagged breakdown rules to handle common OL
24
Rule (2.29) uses the equivalence of (2.24). The tag on the left side vec(v)
denotes v-way SIMD vectorization. On the right side, the underlying tag is removed
and it means that the expanded OL formula has been restructured for SIMD ex-
ecution. This is achieved by tagging a Kronecker product operator with the same
tag vec(v), which encodes the decision that this operator should be implemented as
v-way SIMD operations in the code generation backend. The OL shape Am ⊗vec(v) Iv
is a base case for SIMD hardware platform. The entire set of SIMD base cases can
be found in [10].
ing on general OL patterns does not produce the optimal results. Sometimes, the
niously picking the right rules, from a large identified set of program transformation
rules explained in Section 2.2.2, and applying them in a right way. For example,
though there are independent SIMD tagged rules for Im ⊗ An×n and Lmn
m , the SIMD
factorization rule (2.15) can embrace OL stage reduction through careful program
mn/ν
Im ⊗ An×n Lmn nν n×n
m → Im/ν ⊗ Lν A ⊗ Iν Lm/ν ⊗ I ν , ν|m (2.30)
| {z } | {z }
vec(ν) vec(ν)
By comparing Rule (2.30) and the results achieved by applying Rule (2.28)
and 2.25 separately, the number of stages is reduced. This will make across-stage
25
2.2.4 Algorithm Generation and Autotuning
With a unified formal framework for algorithms, hardware architectures, and pro-
algorithmic breakdown rules, the hardware space as base cases, and the transforma-
man designers. To solve the constrained optimization problem, breakdown rules are
applied recursively until all nonterminals are translated to terminals. The order of
rule applications on nonterminals is called the rule tree, and by applying rules to the
input OL specification with respect to the rule tree, we can obtain fully expanded
algorithms. To pick the optimal algorithms from the large candidate set, Spiral
employs auto-tuning techniques to search within the design space. In the past, the
approach of how to find the most efficient algorithm of DFT8 for a 2-way vectorized
processor. Hardware rules (left, red), algorithm rules (right, blue) and program
transformation rules (center, grey) together span a search space (multi-colored oval).
The given problem specification and hardware target give rise to the solution space
26
Figure 2.1: Algorithm generation as a constraint problem [1].
(black line) that is a subspace of the overall search space. Each point in the solution
space (black dot) represents a DFG given as OL formula that optimized for the
the quality of each algorithm in implementations and finds the optimal solution out
of the space.
The algorithms generated with the constraint solver are flat DFGs containing mul-
tiple compute or data reorganization stages, each of which is optimized for the
All commodity processor architectures share a same property that data re-
sides in main memory and the processing core fetches data from memory, writes back
27
results to memory in an iterative manner. Moreover, the arithmetic / logical process-
ing speed is usually much faster than data movements [11]. Consequently, efficient
and data reorganization on the same data set as much as possible. In addition,
in case the data movement bandwidth falls behind the computational throughput,
This section focuses on stage fusion for the flat DFGs. Since each stage
problem. Spiral introduces the Σ-OL language to make loops explicit and memory
access patterns symbolically such that the difficult loop merging problems difficult
an explicit memory abstraction, extract the kernel operation, and specifies how
data is addressed in loading from memory to the kernel and storing from the kernel
to memory. As in Spiral, a kernel always processes a vector, the load and store
operations are gather and scatter operations [13] that addressing data in memory
data flow patterns to loop programs are shown in Figure 2.2. The left-hand side
shows the DFG. The right-hand side shows the corresponding loop program repre-
sentation that uses stacked squares for memory, places the kernel operation to the
center, gathers and scatters data prescribed by arrows. Note that although two
28
P1
(a) I2 ⊗F2 (b) i=0 Sh2i,1 F2 Gh2i,1
P1
(c) F2 ⊗ I2 (d) i=0 Shi,2 F2 Ghi,2
memory arrays are visualized, they can be mapped to the same physical memory
in actual implementations. The two examples have two iterations for the loop, the
line arrows. The corresponding OL and Σ-OL (will be explained soon) formulas
are provided as subgraph captions. Figure 2.2a shows the DFG of the OL formula
I2 ⊗F2 . Figure 2.2b shows the translated loop representations where the F2 kernel
is extracted and the unit stride pattern is used for gather and scatter. In Figure
2.2c and 2.2d, the translation is shown for the OL formula F2 ⊗ I2 , which results in
a stride-2 access pattern. From this example, we saw that the different data flow
patterns with the same kernel can be normalized to the same loop representation
2.3.2 Σ-OL
Iterative sum. Σ-OL is a superset of OL that introduces a few new constructs for
the loop abstraction. The core operator is the iterative sum operator that captures
29
a loop,
n−1
X
Aj , (2.31)
j=0
where j is an induction variable with range n. For linear operators, the iterative sum
and scatter operators parameterized by index mapping functions for indirect access.
interval is denoted by
In = 0..., n − 1.
f : In → IN ; i 7→ f (i).
We use the short-hand notation f n→N to refer to an index mapping function of the
form f : In → IN .
the access pattern of the algorithms. A particular important index mapping function
used in this thesis is the h function parametersized by a base b and a stride s for
30
strided indexing,
erates a matrix that produces its output vector by multiplying with the input vector.
size-n subvector from the size-N input vector. In this work, a Gather operator is
canonical basis vector with entry 1 in position k and entry 0 elsewhere. An index
tion),
h i>
Ghn→N := eN e N
h(0) h(1) · · · | e N
h(n−1) , (2.33)
b,s
which implies that for two vectors x = (x0 , ..., xN −1 )> and y = (y0 , ..., yn−1 )> ,
y = Ghn→N x ⇐⇒ yi = xh(i) .
The result is used for generating code from the Gather matrices.
Scatter operator. A Scatter and a Gather with the same index mapping
function hn→N are the transposition of each other. It represents transferring data
h i
Shn→N := eN N N
h(0) eh(1) · · · | eh(n−1) (2.34)
b,s
31
The definition (2.34) implies that for two vectors x = (x0 , ..., xn−1 )> and y =
xi if j = f (i)
y = Shn→N x ⇐⇒ yj = .
0
else
The result is used for generating code from the Scatter matrices.
In this example, the loop containing two iterations is captured by the iterative sum
i. The unit stride access pattern is captured by the gather operator Gh2→4 and
2∗i,1
the scatter operator Sh2→4 . The index mapping function h2∗i,1 produces a size-2
2∗i,1
vector with unit stride for each input i. The matrix operator instances for the first
iteration, i.e, i = 0, is
1 0
0 1 1 1 1 0 0 0
Sh2→4 F2 Gh0,1 = ,
2→4
0,1
0 0 1 −1 0 1 0 0
0 0
which extracts the first two data entries of the size-4 input vector to feed the F2
butterfly operator, and finally places the size-2 sub-vector result on the first two
entries of the output vector while setting 0s for the untouched elements.
constraint that the index mapping function must be bijective. Though any linear
32
permutation is permitted, in this work, we only deals with the stride permutation.
h i>
perm (pn→n ) := enp(0) enp(1) · · · | enp(n−1) .
matrices.
C,
diag f n→C := diag(f (0), ..., f (n − 1)).
tors, the aforementioned two special geometries (2.9)(2.10) of the data flow graph
captured by the Kronecker product can be lowered to Σ-OL expressions with the
m−1
X
Im ⊗ An → Shn→nm
nj,1
An Ghn→nm
nj,1
(2.35)
j=0
n−1
X
Am ⊗ In → Shm→nm
j,n
Am Ghm→nm
j,n
(2.36)
j=0
other operators that also process data iteratively. By merging as much as computa-
tions and data reorganization into a single iterative sum operator, the data roundtrip
from memory to memory can be minimized. The loop merging is achieved in Spiral
33
through applying rewrite rules for Σ-OL expressions provided by human designers.
This is possible because of the access pattern is made symbolic as index mapping
functions. Further, the index mapping functions after loop merging can be simplified
For instance, in FFT algorithms, the diagonal and permutation operators are
1 1
! !
X X
4→C
(Sh2→4 F2 Gh2→4 ) diag(f ) (Sh2→4 F2 Gh2→4 ) perm(p4→4 ) (2.37)
i,2 i,2 2∗i,1 2∗i,1
i=0 i=0
which contains four consecutive computational stages over the input vector, each of
which can be implemented as a loop. The rewrite rules for merging Σ-OL expressions
Other operators than the iterative sums can be incorporated to the neigh-
boring iterative sum operator on the left side or on the right side. Rules
m−1
X m−1
X
Aj M → Aj M , (2.38)
j=0 j=0
m−1
X m−1
X
M Aj → M Aj (2.39)
j=0 j=0
implement the distributivity law for moving operators inside iterative sums. After
applying these rules to (2.37), permutations and diagonals can be paired with the
The permutation can be merged into gather on the left so that another data
34
pass for permutation can be saved. Rule
merges permutations into gather by composing the index mapping functions of per-
The diagonal matrices can be swapped with the right-side scatter operation
such that the diagonal operator can be performed in sub-vector granularity together
diag f N →C Swn→N → Sw diag(f ◦ w). (2.41)
swaps diagonal with scatter by compositing the index mapping functions of diagonal
and scatter.
By applying the above rewrite rules to (2.37), we can obtain the loop merged
Σ-OL expressions. However, the merging may create complicated index mapping
functions via composition. A particular set of rewrite rules in [2] can be applied
to simplify the index mapping functions. Finally, the merged and simplified Σ-OL
1 1
! !
X X
4→C
(Sh2→4 F2 Gh2→4 ) (Sh2→4 diag(f ◦ h2→4
2∗i2,1 )F2 Gp4→4 ◦h2→4 ) .
i1,2 i1,2 2∗i2,1 2∗i2,1
i1=0 i2=0
(2.42)
35
2.4 Abstract Programs in icode
To enable loop code portability between various programming languages and appli-
the syntax of C language or C-derived dialects like OpenCL. This also enables op-
values and types, 2)arithmetic and logic operations, 3)constants, arrays and scalar
constructs are listed in Table 2.1. The integer types can be mapped to C type in a
one to one manner. The real type is mapped to float or real depending on the
Arrays are supported with a type specifying the size and the primitive scalar
type. To address the scalar elements of an array, an nth object is added to extract
the nth-th element from a variable of array type. The array type support is listed
in Table 2.2.
A variable object can be created by specifying the name and type as presented
36
Table 2.2: icode constructs supporting C arrays
objects that can autonomously determine the type of results according to the types
of the input. As a result, one operator object can model the same operation of
various scalar types, array types and user-defined structures. The operations of
arithmetic, relational, logical, bitwise and ternary types are listed in Table 2.3. For
certain arithmetic operations such as addition and subtraction, the arbitrary number
of operands are supported. The code v1, ..., vn represents n operands from v1
to vn divided by comma.
Table 2.4 lists several important commands: the assignment, the compound state-
the constructs beyond the C language specification can always be created. User-
defined data types and the corresponding operations can be modeled in icode. Table
2.5 presents the complex arithmetic extension where a complex data type and three
arithmetic operators are addressed. The complex data type complex t is a structure
the final C code. The arithmetic operations add, sub, and mul on the complex
type are implemented as user-defined C functions. Besides, the types and functions
provided by the C standard library [14] or other API can be addressed in icode as
well. Table 2.6 lists certain math functions of the C standard libarary suppported
37
Table 2.3: icode constructs modeling the operations of C language
in current icode. Furthermore, icode has been extended in the past for capturing
the instruction set extensions in the SIMD vectorized architecture such as Intel’s
38
Table 2.5: Extending icode for complex arithmetic
icode operator to ease pattern matching. Table 2.7 lists the max and min operations
that are realized with cond, geq, and leq icode operations.
The Σ-OL expressions are converted to icode expressions with a set of rules. The
operator is mapped to a specific icode object. A sample of the rules are listed in
Table 2.8. Figure 2.3 shows the icode expressions for FFT(4) translated from the
Σ-OL expression in (2.42). The generated icode can be finally unparsed (pretty-
39
Table 2.8: Translating Σ-OL constructs to code; x denotes the input and y the
output vector. [2]
While the loop optimizations have been performed at OL and Σ-OL levels, the
basic blocks can be optimized in icode. The main basic block optimizations like
loop unrolling, array scalarization, constant folding, copy propagation, and common
The formal framework for algorithm generation introduced in Section 2.2.1 can in-
architecture has been targeted in Spiral for generating high-throughput RTL im-
plementations for linear transform kernels, including the discrete Fourier transforms
(DFT), multi-dimensional DFT, real DFT and discrete sine/cosine transforms [15].
A streaming architecture continuously takes in data stream from the input ports
and produces result stream on the output ports. It is typically composed of mul-
40
chain (
loop ( i2 , 2 , chain (
l o o p ( i 4 , 2 , a s s i g n ( nth ( T1 , i 4 ) ) ) ,
c h a i n ( a s s i g n ( t1 , nth ( T1 , 0 ) ) ,
a s s i g n ( t2 , nth ( T1 , 1 ) ) ,
a s s i g n ( nth ( T2 , 0 ) , add ( t1 , t 2 ) ) ,
a s s i g n ( nth ( T2 , 1 ) , sub ( t1 , t 2 ) )
),
l o o p ( i 3 , 2 , a s s i g n ( nth ( T3 , i 3 ) , nth ( T2 , i 3 ) ) ) )
),
loop ( i1 , 2 , chain (
l o o p ( i 4 , 2 , a s s i g n ( nth ( T1 , i 4 ) ) ) ,
c h a i n ( a s s i g n ( t1 , nth ( T1 , 0 ) ) ,
a s s i g n ( t2 , nth ( T1 , 1 ) ) ,
a s s i g n ( nth ( T2 , 0 ) , add ( t1 , t 2 ) ) ,
a s s i g n ( nth ( T2 , 1 ) , sub ( t1 , t 2 ) )
),
l o o p ( i 3 , 2 , a s s i g n ( nth ( T3 , i 3 ) , nth ( T2 , i 3 ) ) ) )
)
)
sented at the right-most side of Figure 1.4. First, the architecture-aware algorithm
generation process produces SPL formulas for streaming processing. Then, an ex-
as explicit streamed operations that require internal storage buffers. The automated
fixed-size permutations in [16] and for power-of-2 size permutations as bit permu-
1
SPL is the predecessor of the OL language. Similar to OL, SPL captures algorithms in flat
data flow graphs but limits to linear operators.
41
tations in [17].
in sharing the precious on-chip storage elements. For instance, when integrating
cation of local memory is required though buffers have been implemented in the
load-store architecture can expose its local memory directly to the off-chip memory
controller. Further, distinct kernels that impose challenge in functional unit sharing
ating a large tradeoff space between throughput and area usage for streaming ar-
as is shown in Figure 1.5. This requires the regularity in DFGs such that the folded
However, the folding method also limits the choice of algorithms for efficient
implementations. For instance, a fully folded datapath design for FFT, implement-
ing only one streamed butterfly kernels and one streamed permutation kernel, is
locked to the iterative Pease algorithm, not to mention that the initial bit-reversal
load-store architecture studied in this work targets the low to medium throughput
tectures has provided precious experience in extending Spiral for hardware gener-
ation. First, the stream tags introduced to the constraint solver shows how a new
42
capture the fully pipelined datapath for RTL implementations. Each relevant icode
manifested the extensibility of the Spiral approach, which encourages this work for
2.6 Summary
This chapter introduces the background of the Spiral approach for code generation.
algorithms. The generated algorithms are then translated to the abstract loop rep-
resentation in Σ-OL for loop merging and index simplification. Finally, optimized
Σ-OL expressions are translated to the internal representation icode where basic
The final section presents an overview of the previous Spiral hardware gen-
work experiences and lessons for developing a new hardware generation backend of
Spiral.
After introducing the background of the Spiral approach, the next chapter
will explain the high level idea of flexible hardware generation targeting customized
43
Chapter 3
explains the high-level ideas and introduces several challenges to overcome. We will
begin with the mapping from a simple loop program to a basic load-store architec-
ture. Then, we discuss a particular form of imperfect loop nest programs amenable
constructs that capture the repetition of other statements based on the structured
programming paradigm [20]. The enclosed statements could be also loops or other
44
1 int i ;
2 f o r ( i =0; i <N; i ++) {
3 B [ i ] = A[ i ] + 1 ;
4 }
in memory and store the results in memory for later use. In Figure 3.1a, the pseudo
each entry of a size-N array A and placing results in array B. Line 2 describes an
of the loop body enclosed by a pair of curly braces. The execution of the loop body
is guarded by the condition i<N, and does include one assignment at Line 3 which
computes one element of B. The nested structure of loops that are used in numerous
hardware organization that allows arbitrary access from the units of computations
read port and one write port, a pipelined datapath connected to the memory, and
a controller that manages the behavior of the datapath. In later sections, we will
elaborate the more complicated designs that cover a wide range of computational
A simple loop program like Figure 3.1a can be translated to a basic load-store
45
architecture design. Since the N iterations of additions are independent to each other,
Figure 3.1b. The pipelined datapath starts by loading a data element of array A from
memory. It is followed by an addition to the input data and ended by storing the
summation back to memory. The loading and storing operations require an address
parameter provided by their dedicated address calculation functions with the current
value of loop variables provided by the controller. The controller traverses the size-N
Despite the simplicity of the above example, it clearly shows that even though
the load-store behavior resembles how a processor computes, the actual components
of the architecture can be fully customized and simplified for hardware efficiency.
This thesis focuses on a particular form of imperfectly nested loops that is amenable
them are suitable for hardware acceleration. In fact, only a subset of loop programs
are handled in the most advanced hardware compilation methods [21][22]. Compared
to the general hardware compilers, our approach relies more on the “right” structures
where loop statements are nested imperfectly. In a perfect way of nesting, each
loop encloses another loop as the loop body except that the innermost loop contains
a loop body with non-loop statements for computations. Since the perfect loop
nest has only one basic block, it is usually regarded as a special case in hardware
46
1 f o r ( i 2 = 0 ; i 2 <= 3 ; i 2 ++) {
2 s 1 3 = X[ 2 ∗ i 2 ] ;
3 s 1 4 = X[ 2 ∗ i 2 + 1 ] ;
4 T1 [ 2 ∗ i 2 ] = s 1 3+s 1 4 ;
5 T1 [ 2 ∗ i 2 +1] = s13−s 1 4 ;
6 }
7 f o r ( i 1 = 0 ; i 1 <= 1 ; i 1 ++) {
8 f o r ( i 4 = 0 ; i 4 <= 1 ; i 4 ++) {
9 s 2 1 = T1 [ i 1 +4∗ i 4 ] ;
10 s 2 2 = T1 [ i 1 +4∗ i 4 + 2 ] ;
11 T2 [ 2 ∗ i 4 ] = s 2 1+s 2 2 ;
12 T2 [ 2 ∗ i 4 +1] = s21−s 2 2 ;
13 }
14 f o r ( i n t i 3 = 0 ; i 3 <= 1 ; i 3 ++) {
15 s 2 9 = T2 [ i 3 ] ;
16 s 3 0 = T2 [ i 3 + 2 ] ;
17 Y[ i 1 +2∗ i 3 ] = s 2 9+s 3 0 ;
18 Y[ i 1 +2∗ i 3 +4] = s29−s 3 0 ;
19 }
20 }
Figure 3.2: An imperfect loop nest program computing the 8-point Walsh-hadamard
transform.
synthesis [23]. In this thesis, we focus on an imperfect loop nest structure inspired
by Spiral, where each loop level allows either a single loop or multiple loops with
nest programs with the desired structure. The top level contains loop-i2 and loop-
contains the second level composition of loop-i4 and loop-i3 using intermediate
buffer T2. As we can see, multiple basic blocks are allowed in imperfect loop nest
programs.
The imperfect loop nest programs studied in this thesis includes certain im-
Independent iterations. The loop iterations at each loop level are in-
dependent to each other. Given the arbitrary depth of nesting and the imperfect
nesting, this allows parallelism at various granularities. For instance, the repeti-
47
tive execution of the basic block of each innermost loop can be computed through
Regular basic block shape. The statements at each basic block share an
identical pattern. Each basic block loads a vector from a read buffer with calcu-
lated indices. Then the input vector is processed through an arbitrary operator to
generate an output vector. Finally, the output vector is written to a write buffer
with calculated indices. This pattern allows the concurrency of data access and
Static loop bounds. The number of repetitions in all loops are stati-
cally determined. This results in the data-independent control flow that allows
architecture designs.
Even though the listed properties of imperfect loop nests appear to be re-
stricted, loop programs satisfy those properties have been observed in high perfor-
dia processing to machine learning. For example, the classic recursive Cooley-Tukey
FFT algorithm views the input data as a two-dimensional tensor and computes
48
Figure 3.3: A basic load-store architecture.
cialized load-store architecture must eliminate the high interpretation cost through
plement imperfect loop nest programs. Moreover, the parallel organizations in pro-
cessors can be applied to specialized load-store architectures for scaling the computa-
49
be decomposed to cooperated hardware operators which form a directed acyclic
graph (DAG). The operators will include data access, arithmetic and logic operators.
latencies. When latencies of any operators are statically determined, raw wires can
be used to connect I/O ports between nodes of hardware operators in the DAG.
When multiple flows of data exist in the DAG, additional flip-flop buffers could be
necessary to guarantee consistent arrival time of data signals. The overall latency
data, latency-insensitive protocols [27] such as the elastic circuit [28] can be used to
Since there are multiple basic blocks in an imperfect loop nest program,
each basic block could be expensive in hardware resources when basic blocks are
The dedicated datapath design for basic blocks in loops naturally generates
and simplifies the control. In the meantime, it is likely to result in a deep pipeline
efficient hardware utilization. This means that a software program with sufficient
iterations for an ALU-based pipeline may not provide enough parallelism to the
50
3.3.2 Flexibility-driven Controller
iteration of basic blocks executed in the dedicated pipelined datapath, and manages
finite state machines (FSMs) with simple datapath. Figure 3.1b shows a design for
controlling a single loop. Controllers for more complicated loops can be built with
control signals work at low-level and are required for every clock period for the entire
provides control signals at higher level for each iteration of basic block execution.
building a hardware design that can handle a various problem sizes by reconfiguring
control signals. It is possible because we decouple the controller and the datapath in
the load-store architecture. This results in partially hardened designs. The emerging
heterogeneous platforms with general purpose cores and specialized cores decoupled
closely [32][33] offer an opportunity for such an implementation. In this use case,
the overall design can be partitioned to the regular parallelism component and the
irregular and frequently changed component for hardware and software implemen-
51
tation, respectively. The repetitive heavy arithmetic computations and bit-level
cores without slowing down the hardware components. The software components
can additionally be reconfigured flexibly with much lower cost than the specialized
to load-store architectures are shown in three scenarios. First, not every component
is executed the same speed in the design. As shown in Figure ??, the load/store
unit and the kernel datapath determines the throughput upper bound of the design.
Because in our paradigm the kernel processes vectors, it only requires a new result
from the space traverser for every few cycles, depending on the vector length to be
processed. Hence, it does not harm the performance if these space are traversed
in slower software with large enough vector size. Second, implementing the space
design. For example, an accelerator that solves arbitrary sizes of the same problem
can be natively built with a soft iteration space traverser and the indices calculator.
Third, softening the space traverser and provide the memory interface from the
platform can minimize the states in the customized hardware, which facilitates the
framework. We show two examples in Figure 3.4. In the figures, we use dotted
hardened components. Figure 3.4a shows a design that implement the space tra-
52
(a) Implementing the traverser in software.
53
verser in software with rest into hardware. Figure 3.4b describes a softer design that
Existing ideas of parallel architectures for general processors can be applied to our
architecture in Figure 3.3 resembles a vector processor like Cray-1 [36] because it can
process one data item per cycle. For higher throughput, the baseline architecture can
equip a vectorized load/store unit along with a SIMD vectorized kernel datapath
to achieve multiples of the baseline throughput with mostly the same control, as
shown in Figure 3.5a. The vectorized accelerator core must connect to multiple
memory banks. The data shuffle circuit may be added to the vectorized kernel to
enable local communications between the vector lanes. Another form of throughput
Figure 3.5b. In this form, we duplicate the baseline accelerator core with local
memory, and coordinates the multiple cores using a task scheduler. This form can
exploit the enormous local memories for a large aggregated throughput. The vector
and SMP can combine for a better performance efficiency in hardware. Though
not presented graphically, the distributed memory architecture can also be applied
potentially to accommodate high communication cost between cores when the cores
simple parallel organizations such that high throughput can be achieved for compli-
54
(a) A SIMD vectorized architecture. (b) A multicore architecture.
cated computations.
tecture implementations, one must have a systematic way to handle the complexity
in the algorithm space and the hardware space. Moreover, the interpretation of
loop nest programs shall go beyond the software programming syntax and support
embedding extra information for more efficient hardware generation. This section
explains three major challenges in hardware generation and how they can be resolved
The creation of imperfect loop nest programs with desired properties for load-store
55
complexity, parallelism, memory access pattern, memory utilization, regularity, etc.
search field that requires profound domain knowledge, exceeding the scope of this
past, Spiral has addressed linear transforms, numerical linear algebra operations
like the matrix-matrix multiply [6], polynomial evaluation, infinity norm, geofenc-
ing for unmanned aerial vehicles [37], the Viterbi decoder [7], polar formatting
synthetic aperture radar [5], Euler integration, statistical z-test, wavelet transforms
and JPEG2000 image compression [38], among many others. Spiral captures algo-
rithms in data flow graphs using the OL formalism, which are then automatically
(explained in Section 2.3.3) which produces imperfect loop nest programs with de-
sired properties listed in Section 3.2. As an example, Figure 3.6 shows an 8-point
nario of deep pipelines that demand more parallelism has been discussed in Section
forming algorithms for fixed parallel organizations. This means that hardware adap-
56
(a) A Spiral-generated FFT(8) algorithm represented in a data flow graph with the corre-
sponding OL formula.
(b) Iterative computing FFT(8) on memory and the corresponding Σ-OL expression.
ity. The imperfect loop nests generated from Spiral are represented in Σ-OL intro-
duced in Section 2.3.2. In this section, we discuss three optimizations for hardware
57
Efficient Buffer Allocation
In the Σ-OL representation, the read/write buffers of basic blocks are implicit. In
the software generation flow of Spiral, the buffers are allocated when translating
Σ-OL expressions to icode. The current buffer allocation scheme is designed based
on a common assumption that main memory is cheap and the data transfer between
is reasonable that the scheme does not minimize buffer utilization. However, in
hardware generation, the local memories are scarce resources, and the data transfer
between on-chip and off-chip is performed explicitly. When processing on-chip data,
The basic idea of the buffer allocation scheme in current Spiral is allocating
separated intermediate buffers between the compute stages. The overall program is
specified an input buffer and an output buffer. When the computation is composed
of stages, intermediate buffers are allocated. The intermediate buffer serves as the
write buffer of the leading stage and the read buffer of the trailing stage. In this
way, the number of intermediate buffers is the number of stages minus one. Spiral
exploits in-place compute stages for reducing the intermediate buffers because in
this case the read buffer can be safely reused as the write buffer. In an extreme case
when all stages are in-place, the output buffer can replace the intermediate buffers
Figure 3.7. In the pseudo code, the compute stages are listed vertically with respect
to the execution order. Each stage is described with two lines: the first line identifies
the stage with a name; the second line is indented and describes the read buffer and
write buffer of the current stage, divided by a right arrow. An arbitrary compute
58
stage is named Compute <Id> where <Id> is a natural number. The in-place
stage must be a loop, and is named as Loop inplace <Id>. The input buffer to the
sub-program is named X and the output buffer is named Y. The newly allocated
On the left, Figure 3.7a demonstrates the buffer allocation result for four
as the write buffer for this stage and the read buffer for the next stage. Additional
intermediate buffers are allocated at each loop except for Loop inplace 2 whose read
buffer can be safely reused as the write buffer. The final stage Compute 3 writes
On the right, Figure 3.7b shows an unusual situation with all in-place stages
such that buffer Y can serve as the write buffer for every stage, thus avoiding
(a) Introducing intermediate buffers for (b) Reusing the output buffer for inten-
non-in-place stages. sive in-place stages.
ing buffers for nested compositions is automated by using the read (write) buffer of
59
the parent stage as the input (output) buffer. Figure 3.8 presents a pseudo code
example. In the pseudo code, the indentation depth denotes the depth of code
blocks in the imperfectly nested loop program. The compute stage of compositions
Compose 0
Compute 2
X −> T2
Compute 3
T2 −> T1
Compose 1
Compute 4
T1 −> T3
Compute 5
T3 −> Y
The reduction of buffers can be addressed in two aspects and the benefit may
First, because the deep level buffers of different stages at the outer compo-
sition level do not overlap in execution, as is the T2 and T3 shown in Figure 3.8,
they can be mapped to the same physical memory. This aspect does not require
particular properties of the program, but the savings of buffers depends on the dis-
60
tribution of intermediate buffers at different stages. Those programs with balanced
distribution of intermediate buffers along the compute stages benefit the most from
this technique. In another extreme where the intermediate buffer used in one stage
vastly overweigh the intermediate buffers of other stages, the benefit is negligible.
Second, the series of allocated intermediate buffers and the input and output
buffers at the same composition level, as shown in Figure 3.7a, can be possibly
replaced by swapping two buffers because the stage series are executed sequentially.
However, the swapping strategy does not always help. A counter-example is that
one extremely large intermediate buffer will call for the duplication of such a large
buffer that could be possibly larger than the aggregated size of the rest of buffers.
Hence, an effective buffer allocation scheme should consider the properties of the
program.
required for the scheme can be addressed conveniently in Σ-OL. In addition, the
language can be extended to captured the program structure of interest in the buffer
There are some types of indices that could be computed with cheaper operations
inductively [39]. Table 3.1 collects three examples from real applications, where
computations are inserted between the nested loops, which can destroy the perfect
61
Table 3.1: Indices that can be simplified with inductive calculation.
these computations into loops so that the perfect sub-nest structure is preserved to
Continuous Pipelining
When the data dependencies between adjacent perfect sub-nests are unknown, the
shared datapath pipeline has to be drained before starting the execution of the trail-
ing sub-nest to avoid data hazards. The dependency can be typically resolved either
dynamically in execution time with expensive circuits [40] or statically through pro-
gram analysis. The Σ-OL representation provides a powerful rule-based static anal-
ysis scheme that allows analyzing the symbolic index mapping functions to figure
out the dependencies between sub-nests so that the trailing sub-nest can start exe-
cution earlier. The speedup upper bound of this optimization depends on the ratio
of the pipeline latency over the iteration counts of the perfect sub-nest, arriving at
To natively describe the spatial hardware designs without entering into the RTL
abstraction level, we adopt icode to model the RTL modules and the connections
RTL modules. The RTL modules can be hardware operators, finite state
62
machines (FSMs), memory blocks, etc. A hardware operator manipulates the input
data to produce the output data in the form of digital signals. FSMs and other RTL
modules can be defined with arbitrary interface for flexible hardware interpretation.
The icode representation of each RTL module serves as the functional speci-
fication to an RTL code generator. The hardware operators and other RTL module
RTL modules other than hardware operators are modeled as icode types,
specified by its name and possibly some parameters, which is mapped to an RTL
template by the generator. The RTL module types can be instantiated with a special
operators, specified by a name, one or multiple input ports with type information.
Any hardware operators always have one output, whose type is derived from the
input types. By extending the types in icode, a hardware operator can manipulate
digital signals representing distinct data types from integer to floating point, real to
identical bits exist in all inputs. The timing specification could improve the circuit
for efficient resource binding especially in FPGAs, where the macro DSP blocks
and memories are necessary for higher hardware efficiency. The mature hardware
compilation methods such as the constrained scheduling [41] and resource binding
1
A decoupled type encapsulates the data payload with valid and ready signals, which is a common
protocol to build latency-insensitive hardware.
63
Figure 3.9: Simplify the multiplexor logic with identical bits
between RTL modules. The existing assign command in icode connects the output
hardware variable that models an RTL wire. Hence, the operator graph representing
erator modules in RTL. Other RTL modules are firstly instantiated with an explicit
command and then assigned to a hardware variable, with which the connection can
3.5 Summary
This chapter introduces a high-level idea of translating imperfect loop nest programs
The key challenge is how to achieve hardware efficiency through component cus-
64
generation, optimization and hardware manipulation, which can be solved by ex-
Given the high level idea of flexible hardware generation, the next chapter
will explain how to systematically extend Spiral for synthesizing scalar load-store
architectures.
65
Chapter 4
The previous chapter has presented a large design space of algorithms and hardware
Spiral for load-store architecture generation, this chapter will focus on a more
restricted compute pattern of imperfect loop nest programs and the scalar load-
store architecture.
store Architecture
and classic algorithms of the Spiral framework. The imperfect loop nest programs
66
conforming to this paradigm are mapped to an elementary fully-hardened load-
store architecture that at most loads and stores one data entries from the memory
interface.
quirements for the general computational paradigm of imperfect loop nest programs
in the iteration space, the indices patterns, and the kernel operations:
2. In the basic blocks, the memory gather and scatter operations are directed by
are identical.
ulations on a high-dimensional data cube. In the next section, we will show how to
tional problems.
load-store architecture that is fully hardened and is only able to load/store at most
one memory data word at the steady state. We refer to this design the scalar
load-store architecture.
The mapping from the multi-linear paradigm to the scalar load-store archi-
memory, perform the kernel operation, and store the results back to memory. The
67
kernel operations of different basic blocks are the same, as required by the multi-
linear paradigm. The parameters for each basic block is provided from the loop
paradigm, for each iteration, a data vector is gather from the memory with multiple
scalar reads, processed with the kernel implementations, then finally scattered back
paradigm of imperfect loop nest programs to the load-store architecture. The op-
timizations exploit the properties of the multi-linear paradigm for reduced resource
sis are discussed in the following subsections, with increasing requirements for the
programs.
68
4.2.1 Efficient Buffer Allocation
In an imperfectly nested loop, the computation is divided into several stages, thus
memory. Thus, reducing the total buffer entries can reduce memory utilization
in hardware.
In the multi-linear paradigm, the gather and scatter operations of every basic
blocks have a static number of entries to load and store on a fixed size data set. As
a result, the buffers required for the computations can be analyzed statically.
In the multi-linear paradigm, the buffer sizes allocated at the deep levels of each stage
of shallow level can be determined statically. Since the different stages of shallow
level do not overlap in execution, the intermediate buffers used in deep levels can
be shared by other stages of the shallow level. Figure 4.2 compares the allocation
schemes in multi-level compositions between the Spiral software generator and the
proposed hardware generator. On the left side, Figure 4.2a replicates Figure 3.7b
for the current Spiral. On the right side, Figure 4.2b shows the allocation scheme
for hardware generation, where the buffer T2 used in the first level-2 composition is
reused in the second level-2 composition, removing the additional T3 in Figure 4.2a.
Practically, T2 and T3 may not be the same sizes, then the larger size represents
the required buffer size in the deep levels. This strategy stays valid when more than
69
Compute 0 Compute 0
Compute 2 Compute 2
X −> T2 X −> T2
Compute 3 Compute 3
T2 −> T1 T2 −> T1
Compute 1 Compute 1
Compute 4 Compute 4
T1 −> T3 T1 −> T2
Compute 5 Compute 5
T3 −> Y T2 −> Y
(a) Always allocating new buffers at deep (b) Reusing deep level intermediate
levels. buffers.
For allocating buffers within a composition level for reduced buffer utilization, the
particular program properties with uniform buffer sizes in the intermediate, output,
and input buffers. For programs with these properties, swapping can assure fewer
When the input, intermediate and output buffers are of the same size, the
swapping can be performed between the input buffer and the output buffer, without
additional intermediate buffers. This means that the initial data in the input buffer
in case the integrity of the original input data is required. In this case, the in-place
70
trick used in software Spiral can help avoid swapping and input buffer overwritten
when all stages except the first one are in-place. Nevertheless, in deeply nested loop
previous composition level that can be safely overwritten. The two situations with
Compute 0 Compute 0
X −> Y X −> Y
Compute 1 Loop in place 1
Y −> X Y −> Y
Compute 2 Loop in place 2
X −> Y Y −> Y
(a) Swapping between the input and out- (b) Avoid swapping and input buffer
put buffer. overwritten with in-place loops.
be one or two.
When only the intermediate and the output buffers are of the same size, the
number of intermediate buffers required for swapping can be performed between one
intermediate buffer and the output buffer, except a case that the output buffer can
71
Compute 0
Compute 0 X −> T
X −> Y Compute 1
Loop in place 1 T −> Y
Y −> Y Compute 2
Loop in place 2 Y −> T
Y −> Y Compute 3
T −> Y
(a) All in-place intermediate stages.
(b) Even number of stages.
Compute 0 Compute 0
X −> T X −> T1
Loop in place 1 Compute 1
T −> T T1 −> T2
Compute 2 Compute 2
T −> Y T2 −> Y
(c) Odd number of stages with in-place (d) Odd number of stages without in-
stage(s). place stage(s).
Figure 4.4: Allocating buffers when output/intermediate buffers are the same size.
When only the intermediate buffers are of the same size, a single intermediate
buffer is sufficient when there are only two stages of computations (shown in 4.5a), or
when all intermediate stages are in-place (shown in 4.5b). Otherwise, the swapping
72
Compute 0 Compute 0
X −> T X −> T1
Loop in place 1 Compute 1
Compute 0
T −> T T1 −> T2
X −> T
Loop in place 2 Loop in place 2
Compute 1
T −> T T2 −> T1
T −> Y
Compute 3 Compute 3
(a) Only two stages. T −> Y T1 −> Y
Figure 4.5: Allocating buffers when only intermediate buffers are the same size.
Since there is only one memory address space in the load-store architecture,
the allocation of buffers must be realized with address offsets for each gather and
the loop nest and thus will be captured explicitly with an extension. The next
section will introduce the OL extensions that makes the address offset explicit.
in indexing memory entries for loading and storing, whose computation can be
one drawback of a common optimization is breaking the loop nest structure. Thus
73
f (j0 , j1 , . . . jn−1 ) = c0 j0 + c1 j1 · · · + cn−1 jn−1 ,
where ck , k ∈ [0, n−1] are natural numbers and jk , k ∈ [0, n−1] are all loop variables.
plications and n − 1 additions. The computational cost could be reduced with two
variables to accumulations given that the loop variable values increase by one at
each iteration. The bit operation requires certain properties of the constant factors
is a power-of-2 number. If all ck are power-of-2 numbers and each operand of the
summation does not overlap in the bit fields of the final sum, the whole computation
Figure 4.6 shows the three approaches in calculating two size-2 multi-linear
crease from left to right. The bit manipulation approach requires zero arithmetic
execution and its value is increased by a constant value. In bit manipulation, this
example shows an ideal situation where all constant factors are power-of-2 numbers.
Thus the whole multi-linear expressions are replaced by bit shifts and bit-wise-OR
operations. Otherwise, the bit manipulation technique can be only partially applied.
74
f o r ( i =0; i <4; i ++) idx1 tmp =0; f o r ( i =0; i <4; i ++)
f o r ( j =0; j <2; j ++) idx2 tmp =0; f o r ( j =0; j <2; j ++)
i d x 1 =2∗ i+j ; f o r ( i =0; i <4; i ++) i d x=i <<1 | j ;
... i d x 1=idx1 tmp ; ...
... f o r ( j =0; j <2; j ++) ...
... ... ...
... i d x 1 +=1; ...
f o r ( k=0;k <2;k++) i d x 2=idx2 tmp ; f o r ( k=0;k <2;k++)
i d x 2=i +4∗k ; f o r ( k=0;k <2;k++) i d x=i | k<<2;
... ... ...
... i d x 2 +=4; ...
... idx1 tmp +=2; ...
... idx2 tmp +=1; ...
that the inductive calculation expressions inserted between the loop nest makes the
program more imperfect, reverse the effort of the multi-linear paradigm in internal-
izing the computations to the inner most loops for easier pipelined implementations.
The next section will explain the Σ-OL extension that enables computation embed-
ding in loops.
The multi-linear paradigm imposes identical kernel operations between basic blocks,
between the perfect sub-nest. The result can enable the overlapped execution across
the boundary of perfect sub-nests for lower execution latency while the hardware
75
pipeline is implemented properly.
Dependency Analysis
We analyze the dependencies between two neighboring perfect sub-nests that access
the data entries linearly in constant size vectors with different fixed strides for read
Table 4.1 shows such an access pattern of stride 1, 2, and 4 for size-8 buffer
and size-2 kernel. Each row represents an iteration. At each iteration, two data
entries are accessed. For each stride, the left columns list the values of the relevant
loop variables. The right columns list the values of two strided indices with the
see, for each stride, every entry of the size-8 buffer is accessed exactly once. Stride-4
accesses the entry pair serially with the largest possible stride. Stride-2 partitions
all data entries into two groups evenly and within each group accesses each pair
with stride of 2 serially, as shown by the dashed line in the table. Stride-1 partitions
entries into four groups evenly and within each group accesses each pair with unit
stride. When comparing the access patterns between unit stride and stride-2, the
Table 4.1: The indices for accessing a size-2 vector from a size-8 buffer with various
strides.
76
To analyze the dependencies using the access pattern, we assume a general
buffer size n and kernel size k. The maximal stride is n/k and the mimimal stride
n
is 1. For any strides s, the n data entries are partitioned into ks even sub-groups.
Given the write stride sw and the read stride sr , analyzing the dependencies on
the leading group of k ∗ max(sw , sr ) entries is sufficient because the trailing groups
We set sl and ss as the larger and smaller stride between sw and sr , respec-
tively. The indices for the leading sub-group of the larger stride is shown in Table
4.2. For the same iteration, the values increase by sl for each of the next location.
For the same location of the indices, the values increase by 1 for each of the next
iteration.
Table 4.2: The indices for larger stride.
The indices of the smaller stride covering the sub-group of the larger stride
is shown in Table 4.3. It requires multiple sub-groups of the smaller stride to cover
a sub-group of the larger stride, as described by the dashed lines in the table. The
n
parameter d in the table specifies the d-th sub-group, ranging from 0 to ss k − 1. For
the same iteration, the values increase by ss for each of the next location. For the
same location of the indices, the values increase by 1 for each of the next iteration
Assuming an execution pipeline that can overlap the iterations between the
77
Table 4.3: The indices for smaller stride.
sl − 1, whose completion can enable the read iterations to be executed in the same
pipeline without data hazard. One has to guarantee that all the write indices at
and after iteration T are larger than the corresponding read iterations.
Table 4.4: A schedule for small write stride and large read stride.
When the write stride is smaller than the read stride, i.e., sw = ss sr = sl , a
schedule is shown in Figure 4.4 where each row represents a scheduling step. On the
completion of the write iteration dss that fills the beginning indices of a sub-group,
the read iterations start to execute. As long as all write indices at write iteration
dss are larger than all read indices at read iteration 0, the same property can be
78
guarantee for the trailing iterations because the index values of the smaller stride
increase at the same speed or faster than the index values of the larger stride. The
T = dss
. (4.1)
kdss > (k − 1)s
l
(k − 1)
T > sl . (4.2)
k
Table 4.5: A schedule for large write stride and small read stride.
When the write stride is larger than the read stride, i.e., sw = sl sr = ss ,
the read iterations start to execute such that all write indices of the final write
iteration are larger than all read indices of the corresponding read iteration sl −1−T .
This guarantees the same property for the previous corresponding iterations because
the read indices decrease at the same speed or faster when reverse-counting the
79
iterations. The condition can be captured by a formula
(k − 1)
T > sl . (4.4)
k
1-th write iteration completes, it is safe to start the execution of the next perfect
Hardware Requirements
The overlapped execution across perfect sub-nest involves the switch of basic blocks
in the hardware pipeline. Though the multi-linear paradigm requires identical kernel
operations between basic blocks, the hardware implementations of gather, kernel and
scatter operations for each basic still vary when they are parameterized differently.
The common implementation schemes for resource sharing between basic blocks
employ multiplexors that only allow one state of the implementations, and thus
the load-store architecture, the multiplexors can be migrated from the execution
pipeline to the loop controller, such that the execution pipeline does not require
sub-nests to be overlapped.
80
Language Support
To encode the aforementioned static analysis results in the program, the next section
will introduce a Σ-OL language extension for capturing the perfect sub-nests with
explicit index functions and the iteration sequence number for synchronization.
To enable the optimizations introduced in the previous section for the multi-linear
paradigm of imperfect loop nest programs, the Σ-OL formalism is extended. First,
two new Σ-OL constructs are introduced to capture the properties essential to op-
timizations. Second, several rewrite rules and compiler passes are added for trans-
We introduce two new formula constructs to Σ-OL, with the first modeling loops
with embedded computations and the second capturing the perfect sub-nest with
Loops with embedded computations offloaded from basic blocks can preserve
the loop nest structure in the inductive calculation of multi-linear expressions. Be-
N
X
A
i=0
{v=v0 ;v=f (v)}
81
for arbitrary operator A iterating through a loop of size N using the iterator variable
block. All its iterations are computed independently with each other. A perfect
sub-nest produce data for its trailing perfect sub-nest to consume, thus needs to
synchronize with the trailing sub-nest to avoid data hazard. To accommodate with
the single address space of the memory system, the memory offsets for gather and
scatter need to be provided for each perfect sub-nest. We capture the perfect sub-
which denotes fg (fs ) the gather (scatter) index functions, δ the iteration sequence
number for synchronizing the trailing sub-nest, and og (os ) the gather (scatter)
buffer offsets for the basic block operations. At an early stage where the offsets are
We develop several rewrite rules and the several syntax tree visitors to improve la-
tency and resource optimization of the input program when translating to hardware
designs. For each rewrite rule, Spiral’s rewrite system matches the left side of a rule
against a given formula and replaces the matched expression by the right-hand side
of the rule. A syntax tree visitor can traverse the loop nest program more flexibly
82
Identifying Perfect Sub-nests
The perfect sub-nest of the input program can be detected with two rewrite rules.
Rule (4.5) annotates an inner most loop with the perfect sub-nest wrapper. The
index functions of gather G and scatter S are copied to the parameter fields. The
the final iteration of the size-N loop must complete before the trailing sub-nest
starts computation. The memory offsets are not determined yet, thus are initialized
N N
X PerfNest X
Sfs Ai Gfg → Sfs Ai Gfg (4.5)
fs ,fg
i=0 N,ø,ø i=0
{}
Rule (4.6) moves the outer loop into the perfect sub-nest wrapper. Besides
moving the outer loop inside the wrapper, the iteration sequence number for syn-
chronization is updated by multiplying to the domain of the outer loop so that the
new parameter still points to the final iteration of the updated sub-nest. Other
Nk Nj Nk Nj
X PerfNest X
→ PerfNest
X X
fs ,fg
A i j f ,f
s g
A i j
(4.6)
ik =0 δ,os ,og ij =0 Nk δ,os ,og ik =0 ij =0
{} {} {}
After all perfect sub-nests have been completely detected, the residual canon-
ical loops are converted to loops supporting embedded computations using Rule
(4.7)
83
N
X N
X
Ai → Ai (4.7)
i=0 i=0
{}
The results of the static dependency analysis in Section 4.2.3 can be added to the
PerfNest parameters to allow the early start of iterations from the trailing perfect
sub-nest in the pipeline when possible. In Rule (4.8), it first captures the required
conditions to the program in the h function of the producer scatter and the consumer
gather, then adds the iteration sequence number for synchronization to the PerfNest
parameter.
With the loop symbol with embedded computations, the multi-linear expressions
in the basic blocks can be computed inductively without breaking the loop nest
structure. The program transformation are performed in two steps: the first step
offloads the multi-linear functions from the basic blocks to the inner most loop; the
second step propagates the inductive calculation steps to each relevant loop level.
The first step of basic block computation offloading is specified in Rule (4.9).
the innermost loop, after which the basic block should reference the results via
linear expression without the term containing the innermost loop variable, and is
84
added the constant factor of the removed term after executing each iteration. In
system so that a simple pattern match can call for this rule.
N
X N
X
A(ck ∗ik +··· ) → Avk (4.9)
ik =0 ik =0
{} {vk =(··· ),vk +=ck }
the outer loop. The principle is similar to Rule (4.9). Rule (4.10) applies to the
perfect sub-nest inside the PerfNest wrapper. Rule (4.11) and (4.12) handles the
imperfect loop nest composed of multiple perfect sub-nests and imperfect sub-nest,
value zero.
Nj Nk Nj Nk
X X X X
→
Avk Av k (4.10)
ij =0 ik =0 ij =0 ik =0
{} {vk =(cj ∗ij +··· ),vk +=ck } {vj =(··· ),vj +=cj } {vk =vj ,vk +=ck }
N Nk
X PerfNest X
· · · f ,f
···
Avk
s g
ij =0 δ,os ,og ik =0
{} {vk =(cj ∗ij +··· ),vk +=ck }
(4.11)
Nj Nk
X PerfNest X
→ · · · f ,f
s g
···
Av k
ij =0 δ,os ,og ik =0
{vj =(··· ),vj +=cj } {vk =vj ,vk +=ck }
85
N
X Nk
X
· · ·
···
Av k
ij =0 ik =0
{} {vk =(cj ∗ij +··· ),vk +=ck }
(4.12)
Nj Nk
X X
→ · · ·
···
Avk
ij =0 ik =0
{vj =(··· ),vj +=cj } {vk =vj ,vk +=ck }
The applications of Rule (4.9-4.12) can remove all multiplicative operations. The
result can be further optimized when the constant factors of the multi-linear expres-
sions are power-of-2 numbers. In this case, the mathematical identity a ∗ 2q = a<<q
allows us to calculate the terms or the entire multi-linear expression with cheap
binary bit mapping operations. It is also performed in two steps: the first step con-
verts the accumulations to bit shift operations in the outer-most loop; the second
N
X N
X
Ai,v → Ai,v (4.13)
i=0 i=0
{v=0;v+=2q } {v=i<<q}
Rule (4.14-4.16) propagate the bit operations inward the perfect and imper-
fect sub-nest. The conversion is valid when the bit shift of the inner loop does not
overlap with the bit fields manipulated by the outer loop, i.e. [p, p + log2 Ni ) ∩ [q, q +
log2 Nj ) = ø. In this case, the embedded calculation of the outer loop is removed
86
Ni Nj Ni Nj
X X X X
→
Ai Ai
, (4.14)
i=0 j=0 i=0 j=0
{vi =i<<p} {vj =vi ;vj +=2q } {} {vj =i<<p|j<<q}
Ni Nj
X PerfNest X
· · · f ,f
···
Ai
s g
i=0 δ,os ,og j=0
{vi =i<<p} {vj =vi ;vj +=2q }
(4.15)
Ni Nj
X PerfNest X
→ · · · f ,f
s g
···,
Ai
i=0 δ,os ,og j=0
{} {vj =i<<p|j<<q}
Ni Nj Ni Nj
X X X X
· · · · · · → · · ·
Ai · · · , (4.16)
Ai
i=0 j=0 i=0 j=0
{vi =i<<p} {vj =vi ;vj +=2q } {} {vj =i<<p|j<<q}
The three strategies for reduced buffer allocation are implemented in a Σ-OL syntax
tree visitor for assigning gather and scatter address offsets to all perfect sub-nests.
The first perfect sub-nest gathers from offset zero. Assume that the input
basic block is in-place computation or not. The visitors maintains the most recent
offset and the buffer size so that it can assign a correct offset to other perfect sub-
nests in the execution order. The final buffer offset and the total memory size is
87
Collecting Basic Blocks Parameters
As all basic blocks are required to perform the same gather, scatter and kernel
these parameters so that the hardware generator can select the correct parameters
This process requires the pre-registration of each basic block operator and the
PerfNest construct for the pattern match shape, a function to extract parameters.
It traverses the loop nest program in execution order. When a PerfNest construct is
visited, the read and write offsets are extracted. The parameters extracted from each
basic block operator depends on the individual registered function for parameter
extractions. The parameters of each operator for all basic blocks forms a two-
dimensional array.
The idea of our hardware generation backend is to decompose the hardware con-
struction problem as selecting building blocks and connecting them. We have cre-
ated the initial set of hardware building blocks in the icode library. Our backend
can currently synthesize the hardware loop nest controllers and the feed-forward
pipelined datapath in Chisel RTL language from icode. As is shown in Figure (4.7)
for an 8-point WHT design, the loop nest controller is constructed as coordinating
finite state machines of single loop controllers and loop composers. The datapath is
framework is built with extensibility, further extensions of the building blocks and
88
Figure 4.7: Implement an 8-point WHT hardware.
blocks, as shown in Figure 4.8. In the loop nest controller, FSMs are connected with
raw signals (Figure 4.8a). In the datapath, the hardware operators are connected
either with a raw interface (Figure 4.8b) or with a decoupled interface that associates
the ready / valid signal to the raw signals (Figure 4.8c). The decoupled interface
interface and decoupled interface simultaneously in our backend to enable the co-
89
(a) The raw interface in loop nest controller.
In our extensions to the icode library, except for a limited set of extensions dedicated
for loop nest controllers, other extensions are mainly used in datapath design where
Controller Extensions
To describe the coordinating FSMs design of the loop nest controllers, we have
added three FSM types, two commands and a special operator, as listed in Table
4.6. The three FSM types models the single loop, the loop composer, and the per-
fect sub-nest wrapper, respectively, whose I/O ports are shown in Figure 4.9. The
90
I/O ports of the instantiated FSM modules are connected using the new commands
loop ctrl connect and loop io connect. The former command connects the ch start /
ch complete signals of an FSM to the start /complete signals of its children FSMs.
The latter command connects the loop variables / companion variables between two
tion expressions. A special operator trigger bb models an FSM that takes in the
start signals of all perfect sub-nests and produce a basic block activation signal.
Table 4.6: The icode extensions dedicated for loop nest controllers.
loop ctrl connect( Connect the I/O ports of control signals between
command
src mod, dst mod) loop controller FSMs.
91
(a) The loop FSM.
92
Datapath Extensions
To describe the uni-rate streaming datapath, we have extended the icode library.
Our extension includes types, commands, operators and location descriptors. The
new operators are the key constructions of our extension while the new types, com-
mands and location descriptors, as shown in Table 4.7, supports the new operators.
Three new types are introduced to capture the arbitrary precision type, the
decoupled token type, and a RTL module generation type. The TUIntAP(nbits,
max) type specifies an unsigned integer type with a given bit width and the maximal
possible value. The TDecoupled(t) encapsulates a normal icode type and associates
the ready and valid flags to the type payload. Operators processing data of these
new types can support bit-precise input/output and the decoupled interface natively.
input, output and the internal design using icode, which is more general than the
Two new commands of our extension supports the hardware semantic. The
instantiate command instantiates a type of any RTL modules, including the loop
type. The define command is in charge of the definition of types and operators
constructed on demand.
vironments. The
93
Table 4.7: The icode extensions of types, commands and location descriptors.
TRTLModGen(name,
type Construct an RTL module with icode.
input,output,code)
Compared to the current icode operators for software generation, the hard-
ware designs with icode operators, the delay in clock cycles between the output data
and the input data of each operator must be specified so that buffers can be inserted
in proper locations to assure the functional correctness and full throughput. Fur-
thermore, when the decoupled interface is employed in operators, the token scaling
94
ratio – a fractional ratio of the number of tokens produced over each token con-
delays and token scaling ratios. More dynamism can be supported in the future
The common integer arithmetic, logical and relational operators already in-
side the icode library are used as hardware operators after setting their delay to
zero and the token scaling ratio to one. We further extend the icode operators in
three groups: token regulation operators, memory operators and other customized
operators.
The token regulation operators listed in Table 4.8 assure the consistent token
production and consumption. The tk pack and tk unpack operators bridges the gap
between the scalar memory words and the vector data processed in Σ-OL kernels.
The tk range operator can drive iterative computations with its output tokens in
the absence of control flow commands. The tk buf operator works like a register in
non-decoupled interface operators. The tk rpt operator duplicates the input token
multiple times to a single consumer. The tk fork operator duplicates the input token
to multiple consumers.
95
Table 4.8: The icode extensions of token regulation operators.
The memory operators need to bind with a specific platform that provides a
memory interface. It is worth noting that mem read and mem write operators are
typed operators such that the operators can adapt to the bit width change of address
and data. In contrast to a software write operator, our mem write for hardware
the completion of memory stores for a target iteration. The lktable operator supports
96
Table 4.9: The icode extensions of memory operators.
lktable(entries, loc) 1 1 Read an entry from the ROM with index loc.
Other customized operators listed in Table 4.10 are created to facilitate the
construction of hardware designs. The mux operator models the common multiplex-
ors in digital circuit. The accum operator exploits the token flexibility to perform
accumulations without control flow commands. The bit maps operator can replace
some expensive arithmetic operations for power-of-2 numbers. The add sub is an
used manually. No that the delay of any arithmetic operators depends on the plat-
form and the type of data to be processed by the operator. The comb blk operator
can encapsulate an arbitrary pipelined combinational raw data typed datapath with
decoupled interface.
97
Table 4.10: The icode extensions of other customized operators.
mux(vsel, [v1, .., vn]) 0 1 Select an output from the input variables.
bit maps(vars, bfields) 0 1 Maps variable fields to the bit field locations.
This work provides two independent synthesis flow for the controller and the data-
path.
Controller Synthesis
We obtain the spatial design of loop nest controllers in two steps. First, we translate
the Σ-OL loop nest to a coordinating FSM design in icode using the rules described
The translation between loop nest of Σ-OL and of icode is performed with a
hierarchical visitor of the syntax tree. The visitor maintains a list of icode generated
the corresponding node is added to the list. Then the children nodes are visited.
Afterwards, the control connections and data connections between parent node and
98
Table 4.11: Synthesize the coordinating FSMs in icode for loop nest controllers.
Σ-OL icode
The controller design is not completed until the accessory components are
generated. First, to activate the right basic block in execution, we use the trigger bb
operator to monitor the start signal of each perfect sub-nest FSM, whose output
indicates which basic block is being executed. Second, all variables captured in the
basic block unifying process as parameters will be multiplexed to become the payload
component of the control token, selected by the basic block activation signal. Finally,
simple gate logic and wire connections assures the control token can be constructed
Datapath Synthesis
The datapath design in icode is synthesized from Σ-OL expressions in two steps.
First, each Σ-OL construct is mapped to the corresponding icode expressions. Sec-
ond, several compiler passes are employed to transform the icode with complete
99
Table 4.12 shows the current rewrite rules we have implemented, where gather
erator, and the size-2 DFT operations are handled. For arbitrary OL kernels, an
ol2icode function translates the OL specification to icode that processes data vec-
tors such that a comb blk operator can be constructed. Although we haven’t im-
operations for using the extensible operations to achieve the same effect.
Σ-OL icode
assign(in, tk pack(x,2))
F2
assign(y, addsub(nth(in,0), nth(in,1), tk range(in,2)=0))
unparsed to the actual RTL designs. First, as illustrated in Figure 4.10 the icode
expressions using the decoupled interface requires the variables referenced multiple
Figure 4.11, the icode needs to be inspected for the case that operators process
tokens at different rates, in which the tk rpt operator needs to be inserted. Finally,
the icode expressions with determined total delay requires inserting elastic buffers
100
(a) Before transformation.
(b) After transformation.
transformation assures functional correctness and the full throughput of the pipeline.
The algorithms for performing these transformations work similarly. The icode is
Then the input variables are set the initial metric of interest. We inspect each
assignment in data flow order at which the input variable are inspected for the
metric and determines if the helper operators needed to be inserted or not. The
algorithm completes when the last assignment of the data flow has been processed.
To translate our hardware designs in icode to Chisel RTL code, we have created
a Chisel RTL code library that provides the mapping targets of each icode type,
101
operator, command and location descriptor. The process is straightforward with a
recursive-descent translator.
In this work, a few algorithms supported in the Spiral framework are selected for
the WHT, DFT and bitonic sorting problems because they inherently match or can
be slightly modified to match the multi-linear paradigm. This section discusses the
decision of algorithms for scalar load-store architectures. For each algorithm, the
breakdown rules can expand the specifications into OL formulas, which is translated
to Σ-OL expressions using Spiral, and then the loop fusion optimization [2] is per-
There are a bunch of algorithms for calculating DFTs, with various data flow ge-
ometries that can satisfy different architectures. For a scalar load-store architecture
with a single level dual-ported memory, a reasonable goal is to maximize the uti-
lization rate of the scalar pipeline. As a result, an algorithm that provides the most
Analysis
Here, we discuss the considerations between three famous FFT algorithms, i.e. the
recursive Cooley-Tukey FFT, the iterative Cooley-Tukey FFT and the Pease FFT.
Though the illustrative data flow graphs in Figure 4.12 are for size-8 FFT, the
pattern applies for general radix-r sizes. The three algorithms in mathematics can
102
(a) Recursive Cooley-Tukey FFT. (b) Iterative Cooley-Tukey FFT.
Figure 4.12: Classic algorithmic candidates in data flow graphs for FFT (N=8).
for calculating the basic butterfly operations. A shared property of all data flow
tation is different as emphasized by the shading. It can be seen that the butterfly
operations with its input ready are not computed as early as possible. As a result,
the recursive algorithm is not the best for scalar load-store architectures.
In contrast, both the iterative Cooley-Tukey FFT shown in Figure 4.12b and
the Pease FFT shown in Figure 4.12c decomposes computations into several data-
dependent stages, thus exposes more parallelism than the recursive algorithm. In
it provides identical geometry in each stage with parallel butterfly operations and
103
stride permutations. However, the automated generation of load-store architectures
in this work prefers the iterative Cooley-Tukey FFT because it offers more oppor-
tunities for across-stage execution overlapping (see Section 4.2.3) by not enforcing
applying the algorithm to eight-point DFT and given the definition of two-point
DFT in Formula 4.18, the obtained algorithm is shown in Formula 4.19. Note that
`−1
!
r` `
Y
DFTr` → Di (Iri ⊗ DFTr ⊗ Ir`−i−1 ) Rrr (4.17)
i=0
1 1
DFT2 = F2 = (4.18)
1 −1
4 2 X
2
! !
X X
DFT8 = Shi1 ,4 Df3 F2 Ghi1 ,4 Shi3 +4i2 ,2 Df2 F2 Ghi3 +4i2 ,2
i1 =0 i2 =0 i3 =0
(4.19)
2 X
2
!
X
Sh2i5 +4i4 ,1 Df1 F2 Gh2i5 +i4 ,4
i4 =0 i5 =0
Though not for performance, the classic recursive Cooley-Tukey FFT is also imple-
mented in this work to study the effect of pattern-based loop optimizations. The
classic recursive Cooley-Tukey FFT algorithm defined in Formula 4.20. After re-
cursively applying the algorithm to eight-point DFT, the obtained algorithm does
104
not conform to the multi-linear paradigm by missing a diagonal operator on the
left most basic block. As a result, a special diagonal operator D0 equalizing with
the identity operator by multiplying 1s to all input data is inserted to the left most
to the first element of the twiddle factor lookup table. The resulting algorithm is
mk
DFTmk → (DFTk ⊗ Im ) Tm (Ik ⊗ DFTm ) Lmk
k (4.20)
2 2 2
!!
X X X
DFT8 = Shi+2k,4 D0 F2 Ghk,2 Sh2l,1 Df2 F2 Ghi+2l,4
i=0 k=0 l=0
(4.21)
X4
Sh2j,1 Df1 F2 Ghj,4
j=0
operator that multiplying the twiddle factors to input data will be implemented
as a hardware block using an internal lookup table. However, this will limit the
feasible problem sizes in hardware design where a main memory is connected to the
load-store architecture and the large capacity of main memory should have stored
the large twiddle factor tables instead. On the other hand, using on-chip ROMs
for twiddle factors can save the precious memory bandwidth. Hence, both methods
The on-chip ROM-based method could fit larger problem sizes by compress-
ing the twiddle factors by leveraging domain knowledge. The twiddle factors by
105
definition is the N-th root-of-unity. The left side of Figure 4.13 shows the definition
of the k-th entry of the twiddle factors for n-point FFT as ωnk . It is a complex
number generated by sinusoids. The right side of Figure 4.13 shows a root-of-unity
in complex number plane. One well known trick is that by leveraging the symmetry
of sinusoids, only eighth of the original size need to be stored in the lookup table.
In this way, the other part of the table can be re-directed to the stored portion of
tors. Figure 4.14 shows the derivation process of twiddle table decomposition from
between entries of the two sub-tables. In this way, the accuracy of the final twid-
dle factors is tunable by specifying the precision of the multiplier. When applying
this technique, much larger problem sizes can be supported with limited capacity of
on-chip ROMs.
Figure 4.14: Decomposing size-m*n table to size-n table and size-m table.
If the twiddle factor lookup table size still exceeds the ROM capacity, a
general solution that loads data entries from the main memory is available as shown
in Figure 4.15. In the design, the diagonal operator shares the memory read interface
106
Figure 4.15: A general solution that loads twiddle factors from main memory.
with the load unit through an arbitrator module. The arbitrator module determines
the right of access for each request and is expected to be configured to efficiently
utilize the memory bandwidth. However, such a design will throttle the speed of
data loading and will then limit the speed of data storing even though a dedicated
For WHT and sorting, I temporarily implemented the recursive algorithms for
effort to identify the proper algorithms from numerous candidates to best saturate
found in [42].
WHT and given the definition of two-point WHT in Formula 4.23, one obtained al-
the multi-linear pattern of load and store is represented by the multi-linear functions
107
of the base of each h function of gather and scatter. The kernel operations are all
1 1
WHT2 = F2 = (4.23)
1 −1
WHT8 =
!! 4
2
X 2
X 2
X X (4.24)
Shi+2k,4 F2 Ghk,2 Sh2l,1 F2 Ghi+4l,2 Sh2j,1 F2 Gh2j,1
i=0 k=0 l=0 j=0
Bitonic sorter. The final algorithm studied in this work solves the problem
of obtaining a sorted list of data, called the bitonic sorter. In contrast to the
name, a bitonic sort algorithm divides the data list into two sub-list, sorts them
with reversed direction, then merges the two sorted sub-list. The algorithm defined
in Spiral software flow hardcoded the direction of the sub-problems and utilize a
direct sum operator to combine the two sub problems, which does not conform to the
we define a new OL construct χΘn,d of sorting that adds the sorting direction as
sorting with value false. Then the same algorithm is defined with the new constructs
108
χΘn,d → Mn,d I2 ⊗l χΘn/2,l=0 (4.25)
Mn,d = I2 ⊗Mn/2,d χΘ2,d ⊗ In/2 (4.26)
M1,d = 1 (4.27)
x0 x0 if d min(x0 , x1 ) else max(x0 , x1 )
χΘ2,d = Kd = (4.28)
x1 x1 if d max(x0 , x1 ) else min(x0 , x1 )
data list and providing the building blocks of sorter in Formula 4.28, an ascending
χΘ8,true =
2 2 2
!
X X X
Sh4i3 +2i5 ,1 Ktrue Shi6 ,2 Ktrue Ghi6 +4i3 ,2
i3 =0 i5 =0 i6 =0
4
!
X
Shi4 ,1 Ktrue Ghi4 ,1
i4 =0
2 2 2 2
!
X X X X
Sh2i7 +4i1 ,1 Ki1 =0 Gh2i7 ,1 Shi8 ,2 Ki1 =0 Ghi8 ,2 Sh2i2 ,1 Ki2 =0 Gh4i1 +2i2 ,1
i1 =0 i7 =0 i8 =0 i2 =0
(4.29)
109
As explained above, we have formally introduced four algorithms conforming
to the multi-linear paradigm. The WHT algorithm and the iterative FFT algorithm
inherently conform to the paradigm, while the recursive FFT algorithm and the
bitonic sorting algorithm require slight modification to meet the paradigm require-
ment.
4.6 Summary
This chapter explains the extensions to the Spiral framework for addressing the
multi-linear paradigm of imperfect loop nest programs for hardware generation tar-
geting the customized scalar load-store architecture. The extensions cover the full
The DSL extensions at the three abstraction levels are made to capture
OL, the sorting operation with the direction parameter enables the unified kernel
operation for the paradigm. In Σ-OL, the perfect sub-nest construct allows this
matching in the form of term rewriting rules. Another Σ-OL extension is the loop
with embedded computations, which captures the optimized program with loop-
icode, new types of precision-precise integers and the decoupled interface make the
hardware features explicit; new operators and commands allow the modeling of
still provides instantiation flexibility in the iteration space, the multi-linear gath-
110
er/scatter pattern and the kernel specification such that many different algorithms
111
Chapter 5
Evaluation
ness of the proposed approach. They include the effectiveness of pattern-based loop
optimizations for hardware, and the quality of designs of FFT cores compared to a
the evaluation flow as is shown in Figure 5.1, where the Spiral generated designs
are compiled to Verilog RTL code and then simulated cycle-accurately, synthesized
following sections.
112
Figure 5.1: Obtaining experimental data from the Spiral-generated designs.
5.1.1 Methodology
generated algorithms for WHT, DFT and bitonic sorter and implemented the load-
store architecture with our approach. The baseline implementations exclude all
hardware optimizations in the Σ-OL level. The optimized implementations are op-
timized in Σ-OL for latency, buffer resources and the arithmetic units for calculating
and we compare the results from the baseline to the deepest optimization.
algorithm in this thesis that provides enough parallelism. The base implementation
forces a perfect sub-nest to complete all memory write operations of all iterations
before the next perfect sub-nest can start. The optimized implementation exploits
the multi-linear access pattern for static analysis and identifies the iteration for
synchronization whose completion unlocks the start of the next perfect sub-nest.
113
To obtain the execution latency, we have crafted an accelerator wrapper in Chisel
RTL [43] and a testbench using the Chisel iotesters 2.10 [44] API. The iotesters
framework compiles the Chisel RTL code to Verilog RTL code, then invoke the
Verilator 3.904 simulator [45] to compiled the Verilog RTL code to C++ code which
is further compiled and executed for cycle-accurate simulation. This setup initializes
the input data in memory through the input data ports and retrieves the result data
from memory via the output data ports. We report the lapse of clock cycles in the
testbench between the time all input data has been loaded to the memory and the
time when the resulting data is ready to be retrieved from the memory.
buffer allocation process of Spiral to report the aggregated number of data entries
in the memory. The buffer allocation processes are implemented as four separated
passes with different levels of optimizations. Level-0 reserves the dedicated input and
output buffers and allocates intermediate buffers for every composition of loops. It
serves as the baseline. Level-1 can exploit in-place calculations in loop compositions
to reuse the read buffer as write buffer. Level-2 simulates the dynamic temporary
buffer allocation scheme by recycling the buffer entries used in deeper loop nest that
have completed computation. Level-3 is the enhanced version of level-2 that also
count the total bits of adders and multipliers in the Σ-OL expressions. The base-
of adders and multipliers. We divide the optimizations into two, where the first
utilizes inductive calculations to avoid multipliers and the second further exploits
114
5.1.2 Latency
Figure 5.2 shows the latency (in clock cycles) of the generated radix-2 FFT cores
ranging from size of 8 to size of 2048 that compares the conservative and the precise
dependency management schemes. The secondary axis presents the speedup of the
precise scheme over the conservative scheme because the latency grows fast as the
problem size doubles and makes it difficult to show the latency of small FFT sizes
in the primary axis. The average speedup is 1.16x. The speedup increases at the
beginning as the problem sizes grow from size of 8, arrives the peak of 1.41x at size
of 64, and decreases when the problem sizes grow larger. The speedup increase in
smaller size range is attributed to the growing parallelism given by the increasing
datapath pipeline has a latency of 43 cycles, for very small size such as 4, the latency
At size of 64, the number of independent iterations at each perfect sub-nest is slightly
larger than the pipeline latency so that most of the pipeline bubbles can be avoided
as long as the iteration for synchronization does not reside at the very end of the
perfect sub-nest. The speedup falls when the size getting even larger, because in
these situations, the cost of pipeline draining is relatively small compared to the
115
Figure 5.2: The latency comparison between two dependency management strategies
for iterative FFTs.
Though the theoretical speedup upper bound can not exceed 2x as analyzed
in Chapter 3 and only happens for problem sizes at the same scale of the fixed
pipeline latency, this optimization is still beneficial for two reasons. First, the anal-
ysis is performed statically and does not incur additional hardware implementation
cost. Second, because the future throughput scaling of the architecture will reduce
the number of independent iterations for the same problem sizes, the problem sizes
that obtain the most benefit from this optimization will also increase.
The effect of the proposed buffer allocation schemes in Section 4.2.1 is evaluated
by compared against the baseline buffer allocation scheme of current Spiral for
WHTs, FFTs and bitonic sorters. In the proposed scheme, three techniques are
each technique, three implementations have been created that employs the three
116
techniques accumulatively. The descriptions of the implementations are listed in
Table 5.1.
Table 5.1: The implementations of four buffer allocation schemes
The different buffer allocation schemes can lead to various memory utiliza-
tion. In all experiments, the result of the implemented scheme is normalized to the
baseline scheme.
Figure 5.3 presents the allocation scheme comparison for iterative FFTs
across the size from 8 to 1024. The iterative FFT algorithm is composed of several
loop stages accessing the full data. Furthermore, in-place calculations exist in all but
the first stage, thus the intermediate buffers can be replaced by the output buffer
in all implementations. This result can not be improved with all three techniques
proposed in this work. Consequently, the four implementations achieve the same
memory utilization.
117
Figure 5.3: The comparison of buffer utilization between different allocation strate-
gies for iterative FFTs.
Figure 5.4 presents the comparison for the recursive Cooley-Tukey algorithms
of WHT and FFT from size-8 to size-1024. The recursive algorithms are numerous
which all produce deeply nested loop programs. The algorithm decomposition that
tends to create balanced rule trees is selected. In this plot, the hierarchical and
erate because at each level, the computation is factorized into two stages of much
smaller sizes which derives small intermediate buffer sizes. The swapping technique
failed to reduce the memory utilization further because in the factorization into two
stages with a smaller size intermediate buffer, swapping does not apply. The over-
written implementation reduces the memory utilization from almost 1/3 by allowing
the data of the input buffer to be overwritten. This removes the largest intermediate
buffer the same as the problem size in the top-level decomposition while preserving
118
Figure 5.4: The comparison of buffer utilization between different allocation strate-
gies for recursive WHTs/FFTs.
Figure 5.5 shows the comparison for the bitonic sorter algorithm from size-8
to size-1024. This is a recursive algorithm where each breakdown step factorizes the
computation into three stages with the middle one being in-place. The plot shows
overlapped scatter dots for hierarchical and swapping, both of which reduces the
memory utilization considerably. The benefit of the hierarchical scheme comes from
the deep nested level that can reuse intermediate buffers. The swapping scheme
failed to reduce the utilization further because the input buffer at each level are all
the input buffer of the program, which prohibits swapping. When overwriting the
input buffer is allowed in the overwritten scheme, the memory utilization continues
119
Figure 5.5: The comparison of buffer utilization between different allocation strate-
gies for bitonic sorters.
Then, Table 5.2 5.3 5.4 5.5 show the accumulated bits of the multiplication and
addition operations used for calculating the multi-linear expressions for a range of
problem sizes of WHT, DFT and bitonic sorters, implemented with three different
optimizations. Table 5.2 5.3 5.4 show the similar results for radix-2 algorithms
of WHT, DFT and bitonic sorter. They all succeessfully remove all multipliers
and adders because the constant factors of the multi-linear expressions of radix-
2 algorithms are all power-of-2 numbers, which can be implemented through bit
mappings.
Figure 5.5 shows the result for the radix-3 FFT where most constant factors
are not power-of-2 numbers unless the potential constant factor 1. In this situation,
the bit mapping scheme optimizes the inductive calculation implementation slightly
120
Table 5.2: The comparison of total arithmetic bits between different implementa-
tions of multi-linear expressions for recursive radix-2 WHTs.
Table 5.3: The comparison of total arithmetic bits between different implementa-
tions of multi-linear expressions for iterative radix-2 FFTs.
Table 5.4: The comparison of total arithmetic bits between different implementa-
tions of multi-linear expressions for Bitonic sorters.
121
Table 5.5: The comparison of total arithmetic bits between different implementa-
tions of multi-linear expressions for iterative radix-3 FFTs.
This section evaluates the quality of the generated designs on an FPGA device using
the proposed approach in a case study of FFT. The cycle counts, peak frequency and
the controller design of the proposed approach is investigated for the peak frequency
5.2.1 Methodology
The iterative radix-2 FFT algorithm is chosen to produce the FFT core designs for
evaluation. Note that this work has focused on the scalar load-store architecture
which provides a data rate of one data word per cycle. The generated design matches
the operation rate to the data rate. The generated Chisel RTL designs of the
input data in memory through the input data ports and retrieves the result data
from memory via the output data ports. The overall Chisel RTL code is translated
The baseline designs are generated by the Discrete Fourier Transform Verilog
122
IP Generator [46] (hereafter referred to as DFTgen) based on the current Spiral
hardware backend technology. Since DFTgen scales the architecture better than the
of DFTgen, the minimal streaming width is two. Hence, the design using streaming
width of 1 is obtained by using the backend interface of DFTgen with the help from
the author. The obtained design is depicted in Figure 5.6. The design is composed of
three main components, the twiddle factor module, the half-rate F(2) module, and
to provide streaming rate of one word per cycle by using reasonable resources. In
this scenario, the permutation module is merely a RAM that is written and read
designs from two approaches. The single precision floating point data format is
used for both approaches. The floating point arithmetic operators in the proposed
approach are mapped to the same RTL modules instantiated by DFTgen designs.
Both designs maps the memory and the twiddle factor lookup table to the block
123
RAM resources of the FPGA.
FPGA. All FPGA synthesis is performed using Xilinx Vivado 2016.2, and the area
and timing data shown in the results are extracted after the final place and route
are complete.
Latency. The latency of the designs produced with the proposed method
DFTgen designs is reported by the generator and is defined as the lapse clock cycles
between the time the first input data sample is streamed into the FFT core and the
time when the first output data sample is streamed out of the FFT core.
utilization of four types of resources, namely the lookup tables (LUTs), the flip-flops
(FFs), the DSPs, the block RAMs (BRAMs). The LUTs are the general reconfig-
urable resources on FPGA, which can simulate the arbitrary gate logic through
configuring the bits of the lookup tables. The FFs are register resources provided
by the FPGA device. The Xilinx Virtex-7 FPGAs contain dedicated arithmetic units
called DSP slices. DSP slices contain hard multipliers, accumulators, registers, and
interconnect. The multipliers in these slices are used in floating point multiplication.
The single precision floating point multiplier each use two DSP slices. The float-
ing point adder is mapped to LUTs and FFs. Xilinx FPGAs provide two types of
memories: BRAM or distributed RAM. BRAMs are 36kb dedicated hard memories
built into the FPGA. The Virtex-7 contains 545 BRAMs. Memory structures can
also be constructed with the FPGA’s LUTs which are normally used as logic. In
our experiments, all the RAMs and ROMs are mapped to BRAM resource.
124
5.2.2 Latency
Figure 5.7 shows the latency comparison between the FFT cores of our approach
and of DFTgen. For the FFT size range from 16 to 2048, this work is in average
15% faster than DFTgen. The key reason is in the algorithm. While the initial
bit reversal permutation is separated from the log N FFT stages in DFTgen, it is
merged in our approach to the FFT stages. The secondary reason is across-stage
Note that the latency in clock cycles is not the ultimate metric to compare the
execution speed because the final performance also depends on the clock frequency.
Figure 5.8 5.9 5.10 show the resource utilization of LUTs, FFs and BRAMs on
Xilinx FPGA for FFT sizes ranging from 128 to 2048. The primary axis presents
125
the number of resources for the this work and DFTgen. The secondary axis presents
the ratio of this work over DFTgen. This work uses averagely 14% more LUTs than
DFTgen. The ratio increases slowly as the FFT sizes go larger. Since the datapath
of different sizes stays mostly the same for both designs, it is possibly due to the
implementation cost of the controller used in our approach. This work uses averagely
5% more FFs than DFTgen with similar trend as LUT utilization with respect to the
problem sizes possibly due to the same reason. In contrast, the proposed approach
utilize fewer BRAMs, in average 70% of DFTgen. The BRAMs are used exclusively
for the memory modules and the twiddle factor lookup tables in both approaches.
The twiddle factor lookup table are at the same size in both approaches. In the
proposed approach, the required memory entires are twice of the problem sizes. In
DFTgen, the number is fourfold of the problem sizes because the initial permutation
module does not share the memory with the stride permutation module in the FFT
compute stage.
the permutation cores [47] and their use in streaming FFTs [48]. In particular, [48]
reports half of the RAM bank utilization compared to the DFTgen baseline used
in this study, by sharing the memory block between bit-reversal permutation and
stride permutation. This matches the number in our load-store architecture design.
sharing technique for streaming designs requires special rewiring of the streaming
blocks with extra control. Besides, the optimization [48] does not change the cycle
counts studied in the previous sub-section because the bit reversal permutation
126
Figure 5.8: This work vs. DFTgen: lookup tables comparison.
127
Figure 5.10: This work vs. DFTgen: block RAMs comparison.
The DFTgen designs present decent scalability in maintaining the peak frequency
near 455 MHz for FFT sizes ranging from 128 to 2048. Figure 5.11 shows the peak
frequency of the proposed approach, which degrades from 286 MHz to 250 MHz
across the size range from 128 to 2048. The analysis of the critical path shows that
the the longest path connects the hardware operators and the root compose FSM
of the controller. The path travels from the perfect sub-nest FSM up to the root
compose FSM because the design requires the fast response of the complete signal
to start the next iteration. Thus, the peak frequency can be improved in two ways.
First, the proper use of elastic buffers can truly decouple the critical path between
hardware operators and reduce the length of critical path in each size. Second, The
controller design can be improved by either reducing FSM levels. To achieve the
128
Figure 5.11: The peak frequency on FPGAs for FFTs.
The peak frequency evaluation has exposed a shortcoming of the controller design.
Figure 5.12 further presents the resource utilization of the controller. The Y-axis
is the percentage of resources utilization of the controller over the overall design.
Because the controller does not utilize BRAMs and DSPs, only the LUTs and FFs
utilizations are shown. The scatter plots present the steady increase of utilizations
of the two resource types while the FFT sizes increase. At FFT size of 2048, the
129
Figure 5.12: The resource cost of controllers for FFTs.
5.3 Summary
This chapter evaluates the effectiveness of the proposed approach for load-store ar-
performed in the Σ-OL level, by comparing the performance and resource utilizations
between the FPGA implementations of the FFT cores generated by the proposed
method and by the existing hardware backend are performed. The comparable result
suggest that the flexible load-store architecture can be optimized well with enough
domain-specific efforts.
130
Chapter 6
Concluding Remarks
algorithm through deeply customizing the control logic and datapath for hardware
design for other algorithms. While the load-store architecture is inherently flexible to
flexibility is carefully realized by having a higher level view of hardware designs, and
This work showed us that the flexible load-store architecture can be practically used
131
way, the datapath can be tailored for efficient use of the given algorithm and the con-
can be replaced by simpler and more specific mechanisms. Since this requires repre-
multi-level DSLs are essential to drive this approach. Spiral provides an extensible
framework to implement the proposed approach and the extensions we have made
mark a significant first step to open up the full power of Spiral for automated
by Spiral possess recursive or iterative nature, thus can be expanded for various al-
gorithms. In the software generation flow of Spiral, the fixed hardware architecture
In the proposed approach, the abundant architectural paradigms including but not
than that provided by commodity processors, thus may trigger new ideas in algo-
Pareto-optimal designs across a large tradeoff space between performance and re-
source utilization.
We saw that, by extending the Σ-OL language for capturing the desired
properties of the compute pattern of imperfect loop nest programs, the latency,
memory utilization and indices computation cost of the resulting hardware can be
132
• Fixed loop bounds
We saw that, the icode extensions modeling the interconnected RTL modules
raw digital signals or in ready/valid protocol are both supported. Each icode opera-
tor modeling an RTL module can be treated as a code generator and can customize
itself with respect to the input characteristics. This also allows external hardware
pattern of imperfectly nested loop programs can be parallelized in the form of vector
The compute pattern studied in this thesis possesses certain properties that
make automated parallel hardware generation possible with Spiral. First, it ensures
all independent operations in the perfect sub-nest of imperfect loop nest programs
that can be computed in parallel. Second, the multi-linear access pattern to memory
133
can be transformed for various types of parallelism. Moreover, the hardware gener-
ation flow introduced in Chapter 4 is compatible with the prior parallelization work
in Spiral for general purpose processors. Thus, high performance parallel hardware
of many data operations [49]. This behavior is not attainable in typical pipelined
instruction set processors because the data movement operations and computations
are encoded in the same instruction stream such that data movements never coincide
with computations. The vector processors, represented by the Cray processor [36],
achieve simultaneous data movement and computation by exploiting the fact that
chitecture introduced in Section 4.4 has inherently achieved the same effect as vector
processors by architecting for the multi-linear compute pattern. The scalar algo-
rithm implemented on the architecture can already provide the compute throughput
of one word per clock period, when connecting to a dual-ported fast on-chip memory.
When interfacing with more complicated memory systems, the necessary algorithms
The customized scalar load-store architecture can inherently achieve the processing
rate at one word per clock period as typical vector processors. Both architectures
134
exploits the homogeneity of computations though with varying degrees. Special-
vector processors.
vector load/store instruction moves entire elements of a vector between a vector reg-
on all elements in vector registers. In hardware design, both vector movement units
and functional units are fully pipelined and allow one element to be processed per
clock period. The chaining technique allows a vector instruction to proceed as soon
In this way, the vector registers act essentially like FIFOs and allow the continuous
throughput easier thanks to the compute pattern it is designed for. In the compute
pattern of imperfect loop nest programs, the basic block operation is always of a
be loaded from memory, processed through an arbitrary kernel operation, then the
chaining the load, compute, and store units is a natural decision. The pipeline is
driven by an FSM for the base and stride of load/store indices and other necessary
from the FSM, the pipeline is capable of process one element per clock period.
vector processors.
135
However, specialization enables more complexity in vector processing than
constrained by the limited number of vector registers and vector functional units.
memory. Moreover, the vector functional units always perform identical operation
customized functional unit can enable arbitrary operation for each vector element
and can buffer a few elements to compute results with multiple vector elements.
While the customized scalar load-store architecture provides inherent vector pro-
cessing throughput, the actually attainable performance highly relies on the memory
system it connects with. Modern main memory, such as DRAMs, incurs long access
latency that can be partially hidden through accessing a continuous data block.
common to modern computers, which requires data locality in programs for mem-
ory performance improvements. The Spiral framework handles these scenarios and
has provided useful rules in the rewrite system for program transformation toward
blocking access.
operations, and introducing a vectorization tag ‘vec’, the necessary rules for trans-
136
forming fundamental OL operators and formulas for vectorization are listed (6.1) -
(6.3).
~ Im
An ⊗ Im → An ⊗ (6.1)
| {z }
vec
ν2
Lmn mn/ν ~ m
m → Lm ⊗ Iν I mn/ν 2 ⊗ L ν I n/ν ⊗ L m/ν
~
⊗ Iν , ν | m, n (6.3)
|{z} |{z}
vec vec
In (6.1), the tagged OL formula with the exact shape is converted to the
~ operator and dropping the tag. In (6.2), the block
identical form by assigning the ⊗
and share a larger memory in between. In general SMP processors, cache memory
cores. Ideally, the compute task is evenly distributed in each core, accessing data
resided in the local cache. When inter-core communication is necessary, the data is
137
dant independent iterations in the perfect sub-nest. For obtaining high performance
from an SMP architecture, existing rewrite rules in Spiral can handle the code
architecture, certain hardware blocks are required but the complicated ones can be
The compute pattern studied in this thesis possesses rich independent iterations in
the perfect sub-nest. As seen in classic loop parallelization tricks, the perfect sub-
nest can be re-organized to have an outermost loop with the loop count equivalent
to the prescribed number of cores. Then each iteration of the outermost loop is
mapped to unique execution core. At each core, data required by the corresponding
outermost loop iteration is loaded from the shared memory, computed locally, then
However, the simple code transformation at loop level does not necessarily
bring higher performance. Though already load-balanced, the different cores can
possibly access different data stored in the same cache line. This causes the false-
sharing problem and can trigger substantial cache coherence traffic and thus incurs
serious performance penalty. The next subsubsection explains the existing rewrite
The existing SMP rewrite rules in Spiral handle the load-balancing problem and
avoids false sharing when parallelizing OL operators. The SMP platform is captured
in two parameters: p the number of cores, and µ the cache line length. An smp(p, µ)
138
balanced parallel formula is denoted by the ⊗k operator, and the data permutation
¯ operator. The relevant rules are shown
avoiding false-sharing is denoted by the ⊗
(6.4) - (6.7).
Ip ⊗ An → Ip ⊗k An . (6.4)
| {z }
smp(p,µ)
¯ µ.
P ⊗ Iµ → P ⊗I (6.5)
| {z }
smp(p,µ)
mp
Lmp
A ⊗I → ⊗ I ¯
⊗ I I ⊗ A ⊗ I L ⊗ I ¯
⊗ I , µ | n/p
| m{z n} m n/pµ µ p k m n/p p n/pµ µ
smp(p,µ)
(6.6)
mk/p
Lmk pm
k → Ip ⊗ L
k k/p Lp ⊗ Ik/pu ¯
⊗Iu (6.7)
|{z}
smp(p,µ)
In (6.4), the tagged OL formula with the exact shape is converted to the
identical form by assigning the ⊗k operator and dropping the tag. In (6.5), block
¯ operator
stride permutation is converted to the identical form by assigning the ⊗
and dropping the tag. In (6.6), the vectorizable OL formula is converted to paral-
lelizable shape with additional stride permutations. In (6.7), the stride permutation
is converted to block stride permutations of cache line length µ and the parallelized
The SMP architecture requires cache memory and the interconnect network be-
tween the cores. These are complicated hardware building blocks that require high
139
development cost if by hand. The Rocket Chip generator [33] includes components
on-chip networks. Configuration options include the number of tiles, the coherence
policy, the presence of a shared L2 cache, the number of memory channels, the
number of cache banks per memory channel, and the implementation of the under-
lying physical networks. These generators are all based around TileLink, a protocol
architecture when multiple banks of fast memory such as SRAMs are available.
In a short vector architecture, the vector core, with a vector load/store unit and
supplies a scalar data word to the core. The vector lanes parameter specifies the
number of data items that can be transferred and processed simultaneously. Similar
vector functional unit includes a vector arithmetic / logic unit and a permutation
unit. The permutation unit handles data movements between vector lanes and can
140
SIMD Algorithm Generation
The existing SIMD rewrite rules in Spiral handle the SIMD vectorization problem.
The SIMD platform is captured in parameter ν for the vector length. An simd(ν)
SIMD vectorized formula is denoted by the ⊗k operator, and the data permutation
~ operator. The relevant rules are shown
avoiding false-sharing is denoted by the ⊗
in (6.8) - (6.10).
~ Iν
An ⊗ Iν → An ⊗ (6.8)
| {z }
vec(ν)
2 2
Iν ⊗ An → Lnv ⊗ In/v ⊗ Lvv In/ν ⊗ Lνν Lnn/ν ⊗
~ Iv ~ Iv
An ⊗ ~ Iν (6.9)
| {z } |{z} |{z}
vec(ν) vec(ν) vec(ν)
ν2
Lmn mn/ν ~ m
m → Lm ⊗ Iν Imn/ν 2 ⊗ L ν In/ν ⊗ Lm/ν
~
⊗ Iν , ν | m, n (6.10)
|{z} |{z}
vec(ν) vec(ν)
In (6.8), the tagged OL formula with the exact shape is converted to the
~ operator and dropping the tag. In (6.9), the block
identical form by assigning the ⊗
permutations of vector length ν. Note that (6.9) and (6.10) produces irreducible
2
permutation Lνν that must be handle as the base case.
141
Hardware Building Blocks
the vector length. When generating software implementations for SIMD micropro-
cessors, they are implemented as in-register shuffle instructions that are supported
The permutation operations for data size at multiples of vector length can be
implemented as streamed permutations. In this way, the input as a size-v data chunk
is streamed from memory through vector loads, then flowed along the streamed
permutation unit, and finally streamed back to memory through vector stores. Since
data streamed in earlier may not be allowed to streamed out in time, a memory
buffer is required in the streamed permutation unit. Such a permutation unit can
be generated with the bit matrix method described in [50] when the data size is
power of two, and the satisfiability approach in [16] for arbitrary data size.
load-store architecture for the multi-linear compute pattern. The properties of the
and the SMP architecture, while the SIMD parallel architecture requires substantial
effort for assuring SIMD vectorizable algorithms. The existing parallelization rules
are compatible to the hardware generation flow and thus can be applied for parallel
in this chapter, Spiral rewrite rules can assure high performance in implementa-
tions by shaping the algorithm at high level. New hardware building blocks are
142
required in SMP parallelism and SIMD parallelism, while the complicated ones can
This thesis concludes with a brief discussion of some possible future extensions of
this work. First, by embedding a new specialized architecture to the Spiral frame-
work, extra extensions could be required to realize the full framework capability. In
addition, the compute pattern discussed in this thesis could be extended to support
Finally, the combination of domain-knowledge and hardware designs may allow some
formance tradeoff, Spiral must obtain the quantitive metrics of the hardware im-
plementation cost and the execution performance. However, the time required for
the metric retrieval time, estimations through modeling have been employed in the
tion turns out to be more difficult than performance estimation, because in highly
Spiral, several solutions addressing the long synthesis time obstacles for design space
exploration have been proposed. However, porting them to the more complicated
143
load-store architecture designs is non-trivial. In [51], an exact model for the DFT
forms, including the slice and hard macro utilizations. Compared to the streaming
core, the design configurations for load-store architectures involve more dimensions
ing at capturing high-level features of the design, allowing the statistical models to
capture the particular patterns of the target application. Though this work is more
general than the previous one, it is an open question how to capture the high-level
ited to static loop bounds, multi-linear access patterns and identical kernel oper-
ations. Though we have shown several application examples that fits the pattern,
ation process needs to be extended as well so that Spiral’s constraint solver produces
algorithms fitting the provided hardware parameters. Extending Spiral for new
algorithms turns out to be a difficult task. Spiral has provided extensions of the
constraint solvers for non-power-of-2 DFTs, some linear algebra kernels, and compu-
In the code generation stage, depending on the exact change, the modifi-
144
blocks, the hardware interpretation backend must handle resource sharing between
blocks can be shared by multiplexing the primitive arithmetic units in DFGs. The
techniques have been explained in literatures [29][30]. Other access patterns can be
natively supported in this framework by having a new index generation module. The
to some other calculations such as power functions and modulo functions. Opti-
mizations for other access patterns are to be explored. The data-dependent control
flow requires extra communications between the datapath and the controller. If the
communication overwhelms the decoupled design between the controller and the
a small set of hardware implementation options that have been developed in the
computer architecture and digital design community. Various options can be incor-
porated to this framework and are compatible with the constraint solver of Spiral
controllers. The ROM-based controllers store the control sequence onto a read-only
memory. It is preferred when storage resources are cheaper than logic resources.
and produce the control sequence by fetching, decoding and executing an instruction
to perform arbitrary control. These cost and benefit of these options have not been
145
algorithm generation process, which will be evaluated in the final implementations.
This work handles only a single level memory, which is limited in either
problem sizes when interfacing with on-chip memory, or compute throughput when
with off-chip memory because a lot of computations exhibits data locality that can
benefit from fast buffers. In the past, the Spiral framework has been extended
for the scratchpad memory of the Cell platform [53], which implements a multi-
infrastructures such as CoRAM [54] and Fluid [55]. The memory system impacts
Spiral.
fabric. This technique has not been widely use because of the runtime overhead in
long enough time at each of the multiple stages to amortize the partial reconfigura-
tion cost.
vide useful information to fusing static and dynamic hardware compilation methods
in a single design.
146
Since the inception of the hardware generation work of Spiral in 2008, the
the most influential methods are the so-called static methods and the dynamic meth-
ods. The static methods take in the control-data flow graph (CDFG) representation,
datapth, under the user-specified timing and resource constraints. The most recent
and have been successfuly used in commercial hardware compilers. However, the
static method requires precise latency of operations in the CDFG, and falls short in
the scenarios with data-dependent latency and control flows. In contrast, the dy-
namic compilation method [22] maps spatially the CDFG into the elastic circuit [28]
that uses data tokens to enable distributed control. This method fits data-dependent
control but incurs higher resource utilization. The fusion of these two methods for
A simpler context for hardware compilation setup up in this work can ease
the fusion of the two compilation methods. First, since the designers are supposed
to specify the specification breakdown rule, the base case operators do not over-
lap in execution can be tagged with the desired compilation method. Second, the
load-store architecture has decoupled most control flows and memory accesses from
computations, thus the selection of compilation methods will be decided purly based
on the computation itself, without worrying about the complicated loop nest control
ity in basic blocks to enable execution overlapping for the sake of latency improve-
ment. Such a single unified pattern is unusual when more complicated computations
147
are to be accelerated. A general pattern-based synthesis flow has been developed in
static hardware compilation for reducing multiplexor usage [29]. The flow takes in
a set of DFGs, then patterns of DFGs are searched, selected, scheduled and bound
ations in basic blcoks. The high level abstractions can allow high value patterns to
sis flow should simultaneously support manual and automatic pattern recognition
DFG fragments within a basic block, breaking the assumption in [29] that each frag-
ment is mapped to a basic block and never overlap in execution. Hence, the input
as a flat set of DFGs is no longer applicable. The set must contain information
of which basic block a fragment belong to. When scheduling patterns, only one of
the multiple instances belong to the same basic block can be considered to allow
concurrent execution within the same basic block, which is required by the proposed
mentations, various hardware design paradigms and the design synthesis methods
148
Bibliography
[1] Franchetti, et, al., “Spiral: automating high quality software production.”
[2] F. Franchetti, Y. Voronenko, and M. Püschel, “Formal loop merging for signal
[3] M. Püschel and J. M. F. Moura, “The algebraic approach to the discrete cosine
and sine transforms and their fast algorithms,” SIAM Journal of Computing,
on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, pp. V–69, 2004.
2.2.1
tures,” in Algorithms for Synthetic Aperture Radar Imagery XVI, vol. 7337,
149
[6] F. Franchetti, F. de Mesmay, D. McFarlin, and M. Püschel, “Operator language:
ence on Domain Specific Languages (DSL WC), vol. 5658 of Lecture Notes in
and applied mathematics, vol. 123, no. 1-2, pp. 85–100, 2000. 2.2.1
tor SIMD code generation for DSP algorithms,” in High Performance Extreme
150
gaussian elimination,” SIAM Journal on Scientific and Statistical Computing,
August-2021]. 2.4.1
Test in Europe Conference & Exhibition, pp. 1118–1123, IEEE, 2009. 2.5, 6.2.3
Design Automation of Electronic Systems (TODAES), vol. 17, no. 2, pp. 1–33,
2012. 2.5
[18] B. Akın, F. Franchetti, and J. C. Hoe, “Ffts with near-optimal memory ac-
Acoustics, Speech and Signal Processing (ICASSP), pp. 3898–3902, IEEE, 2014.
2.5
151
[21] Z. Zhang and B. Liu, “Sdc-based modulo scheduling for pipeline synthesis,”
[23] G. Weisz and J. C. Hoe, “C-to-coram: Compiling perfect loop nests to the
[24] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of
complex Fourier series,” Math. of Computation, vol. 19, pp. 297–301, 1965. 3.2,
3.4.1
[25] G. K. Wallace, “The jpeg still picture compression standard,” IEEE transac-
tions on consumer electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992. 3.2
matical Software (TOMS), vol. 43, no. 2, pp. 1–18, 2016. 3.2, 3.4.1
design,” IEEE Micro, vol. 22, no. 5, pp. 24–35, 2002. 3.3.1
152
[29] J. Cong and W. Jiang, “Pattern-based behavior synthesis for fpga resource
[31] G. De Micheli, Synthesis and optimization of digital circuits. No. BOOK, Mc-
7000 epp: An extensible processing platform family,” in 2011 IEEE Hot Chips
“Efficient spmv operation for large and highly sparse matrices using scalable
3.3.2
153
[37] T. M. Low and F. Franchetti, “High assurance code generation for cyber-
tering and wavelet algorithms. PhD thesis, Carnegie Mellon University, 2004.
3.4.1
[40] L. Josipovic, P. Brisk, and P. Ienne, “An out-of-order load-store queue for spa-
[41] J. Cong and Z. Zhang, “An efficient and versatile scheduling algorithm based
154
[44] Lawson, et al., “Chisel testers.” [Link]
06. 5.2.1
[47] F. Serre, T. Holenstein, and M. Püschel, “Optimal circuits for streamed lin-
5.2.3
5.2.3
rams,” Journal of the ACM (JACM), vol. 56, no. 2, pp. 1–34, 2009. 6.2.3
[51] P. A. Milder, M. Ahmad, J. C. Hoe, and M. Püschel, “Fast and accurate re-
155
[52] M. Zuluaga, A. Krause, P. Milder, and M. Püschel, “” smart” design space
fourier transforms for the cell broadband engine,” in Proceedings of the 23rd
tional symposium on Field programmable gate arrays, pp. 97–106, 2011. 6.3
[55] J. Melber and J. C. Hoe, “A service-oriented memory architecture for fpga com-
156
Spiral's approach ensures memory efficiency during hardware generation of loop-based programs by employing a sophisticated buffer allocation strategy that minimizes intermediate buffers and reuses in-place buffers where possible . By structuring computations in stages and applying pattern-based optimizations, steps that use reusable resources help reduce the memory footprint . Additionally, utilizing swapping strategies for input, output, and intermediate buffers further reduces unnecessary allocation, thus lowering the hardware requirements without compromising on performance . The framework's capability to manage dual-ported memory access patterns also contributes to optimal resource utilization .
Synthesis of loop programs to customized load-store architectures presents challenges such as handling the imperfect loop nest programs and achieving flexibility in hardware design . The proposed approach addresses these challenges by introducing a pattern-based optimization framework that focuses on static loop bounds and multi-linear access patterns . This includes extending Spiral's DSLs to support pattern-based loop optimizations and modeling interconnected RTL modules, which facilitates effective hardware resource management and high throughput designs .
The proposed approach extends Spiral by optimizing buffer allocation through pattern-based loop optimizations to reduce buffer allocation sizes . Intermediate buffers are allocated between compute stages, but the approach supports in-place computing to reuse read buffers as write buffers, reducing the number of intermediate buffers needed . Moreover, techniques like buffer swapping are used when intermediate, input, and output buffers are of uniform size, minimizing resource utilization . This optimization scheme suits the constraints of hardware environments where memory resources are limited compared to the software assumptions of cheap main memory .
Σ-OL extends loop optimization capabilities by introducing new constructs for capturing loop nest structures and embedded computations, which allows for effective transformation and optimization of imperfect loop nest programs . It facilitates loop merging and index function simplification, enabling more efficient data handling and reducing roundtrips from memory . These features help achieve optimized hardware synthesis by allowing application of rewrite rules to lower OL formulas to Σ-OL expressions, further enabling sophisticated optimizations such as overlapping execution of different stages in the pipeline .
Rewrite rules play a pivotal role in optimizing multi-linear expressions within load-store architectures by enabling the transformation of loop programs to a form that minimizes resource usage and maximizes data throughput . These rules allow the integration of permutations into gather operations, and diagonals into scatter operations, aligning with the computational patterns defined in the Spiral framework . Through these transformations, repetitive patterns can be efficiently grouped, significantly improving execution efficiency and reducing latency . Furthermore, they enable the overlapping of computations, crucial for the high-performance requirements of modern hardware designs .
In load-store architectures, the Spiral framework treats data permutations in terms of memory indices, allowing for distinct access patterns at each computational stage and combining initial permutations with subsequent stages . This contrasts with streaming architectures, where permutations are part of the ongoing data stream and often require explicit operations or sequences. The load-store approach facilitates different permutations and optimizations by focusing on memory indexing, which is particularly advantageous for optimizing iterative memory operations .
The proposed approach advances load-store architecture designs by introducing a pattern-based method for optimizing imperfect loop nest programs, extending Spiral's DSLs for hardware generation, and implementing a framework that efficiently models dual-ported memory . Key contributions include extending the Σ-OL language to facilitate loop optimizations and creating a hardware-oriented code generation process that directly produces Chisel RTL code . The approach has successfully been applied to significant algorithms, including Walsh-Hadamard transforms, discrete Fourier transforms, and bitonic sorters, confirming its capability in maintaining high throughput and resource optimization .
The dual-ported memory approach in Spiral's load-store architecture offers the benefit of enabling simultaneous read and write operations, facilitating the execution of complex compute patterns such as the Walsh-Hadamard transforms and discrete Fourier transforms . However, this approach restricts peak processing throughput to one data word per cycle, limiting scalability. To overcome this limitation, additional parallel techniques like SIMD and multicore processing could be employed, which are compatible with the Spiral framework, allowing for increased performance and resource utilization .
A major limitation of the research is its focus on algorithms with limited irregularity in the load-store architecture framework, which may restrict its applicability to a broader range of algorithms that could benefit from its flexible design properties . This focus limits the exploration of load-store architectures' full potential, especially for highly irregular algorithms that might require more advanced optimization techniques and architectures . Future developments could address this by exploring extensions that cater to irregular computing patterns or by implementing more flexible memory designs to accommodate diverse algorithmic structures .
Loop merging significantly impacts both software and hardware optimizations by minimizing memory data roundtrips and reducing latency in task completion . In Spiral, merging is achieved using rewrite rules that allow the combination of computations into a single iterative loop while simplifying index mapping functions, which leads to more efficient buffer utilization and improved performance . This optimization is critical for complex algorithms like FFTs, where the permutation and diagonal operators are frequently reused, minimizing redundant operations and enhancing overall execution speed .