Overview
history and motivation for systolic arrays systolic array features systolic design techniques
composing regular components retiming slowdown clustering bit-level design systolic state machines
Introduction to Systolic Design
Wayne Luk wl@[Link] Imperial College March 2002
wl 3/2002 1
topics not covered further reading
wl 3/2002 2
History and motivation
Memory Processing Element
Systolic arrays
Memory Processing Element
introduced by Kung and Leiserson, 1978 designs for matrix computations illustrated by snapshots of operation
Systolic: rhythmical contraction; describes the contraction of the heart forcing blood onward and keeping up the circulation. Array: multiple PEs to maximise processing per memory access.
PE
wl 3/2002 4
motivations: improve performance of special-purpose systems - e.g. maximise processing per memory access reduce their design and implementation costs - e.g. exploit latest technology: FPGAs
wl 3/2002 3
M PE PE PE
Systolic array features
multiple use of each input data item extensive concurrency; usually by pipelining a few types of simple cells simple and regular data and control flow
Field-Programmable Gate Arrays (FPGAs)
combine software flexibility and hardware performance off-the-shelf parts, factory-tested, many varieties matrix of cells, each has programmable function unit programmable connections
- nearest neighbour / local / global routing
these result in: simple and reduced costs high performance modular and expandable
wl 3/2002 5
technology: 10 million-gate FPGA, GHz clock speed good platform for implementing systolic designs
- array structure - increased flexibility, adaptable at run time - reduced design/implementation time and cost
wl 3/2002 6
Applications
signal, image, video, multimedia, numerial processing
add, multiply, divide, square root...in various number systems recursive and non-recursive, linear and non-linear filtering DFT, FFT, FHT, DCT, DWT, FNT matrix and graph algorithms, algebraic path problem neural nets, motion estimation, shading, texture mapping sorting, searching, matching, priority queue, LRU dynamic programming data compression and encryption discrete event simulation database operations
wl 3/2002 7
Array shapes: linear and rectangular
Linear array: chain
non-numerical processing
-
Rectangular array
R R
R R
R R
wl 3/2002 8
Hexagonal array
Hexagonal array
R R R
R R R
R R R
wl 3/2002 9
R R R
R R R
R R R
wl 3/2002 10
Triangular-shaped arrays
Example: matrix vector multiplier
Ax=y, yi = ai0 x0 + ai1 x1 + ai2 x2 + ai3 x3 D=delay=register constant multiplier:
xi
aij
x0 x1
D
R R
R R R
R R
R R R
x2
D D
x3
D D
aij xi
0
a00 a10 a20 a30
D
a 01 a 11 a 21 a 31
D
a02 a12 a22 a32
D
a03 a13 a23 a 33
D
y0 y1 y2 y3
does it work? are there alternatives?
wl 3/2002 11
wl 3/2002 12
Example: bit-level convolver
0
D
CbCellc: fadd and
Systolic design techniques
systematic design of systolic arrays
transform obvious design to efficient but less obvious designs circuit-oriented block diagram approach simple ideas behind design automation algorithms composing regular components: focus on desired behaviour retming: relocate latches in a circuit slowdown: replicate latches clustering: arrange hierarchy for pipelining bit-level design: useful for hardware libraries systolic state machines: pipeline state transition functions convolution, matrix vector multiplication, sorting
wl 3/2002 14
w0
D
w1
0
D
w2
D
w3
CbCellb
D
x0 y0
CbCellc
D
CbCellc
CbCellc
D
CbCellc
techniques
-
x1 y1
CbCellc
CbCellc
D
CbCellc
CbCellc
x0 y0
CbCellc
D
CbCellc
CbCellc
CbCellc
D
x1 y1
CbCellc
CbCellc
D
CbCellc
CbCellc
illustrated by simple examples
-
0
wl 3/2002 13
Convolver: composing regular components
w0 xt xt w0 xt w1 w1 xt
D
Pipelining
clock speed depends on longest combinational path
w2 w2 xt
D D
w3 w3
data
result
xt 0
w0
xt-1
w1
xt-2
w2
xt-3
w3 y +
wl 3/2002 15 wl 3/2002 16
mac
mac
mac
mac
Cu0: functional description yt = xt-iwi = xtw0 + xt-1w1 + ...
0i<N
obvious but inefficient?
Pipelining
insert latches between circuits to increase throughput
Retime a chain
idea: introduce anti-latch which cancels effect of a latch OK to have anti-latch at inputs or outputs graphical contours linking introduction of latch/anti-latch pre-condition:
given
data clock
result
but may also increase
- area, power consumption, latency
R then
R R R
D-1
retiming: graphical method, introduce/relocate latches to improve performance/regularity and preserve behaviour may apply this method several times; avoid overkill
wl 3/2002 17
D-1
D-1
D-1
wl 3/2002 18
Retime a row
pre-condition:
given
D
Remove triangular-shaped array of registers
w0 xt
D-1
w1 xt w1 xt
D
w2 w2 xt
D D
w3 w3
R
D-1
xt
w0
then
R R R
D D
xt
D D-1 D-1 D-1
w0
xt-1
w1
xt-2
w2
xt-3
w3 y +
wl 3/2002 20
R
D-1
R
D-1 D-1
0 Cu0:
mac
mac
mac
mac
functional description yt = xt-iwi
0i<N
= xtw0 + xt-1w1 + ...
wl 3/2002 19
Retime top part of convolver
w0 w1
D
Uni-directional flow convolver
D-1
w2
D D
w3
D D D
xt
xt
w0
xt
D-1 D-1
w1
xt
D-1 D-1 D-1 D-1
w2
xt
D-1 D-1 D-1
w3
D-1 D-1 D-1
D-1
D-1
D-1
x + 0
CuCell1
w0
D
w1
D
w2
D
w3
D
mac
mac
mac
mac
D D D
xt
w0 mac w0
xt-1
w1
xt-2
w2
xt-3
w3
mac w1
D D
mac w2
D
mac w3
D
x 0
CuCell1
mac
mac
mac
mac
y
wl 3/2002 21
semi-systolic regular connection one type of cell: CuCell1 speed? impact of array size? concurrency?
+ mac
Cu1
wl 3/2002 22
Improving speed: retime the macs
w3 x
D D D D D D
Retime the macs: pipelined bottom
w3 x
D D D D-1 D-1 D-1 D-1 D-1 D-1 D-1 D-1 D-1 D-1 D-1 D-1 D D D
w2
w1
w0
w2
w1
w0
D-1
D-1
D-1
D-1
mac w3
D-1
mac w2
D-1 D-1
mac w1
D-1
mac w0
mac
mac
mac
mac
CuCell2
D-1 D-1
x 0
wl 3/2002 23
D
Cu2
mac
mac
mac
mac
y
wl 3/2002 24
Retime both top and bottom
w0
CuCell3
Uni-directional flow systolic convolver
w0
CuCell3
w1
w2
D
w3
D D D D D D D
w1
w2
D
w3
D D D D D D D
D D D D D
D D D D D
+ Cu3
D-1 D-1 D-1 D-1
+ Cu3
D-1 D-1 D-1 D-1
wl 3/2002 25
wl 3/2002 26
Pipeline the multiplier and adder
w0
CuCell4
Design tree
relate designs by transformation
- root: obvious but inefficient design - leaves: efficient but not obvious designs
w1
w2
D
w3
D D D D D D D
D D D D D
x
D
convolver example
uni-directional flow data
Cu1
pipeline between mac
Cu3 Cu2
pipeline within mac
Cu4
+ Cu4
D-1 D-1 D-1 D-1
Cu0
counter-flow data
reverse coefficients
advantages and disadvantages?
wl 3/2002 27
...
wl 3/2002 28
Characterise designs
express features of a composite design in terms of the number and features of its components number of cells and registers: impact on size and power consumption and latency latency: determined by the path from input to output which has the maximum number of registers critical path: determined by the path from input to output which has the largest combinational delay e.g. Cu1: N latches, N-1 cycles of latency, Tmult + NTadd critical path assumptions: negligible effect of wires and word-length growth
wl 3/2002 29
Pipelining a grid
R R R
R R R
R R R
R R R
wl 3/2002 30
Pipelining a grid
D D D D D D
Pipelining a grid
D D D D D D
R R R
R R R
D-1
R R R
D-1 D-1
R R R
D-1 D-1 D-1
D-1 D-1 D-1 D-1
R
D D
R
D
R
D
R
D
D-1 D-1 D-1 D-1
D-1 D-1 D-1 D-1
R
D
R
D
R
D
R
D
D-1 D-1 D-1 D-1 D-1
D-1 D-1 D-1 D-1
D D
R
D D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1 D-1
D-1 D-1 D-1 D-1 D-1 D-1
wl 3/2002 31
wl 3/2002 32
Pipelining a grid
D D D D D D
Pipelining a grid
D D D D D D
R
D D
R
D
R
D
R
D
D-1 D-1 D-1 D-1
R
D D
R
D
R
D
R
D
D-1 D-1 D-1 D-1
R
D
R
D
R
D
R
D
D-1 D-1 D-1 D-1 D-1
R
D
R
D
R
D
R
D
D-1 D-1 D-1 D-1 D-1
D D
R
D D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1 D-1
D-1 D-1 D-1 D-1 D-1 D-1
D D
R
D D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1 D-1
D-1 D-1 D-1 D-1 D-1 D-1
wl 3/2002 33
wl 3/2002 34
Combinational matrix vector multiplier
Ax=y, yi = ai0 x0 + ai1 x1 + ai2 x2 + ai3 x3
x0 x1 x2 x3
Matrix vector multiplier: contours
Ax=y, yi = ai0 x0 + ai1 x1 + ai2 x2 + ai3 x3
aij
x0 x1 x2 x3
unpipelined constant multiplier:
xi
aij
constant multiplier:
xi
a03 a13 a23 a33
aij xi
a00 a01 a11 a21 a31
aij xi
0 0 0 0
a00 a10 a20 a30
a01 a11 a21 a31
a02 a12 a22 a32
a02 a12 a22 a32
a03 a13 a23 a33
y0 y1 y2 y3
0 0 0 0
a10 a20 a30
y0 y1 y2 y3
wl 3/2002 35
wl 3/2002 36
Matrix vector multiplier: adding registers
Ax=y D=delay=register constant multiplier:
xi
aij
x0 x1
D
Pipelined matrix vector multiplier
Ax=y D=delay=register
x2
D D D
x3
D D D
constant multiplier:
a03 a13 a23 a33
D
x0
x1
D
x2
D D D
x3
D D D
aij xi
0 0 0 0
a00 a10 a20 a30
D
a01 a11 a21 a31
D
a02 a12 a22 a32
D
xi
y0 y1 y2 y3
aij
aij xi
a00 0 0 0 0 a10 a20 a30
D
a01 a11 a21 a31
D
a02 a12 a22 a32
D
a03 a13 a23 a33
D
y0 y1 y2 y3
wl 3/2002 37
wl 3/2002 38
Uni-directional flow systolic convolver
w0
CuCell3
Convolver with counter-flowing data
w0 w1
D D
w1
w2
D
w3
D D D D D D D
w2
D
w3
D
D D D D D
x y
mac"
mac"
mac"
mac"
+
+ Cu3
D-1 D-1 D-1 D-1
Cb1
mac"
wl 3/2002 39 wl 3/2002 40
Convolver with counter-flowing data
w0 x y
D D-1
Convolver with counter-flowing data
w0 w1 w2 w3
w1
D D -1
w2
D D -1
w3
D D -1
mac"
mac"
mac"
mac"
mac" Cb2
mac"
mac"
mac"
still not fully pipelined!
wl 3/2002 41 wl 3/2002 42
Derive fully-pipelined convolver: slowdown Slowdown
n-slow: can replace each latch by n latches in series, provided that (n-1) extra values are inserted between successive inputs; similarly for outputs graphically: introduce additional D or D-1 by replacing each D (or D-1) by n copies in series interpretation:
- interleaved n data streams/computations concurrently - sample output every n cycles to get result of each computation
D D D D D D D D D D D D
mac"
mac"
mac"
mac"
mac"
mac"
mac"
mac"
wl 3/2002 43
wl 3/2002 44
Retime after slowdown
D D D D D D D D
Fully-pipelined convolver
D-1 D-1 D D D D-1
-1
D-1 D-1 D D-1 D D D-1 D D D D
D D D D-1
-1
mac"
mac"
mac"
mac" mac"
D-1 D
mac"
mac"
mac"
D -1 D -1 D -1 D -1
D-1 D D D D-1
-1
D-1 D-1 D D-1 D D D-1 D D D D
D D D D-1
-1
D-1
CbCell3
D-1 D D
D-1 D-1 D
D-1 D-1 D
mac"
mac"
mac"
mac"
D-1 D -1 D -1 D -1
mac"
wl 3/2002 45
mac"
mac"
mac"
Cb3
wl 3/2002 46
Pipelining may become less effective
Throughput (MHz)
1 / Tcell
Controlled pipelining: clustering
cluster elements into groups, and retime the groups
ica l oret The Actua l
1 / (Tcell+T latch)
R
R4
=
=
D-1 D-1
1 / NT cell 1 / (NTcell+T latch) Non-pipelined K=N Fully pipelined K=1
Degree of Pipelining (1/K)
(R2 ; D)2 ; D-2
vary size of groups to control degree of pipelining
size of each group degree of pipelining
input-output speed limit clock skew clock rise and fall times significant control degree of pipeling
wl 3/2002 47
reason about regular patterns of pipelining
RKN = (RK ; D)N ; D-N (given R = D-1 ; R ; D) KN = M, fully-pipelined : K = 1, N = M non-pipelined : K = M, N = 1
wl 3/2002 48
Convolver with counter-flowing data
Cb4
w0 x y
D
Partially-pipelined designs
w1
D
w2
D
w3
D
mac" 0
mac"
mac"
mac"
mac"
mac"
mac"
mac"
mac"
mac"
Cb5
D D
Cb1
mac"
mac"
mac"
mac"
mac"
mac"
boundary conditions not shown
wl 3/2002 49 wl 3/2002 50
Clustering rectangular array R R R R R R R R R R R R R R R R
wl 3/2002 51
Retime around the contours R R R R R R R R R R R R R R R R
wl 3/2002 52
Result
D D
Result
D D
R R
D D
R R
D
R R
D
R R
D
D -1
D -1
R R
D D
R R
D
R R
D
R R
D
D -1
D -1
D -1
D -1
D -1
D -1
R R
D D -1 D -1
R R
D D -1 D -1
R R
D D -1 D -1 D -1
R R
D D -1 D -1 D -1
D -1
D -1
D -1
R R
D D -1 D -1
R R
D D -1 D -1
R R
D D -1 D -1 D -1
R R
D D -1 D -1 D -1
D -1
D -1
D -1
D -1
D -1
D -1
D -1
D -1
D -1
wl 3/2002 53
wl 3/2002 54
Retime through the contours R R R R R R R R R R R R R R R R
wl 3/2002 55
D
Result R R
D D
R
D
R R
D D
R
D
D -1
D -1
R R
D D
R R
D D D -1
D -1
D -1
R R
D D -1 D
-1
R R
D D -1 D -1 D
-1
D -1
D -1
R
D -1 D
-1
R
D -1 D -1 D -1
D -1
D -1
D -1
wl 3/2002 56
Result R R
D D D
Lead R
D D
R
D
R R
D
D -1
D -1
Q R R R
wl 3/2002 57
R Q R R
R R Q R
R R R Q
wl 3/2002 58
R R
D D
R R
D D D -1
D -1
D -1
R R
D D -1 D -1 D
R R
D D -1 D -1 D -1 D
D -1
D -1
R
D -1 D -1
R
D -1 D -1 D -1
D -1
D -1
D -1
Trail
Q
Pipelined rectangular array R Q R R Q R R R
wl 3/2002 59
R R R Q
R R Q R
R
= R
D D
Q R Q R
R Q R Q
Q R Q R
wl 3/2002 60
Q R Q
10
Bit-level convolver: 1-bit w and x
Bit-level convolver: 1-bit w
w x y + D
becomes
w x A H H H
wl 3/2002 61
w D x y + y D
becomes
w x A A A
D D D
0
F F F
wl 3/2002 62
Bit-level convolver: increase regularity
w x y + wx CbCell
0
w x A A A y
Refinement to bit level
x y implement word level cell (assume single bit w)
0
D
D D D
becomes
A F A F
D
xs
w
D
0
D
CbCellc CbCellc CbCellc CbCellc
CbCellc CbCellc CbCellc CbCellc
CbCellc CbCellc CbCellc CbCellc
CbCellc CbCellc CbCellc CbCellc
0
D
0
D
D D
wl 3/2002 63
ys
0
D
F F F
y A F
where
CbCellc: fadd and
wl 3/2002 64
Pipelining strategy 1
(1) slowdown by 2 (double all latches) (2) Retime to get fully-pipelined circuit
0
fadd and
Design Cbb1
0
D
CbCellc:
yt = 0in
w2
D
wi xt,i
0
D
fadd
and
w0
D
0
D
w1
D
0
D
w3
D
CbCellc
x0 y0
CbCellc
D D
CbCellc
D D
CbCellc
D D
CbCellc
D D
0
D D
w
D D
0
D D
CbCellc
CbCellc CbCellc CbCellc CbCellc
CbCellc CbCellc CbCellc CbCellc
CbCellc CbCellc CbCellc CbCellc
D D
x1 0 y1
CbCellc
D D
CbCellc
D D
CbCellc
D D
CbCellc
D D
xs ys
CbCellc CbCellc CbCellc
D D
D D
D D
D D
0
D D
D D
D D
D D
0
D D
x2 y2
CbCellc
D D
CbCellc
D D
CbCellc
D D
CbCellc
D D
D D
D D
D D
x3 y3
CbCellc
D D
CbCellc
D D
CbCellc
D D
CbCellc
D D
0
wl 3/2002 66
wl 3/2002 65
11
Pipelining strategy 2
pipeline clusters of K by K cells (K > 1) e.g. K = 2
x0 0 0
D
Design Cbb2
0
D
CbCellc: fadd and
w0
D
w1
0
D
w2
D
w3
CbCellb
D
w
D
y0 0 0
CbCellc
D
CbCellc
CbCellc
D
CbCellc
CbCellc
CbCellc CbCellc CbCellc CbCellc
CbCellc CbCellc CbCellc CbCellc
CbCellc CbCellc CbCellc CbCellc
0
D
x1 y1
CbCellc
CbCellc
D
CbCellc
CbCellc
D
xs ys
CbCellc CbCellc CbCellc
0
D
0
D
x0 y0
CbCellc
D
CbCellc
CbCellc
CbCellc
D
0 x1
CbCellc
D
Cbb2
wl 3/2002 67
y1
CbCellc
D
CbCellc
CbCellc
0
wl 3/2002 68
Summary: optimising digital designs
useful rules: retiming: can add a latch at all inputs, provided adding an anti-latch at all outputs D D-1
Perforance and resource usage
---------------------------------------------------------------------------Design min clock period latency number of latches ---------------------------------------------------------------------------Cu2 Cu3 Cu4 Tm + Ta Tm + Ta max(Tm , Ta) N-1 2N-1 2N N(N+1) / 2 N(N+5) / 2 N(N+7) / 2
use clustering to control degree of pipelining n-slow: can replace each latch by n latches in series, provided that (n-1) additional values are inserted between successive inputs; similarly for outputs
wl 3/2002 69
----------------------------------------------------------------------------
Note that the minimum clock period for Cu2 should be (N-1) Tp + Tm + Ta, where Tp is the delay across the wiring cell
wl 3/2002 70
State machines
state-transition function R usually includes an output part y and a next state part s counter sorter
Simple state machines
priority queue x
R
s
D
LRU processor
wl 3/2002 71
wl 3/2002 72
12
Simple state machines Simple state machines
1
hadd D0
hadd D0
hadd D0
hadd D0
inc D0
wl 3/2002 73
wl 3/2002 74
Simple state machines Decomposing state machines
1 R0 R1 R0 R1
hadd D0
hadd D0
hadd D0
hadd D0
wl 3/2002 75
wl 3/2002 76
Simple state machines
Simple state machines
hadd D0
hadd D0
hadd D0
hadd D0
hadd D0 D0
hadd D0 D0
D0 hadd D0
hadd D0
D0
wl 3/2002 77
wl 3/2002 78
13
Example: inserter
insert an element into an ordered list to form an ordered list: insert <3, <1, 2, 5, 6>> = <<1, 2, 3, 5>, 6> 1 2 5 6 max min
S2 3
Insertion sort
state registers initialised with + load n values to be sorted cycle by cycle load n - values to extract the sorted result
5
S2
1
S2
2
S2
3
S2
5
S2
5
Dmax
S2
Dmax
S2
Dmax
S2
Dmax
wl 3/2002 79
wl 3/2002 80
Insertion sort
state registers initialised with + load n values to be sorted cycle by cycle load n - values to extract the sorted result
5 4 5 1 -
Insertion sort (cont)
to extract the sorted sequence cycle by cycle, input -
2 1
3 2
8 3
S2
4
Dmax
S2
5
Dmax
S2
Dmax
S2
Dmax
S2
-
Dmax
S2
1
Dmax
S2
2
Dmax
S2
3
Dmax
wl 3/2002 81
wl 3/2002 82
Insertion sort (cont)
to extract the sorted sequence cycle by cycle, input -
Decomposing the sorter
- -
1 -
2 2
3 3
S2
3
Dmax
S2
Dmax
S2
Dmax
S2
Dmax
S2
-
Dmax
S2
-
Dmax
S2
1
Dmax
S2
2
Dmax
Q1: can we avoid reloading the -s (and +s ?) Q2: can we reduce the combinational delay through S2s?
wl 3/2002 83 wl 3/2002 84
14
Decomposing the sorter
Decomposing the sorter
S2
Dmax
S2
Dmax
S2
Dmax
S2
Dmax
S2
Dmax
Dmax
S2
Dmax
Dmax
S2
Dmax
Dmax
S2
Dmax
Dmax
wl 3/2002 85
wl 3/2002 86
Decomposing the sorter
Summary: systolic state machines
start with state-transition function include loop and state registers to ensure computing the desired function
S2
Dmax
Dmax
S2
Dmax
Dmax
S2
Dmax
Dmax
S2
Dmax
Dmax
make sure that registers are initialised appropriately decompose a large state machine into a collection of small state machines pipeline the collection of state machines as required
wl 3/2002 87
wl 3/2002 88
Topics not covered
composite and hybrid systolic systems: ensure boundary conditions match multi-dimensional arrays: nearest-neighbour connections in 3D reconfigurable designs: pipeline morphing and virtual pipelines space optimisation techniques: digit serial systolic array implementations and platforms: iWarp, Splash, Pilchard, RC1000, and Sonic languages and tools for systolic design and verification: Ruby, Pebble, Lava, CSP, CirCal, Alpha
wl 3/2002 89
Further reading
H. T. Kung. Why Systolic Architectures?, IEEE Computer, 15(1):37-46, January 1982. - excellent introductions to systolic architectures S.Y. Kung, VLSI Array Processors, Prentice Hall, 1988. - comprehensive reference textbook Proceedings of IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP). This conference began as the International Workshop on Systolic Arrays, 1986. - research on theory and practice of systolic design Proceedings of IEEE International Conference on Field-Programmable Custom Computing Machines. - research on systolic systems implemented by FPGAs
wl 3/2002 90
15