PIPELINING: 5-STAGE
PIPELINE
CS/ECE 6810: Computer
Single-cycle RISC
Architecture
Example: simple MIPS architecture
🞑Critical path includes all of the
processing steps
Controller Write
Back
PC
Inst. Regist Data
AL
Memory er Memo
U
File ry
Inst. Inst. Execu Memo
Fetch Decode te ry
Single-cycle RISC
Architecture
Example
program
🞑 CT=6ns; CPU
AND
Time = ?
R1,R2,R3
XOR
R4,R2,R3
SUB
R5,R1,R4
ADD
CPU Time = IC x CPI x CT Time
R6,R1,R4
Single-cycle RISC
Architecture
Example program
🞑 CT=6ns; CPU Time = 5 x 1 x 6ns = 30ns
AND R1,R2,R3
How to improve?
XOR R4,R2,R3
SUB
R5,R1,R4
ADD
R6,R1,R4
CPU Time = IC x CPI x CT Time
MUL
Reusing Idle
Resources
Each processing step finishes in a fraction of
a cycle
🞑 Idleresources can be reused for
processing next instructions
Write Back
PC
Inst. Regist Data
AL
Memory er Memo
U
File ry
Inst. Inst. Execu Memo
Fetch Decode te ry
Pipelined
Architecture
Five stage pipeline
🞑Critical path determines the
cycle time
0.7ns Write
Back
PC
Inst. Regist Data
AL
Memory er Memo
U
File ry
Inst. Inst. Execu Memo
Fetch Decode te ry
1.5ns 1.05ns 1.25ns 1.5ns
Pipelined
Architecture
Example program
🞑CT=1.5ns; CPU
Time = ?
AND
R1,R2,R3
XOR
R4,R2,R3
SUB
R5,R1,R4
ADD
CPU Time = IC x CPI x CT Time
R6,R1,R4
Pipelined
Architecture
Example program
🞑CT=1.5ns; CPU Time = 5 x 5 x 1.5ns =
37.5ns > 30ns
WORSE!!
AND R1,R2,R3
XOR R4,R2,R3
SUB
R5,R1,R4
ADD
CPU Time = IC x CPI x CT Time
R6,R1,R4
Pipelined
Architecture
Example program
🞑CT=1.5ns; CPU
Time = ?
AND
R1,R2,R3
XOR
R4,R2,R3
SUB
R5,R1,R4
ADD
CPU Time = IC x CPI x CT Time
R6,R1,R4
Pipelined
Architecture
Example program
🞑CT=1.5ns; CPU Time = 9 x 1 x 1.5ns
= 13.5ns
AND What is the cost of pipelining?
R1,R2,R3
XOR
R4,R2,R3
SUB
R5,R1,R4
ADD
R6,R1,R4 CPU Time = IC x CPI x CT Time
Pipelining
Technique
Improving throughput at the expense of
latency
🞑 Delay: D = T + nδ
🞑 Throughput: IPS = n/(T + nδ)
Combinational
Logic Critical Path
Delay = 30
Pipelining
Technique
Improving throughput at the expense of
latency
🞑 Delay: D = T + nδ
🞑 Throughput: IPS = n/(T + nδ)
Combinational D=
Logic Critical Path IPS
Delay = 30 =
Combinational Combinational D=
Logic Critical Path Logic Critical Path IPS
Delay = 15 Delay = 15 =
Comb. Comb. Comb. D=
Logic Logic Logic IPS
Delay = Delay = Delay = =
10 10 10
Pipelining
Technique
Improving throughput at the expense of
latency
🞑 Delay: D = T + nδ
🞑 Throughput: IPS = n/(T + nδ)
Combinational D = 31
Logic Critical Path IPS =
Delay = 30 1/31
Combinational Combinational D = 32
Logic Critical Path Logic Critical Path IPS =
Delay = 15 Delay = 15 2/32
Comb. Comb. Comb. D = 33
Logic Logic Logic IPS =
Delay = Delay = Delay = 3/33
10 10 10
Pipelining Latency vs.
Throughput
Theoretical delay and throughput
models for perfect pipelining
Delay
2 (D)
0
Performance
1
5
Relative
1
0
5 0 5 100 20
0 0
0 150
Number of Pipeline
Stages
Pipelining Latency vs.
Throughput
Theoretical delay and throughput
models for perfect pipelining
Delay Throughput
2 (D) (IPS)
0
Performance
1
5
Relative
1
0
5 0 5 100 20
0 0
0 150
Number of Pipeline
Stages
Five Stage MIPS
Pipeline
Simple Five Stage
Pipeline
A pipelined load-store architecture that
processes up to one instruction per cycle
Write
Back
PC
Inst. Regist Data
AL
Memory er Memo
U
File ry
Inst. Inst. Execu Memo
Fetch Decode te ry
Instruction
Fetch
Read an instruction from memory (I-
Memory)
Use the program counter (PC) to index
🞑
into the I- Memory
🞑 Compute NPC by incrementing current
PC
What about branches?
Update pipeline registers
🞑 Write the instruction into the pipeline
registers
Instruction
Fetch
clock
Branch
Target
NPC = PC + 4
NPC
cloc PC +
k
4 Why increment
by 4?
Instructi
Memo
ry
on
Pipelin
e
Regist
Instruction
Fetch
cloc
k
P3
Branch
Target
cloc NPC = PC + 4
PC +
NP
C
k
P2
4 Why increment
P1 by 4?
Instructi
Memo
ry
on
Critical Path = Max{P1, P2, P3} Pipelin
e
Regist
Instruction
Decode
Generate control signals for the opcode
bits
Read source operands from the register file
(RF)
🞑 Use the specifiers for indexing RF
How many read ports are required?
Update pipeline registers
🞑Send the operand and immediate values to
next stage
Instruction
Decode
targ
et
NPC
NPC
re
g
Regist
er
Instructi
re
File
g
on
deco
ct
rl
de
Pipelin Pipelin
e e
Registe Registe
Execute
Stage
Perform ALU operation
🞑 Compute the result of ALU
Operation type: control signals
First operand: contents of a register
Second operand: either a register or the
immediate value
🞑 Compute branch target
Target = NPC + immediate
Update pipeline registers
🞑 Control signals, branch target, ALU
results, and destination
Execute
Stage
NPC
Re
ALU
re
s
g
Target
re
re
g
g
ct
ct
rl
rl
Pipelin Pipelin
e e
Registe Registe
Memory
Access
Access data memory
🞑 Load/store address: ALU outcome
🞑 Control signals determine read or write
access
Update pipeline registers
🞑 ALU results from execute
🞑 Loaded data from D-Memory
🞑 Destination register
Memory
Access
Targ
et
Re
Re
s
s
add
r
Da
Memory
re
dat
g
dat
t
a a
ct
ct
rl
rl
Pipelin Pipelin
e e
Registe Registe
Register Write
Back
Update register file
🞑 Control signals determine if a register write is
needed
🞑 Only one write port is required
Write the ALU result to the destination register, or
Write the loaded data into the register file
Five Stage
Pipeline
Ideal pipeline: IPC=1
🞑 Is there enough resources to keep the
pipeline stages busy all the time?
Inst. Decod Execu Memo Writeba
Fetch e te ry ck
+
PC +
Re ALU Re
4
g. Mem g.
Mem
File File
Pipeline
Hazards
Pipeline
Hazards
Structural hazards: multiple instructions
compete for the same resource
Data hazards: a dependent instruction
cannot proceed because it needs a value
that hasn’t been produced
Control hazards: the next instruction cannot
be fetched because the outcome of an
earlier branch is unknown
Structural
Hazards
1. Unified memory for instruction
and data
R1 Mem[R2]
R3
Mem[R20]
R6 R4-R5
R7 R1+R0
Structural
Hazards
1. Unified memory for instruction
and data
R1 Mem[R2]
R3
Mem[R20]
R6 R4-R5
R7 R1+R0
Structural
Hazards
1. Unified memory for instruction and data
2. Register file with shared read/write
access ports
R1 Mem[R2]
R3
Mem[R20]
R6 R4-R5
R7 R1+R0
Structural
Hazards
1. Unified memory for instruction and data
2. Register file with shared read/write
access ports
R1 Mem[R2]
R3
Mem[R20]
R6 R4-R5
R7 R1+R0