Bobca t
AMDs New Low Power x86 Core Archit ect ure
Brad Burgess, AMD Fellow
Chief Archit ect / Bobcat Core
August 24, 2010
1 | Bobcat | Hot Chips 2010
Tw o x 8 6 Cor e s Tune d for Ta r ge t M a r k e t s
Bulldoze r
Perform ance &
Scalabilit y
Bobca t
Flexible, Low
Power & Sm all
2 | Bobcat | Hot Chips 2010
M a inst r e a m Clie nt a nd Se r ve r M a r k e t s
Low Pow e r
Markets
Sm a ll
D ie Ar e a
Cloud Clie nt s
Opt im iz e d
Bobca t D e sign Goa ls
A sm all, efficient , low power
x86 core
Excellent perform ance
Synt hesizable wit h sm all
num ber of cust om arrays
Easily Port able across process
t echnologies
3 | Bobcat | Hot Chips 2010
Fe a t ur e Se t
64- bit AMD64 x86 I SA
SI MD ext ensions: SSE1, SSE2,
SSE3, SSSE3, SSE4A
Virt ualizat ion
Support for m isaligned 128- bit
dat a t ypes
I nst ruct ion Based Sam pling
( for dynam ic opt im izat ion)
C6 ( wit h int egrat ed power gat ing)
4 | Bobcat | Hot Chips 2010
M icr o- a r chit e ct ur e Ove r vie w
Dual x86 inst ruct ion decode
Out- of- Order inst ruct ion execut ion
Dual COP ret irem ent
Com plex m icroOPs
St at e of t he art branch predict ion
Aggressive OOO load/ st ore engine w/ hazard
predict ion
Advanced Virt ualizat ion w/ nest ed page t ables,
ASI Ds and world swit ch accelerat ion
Low power C6 st at e w/ core level power gat ing and
st at e save accelerat ion
5 | Bobcat | Hot Chips 2010
Bobca t
Branch Predictor
32KB
ICACHE
ITLB
Micro-Archit ect ure
Branch Locator
ConditionPredict
or
Dynamic Target
Return Stack
Fetch Queue
Dual x86 Decoder
uCode
Instr Queue
FP Decode
Int Rename
FP Rename
ROB
Scheduler
FP Sched
Scheduler
FP PRF
Int PRF
ALU
Table Walker
DTLB
ALU
LAGU
SAGU
Mul
32KB
DCACHE
LdSt
Unit
Prefetch
512KB
L2CACHE
6 | Bobcat | Hot Chips 2010
BU
MMX Alu
MMX Alu
IntMul
St Conv
FP Logical
FP Logical
FPAdd
FPMul
To/from Northbridge
Bobca t
Branch Predictor
32KB
ICACHE
ITLB
Micro-Archit ect ure
Branch Locator
ConditionPredict
or
Dynamic Target
Return Stack
Fetch Queue
I ca che
32Kbyt e
Dual x86 Decoder
uCode
2- way set associat ive
64- byt e line
Instr Queue
FP Decode
Int Rename
FP Rename
ROB
Parit y Prot ect ed
512/ 8 ent ry I TLB
( 4k/ 2m )
Scheduler
FP PRF
Int PRF
Fet ch up t o
32- byt es/ cycle
ALU
Table Walker
DTLB
ALU
LAGU
SAGU
Mul
32KB
DCACHE
LdSt
Unit
Prefetch
512KB
L2CACHE
7 | Bobcat | Hot Chips 2010
FP Sched
Scheduler
BU
MMX Alu
MMX Alu
IntMul
St Conv
FP Logical
FP Logical
FPAdd
FPMul
To/from Northbridge
Bobca t
Branch Predictor
32KB
ICACHE
ITLB
Micro-Archit ect ure
Branch Locator
ConditionPredict
or
Dynamic Target
Return Stack
Fetch Queue
Br a nch Pr e dict or :
Predict s up t o t wo
branches per cycle
Rem em bers branch
inst ruct ion locat ions
FP Decode
Int Rename
FP Rename
Scheduler
I ndirect Dynam ic
Address Predict or
Only necessary
st ruct ures are clocked
Instr Queue
ROB
Ret urn St ack Address
Predict or
St at e of t he Art
condit ion Predict or
Dual x86 Decoder
uCode
FP PRF
Int PRF
ALU
Table Walker
DTLB
ALU
LAGU
SAGU
Mul
32KB
DCACHE
LdSt
Unit
Prefetch
512KB
L2CACHE
8 | Bobcat | Hot Chips 2010
FP Sched
Scheduler
BU
MMX Alu
MMX Alu
IntMul
St Conv
FP Logical
FP Logical
FPAdd
FPMul
To/from Northbridge
Bobca t
Branch Predictor
32KB
ICACHE
ITLB
Micro-Archit ect ure
Branch Locator
ConditionPredict
or
Dynamic Target
Return Stack
Fetch Queue
D ua l x 8 6 D e code r :
Scans up t o 22 byt es
Decodes up t o t wo x86
inst ruct ions per cycle
The decoder can direct ly
m ap 89% of x86
inst ruct ions t o a single
m icroOp, an addit ional
10% t o a pair of
m icroOps, and m ore
com plicat ed x86
inst ruct ions ( < 1% ) are
m icrocoded. ( Dynam ic
I nst ruct ion Count s)
Dual x86 Decoder
uCode
Instr Queue
FP Decode
Int Rename
FP Rename
ROB
Scheduler
FP PRF
Int PRF
ALU
Table Walker
DTLB
ALU
LAGU
SAGU
Mul
32KB
DCACHE
LdSt
Unit
Prefetch
512KB
L2CACHE
9 | Bobcat | Hot Chips 2010
FP Sched
Scheduler
BU
MMX Alu
MMX Alu
IntMul
St Conv
FP Logical
FP Logical
FPAdd
FPMul
To/from Northbridge
Bobca t
Branch Predictor
32KB
ICACHE
ITLB
Micro-Archit ect ure
Branch Locator
ConditionPredict
or
Dynamic Target
Return Stack
Fetch Queue
I nt e ge r Ex e cut ion:
A dual port int eger
scheduler feeds t wo ALUs
uCode
A dual port address
scheduler feeds a load
address unit , and a st ore
address unit .
ROB
Physical Regist er File uses
m aps and point ers t o
reduce power by
m inim izing dat a
copying/ m ovem ent .
Dual x86 Decoder
Instr Queue
FP Decode
Int Rename
FP Rename
Scheduler
FP PRF
Int PRF
ALU
Table Walker
DTLB
ALU
LAGU
SAGU
Mul
32KB
DCACHE
LdSt
Unit
Prefetch
512KB
L2CACHE
1 0 | Bobcat | Hot Chips 2010
FP Sched
Scheduler
BU
MMX Alu
MMX Alu
IntMul
St Conv
FP Logical
FP Logical
FPAdd
FPMul
To/from Northbridge
Bobca t
Branch Predictor
32KB
ICACHE
ITLB
Micro-Archit ect ure
Branch Locator
ConditionPredict
or
Dynamic Target
Return Stack
Fetch Queue
Floa t ing Point Unit :
A cent ralized FP scheduler
feeds t wo 64- bit FP
execut ion st acks
Dual x86 Decoder
uCode
MMX and Logical unit s are
replicat ed in bot h st acks
A physical regist er file is
used t o reduce power
Int Rename
FP Rename
FP Sched
Scheduler
FP PRF
Int PRF
ALU
Table Walker
DTLB
ALU
LAGU
SAGU
Mul
32KB
DCACHE
LdSt
Unit
Prefetch
512KB
L2CACHE
1 1 | Bobcat | Hot Chips 2010
FP Decode
Scheduler
The FP Mul Unit can
perform t wo SP m ult iplies
per cycle
The FP Add Unit can
perform t wo SP addit ions
per cycle
Instr Queue
ROB
BU
MMX Alu
MMX Alu
IntMul
St Conv
FP Logical
FP Logical
FPAdd
FPMul
To/from Northbridge
Bobca t
Branch Predictor
32KB
ICACHE
ITLB
Micro-Archit ect ure
Branch Locator
ConditionPredict
or
Dynamic Target
Return Stack
Fetch Queue
D a t a Ca che :
32- Kbyt e
Dual x86 Decoder
uCode
8- way set associat ive
64- byt e line
Parit y Prot ect ed
Copyback
Advanced 8- st ream
prefet cher
FP Decode
Int Rename
FP Rename
Scheduler
40/ 8 ent ry L1DTLB
( 4k/ 2m )
512/ 64 ent ry L2DTLB
( 4k/ 2m )
Instr Queue
ROB
FP PRF
Int PRF
ALU
Table Walker
DTLB
ALU
LAGU
SAGU
Mul
32KB
DCACHE
LdSt
Unit
Prefetch
512KB
L2CACHE
1 2 | Bobcat | Hot Chips 2010
FP Sched
Scheduler
BU
MMX Alu
MMX Alu
IntMul
St Conv
FP Logical
FP Logical
FPAdd
FPMul
To/from Northbridge
Bobca t
Branch Predictor
32KB
ICACHE
ITLB
Micro-Archit ect ure
Branch Locator
Return Stack
Fetch Queue
Out - of- Or de r Loa d
St or e Unit :
Dual x86 Decoder
uCode
Loads bypassing loads
Loads bypassing st ores
Instr Queue
FP Decode
Int Rename
FP Rename
ROB
St ores bypassing loads
Bypass t racking and
dependency correct ion
Scheduler
ALU
Fast st ore forwarding
FP PRF
Table Walker
DTLB
ALU
LAGU
SAGU
Mul
32KB
DCACHE
LdSt
Unit
Prefetch
512KB
L2CACHE
1 3 | Bobcat | Hot Chips 2010
FP Sched
Scheduler
Int PRF
Hazard predict or
Fast crit ical word fill
forwarding
ConditionPredict
or
Dynamic Target
BU
MMX Alu
MMX Alu
IntMul
St Conv
FP Logical
FP Logical
FPAdd
FPMul
To/from Northbridge
Bobca t
Branch Predictor
32KB
ICACHE
ITLB
Micro-Archit ect ure
Branch Locator
ConditionPredict
or
Dynamic Target
Return Stack
Fetch Queue
L2 Ca che :
512Kbyt e
Dual x86 Decoder
uCode
16- way set associat ive
64 byt e lines
Instr Queue
FP Decode
Int Rename
FP Rename
ROB
ECC Prot ect ed
Half speed clocking for
power reduct ion
Scheduler
FP PRF
Int PRF
ALU
Table Walker
DTLB
ALU
LAGU
SAGU
Mul
32KB
DCACHE
LdSt
Unit
Prefetch
512KB
L2CACHE
1 4 | Bobcat | Hot Chips 2010
FP Sched
Scheduler
BU
MMX Alu
MMX Alu
IntMul
St Conv
FP Logical
FP Logical
FPAdd
FPMul
To/from Northbridge
Bobca t
Branch Predictor
32KB
ICACHE
ITLB
Micro-Archit ect ure
Branch Locator
ConditionPredict
or
Dynamic Target
Return Stack
Fetch Queue
Bus Unit :
8- out st anding dat a
accesses
uCode
2- out st anding fet ch
accesses
ROB
Dual x86 Decoder
Instr Queue
FP Decode
Int Rename
FP Rename
Evict ion Buffers
Scheduler
Fill Buffers
FP PRF
Int PRF
Writ e com bining buffers
ALU
Coherency m anagem ent
Table Walker
DTLB
ALU
LAGU
SAGU
Mul
32KB
DCACHE
LdSt
Unit
Prefetch
512KB
L2CACHE
1 5 | Bobcat | Hot Chips 2010
FP Sched
Scheduler
BU
MMX Alu
MMX Alu
IntMul
St Conv
FP Logical
FP Logical
FPAdd
FPMul
To/from Northbridge
Bobca t Pipe line
0
Fetch0
Fetch1
Fetch2
Fetch3
Fetch4
Fetch5
Dec0
Dec1
Dec2
Schedule
RegRead
Transit
FpDec
RegRen
Pack
EXE
Writeback
EXE
EXE
uCode
ROM
MDec
FDec
Dispatch
AGU
L2Tag
Schedule RegRead
DC1
L2Data
Loa d Use La t e n cy
L2 hit : 17- cycles
1 6 | Bobcat | Hot Chips 2010
11
12
Br a n ch M ispr e dict La t e n cy
13- cycles
Loa d Use La t e n cy
L1 hit : 3- cycles
Transit
10
DC2
ALU
Writeback
Cor e Floor Pla n
Floating Point Unit
Test/Debug
Data L2 TLB
X86 Decode
Bus Unit
Instruction
Cache
L2 Sub Array
Inst
TLB/Tag
L2 TAG
Branch
Predict
Ucode
ROM
ROB
Data Cache
Integer Unit
Data Tag/TLB
Load Store Unit
1 7 | Bobcat | Hot Chips 2010
Pow e r Re duct ion
Use of physical Regist er files
Ext ensive use of non- shift ing queues wit h
point ers
Fine grain clock gat ing
I nt egrat ed Core Power Gat ing
Only needed arrays are clocked
i.e. Dt ag hit before Dcache read
Predict ing t he t ype of branch t hen clocking t he
appropriat e predict or( s)
Elim inat ion of inst ruct ion m arker bit s in t he
I cache
Finding t he knee of t he curve ( scrut inize
perform ance gains against power cost s)
Polishing speed pat hs t o raise t he Vt m ix
and reduce leakage
1 8 | Bobcat | Hot Chips 2010
Bobca t Cor e Ove r vie w
Adva nce d M icr o- a r chit e ct ur e
Dual x86 Decode
Advanced Branch Predict or
Full OOO inst ruct ion execut ion
Full OOO load/ st ore engine
High Perform ance Float ing Point
AMD64 64- bit I SA
SSE1,2,3, SSSE3 I SA
Secure Virt ualizat ion
32kb L1s, 512kb L2
Low Pow e r D e sign
Power Opt im ized Execut ion
Micro- archit ect ure t hat m inim izes dat a m ovem ent
and unnecessary reads
Clock gat ing, Power gat ing
Syst em Low Power St at es
Sm a ll Cor e
Area efficient balance of high perform ance and low
power
1 9 | Bobcat | Hot Chips 2010
ICACHE
Bobca t
Low
Pow e r
Cor e
Integer
Scheduler
I
Pipe
I
Pipe
L2
Fetch
Decode
BU
Address
Scheduler
FP
Scheduler
Load
Pipe
Store
Pipe
DCACHE
A
Pipe
M
Pipe
Sum m a r y
Est im at ed 90% of t he perform ance of t odays
m ainst ream not ebook CPU in half t he area*
Sub- one wat t capable
Highly port able across designs and
m anufact uring t echnologies
2 0 | Bobcat | Hot Chips 2010
*Based on internal AMD modeling using benchmark simulations