SoC Design
ICE of silicon
Computational efficiency [Roza]
106 [MOPS/W]
105 3DTV
Intrinsic computational efficiency
104 Query
by
humming
103
7400
Turbosparc
102 601
604 604e
604e
21364
Ultra 21164a
i386SX microsparc
sparc P6
101 i486DX P5 Super
68040
sparc
100
2 1 0.5 0.25 0.13 0.07
Feature size [µm]
[Link]
Designing Embedded Systems on Silicon-1
J. van Meerbergen 2/7/13
Hardware Efficiency
efficiency
high
ASIC
ASIP
medium
DSP
low GP proc
FPGA
low medium high flexibility
Designing Embedded Systems on Silicon-1
J. van Meerbergen 2/7/13
ASIC Style
A Finite Impulse
Response (FIR) filter
! highly efficient for fixed algorithms
! Ok only for large market volumes (100Ms for 32 nm)
! No changes after processing at all (no field upgrades, tuning to
specific context, bug fixes, new standards)
! Irregular code leads to highly irregular floorplan with large wiring
impact (Edyn) and large leakage (Estat)
! Difficult to efficiently include time multiplexing for irregular code
ASIC + microcontroller style
CPU
MEM
ASIC
! highly efficient for fixed algorithms that use µ-controller very
seldom
! Ok only for large market volumes (100Ms for 32 nm)
! Limited changes after processing
! Changes only very locally in non-critical code (ok for some field
upgrades, tuning to specific context, bug fixes, new standards)
! Irregular code leads to highly irregular floorplan with large wiring
impact (Edyn) and large leakage (Estat)
! Difficult to efficiently include time multiplexing for irregular code
General-purpose microprocessors
• No picture
! Highly flexible: easy field upgrades, tuning to specific context,
bug fixes, new standards
! Easy to use and compiler friendly
! Large market due to combination of smaller markets
! Large A+E overhead: data cache hierarchy, multi-port register file,
instr. hierarchy, very flexible data-path units (wide multiplier, ALU
with many instr.)
GP CPUs + custom accelerators
Accel
! Highly flexible: easy field upgrades, tuning to specific context,
bug fixes, new standards. But degraded when accelerators have
to be used too much
! Easy to use and compiler friendly
! Large market due to combination of smaller markets, but not
when accelerators used more
! Large A+E overhead: data cache hierarchy, multi-port register file,
instr hierarchy, very flexible data-path units (wide multiplier, ALU
with many instr). Partly mitigated when accelerators are used
sufficiently
! Large overhead in communication between microproc and
accelerators except when large code segments(not flexible!)
SoC Design
• Synthesis
• DFT Insertion
• Floorplanning
• Power Planning
• Clock tree insertion
• Place and Route
• RC extraction
• Timing check
8
Design Tools
• System Architecture • Synthesis
– C/C++ – RC Compiler
– SystemC – Design Compiler
– Matlab
• RTL • Physical Design
– Verilog-XL – SoC Encounter
– NC-Verilog – Magma (Synopsys)
– NC-VHDL – Mentor
– Debussy 9
Simplified Flow
.lib Timing
Front End RTL Constraints
LEF
Test Static Timing
Logic Synthesis
(ATPG) Analysis
Logic
Simulation Floor planning
Formal Clock Tree
Verification Synthesis
Back End
Place &Route
RC Extraction
Static Timing
DRC/LVS Analysis
Netlist GDSII SPEF, SDF
10
TSMC’s Design Flow
11
Flow with Multi-Vendor Tools
12
Design Abstraction Levels
SYSTEM
MODULE
+
GATE
CIRCUIT
DEVICE
G
S D
n+ n+
13
impact of a
design decision
Conceptual level
high level
RT level
gate level
transistor level
complexity
Designing Embedded Systems on Silicon-1
J. van Meerbergen 2/7/13
Design Flow: Summary
Level Time concept Data type Code lines
Concept comm. processes with Tokens 1K
distinct rates
High level frame, signal rate arrays, lists 10K
RT level clock scalars, int, float 100K
Gate level set-up en hold times bits 1M
Transistor level Analog Volt, mA 10M
At higher levels the impact of a design decision is
larger.
Vendors concentrate on lower levels (more general
solutions).
Designing Embedded Systems on Silicon-1
J. van Meerbergen 2/7/13
Logic Synthesis Netlist Synthesis
Synthesis is the process by which an Logic DFT
abstract description (known as RTL) of Synthesis Architecture
the circuit behaviour (generally in VHDL)
is mapped to a set of primitive standard
cells in a library for a particular process
• Translation of RTL description
technology. into an intermediate format
• Optimization of logic
Idea • Mapping of the optimized netlist to
the gates of target library.
• Synthesis tool requires
Functional – RTL code
Description RTL – Target ASIC cell library
– User Constraints
• Timing and Area
• Environmental
Gate-Level • Power, Load etc.
Behavioral Netlist • Output of the synthesis is a gate
HDL level netlist in the target
technology
16
RTL Coding
• RTL stands for Register Transfer Level
• RTL description of a design describes the
design in terms registers and logic that
resides between them
Sample RTL code
• This captures the timing constraints of the
design efficiently
if IR(3) = 0'then'
• Verilog and VHDL are two most popular
hardware description languages that are PC := PC + 1;
commonly used to write RTL description else
• RTL description captures the change in DBUF := MEM(PC);
data at each clock cycle
MEM(SP) := PC + 1;
• All the registers are updated at the same
time in a clock cycle SP := SP - 1;
• RTL captures the data flow PC := DBUF;
end if;
• Logic synthesis tools translate an RTL
model more efficiently compared to
behavioral model
17
Logic Synthesis
User
ASIC cell
constraints
RTL library
Process (CLK, RST)
if (RST = ‘1’) then
Q <= ‘0’; Logic Synthesis
else Tool
if rising_edge (CLK) then
Q <=A and B and !(C and D);
Gate level netlist
18
Logic Synthesis: Technology Mapping
Z = (not S and A) or (S and B)
A Generic Gates
S
Z
Standard Cells
A
I-002
S
Z
B ANDOR-001
19
DfT Insertion
• Testable Flip-Flops DfT Insertion
DfT Insertion and Synthesis
• Scan chain generation DfT Analysis
• Chain propagation Test generation
from core to output pin
ATPG / Expansion
test validation
Handoff deliverables
20
Backend Design
• Technology Information and Chip Physical Architecture
Physical Libraries I/O Power Grid Chip Hierarchical Floorplan
– [Link] & Hierarchical
Planning
Design
Analysis
Assembly STA Implementation
– [Link]
– [Link]
• Timing libraries Physical Synthesis
– Corelib_slow,lib
– Corelib_fast.lib Placement DFT Clock Tree Post Placement
Synthesis Optimisation
– Corelib_typ.lib
– IOlib_slow.lib
– RAM timing libraries Routing and Final Optimisation
• Timing constraints (user
defined)
Signal Routing Crosstalk Fixing Post Route Fix
• Design Netlist Antennas Editing
Decap, Fillers
– Add IO pads, power pads
– Verilog design netlist
• IO pad location file
21
Floorplanning
• Floor planning is the task of deciding
how the chip area is to be utilized by
the leaf modules taking care of wiring
considerations
• Two methods of floorplanning:
– Top Down: Here the chip is
partitioned up during the
development of the RTL level
modelling. Area is assigned on the
basis of estimated block areas and Std. Cells
shapes, and blocks are placed
relative to each other depending on
connectivity.
– Bottom up: Here the design is first
synthesised and then the resultant
gates are clustered together into
blocks on the basis of connectivity. IP Block
• Most designs use a combination of
both of the above techniques, but the
emphasis is increasingly on the first.
Pads 22
Floorplanning
• Calculating core size, width and height
• When calculating core size of standard cells, the core utilization must be
decided first. Usually the core utilization is higher than 85%
• The core size is calculated as follows
standard cell area
Core Size of Standard Cell =
core utilization
• The recommended core shape is a square, i.e. Core Aspect Ratio = 1.
• Width = Height = (Core Size of Standard Cells)0.5
Example
• Standard cell area = 2,000,000um2
• Core utilization demanded = 85%
• No macros
• Core Size of Standard Cells = 2,000,000 / 0.85 =
2,352,941um2
• Width = Height = (2,352,941)0.5 =1534um 23
Floorplanning
• Core Margins
– Space for power and ground
routing
• Core limited / Pad limited designs
– When pad width > (core width +
core margin),die size is decided
by pads. And it is called pad
limited design
– When pad width < (core width +
core margin), die size is decided
by core. And it is called core
limited design
24
Power Planning
• Metal migration (also known as electro-
migration)
• Under high currents, electron collisions with
metal grains cause the metal to move. The
metal wire may be open circuit or short circuit.
– Prevention: sizing power supply lines to
ensure that the chip does not fail
– Experience: make current density of power
ring < 1mA/m
• IR drop
– IR drop is the problem of voltage drop of the
power and ground due to high current flowing
through the power-ground resistive network
– When there are excessive voltage drops in the
power network or voltage rises in the ground
network, the device will run at slower speed
– IR drop can cause the chip to fail due to
• Performance (circuit running slower than
specification)
• Functionality problem (setup or hold violations)
• Unreliable operation (less noise margin)
• Power consumption (leakage power)
• Latch up
• Prevention: adding stripes to avoid IR drop on
cell’s power line
25
Power Planning: IR Drop
Counter • Number of counts inversely proportional
to DSP clock frequency
• FC = 10, 20 and 25 MHz
enable • Ringo frequency ≈ 115 MHz @ VDD = 1.8V
• DSP induced PSN is clearly detected
Average PSN = 6 counts × 2.4 mV/count = 14.4 mV
v(t)
C2 Counts vs. DSP activity (Fc = 20 MHz)
(Tambient = 27ºC)
699
698
1 697
TC =
FC C2 counts 696 Δ counts = 6
695
694
t 693
692
691
0 50 100 150 200 250
Tester ck-cycles
Source: J. Rius, UPC 26
Voltage Drop Verification
VoltageStorm (Cadence)
Block-level Analysis
SoC Encounter Encounter Power Analysis
Block
Block Power Powergrid
Consumption View
Voltage Storm
Virtual Prototype
IP Block
Partition 1
Top-level Analysis
(flat implementation) Power Grid
Encounter Power Analysis View Library
Partition 2 Instance Power
Consumption
Voltage Storm
Top-level
Block-level
CreateChip
PG PG
Analysis
Sign-
Hierarchy
Results displayed
off in
SoC Encounter Interface 27
Power Grid Design
Power Grid Design
Power Power Multiple Power
Power Grid Design &
Power Power
Grid Grid Power Plan Routing Propagation
Creation Connect Ground Refinement
Analysis
Extraction & Analysis Extraction & Hierarchical
Analysis Power
Power Parasitics
Parasitics Power Grid
Grid Extraction
Extraction Propagation Analysis
Analysis
28
Power Ring Width
Experience
• Gate count = 70 k
• 4000 Flip-Flops
• 80% FF with dynamic gated clock
• Current needed = 0.2mA/MHz
– Note: the value should multiply with 1.8~2 for no
gated design
Example:
• Gate count = 200 k
• No gated clock
• Clock frequency = 20 MHz
• Current needed = (200/70) * 0.2 * 20 * 2 = 22.86 mA
• Current density < 1mA/m
• The Width of P/G Ring > 22.86 um
• In order to avoid the slot rule of wide metal, the
largest width is 20 um (process dependent)
• Use two sets of P/G ring for this case
29
Power Stripe Calculation
Experience
• Add one strap set per 100 um
Example
• Core width = height = 1600
• Stripe set added = 15
Core/IO power pad selection
Core power
• Core power pad connection
– One set core power pad Stripes
(PVDDC along with PVSSC)
can provide 40~50mA current Power ring
• IO power pad
– One set IO power pad
(PVDDR along with PVSSR)
can provide the power for
• 3~4 output pads, or
• 6~8 input pads
30
Placement
• Placement decides the positions of components within allocated blocks
• One cannot route until the components have been placed.
• The quality of placement is decided solely on the basis of the quality of routing it allows.
• Placement is performed using simple estimates of final routing.
• Timing driven P&R is the state of the art
• Gates, flip-flops/latches are the common placement objects.
– Smaller elements like logic gates are placed in single row.
– Larger blocks are placed in multiple-rows.
Std cells
Low utilization
core
31
Placement
32
Source: Magma
Clock Tree Synthesis
• Clock signal is used as a timing reference • The goal of clock tree synthesis
in a synchronous digital system for the includes
movement of data within that system. – Creating clock tree spec file
• The Clock Tree or clock distribution – Building a buffer distribution network
network distributes the clock signal(s) from • In automatic CTS mode, Encounter will
a common point to all the elements that do the following things
need it – Build the clock buffer tree according to
• Properties of clock signals the clock tree specification file
– Balance the clock phase delay with
– They are loaded with the greatest fanout, appropriately sized, inserted clock
buffers
– travel over the greatest distances
– operate at the highest speeds
33
Clock Tree Synthesis
34
Routing
• Routing is the process of building the
physical connections between blocks
as defined by the logical connections.
• Routing takes place in more than one
layer, the exact number available
depending on the process and design
conventions.
• Layers are connected together using
vias
• Global Routing
– Assigns wires to channels
defined during the floor
planning phase
• Detailed Routing
– Assigns nets to individual
tracks in the channel
Routing and Final Optimisation
Signal Routing Crosstalk Fixing Post Route Fix
Antennas Editing
Decap, Fillers
35
Routing: Signal Integrity Cross-talk
Peak Noise 20mm wire
• Parallel repeater insertion does not reduce
the cross-talk peak noise
• For a 10mm communication bus, the delay
noise is lowered by about 77%
• Staggered repeaters reduce delay noise by
about 88%
shield wire
pico pad
T1IN driver receiver bfx4 T1OUT
Propagation Delay 20mm wire
aggressor
bfx4 bfx3 bfx50ohm
T2IN driver receiver bfx4 T2OUT
victim
bfx4 bfx3 bfx50ohm
T3IN driver receiver bfx4 T3OUT
aggressor
bfx4 bfx3 bfx50ohm
Power supply 2
shield wire
wire length
Source: M. Meijer and A. Katoch, Philips
36
Routing: SI Prevention
Verification Signoff
Timing & Crosstalk
Analysis
Power
Distribution
Analysis
Parasitic
Extraction
37
Static Timing Analysis
Path 1
• This involves three main steps:
Path 2
– Design is broken down into sets of
timing paths
A D Q Z
– The delay of each path is
CLK calculated
Path 3 – All path delays are checked to see
if timing constraints have been met
Path delay calculations
0.54
0.66
1.0 0.43
D1
0.32 0.23
0.25 U33
path_delay = (1.0 + 0.54 + 0.32 + 0.66 + 0.23 + 0.43 + 0.25) = 3.43 ns
38
Physical Verification
• DRC
– Design Rule
Checking
• LVS
– Layout vs.
Schematic
verifications
39
Chip Finishing tiles
• Seal-ring & Artefact Generation
– helps to make the circuit moisture
resistant and prevents the
generation of cracks in the die
during sawing the wafer
– Sometimes this step is simply
called ‘Design Chip Finishing’
– critical dimensions structures, mask
ids, fuse markers, etc
Seal ring
• Tiling - dummy fill/pattern fill
– Fabs stringent min and rules on
layer densities on active, poly and
metal must be met by all designs
– Currently back-end operation
• Each step is followed by
Physical Verification step 40
Package Fitting Package options
• Selection of appropriate
package
• Route pads to pins
– Wire length is important
– Rule checking
• GDS2 minimum required
information is the nitride or
pad opening layer or the
pad boundary layer
41
Packaging