SoCDesign PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

SoC Design

ICE of silicon
Computational efficiency [Roza]
106 [MOPS/W]

105 3DTV
Intrinsic computational efficiency

104 Query
by
humming
103
7400
Turbosparc
102 601
604 604e
604e
21364
Ultra 21164a
i386SX microsparc
sparc P6
101 i486DX P5 Super
68040
sparc

100
2 1 0.5 0.25 0.13 0.07
Feature size [µm]
http://bwrc.eecs.berkeley.edu/cic
Designing Embedded Systems on Silicon-1
J. van Meerbergen 2/7/13
Hardware Efficiency

efficiency

high
ASIC

ASIP
medium

DSP

low GP proc
FPGA

low medium high flexibility

Designing Embedded Systems on Silicon-1


J. van Meerbergen 2/7/13
ASIC Style

A Finite Impulse
Response (FIR) filter

!  highly efficient for fixed algorithms


!  Ok only for large market volumes (100Ms for 32 nm)
!  No changes after processing at all (no field upgrades, tuning to
specific context, bug fixes, new standards)
!  Irregular code leads to highly irregular floorplan with large wiring
impact (Edyn) and large leakage (Estat)
!  Difficult to efficiently include time multiplexing for irregular code
ASIC + microcontroller style

CPU
MEM
ASIC

!  highly efficient for fixed algorithms that use µ-controller very


seldom
!  Ok only for large market volumes (100Ms for 32 nm)
!  Limited changes after processing
!  Changes only very locally in non-critical code (ok for some field
upgrades, tuning to specific context, bug fixes, new standards)
!  Irregular code leads to highly irregular floorplan with large wiring
impact (Edyn) and large leakage (Estat)
!  Difficult to efficiently include time multiplexing for irregular code
General-purpose microprocessors

•  No picture

!  Highly flexible: easy field upgrades, tuning to specific context,


bug fixes, new standards
!  Easy to use and compiler friendly
!  Large market due to combination of smaller markets
!  Large A+E overhead: data cache hierarchy, multi-port register file,
instr. hierarchy, very flexible data-path units (wide multiplier, ALU
with many instr.)
GP CPUs + custom accelerators

Accel

!  Highly flexible: easy field upgrades, tuning to specific context,


bug fixes, new standards. But degraded when accelerators have
to be used too much
!  Easy to use and compiler friendly
!  Large market due to combination of smaller markets, but not
when accelerators used more
!  Large A+E overhead: data cache hierarchy, multi-port register file,
instr hierarchy, very flexible data-path units (wide multiplier, ALU
with many instr). Partly mitigated when accelerators are used
sufficiently
!  Large overhead in communication between microproc and
accelerators except when large code segments(not flexible!)
SoC Design

•  Synthesis
•  DFT Insertion
•  Floorplanning
•  Power Planning
•  Clock tree insertion
•  Place and Route
•  RC extraction
•  Timing check

8
Design Tools

•  System Architecture •  Synthesis


–  C/C++ –  RC Compiler
–  SystemC –  Design Compiler
–  Matlab

•  RTL •  Physical Design


–  Verilog-XL –  SoC Encounter
–  NC-Verilog –  Magma (Synopsys)
–  NC-VHDL –  Mentor
–  Debussy 9
Simplified Flow
.lib Timing
Front End RTL Constraints
LEF

Test Static Timing


Logic Synthesis
(ATPG) Analysis
Logic
Simulation Floor planning

Formal Clock Tree


Verification Synthesis
Back End

Place &Route

RC Extraction

Static Timing
DRC/LVS Analysis

Netlist GDSII SPEF, SDF


10
TSMC’s Design Flow

11
Flow with Multi-Vendor Tools

12
Design Abstraction Levels
SYSTEM

MODULE
+

GATE

CIRCUIT

DEVICE
G
S D
n+ n+

13
impact of a
design decision
Conceptual level

high level

RT level

gate level

transistor level

complexity

Designing Embedded Systems on Silicon-1


J. van Meerbergen 2/7/13
Design Flow: Summary
Level Time concept Data type Code lines
Concept comm. processes with Tokens 1K
distinct rates
High level frame, signal rate arrays, lists 10K
RT level clock scalars, int, float 100K
Gate level set-up en hold times bits 1M
Transistor level Analog Volt, mA 10M

At higher levels the impact of a design decision is


larger.

Vendors concentrate on lower levels (more general


solutions).
Designing Embedded Systems on Silicon-1
J. van Meerbergen 2/7/13
Logic Synthesis Netlist Synthesis

Synthesis is the process by which an Logic DFT


abstract description (known as RTL) of Synthesis Architecture

the circuit behaviour (generally in VHDL)


is mapped to a set of primitive standard
cells in a library for a particular process
•  Translation of RTL description
technology. into an intermediate format
•  Optimization of logic
Idea •  Mapping of the optimized netlist to
the gates of target library.
•  Synthesis tool requires
Functional –  RTL code
Description RTL –  Target ASIC cell library
–  User Constraints
•  Timing and Area
•  Environmental
Gate-Level •  Power, Load etc.
Behavioral Netlist •  Output of the synthesis is a gate
HDL level netlist in the target
technology

16
RTL Coding
•  RTL stands for Register Transfer Level
•  RTL description of a design describes the
design in terms registers and logic that
resides between them
Sample RTL code
•  This captures the timing constraints of the
design efficiently
if IR(3) = 0'then'
•  Verilog and VHDL are two most popular
hardware description languages that are PC := PC + 1;
commonly used to write RTL description else
•  RTL description captures the change in DBUF := MEM(PC);
data at each clock cycle
MEM(SP) := PC + 1;
•  All the registers are updated at the same
time in a clock cycle SP := SP - 1;

•  RTL captures the data flow PC := DBUF;


end if;
•  Logic synthesis tools translate an RTL
model more efficiently compared to
behavioral model
17
Logic Synthesis

User
ASIC cell
constraints
RTL library

Process (CLK, RST)


if (RST = ‘1’) then
Q <= ‘0’; Logic Synthesis
else Tool
if rising_edge (CLK) then
Q <=A and B and !(C and D);

Gate level netlist

18
Logic Synthesis: Technology Mapping
Z = (not S and A) or (S and B)
A Generic Gates
S
Z

Standard Cells
A
I-002

S
Z

B ANDOR-001
19
DfT Insertion

•  Testable Flip-Flops DfT Insertion

DfT Insertion and Synthesis

•  Scan chain generation DfT Analysis

•  Chain propagation Test generation

from core to output pin


ATPG / Expansion

test validation

Handoff deliverables

20
Backend Design
•  Technology Information and Chip Physical Architecture
Physical Libraries I/O Power Grid Chip Hierarchical Floorplan
–  Corelib.lef & Hierarchical
Planning
Design
Analysis
Assembly STA Implementation

–  IOlib.lef
–  Rams.vclef
•  Timing libraries Physical Synthesis
–  Corelib_slow,lib
–  Corelib_fast.lib Placement DFT Clock Tree Post Placement
Synthesis Optimisation
–  Corelib_typ.lib
–  IOlib_slow.lib
–  RAM timing libraries Routing and Final Optimisation
•  Timing constraints (user
defined)
Signal Routing Crosstalk Fixing Post Route Fix
•  Design Netlist Antennas Editing
Decap, Fillers
–  Add IO pads, power pads
–  Verilog design netlist
•  IO pad location file

21
Floorplanning
•  Floor planning is the task of deciding
how the chip area is to be utilized by
the leaf modules taking care of wiring
considerations
•  Two methods of floorplanning:
–  Top Down: Here the chip is
partitioned up during the
development of the RTL level
modelling. Area is assigned on the
basis of estimated block areas and Std. Cells
shapes, and blocks are placed
relative to each other depending on
connectivity.
–  Bottom up: Here the design is first
synthesised and then the resultant
gates are clustered together into
blocks on the basis of connectivity. IP Block
•  Most designs use a combination of
both of the above techniques, but the
emphasis is increasingly on the first.

Pads 22
Floorplanning
•  Calculating core size, width and height
•  When calculating core size of standard cells, the core utilization must be
decided first. Usually the core utilization is higher than 85%
•  The core size is calculated as follows

standard cell area


Core Size of Standard Cell =
core utilization
•  The recommended core shape is a square, i.e. Core Aspect Ratio = 1.
•  Width = Height = (Core Size of Standard Cells)0.5

Example
•  Standard cell area = 2,000,000um2
•  Core utilization demanded = 85%
•  No macros
•  Core Size of Standard Cells = 2,000,000 / 0.85 =
2,352,941um2
•  Width = Height = (2,352,941)0.5 =1534um 23
Floorplanning
•  Core Margins
–  Space for power and ground
routing
•  Core limited / Pad limited designs
–  When pad width > (core width +
core margin),die size is decided
by pads. And it is called pad
limited design
–  When pad width < (core width +
core margin), die size is decided
by core. And it is called core
limited design

24
Power Planning
•  Metal migration (also known as electro-
migration)
•  Under high currents, electron collisions with
metal grains cause the metal to move. The
metal wire may be open circuit or short circuit.
–  Prevention: sizing power supply lines to
ensure that the chip does not fail
–  Experience: make current density of power
ring < 1mA/m
•  IR drop
–  IR drop is the problem of voltage drop of the
power and ground due to high current flowing
through the power-ground resistive network
–  When there are excessive voltage drops in the
power network or voltage rises in the ground
network, the device will run at slower speed
–  IR drop can cause the chip to fail due to
•  Performance (circuit running slower than
specification)
•  Functionality problem (setup or hold violations)
•  Unreliable operation (less noise margin)
•  Power consumption (leakage power)
•  Latch up
•  Prevention: adding stripes to avoid IR drop on
cell’s power line

25
Power Planning: IR Drop
Counter •  Number of counts inversely proportional
to DSP clock frequency
•  FC = 10, 20 and 25 MHz
enable •  Ringo frequency ≈ 115 MHz @ VDD = 1.8V
•  DSP induced PSN is clearly detected
Average PSN = 6 counts × 2.4 mV/count = 14.4 mV

v(t)
C2 Counts vs. DSP activity (Fc = 20 MHz)
(Tambient = 27ºC)
699
698
1 697
TC =
FC C2 counts 696 Δ counts = 6

695
694
t 693
692
691
0 50 100 150 200 250
Tester ck-cycles
Source: J. Rius, UPC 26
Voltage Drop Verification
VoltageStorm (Cadence)
Block-level Analysis

SoC Encounter Encounter Power Analysis


Block
Block Power Powergrid
Consumption View

Voltage Storm

Virtual Prototype
IP Block
Partition 1
Top-level Analysis
(flat implementation) Power Grid
Encounter Power Analysis View Library

Partition 2 Instance Power


Consumption

Voltage Storm

Top-level
Block-level
CreateChip
PG PG
Analysis
Sign-
Hierarchy
Results displayed
off in
SoC Encounter Interface 27
Power Grid Design

Power Grid Design


Power Power Multiple Power
Power Grid Design &

Power Power
Grid Grid Power Plan Routing Propagation
Creation Connect Ground Refinement
Analysis

Extraction & Analysis Extraction & Hierarchical


Analysis Power
Power Parasitics
Parasitics Power Grid
Grid Extraction
Extraction Propagation Analysis
Analysis

28
Power Ring Width

Experience
•  Gate count = 70 k
•  4000 Flip-Flops
•  80% FF with dynamic gated clock
•  Current needed = 0.2mA/MHz
–  Note: the value should multiply with 1.8~2 for no
gated design
Example:
•  Gate count = 200 k
•  No gated clock
•  Clock frequency = 20 MHz
•  Current needed = (200/70) * 0.2 * 20 * 2 = 22.86 mA
•  Current density < 1mA/m
•  The Width of P/G Ring > 22.86 um
•  In order to avoid the slot rule of wide metal, the
largest width is 20 um (process dependent)
•  Use two sets of P/G ring for this case
29
Power Stripe Calculation

Experience
•  Add one strap set per 100 um
Example
•  Core width = height = 1600
•  Stripe set added = 15

Core/IO power pad selection


Core power
•  Core power pad connection
–  One set core power pad Stripes
(PVDDC along with PVSSC)
can provide 40~50mA current Power ring
•  IO power pad
–  One set IO power pad
(PVDDR along with PVSSR)
can provide the power for
•  3~4 output pads, or
•  6~8 input pads
30
Placement
•  Placement decides the positions of components within allocated blocks
•  One cannot route until the components have been placed.
•  The quality of placement is decided solely on the basis of the quality of routing it allows.
•  Placement is performed using simple estimates of final routing.
•  Timing driven P&R is the state of the art
•  Gates, flip-flops/latches are the common placement objects.
–  Smaller elements like logic gates are placed in single row.
–  Larger blocks are placed in multiple-rows.

Std cells

Low utilization
core

31
Placement

32
Source: Magma
Clock Tree Synthesis
•  Clock signal is used as a timing reference •  The goal of clock tree synthesis
in a synchronous digital system for the includes
movement of data within that system. –  Creating clock tree spec file
•  The Clock Tree or clock distribution –  Building a buffer distribution network
network distributes the clock signal(s) from •  In automatic CTS mode, Encounter will
a common point to all the elements that do the following things
need it –  Build the clock buffer tree according to
•  Properties of clock signals the clock tree specification file
–  Balance the clock phase delay with
–  They are loaded with the greatest fanout, appropriately sized, inserted clock
buffers
–  travel over the greatest distances
–  operate at the highest speeds

33
Clock Tree Synthesis

34
Routing
•  Routing is the process of building the
physical connections between blocks
as defined by the logical connections.
•  Routing takes place in more than one
layer, the exact number available
depending on the process and design
conventions.
•  Layers are connected together using
vias
•  Global Routing
–  Assigns wires to channels
defined during the floor
planning phase
•  Detailed Routing
–  Assigns nets to individual
tracks in the channel

Routing and Final Optimisation

Signal Routing Crosstalk Fixing Post Route Fix


Antennas Editing
Decap, Fillers
35
Routing: Signal Integrity Cross-talk
Peak Noise 20mm wire
•  Parallel repeater insertion does not reduce
the cross-talk peak noise
• For a 10mm communication bus, the delay
noise is lowered by about 77%
•  Staggered repeaters reduce delay noise by
about 88%

shield wire
pico pad
T1IN driver receiver bfx4 T1OUT
Propagation Delay 20mm wire
aggressor

bfx4 bfx3 bfx50ohm


T2IN driver receiver bfx4 T2OUT
victim

bfx4 bfx3 bfx50ohm


T3IN driver receiver bfx4 T3OUT
aggressor

bfx4 bfx3 bfx50ohm


Power supply 2
shield wire

wire length

Source: M. Meijer and A. Katoch, Philips


36
Routing: SI Prevention

Verification Signoff

Timing & Crosstalk


Analysis

Power
Distribution
Analysis

Parasitic
Extraction

37
Static Timing Analysis
Path 1
•  This involves three main steps:
Path 2
–  Design is broken down into sets of
timing paths
A D Q Z
–  The delay of each path is
CLK calculated
Path 3 –  All path delays are checked to see
if timing constraints have been met

Path delay calculations


0.54
0.66
1.0 0.43
D1
0.32 0.23
0.25 U33

path_delay = (1.0 + 0.54 + 0.32 + 0.66 + 0.23 + 0.43 + 0.25) = 3.43 ns


38
Physical Verification

•  DRC
–  Design Rule
Checking
•  LVS
–  Layout vs.
Schematic
verifications

39
Chip Finishing tiles

• Seal-ring & Artefact Generation


– helps to make the circuit moisture
resistant and prevents the
generation of cracks in the die
during sawing the wafer
– Sometimes this step is simply
called ‘Design Chip Finishing’
– critical dimensions structures, mask
ids, fuse markers, etc
Seal ring
• Tiling - dummy fill/pattern fill
– Fabs stringent min and rules on
layer densities on active, poly and
metal must be met by all designs
– Currently back-end operation

• Each step is followed by


Physical Verification step 40
Package Fitting Package options

•  Selection of appropriate
package
•  Route pads to pins
–  Wire length is important
–  Rule checking
•  GDS2 minimum required
information is the nitride or
pad opening layer or the
pad boundary layer

41
Packaging

You might also like