Clock Issues in Deep Submircron Design
Clock Issues in Deep Submircron Design
Clock Issues in Deep Submircron Design
1999. 10
Jun Dong Cho
Sungkyunkwan Univ. Dept. ECE
Mail : [email protected]
Homepage : vada.skku.ac.kr
1
Agenda
The Issues of Clock Tree Synthesis
2
The Issues of Clock Tree Synthesis
Global Distribution of Clocks and Power
Consideration of Timing/Density/Clock/Power
Clock Management Scheme
Clock Power Reduction
Clocking Techniques in Deep Submicron Design
Multiple Clock Design
Need of PLL
3
Global Distribution of
Clocks and Power (1)
When ASICs are built on a deep submicron process with over tens of million
gates and a clock frequency below 1GHz, the designer must consider many
details in the clock and power circuitry. Normally these circuit elements are
not given much thought; power is drawn from the rails drawn across the top
and bottom of the page, and the clock has ideal characteristics: a square
wave running at the specified frequency.
4
Global Distribution of
Clocks and Power (2)
The interaction between clocks and power consumption may require the
ability to generate clock signals which can be stopped in the inactive
sections to minimize power consumption. The power and ground system
will take up about half of the available package pins to be able to handle
the tens of amperes of average current consumed by the IC.
5
Consideration of
Timing/Density/Clock/Power (1)
This further exacerbates the power consumption problems by making the
big clock buffers stay in the high current linear portions of their transfer
curves for greater amounts of each clock cycle. In addition, the clock
network has many signals switching simultaneously, adding large power
surges and a very large potential for crosstalk and interference to the clock
and power distribution sub-systems.
The most basic problems facing designers are managing the skew in the
clock system (which entails getting the clock everywhere at the same time),
supplying sufficient clock drive to operate all of the clocked elements in the
system, and getting operating power to all active circuit elements. For
single-frequency clock systems, the tradeoffs are speed, power
consumption, and area. The problems with skew and the process of
balancing the delays across the chip occur in parallel with the increases in
density and complexity.
6
Consideration of
Timing/Density/Clock/Power (2)
Tom Katsioulas, marketing director of the IC design group of Cadence
Design Systems (San Jose, CA), notes that the timing, density, clock, and
power are intricately related in the following ways:
7
Clock Management Scheme (1)
Some of the clocking problems of complex, high speed circuits are
associated with the physics of the devices and interconnections. At 250
MHz, the clock period is only 4ns. The amount of time available after
accounting for clock skew and set-up and hold times leaves very little time
for buffer and propagation delays.
The large clock buffers lead to high power consumption, often as large as
30 to 50 percent of the total chip power consumption, as well as noise
problems due to the large current spikes generated when the buffers
switch. An alternative approach is to distribute the buffers into the clock
tree. This reduces the power consumption by requiring buffers of smaller
size and also helps the reliability aspects by reducing the size of the current
spike.
8
Clock Management Scheme (2)
The clock system and the wide word datapaths all switching at the same
time increase the possibility for glitches and higher peak switching currents.
They put additional loads on the power delivery systems. The resulting
datapath skews will require close scrutiny of the datapath localization and
grouping, as well as careful analysis of pipeline lengths. The careful
analysis of the signal paths relative to the clocks is critical to making a
working integrated circuit.
9
Clock Management Scheme (3)
Clocking schemes and power distribution are going to be affected by the
system requirements. The areas for compromise are power, area, and
performance. If one of the areas is defined as much more critical than the
others, it will drive the design. For example, if performance is the key
parameter, a single point clock with sufficient buffers to drive all the
circuitry would be the best choice. The tradeoff would be in a clock system
which draws up to half of the total system current. An intermediate solution
might be a multiply driven clock spine
If all of the circuitry did not need to run at same speed, derived multiple
clocks could be generated from the master reference clock. The sections
will get clocks appropriate for their functions. Why have a 250 MHz clock
for a serial I/O channel controller? This could save some more power since
the frequency term in the power equation has now been reduced for much
of the on-chip circuitry.
Obviously, if the designer gates the clock signals to unused sections of the
chip, with the understanding that the gate delay will exacerbate the clock
skew and clock edge uncertainty for those sections, this keeps the clocks
from toggling the inputs of sections with no data changes. If the gate is
used in place of a buffer in the clock tree section, the clock tree does not
require an additional level of buffers to match the delays due to the extra
gate levels.
10
Clock Power Reduction
If power consumption and/or management is the most important concern,
then the complicated scheme described in the introduction should be
considered. This could be multiple clocks, with multiple frequencies so only
those circuits requiring extremely high performance would get the highest-
speed clocks. Other areas would have lower-speed clocks and gated clocks
and power-down circuitry to minimize the capacitive charging currents.
Analyzing the intricacies of multiple clock interactions requires more detail
and different techniques than is available in the standard ASIC flow.
11
Clocking Techniques
in Deep Submicron Design (1)
Physical implementation of a clock network requires novel approaches to
balance the tradeoffs between minimization of skew, small latency and
power usage. One innovative approach is a clock network driven from
multiple clock driver pads, also known as a multiply-driven clock spine
network. Its benefit is that it can reduce both skew and latency.
One reason this technique produces low skew is because the clock signal
is driven from multiple points on the chip, thereby reducing the effective
distance between drivers and clock signal receivers (otherwise known as
flip-flops). Additionally, the clock signal arrival time difference between the
first flip-flop and the last flip-flop is much smaller, minimizing the skew. In
multiply-driven clock networks, latency is reduced because fewer layers of
buffer trees are needed to drive the clock net from multiple ends.
12
Clocking Techniques
in Deep Submicron Design (2)
Physical design manager Herman Lam of Fujitsu (San Jose, CA), says that
they are encouraging place and route of the clock system first, then the rest
of the signals. For high performance functions, a large clock buffer driving
a minimum size clock tree is the best way to accomplish the clocking. They
place virtual flip-flops at the ends of the clock lines for loads, then let the
software move the virtual flip-flops to optimal locations based on the actual
logic use. When people try to get the logic interconnections first, then try to
balance the clock trees for matched delays, the resulting circuit has a much
larger clock tree and its associated parasitics which increase power
consumption.
Clock networks for deep submicron designs are typically inserted during
physical layout. This may be done with a clock tree place and route tool or
manually inserted in physical layout of the design. After place and route of
the design the RC values for the clock network are extracted and measured.
13
Multiple Clock Design
Multiply-driven clock spine network delays are very difficult to model
because analytical RC algorithms only work for a net with a single driver.
Circuit (Spice) simulation has been used as an alternative to analyze
multiple driven clock nets, but the Spice results must be manually analyzed
and backannotated to timing analysis tools. One alternative is a manual
solution that breaks the multiply driven net into multiple subnets and
extracts the subnet segments for RC analysis. This method totally breaks
down for more than a few drivers which drive a single clock net. For
accurate skew and latency analysis, special EDA tools are needed to model
multiply-driven clock networks automatically and the extracted data needs
to be back-annotated to timing analysis tools.
Multiply-driven clock networks can be designed with very small skew and
latency, but special tools beyond RC extraction and analysis are required to
ensure that such networks meet the requirements of high-performance
deep submicron designs.
14
Need of PLL
A phase-locked loop (PLL) is useful to resynchronize clocks and to generate
multiples of the base system clock. The PLL can develop a clock with zero or
even negative effective skew by adjusting the phase comparator response. One
caveat is that one must monitor the phase jitter and noise associated with the
PLL and clock regeneration circuitry. The jitter and synchronization can create
repeatable phase relationships within the clock network for continuous signals.
However, PLLs consume a lot of power making them less attractive for low
power applications.
According to John Harrington, manager of ASIC products at AT&T
Microelectronics (Reading, PA), "PLLs are useful for clock doublers and triplers
[and other multiples]. This can help by reducing external clock frequencies and
allow lower cost crystals which can normally go up to 40 MHz. Three-fourths of
their designs have a PLL to synchronize and or align clock edges. The designer
needs to be careful of PLL latency and lock times for those situations where the
clock is not continuous."
Jim Smith, ASIC product manager at Hitachi America (Brisbane, CA), agrees,
noting,"We try to add PLLs to compensate and resync the clocks where
possible. For multiple clocks, the problem is the latency and lock times for the
clocks as well as the added jitter errors. The jitter errors add to the total clock
skew."
15
The Basic Consideration
of Clock Tree Synthesis
Basic Feature of Clock Skew
16
Basic Feature of Clock Skew
Circuit operation speed is increasingly limited by clock skew The clock, t, in most VLSI ASIC
which is the maximum difference in arrival times of the design is getting faster and
clocking signal at the logic gates. Figure shows the tolerance of THOLD and
definition of clock skew. This is seen from the below TSETUP is getting smaller. In
inequality governing the clock period of a clock signal net . deep submicron and submicron
TGATE(min) + TRC(min) - THOLD(max) > TSKEW technologies, the effect of TRC
TGATE(max) + TRC(max) + TSETUP(max) + TSKEW < t becomes important. The goal of
where: balance clock tree distribution is
TGATE = signal propagation delay from clock input to to make the clock skew, TSKEW,
data output of a logic gate as small as possible.
TRC = signal propagation delay because of metal-
interconnect RC effects between for a logic gate
THOLD = data-valid hold time requirement for for a logic
gate
TSETUP = data-valid setup time requirement for for a
logic gate
TSKEW = maximum amount of skew between clock
signals, and
t = time for one period of the clock
17
Consideration of
Synchronous Design
Assuming signals, A and B, arrive at both identical
D flip-flops simultaneously, as well as the clock
signal reaches the D flip flops within t seconds,
this circuit will produce correct output, Y3, if the
circuit is built on non-submicron technology. This
is because in non-submicron technology the main
delay and cause of skew is due to propagation
delay of logic gates. Figure illustrates that the
unequal length distance of wires from the clock
source to the D-flip flops will not contribute much
unbalance wire delay in non- submicron
technology. The wire delay can be neglected
compared to logic gate delay.
However, in submicron and deep submicron technologies, logic gate delay is no
longer the sole cause of delay. The wire load delay also contributes a large
proportion of delay. The wire distance between logic gates can cause substantial
delay. Since the distance from the clock source to the clock input of the D flip-flop
D1 is longer than the distance from the clock source to the clock input of D flip-
flop D2, clock skew will occur. Y3 may generate incorrect results due to the clock
skew.
18
Design Style Specific Problems
Full Custom : The clock routing problem in full custom style depends on the
availability of a routing layer for clocks.
If a dedicated layer is available for routing with free of obstacles, the clock
routing problem in full custom design is exactly the same as CRP(Clock Routing
Problem) : minimizing total delay and minimizing skew between buffer.
But, if obstacles are present, we refer to that problem as the BBCRP(Building
Block Clock Routing problem) : minimizing both total delay and skew and
constraint(wires does not intersect with any rectangles) exists.
Standard Cell : The clock routing problem in standard cell designs is
somewhat easier than full-custom in some aspects.
Since, clock lines have to be routed in channels and feedthroughs.
Conventional methods do not work in standard cell design since terminals are
neither uniformly distributed (as in full-custom), nor are they symmetric in
nature(as in gate array).
Gate Array : Gate arrays are symmetrically arranged in a plane and allow the
clock to be routed in a symmetric manner as well. The algorithms for clock
routing in such symmetric structures have been well studied and well
analyzed.
19
Balanced Clock Tree
A balanced clock tree distribution is the fundamental requirement for
synchronous systems. It can minimize the clock skew and ensure that the
clock signals arriving at any logic gates are within the clock skew
specification. A typical balanced clock tree is like a binary tree where all
children nodes at the same level have the same distance from the root
(parent) node.
If the period of time for
passing signals down a level
is identical for all children
nodes, then all children
nodes will receive the signal
from the root (parent) node
at the same instant.
20
Load Balancing
Load balancing method is the method
which balances the clock delay by the
number of clock needed component.
It can equalize the delay of clock by
the method that If one node has more
clock needed component than the
other side. Then, It shortens the
length of clock feed line for more
clock needed component. And assign
long clock feed line for less clock
needed component.
21
Load Balancing using Elmore Delay
22
Load Balanced Clustering
23
Balanced Tree + Mesh
24
Single vs. Distributed
25
Clock Tree Distribution Algorithms
An optimal balance clock tree distribution is to connect all logic gates
directly to the clock source. Assuming that there is no buffer between any
logic gate and the clock source, and the wire width is constant, the
furthest logic gate will experience the largest delay. The delay time can be
equalized for all logic gates by adding logic gate delay and interconnect
delay to the faster signal paths. Then all signal paths will experience the
same delay. This approach not only has a near zero clock skew, but also
has the fastest speed. However, this approach is not feasible because the
drive strength of the clock source is limited, and there is not enough room
to route wires around the clock source.
Logic gates are usually being placed by cell placement program at the
early stage of layout. The positions of the buffers and the clock source;
however, are determined by the clock tree distribution algorithm. Two
general clock tree distribution algorithms are discussed here. It should be
noted that a few major assumptions are made for the following discussion:
the wire resistance and wire capacitance have linear relationship with the
clock signal delay; all buffers are identical and they contribute the same
delay.
26
비교
There are other clock tree distribution algorithms proposed,
such as buffer distribution algorithm [3], general zero-skew
clock net [4] and process-variation-tolerant zero skew-
clock routing [5]. Each algorithm has its own distinct
characteristics. It is difficult, if not impossible, to determine
which algorithm is the best. If logic gates are evenly
distributed, the clock trees generated by these algorithms
may look similar. If the placement pattern of the logic gates
is unique, clock trees built by different algorithms may have
noticeable difference in clock skew, clock signal speed,
wire length and design flexibility.
27
Buffer Pre-Placement
28
Iso-Radius Buffer Insertion
(a) (b)
(a) (b)
29
Width Control
30
Width Tapering
31
Buffer/Load Distribution
32
H-Tree
H-Tree is a special case of CRP, where all the clock
terminals are arranged in a symmetric fashion, as is
the case in the gate arrays.
The H-Tree algorithms connects two terminals in a
particular order. Then, it connects the two middle
points of vertical segments. The connected middle
points are called the tapping points.
The H-Tree makes all terminals have the same unit
length, hence the skew in each terminals is zero.
X-Tree : If the routing is not restricted to being rectilinear, the shape of H-Tree
can be changeable with X shape. But, it is undesirable since they may cause
crosstalk due to close proximity of wires.
33
Hierarchical Matching Tree :
MMM & GMA
MMM(Method of Means and Medians) : The MMM algorithm recursively
partitions a circuit into two equal parts, and then connects the center of the
mass of the whole circuit to the centers of mass of the two sub-circuits.
GMA(Geometric Matching Algorithm) : Unlikely MMM algorithm which is a top
down algorithm, GMA works bottom up.
Cut 1 Black -> Blue -> Red -> Green
Cut 2
Cut 3
MMM GMA
34
Zero Skew Algorithm
Zero Skew Algorithm has recursive, bottom-up characteristics in nature.
This algorithm
Assumes that pairing of points has been done
Concerns itself with finding the tapping point very accurately, based on
capacitive loading of the clock terminals as well as the delay in the sub-
trees
c1 c2
r1 ( c1 ) t1 r2 ( c2 ) t2 x
T1
2 2
c 1/2
c 1/2 t1
C1
If we can’t achieve the zero skew in T aP po pi ni nt g T2
35
A Worst Case Tree
36
RHMT
37
Interconnect Topology
Resistance ratio = driver resistance / unit wire resistance
when resistance ratio is small, interconnect topology
optimization is importance.
Importance metric: total wire length, radius (longest source-
sink path-length),diameter (for multi-source nets)
Optimal tree construction algorithms
BRBC(Bounded-radius bounded-cost) algorithm
A-tree algorithm: start with a forest of n single-node A-
trees, repeatedly
combining two A-trees into a new one.
38
Recent Approaches
in Clock Tree Synthesis
Research in Clock Tree Synthesis Algorithm
Wire-sizing & Parallel Algorithm for zero skew
Reducing Clock Power using Multiple Voltage
Clock Tree Scheduling with Storage Retiming
39
Wire-sizing & Parallel Algorithm
for zero skew (1)
Using an iterative approach. One wire segment is selected and an alternate
wire-size is tried. To make the skew of the tree zero, we have to re-merge the
sub-tree rooted at the current wire with its sibling.
Assumption : The sibling wire uses the same wire size.
This propagation continues until all the wire segments on the path from the
current wire to the root wire are re-merged. When, the size of a wire is
locally optimized, the effect of
propagation the wire size change is
path
propagated by zero skew
merging to the root of the
clock tree.
W The length of all the wires
along the propagation path
and their siblings may change
but their wire-sizes remain
unchanged.
40
Wire-sizing & Parallel Algorithm
for zero skew (2)
Sub-tree Partition : Assume 16 2
there are two processors. The
sub-tree assignment will not p1
p1 p1
occur on nodes of depth 1 6
p0
8
The tree is partitioned into the top part and the bottom part.
Only the nodes in the bottom part are distributed among the processors
The nodes in the top part are shared among the processors
Iteration Method.
First, let each processor do the wire-sizing for the top part.(Except root)
Each process can do the wire-sizing for all the wires in the bottom part of
the tree in a distributed manner, then synchronized the result.
41
Reducing Clock Power using
Multiple Voltage
2
P f CLVddVs f CLVdd
HL Converter : converts the
incoming clock signal to the chip
from high voltage swing to a
lower voltage swing.
LL Converter : regenerate the
signal and maintain a sharp slew
rate as the signal passes through
the network.
LH Converter : convert the higher
voltage swing used by logic network
at the sink FF.
Instead of using multiple voltage,
Only use reduced-swing clock
scheme.
42
Clock Tree Scheduling with
Storage Retiming
Retiming improves the speed of a digital circuit bye relocating its storage
elements while preserving the functionality of the original design.
Clock scheduling achieves the same effect as retiming by introducing skew
between the clock signals that control the timing of the storage elements
within a circuit.
When the clock skew is zero, the minimum clock period is the longest
delay of all the combinational paths in the circuit. So the goal is to balance
the longest delay of all the data paths by relocation the registers.
When nonzero clock skew is introduced, the circuit can successfully
operate at a clock period which equals the largest difference in the delays
of the slowest path and the fastest path between any pair of registers.
43
GALS Clock Scheme(1)
By now, Power consumption in Clock tree is about
50% percent of total power consumption.
In the view of system design, we must reduce the
power consumption of clock.
Power consumption in clock of large high performance
VLSIs can be reduced by adopting GALS(Globally
Asynchronous, Locally Synchronous) design style.
45
Conclusion - Low Power Issues (1)
Power optimization allows logic optimization to simultaneously optimize for
timing, area and power. So all the inputs to optimization are the same with the
addition of two new power constraints: max dynamic power and max leakage
power. A power-optimizing logic optimization system takes as input a gate-
level netlist or database, technology library, optional constraints for timing and
area, and parasitic information (initially in the form of estimated wireloads, but
if backannotation has been done that information will be used). All that's
needed in addition for power optimization is to set a power constraint and
supply switching activity - the same switching activity used with power
analysis.
What you get out of power optimization is a gate-level netlist, optimized to
meet all of your constraints. A natural question to ask is: "If optimization at the
RTL and Behavioral levels can have a great impact on final power dissipation,
why offer a gate-level power optimization capability first?" Over a decade of
experience in synthesis and optimization it has proven that RTL level suffers
the impact of optimization at the gate level. The first commercially successful
synthesis products were gate-level timing optimizers and these paved the way
for RTL and Behavioral synthesis systems. In a similar way gate-level power
optimization will pave the way for RTL and Behavioral synthesis for low power.
46
Conclusion - Low Power Issues (1)
Earlier we made the point that analysis precedes optimization. Here we
make another general point: We might say that just as analysis precedes
optimization, optimization precedes synthesis. Or to put it another way:
Successful synthesis at higher levels requires successful optimization at
lower levels.
47
The Key Terms
in Clock Tree Synthesis
Clock buffer: circuit element to isolate and amplify incoming clock signal.
Clock tree: design technique to achieve balanced delays and loads in the clock
buffers.
Gated clock: clock line that can control clock transmission to the operating circuits.
Ground bounce: the change in ground (vss) reference levels due to current in the
ground line.
Ground loop: the noise caused in the ground line(s) due to unbalanced IR drops in
the ground line.
Insertion delay: the time from clock pad to individual flop-flops.
IR drop: the voltage drop caused by the current I through the resistor R.
Jitter: the change in period to period timing in a clock signal.
Latency: the time for a clock to become available in the circuit.
Multiphase clock: clocking system with more than one phase may be overlapping or
non-overlapping. Biphase-clock and complement, Quadrature-clocks separated by
a phase angle of 90 degree
PLL: Phase-Locked Loop, a variable frequency generator locked to a source signal.
Skew: the maximum difference in clock arrival time between any two flip-flops.
Slew rate: also called rise time or fall time. The time for a signal to go from one level
to the other level.
48
References & Suggested Readings(1)
[1] B. Schweber. Delivering The High-Speed Clock: Not Easy To Be On Time. In Proc. EDN, July 6,
1995
[2] H. B. Bakoglu. Circuits, Interconnections, and Packaging for VLSI, Addison-Wesley Publishing
Company. New York. 1990
[3] J. D. Cho and M. Sarrafzadeh. A Buffer Distribution Algorithm for High-Performance Clock Net
Optimization. In Proc. IEEE Transactions On Very Large Scale Integration (VLSI) Systems, Vol 3,
No.1, March 1993.
[4] N. C. Chou and C. K. Cheng. On General Zero-Skew Clock Net Construction. In Proc. IEEE
Transactions On Very Large Scale Integration (VLSI) Systems, Vol 3, No.1, March 1995
[5] S. Lin and C. K. Wong. Process-Variation-Tolerant Zero Skew Clock Routing. In IEEE 1993
Custom Integrated Circuits Conference. 1993
[6] B. Wiederhold. Deep submicron ASIC Design Requires Design Planning. In Proc. EDN, February
16, 1995
[7] Menezes, A. Balivada, S. Pullela and L. T. Pillage. Skew Reduction in Clock trees Using Wire
Width Optimization. In Proc. IEEE 93 Custom Integrated Circuits Conference. 1993
[8] R. Hansen and R. Deming. ASIC Design Techniques Synchronize Dual Clocks In High-Speed
Designs. In Proc. EDN, July 1993
[9] W. Khan and N. Sherwani. Zero Skew Clock Routing Algorithm For High Performance ASIC
Systems.
[10] K. D. Boese and A. B. Kahng. Zero-Skew Clock Routing Trees With Minimum Wirelength. In
IEEE 1992 Custom Integrated Circuits Conference. 1992
[11] A. Hemani, T. Meinchke, S. Kumar, A. Postula, T. Olsson, P. Nisson, J. Oberg, P. Ellervee, D.
Lundqvist.Lowering power consumption in clock by using Globally Asynchronous Locally
Synchronous design style,In Proc. DAC `99, 1999.
49
References & Suggested Readings(2)
[12] J. Rubinstein, P. Penfield, and M. A. Horowitz. Signal Delay in RC Tree Networks. In Proc. IEEE
Transactions On Computer-Aided Design, Vol. CAD-2, No.3, July 1983
[13] X. Liu, M. C. Papaefthymiou, E. G. Friedman, Maximizing Performance by Retiming and Clock
Skew Scheduling, In Proc. DAC`99 1999.
[14] J. Pangjun, S. S. Sapatnekar, Clock Distribution Using Multiple Voltages, ISLPED`99, 1999
[15] Z. Xing, P. Banerjee, A PARALLEL ALGORITHM FOR ZERO SKEW CLOCK TREE ROUTING,
International Symposium on Physical Design, 1998.
[16] J. S. Yim, S. O. Bae, C. M. Kyung, A Floorplan-based Planning Methodology for Power and
Clock Distribution in ASICs, In Proc. DAC`99, 1999.
50