Clock Issues in Deep Submircron Design
Clock Issues in Deep Submircron Design
Clock Issues in Deep Submircron Design
Agenda
Consideration of Timing/Density/Clock/Power
Need of PLL
3
Global Distribution of
Clocks and Power (1)
When ASICs are built on a deep submicron process with over tens of million
gates and a clock frequency below 1GHz, the designer must consider many
details in the clock and power circuitry. Normally these circuit elements are not
given much thought; power is drawn from the rails drawn across the top and
bottom of the page, and the clock has ideal characteristics: a square wave
running at the specified frequency.
In reality, many other effects need to be evaluated in the clocking and power
distribution areas of the design when the total chip power consumption will be in
the range of tens to over 100 W and the clock power can be as much as half of
the total power consumption. The clocking scheme cannot be assumed to be a
clean, uniform signal network. It might be a complicated distribution structure
with architectures ranging from a large distributed clock buffer for the highperformance chips to a complex system with multiple derived sub-clocks to help
manage power consumption.
Global Distribution of
Clocks and Power (2)
The interaction between clocks and power consumption may require the ability to
generate clock signals which can be stopped in the inactive sections to minimize
power consumption. The power and ground system will take up about half of the
available package pins to be able to handle the tens of amperes of average
current consumed by the IC.
Many side effects of the basic IC process will have to be addressed to make the
chip meet all the requirements of speed, power, and silicon area. If the supply
voltage is reduced to take advantage of the power savings available at a lower
supply voltage, noise margins and leakage currents may become significant
problems. The various secondary effects within the system, like voltage drops on
the supply lines, ground bounce, crosstalk, and glitches, may exacerbate the
problems by adding enough noise to the system to decrease the clock slew rates
and the clock rise and fall times.
Consideration of
Timing/Density/Clock/Power (1)
This further exacerbates the power consumption problems by making the big
clock buffers stay in the high current linear portions of their transfer curves for
greater amounts of each clock cycle. In addition, the clock network has many
signals switching simultaneously, adding large power surges and a very large
potential for crosstalk and interference to the clock and power distribution subsystems.
The most basic problems facing designers are managing the skew in the clock
system (which entails getting the clock everywhere at the same time), supplying
sufficient clock drive to operate all of the clocked elements in the system, and
getting operating power to all active circuit elements. For single-frequency clock
systems, the tradeoffs are speed, power consumption, and area. The problems
with skew and the process of balancing the delays across the chip occur in
parallel with the increases in density and complexity.
Consideration of
Timing/Density/Clock/Power (2)
Large load count yields high clock delay and affects timing.
Some of the clocking problems of complex, high speed circuits are associated
with the physics of the devices and interconnections. At 250 MHz, the clock
period is only 4ns. The amount of time available after accounting for clock skew
and set-up and hold times leaves very little time for buffer and propagation
delays.
The large clock buffers lead to high power consumption, often as large as 30 to
50 percent of the total chip power consumption, as well as noise problems due
to the large current spikes generated when the buffers switch. An alternative
approach is to distribute the buffers into the clock tree. This reduces the power
consumption by requiring buffers of smaller size and also helps the reliability
aspects by reducing the size of the current spike.
The clock system and the wide word datapaths all switching at the same time
increase the possibility for glitches and higher peak switching currents. They put
additional loads on the power delivery systems. The resulting datapath skews will
require close scrutiny of the datapath localization and grouping, as well as
careful analysis of pipeline lengths. The careful analysis of the signal paths
relative to the clocks is critical to making a working integrated circuit.
"In synthesized circuits," says Charlie Xiaoli Huang, a senior architect at Epic
Design Technology (Santa Clara, CA), "the software tries to make all paths the
same length. This makes all data paths complete at the same time, which
generates glitches and power surges at the end of each clock cycle. This
effect gets worse at higher speeds."
Clocking schemes and power distribution are going to be affected by the system
requirements. The areas for compromise are power, area, and performance. If
one of the areas is defined as much more critical than the others, it will drive the
design. For example, if performance is the key parameter, a single point clock
with sufficient buffers to drive all the circuitry would be the best choice. The
tradeoff would be in a clock system which draws up to half of the total system
current. An intermediate solution might be a multiply driven clock spine
If all of the circuitry did not need to run at same speed, derived multiple clocks
could be generated from the master reference clock. The sections will get clocks
appropriate for their functions. Why have a 250 MHz clock for a serial I/O
channel controller? This could save some more power since the frequency term
in the power equation has now been reduced for much of the on-chip circuitry.
Obviously, if the designer gates the clock signals to unused sections of the chip,
with the understanding that the gate delay will exacerbate the clock skew and
clock edge uncertainty for those sections, this keeps the clocks from toggling the
inputs of sections with no data changes. If the gate is used in place of a buffer
in the clock tree section, the clock tree does not require an additional level of
buffers to match the delays due to the extra gate levels.
10
11
Clocking Techniques
in Deep Submicron Design (1)
One reason this technique produces low skew is because the clock signal is
driven from multiple points on the chip, thereby reducing the effective distance
between drivers and clock signal receivers (otherwise known as flip-flops).
Additionally, the clock signal arrival time difference between the first flip-flop and
the last flip-flop is much smaller, minimizing the skew. In multiply-driven clock
networks, latency is reduced because fewer layers of buffer trees are needed to
drive the clock net from multiple ends.
12
Clocking Techniques
in Deep Submicron Design (2)
Physical design manager Herman Lam of Fujitsu (San Jose, CA), says that they
are encouraging place and route of the clock system first, then the rest of the
signals. For high performance functions, a large clock buffer driving a minimum
size clock tree is the best way to accomplish the clocking. They place virtual flipflops at the ends of the clock lines for loads, then let the software move the
virtual flip-flops to optimal locations based on the actual logic use. When people
try to get the logic interconnections first, then try to balance the clock trees for
matched delays, the resulting circuit has a much larger clock tree and its
associated parasitics which increase power consumption.
Clock networks for deep submicron designs are typically inserted during physical
layout. This may be done with a clock tree place and route tool or manually
inserted in physical layout of the design. After place and route of the design the
RC values for the clock network are extracted and measured.
13
Multiply-driven clock spine network delays are very difficult to model because
analytical RC algorithms only work for a net with a single driver. Circuit (Spice)
simulation has been used as an alternative to analyze multiple driven clock nets,
but the Spice results must be manually analyzed and backannotated to timing
analysis tools. One alternative is a manual solution that breaks the multiply
driven net into multiple subnets and extracts the subnet segments for RC
analysis. This method totally breaks down for more than a few drivers which drive
a single clock net. For accurate skew and latency analysis, special EDA tools
are needed to model multiply-driven clock networks automatically and the
extracted data needs to be back-annotated to timing analysis tools.
Multiply-driven clock networks can be designed with very small skew and latency,
but special tools beyond RC extraction and analysis are required to ensure that
such networks meet the requirements of high-performance deep submicron
designs.
14
Need of PLL
15
16
TRC = signal propagation delay because of metalinterconnect RC effects between for a logic gate
17
Consideration of
Synchronous Design
However, in submicron and deep submicron technologies, logic gate delay is no longer
the sole cause of delay. The wire load delay also contributes a large proportion of delay.
The wire distance between logic gates can cause substantial delay. Since the distance
from the clock source to the clock input of the D flip-flop D1 is longer than the distance
from the clock source to the clock input of D flip- flop D2, clock skew will occur. Y3 may
generate incorrect results due to the clock skew.
18
Full Custom : The clock routing problem in full custom style depends on the
availability of a routing layer for clocks.
Standard Cell : The clock routing problem in standard cell designs is somewhat
easier than full-custom in some aspects.
If a dedicated layer is available for routing with free of obstacles, the clock routing
problem in full custom design is exactly the same as CRP(Clock Routing Problem) :
minimizing total delay and minimizing skew between buffer.
But, if obstacles are present, we refer to that problem as the BBCRP(Building Block
Clock Routing problem) : minimizing both total delay and skew and constraint(wires
does not intersect with any rectangles) exists.
Gate Array : Gate arrays are symmetrically arranged in a plane and allow the
clock to be routed in a symmetric manner as well. The algorithms for clock
routing in such symmetric structures have been well studied and well analyzed.
19
20
Load Balancing
21
22
23
24
25
An optimal balance clock tree distribution is to connect all logic gates directly to
the clock source. Assuming that there is no buffer between any logic gate and
the clock source, and the wire width is constant, the furthest logic gate will
experience the largest delay. The delay time can be equalized for all logic gates
by adding logic gate delay and interconnect delay to the faster signal paths.
Then all signal paths will experience the same delay. This approach not only has
a near zero clock skew, but also has the fastest speed. However, this approach
is not feasible because the drive strength of the clock source is limited, and
there is not enough room to route wires around the clock source.
Logic gates are usually being placed by cell placement program at the early
stage of layout. The positions of the buffers and the clock source; however, are
determined by the clock tree distribution algorithm. Two general clock tree
distribution algorithms are discussed here. It should be noted that a few major
assumptions are made for the following discussion: the wire resistance and wire
capacitance have linear relationship with the clock signal delay; all buffers are
identical and they contribute the same delay.
26
27
Buffer Pre-Placement
28
(a)
(a)
(b)
(b)
29
Width Control
30
Width Tapering
31
Buffer/Load Distribution
32
H-Tree
X-Tree : If the routing is not restricted to being rectilinear, the shape of H-Tree can
be changeable with X shape. But, it is undesirable since they may cause crosstalk
due to close proximity of wires.
33
Cut 1
Cut 2
Cut 3
MMM
GMA
34
This algorithm
Concerns itself with finding the tapping point very accurately, based on
capacitive loading of the clock terminals as well as the delay in the sub-trees
c1
c2
r1 ( c1 ) t1 r2 ( c2 ) t2
2
2
c 1/2
T1
c 1/2
t1
C 1
T a p p in g
P o in t
T2
c 2/2
c 2/2
C 2
t2
x- 1
35
36
RHMT
37
Interconnect Topology
38
Recent Approaches
in Clock Tree Synthesis
39
Using an iterative approach. One wire segment is selected and an alternate wire-size
is tried. To make the skew of the tree zero, we have to re-merge the sub-tree rooted
at the current wire with its sibling.
This propagation continues until all the wire segments on the path from the current
wire to the root wire are re-merged.
propagation
path
40
16
2
p1
p1
p0
p1
The tree is partitioned into the top part and the bottom part.
Only the nodes in the bottom part are distributed among the processors
The nodes in the top part are shared among the processors
Iteration Method.
First, let each processor do the wire-sizing for the top part.(Except root)
Each process can do the wire-sizing for all the wires in the bottom part of the
tree in a distributed manner, then synchronized the result.
41
42
Retiming improves the speed of a digital circuit bye relocating its storage
elements while preserving the functionality of the original design.
Clock scheduling achieves the same effect as retiming by introducing skew
between the clock signals that control the timing of the storage elements within
a circuit.
When the clock skew is zero, the minimum clock period is the longest delay of
all the combinational paths in the circuit. So the goal is to balance the longest
delay of all the data paths by relocation the registers.
When nonzero clock skew is introduced, the circuit can successfully operate at a
clock period which equals the largest difference in the delays of the slowest
path and the fastest path between any pair of registers.
43
44
45
46
Earlier we made the point that analysis precedes optimization. Here we make
another general point: We might say that just as analysis precedes optimization,
optimization precedes synthesis. Or to put it another way: Successful synthesis
at higher levels requires successful optimization at lower levels.
47
Clock buffer: circuit element to isolate and amplify incoming clock signal.
Clock tree: design technique to achieve balanced delays and loads in the clock buffers.
Gated clock: clock line that can control clock transmission to the operating circuits.
Ground bounce: the change in ground (vss) reference levels due to current in the ground
line.
Ground loop: the noise caused in the ground line(s) due to unbalanced IR drops in the
ground line.
Insertion delay: the time from clock pad to individual flop-flops.
IR drop: the voltage drop caused by the current I through the resistor R.
Jitter: the change in period to period timing in a clock signal.
Latency: the time for a clock to become available in the circuit.
Multiphase clock: clocking system with more than one phase may be overlapping or nonoverlapping. Biphase-clock and complement, Quadrature-clocks separated by a phase angle
of 90 degree
PLL: Phase-Locked Loop, a variable frequency generator locked to a source signal.
Skew: the maximum difference in clock arrival time between any two flip-flops.
Slew rate: also called rise time or fall time. The time for a signal to go from one level to the
other level.
48
[1] B. Schweber. Delivering The High-Speed Clock: Not Easy To Be On Time. In Proc. EDN, July 6, 1995
[2] H. B. Bakoglu. Circuits, Interconnections, and Packaging for VLSI, Addison-Wesley Publishing Company.
New York. 1990
[3] J. D. Cho and M. Sarrafzadeh. A Buffer Distribution Algorithm for High-Performance Clock Net
Optimization. In Proc. IEEE Transactions On Very Large Scale Integration (VLSI) Systems, Vol 3, No.1, March
1993.
[4] N. C. Chou and C. K. Cheng. On General Zero-Skew Clock Net Construction. In Proc. IEEE Transactions
On Very Large Scale Integration (VLSI) Systems, Vol 3, No.1, March 1995
[5] S. Lin and C. K. Wong. Process-Variation-Tolerant Zero Skew Clock Routing. In IEEE 1993 Custom
Integrated Circuits Conference. 1993
[6] B. Wiederhold. Deep submicron ASIC Design Requires Design Planning. In Proc. EDN, February 16, 1995
[7] Menezes, A. Balivada, S. Pullela and L. T. Pillage. Skew Reduction in Clock trees Using Wire Width
Optimization. In Proc. IEEE 93 Custom Integrated Circuits Conference. 1993
[8] R. Hansen and R. Deming. ASIC Design Techniques Synchronize Dual Clocks In High-Speed Designs. In
Proc. EDN, July 1993
[9] W. Khan and N. Sherwani. Zero Skew Clock Routing Algorithm For High Performance ASIC Systems.
[10] K. D. Boese and A. B. Kahng. Zero-Skew Clock Routing Trees With Minimum Wirelength. In IEEE 1992
Custom Integrated Circuits Conference. 1992
[11] A. Hemani, T. Meinchke, S. Kumar, A. Postula, T. Olsson, P. Nisson, J. Oberg, P. Ellervee, D.
Lundqvist.Lowering power consumption in clock by using Globally Asynchronous Locally Synchronous design
style,In Proc. DAC `99, 1999.
49
[12] J. Rubinstein, P. Penfield, and M. A. Horowitz. Signal Delay in RC Tree Networks. In Proc. IEEE
Transactions On Computer-Aided Design, Vol. CAD-2, No.3, July 1983
[13] X. Liu, M. C. Papaefthymiou, E. G. Friedman, Maximizing Performance by Retiming and Clock Skew
Scheduling, In Proc. DAC`99 1999.
[14] J. Pangjun, S. S. Sapatnekar, Clock Distribution Using Multiple Voltages, ISLPED`99, 1999
[15] Z. Xing, P. Banerjee, A PARALLEL ALGORITHM FOR ZERO SKEW CLOCK TREE ROUTING, International
Symposium on Physical Design, 1998.
[16] J. S. Yim, S. O. Bae, C. M. Kyung, A Floorplan-based Planning Methodology for Power and Clock
Distribution in ASICs, In Proc. DAC`99, 1999.
50