High-Performance and Low-Power Clock Network Synthesis in The Presence of Variation
High-Performance and Low-Power Clock Network Synthesis in The Presence of Variation
High-Performance and Low-Power Clock Network Synthesis in The Presence of Variation
by
Dong Jin Lee
Doctoral Committee:
Associate Professor Igor L. Markov, Chair
Professor David T. Blaauw
Professor Dennis M. Sylvester
Associate Professor Marina A. Epelman
Dong Jin Lee 2011
c
All Rights Reserved
TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Problem analysis and a strategy for solutions . . . . . . . . . . . . . . 30
3.2.1 Optimization objectives and timing analysis techniques . . . . 31
3.2.2 Nominal skew optimization . . . . . . . . . . . . . . . . . . . 32
3.2.3 Clock latency range (CLR) optimization . . . . . . . . . . . . 35
3.2.4 Coordinating multiple optimizations . . . . . . . . . . . . . . 35
3.3 Proposed SoC clock-synthesis methodology . . . . . . . . . . . . . . 36
3.3.1 Obstacle-avoiding clock trees . . . . . . . . . . . . . . . . . . 36
3.3.2 Composite inverter/buffer analysis . . . . . . . . . . . . . . . 39
3.3.3 Initial buffer insertion with sizing . . . . . . . . . . . . . . . 40
ii
3.3.4 Buffer sliding and interleaving . . . . . . . . . . . . . . . . . 42
3.3.5 Iterative buffer sizing . . . . . . . . . . . . . . . . . . . . . . 43
3.3.6 Iterative top-down wiresizing . . . . . . . . . . . . . . . . . . 44
3.3.7 Iterative top-down wiresnaking . . . . . . . . . . . . . . . . . 45
3.3.8 Bottom-level fine-tuning & limits to further optimization . . . 47
3.4 Empirical validation: Contango 1.0 . . . . . . . . . . . . . . . . . . . 48
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Limitations of existing techniques . . . . . . . . . . . . . . . . . . . 89
5.3 Optimization objective . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Proposed techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.1 Obstacle-aware virtual clock trees . . . . . . . . . . . . . . . 93
5.4.2 Arboreal clock-net contraction force . . . . . . . . . . . . . . 95
5.4.3 Obstacle-avoidance force . . . . . . . . . . . . . . . . . . . . 98
iii
5.5 Proposed methodology . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.1 The Lopper flow . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.2 Trade-offs and additional features . . . . . . . . . . . . . . . 102
5.6 Empirical validation: Lopper . . . . . . . . . . . . . . . . . . . . . . 105
5.6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . 106
5.6.2 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . 109
5.6.3 SPICE-driven validation . . . . . . . . . . . . . . . . . . . . 111
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Chapter VI. Multilevel Tree Fusion for Robust Clock Networks . . . . . . 117
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
iv
LIST OF FIGURES
Figure
2.1 Eligible clock sink pairs. (a) There is combinational logic between two
sinks, which make the skew between these two sinks affects the useful
portion of clock cycle time. (b) This sink pair is not eligible because the
sinks are not logically dependent. . . . . . . . . . . . . . . . . . . . . 12
2.2 The locus of all midpoints between two sinks s1 and s2 is a Manhat-
tan arc in the Manhattan geometry. On the other hand, the midpoint is
unique in Euclidean geometry. (a) Sinks s1 and s2 are not horizontally
aligned. Therefore, the Manhattan arc has non-zero length. (b) Sinks s1
and s2 are horizontally (left) and vertically (right) aligned. Therefore,
the Manhattan arc for both cases has zero length. Source: [43]. . . . . . 18
2.3 (a) Sinks s1 and s2 form a Manhattan arc. (b) An example of a tilted
rectangular region (TRR) for the Manhattan arc of s1 and s2 with radius
of two units. Source: [43]. . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Finding the location of child node v given the location of its parent node
par. Source: [43]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
v
2.7 Embedding the clock tree during the DME top-down phase. Gray lines
indicate merging segments, dotted lines show connections between merg-
ing segments, and black lines indicate routing segments. (a) Connecting
the clock source to the root merging segment. (b) Connecting the root
merging segment to its children merging segments. (c) Connecting those
merging segments to its children. (d) Connecting those merging seg-
ments to the sinks. Source: [43]. . . . . . . . . . . . . . . . . . . . . . 22
2.8 Min-wirelength trees with zero and bounded skew (Elmore delay). Only
fragments of actual clock trees are shown. . . . . . . . . . . . . . . . . 23
3.1 Key steps of the Contango methodology. Blue boxes represent skew re-
duction techniques, red octilinear shapes show CLR reductions, and the
green box with thick border reduces both objectives. An Improvement-
& Violation-Checking (IVC) step follows each Clock-Network Evalua-
tion (CNE) using circuit simulation tools, e.g., SPICE. “Fail” indicates
no improvement or having slew violations, leading to a transition to the
next optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 The clock tree produced by Contango on ispd09f nb1. Sinks are indi-
cated by crosses, buffers are indicated by blue rectangles. L-shapes are
drawn as “diagonal wires” to reduce clutter. Wires are colored by a red-
green gradient to reflect slow-down slacks, as described in Section 3.2.2.
The impact of wiresnaking is too small to be visible. . . . . . . . . . . . 50
4.1 Local-skew slack for sinks and edges when Ω∆ = 5 ps. (a) Sink pairs
within distance ∆ are enclosed by dashed lines. ω∆ = 12 ps based on
sink latencies and ∆. (b) Local skew-slack for sinks are computed by
Algorithm 2. The algorithm for edge-slack computation is described
in [51, Section 3]. (c) ω∆ is reduced to 5 ps after optimizations, which
satisfies the local skew constraints. . . . . . . . . . . . . . . . . . . . . 60
4.2 The impact of variations on local skew. Sinks are indicated by crosses,
the clock source is indicated by a solid triangle. Nominal skew of 3 ps is
shown in (a). Full skew of 11 ps is shown in (b), where some tree edges
are delayed (thick red) and some are sped up (dotted green) by random
variations. Only sink A is within the local skew distance from sinks B,
C and D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
vi
4.3 Comparison of different wire snaking strategies to satisfy Ω∆ = 10 ps.
(a) Unoptimized sink latencies are shown. 20 ps of additional delay is
required for the left sink. (b) Wire snaking at non-buffer output nodes
results in undesired delay at the right sink. (c) The snaked wire is iso-
lated from the right sink by the left buffer, therefore only the left sink is
delayed and ω∆ satisfies local skew constraints. . . . . . . . . . . . . . 70
4.4 Delay buffer insertion and subsequent wire snaking when Ω∆ = 10 ps,
the delay of the buffer d(B)= 10 ps. (a) Unoptimized sink latencies are
shown. (b) Delay buffer insertion for skew reduction and isolation of the
target node. (c) The snaked wire is isolated from the right sink by the
delay buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7 Our clock tree for ispd10cns07. Sinks are indicated by crosses, buffers
are indicated by blue rectangles. ∆ = 600µm is shown near the left-
bottom corner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.8 Probability density functions for worst local skew of our clock trees (blue
line) and meshes produced by CNSrouter (gray dashed line) for the eight
ISPD 2010 benchmarks, calculated using 500 independent SPICE runs
for each benchmark. The x-axis shows skew in picoseconds. Local skew
limits (Ω∆ ) are shown with red solid lines, and the 95%-ile of local skew
(ω∆,ν,0.95 ) are shown by dotted green lines (our work) and dashed gray
vertical lines (CNSrouter). . . . . . . . . . . . . . . . . . . . . . . . . 80
4.10 Fidelity of Chop-SPICE estimates (γ = β(Ψ)) for CNS 02, CNS 07, and
CNS 04 ISPD 2010 benchmarks. . . . . . . . . . . . . . . . . . . . . . 82
5.1 Two examples of Manhattan rings proposed in [64]. (a) Zero-skew Man-
hattan rings driven by an H-tree. (b) Manhattan rings on the design with
obstacles. Obstacles are indicated by darker boxes, two sink groups (A,
B) are represented as ellipses. . . . . . . . . . . . . . . . . . . . . . . 90
vii
5.2 Bounding boxes of two partial ZST-DME clock trees. (a) HPWL of the
bounding box is (15+12)=27. The total wirelength of the inside clock
tree is 32. (b) HPWL is (10+10)=20 and the total wirelength of the clock
tree is 35. The clock-net wirelength of (b) is greater than (a) although
the bounding-box HPWL of (b) is notably smaller than (a) while the
source-to-sink wirelength is 15 for all sinks. . . . . . . . . . . . . . . . 91
5.4 Two types of forces for clock-net optimization. Registers are indicated
by crosses. (a) For each edge, the corresponding downstream registers
are given force vectors. Right arrows are the force vectors for reduc-
ing e1 , and up arrows are the force vectors for reducing e2 . (b) Virtual
nodes are inserted (squares), and forces are created between each pair of
connected nodes (dotted lines). . . . . . . . . . . . . . . . . . . . . . 95
5.5 Comparison between our arboreal clock-net contraction force and MLAF
of [103]. (a) Arboreal clock-net contraction forces are generated. (b)
The modified register and virtual clock-node locations when forces in
(a) are utilized. (c) The forces created by the MLAF algorithm. (d) The
modified register and virtual clock-node locations when forces in (c) are
utilized. We can observe that the edges between parents and children
nodes are poorly handled for the force creation in (c), and our method
is more efficient on non H-tree structures (which is common in modern
designs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.7 Key steps of Lopper integrated into the SimPL placer, as indicated with
darker rounded boxes and a lozenge. Plain boxes represent the SimPL
steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
viii
5.9 An example of routing dead space that can be found in the ISPD‘05
benchmarks. (a) Routing dead space is created by enclosing macro
blocks. (b) One macro block is modified to open the space. . . . . . . . 107
5.10 Clock trees for clkad1, based on a SimPL register placement (top) and
produced by proposed techniques (bottom). The respective clock-tree
wirelengths based on SimPL and our method are 209.13 mm and 152.27 mm.
The total switching power of SimPL and our method are 279.9 mW and
263.0 mW respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1 Simple clock networks with source node s, two sink nodes a and b. All
paths are considered buffered. (a) a tree, (b) redundant paths. (c) n
multilevel paths for each sink. Each i-th (2 ≤ i) new root-to-sink path
consists of a shared pwi section and a pai or pbi section that is not shared. 119
6.2 Empirical distributions of signed and absolute skew of two example sink
pairs. The data are collected from Monte-Carlo simulations with varia-
tions. (a) Sink pair with nominal skew 0.3 ps. (b) Sink pair with nominal
skew 1.2 ps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4 The impact of redundant paths for a pair of critical sinks (Figure 6.1c) on
clock-network parameters, based on Formulas VI.25, VI.27 and VI.28.
The skew constraint and ρ are set to 10 ps and 0.1 respectively. (a)
Standard deviation. (b) Yield. (c) Relative total capacitance of each
clock network compared to the total capacitance of the clock tree without
redundant paths (n = 1). . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.5 (a) A critical sink pair is indicated by a red oval and the LCA of two
sinks is shown. (b) Corresponding subtree for the sink cluster in (a). . . 126
6.6 Illustration of multilevel tree fusion on ispd10cns02. (a) Initial tree con-
struction. (b) Critical sink pairs are connected by red lines. (c) Auxiliary
trees are fused in to enhance robustness. . . . . . . . . . . . . . . . . . 128
6.7 Key steps of multilevel tree fusion. Proposed techniques are indicated
with darker rounded boxes and a lozenge. Plain boxes represent tech-
niques adapted from earlier publications. . . . . . . . . . . . . . . . . 129
6.8 (a) Multiple paths from clock source to sinks a and b. (b) Splinter sinks
are generated to utilize tree optimization algorithms. . . . . . . . . . . 130
ix
6.9 Impact of variations on a buffered path. The path is 2 mm long with
30 f F load capacitance at the end and buffered by 4 inverters. (a) The
45 nm technology, variation model from the ISPD 2010 benchmarks
and a buffer type used in our work are utilized. (b) The 45 nm low-
power technology, buffer library and variation model from STMicro are
utilized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.10 SPICE waveforms for a reconvergent sink (Sink 680) with largest tempo-
ral displacement of split sinks in a fused clock network with skew limit
4.5 ps on ispd10cns08. Among the four splinter sinks, the maximum
rising-delay displacement before merging is 5.31 ps. The maximum
rising-delay is 423.58 ps and the minimum rising-delay is 418.27 ps.
The delay of the sink after merging is measure as 423.22 ps. The gray
dashed lines represent the waveforms at splinter sinks before merging.
The blue solid lines represent the waveforms at the sink after merging.
(a) rising edge. (b) falling edge. . . . . . . . . . . . . . . . . . . . . . 136
x
LIST OF TABLES
Table
1.1 Clock networks in industry CPUs [6, Chapter 43] and ISPD 2010 bench-
marks from Intel and IBM (Table 2.1). . . . . . . . . . . . . . . . . . . 4
3.2 Inverted sinks in ISPD 2009 benchmarks (after buffer insertion) vs. polarity-
correcting inverters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Results on the ISPD 2009 Contest benchmark suite. CLR is reported in
ps, capacitance in % of the limit specified in benchmarks, and CPU time
in s. Best results from the ISPD 2009 contest and best results overall
are shown in bold. Runtime is dominated by SPICE runs. It was not
used for scoring at the ISPD 2009 contest and can be improved by using
FastSPICE, Arnoldi approximation. . . . . . . . . . . . . . . . . . . . 49
3.5 Results from ASPDAC’10 clock routing papers on the ISPD 2009 Con-
test benchmark suite [61, 63, 87]. Runtimes may be from different work-
stations. CLR and skew are reported in ps and CPU time in s. Only
average skew was published for HKPU [63]. . . . . . . . . . . . . . . 50
xi
3.6 The ‘Full flow’ column shows skew change at each step in the Contango
flow, and the final skew in ps. Acronyms are decoded in the caption
of Table 3.3. Subsequent columns show the impact of removing one
optimization. These results illustrate the range of each optimization and
its impact on final results. . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Results on the ISPD 2010 Contest benchmark suite. Skew numbers are
reported in ps, capacitance in pF and CPU time in s. ‘95%’ represents
ω∆,ν,95 . The numbers in parentheses of the capacitance column refer to
the fraction of capacitance of the snaked wires in %. Skew constraint
violations are shown in strikethrough font. Otherwise, skew results are
not comparable because skew can be traded for capacitance, which was
the primary objective of the contest. All networks produced by these
tools satisfy slew constraints imposed at the ISPD 2010 contest. Due to
limited page space, we do not include results for the other teams, but
significantly outperform them in solution quality. . . . . . . . . . . . . 77
4.2 Our clock trees for the ISPD 2010 benchmarks, buffered by our slew-
constrained algorithm, versus existing state-of-the-art clock networks
[13, 67]. Skew numbers are reported in ps, capacitance in pF . All net-
works produced by these tools satisfy slew constraints from the ISPD
2010 contest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Averages delay (D-err.) and slew error (S-err.) (in picoseconds) and
runtimes (in seconds) with varying the granularity on the ISPD 2010
CNS contest benchmark suite. The clock networks are synthesized by
Contango2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Maximum delay (D-err.) and slew errors (S-err.) in ps, on the ISPD
2010 CNS contest benchmark suite, as sub-circuit granularity varies. . . 84
xii
5.2 Results on the CLKISPD‘05 benchmark suite. ClkWL represents total
wirelength of a clock network synthesized by the initial phase of Con-
tango 2.0 [53]. HPWL is total HPWL of signal nets. Pwr is total net-
switching power. SimPL+Lopper is 4.16× faster than mPL6 and 1.51×,
1.81× slower than FastPlace3, SimPL respectively. . . . . . . . . . . . 108
5.3 The results on clkad1 with various clock power ratios β. The specifi-
cations of the reference placement produced by SimPL are in the row
Orig. αavg is calculated based on β and reference placement produced
by SimPL. Total wire-switching power values of the reference place-
ment with the corresponding β are represented in the column Orig. P.
The relative power ratios are indicated with Rel. . . . . . . . . . . . . 110
5.5 Results of the MLAF technique integrated into SimPL with comparison
to our technique. Average results are compared to the results for SimPL
in Table 5.2. The numbers in parentheses represent the amount of re-
duction(ClkWL, Pwr) [increase(HPWL)] assuming 100% reduction [in-
crease] for our technique. For example, [209.1(SimPL) - 182.4(MLAF)]
/ [209.1(SimPL) - 152.3(Lopper)] = 46.9%. . . . . . . . . . . . . . . . 112
xiii
6.2 Comparison of buffer types. ispd10b1 and ispd10b2 are two buffer types
in ISPD 2010 CNS benchmarks. The large buffer utilized in this work
has Gaussian variation and parallel buffering is not allowed. The buffer
type in this work is intended to represent a composite buffer made from
8 ispd10b2 buffers, but in a way that would prevent modeling constituent
buffers as experiencing independent PVT variation. . . . . . . . . . . . 133
6.4 Results on ispd10cns08 with different local skew limits. The statistics
of nominal skew, total skew and variational skew are reported based on
Monte-Carlo simulations. For each tree, we report its mean, standard
deviation (σ), as well as yield for a given skew limit. the worst local
skew when yield is 95%. All the results satisfy slew constraints. . . . . 135
6.5 Results on ispd10cns08 with the buffer type ispd10b1 in Table 6.2 with-
out parallel buffering. The statistics of nominal skew, total skew and
variational skew are reported based on Monte-Carlo simulations. Mean,
standard deviation (σ) and yield for given local skew limit are reported
for each tree. ‘95%’ column represents the worst local skew when yield
is 95%. All the results satisfy slew constraints. . . . . . . . . . . . . . 136
xiv
ABSTRACT
by
Dong Jin Lee
ical design of integrated circuits. Among the major design steps, clock-network synthesis
has been greatly affected by technology scaling, rendering existing methodologies inade-
quate. Clock routing was previously sufficient for smaller ICs, but design difficulty and
structural complexity have greatly increased as interconnect delay and clock frequency in-
creased in the 1990s. Since a clock network directly influences IC performance and often
consumes a substantial portion of total power, both academia and industry developed syn-
thesis methodologies to achieve low skew, low power and robustness from PVT variations.
Nevertheless, clock network synthesis under tight constraints is currently the least auto-
mated step in physical design and requires significant manual intervention, undermining
turn-around-time. The need for multi-objective optimization over a large parameter space
and the increasing impact of process variation make clock network synthesis particularly
challenging.
xv
Our work identifies new objectives, constraints and concerns in the clock-network syn-
develop new modeling techniques and algorithms for clock power optimization subject
low 5 ps, satisfy slew constraints and trade-off skew, insertion delay and power, while
6.8%-11.6% for large IC designs with macro blocks by integrating clock network synthe-
sis within global placement. We also present a novel non-tree topology that is 2.3× more
power-efficient than mesh structures. We fuse several clock trees to create large-scale re-
dundancy in a clock network to bridge the gap between tree-like and mesh-like topologies.
Integrated optimization techniques for high-quality clock networks described in this dis-
in the presence of process variation. Our software implementations were recognized with
the first-place awards at the ISPD 2009 and ISPD 2010 Clock-Network Synthesis Contests
xvi
PART I
CHAPTER I
national logic connecting registers [76]. While the functional requirements of a digital
system are satisfied by the register transfer level (RTL) and logic synthesis, the overall
performance and timing constraints require insertion of pipeline registers to ensure that
the latencies of critical paths between registers satisfy timing constraints [76]. Clock sig-
nals are delivered from a clock generator to sequential elements by a clock distribution
network, which must optimize important parameters such as clock skew, slew rate, inser-
tion delay, power dissipation, area and sensitivity to variations [6, 43]. In a modern EDA
flow, the number, type and netlist of combinational logic and sequential elements are de-
fined after RTL and logic synthesis [8, 9]. The physical locations of sequential elements
1
become known after the placement stage, which consists of global placement, legaliza-
tion and detail placement [6, 43]. For ASIC and SoC designs, clock-network synthesis is
A clock distribution network is typically the largest net in the circuit netlist and op-
erates at the highest speed of any signal within the entire synchronous system, hence
the clock network often takes a significant fraction of the power consumed by a chip
[26, 31, 66, 96]. Clock waveforms must be sharp and noise-free since all the data signals
are referenced by the clock signals. Technology scaling has made long global interconnect
wires significantly more resistive as wires become thinner [34]. Clock signals are particu-
larly affected by this increased wire resistance, and precise control of clock-signal arrival
times has grown in importance since they severely limit the maximum performance of the
entire system.
clock distribution networks are crucial; therefore, clock network synthesis is excluded
from other signal-net routing and processed by specialized algorithms and techniques prior
Processor-based systems fueled the development of electronics since the 1960s. PCs
were the main driver of growth in electronics in the 1990s, and in the 2000s mobile phones
and other battery-powered consumer devices became a significant market segment, fol-
CPUs and ASIC chips, whose clock frequency has steadily increased for many years.
2
However, semiconductor scaling in the 1990s made clock optimization more challenging.
While transistors continued scaling, interconnect lagged in performance [34]. The max-
imal length of a wire that can be driven by an inverter started a steady decrease. This
phenomenon boosted demands for repeaters in clock networks, raised their power pro-
file, and complicated their synthesis. Research in delay-driven buffering of single signal
nets — arguably an easier problem and on a smaller scale — has blossomed well into the
late 2000s, leaving clock-tree synthesis a difficult, high-value target. As the accuracy of
compact delay models for transistors and wires deteriorated, clock-network design in the
A variety of clock network topologies and deskewing techniques were developed for
microprocessors previously. Table 1.1 shows key parameters of clock networks in the mi-
croprocessors designed by IBM and Intel from the late 1990s to early 2000s [6, Chapter
43]. All those clock networks are regular, and only minimally adapt to sink locations.
IBM S/390 used two-level balanced H-like trees. The clock network of the IBM Power4
processor consists of tuned H-trees driving a single full-chip grid. Active deskewing and
wire-width tuning were employed to reduce skew. Alpha 21264 utilized hierarchical struc-
tures consisting of a global grid, six major grids and local clocks. The Intel Pentium series
used spine (tall tree) structures driven by balanced binary trees. Adaptive deskewing tech-
nique based on a delay-locked loop (DLL) reduced skew from 100 ps to 15 ps in Pentium
III. Deskewing by a 5-bit domain deskew register (DDR) was employed in Pentium 4. The
clock network of the Intel Itanium microprocessor series features three levels of global dis-
tribution by two identical and balanced H-trees, regional clock distribution by the regional
3
clock driver (RCD) and regional clock grid and local clock distribution by local clock
buffers (LCBs) and local clock routings. A fuse-based deskewing technique was devel-
oped for Intel Itanium 2, reducing skew from 71 ps to 24 ps. In recent high-performance
microprocessors, clock signals are distributed using two-level hybrid networks consisting
of a global grid and local buffered gated trees connecting to the grid [81, 82].
Table 1.1: Clock networks in industry CPUs [6, Chapter 43] and ISPD 2010 benchmarks
from Intel and IBM (Table 2.1).
In the early 2000s, the emphasis in CPU design has shifted from high performance to
power-performance-cost trade-offs, including the advent of multicore CPUs and the grow-
ing popularity of low-power ARM CPUs. In the netbook market, the low-power 1.6GHz
Atom CPU from Intel is currently competing with ARM’s multicore 2GHz Cortex-A9
CPUs and the 1GHz Cortex-A8, but 98% of world’s mobile phones rely on ARM-based
CPUs [49] which offer better power-performance-cost trade-offs than Intel CPUs [90].
ARM cores often drive system-on-chip (SoC) designs, laid out using low-power ASIC
works and use active deskewing to lower clock skew and susceptibility to process varia-
4
tions [82]. Clock trees are more susceptible to variations than meshes (common in CPUs),
but are 2-4 times more power-efficient. This is significant because clock networks and cor-
responding sequential elements consume up to 50% of CPU power and can affect power-
performance comparisons between CPUs [83]. Unused parts of the clock network can be
temporarily turned off (clock gating), but this does not always reduce peak power.
more important problem as it significantly impacts the performance, area and power dis-
sipation of the design. The trend of increasing system complexity in conjunction with
architectural-level pipelining increases the number of clocked elements [24, 108]. Semi-
conductor scaling facilitates smaller cycle times, but this trend assumes increasingly re-
liable clock distribution. The design of clock networks directly influences the maximum
operating clock frequency because it determines clock skew, slew rate and insertion de-
lay of the clocked elements [10]. Decreasing power consumption has become one of the
main objectives in IC design today. The benefits of voltage reduction and device size scal-
ing are often overwhelmed by the increase in the number of gates and clock frequency.
The high costs of system cooling have also increased the importance of low-power de-
sign. Clock networks consume a significant fraction of the total system power due to its
very high capacitive load and frequent switching. Being responsible for 30-50% of chip
power [27, 60], clock networks require careful optimization. With shrinking cycle times,
the impact of process, voltage and temperature (PVT) variation is becoming more serious
and complicates the design of reliable clock networks [82]. The time it takes to design
5
and synthesize clock networks is becoming significant, because laborious accurate timing
analysis is often required to satisfy the tight constraints for clock signals.
Nominal clock skew is usually improved first during clock routing since in GHz-range
systems, performance can be seriously affected by skew in tens of picoseconds. For skew
optimization, highly accurate timing analysis tools (e.g., SPICE) are required, but they
are slow and dominate the runtime of clock network synthesis. Therefore, choosing ap-
propriate timing analysis tools and how to utilize them is also important in clock network
synthesis. Skew is affected by PVT variations. Hence, skew optimization based on only
nominal parameters (no variation, single corner) does not guarantee a reliable clock net-
work. We distinguish two approaches to the design of reliable clock networks. First,
one can use strong devices or thick wires that are less affected by variations. Second,
one can build a redundant clock network with multiple paths from the clock source to
each clock sink, or only some clock sinks. The impact of variation on one path can be
compensated for by the clock signals from the other less-affected paths. In modern clock
network design, this is mostly done by using mesh/grid type structures. However, neither
method can avoid increase in total capacitance, which results in an increase in total power
consumption. In general, making a more robust clock network requires a significant in-
crease in power consumption. Since reducing power dissipation is another primary goal
of clock network synthesis, careful analysis of the optimal point between reliability and
when a tree structure is insufficient to ensure a robust clock network, even after best pos-
6
tree structures. Although some publications propose adding cross-links to harden the tree
structure [36, 37, 50], recent studies suggest that these proposals are unworkable. There-
fore, designing effective clock tree structures that combine the reliability of a mesh with
Clock network synthesis for commercial designs is verified with respect to multiple
process corners (or scenarios). Each corner represents a different operation environment of
the chip and commercial clock network synthesis tools try to optimize the clock network
intradie-process variations and decreases the accuracy of skew analysis as the impact of
variations increases. One can utilize Monte-Carlo simulations for accurate estimation of
the impact of variations, but this method is too time-consuming and remains impractical
within clock network synthesis. Statistical timing analysis can model the impact of timing
variations more efficiently, but remains relatively unexplored in the context of state-of-the-
In a physical design flow, clock routing is performed after cell placement, which de-
termines the physical locations of registers [6, 43]. Most academic/commercial placement
tools do not distinguish clocked elements from combinational logic cells [20, 64, 103].
Hence, even though it is possible to improve the quality of a clock network (especially
in terms of power) by modifying the locations of registers, clock network synthesis tech-
niques are often prevented from altering the locations of registers. Some researchers pro-
posed techniques like leaf-level register clustering [16, 75], but finding optimal register
7
Producing high-quality clock networks is becoming more difficult, and related chal-
lenges may soon overwhelm those at other stages of a physical design flow because of
conflicting objectives. Novel multi-objective methods are needed that can generate clock
networks satisfying tight design constraints. Our solutions to these challenges are ad-
SPICE-accurate SoC clock network synthesis. Most existing algorithms and tech-
niques establish fundamental methodologies for clock network synthesis, but perform
large-scale optimization using analytical models that lose accuracy at recent technology
nodes, and are not always validated by realistic SPICE simulations on large industry de-
the gaps between existing point-optimizations. We develop an EDA methodology for in-
called Contango. We then extend our implementation to large industrial clock networks.
Optimization of clock trees for microprocessors. Clock networks account for a sig-
nificant fraction of system power dissipation while limiting CPU performance. Therefore,
CPU clock design. On the other hand, the increasing impact of process variation makes
clock network synthesis particularly challenging. Mesh structures are often utilized to im-
8
able. In Chapter IV we propose a tree-based solution for CPU clock routing that improves
power consumption under tight skew constraints in the presence of variations. We intro-
duce the notion of local-skew slack for clock trees, modeling and optimization of varia-
for clock-tree tuning and accurate optimizations that satisfy budgets. Our strong empirical
results suggest that clock trees constructed using accurate variational skew modeling and
meshes.
Clock network optimization during placement. Most of the existing literature for
clock network synthesis assumes that register locations are given and cannot be changed.
While clock networks can be improved by finding better register locations during place-
ment, most publications do not propose such optimization, hence the quality of resulting
we propose to optimize the locations of registers at the placement stage for power-efficient
Closing the gap between tree and mesh structures. Common clock-network topolo-
gies can be categorized into two major types: trees and meshes. While older chips re-
lied on trees, mesh structures were utilized to satisfy tight variation-related constraints in
9
In Chapter VI, we propose a novel flexible structure that maintains many advantages of
tree structures, but is more robust to variations. Through in-depth structural analysis of
a given clock tree, we quantitatively diagnose where and why it fails to satisfy variation-
related constraints. We then go on to enhance the tree structure to attain required power-
performance-robustness trade-offs.
• Part II lays the foundation for our research. Chapter III describes our method for
• Part III proposes new techniques that broaden the scope of optimization for clock
reduce clock-network size and total power consumption. In Chapter VI, we propose
algorithms and techniques for a novel non-tree clock network structure that bridges
10
CHAPTER II
As clock networks distribute clock signals to numerous clocked elements all over the
chip, they consume a sizable portion of routing resources. Their high switching activity
implies significant power consumption. Hence clock networks must be carefully designed
to optimize the performance of the chip, routing resource usage, and power.
This chapter covers basic terminology, core algorithms, prior work and other prereq-
uisite topics in clock network design. Additional background information relevant to our
contributions appears in further chapters. Section 2.1 discusses key parameters of a clock
network and reviews the ISPD clock network synthesis contests which were held in 2009
and 2010. In Section 2.2, general types of clock networks are presented. Section 2.3
covers algorithms for clock-tree generation. In Section 2.4, existing techniques for clock
Clock skew between two clock sinks connected to the same clock source is the abso-
lute value of the difference in transition arrival times. The clock skew of an entire clock
network is the maximum pairwise clock skew between any two sinks (more details and fur-
11
ther definitions are given in Section 4.2.1). Clock jitter refers to the time variation of the
clock period at a given clock sink on the chip. The term slew characterizes how quickly
10%-90% slew can be measured by the time taken to change the value from 0.1V to 0.9V.
Clock skew, jitter and slew are major issues in digital circuits, and can fundamentally limit
the performance of a digital system. Therefore, clock-network synthesis must limit skew,
jitter and slew. When a clock network is designed to have zero nominal skew, permanent
(static) skew can occur as a result of manufacturing device and interconnect variations
(i.e., process variations). Temperature gradients across a chip also contribute to skew. On
the other hand, power-supply variations are the major source of jitter in clock distribution
networks.1 In this dissertation, we evaluate our clock networks using Monte-Carlo simu-
lations with PVT variations to effectively measure clock skew affected by jitter. In other
words, when we improve robustness of clock networks, we reduce not only permanent
skew induced by process variations, but also temporal skew and jitter induced by voltage
FF Logic FF FF FF
(a) (b)
Figure 2.1: Eligible clock sink pairs. (a) There is combinational logic between two sinks,
which make the skew between these two sinks affects the useful portion of
clock cycle time. (b) This sink pair is not eligible because the sinks are not
logically dependent.
1
Another major contributor to jitter is a clock-signal generator, but optimization of such a generator is
beyond the scope of this dissertation.
12
Sink pairs eligible for local-skew calculation. In a large clock network, skew between
adjacent and connected sinks is a more meaningful optimization objective than global
skew [32, 81]. When two clock sinks are connected by combinational logic (Figure 2.1a)
the clock skew between two sinks directly affects the useful portion of clock cycle time for
the combinational logic. Otherwise, where there is no combinational logic between two
sinks (Figure 2.1b), the skew between them is not a source of performance degradation,
therefore we do not need to optimize the clock network to reduce the skew between those
sink pairs. Eligible sink pairs for skew can be defined based on the netlist after Register-
Transfer Level (RTL) synthesis so that only sink pairs that are connected by combinational
logic are considered for skew calculation. In the ISPD 2010 Clock Network Synthesis
(CNS) Contest, local skew distance limit was introduced to define the eligible sink pairs
and local skew [95]. If the Manhattan distance between two sinks is less than the local
skew distance limit, it is assumed that there is combinational logic between the two sinks
and otherwise, there is no logic dependency. We use the same notion of local skew in
our work, but do not rely on the metric definition, and all our techniques apply in a more
realistic context where eligible pairs of sinks are derived directly from the netlist.
The ISPD 2009 clock-network synthesis contest organized by IBM Austin Research
Lab was based on a 45 nm technology [94]. Sink latencies were evaluated by SPICE. The
main objective was the difference between the least sink latency at 1.2V and the greatest
sink latency at 1V. This Clock Latency Range (CLR) metric was intended to capture the
impact of multiple power modes with different supply voltages [65], but nominal skew
was also recorded. Total power was limited and the 10%-90% slew rate of 100 ps was
13
enforced. The CLR objective attracted significant criticism, which we share. Therefore
we also evaluate our techniques in terms of nominal skew. The benchmarks were derived
from industry SoC designs and include dozens to hundreds fixed rectilinear obstacles.
The ISPD 2010 high-performance clock-network synthesis contest used several 2 GHz
CPU benchmarks from IBM and Intel to compare tools submitted by 10 teams across the
world (down-selected from 20 initial registrants). To evaluate the quality of the clock
networks, difficult slew and skew constraints were checked against 45 nm Monte-Carlo
SPICE simulations that modeled PVT variations. The 100 ps slew constraints were un-
changed from the ISPD 2009 contest. Clock networks that cleared all constraints were
compared by their total capacitance — a proxy for dynamic power. Table 2.1 shows the
The choice between a tree and non-tree topology is a central question in modern clock-
14
robustness to late design changes and process variations, but at a great cost in terms of ca-
pacitance. Tree topologies offer many advantages, including simplicity, symmetry, faster
timing analysis and amenability to incremental tuning. We start by surveying general types
of clock networks, and will describe details of relevant algorithms in Section 2.3.
Clock trees have been widely supported by academic and commercial EDA tools. Sim-
ple methods including H-tree [11], the method of means and medians (MMM) [40], the
geometric matching algorithm (GMA) [22] and path length balancing method (PLB) [42]
were commonly utilized before the deferred merge embedding (DME) algorithm [12, 28]
was introduced. Recently several methodologies for SoC clock-tree tuning have been
accurate skew optimization with tolerance to voltage variations was proposed in [51].
The Dynamic Nearest-Neighbor Algorithm (DNNA) to generate tree topology and the
Walk-Segment Breadth First Search (WSBFS) for routing and buffering were proposed
ing algorithm with monotonic buffer insertion is proposed in [61]. A Dual-MST (DMST)
geometric matching approach is proposed in [63] for topology construction and recursive
buffer insertion. Modeling techniques and algorithms for microprocessor clock power
optimization subject to local skew constraints in the presence of variations are proposed
in [53].
Meshes. From the mid 1990s when the impact of PVT variation became significant, clock
networks were more affected by PVT variations than random logic, due to their struc-
ture and more stringent timing constraints. In a tree network, such unexpected changes
15
are likely to propagate to the sinks. Mesh (or grid) structures have emerged to address
the structural drawbacks of trees. In meshes, there are multiple paths from the clock
source to individual clock sink; thus, the impact of variations on one path can be averaged
out by multiple redundant paths [107]. However, meshes require significant overhead in
terms of on-chip resources and power. Published examples suggest that mesh-type clock
networks suffer much greater power consumption. Nevertheless, mesh structures were
designs where performance is more emphasized than power consumption [6, 43]. Some
methods to analyze the characteristics of mesh structures are proposed in [19, 106] and
avoiding clock mesh synthesis method which applies a two-stage approach of mesh con-
Trees with cross-links. The dichotomy between meshes and trees is striking, and several
researchers attempted to find intermediate topologies that would retain the advantages of
meshes but reduce capacitance overhead. A key idea in the literature is to insert cross-links
into clock trees, creating redundant paths to sinks that contribute to nominal or variational
clock skew [77, 78]. These methods are later extended to handle buffered clock trees
in [79, 99]. Most publications discuss cross-links that directly connect pairs of sinks. Sur-
prisingly, none of these techniques were useful at the ISPD 2009-2010 clock-network con-
tests [94, 95] despite diligent attempts, as improved tree-tuning methods were sufficient.
Careful experiments and analytical estimates [67] have shown that direct cross-links are
16
only effective in poorly tuned clock trees and/or at relatively short distances. However, in
high-quality clock trees it is rare to find a critical pair of sinks at a short distance. A recent
proposal [67] suggests adding cross-links higher in the tree to connect entire branches.
As several other publications with strong empirical results, [67] uses unrealistically large
composite buffers, and arranges them in a unique two-layer configuration (10+40 small in-
verters). Given that the ISPD 2010 contest infrastructure does not adequately model such
The first geometric algorithms for clock routing evaluated skew in terms of wirelength
from the source to sinks and produced minimum-wirelength trees for a given sink cluster-
ing using the deferred merging and embedding (DME) principle [12].
DME algorithms [43]. The deferred-merge embedding (DME) algorithm defers the
choice of merging (tapping) points for subtrees of the clock tree. DME optimally embeds
any given topology over the sink set S: the embedding has minimum possible source-
sink linear delay, and minimum possible total tree cost. The algorithm was independently
proposed by several groups - Boese and Kahng [12], Chao et al. [17], and Edahiro [28].
In the Manhattan geometry, two sinks in general position will have an infinite number
of midpoints, creating a tilted line segment, or Manhattan arc (Figure 2.2 [43]); each of
these midpoints affords the same minimum wirelength and exact zero skew. Ideally, the
selection of embedding points for internal nodes will be delayed for as long as possible.
The DME algorithm embeds internal nodes of the given topology G via a two-phase
process. The first phase of DME is bottom-up, and determines all possible locations of
17
Euclidean midpoint Euclidean midpoint
s2 s1
s1 s2
s1 s2
Locus of all
Manhattan midpoints
(a) (b)
Figure 2.2: The locus of all midpoints between two sinks s1 and s2 is a Manhattan arc in
the Manhattan geometry. On the other hand, the midpoint is unique in Eu-
clidean geometry. (a) Sinks s1 and s2 are not horizontally aligned. Therefore,
the Manhattan arc has non-zero length. (b) Sinks s1 and s2 are horizontally
(left) and vertically (right) aligned. Therefore, the Manhattan arc for both
cases has zero length. Source: [43].
internal nodes of G that are consistent with a minimum-cost ZST T . The output of the
first phase is a tree of line segments, with each line segment being the locus of possible
placements of an internal node of T . The second phase of DME is top-down, and chooses
the exact locations of all internal nodes in T . The output of the second phase is a fully
Tilted Rectangular
Region (TRR)
s1 s1
Core
Radius
s2 s2
Figure 2.3: (a) Sinks s1 and s2 form a Manhattan arc. (b) An example of a tilted rectangu-
lar region (TRR) for the Manhattan arc of s1 and s2 with radius of two units.
Source: [43].
Manhattan arc (Figure 2.3 [43]). The core of a TRR is the subset of its points at maximum
18
distance from its boundary, and its radius is the distance between its core and boundary.
The merging segment of a node v in the topology, denoted by ms(v), is the locus of
feasible locations for v, consistent with exact zero skew and minimum wirelength (Figure
2.4 [43]). The following presents the sub-algorithms used for DME.
u3
s1 s3 trr(u2)
u1 u2 s4
trr(u1) |eu 2|
|eu1 | s2
s1 s2 s3 s4
Figure 2.4: A bottom-up construction of the merging segment ms(u3 ) for node u3 , the
parent of nodes u1 and u2 , given the topology on the left. The sinks s1 and s2
form the merging segment ms(u1 ), and the sinks s3 and s4 form the merging
segment ms(u2 ). The two segments ms(u1 ) and ms(u2 ) together form the
merging segment ms(u3 ). Source: [43].
The bottom-up phase of DME (building a tree of segments) starts with all sink loca-
tions S given. Each sink location is viewed as a (zero-length) Manhattan arc. If two sinks
have the same parent node u, then the locus of possible placements of u is a merging seg-
ment (Manhattan arc) ms(u). In general, given the Manhattan arcs that are the merging
segments of two nodes a and b, the merging segment of their parent node is uniquely de-
termined due to the minimum-cost property, and is itself another Manhattan arc (Figure
2.4 [43]). The edge lengths |ea | and |eb | are uniquely determined by the minimum-length
and zero-skew requirements. As a result, the entire tree of merging segments can be con-
19
s1 s1
s8 s8
s7 s7
s2 s2
s6 s6
s3 s5 s3 s5
s4 s0 s4 s0
(a) (b)
s1 s1
s8 s8
s7 s7
s2 s2
s6 s6
s3 s5 s3
s5
s4 s0 s4 s0
(c) (d)
Figure 2.5: Construction of a tree of merging segments (DME bottom-up phase). Solid
lines are merging segments, dotted rectangles are the tilted rectangular regions
(TRR), and dashed lines are edges between merging segments; s0 is the clock
source, and s1 − s8 are sinks. (a) The eight sinks and the clock source. (b)
Construct merging segments for the eight sinks. (c) Construct merging seg-
ments for the segments generated in (a). (d) Construct the root segment, the
merge segment that connects to the clock source. Source: [43].
In the DME top-down phase (finding exact locations), exact locations of internal nodes
in G are determined, starting with the root. Any point on the root merging segment from
the bottom-up phase is consistent with a minimum-cost ZST. Given that the location of a
parent node par has already been chosen in the top-down processing, the location of its
child node v is determined from two known quantities: (1) |ev |, the edge length from v to
its parent par, and (2) ms(v), the locus of placements for v consistent with a minimum-
20
cost ZST. The location of v, i.e., pl(v), can be determined as illustrated in Figure 2.6 [43].
Thus, the embeddings of all internal nodes of the topology can be determined top-down in
Possible locations of v
|epar|
trr(par) pl(par)
ms(v)
Figure 2.6: Finding the location of child node v given the location of its parent node par.
Source: [43].
The Deferred Merge Embedding (DME) algorithm was extended to the bounded-skew
tree (BST) problem. BST/DME algorithms [23,38] generalize merging segments to merg-
ing regions. When BST/DME algorithms were introduced in the early 1990s, many chip
designs included one large central buffer to drive clock signals through the entire chip. To-
day traditional clock trees cannot satisfy slew constraints in large ICs because the maximal
Furthermore, the Elmore delay model used by published clock-tree optimizations lost ac-
BSTs allow one to trade off a small increase in skew for reduced total wirelength.
Figure 2.3 shows that BSTs are shorter than ZSTs. However, BSTs are less balanced than
ZSTs and Elmore delay used in BST generation is inaccurate, thus the capacitance saved
on wires can be lost when compensating for skew with accurate timing analysis. After
initial buffer insertion, slow sinks and fast sinks are more clustered in ZSTs. Since our
21
s1 s1
s8 s8
s7 s7
s2 s2
s6 s6
s3 s3
s5 s5
s4 s0 s4 s0
(a) (b)
s1 s1
s8 s8
s7 s7
s2 s2
s6 s6
s3 s3
s5 s5
s4 s0 s4 s0
(c) (d)
Figure 2.7: Embedding the clock tree during the DME top-down phase. Gray lines indicate
merging segments, dotted lines show connections between merging segments,
and black lines indicate routing segments. (a) Connecting the clock source
to the root merging segment. (b) Connecting the root merging segment to
its children merging segments. (c) Connecting those merging segments to its
children. (d) Connecting those merging segments to the sinks. Source: [43].
skew optimization techniques exploit these clusters, BSTs need greater resources to reach
near zero-skew than ZSTs. Table 2.2 shows the impact of BST skew bounds on final
results (CLR is defined in Section 2.1). The skew bounds during BST construction are
based on Elmore delay, and the final results are based on SPICE simulations. Based on
overwhelming empirical evidence against BSTs, Contango does not use them.
Obstacle-avoiding clock trees. The concept of merging regions in BST/DME was ex-
22
(a) ZST (b) 3 ps BST (c) 9 ps BST
Figure 2.8: Min-wirelength trees with zero and bounded skew (Elmore delay). Only frag-
ments of actual clock trees are shown.
tended to obstacle-avoiding trees in [44], where (i) obstacles were assumed rectangular,
(ii) no routing over obstacles was allowed, and (iii) buffering was not considered. The
authors noted that obstacle processing slowed down their BST/DME algorithm and hinted
at more advanced geometric data structures. Unlike in [44], the ISPD 2009 contest allowed
routing but not buffering over obstacles, with SoCs in mind. ISPD 2009 benchmarks in-
Fast buffer insertion. L. van Ginneken introduced an algorithm for buffering RC-trees
[30], which minimizes Elmore delay and runs in O(n2 ) time, given n possible buffer
locations and buffer specification. While not intended for clock trees, it minimizes worst
delay rather than skew. The O(n log n)-time variant of van Ginneken’s algorithm proposed
in [84] is more appropriate for large trees. A key insight into van Ginneken’s algorithm and
its faster variant makes them applicable to our work — while trying to minimize source
23
to sink latencies, these algorithms insert almost same number of buffers on every path and
therefore result in low skew if the initial tree was already balanced.
Other buffering techniques have been proposed as well, e.g., a linear-time algorithm
from [7] that minimizes the number of buffers while bounding capacitive load and slew
rate, but does not minimize delay or skew. A dynamic program from [3] inserts a limited
the ISPD 2009 contest, slew constraints were checked by SPICE, but capacitance limits
were relatively generous. Our competitors predominantly used greedy bottom-up buffer-
insertion algorithms that added each buffer as high in the tree as possible, while satisfying
slew constraints. Such technique seek to minimize capacitance as the top priority. How-
ever, we chose the (faster variant of) van Ginneken’s algorithm, which seeks to minimize
worst sink latency. Our rationale was that process variations can be moderated by lower-
ing sink latency and that it is relatively easy to slow down paths that are too fast, but it
is harder to speed up slow paths. It is difficult to make a rigorous comparison with slew-
based buffering. In particular, some of our competitors at the ISPD 2009 contest relied on
it and produced relatively poor results, but others did better. In any case, our overall results
compare favorably to the best published results, especially in terms of nominal skew, and
Several methodologies for clock-tree tuning have recently been developed for the ISPD
2009 clock-network synthesis contest which focused on ASIC and SoC designs. A clock-
24
to generate tree topology and Walk-Segment Breadth First Search for routing and buffering
posed in [63] for topology construction, along with recursive buffer insertion and a way to
handle blockages. SoC methodologies often spend significant effort dealing with hundreds
of layout obstacles, while CPU layouts include very few obstacles. However, skew con-
straints are more difficult in CPU clock synthesis. Because of these differences and due
to the incorporation of process variation into the ISPD 2010 contest, most of the above
Power consumption is one of the primary optimization objectives for modern IC de-
signs [76]. It includes three basic components: short-circuit power, leakage power and
net-switching power [62]. Net-switching power is usually the largest contributor, and
clock networks are often responsible for over 30% of total power consumption due to
their high capacitance and frequent switching [26, 31, 66, 96]. The quality of clock net-
and most commercial EDA tools have largely overlooked this fact by focusing on wire-
length of signal nets [48], routability [102] and circuit timing [35]. As far as we know,
isting techniques. To this end, most appropriate changes to cell locations that reduce the
clock network may depend on the current structure of the clock network, which is not
25
accounted for in existing placement tools. However, over-emphasizing the placement of
clock sinks may harm the overall design performance by making signal nets longer.
To address the apparent conflict between clock-net optimization and traditional place-
ment objectives, some researchers proposed techniques and algorithms for better regis-
gravity constraints for registers, pseudo-pins and register-cluster contraction. Cheon [20]
proposed power-aware placement that performs both activity-based register clustering and
activity-based net weighting to simultaneously reduce the clock and signal net-switching
power. In order to reduce the clock network size, Wang [103] proposed dynamic clock-
tree building (DCTB), multi level bounding box (MLBB) and multi level attractive force
(MLAF), and integrated them into a force-directed placement (FDP) framework [101].
flip-flops [16, 75] to share inverters (inside flip-flops) and shorten the clock tree. This
clustering does not adversely affect signal nets, but is rather limited by the locations of
26
PART II
CHAPTER III
On-chip clock networks are remarkable in their impact on the performance and power
nology scaling, as well as in their strong potential for improvement through better CAD
algorithms and tools. Existing literature is rich in ideas and techniques, but performs large-
scale optimizations using analytical models that lost accuracy at recent technology nodes,
and have rarely been validated by realistic SPICE simulations on large industry designs.
coordinated to satisfy slew constraints and achieve best trade-offs between skew, insertion
27
is evaluated on 45 nm benchmarks from IBM Research and Texas Instruments with up
to 50K sinks. It outperforms all published results in terms of skew and shows superior
3.1 Introduction
Clock networks were among the first circuits to suffer the impact of process, voltage
and temperature variations. Systematic variations can affect paths to different sinks in dif-
ferent ways, making effective skew higher than nominal skew. Intra-die variations may be
stronger on some paths than on others, which would further increase effective skew. These
challenges have motivated research at the device, circuit and algorithm levels [45]. In gen-
eral, smaller sink latencies and shorter tree paths decrease exposure to variations. Some
researchers tried to increase the tolerance of buffers to CD changes and temperature varia-
tion [39], some proposed to tune wires or buffers based on post-silicon measurements [92],
and some developed methodologies for inserting cross-links into the trees [36, 37, 50], ar-
guing that such links can decrease the impact of variation on skew. Existing literature
tends to (1) rely on closed-form delay models during large-scale optimization, (2) fre-
quently focus on a single optimization technique in analysis and evaluation, (3) neglect
the difficulties in modifying highly optimized clock trees. Our work seeks to address these
tion, rather than just elegant algorithms with provable abstract properties. With process
variation in mind, microprocessor designers combine regular meshes with local or global
trees [82]. However, meshes have much higher capacitance and use more power.
28
This chapter focuses on clock-network synthesis for ASICs and SoCs, where clock
frequencies are not as aggressive as in high-performance CPUs, but power is limited, es-
pecially for portable applications. In this context, tree topologies remain the most popular
choice, but may require accurate tuning and further enhancements. The SoC context in-
troduces another twist — layout obstacles. SoCs include numerous pre-designed blocks
(CPUs, RAMs, DSPs, etc) and datapaths. While it may be possible to route wires over
such obstacles, buffer insertion is typically not allowed. One can fathom the difficulty
Steiner trees currently remain an active area of research [59]. Our contributions include
• A simple and robust technique for obstacle avoidance in clock trees subject to slew
constraints
• A methodology for clock-tree optimizations that outperforms the best results at the
ISPD 2009 contest on every benchmark by 2.15-3.99 times, while reducing skew to
2.2-4.6 ps. On newer Texas Instruments benchmarks with up to 50K sinks, skew
29
Selecting best parameters for each benchmark can further improve results, at the cost of
increased runtime. But global skew < 20 ps is considered very small for ASICs and SoCs.
In the remainder of this chapter, Section 3.2 describes our analysis of the clock-
network synthesis problem and introduces slow-down & speed-up slacks. Major opti-
mization steps are described in Section 3.3, and Section 3.4 presents empirical results.
Section 3.5 summarizes this chapter and raises several intriguing research questions.
The design of a clock network offers a large amount of freedom in topology selection,
spacing and sizing of inverters, as well as the sizing of individual wires. Traditionally, net-
work topology is decided first. Trees offer unparalleled flexibility in optimization because
latency from the root to each sink can be tuned individually, while large groups of sinks
Composite buffers can be built by stacking up inverters in parallel and/or in series. Par-
allel composition decreases driver resistance, but increases input pin capacitance, while
leaving the intrinsic delay intact. The spacing of buffers is largely responsible for pre-
venting slew violations and also affects clock skew. It is sensitive to driver resistances,
the maximal capacitance (wire and input pins) that can be driven by a given composite
buffer, as well as branches in the buffer’s fanout, which determine the number of input
pins driven. A single wire segment can be split into smaller segments, and each can be
sized independently.
30
3.2.1 Optimization objectives and timing analysis techniques
Accurate clock network design is complicated by the fact that the optimization objec-
tives are not available in closed form and take significant CPU resources to evaluate. Skew
optimization requires much higher accuracy than popular Elmore-like delay models. For
example, a 5 ps error represents only 1% of 500 ps sink latency, but 50% of 10 ps skew.
Closed-form models do not capture resistive shielding in long wires, do not propagate slew
with sufficient accuracy, and do not account for slew’s impact on delay well. Newer, more
sophisticated models are laborious to implement and only available in modern commercial
tools. Our strategy is to use simple analytical models at the first steps of the proposed flow
— (1) to construct zero-skew clock trees and (2) to perform initial fast buffer insertion,
— but drive further optimizations by SPICE runs, Arnoldi approximation, or any other
techniques. Runtime can be significantly reduced using localization and batch-mode eval-
uation. During localization, one prunes large portions of the clock tree that do not affect
latencies to the sinks impacted by the changes in question [36]. This does not reduce the
number of SPICE calls, but rather decreases the complexity of each run. On the other
hand, a batch of changes can be evaluated by a single SPICE run, as long as multiple
properties of circuit delay, such as monotonicity, convexity, and linearity with respect
to some parameters. Monotonicity and convexity support binary search, where an optimal
31
value is sought on a certain interval. At each step of the search, the middle point of the
interval is evaluated by SPICE (e.g., a wire can be sized half-way) and the result deter-
mines whether to recur to the left or right half-interval. Linearity enables extrapolation of
An initial buffered clock tree is constructed early in the design flow. Assuming no
slew violations, the latency of each sink s (Ts ) is known from SPICE simulations (or
faster techniques, such as Arnoldi-based delay calculations), at which point minimal and
maximal latencies (Tmax and Tmin ) can be found.1 Since absolute sink latencies are not
as important as skew (Tmax − Tmin ), skew can be improved by either decreasing Tmax
(speeding up the slowest sinks) or increasing Tmin (slowing down the fastest sinks).
Definition III.1 Consider a clock tree and its sink s. The slow-down slack Slacksslow
(speed-up slack SlacksF ast ) of s is the amount in ps by which the sink latency can be unilat-
erally increased (decreased) without increasing clock skew. In other words, Slacksslow =
Slow sinks often cluster together, and so do fast sinks. Hence, clock skew can be
improved by modifying a few nodes or edges high in the tree. To find desired delay
1
Separately for rising and falling transitions, for each PVT corner.
32
Definition III.2 Consider a clock tree and its edge e. The slow-down slack Slackeslow
(speed-up slack SlackeF ast ) of e is the amount in ps by which the edge delay can be unilat-
Given slacks on n sinks, all edge slacks can be computed in O(n) time.
Lemma 2 For any edge e and its parent in the tree, Slackeslow ≥ Slackparent(e)
slow
and
F ast
SlackeF ast ≥ Slackparent(e) .
The flexibility of a tree edge is limited by each downstream sink. Therefore, for edges
close to the root we often have Slackeslow = SlackeF ast = 0. It is important to note that the
validity of slacks-related calculations does not depend on the use of specific delay models
or SPICE simulations. When visualizing clock trees, we color their edges with a red-green
gradient, indicating low slack with red and high slack with green, as shown in Figure 3.4.
Lemma 2 suggests that, instead of changing the delay of an edge, one can change the
delay of its downstream edges by an equal amount, as long as only one delay change is
applied on each root-to-sink path. When choosing between tree edges on the same path,
we prefer (at early stages of optimization) to tune edges as high in the tree as possible, so
as to minimize (i) the amount of change, (ii) the risk of introducing slew violations and
(iii) power overhead. However, in a highly optimized tree, we tune bottom-level edges
33
where we can better predict the impact on skew. The preference for high-level tree edges
f ast
Naturally ∆fe ast = Slackef ast − Slackparent(e) , and a mirror statement holds. For a tree
If optimizations are not coordinated well, some edges may be sped up and some slowed
down, while the overall skew is unchanged. To avoid such conflicts, one can perform
slack update. In practice, it is easier to slow down an edge than to speed it up. Thus, any
possible speed-up, e.g., by using stronger buffers, is performed first. Rounds of speed-up
and slow-down are more conveniently performed top-down, so that when an edge cannot
be tuned by the desired amount, the remainder is passed to its downstream edges.
We found that after nominal skew is sufficiently optimized, both rising and falling
transitions can individually limit speed-up and slow-down slacks. We handle the two
transitions separately and define edge slacks as the smaller of rise-slack and fall-slack.
Furthermore, speed-up and slow-down slacks can be computed for each process corner
given (two in the ISPD 2009 contest). In order to improve the multicorner CLR objective,
a tree edge can be sped up conservatively by the minimum of its speed-up slacks, and can
34
3.2.3 Clock latency range (CLR) optimization
Our methodology pursues two objective functions —- nominal skew and the ISPD
2009 CNS contest metric, CLR, introduced in Section 2.1. Due to significant correla-
tion between CLR and nominal skew, some of the optimizations in our flow target skew
optimization, some target CLR, and some address both (see Table 3.3). In practice this
approach achieves a good trade-off between the two optimization objectives, and is repre-
the CLR calculation is based on the sink latencies at two different supply voltage settings.
There are mainly two strategies to reduce CLR. First, reducing skew directly contributes
to reducing CLR until skew becomes very small (e.g. less than 5 ps). Let sink L be the
sink with the least sink latency @1.2V (TL1.2V ) and sink G be the sink with the greatest
sink latency @1.0V (TG1.0V ). Then CLR = TG1.0V - TL1.2V . When we consider the latency of
sink G @1.2V (TG1.2V ), then CLR = (TG1.0V -TG1.2V ) +(TG1.2V -TL1.2V ). We call (TG1.0V -TG1.2V )
the variational part of CLR and (TG1.2V -TL1.2V ) the skew part of CLR. The skew part of CLR
can be reduced by skew optimization techniques. Since the corner sinks of skew are not al-
ways same to the corner sinks of CLR (sink L and G), CLR needs to be measured after any
skew optimization to check CLR improvement. The second strategy for CLR optimization
targets the variational component of CLR. The detailed descriptions of optimizations for
the skew and variational part of CLR are discussed in Section 3.3.
different accuracy (see Tables 3.3 and 3.6). For example, buffers can be inserted to in-
35
crease delay with the purpose to decrease nominal skew. This optimization offers a great
range (significant strength) because a buffer’s intrinsic delay can be significant. However
its accuracy is low because buffer delay cannot be accurately controlled. Our strategy in
coordinating clock-tree optimizations is to start with optimizations that offer the great-
est range, and then transition to optimizations with greater accuracy. Each step should
decrease the main optimization objective sufficiently to be within the range of the next
optimization. For example, in the ISPD 2009 contest, top-down wiresizing can decrease
nominal skew from hundreds of ps to 10 - 20 ps. This is sufficient for top-down wires-
naking to take over and reduce nominal skew to 5 ps where more accurate techniques can
be used. Here we observed that wiresnaking also exhibits a significant range of optimiza-
tion, but we sequenced it after wiresizing because it offer a greater accuracy and because
Our proposed clock-network synthesis methodology and its major algorithmic steps
are shown in Figure 3.1. Contango first builds an initial tree using a ZST/DME algo-
rithm [23] and alters it to avoid obstacles. It then uses an O(n log n)-time variant of van
Ginneken’s buffer insertion algorithm [84] to ensure small insertion delay and to satisfy
As we pointed out in Section 2.3, obstacle-avoiding clock trees can be built by repair-
ing obstacle violations in ZSTs. This approach is attractive when large obstacles abut the
36
Figure 3.1: Key steps of the Contango methodology. Blue boxes represent skew reduction
techniques, red octilinear shapes show CLR reductions, and the green box with
thick border reduces both objectives. An Improvement- & Violation-Checking
(IVC) step follows each Clock-Network Evaluation (CNE) using circuit sim-
ulation tools, e.g., SPICE. “Fail” indicates no improvement or having slew
violations, leading to a transition to the next optimization.
chip’s periphery because ZSTs naturally avoid areas without clock sinks. This approach is
also attractive when obstacles are small or thin enough that a buffer inserted immediately
before the obstacle can drive the wire over the obstacle, so that no rerouting is necessary.
A third convenient case occurs when a wire can be rerouted around the obstacle without an
increase in length. Most obstacles are rectangular in shape, but such rectangles may abut,
creating rectilinear-shaped obstacles. When two obstacles abut, we cannot place a buffer
between them, and therefore handle them as one compound obstacle. Contango detours
37
wires using the following algorithm, illustrated in Figure 3.2 for a composite obstacles.
Step 1. Identify all wires that intersect obstacles. For each point-to-point connection, per-
form shortest-path maze routing around the obstacles. For subtrees that cross an obstacle,
find L-shaped segments that link points inside and outside the obstacle. For each L-shape,
choose one of the two possible configurations that minimizes overlap with the obstacle.
Step 2. When a wire crosses an obstacle, Contango captures an entire subtree enclosed by
the obstacle (see Figure 3.2). The total capacitance of the subtree is then measured and
compared to the capacitance that can be driven by the driving buffer without risking slew
violations. Sub-trees that can be driven by the driving buffer do not require detours.
Step 3. For obstacles crossed by a subtree that cannot be safely driven by the driving
buffer, Contango establishes a detour along the contour of the obstacle as follows. First,
the entire contour is considered a detour. Then, to ensure that the clock network remains a
tree, one segment is removed between tree sinks adjacent along the contour. If we were to
minimize total capacitance, we would remove the longest segment of the contour between
two adjacent tree sinks. However, we minimize the longest detoured source-to-sink path,
and therefore remove the segment furthest from the tree source (counting distances along
the contour). In other words, we first find the sink most distant from the source along the
contour, and include in the detour the entire shortest path to the source. The other segment
incident to the sink is removed, but the shortest path from its other end to the source is
Modern SoC layouts are littered with obstacles, which upset regular structures such
as meshes and H-trees. In the ISPD 2009 contest, such layouts required numerous de-
38
Figure 3.2: An illustration of our detouring algorithm. Small solid circle indicates the
source of detour, larger circles indicate sinks. The detour is shown with red
dotted lines.
tours. Detouring may significantly increase skew, but the subsequent skew optimization
Most technology libraries support dedicated clock buffers or inverters that are larger
and more reliable than those for signal nets. Industry designs usually offer at least six
different sizes. Parallel composition of buffers increases driver strength, helping with slew
constraints and improving robustness to variations. Yet, buffer sizes must be moderated to
satisfy total power limits. For a given buffer library, we consider many possible composite
that can be further evaluated during buffer insertion. Algorithmic details are omitted here
because the ISPD 2009 contest used only two inverter types — large and small. Table 3.1
shows that eight parallel small inverters exhibit smaller output resistance than one large
inverter, and smaller input/output capacitance. Hence Contango used 8× small inverters
39
I NVERTER I NPUT O UTPUT
TYPE Cap., f F Cap., f F Res., Ω
1X Large 35 80 61.2
1X Small 4.2 6.1 440
2X Small 8.4 12.2 220
4X Small 16.8 24.4 110
8X Small 33.6 48.8 55
optimization, along with buffer sizing, plays an important role in our methodology.
Given a clock tree with buffers, it is easy to increase the latency of a given sink, but
it is difficult to speed up a sink. Therefore, our strategy is to first make sinks as fast as
possible, and then reduce skew with wiresnaking and wiresizing. When buffers are
inserted into an Elmore-balanced tree, source-to-sink paths contain practically the same
We adapted the O(n log n)-time variant of van Ginneken’s algorithm from [84]. Due
to its speed, it can be launched with different inverter configurations, effectively perform-
ing simultaneous optimization across multiple parameters. Our experiments indicate that
Therefore, to reduce the variational part of CLR, Contango performs fast buffer insertion
with different composite buffers until it finds the best-performing solution with strongest
composite buffers within 90% of the power limit. Slew-constraint violations are not a
concern at this point since minimizing delay involves avoiding high slew-rate (recall that
there is positive correlation between delay and slew-rate). The experiments on various
40
clock trees with initial buffer insertion suggest that even the worst slew-rate is well under
60% of the slew limit. We reserve γ = 10% of power budget to facilitate more accurate
optimizations.
The O(n log n) variant of van Ginneken’s algorithm [84] used in our work assumes
that all available clock buffers preserve polarity. However, when polarity-changing invert-
ers are used, as in the ISPD 2009 contest, it will typically produce trees with incorrect sink
polarity (inverted sinks). While the algorithm can be extended to account for sink polarity,
we found this unnecessary. Even a simple patch — placing additional inverters at each of
n× inverted sinks — works reasonably well, because the skew introduced by new invert-
ers can be fixed by downstream optimizations. This technique inserts inverters at half the
sinks (n/2) on average. To reduce the added capacitance in cases when n× > n/2, Con-
tango inserts one inverter at the top of the tree, leaving only n♯ = (n − n× ) < n/2 sinks
with wrong polarity. The average number of inserted inverters would now be (n + 2)/4.
Instead, Contango traverses the tree bottom-up and marks each node (i) whose all sinks
have equal polarity, but (ii) whose parent does not satisfy (i). An inverter is inserted at
each marked node with downstream sinks of incorrect polarity. As a result, the number of
added inverters is significantly reduced, as shown in Table 3.2. The skew induced by new
inverters is not significant and fixed by the skew optimization algorithms later.
Proposition 2 The above algorithm runs in O(n) time, fixes all inverted sinks and mini-
mizes the number of added inverters, subject to ≤ 1 inverter on every root-to-sink path.
41
f11 f12 f21 f22 f31 f32 fnb1
Inverted sinks 77 71 46 57 140 47 153
Added inverters 9 7 8 9 16 13 2
Table 3.2: Inverted sinks in ISPD 2009 benchmarks (after buffer insertion) vs. polarity-
correcting inverters.
mance. The iterative buffer sizing introduced in Section 3.3.5 is primarily used to reduce
the variational component of CLR, while buffer sliding and interleaving are applied as
preliminary steps. Extensive experiments suggest that the impact of variations on skew is
best reduced by (i) decreasing sink latency (insertion delay), and (ii) using the strongest
possible buffers. Since our initial buffer insertion algorithm focuses on the former metric
with the latter metric as a secondary objective, it is possible to further improve the varia-
tional component of CLR by emphasizing the latter metric. Therefore, based on the results
Sizing up a single inverter increases its input pin capacitance and can lead to slew
violations. To prevent such violations, it is often possible to slide the inverter up the tree to
reduce upstream wire capacitance and interleave an inverter when two inverters move too
far apart after sliding. The increase in downstream wire capacitance is balanced with the
increase in the inverter’s driving strength. Sizing a single inverter may increase the skew
and require further correction. Therefore, we focused on the top-most levels of the tree,
whose impact on skew is relatively small. Given a clock source at the chip boundary, DME
algorithms generate a long wire leading to the center of the chip, and the tree branches out
from the center. This long wire — the tree trunk — is later populated with a chain of
42
inverters, which can be up- or down-sized without significant impact on skew because this
equally affects all sinks. However, since roughly 1/3 to 1/2 of sink latency is due to the
The trunk’s variational impact is different for voltage and process variations, and this
must be accounted for during optimizations. Stronger buffers in the trunk reduce the sen-
sitivity of latency to supply voltage (e.g., in the case of different power modes), and help
optimizing the CLR objective from the ISPD 2009 contest. However, process variations in
the trunk do not affect skew. In the ISPD 2010 contest, process variations were included
in the skew constraint, while the primary objective was to minimize total capacitance.
Therefore, one of successful strategies to weaken the buffers in the tree trunk and avail the
After sliding and interleaving top-level buffers, we invoke iterative buffer sizing. First,
this algorithm sizes up buffers in the tree trunk. At the i-th iteration of buffer sizing,
Contango sizes up the composite inverters by at most pi = 100/(i + 3)%. The itera-
tions continue until results improve without slew violation. Buffer sizing in tree branches
However, sizing up buffers after the trunk often makes the tree unbalanced in terms of
skew and results in greater load for the skew optimization algorithms. For better perfor-
mance of skew optimizations, typically 4 or 5 levels after the first branch are sized up by
43
3.3.6 Iterative top-down wiresizing
Before skew optimization, Contango computes slow-down slacks at every edge as de-
given tree edge can be slowed down before skew would be negatively affected. Since fast
sinks often cluster together, skew can be lowered by slowing down either many bottom-
level wires or few wires higher in the tree. Our top-down algorithm pursues the latter,
We build an ad hoc linear model based on the impact of downsizing a unit-length (lws )
wire segment. Contango chooses several independent wire segments with same length
(lws ) in the middle of the tree and downsizes them to observe the impact on latencies of
downstream sinks, ensuring that every sink is affected by only one downsized wire. This
requires a single SPICE run and produces a single parameter Tws — maximal latency
increase by downsizing a unit-length (lws ) wire segment. When downsizing a wire, the
scaling factor k is calculated based on Slacke divided by Tws and k × lws of the wire is
down-sized. When k is small, the latency increases almost linearly since the down-sized
length is much smaller than the length of the wire. Therefore we can estimate that the
maximum latency increase is equal to or less than k × Tws . To utilize this linearity, we
limit k by kmax . kmax is experimentally determined by observing the threshold at which the
linearity breaks significantly. Also, the scaling factor k can be limited by slew constraints.
Wiresizing typically increases slew rate because of increase in resistance. Even though
k < kmax holds, Contango does not allow any downsizing on a wire whose downstream
44
Algorithm 1 IterativeWireSizing
Tws = TwsEstimation();
repeat
SaveSolution(); ComputeWireSlacks();
Q = {root}; RSlack = {0}; i = 0;
while i < size(Q) do
if (Slack[Qi ] − RSlacki > Tws ) then
k = (Slack[Qi ] − RSlacki )/Tws ;
DownSize(W ire[Qi ],k); RSlacki + = kTws ;
end if
for j = 1 to Size(Child[Qi ]) do
Q.push(Child[Qi ][j]); RSlack.push(RSlacki);
end for
+ + i;
end while
SpiceSimulation();
until (no improvement || slew violation)
Since we selected Tws as the maximal latency increase from the SPICE simulation,
the actual increase (calculated by SPICE) is smaller — our modifications are intentionally
conservative to avoid excessive increase of latency, which increases the maximal latency of
the tree and consequently causes increase of slack for the entire tree. After running SPICE,
collecting sink latencies and recomputing slow-down slacks, Contango repeats top-down
wiresizing to reduce skew based on current data. This process is performed iteratively
until the objective function (CLR or nominal skew) stops improving. Iterative wiresizing
is detailed in Algorithm 1.
Wiresizing can reduce large skew by applying small changes, which is appropriate
that a small amount of wire-snaking is often used to improve clock skew, as long as added
capacitance does not significantly affect power. Wiresnaking alters a given route so as to
45
increase its length and can be applied on fast paths.
down wiresizing. This step uses the same slow-down slack computation we described
earlier. A SPICE simulation is performed (other accurate delay model can be used) to
measure Twn , the worst-case delay of wiresnaking with unit length lwn . lwn affects the
accuracy of the wiresnaking algorithm; smaller lwn offers greater accuracy but typically
leads to more SPICE runs since skew reduction in each round of top-down wiresnaking
is smaller. lwn was set based on empirical analysis of the 45 nm technology used at
the ISPD contest before contest benchmarks became available. The applicability
of wiresnaking depends on the VLSI context. If the clock tree is competing for routing
resources with signal nets, then every effort should be taken to reduce the utilization of
(also, clock trees should avoid such areas to minimize crosstalk noise). On the other hand,
some ICs include abundant routing resources. This is the case for pad-limited designs and
designs whose area is determined by large IP blocks. The number of available metal layers
also plays a major role in the design of clock trees, and can vary dramatically between
designs, clock networks are given a dedicated metal layer, which makes wiresnaking much
more attractive.
One of the top-three teams at the ISPD 2009 clock-tree routing contest (NTU [87])
used dangling wires instead of wiresnaking. Rather than elongate a route, this strategy
adds a dead-end branch. The goal is to increase wire capacitance, and therefore increase
46
ISPD 09 F 11 ISPD 09 F 12 ISPD 09 F 21 ISPD 09 F 22 ISPD 09 F 31 ISPD 09 F 32 ISPD 09 FNB 1
CLR Skew CLR Skew CLR Skew CLR Skew CLR Skew CLR Skew CLR Skew
I NITIAL 56.2 30.6 75.8 49.0 89.3 59.2 52.0 31.6 152 117 122 88.2 31.9 21.2
TBS Z 55.6 46.8 80.0 66.2 89.5 76.3 43.2 33.7 140 129 111 98.3 31.5 21.1
TWS Z 23.4 15.1 19.7 8.13 26.0 12.3 16.4 6.93 43.1 32.2 27.2 14.8 30.8 20.4
TWS N 13.8 2.93 16.2 3.38 17.6 2.83 12.6 1.99 12.8 3.91 17.9 4.59 13.9 3.15
BWS N 13.4 2.87 15.3 2.61 17.4 2.74 12.4 2.23 12.8 3.91 17.9 4.59 13.4 3.5
Table 3.3: Progress achieved by individual steps of Contango on ISPD 2009 benchmarks:
the first letter in each acronym indicates top-down (T) or bottom-level (B) op-
timization, second letter differentiates wires (W) from buffers (B), while “Sz”
stands for “sizing” and “Sn” stands for “snaking”. Gray highlights indicate
whether skew or CLR was the primary optimization objective.
the delay. In comparing dangling wires to wire-snaking, we note that the former does not
alter the resistance that affects propagation delay. Therefore, to achieve a particular slow-
down, a much longer wire-branch is needed. On the positive side, the dependence of delay
increase on branch length is linear, and this may allow for more accurate tuning. In other
words, this technique offers a potentially greater accuracy, but smaller range because the
wires are found useful, they should be used at a later stage in the optimization flow.
After two top-down skew reduction phases, skew becomes small enough to perform
bottom level optimizations. Bottom-level wiresnaking optimize the wires directly con-
nected to sinks. This technique is more accurate than the top-down optimizations since
naking until the results stop improving. Typically the gain of bottom-level tuning is under
2 ps, but can be a significant fraction of remaining skew. We found that with skew < 5 ps,
the corner sinks of rising transition and falling transition are often different. This rise-fall
divergence makes further improvements to the clock tree very difficult. Indeed, reducing
47
rising skew by slowing down a fast sink for rising transition may increase falling skew
due to excessive slowdown of a slow sink for falling transition. In the Contango flow, the
average skew after bottom-level tuning is 3.21 ps on ISPD 2009 CNS contest benchmarks.
Table 3.3 shows the improvement of CLR and skew by each optimization algorithm.
Note that after iterative buffer sizing (TBSz), skew is increased but CLR does not change
much. This implies that TBSz reduced the variational part of CLR significantly. The
To validate our proposed techniques, we first present results on ISPD 2009 benchmarks
according to the contest protocol, then discuss the significance of specific optimizations
used by Contango, and then evaluate the scalability of our C++ implementation on larger
benchmarks from our industry colleagues. We measured runtimes on a 2.4 GHz Intel
QuadCore CPU running Linux, similar to CPUs used at the ISPD contest.
with up to 330 selected clock sinks [94]. Table 3.4 compares results of our software
Contango to the top three teams of the ISPD 2009 clock-network synthesis contest. On
average, Contango reduces CLR by 2.15×, 3.99× and 2.35× versus contest results by
NTU, NCTU and U. of Michigan respectively, excluding failures of NTU and NCTU
on benchmarks with many obstacles. All results are within the capacitance limits, but
Contango nearly exhausts the limits as a part of its strategy. On ISPD 2009 benchmarks,
maximum sink latency averages 1120 ps, while the average number of composite-buffer
48
C ONTANGO ( THIS WORK ) NTU NCTU U. OF M ICHIGAN
Benchmark 9/10/2009 3/30/2009 3/30/2009 3/30/2009
CLR Cap. CLR Cap. CLR Cap. CLR Cap.
ispd09f11 13.36 99.61 6488 26.71 85.53 14764 22.31 89.90 23358 32.29 73.86 3892
ispd09f12 15.27 99.99 6564 25.73 84.72 13934 22.18 87.86 14992 32.17 73.45 3944
ispd09f21 17.40 96.74 6673 30.54 80.79 14978 19.61 86.65 26420 34.31 74.30 4587
ispd09f22 12.36 97.43 3618 24.51 81.82 7189 16.38 85.01 9432 30.45 70.01 2005
ispd09f31 12.81 98.29 21379 45.07 73.49 40088 212.0 92.38 1.29 51.34 81.53 17333
ispd09f32 17.92 99.24 12895 36.90 80.14 3566 fail - - 40.32 77.39 10599
ispd09fnb1 13.40 78.38 778 fail - - fail - - 19.84 63.10 477
Average 14.65 95.66 8342 31.57 81.08 15753 58.49 88.36 14841 34.39 73.38 6120
Relative 1.0 1.0 1.0 2.15 0.85 1.89 3.99 0.92 1.78 2.35 0.77 0.73
Table 3.4: Results on the ISPD 2009 Contest benchmark suite. CLR is reported in ps,
capacitance in % of the limit specified in benchmarks, and CPU time in s. Best
results from the ISPD 2009 contest and best results overall are shown in bold.
Runtime is dominated by SPICE runs. It was not used for scoring at the ISPD
2009 contest and can be improved by using FastSPICE, Arnoldi approximation.
More recent results for ISPD 2009 benchmarks from ASPDAC‘10 [61, 63, 87] are
summarized in Table 3.5. The results in Table 3.5 show that Contango outperforms NTU
and NCTU by skew and CLR. HKPU [63] claims a 20% advantage in CLR, but more than
doubles nominal skew. Another interesting aspect of the HKPU work is that they rely on
SPICE very little in their optimizations and instead use the Elmore delay model, which
explains their low runtimes. The algorithms in [63] focus entirely on the optimization of
nominal skew, which does not explain the results — high nominal skew and low CLR. As
the authors of [63] have kindly provided their clock trees on our request, we observed that
those trees use very large buffers at the top levels of the tree (including but not limited to
the trunk) and small buffers toward the sinks. This strategy minimizes the impact of supply
voltage variations, but makes it more difficult to optimize nominal skew given a limited
capacitance budget.
were superseded by more powerful techniques. For example, skew reduction by buffer
insertion was unnecessary and undermined the robustness to variations. However, it can
49
Figure 3.3: The clock tree produced by Contango on ispd09f nb1. Sinks are indicated by
crosses, buffers are indicated by blue rectangles. L-shapes are drawn as “di-
agonal wires” to reduce clutter. Wires are colored by a red-green gradient to
reflect slow-down slacks, as described in Section 3.2.2. The impact of wires-
naking is too small to be visible.
Table 3.5: Results from ASPDAC’10 clock routing papers on the ISPD 2009 Contest
benchmark suite [61, 63, 87]. Runtimes may be from different workstations.
CLR and skew are reported in ps and CPU time in s. Only average skew was
published for HKPU [63].
be used as a last resort when detours around obstacles introduce extremely high skew.
Our wiresizing can be refined, but probably not beyond the accuracy of subsequent wires-
wires (unless clock wiring is given a dedicated metal layer). Dangling wires, used by NTU
50
ispd09f12 Full flow w/o TWSz w/o TWSn w/o BWSn
TWSz -58.11 - -58.11 -58.11
TWSn -4.740 -33.51 - -4.740
BWSn -0.773 0 -2.494 -
Skew 2.611 14.92 5.633 3.384
Table 3.6: The ‘Full flow’ column shows skew change at each step in the Contango flow,
and the final skew in ps. Acronyms are decoded in the caption of Table 3.3.
Subsequent columns show the impact of removing one optimization. These
results illustrate the range of each optimization and its impact on final results.
Table 3.6 the impact of removing each skew optimization step from the flow. It can be
seen that each step is necessary to achieve competitive results. Removing top-down wire-
sizing effects the greatest impact because this optimization offers the greatest range, and
Scalability studies. The ISPD 2009 contest was limited to unrealistically small numbers
of sinks due to limitations of the open-source ngSPICE software [71] it relied upon. To
HSPICE software [93].2 Working with a recent Texas Instruments chip sized 4.2mm ×
3.0mm, we identified locations of 135K sinks and randomly sampled them to create a
family of benchmarks. For this experiment, our algorithm used groups of large inverters
instead of groups of 8 parallel small inverters, improving runtime eightfold at the cost of
highly-optimized clock trees with up to 50K sinks. Table 3.7 shows that total capacitance
scales linearly with the number of sinks, and skew remains in single ps. The number of
HSPICE runs grows very slowly, but HSPICE remains the bottleneck.
2
The numbers produced by ngSPICE and HSPICE were fairly close, with the main difference being
runtime and scalability.
51
# sinks CLR, ps Skew, ps Latency, ps Cap., pF , min
200 13.47 2.124 506.8 52.21 2.2 (21)
500 14.84 2.174 528.0 99.53 6.28 (20)
1K 17.53 3.138 543.1 162.3 12.5 (20)
2K 16.56 3.136 543.9 276.1 19.3 (15)
5K 23.20 3.853 538.5 591.1 99.6 (22)
10K 25.54 5.562 538.0 1130 352.8 (23)
20K 32.47 10.46 546.8 2243 1867 (35)
50K 31.52 8.774 545.1 5243 16027 (45)
Table 3.7: Scalability on Texas Instruments benchmarks. The “Latency” column repre-
sents maximum 1.2V latencies. SPICE runs are counted in parenthesis.
3.5 Summary
Existing literature on clock networks offers several elegant algorithms, but does not
describe end-to-end solutions to clock-network synthesis that can handle modern inter-
connect. Our work makes several contributions to this end. F irst, we develop special-
ized optimization algorithms necessary to bridge the gaps between well-known point-
specific tuning and are amenable to embedding into design flows. Second, we develop an
a robust software implementation, called Contango, that outperforms the best results from
the ISPD 2009 contest [94] by a factor of two.3 F ourth, we scale our implementation to
large industrial clock networks. Based on their strong empirical results, our techniques
may improve timing and power of future ASICs and SoCs [33]. In CPU designs, our trees
can be integrated with meshes [82]. Here, better trees may facilitate smaller meshes and
reduce power consumption, which can be traded off for higher performance or longer bat-
tery life in portable applications. Optimization techniques presented in this chapter serve
52
CHAPTER IV
Clock networks contribute a significant fraction of dynamic power and can be a lim-
iting factor in high-performance CPUs and SoCs. The need for multi-objective optimiza-
tion over a large parameter space and the increasing impact of process variation make
clock network synthesis particularly challenging. In this chapter, we develop new mod-
eling techniques and algorithms, as well as a methodology, for clock power optimization
subject to tight skew constraints in the presence of process variations. Key contributions
include a new time-budgeting step for clock-tree tuning, accurate optimizations that satisfy
2.0, outperforms the winners of the ISPD 2010 clock-network synthesis contest on 45
nm benchmarks from Intel and IBM. To support emerging SPICE-accurate circuit opti-
simulations to large, buffered RC trees, that trades off precision for speed.
Recent developments in embedded CPU design stress the need for low-power clock
trees, yet also impose stringent skew limits, especially in the presence of process, voltage
53
and temperature (PVT) variation for sub-45 nm CMOS technology. Previous clock-tree
methodologies rely on symmetric and regular tree topologies, such as H-trees and fish-
bones [6, chapter 43], which do not require sophisticated design algorithms (see Section
1.1). However, these topologies experience difficulties with layout obstacles, non-uniform
sink distributions, and varied sink capacitances. Fully-automated clock-tree synthesis sup-
ported by commercial EDA tools offers clear advantages in terms of capacitance, but may
not be able to ensure sufficiently low skew for use in a 2 GHz CPU. For example, the
authors of [65] report clock trees generated by Cadence tools with skew that is orders of
• How can one minimize the impact of PVT variations in a clock tree?
• Given a single-picosecond skew requirement, how competitive are clock trees with
clock meshes?
Our approach to answering these questions is inspired by the ISPD 2010 clock-network
synthesis contest, which used several 2 GHz CPU benchmarks from IBM and Intel to
compare tools submitted by 10 teams across the world (downselected from 20 initial reg-
istrants). To evaluate the quality of the clock networks, difficult slew and skew constraints
were checked against 45 nm Monte-Carlo SPICE simulations that modeled PVT varia-
tions. Clock networks that cleared all constraints were compared by their total capacitance
54
— a proxy for dynamic power. In this context, we developed a suite of algorithms for the
design and thorough optimization of clock trees. The results of the ISPD 2010 contest
offer a rare opportunity to compare multiple strategies for clock-network synthesis — the
third-place team used symmetric trees [85], the second-place team used clock meshes, and
our team won the contest by optimizing clock trees built by the DME algorithm [12, 28].
specific innovations.
• A tabular technique to estimate the impact of variations on skew between two sinks.
individual edges of the tree so as to improve skew with minimal power resources.
This algorithm can be used in the context of PVT variations and is not specific to
our methodology.
• Fine tuning of optimized clock trees by gentle wire snaking, sufficiently accurate to
Our empirical results are compared to those of the winners of the ISPD 2010 clock-
network contest, where each team violated prescribed skew constraints (7.5 ps in most
cases) on at least some benchmarks in the presence of variations. However, results re-
ported in this chapter satisfy skew constraints on every benchmark. Our clock trees have
55
4.2× smaller capacitance than clock meshes produced by CNSrouter [95], while exhibit-
The SPICE circuit simulator has been routinely used to evaluate electronic circuits
since its initial release in 1973 [68]. SPICE represents circuits by nonlinear differen-
tial equations and then solves them numerically. While highly accurate, this technique is
prohibitively expensive for use in many circuit optimizations. Instead, the Elmore delay
model [29] is commonly used, but suffers several known shortcomings, such as its neglect
for the resistive shielding effect. More advanced compact delay models, such as fitted El-
more [1] and models based on higher-order moments — S2M [2], D2M [4], LnD [5], —
and more accurate models [72] [74] require heavier computation, but are still much faster
than SPICE. While such models have proven accurate and useful in certain well-defined
circumstances, the transition to the 45 nm technology node exposed a widening gap be-
tween the accuracy of SPICE simulation and available surrogates. In particular, many
closed-form models do not capture the impact of slew rate and the direction of signal tran-
sition. The impact of process variation is another major challenge to using non-SPICE
models. Additionally, accurate modeling requires a large number of device and intercon-
nect parameters included in SPICE inputs that are not accounted for by simpler models.
The need for fast SPICE-accurate circuit analysis was recently underlined by the ISPD
simulation proved difficult, partly due to the presence of numerous buffers in relevant
56
clock networks and due to the impact of process variation. We also note that low-skew
error in a 500 ps path delay is only 1%, but two paths with similar delay from the clock
source to sinks may result in 10 ps skew, where a 5 ps error is 50%. On the other hand,
the 12-hour runtime limit could not be met with conventional SPICE-based RC clock tree
generation algorithm [51] because simulation must be invoked many times per benchmark
curate) simulator for use by our implementation Contango 2.0. Our divide-and-conquer
approach is chosen among several other candidates for its flexible trade-off between run-
time and solution quality. Chop-SPICE not only enables Contango 2.0 to meet runtime
The remainder of this chapter is organized as follows. Section 4.2 describes optimiza-
tion objectives and variation modeling. Section 4.3 explains initial tree construction with
buffering. Section 4.4 details our techniques for robustness improvements. Section 4.5
outlines our skew optimization techniques. Section 4.6 describes our Chop-SPICE tech-
nique. Section 4.7 reports our empirical results. Summary is given in Section 4.8.
view key optimization objectives (global and local skew), define the notion of local-skew
57
4.2.1 Global and local skew
Definition IV.3 Given a clock tree Ψ, let λ(si ) be the clock latency (insertion delay) at
Nominal values of skewΨ (si , sj ) and ω Ψ are computed neglecting the impact of variations.
Global skew can be improved by decreasing maxi∈Ψ λ(si ) (speeding up the slowest sinks)
or increasing mini∈Ψ λ(si ) (delaying the fastest sinks). Previous publications on clock
network synthesis were focused on reducing global skew with or without the presence
of variations [12, 18, 38, 44, 51, 61, 63, 87, 97]. However, in a large clock network, skew
between adjacent and connected sinks is a more meaningful optimization objective [32,
81]. Local skew is defined by restricting eligible sink pairs to be within distance ∆ > 0,
Definition IV.4 Given a clock tree Ψ and a local skew distance bound ∆ > 0, let dist(si , sj )
be the Manhattan distance between sinks si and sj ∈ Ψ. Then the worst local skew [95]
is defined as
Ψ
ω∆ = max skew Ψ (si , sj ) (IV.3)
dist(si ,sj )<∆
58
Reducing skew down to single picoseconds in the presence of variations may require a
significant increase in power consumption. Since more than 30% of total power in modern
certain amount of clock skew, and power can be reduced provided that the clock network
Definition IV.5 Consider a clock tree Ψ, a local skew distance bound ∆ > 0, variation
model ν and target yield 0 < y ≤ 1. Let Ψν be the clock tree Ψ with variation ν and f (t)
Ψν
be the cumulative distribution function of ω∆ . Then the worst local skew with variation is
defined as
Ψ
ω∆,ν,y = f −1 (y) (IV.4)
Viewing the local skew limit Ω∆ as a design constraint (see Table 2.1), we pursue the
following goals.
1. Building variation-tolerant clock networks with ω∆,ν,y < Ω∆ , subject to slew con-
straints.
Given a clock tree with known sink latencies, one can optimize it using delay bud-
gets derived from the sink- and edge-slack calculation [51, Section 3], followed by global
skew optimization to reduce global skew below Ω∆ . This strategy is sound because local
59
skew ω∆ cannot exceed global skew. However, global skew optimizations attempt to re-
duce skew between sinks more distant than ∆, which may require unnecessary increase
in power. To tune the clock tree on a tight power budget, we propose the concept of
local-skew slack.
Definition IV.6 Given a clock tree Ψ and local-skew constraints Ω∆ , the local-skew slack
σ(s) for a sink s ∈ Ψ is the minimum amount of additional delay in picoseconds for s, so
Ψ
that the tree satisfies ω∆ < Ω∆ .
Algorithm 2 to calculate σ(s) for every sink. This algorithm uses varEst(si , sj ) = 0 in the
Once local-skew slacks σ(s) are computed for all sinks, we define local-skew slack of
tree edge e as the smallest slack of a downstream sink. Edge slacks in the entire tree can
be computed by one recursive tree traversal in linear time, giving the optimal amount of
tuning to improve worst local skew [51, Section 3]. Figure 4.1 illustrates the computation
7ps
6ps 5ps
A B C D E A B C D E A B C D E
664ps 675ps 687ps 671ps 681ps 13ps 7ps 0ps 5ps 0ps 677ps 682ps 687ps 676ps 681ps
(a) Sink latencies (b) Local-skew slack on sinks and edges (c) Sink latencies after optimization
Figure 4.1: Local-skew slack for sinks and edges when Ω∆ = 5 ps. (a) Sink pairs within
distance ∆ are enclosed by dashed lines. ω∆ = 12 ps based on sink latencies
and ∆. (b) Local skew-slack for sinks are computed by Algorithm 2. The
algorithm for edge-slack computation is described in [51, Section 3]. (c) ω∆ is
reduced to 5 ps after optimizations, which satisfies the local skew constraints.
60
Algorithm 2 Computing local-skew slack for sinks
σ = 0; //Set of minimum local slack
SinkQ = ∅; //Sinks to be optimized
for each sink si do
for each sj in N (si ) do
if ( λ(si ) < λ(sj ) and
skew(si , sj ) + varEst(si , sj ) > Ω∆ ) then
SinkQ.enqueue( si );
end if
end for
end for
while size(SinkQ) 6= 0 do
si = SinkQ.dequeue(); MaxSlack=0;
for each sj in N (si ) do
if (MaxSlack < skew(si , sj ) + varEst(si , sj ) − Ω∆ ) then
MaxSlack = skew(si , sj ) + varEst(si , sj ) − Ω∆ ;
end if
end for
σsi = MaxSlack;
for each sj in Ni do
if ( sj ∈
/ SinkQ and λ(sj ) + σsj < λ(si ) + σsi and |(λ(sj ) + σsj - (λ(si ) + σsi )| +
varEst(si , sj ) > Ω∆ ) then
SinkQ.enqueue( sj );
end if
end for
end while
age and temperature variations often results in significant skew in each chip. However,
variation-aware optimization has not been explored until recently and requires reliable es-
timation techniques. Monte-Carlo simulations are slow and not suitable to clock network
Our key insight is that the impact of variations on skew between two sinks is closely
61
correlated with tree path length and how the tree path is buffered. When two sinks can
be connected by a short path in the tree, variation of skew between them is small. On the
other hand, variational skew between sinks that are geometrically close can be significant
if the unique tree-path between them is long. This is illustrated in Figure 4.2. For a given
technology node, buffer library, wires and variation model, we propose to build a look-
up table with comprehensive information regarding the worst-case variation on skew for
B(766ps)
B(766ps)
Figure 4.2: The impact of variations on local skew. Sinks are indicated by crosses, the
clock source is indicated by a solid triangle. Nominal skew of 3 ps is shown
in (a). Full skew of 11 ps is shown in (b), where some tree edges are delayed
(thick red) and some are sped up (dotted green) by random variations. Only
sink A is within the local skew distance from sinks B, C and D.
Definition IV.7 Given a technology node T , buffer and wire library B, variation model ν
and desired yield 0 < y ≤ 1, let ΞT ,B,ν,y [w, b, t] be the variation-estimation table which
returns the worst-case increase in skew (with probability y) between two sinks connected
by a tree path of length w with b buffers and the buffer type t. When multiple buffer types
are used in the tree path, t is the smallest type in the tree path, so as to avoid under-
estimation of variation.
62
To build the table, we generated a large number of test trees on public CNS benchmarks
Section 4.3 with various buffer types is utilized for the test trees. The number of Monte-
Carlo SPICE simulations is determined based on the given variation model ν. Variational
skew between any two sinks during the simulations is recorded in the table with classi-
function for each (w, b, t) entry in order to look up with yield y. Building the variation-
estimation table requires extensive simulations, but once the table is built, it can be used
for many clock trees. To determine the impact of variation on skew between sinks in
a clock tree, a function varEst(si , sj ) is defined as follows. Given a clock tree Ψ and
a variation table ΞT ,B,ν , let L(si , sj ) be the total length of wires, bn (si , sj ) be the total
number of buffers and bt (si , sj ) be the largest buffer type in the tree path between two
To estimate the impact of variations when optimizing clock trees we utilize varEst()
when computing local-skew slack for each sink (Algorithm 2). Without considering varia-
tions, it is sufficient to satisfy skew(si , sj ) < Ω∆ for all pairs of sinks within ∆. However,
Ψ
Theorem IV.1 ω∆,ν,y < Ω∆ only if
63
4.3 Initial tree construction and buffer insertion
We invoke the unmodified ZST-DME algorithm [23, 38] and perform initial buffer in-
sertion to minimize source-to-sink Elmore delay, rather than skew or capacitance [30, 84].
Elmore delay is too inaccurate for skew optimization, but our approach creates signifi-
cant room for tuning the clock tree by delaying fast paths [51]. In the presence of layout
ISPD 2010 benchmarks include obstacles over which wire-routing is possible but buffer
insertion is not allowed. We adapted a simple and robust technique for obstacle avoid-
ance in clock trees from [51] which repairs obstacle violations in the trees obtained by the
ZST-DME algorithm.
When multiple wire types are available, the choice of wires affects both total power
designs, thicker wires (on a given metal layer) are preferable because they limit the impact
of variations and still allow for future power-performance trade-offs by wire sizing. In less
aggressive ASIC and SoC designs, power optimization may motivate thinner wires. But
upsizing wires in a reasonably tuned clock tree may be of limited use because it increases
Selecting buffer types for initial buffer insertion is also important. Given an initial tree
without buffers Ψ0 , let t(si , sj ) be the type of a buffer required for the tree path between
two sinks si and sj ∈ Ψ0 to satisfy varEst(si , sj ) < Ω∆ . t(si , sj ) can be found from
the variation-estimation table ΞT ,B,ν with L(si , sj ). Since bn (si , sj ) is not available at this
step, it is difficult to find the exact required t(si , sj ). However, because bn (si , sj ) and
64
L(si , sj ) are highly correlated with each other, bn (si , sj ) can be estimated by modeling
it with the average number of buffers corresponding to L(si , sj ). Once bn (si , sj ) is esti-
mated, t(si , sj ) can be computed as described in Section 4.4. The initial buffer type (t0 )
Once t0 is determined, we adopt the fast variant of van Ginneken’s algorithm from [84]
for initial buffer insertion. The main objective of van Ginneken’s algorithm is to mini-
mize insertion delay of a given RC tree. When clock-power reduction is more important
than insertion delay, the buffering algorithm can be modified to place buffers with worse
insertion delay but significantly smaller total capacitance. We modified the fast variant
three phases. First, parameters of input instances for the buffering algorithm are adjusted
to reduce buffer numbers based on given a slew constraint and a variation model. The fast
variant of van Ginneken’s algorithm [84] is utilized in the second phase. In the third phase,
sizes of inserted buffers are adjusted to further reduce total capacitance while satisfying
slew constraints.
After initial buffer insertion, bn (si , sj ) ∀si , sj ∈ Ψ is determined, and more accurate
t(si , sj ) can be obtained. For sink pairs that do not satisfy varEst(si , sj ) < Ω∆ , we use
the robustness-improvement algorithm from Section 4.4 to ensure that the tree eventually
Ψ
satisfies ω∆ < Ω∆ .
65
4.4 Improving robustness to variations
The initial buffer insertion algorithm cannot accurately estimate buffer types required
for local-skew constraints for a given initial tree. Therefore robustness-improvement must
Ψ
follow after initial buffer insertion so that ω∆,ν,y < Ω∆ holds after all the skew optimization
In an ideal situation in which we can reduce all the skew down to 0, varEst(si , sj ) <
Ψ
Ω∆ ∀si , sj ∈ Ψ is sufficient to satisfy ω∆,ν,y < Ω∆ . In practice we must estimate nominal
experience.
Ψ
ω∆,ν,y < Ω∆ (IV.7)
The target buffer type for the tree-path between sink si and sj , t(si , sj ) can be com-
From the above method, the minimum size of buffer type which satisfies varEst(si , sj ) <
Ω∆ - skew Ψ
est is selected to reduce capacitance. Once t(si , sj ) is determined, the buffers in
the tree path between sink si and sj are substituted with type t(si , sj ) buffers. This step is
66
4.5 Skew optimizations
In this section, several local skew optimization techniques are described. Each tech-
nique is designed to reduce skew under different circumstances, but the primary objective
is to optimize the skew of given tree to below the local skew limit in the presence of varia-
tions. The target tuning amount for each edge of the tree can be determined by local-skew
Wire sizing and wire snaking are popular techniques for skew optimization and are
often able to reduce global or local skew down to the practical skew limit. In this context,
however, we exclude wire sizing because narrowing down a wire in the middle of a clock
tree is risky due to the impact of variations. We extend the wire snaking technique from
[51] to improve its speed and accuracy, while limiting its use of routing resources.
The optimal tuning amount for each edge can be obtained by the top-down slack com-
putation explained in Section 4.2.2. Let Ttarget (e) be the amount of time in ps by which
the edge e must be delayed to achieve legal ω∆ under local skew constraints. Lsn (e) de-
notes the length of the wire determined by the wire snaking algorithm to delay the edge
e by Ttarget (e). Let Tactual (e) be the amount of time in ps which the edge e is actually
delayed by Lsn (e) of a wire. Ideally, the wire snaking algorithm can estimate Lsn (e) so
that Ttarget (e) = Tactual (e). Lideal (e) is the length which satisfies Ttarget (e) = Tactual (e).
X
T otalCapsn = κ(Lsn (ei )) (IV.9)
ei ∈E
67
where κ(w) denotes the capacitance of a wire w, and the ideal total additional capacitance
T otalCapideal is
X
T otalCapideal = κ(Lideal (ei )) (IV.10)
ei ∈E
Practically, Tactual (e) 6= Ttarget (e) unless extensive SPICE simulations are performed for
finding Lsn (e), which is unrealistic in terms of runtime for a clock network synthesis flow.
When Tactual (e) < Ttarget (e), another round of wire snaking is required to bring Tactual (e)
closer to Ttarget (e). Lisn (e) denotes the length of the wire determined at ith iteration of the
i
wire snaking algorithm to delay the edge e. Tactual (e) is the amount of time in ps by which
N
X N
X
i
Tactual (e) = Tactual (e), Lsn (e) = Lisn (e) (IV.11)
i=1 i=1
Theorem IV.4 If Tactual (e) ≤ Ttarget (e) for every e in the clock tree, then
T otalCapsn ≤ T otalCapideal
If the wire snaking algorithm over-estimates Lisn (e) and results in Tactual
i i
(e) > Ttarget (e)
for any edge e in any i-th iteration, then T otalCapsn exceeds T otalCapideal after all the
iterations of the wire snaking algorithm because it means there exists excessive delay of
a wire which results in excessive delay of some sinks and possibly increases local skew
68
around the sinks. Therefore the wire snaking algorithm must produce Lisn (e) which satis-
i i
fies Tactual (e) ≤ Ttarget (e) for optimized power consumption by the clock tree. However,
i i
if the gap between Tactual (e) and Ttarget (e) is too big, more iterations will be needed for
Tactual (e) to approach Ttarget (e). We improve the accuracy of wire snaking in two ways.
i i
Delay model for wire snaking. To keep Tactual (e) ≤ Ttarget (e) with optimal quality, we
define α where,
i
Tactual (e)
α≤ i
≤ 1.0 (IV.12)
Ttarget (e)
i
Wire snaking algorithm aims for Tactual (e) to satisfy the above inequality with the highest
log (ε) m
l
N= (IV.13)
log (1 − α)
i
Closed-form delay models like Elmore delay are not accurate enough to keep Tactual (e) ≤
i
Ttarget (e) and α high. To enhance the quality of estimation by the wire snaking algorithm,
look-up tables for Lisn (e) are built by performing a set of SPICE simulations for each tech-
nology environment which includes technology model, types of buffers and wires, varia-
i
tion specification. In the simulations, Tactual (e) is tested with different snaking lengths
on various locations of nodes in various types of clock trees. The results of simulations
are stored in a look-up table, used by wire snaking during local skew optimization. We
achieved α values between 60% and 70%, therefore 4 ≤ N ≤ 6. Only one technology en-
vironment was used at the ISPD 2010 CNS contest, requiring a single set of simulations.
Optimal node selection for wire snaking. Figure 4.3 compares two different styles of
wire snaking. Figure 4.3(b) illustrates undesired delay of sinks after wire snaking on non-
69
buffer-output nodes. The increased capacitance and resistance by wire snaking affects the
driving buffer which results in additional delay of slow sinks. Wire snaking at buffer out-
put nodes, as in Figure 4.3(c), is much more accurate than wire snaking at any branch.
Limiting wire snaking to buffer output nodes reduces the number of SPICE calls required
for clock-tree tuning. This also reduces the number of simulations for building the look-up
table by limiting the number of target nodes to be tested. Wire snaking usually increases
slew rate of input nodes of downstream buffers. To prevent slew violation, slew rate num-
bers of downstream buffers are checked and if the worst slew rate is more than 70% of the
given slew limit, the target node is excluded from wire snaking.
The local skew of a sink cluster driven by the same final buffer is often negligible.
However, highly unbalanced sink capacitances or layout obstacles in those clusters can
result in significant local skew. An alternative technique is needed because wire snaking
in Section 4.5.1 is inapplicable. In this case, inserting a buffer at the target node is very
efficient for two reasons. First, skew can be reduced by the delay of the inserted buffer.
70
Second, further precise wire snaking is possible because the inserted buffer isolates the
Let W(B) be the set of sinks driven by a final buffer B and d(B) be the delay of the
buffer B. Delay buffer insertion is required if there exists si , sj ∈ W(b) where skew(si , sj )
For each path from the buffer to the sinks, inserting at most one buffer is sufficient
since the wire snaking algorithm in Section 4.5.1 can be invoked again at the output node
of inserted buffers. Figure 4.4 illustrates delay buffer insertion algorithm followed by wire
snaking. When a delay buffer is inserted, it is placed at the node so that the input capaci-
tance of a delay buffer is comparable to the sum of downstream sink and wire capacitance
of the target node, thus sink latency in the other path changes very little. See Figure 4.4b.
develop a technique to reduce SPICE runtime at the modest cost in terms of accuracy.
71
4.6.1 Runtime versus accuracy
portant to identify key problems with available solutions. SPICE produces highly accurate
results and is uncontested on very small (sub)circuits. However, its runtime scales worse
than linear as circuits grow [41]. Compact and algorithmic delay models scale much bet-
ter, but their accuracy lags behind SPICE results even for fairly small (sub)circuits. Sev-
eral factors contribute to loss of accuracy: (i) the combination of buffers and inverters in
clock trees makes it difficult to precisely calibrate technology parameters, (ii) closed-form
models do not account for sufficiently many technology parameters, (iii) many closed-
form models do not adequately model slew propagation and its impact on delay, and (iv)
averaging or corrections.
Given our initial emphasis on runtime scaling, rather than actual runtimes, using SPICE
appears the best option on small (sub)circuits. To avoid non-linear runtime scaling, we par-
tition the original RC tree into sub-circuits, invoke SPICE on them, and assemble results
(sub)circuits are sufficiently independent. This step can be performed by splitting the
original tree at buffers, which effectively shields downstream capacitance from upstream
drivers. On the other hand, the tree topology prevents complicated min-cut configurations,
and sub-circuit selection can be based largely on the desired granularity as a means to
72
4.6.2 The Chop-SPICE algorithm
to implement and inherits the accuracy and reliability of the SPICE simulator.
Using any tree traversal technique, such as BFS, Chop-SPICE divides the circuit based
on target granularity γ (number of probing points). Once γ is met, Chop-SPICE (i) stops
the tree traversal, (ii) performs the SPICE simulation on the sub-circuit, (iii) retrieves the
relevant simulation results, and (iv) propagates the delay and slew information to input
nodes of subsequent sub-circuits. Such iterations continue until the original RC tree is
Formally, given an RC tree Ψ, let ω Ψ be the set of probing points in Ψ, which are
defined as input nodes of buffers or sink nodes (Figure 4.5). Let λ(si ) be the number of
fanouts at node si ∈ Ψ to probing points in ω Ψ . Then, the maximum granularity α(Ψ) and
X
α(Ψ) = λ(si ) (IV.14)
si ∈Ψ
pi ∈ ω Ψ is an input node of buffer(s), all fanout buffers are also explicitly included in
the current sub-circuit (Figure 4.6). Hence, sub-circuits are always delimited by buffers
and/or sinks, and a buffer at the boundary of a sub-circuit may also appear in another sub-
circuit. This is convenient for delay modeling because buffers effectively shield most of
73
downstream capacitance (in terms of first-order analysis) from upstream circuit elements,
Delay and slew propagation. Our implementation of Chop-SPICE relies on the stan-
dard Π modeling of interconnect [76]. After the probing points’ delay and slew are cal-
culated by SPICE, they can be propagated in order to accurately capture delay and slew
for probing points in subsequent sub-circuits. Note that probing points in ω Ψ can become
input nodes si of sub-circuits at the subsequent stage, so long as they are input nodes of
buffers. Therefore, measured delays can be directly propagated to the following stages.
Delay from the root node s0 to another node sj is calculated in separate steps. First, we
find the sub-circuit containing sj . Second, we identify the shortest tree path from s0 to sj ,
and the earliest node si in the sub-circuit that lies on this tree path. We now assume that
signal delay at si was computed at previous stages of Algorithm 3, i.e., recursively. The
Slew at a given node can be expressed as a function of input slews of an upstream sub-
circuit. Therefore, slew measured at the previous stage (up to the root node in a given
sub-circuit) should be accounted for when stimuli for the current sub-circuit are gener-
ated. Slew rate for stimuli should be the same as slew rate at the input node si of the
74
Algorithm 3 Chop-SPICE
Circuit instance Ψ
Target granularity γ, where β(Ψ) ≤ γ ≤ α(Ψ)
Root node of RC tree=s0
Tree traversal queue Q=∅
1: Q.enqueue(s0)
2: while !Q.empty() do
3: si = Q.dequeue()
4; if si is sink then continue
5: Empty sub-circuit instance Ψ0
6: Probing point queue P =∅
7: Fanout node queue F =∅
8: F .push(fanout nodes of si )
9: while P .size() < γ do
10: Subsequent input node queue E={}
11: while !F .empty() do
12: fo = F .dequeue()
13: Copy the delay of si to the delay of fo
14: if fo is ∈ ω Ψ then
15: P .enqueue(fo)
16: E.enqueue(fanout nodes of fo )
17: write the specification of the buffer / sink to Ψ0
18: else if fo is a wire segment then
19: write the specification of the wire segment to Ψ0
20: F .enqueue(fanout nodes of fo )
21: end while
22: F .enqueue(every e ∈ E)
23: end while
24: Generate stimuli based on slew of si
25: Invoke SPICE simulation on Ψ0
26: Retrieve simulation results
27: Update delay and slew information on P
28: Q.enqueue(every e ∈ E)
29: end while
4.7.4, source-to-sink delay errors reported in Table 4.4 cannot be ignored during skew
optimizations because typical target nominal skews are below 5 ps. Therefore we propose
a technique that estimates delay errors and compensates them for higher accuracy. Due to
high fidelity of Chop-SPICE, actual delays measured by full-scale SPICE can be estimated
75
Figure 4.6: Sub-circuits are delimited by boundary buffers.
before SPICE-driven skew optimizations, we run full-scale SPICE and Chop-SPICE and
compare sink delays measured by both methods. Then we calculate scaling factors for
each sink for delay compensation. For example, a delay from a source to a sink s may be
measured as 1010 ps and 1000 ps by full-scale SPICE and Chop-SPICE respectively. Then
we set the scaling factor for sink s to 1.01. Second, scaling factors are utilized to estimate
actual sink delays based on delays measured by Chop-SPICE. In the previous example, it
may turn out that the delay of sink s must increase at least to 1020 ps to satisfy a given
skew constraint. After iterations of skew optimizations, the delay of sink s is measured
as 1010 ps by Chop-SPICE. Based on the scaling factor 1.01, we estimate that the actual
delay of sink s is 1020.1 ps, which satisfies the target delay. We achieve nominal skews
under 5 ps based on the above method while runtime improvement is more than 2.5×.
Our implementation, Contango 2.0, is written in C++ and is based on our software
Contango 1.0 [51] that shared the first place at the ISPD 2009 clock-network synthesis
76
CNS ROUTER NTU CLOCK C ONTANGO 2.0
3/16/2010 3/16/2010 4/14/2010
Skew+Var Tot. Skew+Var Tot. Nom. Skew+Var Total Cap.
Mean 95% Cap. Mean 95% Cap. skew Mean 95% Yield (wire snaking)
cns01 5.27 7.32 841 20 6.71 8.66 294 15 2.51 5.16 7.01 97.4% 198(2.6) 12015
cns02 6.27 8.33 1454 81 8.27 10.7 833 176 2.99 5.58 7.34 95.8% 376(1.3) 25006
cns03 2.07 2.70 163 8 6.82 8.63 167 6 1.5 3.03 4.18 99.6% 55.9(1.0) 3840
cns04 2.83 3.98 277 32 7.45 9.55 325 58 2.07 3.26 4.46 99.9% 71.84(1.46) 6075
cns05 2.88 4.38 176 7 5.30 6.98 130 11 1.50 3.01 4.41 99.9% 37.7(5.43) 2406
cns06 12.4 14.0 133 11 407 417 1577 10 4.29 5.03 6.05 99.9% 47.8(3.0) 2660
cns07 12.4 15.7 309 45 6.17 8.12 276 66 2.22 3.41 4.58 99.9% 72.7(1.1) 2351
cns08 6.15 7.33 223 19 5.94 7.64 166 7 3.42 4.15 5.15 99.9% 52.5(1.9) 1987
Rel. 4.22 4.13 1.0
Table 4.1: Results on the ISPD 2010 Contest benchmark suite. Skew numbers are re-
ported in ps, capacitance in pF and CPU time in s. ‘95%’ represents ω∆,ν,95 .
The numbers in parentheses of the capacitance column refer to the fraction of
capacitance of the snaked wires in %. Skew constraint violations are shown in
strikethrough font. Otherwise, skew results are not comparable because skew
can be traded for capacitance, which was the primary objective of the contest.
All networks produced by these tools satisfy slew constraints imposed at the
ISPD 2010 contest. Due to limited page space, we do not include results for the
other teams, but significantly outperform them in solution quality.
contest. Contango 2.0 was the sole winner of the ISPD 2010 contest, but we now report
Table 2.1 lists the statistics of all benchmarks from the ISPD 2010 contest. The
insertion-delay driven buffer insertion algorithm [84] is utilized to produce our clock trees
in this experiment. The contest limited slew to 100 ps, and all reported clock networks sat-
isfy this constraint. Slews in Contango 2.0 trees do not exceed 81 ps. Table 4.1 compares
Contango 2.0 with CNSrouter and NTUclock. Clock networks produced by our software
have smaller capacitance than CNSrouter and NTUclock on average by 4.22× and 4.13×
respectively. The contest imposed local skew constraints with yield y = 95%. Our clock
trees always yield > 95%, while CNSrouter violates yield constraints on three bench-
marks and NTUclock on all benchmarks except one. All three teams satisfied the 12-hour
77
runtime limit for all benchmarks. Our data suggest that wire snaking usually increases
wire length by 1-3% (5.43% in one case), which is small enough to neglect the negative
effects of wire snaking. Figure 4.8 compares probability density functions (pdf) produced
by Monte-Carlo SPICE simulations of our clock trees to those of clock meshes produced
by CNSrouter. One such clock tree is illustrated in Figure 4.7. Despite the dramatic dif-
ferences in network topology and total capacitance between trees and meshes, some of the
plots in Figure 4.8 bear striking resemblance (cns01, cns02, cns04, cns05). To explain this
phenomenon, we recall that meshes cannot be buffered directly and are therefore driven
by a buffered clock tree. Such a clock tree can be constructed by the same DME algo-
rithm that we use, which is why the pdf profiles in Figure 4.8 reflect the pointset of sink
locations. Apparently, the mesh does not significantly change this profile.
Table 4.2 compares our clock trees with the slew-constrained buffering algorithm in
Section 4.3 to the existing state-of-the-art clock networks [13,67] on the ISPD 2010 bench-
marks. Compared to our clock trees in Table 2.1, we reduce total capacitance by 32.8% on
average. Clock networks produced by our software exhibit smaller capacitance than the
clock networks in [13] and [67] on average by 23.0%, 1.2% respectively. All the results in
Figure 4.9 describes experiments on benchmark ispd10cns08 with different local skew
constraints. When tight local skew constraints are given, large buffers are required to
ensure robustness to variations, increasing the capacitance of the clock tree. On the other
78
[13] [67] C ONTANGO 2.0
ISPD‘10 3/27/2011 3/27/2011 9/18/2011
Bench. Skew+Var Total Skew+Var Total Skew+Var Total
Mean ω∆,ν,95 Cap. Mean ω∆,ν,95 Cap. Mean ω∆,ν,95 Cap.
cns01 4.01 5.79 177.5 5.06 7.32 142.6 5.20 7.25 141.5
cns02 4.98 6.69 329.9 5.81 7.42 265.2 5.79 7.42 264.4
cns03 2.44 3.46 50.81 2.74 4.49 36.61 3.39 4.63 35.81
cns04 2.84 3.79 57.44 3.96 6.70 51.07 3.91 5.26 47.90
cns05 2.72 3.68 28.93 2.16 4.78 25.13 4.31 6.57 25.45
cns06 3.03 4.01 36.12 4.73 6.41 32.68 4.56 6.37 35.02
cns07 3.81 5.65 57.93 4.04 5.86 48.32 4.48 5.99 47.09
cns08 2.89 4.24 40.43 3.41 5.07 32.70 4.02 5.71 31.18
Relative 1.230 1.012 1.000
Table 4.2: Our clock trees for the ISPD 2010 benchmarks, buffered by our slew-
constrained algorithm, versus existing state-of-the-art clock networks [13, 67].
Skew numbers are reported in ps, capacitance in pF . All networks produced by
these tools satisfy slew constraints from the ISPD 2010 contest.
Figure 4.7: Our clock tree for ispd10cns07. Sinks are indicated by crosses, buffers are in-
dicated by blue rectangles. ∆ = 600µm is shown near the left-bottom corner.
hand, a large portion of capacitance can be saved when local skew constraints are loose.
To clarify the impact of variation, we plot variational skew (y-axis), defined as ω∆,ν,y - ω∆
79
0.25 0.2 0.5 0.35
0.3
0.2 0.4
0.15
0.25
0.15 0.3 0.2
0.1
0.1 0.2 0.15
0.1
0.05
0.05 0.1
0.05
0 0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 0 2 4 6 8 10
0 0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10
Figure 4.8: Probability density functions for worst local skew of our clock trees (blue line)
and meshes produced by CNSrouter (gray dashed line) for the eight ISPD 2010
benchmarks, calculated using 500 independent SPICE runs for each bench-
mark. The x-axis shows skew in picoseconds. Local skew limits (Ω∆ ) are
shown with red solid lines, and the 95%-ile of local skew (ω∆,ν,0.95 ) are shown
by dotted green lines (our work) and dashed gray vertical lines (CNSrouter).
6
Variational skew (ps)
1
42 44 46 48 50 52 54
Capacitance (pF)
Figure 4.9: Trade-off between capacitance and robustness on ispd10cns08. The x-axis
represents total capacitance of a tree and y-axis represents the maximal varia-
tional skew at 95% yield.
80
4.7.4 Empirical validation of Chop-SPICE
Runtimes and errors. Table 4.3 and Table 4.4 show Chop-SPICE runtimes and errors
as target granularity γ varies from α(Ψ) (full-scale SPICE simulation) to β(Ψ) on each
circuit. Delay and slew errors at each sink are measured by absolute difference with delay
and slew, respectively, from the full-scale SPICE simulation of each circuit. With decreas-
ing granularity and probing points in sub-circuits, delay and slew errors increase while
runtimes decrease. We observed runtime improvements of 1.2 - 4.6×, delay errors of 4.99
- 10.73 ps, and slew errors of 2.33 - 4.21 ps per benchmark for the smallest sub-circuits
of full-scale SPICE during optimization. To this end, we analyze intermediate clock trees
produced by Contango2. During each iteration, we use Chop-SPICE and full-scale SPICE
to measure sink delays for clock trees before and after optimization, and compare the de-
lay differences. Let Ψk = clock tree at the k th iteration of Contango2 optimization. Then,
considering clock trees Ψk and Ψk+1 , we calculate the delay difference for each sink be-
tween Ψk and Ψk+1 , as measured by Chop-SPICE and full-scale SPICE. Let delayk (si ) be
the delay of node si on RC clock tree Ψk , then ∆k (si ) = delayk+1 (si ) – delayk (si ) where
Empirical results are depicted in Figure 4.10. For each datapoint, the x-axis represents
the delay difference measured by Chop-SPICE while the y-axis represents the delay dif-
81
were used. Most ∆k (si ) values are positive since the Contango2 optimizations increase
For datapoints that fall on the straight line, Chop-SPICE and full-scale SPICE agreed on
delay differences; datapoints above the straight line indicate that Chop-SPICE underes-
timates the delay difference while datapoints below the line indicate that Chop-SPICE
overestimates the delay difference. As shown in the figure, the majority of datapoints fall
slightly above or below the line (a difference of a few picoseconds). Moreover, we have
shown in Section 4.7.4 that the delay and slew errors of using Chop-SPICE are within 10
20
15
10
0 5 10 15 20
∆k measured by Chop-SPICE (ps)
Figure 4.10: Fidelity of Chop-SPICE estimates (γ = β(Ψ)) for CNS 02, CNS 07, and
CNS 04 ISPD 2010 benchmarks.
Table 4.5 compares our software to a commercial clock-tree synthesis tool, Cadence
technology with a 5 ns target clock period. When skew requirement is set to 2 ps, En-
counter produces a clock tree with 97.0 ps worst insertion delay and 8.5 ps nominal skew.
82
1400 10
Runtimes (seconds)
1000
6
800
4
600
2
400 Runtime
0
200
0 20 40 60 80 100
Target number of probing points (%)
Table 4.3: Averages delay (D-err.) and slew error (S-err.) (in picoseconds) and runtimes
(in seconds) with varying the granularity on the ISPD 2010 CNS contest bench-
mark suite. The clock networks are synthesized by Contango2.
with chip size 84 µm by 84 µm and the target clock period 2 ns. Our software produces a
clock network with 62.6 ps worst insertion delay and 1.6 ps nominal skew. To compensate
for the differences in technology parameters, note that skew improvement considerably
outpaces insertion-delay improvement. The result shows that our nominal-skew optimiza-
tion techniques are more precise and reliable than Encounter when a given skew constraint
is difficult to satisfy.
83
Bench. β(Ψ)+0.5·θ(Ψ) β(Ψ)+0.25·θ(Ψ) β(Ψ)
D-err. S-err. D-err. S-err. D-err. S-err.
CNS 01 1.861 2.152 1.882 2.710 4.993 3.737
CNS 02 6.775 3.847 6.757 3.839 10.745 4.214
CNS 03 1.955 2.551 1.656 2.559 6.764 2.684
CNS 04 1.462 2.423 7.623 1.885 7.623 2.609
CNS 05 4.880 1.853 4.880 1.853 7.067 2.328
CNS 06 7.063 2.860 7.063 3.107 7.063 2.908
CNS 07 1.625 2.862 7.046 2.229 7.449 2.740
CNS 08 7.107 3.132 7.079 3.637 7.074 3.645
Average 4.091 2.710 5.498 2.727 7.347 3.108
Table 4.4: Maximum delay (D-err.) and slew errors (S-err.) in ps, on the ISPD 2010 CNS
contest benchmark suite, as sub-circuit granularity varies.
Table 4.5: Comparison of our software on a design with 309 registers to a commercial
clock-tree synthesis tool, Cadence First Encounter. Skew limit 2.0 ps is used
to produce each clock network.
4.8 Summary
CPU clock designs. Mesh structures often sacrifice power to improve robustness to varia-
tions. We propose a tree solution for CPU clock routing that improves power consumption
under tight skew constraints in the presence of variations. To this end, we introduce the
notion of local-skew slack for clock trees, a model for variational skew, a path-based tech-
nique to enhance robustness, a new time-budgeting algorithm for clock-tree tuning and
accurate optimizations that satisfy budgets. We have shown that clock trees can be tuned
to have nominal skew below 5 ps and total skew in single picoseconds in the presence of
variations. Our optimizations not only satisfy given skew constraints and target yield but
84
also lead to 4.22× capacitance improvement on average over mesh structures proposed in
the ISPD 2010 contest. Furthermore, our clock trees had a higher yield than the meshes
because meshes are not as easy to tune for nominal skew. The comparison with a com-
mercial clock-tree synthesis tool shows that our nominal-skew optimization techniques are
more precise and reliable than an industrial tool when a given skew constraint is difficult
to satisfy. Our analysis does not consider gated clocks, inductive effects and short-circuit
power in meshes, but these factors generally favor trees over meshes. Our strong empir-
ical results suggest that clock trees constructed using accurate variational skew modeling
and optimizations have distinct advantage in power consumption and similar robustness
as meshes. Hence, our techniques may improve power of future CPUs without sacrificing
85
PART III
CHAPTER V
Traditional IC design flows optimize clock networks after placement and thus limit the
quality of clock networks by register locations. Existing publications also reflect this bias
and focus mostly on clock routing. The few known techniques for register placement ex-
hibit significant limitations and do not account for recent progress in large-scale placement
In this chapter, we integrate clock network synthesis with global placement by opti-
mizing register locations. We propose (1) obstacle-aware virtual clock-tree synthesis; (2)
arboreal clock-net contraction force with virtual-node insertion, which can handle multiple
clock domains and gated clocks; (3) an obstacle-avoidance force. Our work is validated
86
on large benchmarks with numerous macro blocks. Experimental results indicate that our
software implementation, called Lopper, prunes clock-tree branches to reduce their length
5.1 Introduction
Our analysis of prior work reveals serious limitations in published techniques. Some
methods coerce the placer into shortening the clock tree by capturing portions of the clock
tree with the half-perimeter wirelength (HPWL) objective, which is usually applied only
to signal nets [20, 103]. This idea overlooks the fact that low-skew clock trees exhibit
much greater wirelength than signal nets with the same bounding box. To make mat-
ters worse, the HPWL estimate does not offer much fidelity for clock-tree lengths, as we
show in Figure 5.2. Furthermore, a handful of existing publications that optimize clock
networks during placement (reviewed in Section 5.2) do not reflect recent progress in
large-scale placement and clock-network synthesis, and do not compare their results with
best-of-breed software. In most cases, they are evaluated on small benchmarks without
routing/buffering obstacles rather than on modern ASIC or SoC designs with many macro
blocks. Our research addresses these gaps in the literature by developing a set of new
techniques for clock-net optimization during placement and evaluating these techniques
against leading academic software. We extended the ISPD 2005 benchmark suite toward
clock-network synthesis, with the largest benchmark including 2.1M standard cells and
327K registers. The benchmarks include macros, which we interpret as routing obstacles.
87
To optimize the trade-off between clock network minimization and traditional place-
clock-net force using virtual nodes. A key challenge addressed in our work is preserv-
ing the quality of global placement when adding clock-net optimizations. We also ac-
commodate multiple clock domains and gated clocks. Our algorithms are integrated into
the SimPL placer [47], which currently produces lowest-wirelength placements on the
ISPD‘05 benchmarks. The quality of register placement is evaluated by Contango 2.0 [53]
– the winner of the ISPD 2010 contest. Experimental results show that our method can
Modern CPU designs demand low-power clock networks, yet also impose stringent
skew limits, especially in the presence of process, voltage and temperature (PVT) varia-
tion for sub-45 nm CMOS technologies. Clock networks that are robust to PVT varia-
tions are usually not power efficient. To this end, our proposed methodology integrates
of global placement, and therefore produces more robust clock trees without increasing
wirelength-driven optimizations.
The remainder of this chapter is organized as follows. Section 5.2 covers limitations
of existing techniques. Section 5.3 reviews the optimization objective for clock-net opti-
mization in placement subject to dynamic-power reduction. Section 5.4 describes our tech-
88
niques for high-quality register placement. Section 5.5 describes our methodology for in-
tegrating proposed techniques into a state-of-the-art placer used in industry and academia.
Our empirical results are described in Section 5.6. Summary is given in Section 5.7.
Clock-net optimization during placement seeks better register locations but should not
harm total wirelength of signal nets. A naive method is to increase the weight of the clock
net and pull all registers together. Unfortunately, this method increases routing congestion
and hot spots, and also leads to poor signal-net wirelength when dealing with more than
several hundred registers [20,103]. To resolve the conflict between clock-net minimization
ring guidance methods commit registers to certain guidance locations and try to pull the
registers close to the nearest such locations during placement [64]. However, such methods
do poorly in the presence of numerous obstacles, e.g., macro-blocks, or when register lo-
cations found by the global placer are not uniformly distributed. In other words, guidance
rings cannot accurately predict ideal locations for register clusters. Figure 5.1 illustrates
how Manhattan-ring methods fail. In Figure 5.1(b), the sink group A is attracted by the
closest Manhattan ring. The sinks in A are erroneously guided toward the obstacle. The
sink group B and the related standard cells have heavy connections to the bottom macro
block. However, the two bottom Manhattan rings encourage the sinks in B to move away
from the center of B, which will likely increase signal-net wirelength significantly.
The second family of approaches performs clock-network synthesis using register lo-
89
A
Manhattan rings
(a) (b)
Figure 5.1: Two examples of Manhattan rings proposed in [64]. (a) Zero-skew Manhattan
rings driven by an H-tree. (b) Manhattan rings on the design with obstacles.
Obstacles are indicated by darker boxes, two sink groups (A, B) are repre-
sented as ellipses.
cations from intermediate placement results. Specific techniques [20, 103] often simplify
the structure of the clock network and bias the placement process to optimize such sim-
plified networks. However, clock trees generated by those techniques are not realistic and
very different from those generated by leading software. In the DCTB algorithm [103],
the essential parameters of clock network synthesis, such as sink capacitance and wire
capacitance/resistance, are ignored, and the cost function is derived by only considering
Manhattan length between sinks or nodes. The quick CTS algorithm in [20] relies on sim-
ple heuristic clustering for topology generation and is more simple-minded than standard
DME algorithms, which minimize wirelength with zero or bounded skew based on Elmore
delay. Furthermore, all previous work ignores the presence of routing obstacles, common
in modern IC designs, and this ignorance can undermine end results (Sections 5.4 and 5.6).
Previous publications that simplify clock-tree synthesis during placement [20, 103]
cluster clock trees and represent these clusters with bounding boxes to model clock net-
work reduction by placement objectives. Typically, registers are clustered at one or mul-
90
tiple levels based on the structure of the reference (simplified) clock tree, and bounding
boxes are created for each cluster. The experimental results of [20, 103] show that bound-
ing boxes are helpful for clock-net size reduction. However, we argue below that this
Bounding boxes are represented by fake nets during placement and are optimized to
reduce HPWL [47, 91]. The HPWL objective is relevant to placement because it estimates
the lengths of signal routes reasonably well. However, clock routing is very different from
signal-net routing and requires longer routes to ensure low skew. Therefore, HPWL does
not offer accurate estimates of clock-tree lengths. Figure 5.2 shows that reducing HPWL
of the clock net may increase the total length of the clock tree, demonstrating that the
The authors of [103] adapted MLAF to compensate for the drawback of MLBB. How-
ever, we show in Section 5.4.2 that MLAF offers only a partial solution to this problem.
(a) (b)
Figure 5.2: Bounding boxes of two partial ZST-DME clock trees. (a) HPWL of the bound-
ing box is (15+12)=27. The total wirelength of the inside clock tree is 32.
(b) HPWL is (10+10)=20 and the total wirelength of the clock tree is 35.
The clock-net wirelength of (b) is greater than (a) although the bounding-box
HPWL of (b) is notably smaller than (a) while the source-to-sink wirelength is
15 for all sinks.
91
5.3 Optimization objective
Let N be the set of signal nets, and let E be the set of clock-net edges. To optimize
clock networks in placement, we minimize the total switching power Psw , defined as the
Psw = PN + PE (V.1)
Given activity factors of signal nets and clock-net edges, the total signal-net and clock-net
Here, αni and αei are the respective signal-net and clock-edge activity factors, Cn and Ce
are the respective unit capacitance for signal and clock wires, V is the supply voltage, f is
the clock frequency, HP W Lni is the HPWL of net ni , and Lei is the Manhattan length of
edge ei . Activity factors of clock-net edges are required when multiple clock domains or
gated clocks are utilized for given designs, otherwise αei = 1 as clock edges switch every
clock cycle. The handling of gated clocks is discussed in Section 5.5 in more detail. If the
activity factors of signal nets are not available, the computation of total switching power
relies on clock-power ratio β, i.e., clock-net switching power divided by total switching
power. In this case, the average activity factor of signal-net αavg can be derived as
P
(1 − β) ei ∈E Lei Ce
αavg = P (V.4)
β ni ∈N HP W Lni Cn
αavg is utilized for the activity factors of all the signal nets.
92
Compared to the work in [103], where the main objective does not capture the power
of signal nets, our primary objective function (Formula V.1) captures both clock-net and
signal-net switching power. The objective function in [20] considers clock-net and signal-
net power, but their estimation of clock-net switching power relies on bounding boxes
that cannot accurately represent clock-net wirelength. However, our analysis of clock-net
power in Section 5.4.2 allows us to explicitly represent clock-net power in the primary ob-
jective. Thus, the optimization of our primary objective effectively decreases total switch-
prior work and reliably optimize large IC designs with numerous layout obstacles. Our
approach consists of two major phases: (i) virtual clock-tree synthesis, (ii) arboreal clock-
Our virtual clock-tree synthesis handles macro blocks as wiring obstacles and produces
obstacle-avoiding clock trees. The importance of utilizing obstacle-aware clock trees is il-
lustrated in Figure 5.3 (the contraction forces are described in Section 5.4.2). Clock-net
optimizations without obstacle handling pull clock sinks inside obstacles, which under-
Experimental results in [53] show that the difference in total capacitance between ini-
tial zero-skew DME trees (based on Elmore delay) and the final SPICE-optimized trees
93
(a) (b)
(c) (d)
Figure 5.3: An example of clock-net optimization with an obstacle. (a) The virtual clock
tree and corresponding contraction forces are created without considering the
obstacle. (b) The result of a placement iteration with the forces in (a). (c) The
obstacle is accounted during virtual clock-tree generation and when establish-
ing additional forces. (d) The result of (c).
is only 2.2% on average. Hence, initial trees produced by leading clock-network synthe-
sis tools offer reasonably accurate capacitance estimates. To quickly construct a virtual
clock-tree during placement, our methodology first performs traditional DME-based zero-
skew clock-tree synthesis with Elmore delay model, subject to obstacle avoidance. Several
techniques are known for this problem, including direct obstacle-avoiding clock-tree con-
struction [46] and incremental repair of obstacle-unaware trees [51]. Each approach can
be used in our methodology, but we found that incremental-repair techniques are simpler
and yet produce high-quality trees.1 Our clock trees target the 45 nm technology used at
94
5.4.2 Arboreal clock-net contraction force
If the virtual clock network connecting to current register locations faithfully repre-
sents a realistic clock network, then optimizing it directly should improve the final clock
network produced by a specialized CTS tool after placement is complete. To this end,
we extend force-directed placement with new, structurally-defined forces that seek to re-
duce individual edges of the virtual clock network. This technique communicates current
clock-tree structure to the placement algorithm, and also allows the structure to change
with placement.
Figure 5.4(a) illustrates a sample virtual clock tree. To reduce the length of e1 directly,
all sinks downstream from e1 can be moved in the direction of reducing the length of e1 .
For each downstream sink of e1 , a force vector needs to be assigned. The force vectors
e1
e2
(a) (b)
Figure 5.4: Two types of forces for clock-net optimization. Registers are indicated by
crosses. (a) For each edge, the corresponding downstream registers are given
force vectors. Right arrows are the force vectors for reducing e1 , and up arrows
are the force vectors for reducing e2 . (b) Virtual nodes are inserted (squares),
and forces are created between each pair of connected nodes (dotted lines).
95
The sum of magnitudes of force vectors induced by e1 (Fesum
i
) needs to be carefully
activity factors of clock edges differ (e.g., in gated clocks). Figure 5.4(a) illustrates force
vectors. The force from e1 is weaker than the force from e2 , Fe1 < Fe2 since the sum of
The main problem with this method is that the relative locations of branching nodes
from sinks are assumed to be same when the force vectors are created. However, opti-
mal relative locations of the branching nodes change during the optimization. Therefore,
placement iterations with fixed force vectors for sinks do not produce optimal locations.
To shorten clock wires, we propose an arboreal clock-net contraction force with virtual-
node insertion. Our approach creates forces between clock-tree nodes and structurally
transfer the forces down to registers. Virtual nodes represent branching nodes in the clock
tree and split the clock tree into individual edges, seen as different nets by the placement
algorithm. The virtual nodes have zero area and do not create overlap with real cells, so
they do not affect the spreading process in force-directed placers. Zero-area nodes may or
may not be allowed to overlap with obstacles (if such a node is placed over an obstacle,
its overlap has zero area). In our case, virtual nodes should not be placed over obstacles to
Compared to the fixed force vectors applied exclusively to sinks, our technique creates
forces between flexible nodes and each force seeks to reduce the length of the correspond-
ing clock edge. Unlike in the bounding-box based method, each force is integrated into
96
To reduce dynamic power consumption of the IC, contraction forces are calculated
based on the activity factors of the signal nets. When activity factors of signal nets are
αni
w ni = (V.6)
αavg
When activity factors of signal nets are not available, Equation V.4 is utilized to compute
αavg and wni = 1 for all signal nets. A two-pin net representing clock-net contraction
Ce αei
wei = (V.7)
Cn αavg
and the HPWL of a two-pin net from ei is equal to the Manhattan length of ei ,
Note that our primary objective function is a sum of signal-net and clock-net switching
power (Formula V.1). By combining Formulas V.2, V.3 and V.8, the total switching power
is expressed as
X X
αni HP W Lni Cn + αei HP W Lei Ce V 2 f (V.9)
ni ∈N ei ∈E
By substituting αni and αei in terms of wni and wei (Equations V.5, V.7), Equation V.9 can
be rewritten as
X X
αavg wni HP W Lni Cn + wei HP W Lei Cn V 2 f (V.10)
ni ∈N ei ∈E
97
Let K be αavg Cn V 2 f , M = N ∪ E. Then the total switching power of signal nets and
X
Psw = PN + PE = K wmi HP W Lmi (V.11)
mi ∈M
In other words, our techniques capture the switching-power minimization problem, which
can be solved by any high-quality wirelength-driven placer capable of net weighting. Fig-
ure 5.5 compares our technique and MLAF from [103]. MLAF is ineffective in shortening
clock nets that significantly differ from H-trees. Additionally, MLAF does not establish
exclusive forces that represent the edges between parents and children nodes. Instead,
bounding boxes (MLBB) are used with MLAF in [103]. We show in Section 5.2 that
bounding boxes cannot offer accurate estimates of clock-tree lengths. Also, the authors
of [103] did not explain how they assigned weights to MLAF but only hinted that the
magnitude of MLAF was similar to the magnitude of forces for signal nets. Detailed
comparison between our technique and MLAF is discussed in Section 5.6.2 (Table 5.5).
do not necessarily improve every tree edge, as shown in Figure 6. In Figure 5.6(a), five
edges are derived from a virtual obstacle-aware tree built as in Section 5.4.1. If we create
forces for all the edges, subsequent optimization will produce the tree in Figure 5.6(b).
The force f4 associated with edge e4 is rendered ineffective by the obstacle. Our force-
modification algorithm for obstacle avoidance detects these obstacle-detouring edges and
98
Lopper Lopper
(a) (b)
MLAF MLAF
(c) (d)
Figure 5.5: Comparison between our arboreal clock-net contraction force and MLAF of
[103]. (a) Arboreal clock-net contraction forces are generated. (b) The mod-
ified register and virtual clock-node locations when forces in (a) are utilized.
(c) The forces created by the MLAF algorithm. (d) The modified register and
virtual clock-node locations when forces in (c) are utilized. We can observe
that the edges between parents and children nodes are poorly handled for the
force creation in (c), and our method is more efficient on non H-tree structures
(which is common in modern designs).
e1
e2
e3
e4 e5
Figure 5.6: Obstacle-avoidance force. (a) Five edges of an obstacle-aware virtual clock
tree. (b) The result when all the edges are utilized for contraction forces. (c)
The result when e4 and e5 are excluded from force construction.
99
eliminates the contraction forces for them.2 In this example, e4 and e5 are excluded from
We integrate our techniques into SimPL, a flat, force-directed quadratic placer [47].
Recall that analytic placers first minimize a function of interconnect length, neglecting
overlaps between standard cells and macros. This initial step places many cells in densely
populated regions. Clock-net contraction forces are ineffective at this step for two reasons:
(i) the current virtual clock network may differ greatly from the final clock network, (ii)
the contraction forces may restrict the spreading of the registers at the center of the design
due to their high net weight. Therefore, our techniques are invoked between signal-net
At each iteration of Lopper, a new virtual clock tree is generated based on current
register locations. We discard the previous virtual clock tree based on the following obser-
vation. The topology of a clock tree and the embedding of its wires minimize (i) skew as
the primary objective, (ii) total wirelength as the secondary objective. When an iteration
of Lopper is performed, the locations of the registers are modified in order to reduce the
total wirelength of the given virtual clock tree. Since registers are displaced by different
amounts (due to different connectivities), keeping the previous clock-tree structure would
2
Consider a clock-tree edge that does not cross a given obstacle. The edge detours the obstacle if the
straight line connecting the ends of the edge crosses the obstacle.
100
Virtual
Initial Look-ahead
Clock Tree
Placement Legalization
Synthesis
Linear Hierarchical
Global Obstacle-avoidance
System
Placement Clock-net force
(Jacobi + CG)
No
Building
Legalization Objective
Virtual Nodes
Detail Place Converged?
Yes & Nets
Figure 5.7: Key steps of Lopper integrated into the SimPL placer, as indicated with darker
rounded boxes and a lozenge. Plain boxes represent the SimPL steps.
risk a large increase in skew. Therefore we regenerate the virtual clock tree for each iter-
ation to obtain an optimal virtual clock tree with the current register locations. The tree
topology typically undergoes only moderate changes, while branching nodes relocate to
reduce skew.
Early placement iterations may greatly displace the registers, suggesting that effective
clock-net wirelength reduction requires moving registers over the obstacles. In this case,
obstacle-aware virtual CTS and obstacle-avoidance force may undermine the potential
becomes small.
Global placement typically continues while HPWL continues improving, but clock-
tree reduction in Lopper requires a different convergence criterion. After each iteration,
total switching power is calculated and compared to previous values. Lopper is invoked
101
Legalization and detailed placement are applied after Lopper is complete. It is im-
portant to preserve the virtual nodes and two-pin nets that represent the clock-net con-
traction forces during detailed placement because detailed placement algorithms usually
optimize wirelength and would not have preserved clock-optimized register locations if
Quality control. Our techniques reduce the size of clock networks, but are likely to
increase signal-net wirelength. The activity factor of each signal-net αni or clock-power
ratio β are required for Lopper to reduce total switching power. However, even clock-
power ratio β is hard to estimate before the design is completed and can vary with various
clock-net and signal-net switching power can be easily controlled with a single parameter
β. This simple quality control allows an IC designer to achieve intended total switching
power of a chip without changing the algorithm or its internal parameters. Relevant trade-
Gated clocks and multiple clock domains are well-known and often the most effec-
tive techniques to reduce clock network power dissipation [73]. To extend our techniques
to gated clocks and multiple clock domains, each register si is given an activity factor αsi
and the activity factors are propagated through the tree. The activity factor of an edge
is the highest activity factor of its child edge or register (see Figure 5.8). Without clock
gating, all registers are given activity factors 1.0, which are propagated to all tree edges.
Once activity factors are propagated to tree edges in each clock tree, they are used to
102
calculate net weights that represent clock-net contraction forces in Equation V.7. Registers
that switch less frequently due to clock gating will be more affected by signal nets than
normal registers without clock gating. Our technique does not track the locations of gaters
assuming that the final clock tree and the gaters are constructed after register placement.
While we have not experimented with gater placement, we do not believe that it will affect
Clock gating
1.0
1.0 0.6
1.0 0.6
1.0 0.6
Figure 5.8: Activity-factor propagation for gated clocks. Registers are indicated with
crosses. Tree edges and registers are labeled with activity factors.
Flexible integration. Through the Lopper flow, forces for clock-net optimization are
represented in placement instances by virtual nodes and nets. No support for clock-net
any fast obstacle-aware clock-tree synthesis technique into any iterative high-performance
locations to satisfy timing constraints, while minimizing interconnect [43, Chapter 8].
prevent timing violations. Lopper can be integrated into timing-driven placers thanks to
103
placement increases weights of critical nets.3 To this end, Formula V.6 can be extended to
include the criticality cni of net ni which represents how important this net is to satisfying
timing constraints.
αni cni
w ni = (V.12)
αavg
If weights of critical nets exceed clock-net weights (i.e., wni > wei ), the wirelength of
critical nets is affected less when reducing clock-net wirelength. During the Lopper flow,
only the wirelengths of less critical nets increase to reduce clock-net switching power.
low nominal skew but also robustness to PVT variations are essential for building high-
quality clock networks. When making a clock network more robust, one uses large buffers
and/or redundant wiring, which increases total capacitance and dynamic power. Because
clock networks consume a large portion of total power, it is important to limit the max-
imum clock-network power. When clock-network power is limited by design, one of the
placement. Since our virtual clock-tree construction algorithm is variation sensitive4 and
integrated into our primary optimization objective during placement, our register place-
ment is sensitive to PVT variations. Extensive SPICE simulations confirm that the Lopper
3
This approach is used in our work to illustrate the compatibility of Lopper with timing-driven placement.
Many other approaches exist [43, Chapter 8].
4
When clock-network power is limited, the DME-based initial-tree construction algorithm in [53] is
variation sensitive in the sense that it generate low-skew trees with optimizing wirelength. When initial
wirelength of a clock tree is small, more optimizations for enhancing robustness are possible within given
power limit. Our virtual clock-tree construction algorithm is adapted from this variation-sensitive initial-tree
construction algorithm.
104
flow significantly improves robustness of clock trees when clock-network power is limited
ment exhibit the following problems: (1) Empirical validation of each existing publication
relies on one benchmark suite which is not utilized by any other work. Most of the bench-
marks are inaccessible to public, therefore comparisons to new techniques are impossible.
(2) The benchmark designs are based on unrealistically small placement instances. None
of the prior publications provide results on a design with more than 1M standard cells,
which is common in modern modern ASIC designs. (3) Macro blocks became essential
components, and many IC designs include more than hundreds of macros with fixed loca-
tions after floorplanning [6,43]. However, prior publications used the benchmarks without
macro blocks or ignored macro blocks present in the benchmarks [103]. (4) Reference
placement tools used for comparison are often outdated [64] or self-implemented [103].
In this section, we propose a new benchmark set that addresses the above pitfalls.
Our experimental results offer full comparisons with leading academic wirelength-driven
placers and a known technique for register placement (MLAF). The quality of register
105
5.6.1 Experimental setup
The ISPD 2005 placement contest benchmark suite is being used extensively in place-
ment research, and the academic community consistently advanced physical design tech-
niques using the ISPD‘05 benchmarks. These benchmarks are directly derived from indus-
trial ASIC designs, with circuit sizes ranging from 210K to 2.1M placeable objects. We
adapted eight designs from the ISPD‘05 benchmarks and created register lists in which
15% of standard cells are selected to be registers. We selected the number 15% based
on the industrial designs introduced in [20], where the average 14.65% of cells are regis-
ters. The largest benchmark has 327K registers. Fixed macro blocks are viewed as routing
and placement blockages during clock-network synthesis.5 Some macro blocks that create
routing dead space are slightly resized and/or repositioned to eliminate dead space (see
Figure 5.9). This modification is so small that the impact on density and timing is negligi-
ble. The benchmarks are mapped to the Nangate 45 nm open cell library [70] to facilitate
clock-network synthesis with parameters from ISPD 2010 CNS contest. The standard-cell
height (or row height) is set to 1.4 µm according to the 45 nm library.6 Clock-power ratio
β is set to 0.3 for clock network optimization during placement based on the industrial
circuits from [20], where clock power is responsible for 31.9% of total power on average.
For each circuit, the average activity factor of signal nets is calculated based on the signal-
net and clock-net wirelength of the placement produced by SimPL [47] using Formula V.4.
The unit-wire capacitances for signal-net and clock-net (Cn , Ce ) are set to 0.1 f F/µm, 0.2
5
When macro blocks act as placement blockages but routing is allowed above them, the load-capacitance-
aware obstacle-avoidance algorithm in [51] can be utilized to detour the clock-tree wires that (i) cross macro
blocks, (ii) but cannot be driven by the buffers outside macro blocks.
6
Unit length in the ISPD‘05 benchmark corresponds to approximately 117 nm in our benchmark set.
106
Name Cells Regs Macros CoreX CoreY Area
(mm) (mm) (mm2 )
clkad1 210K 32K 56 1.247 1.246 1.554
clkad2 255K 38K 177 1.640 1.638 2.686
clkad3 451K 68K 721 2.706 2.722 7.363
clkad4 494K 74K 1329 2.706 2.722 7.363
clkbb1 278K 42K 30 1.247 1.246 1.554
clkbb2 535K 84K 923 2.181 2.192 4.781
clkbb3 1095K 165K 666 3.231 3.242 10.47
clkbb4 2169K 327K 639 3.756 3.772 14.16
f F/µm respectively based on the 45 nm technology model from the ISPD‘10 contest [95]
and the Nangate open-cell library [70]. Supply voltage and clock frequency are set to 1.0V
and 2GHz. The coordinate of clock source is set to the bottom left corner of core area
except when it is blocked by macros. When the desired location is blocked, we move the
clock source to the closest unblocked coordinate. Since many academic placers handle the
(a) (b)
Figure 5.9: An example of routing dead space that can be found in the ISPD‘05 bench-
marks. (a) Routing dead space is created by enclosing macro blocks. (b) One
macro block is modified to open the space.
107
FAST P LACE 3 M PL6 S IM PL 101 S IM PL+L OPPER
Bench αavg ClkWL HPWL Pwr ClkWL HPWL Pwr ClkWL HPWL Pwr ClkWL HPWL Pwr
(mm) (m) (mW ) (mm) (m) (mW ) (mm) (m) (mW ) (mm) (m) (mW ) (min)
clkad1 0.11 214.7 9.12 285.5 248.2 9.09 298.3 209.1 8.97 279.9 152.3 9.23 263.0 3.6
clkad2 0.10 236.2 10.9 310.1 267.0 10.7 318.9 223.1 10.5 297.6 161.0 10.8 278.4 5.5
clkad3 0.09 469.3 25.0 640.8 467.6 25.0 640.8 468.5 24.0 624.7 326.9 24.9 583.0 11
clkad4 0.11 540.9 23.1 732.9 615.6 22.6 751.6 519.4 21.7 692.6 354.4 22.3 640.4 12
clkbb1 0.10 250.5 11.2 323.6 245.1 11.3 322.5 238.2 11.2 317.6 166.3 11.5 295.7 5.2
clkbb2 0.15 539.2 18.1 752.6 514.1 17.8 733.6 533.2 16.8 710.9 371.2 17.3 661.4 13
clkbb3 0.10 892.6 42.7 1236 1032 40.2 1240 866.3 39.2 1155 602.2 41.0 1085 30
clkbb4 0.09 1907 97.3 2575 2119 96.8 2650 1855 93.0 2473 1266 95.2 2279 87
Avg 1.03× 1.05× 1.04× 1.11× 1.03× 1.06× 1.00× 1.00× 1.00× 0.70× 1.03× 0.93× 1.8×
Table 5.2: Results on the CLKISPD‘05 benchmark suite. ClkWL represents total wire-
length of a clock network synthesized by the initial phase of Contango 2.0
[53]. HPWL is total HPWL of signal nets. Pwr is total net-switching power.
SimPL+Lopper is 4.16× faster than mPL6 and 1.51×, 1.81× slower than Fast-
Place3, SimPL respectively.
The quality of clock networks based on the final register locations of each placer is
evaluated by Contango 2.0 [53]. Contango 2.0 is the winner of the ISPD 2009 and 2010
Clock Network Synthesis (CNS) contests and produces clock trees with less than 7.5 ps
skew in the presence of variation on the ISPD‘10 CNS benchmarks. During our experi-
ments in Section 5.6.2 , we exclude SPICE-accurate tuning in Contango 2.0 for two rea-
sons: (1) the designs from the ISPD‘05 benchmarks are too large to run SPICE simu-
lations, (2) the average added capacitance during the SPICE-driven optimization on the
ISPD‘10 CNS benchmarks is 2.2% of total clock-net capacitance (including sink, wire
and buffer capacitance), suggesting that the initial trees optimized for Elmore delay pro-
results for clock trees with SPICE-driven optimizations on the modified benchmark set.
Insertion delay and skew from SPICE simulations are reported as well as total capaci-
tance of optimized clock trees with driving buffers. We also present robustness analysis
of different register placements when total capacitance is limited for clock networks in the
presence of variations.
108
5.6.2 Empirical results
Table 5.2 compares results of our methodology to the leading academic placers on the
CLKISPD‘05 benchmarks. The results of SimPL [47] are used as reference for compar-
ison. αavg is computed for each benchmark based on the given β = 0.3 as described in
Section 5.6.1, and total wire-switching power is calculated based on αavg . Power con-
sumed inside macro blocks is ignored since it cannot be optimized during placement and
is not available in original ISPD benchmark data. On average, the combination of SimPL
and Lopper reduces total clock-tree length by 30.0%, total wire-switching power by 6.8%
while the total HPWL of the signal nets only increases by 3.1% compared to SimPL. Com-
pared to FastPlace3 [100] and mPL6 [15], our methodology reduces the total clock-net
while the total signal-net HPWL is smaller than that produced by FastPlace3 by 1.4% and
very similar to that produced by mPL6. Our methodology shows consistent improvement
for the benchmarks considered, with various configurations of macro blocks. Figure 5.10
compares two clock trees based on register placements from SimPL and our method.
To further study the relative significance of clock-power ratio β, we show in Table 5.3
the impact of varying β on the benchmark clkad1. The average activity factor of signal nets
αavg is computed based on the reference layout and utilized for computing the total wire-
switching power. The performance of Lopper is improved when clock networks consume
a greater portion of total power. Table 5.3 also shows that reducing clock networks does
not necessarily reduce the total switching power. For example, the result for β = 0.6
consumes 109.6 mW for total wire-switching power, but if the same circuit is used for the
109
β αavg Orig. P ClkWL HPWL Pwr
(mW ) (mm) (m) (mW ) (Rel)
Orig - - 209.1 8.968 - -
0.1 0.420 837.0 184.2 9.073 835.8 0.999
0.15 0.264 557.2 173.5 9.128 551.3 0.990
0.2 0.187 419.1 165.7 9.188 409.9 0.978
0.25 0.140 334.8 158.0 9.225 321.5 0.960
0.3 0.109 279.9 152.3 9.233 262.2 0.939
0.35 0.087 239.7 151.0 9.280 221.9 0.925
0.4 0.070 209.2 144.8 9.305 188.2 0.900
0.45 0.057 185.9 144.5 9.316 164.0 0.882
0.5 0.047 168.0 139.5 9.342 143.6 0.854
0.55 0.038 151.8 135.7 9.343 125.3 0.826
0.6 0.031 139.3 128.0 9.425 109.6 0.787
Table 5.3: The results on clkad1 with various clock power ratios β. The specifications
of the reference placement produced by SimPL are in the row Orig. αavg is
calculated based on β and reference placement produced by SimPL. Total wire-
switching power values of the reference placement with the corresponding β
are represented in the column Orig. P. The relative power ratios are indicated
with Rel.
applications with β = 0.1, the total wire-switching power computed by Equations V.1 -
V.3 is 842.9 mW , which is greater than the switching power of the reference placement
837.0 mW . This implies that clock-net optimization must utilize activity factors of signal
Table 5.4 shows the impact of obstacle-aware virtual clock trees (OAVCT) and obstacle
avoidance forces (OAF). When OAVCT is excluded, DME trees without obstacle handling
are utilized for the remaining flow. The results indicate that 9.5% of clock-net wirelength
is reduced on benchmarks with very few obstacles such as clkbb1 where a few obstacles
exist at the top left corner of the chip. Obstacle-avoidance forces reduce clock-net length
Table 5.5 compares results of our technique to the technique called MLAF on MLBB
[103]. We re-implemented their MLAF algorithm and integrated it into the SimPL placer
110
[47] instead of the FDP framework [101] they utilized. Since their DCTB algorithm can-
not process obstacles, our obstacle-aware virtual clock-tree generation algorithm in Sec-
tion 5.4.1 is utilized for the MLAF algorithm. In terms of clock-net wirelength and net-
switching power, the average gain from the MLAF technique is limited by 43.5%, 30.6%
of the improvement of our technique respectively, which means that our arboreal clock-
net contraction force is 3.3× more effective for switching-power reduction than MLAF.
Our comparison to MLAF concludes that explicit and structural representation of clock-
net force and an accurate weighting function are important to achieve competitive register
placement.
Orig. Flow w/o OAVCT w/o OAF
Bench ClkWL Pwr ClkWL Pwr ClkWL Pwr
(mm) (mW ) (mm) (mW ) (mm) (mW )
clkad1 152.3 263.0 165.7 267.8 158.5 265.3
clkad2 161.0 278.4 170.9 285.5 163.7 278.7
clkad3 326.9 583.0 362.1 595.1 340.8 587.4
clkad4 354.4 640.4 403.1 657.2 379.8 649.4
clkbb1 166.3 295.7 172.6 297.4 169.1 296.4
clkbb2 371.2 661.4 411.2 673.8 389.9 666.7
clkbb3 602.2 1085 663.1 1104 627.2 1093
clkbb4 1266 2279 1412 2331 1328 2102
Avg 1.0× 1.0× +9.5% +1.8% +4.1% +0.7%
Table 5.4: Impact of excluding obstacle-aware virtual clock trees (OAVCT), obstacle
avoidance forces (OAF). OAVCT and OAF are excluded in the columns under
“w/o OAVCT”. Only OAF is removed in “w/o OAF”
Insertion delay and skew are important metrics when evaluating the quality of clock
trees. Accurate analysis of these metrics requires SPICE simulations. The CLKISPD‘05
benchmark set has up to 327K registers in one benchmark and it is impractical to per-
accurate clock trees, we decrease the number of registers in the CLKISPD‘05 benchmarks
111
S IM PL+MLAF
Bench ClkWL HPWL Pwr
(mm) (m) (mW )
clkad1 182.4 (46.9%) 9.194 (85.3%) 274.2 (33.7%)
clkad2 200.9 (35.8%) 10.76 (76.2%) 293.0 (24.0%)
clkad3 402.5 (46.6%) 24.71 (76.9%) 609.8 (35.7%)
clkad4 449.5 (42.4%) 22.24 (86.9%) 676.6 (30.7%)
clkbb1 203.8 (47.9%) 11.48 (84.9%) 309.7 (36.1%)
clkbb2 473.8 (36.7%) 17.16 (80.0%) 699.3 (23.4%)
clkbb3 743.5 (46.5%) 40.81 (91.0%) 1139 (22.9%)
clkbb4 1587 (45.5%) 94.77 (80.2%) 2399 (38.1%)
Avg 0.87× (43.5%) 1.03× (82.7%) 0.98× (30.6%)
Table 5.5: Results of the MLAF technique integrated into SimPL with comparison to our
technique. Average results are compared to the results for SimPL in Table 5.2.
The numbers in parentheses represent the amount of reduction(ClkWL, Pwr)
[increase(HPWL)] assuming 100% reduction [increase] for our technique. For
example, [209.1(SimPL) - 182.4(MLAF)] / [209.1(SimPL) - 152.3(Lopper)] =
46.9%.
FAST P LACE 3 M PL6 S IM PL 101 S IM PL+L OPPER
Bench Regs Ins. D. Skew Cap. Ins. D. Skew Cap Ins. D. Skew Cap. Ins. D. Skew Cap.
(ps) (ps) (pF ) (ps) (ps) (pF ) (ps) (ps) (pF ) (ps) (ps) (pF )
clkad1 s 2114 386.3 2.702 29.90 388.2 2.629 34.09 381.7 3.730 26.61 369.0 0.962 16.70
clkad2 s 2550 406.8 3.329 35.95 414.2 2.471 36.24 405.8 3.152 31.52 402.9 2.163 20.53
clkad3 s 4516 453.7 3.921 68.08 468.0 3.642 77.36 453.4 3.770 65.95 452.3 2.208 45.00
clkad4 s 4960 455.4 3.502 77.84 470.0 2.065 86.28 454.8 3.568 71.66 446.6 2.216 44.44
clkbb1 s 2781 385.0 2.711 29.86 387.0 2.139 35.61 386.1 5.166 29.10 385.9 1.754 18.35
clkbb2 s 5578 445.1 8.153 78.34 444.4 5.984 83.77 444.4 1.876 75.18 431.4 2.537 47.18
clkbb3 s 10968 489.8 3.118 125.7 513.0 7.140 150.2 497.6 2.796 120.7 494.2 2.286 77.75
clkbb4 s 21773 523.9 2.318 268.6 522.7 2.956 284.0 523.5 3.511 250.8 510.9 2.852 156.0
Avg 1.02× 1.83× 1.68× 1.03× 1.77× 1.87× 1.02× 1.85× 1.57× 1.00× 1.00× 1.00×
and invoke Contango 2.0 on various register placements. The experimental results with
We adapted wire and buffer libraries from the ISPD 2010 CNS benchmarks. The con-
cept of local skew defined in [53] is utilized to calculate exact skew with the local skew
distance limit 600µm. Unlike Table 5.2 that reports dynamic power based on only wire
capacitance, the total capacitance including driving buffers is reported in Table 5.6. Same
112
buffering scheme is utilized for all clock trees in this table. The Lopper flow produces
high-quality register placement and it is already shown in Table 5.2 that the clock-network
wirelength is significantly smaller than other methods. Due to compact clock-net wire-
length, fewer driving buffers are required for clock-tree synthesis on our register place-
ment, hence the total clock-network capacitance including buffers is 57%∼87% less than
Table 5.6 also shows that insertion delay based on our method is 1.6%∼3.3% smaller
than other methods. Since insertion delay depends on the path length from the clock
source to sinks, the size of layout area is closely related to insertion delay. As shown
in Figure 5.10, the wirelength of clock trees is reduced mainly near leaves, hence the
clock trees spend a lot of power near the leaves. The results show that Contango 2.0 can
reduce nominal skew down to 10 ps for any register placement. The quality of nominal
skew is not highly related to register placement but it depends on optimization performed
during CNS. However, compact clock trees are easier to tune therefore average nominal
Table 5.7 shows how the quality of register placement affects robustness of clock net-
works in the presence of variations. Since robustness analysis requires extensive SPICE
simulations, we rebuilt the CLKISPD‘05 benchmarks with fewer registers, so that we can
running Contango 2.0, and clock trees are optimized to be as robust as possible within this
7
When the clock source is at a corner of chip area, often 30%∼50% of insertion delay is due to the tree
trunk (the wire that connects the clock source and the root node of the clock tree [51]) and the length of the
trunk is largely unaffected by register placement in practice.
113
S IM PL 101 S IM PL+L OPPER
Bench Regs Cap. Nom. Mean Yield Nom. Mean Yield
(pF ) (ps) (ps) (%) (ps) (ps) (%)
clkad1 v 1057 23 1.855 6.634 72.4 2.177 4.298 99.2
clkad2 v 1275 28 1.113 5.771 90.4 3.265 5.291 89.8
clkad3 v 1354 46 1.132 7.281 60.4 2.673 6.615 72.4
clkad4 v 1488 50 2.010 8.007 43.2 1.610 6.094 82.0
clkbb1 v 1390 26 1.014 6.342 78.8 0.994 5.104 94.6
clkbb2 v 1115 39 2.208 7.674 49.0 1.338 5.912 86.6
clkbb3 v 1096 49 1.532 7.246 58.4 1.379 6.071 82.4
clkbb4 v 1088 67 5.226 9.607 15.4 4.207 7.342 57.8
Avg 2.011 7.320 58.5 2.205 5.841 83.1
Table 5.7: Results of SPICE simulations in the presence of variations. Regs represents the
number of registers in each benchmark. Cap. represents the capacitance limit
for clock networks. Nom. represents nominal skew without variation and Mean
is average skew with variation. Yield represents the percentage of acceptable
results with given skew limit 7.5 ps.
limit. After building clock trees, we first measure the nominal skew of clock trees without
variations. Then we run extensive Monte-Carlo simulations with variations to estimate the
impact of variations on clock-tree circuits. The variation model from the ISPD 2010 CNS
Our register placement leads to more compact clock trees than other methods, and
robustness is further enhanced by Contango 2.0. The results show that the clock trees
based on our techniques offer 24.6% greater yield than the clock trees based on simPL
5.7 Summary
placement algorithms only optimize signal-net switching power and ignore clock-network
switching responsible for over 30% of total power. We propose new techniques and a
methodology to optimize total dynamic power during placement for large IC designs with
114
macro blocks. To this end, we advocate obstacle-aware virtual clock-tree synthesis, an ar-
boreal clock-net contraction force with virtual nodes that can handle gated clocks, and an
obstacle-avoidance force for clock edges. Our methodology is integrated into the SimPL
placer [47], and the total switching power is measured by utilizing Contango 2.0 [53] —
both programs are leading academic software. A new set of 45 nm benchmarks is pro-
posed to better represent modern IC designs. Experimental results show that our method
lowers the overall dynamic power by significantly reducing clock-net switching power.
Other benefits of our optimizations include smaller insertion delay in clock trees, dimin-
115
Figure 5.10: Clock trees for clkad1, based on a SimPL register placement (top) and pro-
duced by proposed techniques (bottom). The respective clock-tree wire-
lengths based on SimPL and our method are 209.13 mm and 152.27 mm.
The total switching power of SimPL and our method are 279.9 mW and
263.0 mW respectively.
116
CHAPTER VI
competition between the two. Trees require much smaller capacitance, but meshes are nat-
urally robust against process variation and can accommodate late design changes. Cross-
link insertion has been advocated to make trees more robust, but is limited in practice
to short distances. In this chapter we develop a novel non-tree topology that fuses sev-
eral clock trees to create large-scale redundancy in a clock network. Empirical validation
shows that this clock-network structure incrementally enhances robustness to satisfy given
variation constraints. Our implementation called Contango 3.0 produces robust clock net-
works even for challenging skew limits, without parallel buffering used by other imple-
mentations. It also offers a fine trade-off between power and robustness, increasing the
capacitance of the initial tree by less than 60%, which results in 2.3× greater power effi-
In this section, we develop statistical models for delay and skew in RC-buffered clock
117
6.1.1 Impact of variation on delay
In the presence of PVT variations, the delay of a buffered path p can be treated as
a random variable Dp whose mean dp is the nominal delay. Given that tree-like clock
networks entail long paths without significant reconvergence, path delay can be modeled
by Gaussian variables:1
Given n parallel paths from a to b, we tune nominal path delays using existing methods
[51, 53] to bring their difference under 10 ps. We also size the drivers, that jointly drive
the sink, to have similar strength. Under these circumstances, the random variable of path
n
X n
X
Dp(a,b) = Dpi /n, dp(a,b) = dpi /n (VI.4)
i=1 i=1
This averaging is illustrated in Figure 6.10. Detailed analysis is given in Section 6.4.2 and
Table 6.6.
n
X n−1 X
X n
2
σp(a,b) = σp2i /n2 +2 σpi σpj ρ(pi ,pj ) /n2 (VI.5)
i=1 i=1 j=i+1
118
s
pw2
s s pa1 pb1
pw3
pa2 pb2
pa pb pa1 pw2 pb1 pa3 pb3
Figure 6.1: Simple clock networks with source node s, two sink nodes a and b. All paths
are considered buffered. (a) a tree, (b) redundant paths. (c) n multilevel paths
for each sink. Each i-th (2 ≤ i) new root-to-sink path consists of a shared pwi
section and a pai or pbi section that is not shared.
Example III.1 Consider the case n = 2, σp21 = σp22 = 10 and ρ(p1 ,p2) =0.1. Then σp(a,b)
2
=
5.5, which reduces standard deviation by about 26% compared to a single path. Thus,
having multiple paths reduces the impact of PVT variation compared to a single path.
Let s be a source node and a, b be two sink nodes. Nominal skew (without variation)
is defined as
We define total signed skew (with variation), mean signed skew, variational signed skew,
signed skew variance, total absolute skew and variational absolute skew as follows.
∗ 2 ∗ 2
S(a,b) = S(a,b) − S̄(a,b) , σs(a,b) = E[S(a,b) ] (VI.8)
119
Since S(a,b) has a normal distribution, skew (a,b) has a folded normal distribution.2 Figure
6.2 shows empirical distributions for signed and absolute skew of two example sink pairs
from the experimental results in Section 6.4.2. The mean and variance of skew (a,b) can be
p
2/πexp(−µ2 /2σ + µ 1 − 2Φ(−µ/σ)
E[skew (a,b) ] = σ (VI.10)
p
var[skew (a,b) ] = µ2 + σ 2 − {σ 2/πexp(−µ2 /2σ 2 )
+ µ 1 − 2Φ(−µ/σ) }2
(VI.11)
where Φ(·) denotes the cumulative distribution function of a standard normal distribution.
p
Expected skew : E[skew (a,b) ] = σs(a,b) 2/π (VI.12)
2
Skew variance : var[skew(a,b) ] = σs(a,b) (1 − 2/π) (VI.13)
Note that mean absolute skew can be positive with zero nominal skew.
This suggests that we can use signed skew as a proxy for the analysis of absolute skew. In
other words, we can obtain the yield of skew (P[skew (a,b) < x]) by examining the yield of
signed skew (P[−x < S(a,b) < x]). In this section, we analyze the impact of variation on
signed skew because of mathematically simpler analysis. Figure 6.1a illustrates a simple
2
Given a normally distributed random variable X = N (µ, σ), the variable |X| follows the folded normal
distribution [58].
120
(a) (b)
Figure 6.2: Empirical distributions of signed and absolute skew of two example sink pairs.
The data are collected from Monte-Carlo simulations with variations. (a) Sink
pair with nominal skew 0.3 ps. (b) Sink pair with nominal skew 1.2 ps.
clock tree with one path per sink. In this case, the skew variance is
2
σs(a,b) = var(S(a,b) ) = σp2a + σp2b − 2σpa σpb ρ(pa ,pb ) (VI.15)
We extend this analysis to clock networks with multiple paths for each sink node, as
illustrated in Figure 6.1b. pw2 is the shared path and pa2 , pb2 connect the shared path to the
sinks a and b. From the multiple-path delay variation model from Section 6.1.1, we obtain
Dp(s,a) = Dpa1 + (Dpw2 + Dpa2 ) /2 (VI.16)
Dp(s,b) = Dpb1 + (Dpw2 + Dpb2 ) /2 (VI.17)
S(a,b) = (Dpa1 + Dpa2 ) − (Dpb1 + Dpb2 ) /2 (VI.18)
2
σs(a,b) = (σp2a1 + σp2a2 + σp2b1 + σp2b2 )/4
121
Example III.2 Consider the clock tree in Figure 6.1a with σp2a = σp2b = 50 and ρ = 0.
2
Then from Formula VI.15, σs(a,b) = 50 + 50 = 100. We assume the variation constraint to
be 15 ps with yield 95% (i.e., P[−15 < S(a,b) < 15] > 95%, Figure 6.3). However, with
The current tree structure does not satisfy the given variation constraint. In this case, we
95%
-15ps 15ps
can insert a new subtree and fuse it to the original tree to enhance robustness.
Example III.3 Consider adding a subtree with three paths (pw2 , pa2 , pb2 ) to Figure 6.1a
and build a fusion topology as in Figure 6.1b with σp2w 2 = σp2a2 = σp2b2 = 25. From
2
Formula VI.19, σs(a,b) reduces down to 37.5. Now the probability becomes
We generalize the above analysis to clock networks with n > 1 redundant paths per
122
16 100 7
Capapcitance (Rel.)
12 90
5
10
Yield (%)
80 4
8
6 70 3
4 2
60
2 1
0 50 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
# of multiple paths # of multiple paths # of multiple paths
Figure 6.4: The impact of redundant paths for a pair of critical sinks (Figure 6.1c) on
clock-network parameters, based on Formulas VI.25, VI.27 and VI.28. The
skew constraint and ρ are set to 10 ps and 0.1 respectively. (a) Standard devia-
tion. (b) Yield. (c) Relative total capacitance of each clock network compared
to the total capacitance of the clock tree without redundant paths (n = 1).
n
X
Dp(s,b) = Dpb1 + (Dpwi + Dpbi ) /n (VI.23)
i=2
n
X
S(a,b) = (Dpai − Dpbi ) /n (VI.24)
i=1
n
X
2
σs(a,b) = (σp2ai + σp2bi )/n2
i=1
n−1 X
X n
+2 (σpai σpaj ρ(pai ,paj ) + σpbi σpbj ρ(pbi ,pbj ) )/n2
i=1 j=i+1
n X
X n
−2 (σpai σpbj ρ(pai ,pbj ) )/n2 (VI.25)
i=1 j=1
In the case when σpai = σpbi = σ and all ρ values are equal,
2
σs(a,b) = 2σ 2 (1 − ρ)/n (VI.26)
Just as in Formula VI.15, highly correlated path delays lead to small skew variance.
Example III.4 Figure 6.4 illustrates how n redundant paths for each sink (as in Figure
123
6.1c) reduce σs(a,b) and increase yield (based on Formula VI.25). Here we assume
In practice, we select eligible sinks a and b (see Section 2.1) that maximize initial
2
σs(a,b) . Thus ρ(pa1 ,pb1 ) will be small, but, for additional redundant paths, ρ(pai ,pbi ) will be
greater, especially when a and b are located close to each other. These paths are added so
that ρ(pai ,paj ) and ρ(pbi ,pbj ) remain small. The same statistical analysis applies to process,
Given a clock network Ψ and a skew limit x, let E be a set of eligible sink pairs of Ψ.
We define the following parameters (skew, nominal skew, mean skew and yield) for the
124
Let σ be the standard deviation of the most critical sink pair in Ψ (i.e., σ = max(a,b)∈E σs(a,b) ).
If skewΨ ≫ σ, then skewΨ and yieldΨ are significantly affected by skew Ψ (nominal skew
of a clock network). However, when skew Ψ ≪ σ, the clock-network’s yield is closely re-
lated to the yields of critical sink pairs (see Section 6.2.1). Our methodology invokes nom-
inal skew optimizations to satisfy skewΨ ≪ σ (see Section 6.3). Therefore our proposed
methods in Sections 6.2 and 6.3 for enhancing robustness of critical sink pairs effectively
Analysis in Section 6.1 suggests that one can reduce the impact of variation on clock
skew by driving critical sinks through multiple redundant paths. To generalize, we pro-
pose a novel family of clock-network structures, called fused multilevel trees, which main-
of variation on skew between eligible sink pairs. Using models from Section 6.1.2, we
can determine the variance and standard deviation of skew between each sink pair and
detect critical sink pairs that are not robust enough with respect to given timing constraints.
Eligible sink pairs are often geometrically close (or placed within the local skew distance
limit in ISPD10 CNS benchmarks). However, they can be distant in the tree, i.e., the
shortest tree-path connecting them can traverse many tree edges. These sinks are included
125
in the set of critical sink pairs after variational analysis because the impact of variations
Once we find all critical sink pairs, we cluster them based on their least common
ancestors (LCA) in the tree. The pairs that share LCA are clustered, and a set of sinks is
formed as the union of the sink pairs in the cluster. The LCA plays the role of the clock
source for a new auxiliary tree that connects to the sinks in a given set. Here we use the
The nominal delays of multiple redundant paths from the clock source to each critical
sink must be carefully synchronized in order to reduce nominal skew in the fused topology.
This process is discussed in detail in Section 6.3.2. Figure 6.5 illustrates detection of
critical sink pairs and the addition of auxiliary trees to enhance robustness.
least common
ancestor
a b a b
(a) (b)
Figure 6.5: (a) A critical sink pair is indicated by a red oval and the LCA of two sinks is
shown. (b) Corresponding subtree for the sink cluster in (a).
After auxiliary trees are constructed and fused, we analyze the impact of variation on
skew of eligible sink pairs again. Since there are multiple paths to some sinks, we uti-
lize variation modeling from Section 6.1.3. If some critical sinks remain, we construct
another round of auxiliary trees and fuse them into the main network to enhance robust-
126
ness. This robustness evaluation and tree construction/fusion process is repeated until we
cannot find a critical sink pair anymore. The success of our iterative fusion process criti-
tuning. If implemented correctly, every fusion iteration significantly reduces the number
of critical sinks (Tables 6.4 and 6.5), but if path synchronization fails, this improvement
is not guaranteed. Figure 6.6 illustrates the proposed methodology including initial tree
The new clock-network structure is a joint of several trees that provides multiple re-
dundant paths, helping to improve network robustness and satisfy skew constraints. Such
a clock network exhibits the redundancy and robustness of a mesh but is easier to analyze
and optimize. Our results in Section 6.4.2 shows that fusion topologies can be essentially
Multilevel tree fusion topology is technically not a tree structure because of intercon-
nect loops. However, those loops always close at the sink nodes, which makes it easy to
reduce not only the complexity of variational analysis but also nominal skew by various
tree-based skew optimization techniques. Section 6.3.2 outlines the use of tree optimiza-
127
Figure 6.6: Illustration of multilevel tree fusion on ispd10cns02. (a) Initial tree construc-
tion. (b) Critical sink pairs are connected by red lines. (c) Auxiliary trees are
fused in to enhance robustness.
128
Initial-tree Variational
Unsat. Aux-trees
Construction
Construction Analysis and Fusion
Satisfied
Figure 6.7: Key steps of multilevel tree fusion. Proposed techniques are indicated with
darker rounded boxes and a lozenge. Plain boxes represent techniques adapted
from earlier publications.
After initial-tree construction [51, 53], we perform variational analysis based on the
methods in Section 6.1.2 and build fusion topology to enhance robustness. For pre-
cise variational analysis, it is important to estimate Gaussian random variables for each
buffered path. For accurate estimation of random variables, we build various test trees for
given technology node, buffer and wire library and variation environment. Then we per-
form Monte-Carlo simulations with variation and record the variance of each buffered path
in a look-up table. It is not necessary to record the mean of each random variable because
our experimental results show that E[X] is nearly zero for all cases. The table is accessed
by wirelength w and buffer count b to estimate the impact of variation on a buffered path
with wirelength w and b buffers. Finally, the table is used to produce a least-squares fit F.
129
With a Gaussian estimate of path delay, we analyze the impact of variation on eligible sink
Since the initial and auxiliary trees are built using Elmore delay, they need to be tuned
except for loops that close at critical sinks. To leverage the efficiency of existing tree-
optimization techniques, we propose to split (clone) each critical sink and distribute its
input capacitance among the resulting splinter sinks, as illustrated in Figure 6.8. Once
splinter sinks are generated, there is no metal loop and our clock network becomes a tree,
nominal delays of multiple paths ending at the same sink, and then equalize them using
tree-tuning techniques.
a b a1 a2 b1 b2
(a) (b)
Figure 6.8: (a) Multiple paths from clock source to sinks a and b. (b) Splinter sinks are
generated to utilize tree optimization algorithms.
reduce nominal skew measured by SPICE simulations. During SPICE-driven skew opti-
130
After nominal skew optimization, in the context of splinter sinks, the average nominal
skew drops below 4 ps on the ISPD 2010 CNS benchmarks. We merge splinter sinks
to recover the fusion topology structure, at which point sink latencies may change and
nominal skew may worsen. However, our experiments show that this deterioration can be
limited to 2 ps in the worst case.3 The average nominal skew of fusion topologies on the
Our empirical evaluation of multilevel tree fusion focuses on total capacitance and
robustness to variations. We use ISPD 2010 CNS benchmarks but enhance their buffer
ISPD 2010 CNS benchmarks are based on microprocessor designs from IBM and Intel
and use a 45 nm technology library. Each benchmark is given a local-skew limit and local
skew distance bound. Result are evaluated by 500 Monte-Carlo simulations with a given
variation model, with respect to a given yield constraint. ISPD 2010 benchmarks suf-
fer from a recognized deficiency in the modeling of numerous parallel buffers (that may
or may not appear in the clock network), which underestimates electrical parasitics and
power overhead. Process variations are not spatially correlated, making parallel buffers
completely independent and underestimating the impact of process variations. These de-
ficiencies encourage unrealistic clock-network configurations. To this end, the best pub-
3
It is important to note that the number of splinter sinks for a given sink may increase by at most one
during each fusion iteration. This significantly simplifies delay synchronization for redundant paths.
131
lished results for the ISPD 2010 benchmarks [67] seem to require the stacking of numerous
inverters in a unique 10+40 configuration. The authors attribute the quality of results to a
new cross-link insertion technique, but do not report results without cross-link insertion to
substantiate this claim. Results in [53] report even smaller skews but greater capacitance,
but the authors also stack numerous (32) small inverters in parallel.
Table 6.1: Results of clock trees on ispd10cns05 with parallel buffering. Local skew limit
is 7.5 ps as in the ISPD 2010 benchmarks. The statistics of nominal skew, total
skew are reported based on Monte-Carlo simulations. For each tree, we report
its mean, standard deviation (σ), as well as yield for a given skew limit. ‘95%’
column represents the worst local skew for 95% yield.
Table 6.1 illustrates how one can reduce the impact of process variation by only using
excessive parallel buffers without any structural modification. It shows that competitive
results on the ISPD 2010 benchmarks can be easily achieved by stacking only 16 small
inverters in parallel.
of the ISPD 2010 benchmarks. First, instead of the ISPD 2010 buffer library that exhibits
uniformly-distributed variation, we use a buffer type with Gaussian variation. Table 6.2
compares buffers used in the ISPD 2010 benchmarks and in this work.
By essentially clustering a reasonable number of small ISPD buffers into one large
ous publications, we limit our empirical validation to a single wire type to illustrate that
132
buffer in out out distribut’n parallel
type cap cap res of proc. σ buffers
(f F ) (f F ) Ω variation (V ) allowed
ispd10b1 35 80 61.2 uniform 0.043 yes
ispd10b2 4.2 6.1 440 uniform 0.043 yes
our work 33.6 48.8 55 Gaussian 0.015 no
Table 6.2: Comparison of buffer types. ispd10b1 and ispd10b2 are two buffer types in
ISPD 2010 CNS benchmarks. The large buffer utilized in this work has Gaus-
sian variation and parallel buffering is not allowed. The buffer type in this work
is intended to represent a composite buffer made from 8 ispd10b2 buffers, but
in a way that would prevent modeling constituent buffers as experiencing inde-
pendent PVT variation.
proposed multilevel tree fusion can still produce high-quality clock networks. We also
note that spatially-correlated variation is only responsible for a fraction of total variation,
whereas random variation also makes a significant contribution. Thus, our experimental
setup is pessimistic and serves to show that our proposed technique can achieve strong
results even in adverse circumstances. Using one buffer type for clock-network synthesis
also restricts the flexibility to allocate driver strength throughout the clock network. We
use this limitation as a handicap in our experiments to highlight the strength of multilevel
tree fusion.
Figure 6.9a shows the impact of variations on a buffered path with our buffer type.
The path is buffered by four inverters. Monte-Carlo simulations based on the variation
model from the ISPD 2010 benchmarks are performed. The experiment validates our
variation modeling in Section 6.1, in which the delay of a buffered path is modeled by
technology (Figure 6.9b). The 45 nm low-power process from STMicro is utilized for the
experiment. The result shows that the delay of a buffered path on an industrial technology
133
(a) (b)
Figure 6.9: Impact of variations on a buffered path. The path is 2 mm long with 30 f F
load capacitance at the end and buffered by 4 inverters. (a) The 45 nm technol-
ogy, variation model from the ISPD 2010 benchmarks and a buffer type used
in our work are utilized. (b) The 45 nm low-power technology, buffer library
and variation model from STMicro are utilized.
Table 6.4 shows empirical results on the ispd10cns08 benchmark. We vary the local
skew limit for the benchmarks to evaluate the flexibility of our novel clock-network struc-
ture. Once again, we use only one Gaussian buffer type without parallel stacking. When
there is no local skew limit, the initial clock tree is left unchanged. To satisfy increasingly
difficult skew constraints, additional auxiliary trees are generated and fused to enhance ro-
bustness of clock networks. Total clock-network capacitance increases as local skew limit
decreases because the tree must become more robust. The statistics of variational skew are
also shown in the table. Since nominal skew varies for each fusion topology, variational
skew more correctly represents the impact of variations on skew. As shown in the table,
The results show that the multilevel fusion topology exhibits sufficient flexibility to in-
crementally improve robustness based on variational analysis with given local skew limit.
134
to satisfy the difficult skew constraints with 4.5 ps skew limit. The probability density
functions for skew of each clock network in Table 6.4 are illustrated in Figure 6.11.
total skew
method mean 95% cap. cap.
(ps) (ps) (f F ) Ratio
CNSRouter [105] 3.58 5.37 97421 2.30
[86] - 5.47 228243 5.38
our work 2.17 5.06 42414 1.00
Table 6.3: Comparison of results on ispd10cns08 to published data for meshes. Local
skew limit 6.0 ps is used to produce a clock network with better robustness than
meshes. Our clock network is more robust than meshes but also 2.30× greater
power efficient than CNSRouter [105].
Table 6.3 compares our clock network with those produced by CNSRouter [105] and
by techniques in [86]. Our clock network is more robust than meshes, and exhibits signif-
Table 6.4: Results on ispd10cns08 with different local skew limits. The statistics of nom-
inal skew, total skew and variational skew are reported based on Monte-Carlo
simulations. For each tree, we report its mean, standard deviation (σ), as well
as yield for a given skew limit. the worst local skew when yield is 95%. All the
results satisfy slew constraints.
The reported nominal skews in Table 6.4 confirm that our strategy of utilizing tree-
of fused clock networks. To more explicitly validate our splinter-sink technique, Fig-
ure 6.10 illustrates SPICE waveforms at a reconvergent sink and its splinter sinks. The
worst-case reconvergent sink (Sink 680) in the clock network with skew limit 4.5 ps on
135
skew nominal total skew variational skew
limit skew mean σ yield 95% mean σ 95% cap.
(ps) (ps) (ps) (ps) (%) (ps) (ps) (ps) (ps) (f F ) (s)
- 0.980 16.47 2.619 - 22.086 16.46 2.595 21.88 37704.8 293.1
22 2.333 15.32 3.137 98.0 21.08 15.21 3.035 20.83 44093.6 339.6
20 4.081 14.39 2.545 97.4 19.03 14.14 2.619 19.09 46373.9 328.6
18 1.845 12.07 2.446 98.8 16.89 11.90 2.569 17.01 48153.0 469.9
16 3.317 10.84 2.068 99.2 14.07 10.62 2.093 14.15 49918.3 509.3
14 2.412 9.359 2.068 98.4 12.72 9.103 2.201 12.82 56374.6 746.6
Table 6.5: Results on ispd10cns08 with the buffer type ispd10b1 in Table 6.2 without par-
allel buffering. The statistics of nominal skew, total skew and variational skew
are reported based on Monte-Carlo simulations. Mean, standard deviation (σ)
and yield for given local skew limit are reported for each tree. ‘95%’ column
represents the worst local skew when yield is 95%. All the results satisfy slew
constraints.
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
3.5E-10 4.0E-10 4.5E-10 5.0E-10 5.5E-10 3.5E-10 4.0E-10 4.5E-10 5.0E-10 5.5E-10
(a) (b)
Figure 6.10: SPICE waveforms for a reconvergent sink (Sink 680) with largest temporal
displacement of split sinks in a fused clock network with skew limit 4.5 ps on
ispd10cns08. Among the four splinter sinks, the maximum rising-delay dis-
placement before merging is 5.31 ps. The maximum rising-delay is 423.58
ps and the minimum rising-delay is 418.27 ps. The delay of the sink after
merging is measure as 423.22 ps. The gray dashed lines represent the wave-
forms at splinter sinks before merging. The blue solid lines represent the
waveforms at the sink after merging. (a) rising edge. (b) falling edge.
ispd10cns08 is shown. Since there are four different paths from the clock source to this
sink, four splinter sinks are generated before optimizing nominal skew. After nominal-
ence between splinter sinks is measured as 5.31 ps. However, after merging splinter sinks,
the waveform of the sink is close to the average of the waveforms of splinter sinks before
merging. This reinforces our modeling of delay and variation based on Formula VI.4. Ta-
136
rising edge falling edge
skew max avg max avg max avg max avg
limit dSS dSS err. err. dSS dSS err. err.
(ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps)
7.5 1.50 0.42 0.89 0.31 1.65 0.52 0.98 0.34
7.0 0.90 0.22 1.15 0.40 1.73 0.52 1.08 0.45
6.5 1.02 0.25 0.93 0.30 1.64 0.47 1.77 0.45
6.0 0.89 0.36 0.95 0.34 1.04 0.24 0.86 0.30
5.5 1.99 0.21 1.16 0.29 1.53 0.53 0.97 0.26
5.0 2.40 0.33 0.91 0.29 2.17 0.67 0.85 0.26
4.5 5.31 1.07 3.60 0.31 6.75 0.38 4.30 0.31
Table 6.6: Delay analysis of splinter sinks before/after merging on ispd10cns08. dSS rep-
resents displacement of splinter-sink delay before merging. err. represents dif-
ference between average splinter-sink delay (before merging) and actual delay
(after merging).
ble 6.6 shows the overall delay analysis of splinter sinks before and after merging. Since
the splinter-sink condition, the average displacement in delay of splinter sinks is small.
Also the error between the actual delay after merging and the average delay of splinter
This analysis shows that nominal skew optimization with the splinter-sink technique is
In Table 6.5, we present our experimental results on ispd10cns08 with more pessimistic
modeling of process variations. In this experiment, the buffer type ispd10b1 in Table 6.2
is utilized without parallel stacking. The purpose of this experiment is to verify how
robust fusion topology is when the impact of variation is more significant than normal
condition. Given that buffer delays are particularly affected by variation, the skew induced
by variation is significant in the tree structure. However the results show that we can
decidedly reduce the impact of variation by constructing additional auxiliary trees and
137
fusing them into the main network.
Figure 6.11: Skew distributions in our clock networks for ispd10cns08, calculated using
500 independent SPICE runs with variations (Table 6.4). The x-axis shows
skew in ps, skew limits are shown with red solid lines, and the 95%-ile of
skew are shown by dotted green lines.
6.5 Summary
Clock network topologies described in the literature fall into several categories: (i)
trees, (ii) meshes, (iii) trees with incrementally added cross-links, (iv) combinations of
trees and meshes. The gap between tree-like and mesh-like topologies remains signifi-
cant, and cross-links have not been convincingly shown to improve upon pure trees, due
gies derived from trees by adding auxiliary trees and iteratively fusing them into the main
network. Each fusion iteration balances a large subset of skew-critical clock sinks, but as
auxiliary trees are much smaller than the initial tree, the added capacitance is also small.
138
The accuracy of fusion iterations rests on the variational skew analysis techniques we pro-
posed. The final clock-network topology averages out source-to-sink delay and cancels
out some of the correlations induced by process, voltage and temperature (PVT) variations.
Empirical evaluation shows strong results even with exceptionally pessimistic modeling of
process variations, a single wire width and a single allowed buffer configuration without
parallel stacking.
139
CHAPTER VII
semiconductor scaling, the impact of process, voltage and temperature (PVT) variation
tions with difficult constraints are often required to produce reliable and power-efficient
clock networks. We observe that published clock-network synthesis techniques often ig-
nore intradie-process variation and overlook possible synergies with global placement.
Given a clear dichotomy between trees and meshes in clock-network design, clock-network
structures that can incrementally improve robustness based on given constraints are miss-
ing in prior work. Such a limited view of clock-network optimization leaves room for
dress them. We propose optimization algorithms for SoC and microprocessor clock trees,
register placement and novel fused multilevel trees. Below we summarize our contribu-
140
7.1 Summary of contributions
We have found that a significant source of suboptimality in both academic and industry
clock-network synthesis tools today is the fact that they are limited by inaccurate timing
analysis, traditional physical-design flows and clock-network structures optimized for ob-
jectives that are not fully relevant. In this dissertation, we make several contributions that
advance the strength and capabilities of clock-network synthesis tools for large-scale 45
nm designs, with the ultimate goal to improve the quality of leading-edge semiconductor
clock network synthesis, but perform large-scale optimization using analytical models
that lose accuracy at recent technology nodes, and are not always validated by realistic
and achieve best trade-offs between skew, insertion delay, power, as well as tolerance to
marks from IBM Research and Texas Instruments with up to 50K sinks. Contango shared
the first place at the ISPD 2009 Clock-Network Synthesis Contest with two other teams.
Improved experimental results after the contest show that our methodology outperforms
141
Optimization of clock trees for microprocessors.
Clock networks account for a significant fraction of system power dissipation while
a major issue in modern high-performance CPU clock design. On the other hand, the in-
creasing impact of process variation makes clock network synthesis particularly challeng-
ing. Mesh structures are often utilized to improve robustness to variations, but significant
for CPU clock routing that improves power consumption under tight skew constraints in
the presence of variations [53]. Our key contributions include a new time-budgeting step
for clock-tree tuning, accurate optimizations that satisfy budgets, modeling and optimiza-
tion of variational skew. Our software implementation, Contango 2.0, won the first place
at the ISPD 2010 Clock-Network Synthesis Contest. We have shown that clock trees can
be tuned to have nominal skew below 5 ps and total skew in single picoseconds in the
presence of variations. Our optimizations not only satisfy given skew constraints and tar-
get yield but also lead to 4.22× capacitance improvement on average over mesh structures
proposed at the ISPD 2010 contest, with better yield. The comparison with a commercial
clock-tree synthesis tool shows that our nominal-skew optimization techniques are more
precise and reliable than an industrial tool when a given skew constraint is difficult to
satisfy. Our strong empirical results suggest that clock trees constructed using accurate
variational skew modeling and optimizations have distinct advantage in power consump-
142
Clock network optimization during placement.
Most of the existing literature on clock network synthesis assumes that register lo-
cations are given and cannot be changed. While clock networks can be improved by
finding better register locations during placement, few publications develop such opti-
mizations, hence the quality of resulting clock networks is limited by un-optimized loca-
tions of the clocked elements. In Chapter V, we propose new techniques and a method-
ology to optimize total dynamic power during placement for large IC designs with macro
blocks [55, 56]. To this end, we advocate obstacle-aware virtual clock-tree synthesis, an
arboreal clock-net contraction force with virtual nodes that can handle gated clocks, and
an obstacle-avoidance force for clock edges. Our methodology is integrated into the state-
of-the-art SimPL placer [47], and the total switching power is measured by utilizing Con-
tango 2.0 [53]. A new set of 45 nm benchmarks is proposed to better represent modern IC
designs. Experimental results indicate that our software implementation, Lopper, prunes
clock-tree branches to reduce their length by 30.0%∼36.6% and average total dynamic
SPICE-driven simulations show that our methods improve robustness of clock trees.
Commonly used structures for clock networks can be categorized into two major types:
trees and meshes. While tree structures were popular for clock network synthesis in older
chips, mesh structures were utilized to satisfy tight variation-related constraints in high-
143
amount of total wire/buffer capacitance, which leads to a significant increase in total power
dissipation of the design. In Chapter VI, we propose, develop and empirically evaluate
auxiliary trees and iteratively fusing them into the main network [54]. Specific innova-
tions include: (i) Statistical models for delay and skew in buffered clock networks. (ii) A
technique to identify critical sink pairs based on robustness analysis. (iii) A novel clock-
network structure (fused multilevel trees) based on auxiliary-tree construction and fusing
monolithic wires that remedies known deficiencies in ISPD 2010 benchmarks. Each fu-
sion iteration balances a large subset of skew-critical clock sinks, but as auxiliary trees
are much smaller than the initial tree, the added capacitance is also small. The accuracy
of fusion iterations rests on the variational skew analysis techniques we proposed. The
final clock-network topology averages out source-to-sink delay and cancels out some of
the correlations induced by process, voltage and temperature (PVT) variations. Our imple-
mentation called Contango3.0 produces robust clock networks even for challenging skew
limits, without parallel buffering used by other implementations. It also offers a fine trade-
off between power and robustness, increasing the capacitance of the initial tree by less
than 60%, which results in 2.3× greater power efficiency than mesh structures.
The results developed during the course of our research suggest several directions for
future exploration. In Chapter VI, we propose a novel non-tree topology that fuses several
144
auxiliary clock trees at sink nodes to create redundant paths from a clock source to sinks.
In this context, recall that high-quality register placement proposed in Chapter V tends
to locally cluster registers to reduce leaf-level clock-net wirelength. When clock sinks
are locally clustered, tree fusion at internal nodes may produce better clock networks in
terms of power-performance trade-off. However, several difficult problems that are not
explored in the existing literature need to be resolved to make this technique practical.
First, for tree fusion at sink nodes, target latencies for leaf nodes of an auxiliary tree
are uniform, therefore traditional clock-tree generation algorithms can be directly utilized
without significant modifications. However, target latencies of internal nodes vary even
for the nodes at same tree-level, hence auxiliary-tree generation algorithms must consider
the uneven target delays of leaf nodes. DME-based tree-generation algorithms can be
extended to address this problem but prior work has not addressed this problem with proper
analysis tools. Second, the novel non-tree topology has metal loops only at sink nodes.
In Chapter VI, we present the splinter-sink technique that duplicates reconvergent sinks
and disconnect metal loops. This technique temporarily transforms a non-tree topology
adapted to reduce skew of a non-tree structure. However, since the splinter-sink technique
node is significantly more difficult compared to the case in which we can use the splinter-
sink technique. Resolving these problems in future work is not going to be easy, but could
Large-scale chip design may soon have hundreds of millions of standard cells including
145
millions of registers. Hierarchical clock-network synthesis is suggested for these large-
that our clock-network synthesis flows in Chapter III, IV and VI can be extended for
levels. Especially when different parts of a design have different constraints (i.e., some
parts of a chip are designed for high performance, and other parts are designed for low
to generate optimal clock networks for this kind of designs. Consider a two-level clock
network, with the top level driving two disjoint clusters of the bottom level. One cluster
of the bottom level may consists of registers for high-performance logics and the other
cluster has registers for low-power logics. We can utilize the tree-fusion flow with skew-
emphasized constraints for the top-level and high-performance clock networks to generate
more redundant paths, and power-emphasized constraints for the low-power clock network
Design methodologies for TSV-based 3D-ICs are currently pursued by many develop-
ers of EDA software tools. Techniques proposed in this dissertation lend themselves nat-
urally to such extensions. In particular, skew optimization techniques in Chapter III and
IV are not limited by TSVs if clock networks of 3D-ICs maintain tree structures. Arboreal
clock-net contraction force in Chapter V can be utilized for clock networks of 3D-ICs if
a TSV-aware virtual clock-tree generation algorithm is adapted. In this case, TSVs will
be represented by virtual nodes during placement and the optimal locations of TSVs for
146
minimizing switching power can be hinted by placement results. The non-tree topology in
Chapter VI can be utilized for 3D-ICs to improve robustness if the characteristics and cost
147
BIBLIOGRAPHY
148
BIBLIOGRAPHY
[1] A. I. Abou-Seio, B. Nowak and C. Chu, “Fitted Elmore Delay: A Simple and Ac-
curate Interconnect Delay Model,” IEEE Transactions on VLSI, vol 12, pp. 691-696,
2004.
[2] K. Agarwal, D. Sylvester and D. Blaauw, “Simple Metrics for Slew Rate of RC Cir-
cuits Based on Two Circuit Moments,” ACM/IEEE Design Automation Conference,
2003, pp. 950-953.
[4] C. J. Alpert, A. Devgan and C. Kashyap, “RC delay metric for performance optimiza-
tion,”Simple Metrics for Slew Rate of RC Circuits Based on Two Circuit Moments,”
IEEE Transactions on CAD, vol. 20, pp. 571-582, 2004.
[5] C. J. Alpert, F. Liu, C. Kashyap and A. Devgan, “Closed-Form Delay and Slew
Metrics Made Easy,” IEEE Transactions on CAD, vol. 23, pp. 1661-1669, 2004.
[9] C. J. Alpert, C. Chu and P. G. Villarrubia, “The Coming of Age of Physical Syn-
thesis,” IEEE/ACM International Conference on Computer-Aided Design, 2007, pp.
246-249.
149
[11] H. Bakoglu, J. Walker and J. Meindl, “A symmetric clock distribution tree and opti-
mized high-speed interconnects for reduced clock skew in ULSI and WSI circuits,”
IEEE International Conference on Computer Design, 1986, pp. 118-122.
[12] K. D. Boese and A. B. Kahng, “Zero-Skew Clock Routing Trees with Mini-
mum Wirelength,” ASIC Conference and Exhibit, 1992, pp.111-115.
[13] S. Bujimalla and C.-K. Koh, “Synthesis of Low Power Clock Trees for Handling
Power-supply Variations,” ACM/IEEE International Symposium on Physical Design
2011, pp. 37-44.
[14] P. Chan and K. Karplus, “Computing Signal Delay in General RC Tree/Link Parti-
tioning,” IEEE Transactions on CAD, vol. 9, pp. 898-902, 1990.
[15] T. F. Chan, J. Cong, J. R. Shinnerl, K. Sze and M. Xie “mPL6: Enhanced Multilevel
Mixed- Size Placement,” ACM/IEEE International Symposium on Physical Design,
2006, pp. 212-214.
[16] Y.-T. Chang, C.-C Hsu, M. P.-H. Lin, Y.-W. Tsai and S.-F. Chen, “Post-Placement
Power Optimization with Multi-Bit Flip-Flops,” IEEE/ACM International Confer-
ence on Computer-Aided Design, 2010, pp. 218-223.
[17] T.-H. Chao, Y.-C. Hsu and J.-M. Ho, “Zero Skew Clock Net Routing,” ACM/IEEE
Design Automation Conference, 1992, pp. 518-523.
[18] T.-H. Chao, Y.-C. Hsu, J.-M. Ho and A. B. Kahng, “Zero Skew Clock Routing with
Minimum Wirelength,” IEEE Transactions on Circuits and Systems, vol. 39(11), pp.
799-814, 1992.
[20] Y. Cheon, P.-H. Ho, A. B. Kahng, S. Reda and Q. Wang, “Power-aware placement,”
ACM/IEEE Design Automation Conference, 2005, pp. 795-800.
[21] M. Cho, D. Z. Pan and R. Puri, “Novel Binary Linear Programming for High Perfor-
mance Clock Mesh Synthesis,” IEEE/ACM International Conference on Computer-
Aided Design, 2010, pp. 438-443.
[23] J. Cong, A. B. Kahng, C.-K. Koh and C.-W. A. Tsao, “Bounded-Skew Clock and
Steiner Routing,” ACM Transactions on Design Automation of Electronic Systems,
1998, pp. 341-388.
150
[24] L. Conway and C. Mead, “Introduction to VLSI Systems. Reading,” Addison-Wesley,
1980
[29] W. C. Elmore, “The transient response of damped linear network with particular
regard to wideband amplifiers,” Journal of Applied Physics, vol. 19, pp. 66-63, 1948.
[32] D. Harris, M. Horowitz and D. Liu, “Timing Analysis Including Clock Skew,” IEEE
Transactions on CAD, vol. 18(11), pp. 1608-1618, 1999.
[33] P.-H. Ho, “Industrial Clock Design,” ACM/IEEE International Symposium on Physi-
cal Design, 2009, pp. 139-140.
[34] R. Ho, K. Mai and M. Horowitz, “The Future of Wires,” Proceedings of IEEE, vol.
89(4), pp. 490-504, 2001.
[36] J. Hu, A. B. Kahng, B. Liu, G. Venkataraman and X. Xu, “A Global Minimum Clock
Distribution Network Augmentation Algorithm for Guaranteed Clock Skew Yield,”
IEEE/ACM Asia and South Pacific Design Automation Conference, 2007, pp. 24-31.
[37] S. Hu, Q. Li, J. Hu and P. Li,“Utilizing Redundancy for Timing Critical Intercon-
nect,” IEEE Transactions on VLSI, vol. 15(10), pp. 1067-1080, 2007.
[38] J.-H. Huang, A. B. Kahng and C.-W. Tsao, “On Bounded-Skew Routing Tree Prob-
lem,” ACM/IEEE Design Automation Conference, 1995, pp. 508-513.
151
[39] F. Huebbers, A. Dasdan and Y. Ismail, “Multi-Layer Interconnect Performance Cor-
ners for Variation-Aware Timing Analysis,” IEEE/ACM International Conference on
Computer-Aided Design , 2007, pp. 713-718.
[40] M. A. B. Jackson, A. Srinivasan and E. S. Kuh, “Clock Routing for High-
performance ICs,” ACM/IEEE Design Automation Conference, 1990, pp. 573-579.
[41] X. D. Jia, R. M. M. Chen and A. M. Layfield, “Circuit Partitioning for Multiprocessor
SPICE,” IEEE TENCON, 1993.
[42] A. Kahng, J. Cong and G. Robins, “High-performance clock routing based on re-
cursive geometric matching,” ACM/IEEE Design Automation Conference, 1991, pp.
322-327.
[43] A. B. Kahng, J. Leinig, I. L. Markov and J. Hu, “VLSI Physical Design: From Graph
Partitioning to Timing Closure”, Springer, 2011.
[44] A. B. Kahng and C.-W. Tsao, “Practical Bounded-Skew Clock Routing,” Journal of
VLSI Signal Processing, vol. 16, pp.199-215, 1997.
[45] A. B. Kahng, S. Muddu, E. Sarto and R. Sharma, “Interconnect Tuning Strategies for
High-Performance ICs,” IEEE/ACM Design, Automation, and Test in Europe, 1998,
pp. 471-478.
[46] A. B. Kahng and C.-W. Tsao, “Practical Bounded-Skew Clock Routing,” Journal of
VLSI Signal Processing, vol. 16, pp.199-215, 1997.
[47] M.-C. Kim, D.-J. Lee and I. L. Markov, “SimPL: An Effective Placement Algorithm,”
IEEE/ACM International Conference on Computer-Aided Design, 2010, pp. 649-
656.
[48] J. M. Kleinhans, G. Sigl, F. M. Johannes and K. J. Antreich, “GORDIAN: VLSI
placement by quadratic programming and slicing optimization,” IEEE Transactions
on CAD, vol. 10(3), pp. 356-365, 1991.
[49] T. Krazit, “ARMed for the Living Room,” CNET news, Web. 3 Apr. 2006.
[50] W.-C. D. Lam, J. Jam, C.-K. Koh, V. Balakrishnan and Y. Chen, “Statistical Based
Link Insertion for Robust Clock Network Design,” IEEE/ACM International Confer-
ence on Computer-Aided Design , 2005, pp. 588-591.
[51] D.-J. Lee and I. L. Markov, “Contango: Integrated Optimization of SoC Clock Net-
works,” IEEE/ACM Design, Automation, and Test in Europe, 2010, pp. 1468-1473.
[52] D.-J. Lee and I. L. Markov, “Contango: Integrated Optimizations for SoC Clock
Networks”, VLSI Design, vol. 2011, no. 407507, 12 pp., 2011.
[53] D.-J. Lee, M.-C. Kim and I. L. Markov, “Low-Power Clock Trees for CPUs,”
IEEE/ACM International Conference on Computer-Aided Design, 2010, pp. 444-
451.
152
[54] D.-J. Lee and I. L. Markov, “Multilevel Tree Fusion for Robust Clock Networks,”
IEEE/ACM International Conference on Computer-Aided Design, 2011.
[55] D.-J. Lee and I. L.Markov, “Obstacle-aware Clock-tree Shaping during Placement,”
ACM/IEEE International Symposium on Physical Design, 2011, pp. 123-130.
[56] D.-J. Lee and I. L.Markov, “Obstacle-aware Clock-tree Shaping during Placement,”
IEEE Transactions on CAD, 2011.
[57] D.-J. Lee and I. L. Markov, The CLKISPD‘05 Benchmark Suite, 2011. Download-
able from http://vlsicad.eecs.umich.edu/BK/CLKISPD05bench
[59] J. Long, H. Zhou and S. O. Memik, “An O(nlogn) Edge-Based Algorithm for
Obstacle-Avoiding Rectilinear Steiner Tree Construction,” ACM/IEEE International
Symposium on Physical Design, 2008, pp. 126-133.
[60] D. Liu and C. Svensson, “Power consumption estimation in CMOS VLSI circuits,”
IEEE Journal of Solid-State Circuits, vol. 29, pp. 663-670, 1994.
[61] W.-H Liu, Y.-L Li and H.-C. Chen, “Minimizing Clock Latency Range in Robust
Clock Tree Synthesis,” IEEE/ACM Asia and South Pacific Design Automation Con-
ference, 2010, pp. 389-394.
[62] Y. Liu, X. Hong, Y. Cai and W. Wu, “CEP: A clock-driven ECO placement algorithm
for standard-cell layout,” IEEE International Conference on ASIC, 2001, pp. 118-
121.
[63] J. Lu, W.-K. Chow, C.-W. Sham and E. F. Y. Young, “A Dual-MST Approach for
Clock Network Synthesis,” IEEE/ACM Asia and South Pacific Design Automation
Conference, 2010, pp. 467-473.
[64] Y. Lu, C. N. Sze, X. Hong, Q. Zhou, Y. Cai, L. Huang and J. Hu, “Navigating Regis-
ters in Placement for Clock Network Minimization,” ACM/IEEE Design Automation
Conference, 2005, pp. 176-181.
[65] C.-L. Lung, Z.-Y. Zeng, C.-H. Chou and S.-C. Chang “Clock Skew Optimization
Considering Compilicated Power Modes,” IEEE/ACM Design, Automation, and Test
in Europe, 2010, pp. 1474-1479.
[67] T. Mittal and C.-K. Koh, “Cross Link Insertion for Improving Tolerance to Variations
in Clock Network Synthesis,” ACM/IEEE International Symposium on Physical De-
sign, 2011, pp. 29-36.
153
[68] L. W. Nagel and D. O. Pederson, “SPICE (Simulation Program with Integrated Cir-
cuit Emphasis),” Memorandum No. ERL-M382, University of California, Berkeley,
1973.
[70] Nangate Inc. Open Cell Library v2009 07, 2009. Downloadable from
http://www.nangate.com/openlibrary
[73] J. Oh and M. Pedram, “Gated Clock Routing for Low-Power Microprocessor De-
sign”, IEEE Transactions on CAD, Vol. 20(6), pp. 715-722, 2001.
[74] L. T. Pillage and R. A. Rohrer, “Asymptotic Waveform Evaluation for Timing Anal-
ysis,” IEEE Transactions on CAD, vol. 9, pp. 352-366, 1990.
[75] R.P. Pokala, R.A. Feretich and R.W. McGuffin, “Physical Synthesis for Performance
Optimization”, ASIC Conference and Exhibit, 1992, pp. 34-37.
[77] A. Rajaram, J. Hu and R. Mahapatra, “Reducing Clock Skew Variability via Cross
Links,” ACM/IEEE Design Automation Conference, 2004, pp. 18-23.
[78] A. Rajaram, D.Z. Pan and J. Hu, “Improved Algorithms for Link-Based Non-Tree
Clock Networks for Skew Variability,” ACM/IEEE International Symposium on
Physical Design, 2005, pp. 55-62.
[79] A. Rajaram and D. Z. Pan, “Variation Tolerant Buffered Clock Network Synthesis
with Cross Links,” ACM/IEEE International Symposium on Physical Design, 2006,
pp. 157-164.
154
[82] R. S. Shelar, “An Algorithm for Routing with Capacitance/Distance Constraints for
Clock Distribution in Microprocessors,” ACM/IEEE International Symposium on
Physical Design D, 2009, pp. 141-148.
[83] R. S. Shelar and M. Patyra, “Impact of Local Interconnects on Timing and Power
in a High Performance Microprocessor,” ACM/IEEE International Symposium on
Physical Design, 2010, pp. 145-152.
[84] W. Shi and Z. Li, “A Fast Algorithm for Optimal Buffer Insertion,” IEEE Transac-
tions on CAD, vol. 24(6), pp. 879-891, 2005.
[85] X.-W. Shih and Y.-W. Chang, “Fast Timing-Model Independent Buffered Clock-Tree
Synthesis,” ACM/IEEE Design Automation Conference, 2010, pp. 80-85.
[86] X.-W. Shih, H.-C. Lee, K.-H. Ho and Y.-W. Chang, “High Variation-Tolerant
Obstacle-Avoiding Clock Mesh Synthesis with Symmetrical Driving Trees,”
IEEE/ACM International Conference on Computer-Aided Design, 2010, pp. 452-
457.
[87] X.-W. Shih, C.-C. Cheng, Y.-K. Ho and Y.-W. Chang, “Blockage-Avoiding
Buffered Clock-Tree Synthesis for Clock Latency-Range and Skew Minimization,”
IEEE/ACM Asia and South Pacific Design Automation Conference, 2010, pp. 395-
400.
[88] X.-W. Shih and Y.-W. Chang, “Fast Timing-Model Independent Buffered Clock-Tree
Synthesis,” ACM/IEEE Design Automation Conference, 2011, pp. 80-85.
[90] V. Smith, “The Coming War: ARM versus x86,” Bright Side of News, Web. 7 Apr.
2010.
[93] Synopsys, “HSPICE: Simulation and Analysis User Guide,” Synopsys, Inc., Moun-
tain View, CA, 2003. Release U-2003.03-PA.
[94] C. Sze, P. Restle, G.-J. Nam and C. J. Alpert, “ISPD 2009 Clock-network
Synthesis Contest,” ISPD‘09, pp. 149-150. http://www.sigda.org/ispd/
contests/ispd09cts.html.
155
[95] C. N. Sze, “ISPD 2010 High-Performance Clock Network Synthesis Contest: Bench-
mark Suite and Results,” ACM/IEEE International Symposium on Physical Design,
2010, pp. 143-143.
[96] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel and F. Baez, “Reducing Power
in High-Performance Microprocessors,” ACM/IEEE Design Automation Conference,
1998, pp. 732-737.
[97] R. S. Tsay, “An Exact Zero-Skew Clock Routing Algorithm,” IEEE Transactions on
CAD, vol. 12(2), pp.242-249, 1993.
[100] N. Viswanathan, M.Pan and C.Chu, “FastPlace 3.0: A Fast Multilevel Quadratic
Placement Algorithm with Placement Congestion Control,” IEEE/ACM Asia and
South Pacific Design Automation Conference, 2007, pp. 135-140.
[103] Y. Wang, Q. Zhou, X. Hong and Y. Cai, “Clock-Tree Aware Placement Based on
Dynamic Clock-Tree Building,” IEEE International Symposium on Circuits and Sys-
tems, 2007, pp. 2040-2043.
[104] J. L. Wyatt. Jr., “Signal delay in RC mesh networks,” IEEE Transactions on Circuits
and Systems, vol. 32(5), pp. 507-510, 1985.
[105] L. Xiao, Z. Xiao, Z. Qian, Y. Jiang, T. Huang, H. Tian and E. F. Y. Young, “Lo-
cal Clock Skew Minimization Using Blockage-aware Mixed Tree-Mesh Clock Net-
work,” IEEE/ACM International Conference on Computer-Aided Design, 2010, pp.
458-462.
[106] X. Ye, P. Li, M. Zhao, R. Panda and J. Hu, “Analysis of large clock meshes via
harmonic-weighted model order reduction and port sliding,” IEEE/ACM Interna-
tional Conference on Computer-Aided Design, 2007.
156
[107] C. Yeh, G. Wilke, H. Chen, S. Reddy, H. Nguyen, T. Miyoshi, W. Walker and R.
Murgai, “Clock Distribution Architectures: A Comparative Study,” IEEE Interna-
tional Symposium on Quality Electronic Design, 2006, pp. 85-91.
[108] J. Yuan and C. Svensson, “High-speed CMOS circuit technique,” IEEE Journal of
Solid-State Circuits, vol. 24, pp. 62-70, 1989.
[109] W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45
nm Early Design Exploration,” IEEE Transactions on Electron Devices, vol. 53(11),
pp. 2816-2823, 2006.
157