An Efficient Clock Tree Synthesis Method in Physical Design: December 2009

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/251916106

An Efficient Clock Tree Synthesis Method in Physical Design

Conference Paper · December 2009


DOI: 10.1109/EDSSC.2009.5394159

CITATIONS READS

4 2,952

4 authors, including:

Yuan Wang Ganggang Zhang


Peking University Northwest A & F University
176 PUBLICATIONS   363 CITATIONS    48 PUBLICATIONS   190 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

ESD robustness for advanced technology View project

Research on Neuromorphic devices and circuits View project

All content following this page was uploaded by Yuan Wang on 26 January 2016.

The user has requested enhancement of the downloaded file.


An Efficient Clock Tree Synthesis
Method in Physical Design
Guirong Wu, Song Jia﹡, Yuan Wang, Ganggang Zhang

Abstract — This paper proposes a method aiding in low have bigger fan-out, and have to be distributed over more
clock skew applicable to the mainstream industry clock function modules. It has been observed that the existing
tree synthesis (CTS) design flow. The original clock root APR tool performance has deteriorated because of its
is partitioned into several pseudo clock sources at the gate limited computation capability. To address this issue, the
level. The automatic place and route (APR) tool may logic solution is to break the clock nets into smaller parts.
synthesize the clock tree with better performance in clock In paper [4], this is accomplished by partitioning the chip
skew because each pseudo clock source drives smaller into several pseudo-partitions at the layout level, based on
number of fan out. The proposed method is applied to a the cells placement. However, a new Visual Basic based
chip level clock tree network and achieves good results. routing tool based on the Exact Zero Skew (EZS) routing
Keywords: Physical Design, Clock Tree Synthesis, Low algorithm [1] should be developed to support the
Clock Skew methodology implementation which is so time-consuming.
Furthermore, other issues discussed in its conclusion
I. INTRODUCTION section restrict the use of this method also. As for
partitioning at the RTL level, it involves some changes to
Clock network design has been a key aspect of the the chip architecture and would increases the complexity
design process which directly impacts the performance of of the solution. Except for the above two methods,
the chip. The following equation [1] summarizes the partitioning at the gate level has not been reported hereto
relationship of the clock period P , clock skew s , worst and it will become our object of study.
case data path delay d max , and other offset constant In this paper, we partition the original clock root into
Po for the proper timing. several pseudo clock sources at the gate level, which
needs no changes to the chip architecture and no extra
routing tool developed. The method is applicable to the
P = s + d max + Po (1) mainstream industry CTS design flow that ensures the
quality and efficiency. The outline of this article is as
The clock skew is the maximum different among the follows. We first review the common CTS modes
clock latencies from the clock source to flip-flops. Skew supported by the mainstream industry APR tools, the
can be calculated at the edge of the clock root in three Cadence First Encounter (v03.30) being used as the
fashions: rise skew, fall skew, and trigger-edge skew [2]. platform. Next we conduct a series of experiments and
In this paper, we calculate the skew in trigger-edge derive the clock tree partition guidance as the result of
fashion. Po is a constant that includes data set up, hold these experiments. Then we apply the method to a chip
time, latch active time, and other possible offset factors level clock tree synthesis of an embedded processor and
like safety margins. It is clear from the equation that to compare the experimental results between the proposed
reduce the cycle time P it is necessary to minimize the method and conventional method. Finally, we make some
skew s , besides the minimization of the worst case data discussion and draw the conclusion.
delay d max on the combinational logics.
II. REVIEW FOR COMMON CTS MODES
As interconnection delay is becoming more dominating
in deep submicron (DSM) silicon technology levels, the
There are two modes for running CTS in Cadence First
clock skew is more significant in terms of circuit
Encounter APR tool: manual and automatic [2].
performance. Therefore the minimization of skew is
always a very important topic in the design of Manual CTS mode allows you to control the number of
synchronous sequential circuit [3]. levels and the number of buffers, and specify the types of
With the growing complexity of system designs, clock buffers at each level. The following is an example of
network are getting increasingly complex. Clock nets clock-tree specification file syntax and a graphic
representation of that syntax as seen in Fig. 1:
ClockNetName MCLK_GE
Guirong Wu, Song Jia*, Yuan Wang and Ganggang Zhang are
with the Key Laboratory of Microelectronics Devices and LevelNumber 2
Circuits, Institute of Microelectronics, Peking University, LevelSpec 1 2 CLKBUFX20
Beijing, P. R. China. LevelSpec 2 10 CLKBUFX16
E-mail: [email protected], [email protected], PostOpt YES
[email protected], [email protected] End

978-1-4244-4298-0/09/$25.00 ©2009 IEEE


as specified in the clock tree specification file. Clock
grouping balances the clocks and attempts to meets clock
skew for all clocks as if they were one tree. The following
is an example of clock group syntax and its graphical
representation as seen in Fig. 3:

ClkGroup
+ SH1/I3/Z1
Fig. 1. Graphic representation of manual CTS mode + SH2/I4/Z2

For automatic CTS on a net, CTS builds the clock


buffer tree according to the clock tree specification file,
such as the maximum delay, maximum transition and
maximum skew, generates the clock tree topology, and
balance the clock phase delay with appropriately sized,
inserted clock buffers. The following is an example of
clock tree specification file syntax for automatic CTS on
a net and a graphic representation of the syntax as shown
in Fig. 2:

MacroModel pin alu_core/clk 20ps 18ps 20ps


18ps 30ff
AutoCTSRootPin clk_div/U3/Q
Fig. 3. Graphic representation of clock grouping syntax
NoGating rising
Buffer CLKINV CLKBUF DLY
III. OUR PROPOSED CTS METHOD
MaxDelay 5ns
MinDelay 0ns Before our proposed method is introduced, let's
SinkMaxTran 80ps conduct a series of CTS experiments on one function
BufMaxTran 80ps module which is targeted at SMIC 0.18um process
MaxSkew 50ps technology using Cadence First Encounter (v03.30). The
End module consists of 10906 gates, each gate representing
one two-input NAND gate with minimum driving
capability in targeted process technology, besides 1190 D
flip-flops (DFFs) which are triggered at falling edge and
are all synchronized by a single clock root named as
MCLK. The original clock root is partitioned into several
pseudo clock sources at the gate level. The clock sources
are referred to "pseudo" because they are not a real design
intent. After the completion of above partition stage, each
pseudo clock source drives a smaller number of fun-out.
In this experiment, we break the initial clock root into 19
new pseudo clock sources, named as MCLK_0, MCLK_1,
MCLK_2, ..., MCLK__17, MCLK_18. From MCLK_0 to
MCLK_17, each drives 64 DFFs respectively and last one
clock source MCLK_18 drives the remaining 38 DFFs.
The graphic representation is shown in Fig. 4.
In the first CTS experiment, we specify the original
clock root as the only one AutoCTSRootPin in the clock
tree specification file. We use the First Encounter to
Fig. 2. Graphic representation of automatic CTS mode synthesize the clock tree in automatic CTS mode
discussed in section II. The trigger-edge skew is
Note that the skew among nodes A, B, and C may meet measured as 39.2 ps.
the maximum skew specified in clock tree specification In the second CTS experiment, we specify 19 new
pseudo clock sources split through the partition scheme in
file.
the first paragraph of this section as root pins in the clock
At the end of this section, we introduce the Clock
tree specification file. We also use the First Encounter to
Grouping technology which is available in automatic
synthesize 19 clock trees in automatic CTS mode. The
CTS mode. All clock root pin names entered into a clock statistics of trigger-edge skews in each pseudo clock
group that will have their sinks meet the maximum skew
domain are listed in Table I in which the skew unit is MCLK_1 driving 617 DFFs;
picosecond. C: The original clock root is partitioned into 5 sources, in
which 4 sources MCLK_0, MCLK_1, MCLK_2, and
MCLK_3 driving 256 DFFs respectively and the last
one MCLK_4 driving 166 DFFs;
D: The original clock root is partitioned into 9 sources, in
which 8 sources MCLK_0 ~ MCLK_7 driving 128
DFFs respectively and the last one MCLK_9 driving
166 DFFs;
E: The original clock root is partitioned into 19 sources,
in which 18 sources MCLK_0 ~ MCLK_17 driving
64 DFFs respectively and the last one MCLK_18
driving 38 DFFs;

We use the First Encounter to synthesize the clock trees


in automatic CTS mode for each partition scheme on the
same config in clock tree specification file. The trigger-
edge skew of each case are shown in Table II.

TABLE II
STATISTICS OF THE CTS RESULT

Fig. 4. Graphic representation of clock root partition Clock Time


Case Skew Total Area
Tree Area (normalized)
( ps ) ( μm 2 )
( μm 2 )
TABLE I
A 39.20 7048 154739 1
STATISTICS OF SKEW
B 34.40 8522 157569 1
Source Skew Source Skew C 26.40 10003 159442 1
(ps) (ps) D 23.6 11113 161344 1
MCLK_0 13.1 MCLK_10 11.0 E 17.3 15062 161220 1.8
MCLK_1 15.3 MCLK_11 7.1
MCLK_2 11.3 MCLK_12 11.2 The results confirm our assumption. It clearly shows
MCLK_3 11.5 MCLK_13 7.4 that the proposed CTS method is effective in low clock
MCLK_4 6.6 MCLK_14 12.2 skew. The smaller number of fan out driven by pseudo
MCLK_5 11.5 MCLK_15 9.9 clock sources, the lower skew can be achieved. However,
MCLK_6 14.7 MCLK_16 13.0 lower skew is with the penalty of the bigger chip area and
MCLK_7 14.1 MCLK_17 15.0 longer run time. For instance, when the skew improves
MCLK_8 15.4 MCLK_18 7.6 56% in case E compared with case A, the chip area
increases 4.2% and run time increases 80%. Therefore,
MCLK_9 14.3
the appropriate partition scheme should be determined
from the trade-off among skew, area, and time cost. From
From the statistics in Table I, the skews are much
the statistics in Table II, case C or D are seemed as
lower than the one which is 39.2 ps in first experiment. It
appropriate for its obvious improvement in skew while a
is because that the APR tool will perform better within
little chip area increases and almost no run time grows.
expectation for small clock net. The tool’s capability
limitation is slowing the design process and deteriorating
the performance as the clock net fan out size grows
bigger and bigger [5]. Based on the above observation, IV. METHOD APPLICATION
we make the following assumption: Breaking up the
clock root into several new pseudo clock sources at the In this section, we apply the proposed method to a chip
gate level and then synthesizing the clock tree for each level clock tree synthesis of the 32-bit RISC-based
may achieve the low clock skew. To balance the skew embedded processor (27690 gates, 66Mhz, SMIC 0.18um
among new pseudo clock trees, we can use the Clock process technology) which is designed by the R&D team
Grouping technology mentioned in section I. of the Key Laboratory of Microelectronics Devices and
We conduct the third CTS experiment to validate our Circuits, Institute of Microelectronics, Peking University.
assumption. Five clock root partition schemes on the Firstly, we analyze the clock tree structure to determine
same function module are listed as belows. the partition scheme of the pseudo clock sources. In the
targeted design chip, there is only one original clock root,
A: Clock tree is synthesized under one single clock root named as MCLK, which synchronizes the total 1673
which drives 1190 DFFs; DFFs. The brief graphic representation of the clock tree
B: The original clock root MCLK is partitioned into 2 structure is shown in Fig. 5. The number of DFFs in each
clock source sources, MCLK_0 driving 573 DFFs and function module is listed in bracket.
V. CONCLUSION
The method proposed in this article improves the clock
skew significantly with the cost of area and through put
time, as opposed to the results shown in [4] in which the
maximum skews show some improvement, but not
significantly different, and there is a significant
improvement in time. There is no area reported in [4], so
we can not make the relative comparison. As for the
significantly different in through put time, the
improvement is attributed to the ability to run clock tree
generation (CTG) on multiple smaller pseudo clock
sources on multiple workstations in parallel while our
Fig. 5. Graphic representation of the clock tree structure tool is run on one CPU core. This is a good explanation
of our result in through out time. In theory, there should
Secondly, we determine the partition scheme according be significant improvement in time if we run tool on
to the function and number of fan out. This process may multiple workstations concurrently.
require repeated trials in order to obtain the appropriate Furthermore, because our method is applicable to the
scheme. One partition scheme is shown in Fig. 6. We mainstream industry CTS design flow, it overcomes many
break up the original clock root into 10 new pseudo clock defects existing in [4] such as too much manually analysis,
sources. The partition scheme in module register_bank is accurate delay estimations, and other realistic layout
as same as the case C described in section III. A few Perl concerns discussed in its conclusion section. From this
scripts are developed to partition the original clock root view, our approach is superior to the one mentioned in
and generate the new gate-level netlist containing pseudo article [4].
clock sources.
ACKNOWLEDGMENT
I owe a lot of thanks to all of the people who contribute
to this paper as possible. First and foremost, I would like
to thank my adviser, Dr. Song Jia, from the bottom of my
heart, for his guidance. He has been a great source of
ideas and provides me with invaluable feedback. Second,
I would like to thank Dr. Yuan Wang and Ganggang
Zhang, who give me their constant support and
suggestions during the project and paper writing.

REFERENCES
[1] Tsay, R.-S. “Exact zero skew,” Computer-Aided
Design, 1991. ICCAD-91. Digest of Technical
papers., 1991IEEE International Conference, pp.
Fig. 6. Graphic representation of the partition scheme 336-339, 1991.
[2] Encounter User Guide, Product Version 5.2.1,
Finally, we use First Encounter to synthesize the February 2006.
pseudo clock trees in automatic CTS mode, named as new [3] Chia-Ming Chang, Shih-Hsu Huang, Yuan-Kai Ho,
in case item in the table. For comparison between the Jia-Zong Lin, Hsin_Po Wang, Yu-Sheng Lu, “Type-
proposed method and the conventional method, we also matching clock tree for zero skew clock gating,”
conduct CTS experiment for the original clock root, Design Automation Conference, 2008. DAC 2008.
named as original in case item. The summary of the 45th ACM/IEEE, pp. 714-719, 2008.
comparison is shown in Table III. It shows that the [4] Reaz, M.B.I., Amin, N., Ibrahimy, M.I., Mohd-Yasin,
method proposed in this article improves 66.3% in F., Mohammad, A., “Zero skew clock routing for fast
trigger-edge skew, increases 5.88% in chip area and 40% clock tree generation,” Electrical
in run time compared with the conventional method. and Computer Engineering, 2008, CCECE 2008.
Canadian Conference on, pp. 4-7, May 2008.
TABLE III [5] Y. P. Chen, D. F. Wong, “An Algorithm for Zero
SUMMARY OF THE COMPARISON Skew Clock Tree Routing with B,” In: Proceedings
of the 42nd annual conference on Design automation,
Clock Tree Time USA, pp. 783-788, 2005.
Case Skew Total Area
Area (normalized)
( ps ) ( μm 2 )
( μm 2 )
original 88.4 16695 452783 1
new 29.8 22473 479367 1.4

View publication stats

You might also like