Vlsi Final Notes Unit4
Vlsi Final Notes Unit4
Vlsi Final Notes Unit4
Analog Simulation uses a simulator such as SPICE. These simulators require models of the I-V
characteristics of the devices. The analysis is done using numerical integration methods. Worse case
delay analysis requires finding worse case input patterns. These methods are very time-consuming, and
thus timing analysis using these methods is all but impractical, except for special problems.
Timing Simulation uses models of circuit and interconnect delays. As an example, consider the Crystal
timing simulator. In this case simple models of gates and interconnections are used, and thus timing
analysis can be done efficiently on large circuits. However, these methods pay a price in terms of
accuracy.
In static logic circuits, delay is measured between the 50% point of the waveforms, as shown in Figure 110.
This choice assumes that the switching points of all gates are approximately at V DD/2. The time from the
10% point of a waveform to the 90% point is called the rise time t r. The time from the 90% to the 10% point
is referred to as the fall time tf.
Output Lading: As capacitive loads caused by the input capacitance of the fanout gates increase, gate
delay increases.
Wire Loading: As capacitive loads caused by the wiring network driven by the gate increase, gate delay
increases.
Input Waveform Shape: Slow input signals (signals with large rise times) cause long gate delays.
A more detailed model of gate delay is the on-resistance model shown in Figure 111. This model consists of
the following components: the input gate loading or gate capacitance C g, the output gate loading capacitance
Cd, the intrinsic gate delay TD, and the gate output on-resistance Ron. The function of a gate can be described
as follows. Suppose the input patterns of the gate change, causing a change in the output of the signal.
Suppose the latest change occurs at some time t. Then the gate output begins changing at time t + T D.
Suppose the gate is driving a load capacitance C L, then the rise time constant of the output waveform is =
Ron (Cd + CL). Given some assumptions about the shape of the waveforms (exponential or ramp), these
figures can then be used to compute gate delays.
Although inaccurate (the accuracy of this model is around 20%), the on-resistance model can describe many
properties of physical designs. For example, the effects of increased capacitive loading are clear. Also, as
shown in following sections, this model is suitable for efficient delay modeling of lossy interconnections.
Figure 111. (a) Schematic for a simple CMOS inverter driving a distributed RC wiring load
and a single gate load. (b) On-resistance buffer and interconnection delay model,
the distributed RC wiring is modeled with lumped components.
Interconnect Capacitance Cw, which arises from three components: the area component, also referred
to as parallel plate capacitance; the fringing field component that accounts for the finite dimensions of
wires; and the wire-to-wire capacitance between neighboring wires. In modern IC technologies the latter
two are becoming comparable to parallel plate capacitance. Given a wire of width w, length l, dielectric
thickness tox, a dielectric constant ox , and the fringing field factor, the wire capacitance is
C w = ox wl f r
t ox
Interconnect Resistance Rw, which arises since conductors in modern ICs are not ideal. Given the
resistivity of the wiring material , and the thickness of the wire h, the wire resistance is
l
Rw =
wh
Other effects such as interconnect inductance Lw and shunt conductance Gw . These effects have been
negligible up to date in IC technologies; however, they may become more important with decreasing
dimensions.
Figure 112. (a) Three-dimensional view of an IC wiring structure. The dimensions of the wire are its
length l, width w, thickness h, and its distance from a ground plane tox.
(b) Electrical model of the parasitic effects of the wire.
This section considers various interconnect models that handle parasitic capacitance and resistance. The
most accurate model of interconnect at high speed is the transmission line model, where wires have
capacitance, resistance, and inductance per unit length. These models are difficult to use and are
unnecessarily complicated for present-day IC technologies. Transmission line analysis is only called for
when the time of flight is longer than the signal rise times. This effect only occurs in the fastest of modern
IC technologies such as gallium arsenide ICs. For this reason, this section focuses on two practical
interconnect models, the distributed RC network model and the lumped RC network models.
3. Delay in RC Trees
The general time-domain solution of any RC network is a well-known problem in numerical analysis. An
RC network (as shown in Figure 113) produces a linear system of differential equations that can be solved
for an exact response. In general, given a linear network with n storage elements the response of the Laplace
transform (or frequency domain) of the network, H(s) = L(h(t)) can be expressed in the form
H ( s )=
H ( s )=
1+a1 s ++ am s
s ( 1+ b1 s++bn sn )
k1
k
n
+ + i + +
s p1
s pi
s pn
pi t
h ( t ) =k 1 e + +k i e ++ k n e
pn t
where pi and ki are the poles and residues, respectively, of the frequency domain expression. Unfortunately,
the problem of computing the exact poles and residues from the frequency domain solution is a
computationally expensive problem. Furthermore, it has been observed in practice that the response of
practical RC networks is dominated by a few poles and residues.
Figure 113. Example of RC Tree for a CMOS VLSI Circuit with a fanout of 2.
T =S +d max +T 0
where S, dmax, and T0 represent clock skew, maximum delay, and the adjustment factor, respectively.
Placement is a very important step in designing timing-efficient circuits. If a placement is bad in terms of
timing, even a very powerful router cannot help. There are basically three approaches to the timing-driven
placement problem: the zero-slack algorithm, the weight-based placement, and the linear programming
approach.
slack = tR tA.
..(1)
Note that the arrival time at the output of a gate is known only when the arrival time at every input of that
gate has been computed. Thus, a topological sorting is performed on the gates (vertices) to find the arrival
times. (A topological sort is the process of visiting the gates one by one. A gate can be visited if all its inputs
are PIs or when all the gates providing signal to it have already been visited.)
The relationship between the arrival time of an output pin,
cells,
tA ,
tA ,
t A , .. , t A
dn
dn
, ..,
dn
are net
delays, and D is the delay of the cell containing pin j . The relationship is also shown in Figure 114.
t Aj =max mi=1 ( t Aj + d n ) + D
..(2)
Deciding the required times is done in a similar manner, except the signal direction (i.e., the edges of the
DAG) is reversed when topological sorting is applied. The relationship between the required time of the
output pin,
t 1R ,
t 2R ,
t 3R , .. , t mR
is
shown in Figure 115 and expressed in Equation 2. The terms D 1, D2, D3, . . ., Dm are cell delays, and dn is the
delay of the net that connects the cell to its fanout cells.
t Rj=minmi=1 ( t RjDi ) d n
..(3)
The zero-slack algorithm associates a maximum allowable delay to each net in an iterative manner. In the
first iteration, the delay of each net is assumed to be zero. The purpose of the zero-slack algorithm is to
distribute slacks to each net. It transforms timing constraints to net length constraints. Consequently, primary
inputs, primary outputs, and the outputs of each cell will have zero slacks.
An example is demonstrated in Figures 116 122. A triplet t A/slack/tR is shown at each (input and output)
pin. Figure 116 shows the input circuit with the delay of each net initially set to zero. The arrival times of
primary inputs and the required times of primary outputs are given. Slacks are calculated according to
Equations 1 3. The minimum positive slack is four and the corresponding path, then one is arbitrarily
selected. In Figure 117, the four units of slack have been distributed along the path segment of Figure 116. It
indicates that each of the segments on the path can have length proportional to the assigned length (in this
case, 1) in the final layout. A unit-delay model is assumed; that is, the propagation delay along a wire of
length l is O(l). After slack distribution, slacks on the path segment become zero. Then the new slacks are
calculated. The new path segment is primary input I2 and the net connected to it. Figures 118 122
demonstrate subsequent steps. Figure 122 shows the final result, where the output pin slack of each gate is
zero. The just described algorithm is summarized as follows:
procedure zero-slack
begin-1
repeat until no positive slack;
compute all slacks;
find the minimum positive slack;
find a path with all the slacks equal to the minimum slack;
distribute the slacks along the path segment;
end-1.
1. Delay Minimization
Delay minimization can be used to obtain an effective timing-driven global router. The procedure can also
be used as a detailed router.
Let the radius R of a signal net be the length of a shortest path from the source s to the furthest sink, that is,
R = maxxN dist(s, x). The radius of a routing tree T, denoted by r(T), is the length of a shortest path in T
from the source s to the furthest sink. Clearly, r(T) R, for any routing tree T. According to the linear RC
delay model, the interconnection delay of a net is minimized by minimizing the radius of the routing tree,
that is, the maximum interconnection delay between the source and the sinks. Furthermore, the goal is also a
routing tree with small total cost. Here, the total length of the tree is used as the cost. Other cost measures
might be a measure of the crowdedness of the channels or the number of bends, proportional to the number
of vias, in the net.
In order to consider both the radius and the total length in the routing tree construction, bounded radius
minimum routing trees (BRMRT) are used. The reason for minimizing the length is that it has been observed
that global routers aiming to minimize wire length produce small chips. Given a parameter 0 and a signal
net with radius R, a BRMRT is a minimum-length routing tree T with radius r(T) (1 + ) x R.
The parameter controls the trade-off between the radius and the length of the tree. When = 0, the radius
of the routing tree is minimized and thus an SPT (shortest path tree) for the signal net is obtained. On the
other hand, when = , the total length of the tree is minimized and an MST (minimum spanning tree) is
obtained. In general, as grows, there is less restriction on the radius, allowing further reduction in tree cost.
Figure 123 shows an example where three distinct spanning trees are obtained using different values of .
Figure 124. Transforming a clock tree with large skew to one with small skew.
The minimization of clock skew has been studied in the past few years. First, researchers proposed to use the
H-tree construction in systolic arrays. The H-tree construction can be used when all components have equal
sizes and are regularly placed. When all blocks are organized hierarchically, a clock distribution technique
was proposed in. It was assumed that the number of blocks at each level of hierarchy was small since an
exhaustive search algorithm was employed to enumerate all the possible routes.
More recently, two recursive algorithms were proposed. They are both based on the linear delay model;
recall that the delay along a path of length l in a given routing tree is O(l). First, Jackson, Srinivasan, and
Kuh proposed a top-down approach to the problem. Their algorithm (the JSK-algorithm) recursively
partitions the problem into two equal-sized (i.e., with equal number of points) subproblems. At each step, the
center of mass of the whole problem is directly connected to the center of mass of the two subproblems, as
shown in Figure 125. The center of mass of a point set {P 1,.....,Pn}, where Pi has coordinates xi and yi is
defined as a point with x-coordinate x = (x1 + . . . +xn)/n and y-coordinate y = (y1 + . . . + yn)/n. It is shown
that clock skew (dmax dmin) is bounded by O(1/
{ P11 , ., P1n /2 }
, are
found (Figure 126(b)). Here, the center is defined as the point that minimizes the length from it to the
furthest node minus the length from it to the closest point. That is, let c be the selected center, l max the length
of the longest path from c to a sink, and l min the length of the shortest path from c to a sink. The center c is
selected to minimize lmax - lmin. At this stage the generated center points are also matched as shown in Figure
126(b). This procedure is repeated until all the points are matched. The final tree is obtained as shown in
Figure 126(c). It is shown that the total length of the final tree for a set uniformly distributed in the unit
square is O(
n ).
the clock skew, defined as the maximum difference of the delays from the clock source to the clock pins;
the clock phase delay (or latency), defined as the maximum delay from the clock source to any clock
pin;
the clock rise time (or skew rate) of the signals at the clock pins, defined as the time it takes the
waveform to change from a VLO to a VHI value;
the sensitivity to the parametric variations of the clock skew, clock phase delay, and rise time. The most
important sensitivity is to the clock skew.
Additionally, since in modern microprocessors the clock networks can consume a large portion of the total
microprocessor's power (up to 50%), the clock design objectives must be attained while minimizing the use
of system resources such as power and area. In high performance systems, these constraints can be quite
severe. Consequently, the layout of a good clock distribution network is difficult and time-consuming.
A very basic concern in the construction of clock trees is the integrity of the clock signal. Since the chip
interconnect is lossy, the clock signal degrades as it is transmitted through the network. The longer the path
from the source to the sink, the longer the rise time of the clock signal. If the rise time of the clock signal
exceeds the clock period, then the clock signal is unable to reach suitable logic high and low values. This
problem can be remedied by several different clock network design strategies.
Other work on buffered clock trees has focused on minimizing the phase delay of the clock tree on the
assumption that reducing clock tree delays also reduces skew and skew sensitivity to process variation. In
these works, the delay model usually consists of a fixed on-resistance and delay for each buffer. More
recently, work in considers the minimization of skew and delay in the presence of process variations in
buffered clock trees.
Various wiring-oriented approaches exist to resolve these problems. First, it is possible to taper the wire
widths of the clock tree such that the rise times at the terminals are decreased and perhaps satisfied (Figure
127(a)). Another alternative is to increase the capacitance and decrease the resistance on the source-to-sink
paths by using wide bus techniques (Figure 127(b)). This approach can be taken to extremes by constructing
a fat clock bus from which other clock pin connections are obtained (as shown in Figure 127(c)). This last
type of a clock tree has to be driven by a very large buffer, as in Figure 128(a); thus large amounts of power
are consumed by the clock driver and the clock interconnect. However, since the delays of the clock tree are
dominated by the capacitance of the wide bus, the bus-to-clock pin interconnection can be done simply and
efficiently.
Figure 127. (a) Clock tree with tapered wire widths. (b) Clock tree using a wide bus approach.
(c) Fat bus clock tree construction
There are also buffer-oriented approaches to clock network constructions. Buffer-oriented approaches
partition the clock tree into sections using buffers. Consider a buffered clock tree or clock power-up tree,
which contains buffers in its source-to-sink paths. Several examples of buffered clock trees are shown in
Figure 128. In this case, to maintain the integrity of the clock signal, it is sufficient to restrict all the rise
times in the network to values smaller than the clock period. The addition of buffers in source-to-sink paths
reduces the resistance in these paths, and therefore reduces the signals' rise times. As a result, the number of
buffers needed by the clock tree is bounded below by the clock's target clock period. Algorithms are
proposed to compute the minimum number of buffers required to satisfy a given clock period. Bufferoriented and wiring-oriented clock designs exhibit several trade-offs.
Wiring-oriented solutions are not very sensitive to fabrication variations, while buffer-oriented solutions
may be more so.
The construction of wiring-oriented networks requires careful analysis and modeling of the interconnect
networks, while buffer-oriented trees must also take into account detailed buffer delay modeling.
Buffer-oriented solutions potentially consume less power than wiring oriented solutions.
Embedding the buffers into the placement may not be as easy as constructing a wiring-oriented solution.
Consider buffered clock tree constructions along with process variations and realistic buffer delay models.
To obtain zero skews while reducing sensitivity to process variations, buffered clock trees usually have
equal numbers of buffers in all source-to-sink paths. Furthermore, at each buffered level the clock tree
buffers are required to have equal dimensions (Figure 128(c)) to avoid buffer parameter variation and buffer
delay model accuracy problems. Using the buffer delay model described earlier, two additional steps are
required to obtain zero-skew clock trees: the capacitive loads of the buffers in one level must be the same
(this is called load matching), and the rise times at the inputs of the buffers at one level must also match.
Note that wiring-oriented techniques may be applied at each buffer stage to produce load and rise time
matching.
Figure 128. (a) A buffer chain with tapered sizes driving a clock tree. (b) Clock power-up tree,
buffers of a single size. (c) Clock power-up tree, with buffers of tapered sizes.
that all buffers on the same level have the same size.)
with
(Note
The clock network construction problem presents trade-offs between wire length and skew, and between
variation of parameters under manufacturing conditions (effectively affecting chip yield) and power
consumption. These trade-offs present a challenge to the designer and to those seeking to automate the clock
design process.
VIA MINIMIZATION
Vias (or contact holes) between different layers of interconnection on dense integrated circuits can reduce
production yield, degrade circuit performance (by increasing propagation delay), and occupy a large amount
of chip area (as the physical size of a via is larger than the wire's width). Also, vias on multilayer printed
circuit boards raise the manufacturing cost and reduce product reliability. Thus, much research effort has
been focused on minimizing the number of vias.
There are basically two categories of via minimization problems, constrained via minimization (CVM), and
unconstrained via minimization (UVM). In constrained via minimization, given a routing region, a set of
modules and terminals, and the layout of the nets (the detailed routing), the question is how to assign the
wire segments to the layers such that no two wires of different nets intersect in the same layer and the
number of vias is minimum. In an unconstrained via minimization problem, the layout of the nets is not
given, and the question is how to find a topological (or global) routing such that the minimum number of
vias is needed.
Figure 129. CVM. (a) Before via minimization. (b) After via minimization.
Consider an instance of CVM shown in Figure 130(a). A crossing graph is one where the vertices represent
the intersection of nets and an edge is present between two adjacent intersections. The crossing graph of the
layout shown in Figure 130(a) is shown in Figure 130(b). The faces of the graph are labeled as a, b,, g,
where face g is the infinite (or boundary) face. From the result of CVM, represented by a plane graph G, the
dual graph G' is obtained. A face with an odd number of edges on its boundary in G is an odd-cycle face. An
even-cycle face is similarly defined.
Figure 131. UVM. (a) Topology of the net with one via; (b) geometrical mapping.
Figure 132. Topological routing graphs. (a) A circle graph; (b) net intersection graph; (c) a max-cut.
Consider a routing region and a set of terminals. The routing region is characterized as a circle and the nets
are drawn as chords. The intersection graph of the chords is called a circle graph, as shown in Figure 132(a).
The vertices of the net intersection graph (see Figure 132(b)) represent the nets, and the edges represent
intersections between the corresponding nets. That is, an edge (i, j) exists if nets i and j intersect in the circle
graph. The goal is to delete a minimum number of vertices from the intersection graph, such that the
resulting graph is bipartite; see Figure 132(c). This creates two sets of vertices that can be placed into two
layers without any vias. In the figure, L 1 represents a set of nets assigned to layer 1, and L 2 represents a set
of nets assigned to layer 2. The sets L 1 and L2 define the maximum sets of wires that can be routed on two
layers without vias. The deleted vertices are referred to as the residual nets. The problem now reduces to one
of routing the residual wires.
POWER MINIMIZATION
Power consumption in CMOS circuits is due to the dynamic power consumption of charging and
discharging capacitive loads during output transitions at gates, the short circuit current that flows during
output transitions at gates, and the leakage current. The last two factors can be made sufficiently small with
proper device and circuit design techniques; thus, research in design automation for low power use has
focused on the minimization of the first factor, the dynamic power consumption.
The average dynamic power consumed by a CMOS gate is given below, where, C l is the load capacity at the
output of the node, Vdd is the supply voltage, Tcycle is the global clock period, N is the number of transitions
of the gate output per clock cycle, Cg is the load capacity due to input capacitance of fanout gates, and C w is
the load capacity due to the interconnection tree formed between the driver and its fanout gates.
V 2dd
V 2dd
Pav =0.5
C N =0.5
( C +C ) N
T cycle l
T cycle g w
To minimize the power following techniques are used:
1. Reducing chip and package capacitance: this can be achieved through process development with
partially and fully depleted wells, CMOS scaling to submicron device sizes, and advanced interconnect
substrates such as Multi-Chip Modules (MCM). This approach can be very effective but is very
expensive.
2. Scaling the supply voltage: This approach can be very effective in reducing the power dissipation, but
requires new IC fabrication processing. Supply voltage scaling also requires support circuitry for low
voltage operation including level converters and AC/DC converters as well as detailed consideration of
issues such as signal to noise margins.
3. Employing better design techniques: In this approach the investment to reduce power by design is
relatively small in comparison to other techniques.
4. Using power management strategies: The power saving that can be achieved by various static and
dynamic power management techniques are application dependent but can be significant.
Timing-driven placement, timing-driven routing, the via minimization problems, and the power
minimization problem are the problems given the most attention in the literature. Other problems have been
briefly studied from a timing point of view. A summary of these results follows.
Timing-driven partitioning consider both timing and capacity constraints in the partitioning phase. The main
application is to assign functional blocks into slots on multichip modules during high level design. It
provides fast feedback on impact of high level design decisions. In the first phase, clustering is done to
ensure the timing constraints are satisfied. Then a packing, followed by a traditional partitioning (e.g., a
Kernighan-Lin technique), is done to satisfy capacity constraints. Compared to a classical partitioning
scheme, experiments show that the new technique has no timing violation if the capacity is increased by a
small percentage (about 5 to 6% on the average).
Some factors that determine the suitability of a decomposition are the geometry of the modules, the
connectivity among modules, and timing constraints. A hierarchical-clustering-based algorithm that leads to
a small number of superior candidate hierarchies. An approach to perform simultaneous placement and
routing of hierarchical IC layouts is the method based on the concept of slicing, which introduces a specific
regularity to the problem with certain constraints on the chip floorplan. The placement part consists of a
number of interrelated linear ordering problems, corresponding to the ordering of slices. The routing part
outlines a hierarchical pattern router, applicable to slicing structures. The IC layout construction progresses
top down in small design increments, corresponding to the processing of the individual slices. The
placement and routing data are acquired at intervening time slots and influence each other at each level of
hierarchy.
Timing-driven floorplanning is a mathematical programming (i.e., a constrained nonlinear programming)
approach that incorporates both timing and geometric information in the formulation. The three steps
employed are timing minimization with module overlap, module separation, and timing minimization
without module overlap. To obtain a fast algorithm, techniques for removing redundant paths are used; thus,
the number of constraints is reduced. The information obtained on the initial placement is fed into a
floorplan package to find the final shapes, sizes, and pin locations.
A new three-layer over-the-cell channel routing algorithm in which router minimizes the channel height by
using over the cell areas and achieves the net's timing requirements; each net has a maximum allowable
upper bound on its length. In this work, 45 wire segments are used to route the nets over the cell to further
reduce the net length. Experimental results show that the same area as used by previous over the cell routers
can be obtained while satisfying length constraints.
Other performance issues in physical design include regularity, layout synthesis systems, and automated
generation . Regularity is an important issue inVLSI design. Ishikawa and Yoshimura presented a new
module generator for generating functional modules with structural regularity. Unlike most module
generators that are only able to place cells, the proposed generator can generate functional modules with
structural routers for regular structure layouts such as multipliers and RAMs. A synthesis system for the
automatic layout of NMOS gate cells is described. The cells are based on multigrid cell models and are
intended for use as part of a chip synthesis system. An outline of the basic concepts of a CAD procedure for
the layout synthesis of such cells is given. The main objective is to generate correct and compact cells with a
controlled growth in area when they are subjected to modified speed requirements. The key is to determine
the most profitable level of abstraction for the designer, which is accomplished by the introduction of macro
abstraction, interface inheritance, delay binding, and the complete decoupling of procedural and graphical
design information. These abstraction mechanisms are implemented in the regular structure generator, an
operational layout generator with significant advantages over first-generation layout tools.
Recently, a new class of placement strategies, incorporating the knowledge of underlying circuit structure
targeted for high-performance circuits. The net regularity is measured using the standard deviation of the
locations of its terminals. Experimental results showed that exploiting circuit structural information led to
substantially more regular nets, compared to placement without structural information. Benchmark circuits
showed 1.9 to 26 times the number of nets with zero standard deviation. Furthermore, the length of the
longest net in a chip was reduced from 5 to 22%. The placement strategy is crucial for high-performance
circuits since regular nets have relatively fewer vias and reducing the longest net length improves signal
delay.
Some other performance issues are minimize the number of power pads, electrical performance, critical area
and bend minimization.