05 Timing

University of Manchester
School of Computer Science
Clocking
The synchronous model is an enormous simplication in designing working Finite State Machines In order for this to work there are certain assumptions. t data setup time (tsu) clk D Q Q D r data stability before the clock t data hold time (thold) r data stability after the clock t propagation delay (tpd) r output lag after the clock In the synchronous model it is assumed that one ip-op can feed another Which means this is possible:
assuming the clock reaches both (all) devices simultaneously.

i.e. minimal clock skew
COMP32212 Implementing System-on-Chip Designs
Section 5
Slide 1
Clock distribution
In practice clock arrival times have to be close enough. t The difference in arrival times is referred to as clock skew. t In almost all practical cases, it must be minimised.
Observations on clock distribution

t Clocks have high fan-out r a clock may fan out to thousands of ip-ops t Clock edges should be fast r Slow edge speeds increase the uncertainty in exactly when the transition happens t Clock signals are repetitive r The latency of a clock reaching a ip-op doesnt matter r They always arrive at regular intervals . These properties mean that clock signals require buffers but thats okay providing the skew is still kept small. The classic H-tree structure is one method of trying to distribute a clock across a chip with minimal skew. Note, all paths are the same length. Reset skew Reset activation is not normally a timing problem because reset will be present for some time. The removal of reset can be a problem though. Imagine reset being removed in one part of a state machine but not quite making it to another. This could cause the machine to enter an illegal state. It is sometimes necessary to synchronise an otherwise asynchronous reset to prevent this.
The clock tree needs to be well balanced, taking into consideration: t Fan-out at each point t Buffer strength t Wire lengths
FPGA clock distribution

Modern FPGAs have a dedicated set of clock distribution networks built onto the chip which deliver clock signals to all ip-ops with minimal skew. There are typically a small number (e.g. four) of these networks so that a number of different clocks may be used. These networks can only be used for clocking ip-ops.
Fortunately, tools will help with this.
Timing Closure
Simulation t Okay for a rough-cut t May be difcult to simulate critical path Static timing analysis t Take blocks of logic between synchronously clocked elements t Time all possible switching paths in block t Find the longest Advantage r Quick to perform Disadvantage r Can be too pessimistic
Section 5
Slide 2
Timing closure
Timing closure is, basically, making the logic t within the desired clock period. How fast does it go? This can be difcult to determine, exactly. It is set by the critical path. From a HDL source this requires at least technology mapping into gates. Accuracy requires knowledge of gate strengths, wire load and layout detail. However it is cost effective to estimate timing early to check that the implementation strategy is feasible. Even pre-layout the tools usually give an estimate of the wire loads to yield a more realistic result. Simulation Simulation can indicate whether a particular sequence will fail at a particular clock speed. This is a reasonable guide but is not reliable unless either the critical path is known (and exercised) or the simulation is exhaustive. Example: a ripple carry adder. Simulation with random inputs is unlikely to nd the slowest case, when a carry propagates across the whole width. Static Timing Analysis (STA) In this case static means independent of input state. The delays through each combinatorial path can be summed and compared with the design objective. This reveals the critical path or the slack in all logic paths. In the latter case negative slack will reveal where the logic is too slow. The great advantage of static analysis is its low computational complexity. The disadvantage is that the critical path may be a false path, i.e. one whose switching sequence cannot occur in reality. In general STA will identify anything which is signicantly bad at low cost. Does it go fast enough? No, of course it doesnt; where would the fun be in that? Seriously, what is fast enough? In many applications there will be a real-time constraint to be met but exceeding it brings no additional benet. In other applications (e.g. microprocessors) any performance increase is to be seized.
How can the speed be improved?

How close are you to your target? If youre miles off you need to restructure your architecture to increase parallelism. This may be done by: t t t t deeper pipelining faster clock increase logic parallelism do more within clock cycle evaluating several things at once do more with slower clock multi-cycle operations sometimes allow more than one period
If close to target you might be able to identify and recode critical modules. Tools can also be instructed to optimise for certain criteria, such as speed, power, area, Normally gains in one category are paid for in others. Technology Many cells come in families with various drive strengths. Increasing the drive will speed up an output (and slow the input, and probably cost power). It may be possible to use different cell families to improve performance. E.g. t High-speed low threshold transistors switch faster but leak more t Standard a compromise design t Low-leakage high threshold transistors save power but switch slowly
Post-layout
The process may need repeating. After the wiring is factored in things (probably) have slowed down. Buffers may be added which increase the latency but speed up edges. Hopefully this process converges on something acceptable.
Optimise early to avoid wasting effort on hopeless designs. Layout and extraction all take time and more accurate modelling also takes longer.
Clock Domains
Synchronous design is a Good Thing t Simplies RTL design r May be easier to think about state diagrams t Simplies debugging can take a global view of state t Tool chains optimised for such
However it is not always possible to have one clock across an SoC t Synchronous clock distribution increasingly difcult t Blocks may work optimally at different frequencies r May be IP from different vendors t Some I/O may require specic frequencies
Section 5
Slide 3
Frequency and phase

Its only meaningful to talk about frequency with respect to repetitive signals. The frequency of a clock is the reciprocal of its period. T 1 f = --T With two or more signals there may be a phase relationship. Same frequency, different phase. Crystal oscillators
Oscillators
The normal clock source for digital logic is a crystal-controlled oscillator. These use vibrations in a carefully machined (piezo electric, usually quartz) crystal to stabilise an electrical oscillator circuit. Without any special care a frequency within about 50ppm1 is usual. If it matters, much greater stability is, of course, achievable as is demonstrated by quartz clocks. t No two independent oscillators will run at exactly the same frequency. t If a constant phase relationship is required a single oscillator must be used. Frequency examples t Digital logic is operated at frequencies of several GHz r For ASIC design, typically think 100s MHz t Humans tend to prefer simple numbers such as 20 MHz t A serial line (old fashioned now) has standard baud rates of 9600, 19200, 38400, 115200, Hence multiples of such frequencies are not uncommon. r Example: 18.432 MHz = 30 * 16 * 38400 = 10 * 16 * 115200 t USB uses bit rates of 12 Mb/s (USB 1), 480 Mb/s (USB 2) t In I/O applications there is commonly some tolerance. r E.g. RS232 a few percent r USB 480.00 Mbit/s 500 ppm, 12.000 Mbit/s 2500 ppm Clocking and power
Harmonic frequencies (phase relationship is xed)
Non-harmonic frequencies: phase relationship drifts
It is increasingly common to have blocks running at different frequencies Or possibly the same frequency, but uncertain phase. t Sometimes just reduce (divide) master clock t Sometimes have separate clocks.
The clock network is a signicant source of power dissipation. The power used is (effectively) proportional to clock frequency. Thus it makes no sense to clock a circuit faster than is necessary. Clock gating may be introduced to stop clocks when a block is unused but this should be done with caution!
1. Equivalent to about 4 s error per day.
Crossing clock domains

There are various possibilities for relationships between clocks. t Synchronous circuits avoid this difculty t Isochronous circuits have a known, constant, phase relationship r Maybe with blocks with harmonic frequencies r This may be exploited (with care!) in inter-block communication
t Asynchronous clock sources cause problems! r Sending signals between asynchronous domains is impossible with 100% reliability. r At some stage a ip-op set-up/hold constraint will be violated. r We can make the probability of failure very small.
There is also the need for arbitration: which receiver cycle did the data arrive in?
Section 5
Slide 4
Metastability
A model ip-op Operation is simple.
Synchronisers
A typical synchroniser looks like this:
D Q D Q
? 0 1 0 1 0 1
t The ip-op has three stable positions: 0, 1 and a metastable position half-way between. t Violating set-up/hold conditions can result in the ip-ip entering the metastable state. t In principle the ip-op can stay metastable indenitely r But if it starts to resolve one way, positive feedback pushes it further in that direction t The probability of remaining metastable decreases exponentially with time.
If the rst ip-op latches a valid level the second one copies this one clock period later. Else the rst ip-op may go metastable but has a whole clock period to resolve to a digital state. As the violation is caused by an input data transition the chosen state will determine whether the data changed before or after the clock. If determined that the data changed after the clock then it will be picked up on the next clock edge.
The rst ip-op probably doesnt remain metastable for a whole clock period. The probability depends on the properties of the ip-op and the length of the clock period.
The dangers in a metastable state lie in that it can be interpreted as different values by different inputs, or at different times.
If the ip-op doesnt resolve in time it will be forced to a digital state on the next clock edge but the second ip-op may go metastable.
Paranoid designers may add more ip-ops. Each multiplies the probability of remaining metastable by the same small number, thus if (say) 1 in 106 is too high, go for 1 in 1012, 1 in 1018, etc. Each ip-op (delay) also increases the latency, of course. There is no certain guarantee that this will always work. However the probability of failure can be made very small. [Remember that 3 GHz translates to 3109 clocks/second or about 1019/century.]
Synchroniser ip-ops
Some cell libraries provide ip-ops specically to address this problem. They can still go metastable but they have a steeper hill so they tend to resolve more quickly. They have worse properties in other respects.
Crossing clock domains

There is no need to synchronise every signal crossing a boundary explicitly.
If the request is synchronised, accompanying data will have had plenty of time to arrive. When crossing a clock boundary, there is always: t some latency t a chance of failure due to persistent metastability r small: may be reduced by adding extra ip-ops r special ip-ops which resolve faster may be available (though not from logic synthesis!)
COMP32212 Implementing System-on-Chip Designs Section 5 Slide 5
Crossing timing domains in the lab.

The system we are constructing has been kept as synchronous as possible. Thus the master frequency is set by the pixel clock and the drawing engine is run at the same rate. There is, however, an asynchronous input in terms of the processor bus, which is governed by a completely separate clock. The bus arriving from the uP_nwr ARM is an asynchronous uP_addr bus. In this context this means there is no clock sig- uP_data nal within the bus. Timing is latch provided by pulses on control signals, the length of which is governed by the bus master (i.e. the processor). This type of bus is a typical arguably old fashioned interface used by many memory and I/O devices. The various parameter registers are therefore built as transparent latches, enabled by the strobe pulses. Most of the time, writing to the interface has no effect on the clocked part of the circuit. Parameters are set up but not yet read. This happens on software timescales where it is easy to be condent the values will be stable long before they are used. Synchronization is therefore unnecessary. When a command is issued a signal must cross into the clocked domain. In this case the synchroniser shown here is used.
always @ (posedge uP_nwr, posedge cmd_ack) if (cmd_ack) go <= 0; else if (!uP_ncs && (uP_address == 6h08)) go <= 1; always @ (posedge clk, posedge cmd_ack) if (cmd_ack) begin go_1 <= 0; cmd_req <= 0; end else begin go_1 <= go; cmd_req <= go_1; end uP_nwr
go
go_1
cmd_req
clk cmd_ack uP_nwr go clk go_1 cmd_req cmd_ack
prevented by, for example, checking the go signal in software as a status bit. A status bit could be cleared at an arbitrary time unless it is resynchronised for the processor. This is difcult without access to the processors clock. However if read into a processor register the bit is likely to pass through several ip-ops and thus have settled into some digital state by the time it is read and tested. If sampled in a polling loop it can be deduced that, if a bit is read just as it changes, it doesnt matter how its interpreted. The other possible output for such a bit is as an interrupt signal. Interrupts are routinely regarded as asynchronous and fed through synchronisers on entry to a processor. Any additional latency is small compared with the software run time.
Asynchronous arbitration
It is possible to enter an asynchronous domain with 100% reliability using an arbiter or mutual exclusion element. This is a cell which determines which of its (usually two) inputs arrived rst. It achieves reliability by detecting metastability and delaying its decision until this is resolved. Unfortunately the time taken to make a decision is unbounded so this process could always take more than a clock period however long that is.
The operation is triggered by the end of the write pulse which allows time for data to be propagated through transparent latches in the same cycle. cmd_ack is a one clock long pulse in response to an accepted cmp_req from the synchronous side. There is an assumption that a second write will not occur too soon. This can be
Changing frequency
Reducing frequency by an integer factor is easy. t Note that dividing by an odd number will result in an uneven duty-cycle r This may or may not matter to you N
3T T
t The output clock will have a xed (unknown) phase relationship with the input
Increasing frequency is more difcult: use a Phase-Locked Loop (PLL)
N
Reference frequency
faster
Phase comparator
LPF
slower
VCO
These include some mixed-signal (analogue) components but can usually be bought-in from a specialist designer.
COMP32212 Implementing System-on-Chip Designs Section 5 Slide 6
Changing frequency
Why bother?
t It is difcult to carry UHF signals across a PCB; great care with tracking is required. t Switching signals dissipates power. The more you switch, the more it costs. t Switching signals transmits Radio Frequency Interference (RFI). (Where do you think the power goes?) This is a Bad Thing, and may be illegal. t Generating stable clocks at UHF directly is impractical. A typical clock source will be a crystal controlled oscillator. These are cheap and quite precise. Frequencies of the order 1-100 MHz 50ppm are readily available. However modern computers are typically clocked much faster. So, the usual ploy is to supply a stable frequency (say 20 MHz) to the chip and then multiply this on board to the desired clock rate. A bonus from this strategy is that the clock multiplier is digital and can be controlled (e.g. in software) allowing a tradeoff between performance and power consumption. Another strategy is to reduce the clock rate to reduce power dissipation if the chip is becoming too hot.
PLLs
A Phase Locked Loop is a machine capable of matching the frequency of an input signal. Everyday example Consider a television set. It must display images at the same rate as they are broadcast. Thus it needs synchronisation information so that it can adjust its internal timing to match the transmitter. Of course, in modern sets at these slow speeds this can be done digitally by varying the number of local clock cycles in each line, frame, etc. slightly. Clock multiplication The slide shows a clock multiplier which works by matching a division of an output clock to an input reference. f out --------- = f in N f out = N f in LPF Low Pass Filter A typical phase comparator produces pulses on its outputs which indicate which input edge came rst. These need integrating (smoothing) to produce a voltage which is (approximately) stable over many clock periods. VCO Voltage Controlled Oscillator An oscillator which runs naturally in a certain range of frequencies which is tuned by an analogue input voltage. Because a PLL circuit is controlled by feedback its output frequency will vary slightly around the nominal frequency. This contributes to clock jitter the perceived variation in clock frequency. Jitter is a Bad Thing because the logic must always evaluate within the shortest clock period (not the average) and the more variation there is the shorter this minimum time will be.
Clock Gating
If part of a device is not in use, its clock may be stopped (gated). This is more power-economic than simply disabling the registers. However there are several concomitant hazards and its easy to introduce unpleasant clock skew or even glitches if care is not exercised. Dont do this by hand until youve lots of experience! This is an option best left to the tools (if available). Note that adding gating may compromise peak performance so is not always desirable.
Miscellany
A collection of other timing-related issues. Timing checking tools A number of tools exist in order to assist with timing closure. Many of these are only appropriate when a physical realisation of the chip is available. t Static Timing Analysis (STA) t Edge speed analysis t Hold time checking t Clock skew analysis t
Multi-cycle paths It is sometimes expedient (and convenient) to allow logic more than one clock period to settle. This may be sensible but you need to tell the tools.
Section 5
Slide 7
Tools
A Static Timing Analyser (introduced earlier) will give an estimate of the critical path in a system by searching all paths between clocked registers and nding the slowest. This then sets the standard for other logic speeds; there is (usually) no point in optimising any logic paths already faster than the critical path. The delay of the critical path will depend on the number of serial logic gates, their type, the fanout and other factors affecting the electrical load (particularly wire lengths) and their output impedance or drive strength. All these factors go into the mix when attempting to optimise the circuit. Typically, synthesis tools will have options which allow the engineer to put more importance on speed, size, power etc. It may be that a circuit can be optimised for speed but this may result in it being larger or more power hungry. Edge speeds are the time it takes a digital circuit to switch between states. They depend on the outputting gates drive and the (capacitive) load it needs to switch. Edges which are too slow may introduce problems such as: t induced noise near the threshold may be received and amplied t different target gates may see the input switch at (signicantly) different times t increased time spent near the half way level may result in an extra power drain Tools are available to identify any slow1 edges, possibly for further attention. With challenging speed targets a ip-op may be designed with a data hold time longer than its propagation delay. With such it would be dangerous to connect one ip-op output directly to anothers input. Any logic in-between will naturally act as an additional delay and help meet the true constraints. Holdtime checking will identify any remaining risks here and allow extra buffer insertion. Note that problems with a too-long critical path may be accommodated by reducing the clock frequency. Hold-time problems are a property of the circuit and there is no cure if they appear in the chip!
1. The user can dene what slow means.
Delay lines
It is possible and sometimes necessary to build delays onto ASICs. An approximate delay can be produced with a chain of inverters or buffers; the actual delay on a given design and process may vary by a factor of two or more depending on the manufacturing and operation conditions of the chip. Precise delays need to be calibrated against a reliable reference frequency. These are typically chains of gates (as above) whose length can be altered (e.g. by multiplexing output taps) to give the nearest available approximation to the required delay. Periodic recalibration may be needed due to thermal drift. An example would be a Delay-Locked Loop (DLL). For instance Xilinx FPGAs contain a small number of DLLs which allow the insertion of a known delay. A typical application is to delay a clock signal so that edges at the leaves of the distribution tree are in phase (via a total delay of a number of clock cycles) with an incoming reference. This effectively removes the clock buffer delays.
Chip variation Gate speed depends on various manufacturing and operation conditions, normally referred to as PVT for Process, Voltage, Temperature. t Process: variation in manufacturing such as transistor doping density. t Voltage: the supply voltage at a gate will be less than that at the chips pins (Ohms Law); this varies across the chip and may uctuate due to other power demands elsewhere. t Temperature: hotter is slower.

05 Timing

Uploaded by

Copyright:

Available Formats

05 Timing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 Timing

Uploaded by

Copyright:

Available Formats

University of Manchester

School of Computer Science

assuming the clock reaches both (all) devices simultaneously.

COMP32212 Implementing System-on-Chip Designs

Observations on clock distribution

FPGA clock distribution

Fortunately, tools will help with this.

School of Computer Science

COMP32212 Implementing System-on-Chip Designs

How can the speed be improved?

School of Computer Science

COMP32212 Implementing System-on-Chip Designs

Frequency and phase

Harmonic frequencies (phase relationship is xed)

Non-harmonic frequencies: phase relationship drifts

School of Computer Science

Crossing clock domains

COMP32212 Implementing System-on-Chip Designs

School of Computer Science

Crossing clock domains

Crossing timing domains in the lab.

clk cmd_ack uP_nwr go clk go_1 cmd_req cmd_ack

School of Computer Science

Increasing frequency is more difcult: use a Phase-Locked Loop (PLL)

School of Computer Science

COMP32212 Implementing System-on-Chip Designs

You might also like