A Scalable Dual-Clock FIFO For Data Transfers Between Arbitrary and Haltable Clock Domains
A Scalable Dual-Clock FIFO For Data Transfers Between Arbitrary and Haltable Clock Domains
A Scalable Dual-Clock FIFO For Data Transfers Between Arbitrary and Haltable Clock Domains
1125
I. INTRODUCTION
VER shrinking transistor sizes have enabled the integration of a greater number of components onto a single
chipthus making systems-on-a-chip (SoCs) with many complex modules a common design solution. Unfortunately, global
interconnect scaling has not been able to maintain the same
performance increases [1], causing the timing of high speed
global clock signals to become a major concern in system
design. This has resulted in clock distribution circuits requiring
increasing circuit resources and design time.
Nearly all existing digital systems utilize synchronous design
techniques which normally require an accurate and highly synchronized global clock reference to be supplied to all areas of
the circuit. One solution for coping with the clock distribution
problem is to utilize self-timed or asynchronous circuits, which
do not have a global timing reference signal. However, the lack
of mature design tools and the reluctance of industry to incur the
cost and risk of moving away from successful synchronous design flows have limited the acceptance of these design styles [2].
An alternative approach is to create systems that mix asynchronous and synchronous design techniques using a globally
asynchronous locally synchronous (GALS) [3] design approach. In this paradigm, blocks are built using traditional
Manuscript received September 2, 2006; revised March 9, 2007. This work
was supported in part by Intel Corporation, by University of California (UC)
Micro, by the National Science Foundation under Grant 0430090, by MOSIS,
and by a University of California at Davis (UCD) Faculty Research Grant.
R. W. Apperson is with Boston Scientific CRM Division, Redmond, WA
98052 USA.
Z. Yu, T. Mohsenin, and B. M. Baas are with the Electrical and Computer Engineering Department, University of California, Davis, CA 95616 USA (e-mail:
zhyyu@ece.ucdavis.edu; bbaas@ucdavis.edu).
M. J. Meeuwsen is with the Digital Enterprise Group, Intel, Hillsboro, OR
97124 USA.
Digital Object Identifier 10.1109/TVLSI.2007.903938
TABLE I
COMPARISON OF VARIOUS DUAL-CLOCK FIFO DESIGNS
1126
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 10, OCTOBER 2007
The FIFO designed by Chakraborty et al. requires time to develop a frequency difference estimate before transferring data, as
well as usage of different circuits depending on which clock domain has the higher rate [9]. Siezovic [10] presents a linear FIFO
architecture for data synchronization, which has the limitations
presented in Section II-A. An alternative FIFO architecture for
use in some dual-clock applications is presented by Chelcea and
Nowick [11]. The design uses independent registers as storage
and
signals.
elements, and each register has its own
This scheme reduces the latency when the FIFO size is small,
but is less suitable when the FIFO size is large.
This work uses a dual-port SRAM as the storage element
which increases memory density and improves FIFO size scalability [13]. Compared with the most similar previous work
[12], this design includes configurable logic to make it suitable
for many environments, and also enables complete oscillator
halting during idle times to achieve high energy efficiency. The
proposed FIFO design has been fabricated in what we believe is
the first VLSI implementation of a GALS array processor [14].
B. Paper Outline
Section II introduces key structures and parameters for all
styles of FIFO buffers by analyzing the single-clock case.
Section III discusses synchronization and metastability issues.
Section IV describes the proposed design of an efficient and robust dual-clock FIFO architecture. Finally, Section V describes
a specific hardware implementation of the dual-clock FIFO
architecture.
II. SINGLE-CLOCK FIFOS
To best address dual-clock FIFO issues, we first consider the
case of a single-clock synchronous FIFO. This section covers
these fundamental FIFO principles.
A. Linear FIFOs
The simplest FIFO structure consists of a linear chain of
latches or flip-flops connected serially as a shift register. Data is
shifted into one end of the chain and propagates through every
memory element until it reaches the end as shown in Fig. 1.
This FIFO is synchronous since all movement of data requires
a common clock.
Alternatively, a linear elastic FIFO uses control signal handshakes to propagate data from location to location. Unlike the
synchronous case, a datum can propagate through the FIFO
without any new items entering. This results in the FIFO being
at various degrees of fullness, hence, the name elastic. FIFOs
of this nature work well with asynchronous designs and many
examples of these can be found in the literature [15], [16]. A
simple example of this type of FIFO is shown in Fig. 2.
Fig. 4. Typical WRITE and READ address pointer scheme for a circular FIFO
and its f ull definition when rd ptr
wr ptr .
1127
(1)
III. SYNCHRONIZATION
A fundamental problem in systems lacking a single global
timing reference is synchronization. In general, the timing relationship between a signal and a clock can be cast into one
of five categories [4], [18]: 1) synchronous; 2) mesochronous,
where the signal is the same frequency as the clock, but has a
constant phase difference; 3) plesiochronous where the signal
is at a frequency close, but not identical to the clock frequency,
which implies a varying phase difference; 4) periodic, where the
signal has an unknown relationship to the clock, but is periodic
in nature; and 5) asynchronous, where the signal is completely
unrelated to the clock and signal transitions are arbitrary.
A. Metastability
Metastability is a fundamental problem present when interfacing asynchronous blocks [4] and is caused by registers not
receiving a stable input signal near the active edge of the clock
signal. Synchronization methods are used to avoid or reduce the
probability of metastability. An approximation for modeling the
mean time between failures (MTBF) is shown in (2) [19], where
is the clock frequency, is the input data event frequency,
is the allowed settling time before sampling, is the expois
nential time constant of the metastability decay rate, and
the asymptotic width of the time aperture in which the device
can enter the metastable state, normalized to a response time of
zero [20]
(2)
1128
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 10, OCTOBER 2007
Fig. 6. Example usage of the proposed dual-clock FIFO for transferring data between producer and consumer blocks that each contain haltable oscillators. Note
that the architecture supports halting the clock oscillators, and not just stopping the local FIFO clock. The two variables represent delays from the producer to
equal to the delays for data in and wr valid, and
equal to the delay for clk wr .
the consumer with
Fig. 7. High-level diagram of the blocks within the proposed dual-clock FIFO.
1129
and
, generated by
logic
and equivalence blocks, respectively, are used for the clock stop
and wake up logic that is further described in Section IV-E.
READ and WRITE address pointers are used to indicate the beginning and end of the valid data. To prevent multibit word failures while crossing the clock domain boundaries, the address is
transformed to a Gray code representation. Sync blocks are used
to synchronize the information which is passed across the clock
boundary. A configurable multiple register synchronization circuit is used to alleviate metastability issues.
The FIFO calculates whether or not it is empty on the READ
side. If the FIFO is not empty, the consumer asserts a request
signal indicating it would like data. The FIFO indicates whether
or not it is full on the WRITE side. The producer should only
send data when the FIFO is not full and it indicates valid data
signal. Ideally, the producer should
by asserting a
. But in
stop writing data immediately when the FIFO is
order to stop writing the FIFO, the producer needs to receive
signal, through some WRITE logic, then it stops
the FIFO
sending data. These steps may cost several clock cycles through
and
delays. A configurable reserve space (described
the
in Section IV-C) is added to guarantee correct functionality.
B. Address Pointer and Gray Coding
The proposed architecture utilizes READ and WRITE address
pointers to track occupancy of the FIFO. The pointers are in1 bits to allow straightforward use of all
creased to
memory words. Because many applications do not allow local
clock pausing, the technique of increasing the metastability resolution time is used to pass pointers across clock domains.
Since the address pointers are susceptible to multibit word
failures, they are transformed to a gray code representation
before being passed across the clock boundary. Addresses are
then converted back to binary format in the other domain since
arithmetic is most naturally performed on binary numbers. As
described in Section III-C, this approach guarantees correct
pointer transfer regardless of relative clock frequencies or
pausing. In the case where the old pointer value is received,
the pointer will merely be interpreted as having remained at its
old location (i.e., no READs or WRITEs have occurred). While
this potentially adds latency to the system, it will never cause
incorrect FIFO operation.
Special circuits are required to convert pointers between
binary and Gray code formats. Given an -bit binary vector
, the equations in (3) can be used to
,
convert to an -bit gray coded vector
where indicates the sum ignoring the carry. This can be
accomplished using the XOR function. The worst case gate
XOR
delay for this calculation is one XOR gate; a total of
gates are required
(4)
When exchanging address pointers, it is crucial to take into
account memory core READ and WRITE latencies to avoid data
corruption and loss. Delays that compensate for these latencies
consist of registers placed immediately before the data synchronization circuits as indicated in Fig. 7.
C. Reserve Space and
Detector
(3)
1130
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 10, OCTOBER 2007
TABLE II
METASTABILITY PARAMETERS FROM HSPICE NUMERICAL SIMULATOR
(test conditions 100 C AND V
= 1.62 V)
1131
TABLE III
ESTIMATED MEAN TIME BETWEEN FAILURES USING SYNCHRONIZER FROM FIG. 8 AND WORST CASE SIMULATED DEVICE PARAMETERS
720 ps,
30.5 ps, t
300 ps, t
50 ps, AND t
700 ps)
(T
Fig. 9. WRITE clock halting due to the full FIFO and restarting due to non-full
FIFO.
Fig. 11. Simplified clock halting and restarting circuit.
desired signal directly into the clock port of a flip-flop. The circuit is simple, but it is sensitive to noise and is not safe for many
physical design flows. Another method, shown in Fig. 10(b),
is to use two registers to check changing data. This is a safe
method, but requires the availability of the clock signal, which
is not guaranteed in our situation.
In the proposed design, simple and safe combinational logic
is used to control the clock halt and restart functions. This strucis
ture slightly changes the scheme shown in Fig. 9:
and
are high, and
is
stopped when both
restarted when either of them goes low. This logic can be simply
implemented using an AND gate. The same logic exists in the
is stopped when both
and
FIFO empty situation:
are high, and it is restarted when either of them
goes low.
The simplified clock halt and restart circuit block diagram is
and
signals
shown in Fig. 11. The
are discussed in Section IV-E3.
3) The Consistent Signals: Simply using the combination of
and
to stop the clock results in wasted power
dissipation in some cases. One example is shown in Fig. 12.
As shown in Fig. 12, the producer is stopped due to its own
is, therefore, off. When
FIFO being empty and its clock
, it is supposed to stop its
the consumer FIFO becomes
clock too. However,
in the consumer is controlled
, which is halted. As a result,
by the producers clock,
stays low and the consumers clock
keeps
running and wastes power.
Fig. 12. Example showing clock stopping with a stuck-on consumers clock
before using the consistent signal.
1132
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 10, OCTOBER 2007
Fig. 13. Example showing clock stopping with a properly stalled consumers
clock oscillator using the consistent signal.
TABLE IV
AREA BREAKDOWN FOR THE FIFO HARDWARE MODULE
A. Full-Custom Implementation
A 32-word implementation of the proposed dual-clock architecture is designed in a 0.18- m CMOS technology utilizing a
full-custom design approach and scalable design rules.
HSPICE simulation estimates from extracted layout predict
the critical path on the write side consists of the Gray-to-binary converter, adder, and a flip-flop. The delays under typical conditions are 535, 515, and 250 ps, respectively. The total
minimum cycle time delay is approximately 1.3 ns. The read
sides critical path delay is also 1.3 ns. The resulting maximum
clock frequency for the entire FIFO is, therefore, approximately
770 MHz.
The final design supports a throughput of one datum per clock
cycle up to its maximum clock frequency. This occurs when the
consumption and production rates are similar and is not limited by the clock frequency. The minimum latency is an imprecise number since it depends on the phases and frequencies
of the read and write clocks and the synchronizers configurations. With two synchronization registers, latency is bounded to
no more than three WRITE clock cycles plus three READ clock
cycleswhich corresponds to 7.8 ns with both clocks at their
maximum frequency in this design.
The layout of the dual-clock FIFO module without global
wiring is shown in Fig. 14. It occupies approximately
44 761 m of active area with a minimum rectangle area
of 66 500 m . This area is larger than it otherwise would be
for three major reasons: 1) transistor sizes are large and circuits
are optimized for high speed, 2) layout is done using relaxed
scalable design rules that allow easy portability across many
vendors and technology generations, but also significantly
increase area, and 3) the layout is a first-generation design and
has not been significantly optimized for reduced area. Table IV
shows the active areas for the individual components of the
FIFO.
Fig. 15. Chip micrograph of two processors in a GALS chip, where each processor contains two of the proposed dual-clock FIFOs.
VI. CONCLUSION
The proposed dual-clock FIFO architecture is well suited for
many dual-clock applications and achieves high energy efficiency, good scalability and area utilization, high clock rates,
and arbitrarily high robustness. This architecture can be utilized
as a drop-in module to many applications.
The FIFO is implemented using 0.18- m standard cell technology and embedded in a GALS array processor. The FIFO occupies 25,000 m and operates over 580 MHz at 1.8 V, with simultaneous FIFO READs and FIFO WRITEs consuming 10.3 mW
under those conditions.
1133
ACKNOWLEDGMENT
Ryan W. Apperson received the B.S. degree in electrical engineering (magna cum laude) from the University of Washington, Seattle, and the M.S. degree
in electrical and computer engineering from the University of California, Davis.
He is currently an IC Design Engineer with
Boston Scientific CRM Division, Redmond, WA.
His research interests include multiclock domain
systems and SRAM design.
The authors would like to thank R. Krishnamurthy, M. Anders, S. Mathew, E. Work, other members of the VCL Laboratory, and Artisan for their support and assistance.
REFERENCES
[1] R. Ho, K. W. Mai, and M. A. Horowitz, The future of wires, Proc.
IEEE, vol. 89, no. 4, pp. 490504, Apr. 2001.
[2] G. Semeraro and G. Magklis et al., Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency
scaling, in Proc. Int. Symp. High-Perform. Comput. Arch., 2002, pp.
2940.
[3] D. M. Chapiro, Globally-asynchronous locally-synchronous systems, Ph.D. dissertation, Dept. Comput. Sci., Stanford Univ.,
Stanford, CA, 1984.
[4] W. J. Dally and J. W. Poulton, Digital Systems Engineering. Cambridge, U.K.: Cambridge Univ. Press, 1998.
[5] M. Balch, Complete Digital Design, 1st ed. New York: McGraw-Hill,
2003.
[6] J. Ebergen, Squaring the FIFO in GasP, in Proc. Int. Symp. Asynch.
Circuits Syst., 2001, pp. 194205.
[7] C. E. Molnar, I. W. Jones, W. S. Coates, and J. K. Lexau, A FIFO ring
performance experiment, in Proc. Int. Symp. Asynch. Circuits Syst.,
1997, pp. 279289.
[8] M. R. Greenstreet, Implementing a STARI chip, in Proc. IEEE Int.
Conf. Comput. Des., 1995, pp. 3843.
Zhiyi Yu received the B.S. and M.S. degrees in electrical engineering (with honors) from Fudan University, Shanghai, China. He is currently pursuing the
Ph.D. degree in electrical and computer engineering
from the University of California, Davis.
He was a key contributor and designer of the
36-processor programmable GALS Asynchronous
Array of simple Processors (AsAP) chip. His
research interests include high-performance and
energy-efficient digital VLSI design with an emphasis on many-core GALS clocking and efficient
processor interconnects.
1134
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 10, OCTOBER 2007
Tinoosh Mohsenin received the B.S. degree in electrical engineering from Sharif University, Tehran,
Iran, and the M.S. degree in electrical and computer
engineering from Rice University, Houston, TX. She
is currently pursuing the Ph.D. degree in electrical
and computer engineering from the University of
California, Davis.
She is the designer of the Split-Row and MultiSplit-Row Low Density Parity Check (LDPC)
decoding algorithms. Her research interests include
energy efficient and high performance signal processing and error correction architectures including multi-gigabit full-parallel
LDPC decoders and many-core processor architectural design.