Clocked and Asynchronous FIFO Characterization and Comparison
Clocked and Asynchronous FIFO Characterization and Comparison
Clocked and Asynchronous FIFO Characterization and Comparison
Fig. 4. Block diagram of a 4-deep Linear FIFO. L1L4 are the controller
of Fig. 1, D1D4 are latch banks.
IV. C HARACTERIZATION
First order equations were developed to represent the struc-
ture, capacity, and forward latency for each design. The
equations are based on the number of parallel legs Nl , the
capacity of each leg Cl , the capacity of the FIFO Cf , the
clock period Pc and the latency of the asynchronous linear
controller Lc . The top two classes in the following table are
clocked, the latter four are asynchronous.
Design Structure Capacity Fwd Latency
Clk Linear Cl (Cf /2)Pc
Clk H/T Nl 3Pc
A. Linear Cl Cf Lc
Fig. 8. Block diagram of the S-4-4 square FIFO. h1i shows the path of the Parallel P-Nl -Cl 2 + Nl Cl (Cl + 4)Lc
first datum, h2i of the second. Tree T-Nl -Cl 2(Nl 1) + Nl Cl (2log2 Nl + Cl )Lc
Square S-Nl -(Cl + 2) Nl (Cl + 2) (2Nl + Cl)Lc
The structures are specified by the class, number of parallel
The block diagram of a square FIFO is shown in Fig. 8. legs, and capacity of each leg. Thus the parallel FIFO of Fig. 5
This design consists of a row of steering cells on the top and is a P-4-2 FIFO and the tree design of Fig. 7 is a T-4-1
bottom, with parallel legs in between. Data flows across the configuration.
top row to each leg in order, down, and then out at the bottom
right. The path of the first two tokens is shown in the figure A. Latency and Throughput
with arcs labeled h1i and h2i. The controllers in the bottom Forward latency is defined as the delay from placing a token
row first steer the datum from the top to the output, then take into the head of an empty FIFO until it has been read from the
one or more tokens from the left based on the location in the tail. The backward latency is its dual: the delay from removing
rectangle. The degenerate case of a single leg equals a linear a token from the tail of a full FIFO until one can place a new
structure. The square toggle and merge templates, like the token in the head.
parallel modules, are not pipelined. Fig. 9 shows a pipelined Latency has a major effect on FIFO throughput under certain
linear controller connected to a square toggle module. Like occupancy ranges. Throughput is limited by the latency in all
the unpipelined parallel design, these templates also reduce designs when the occupancy is near empty or full [7]. The
the frequency of the design approximately in half. throughput of a FIFO is therefore dependent on its occupancy
and cycle time.
Throughput vs. occupancy is measured keeping the number
of data items in the FIFO constant at all times. When a
datum is removed from the tail of the FIFO, a new datum
is simultaneously added to the head of the FIFO. The FIFO is
first initialized with a particular occupancy. When a FIFO is
near empty, throughput is reduced due to the forward latency
of tokens. More data could be added to the FIFO, but this
must be delayed until data is valid at the output to maintain
a constant occupancy. When the FIFO is nearly full, the dual
applies. New data can not be removed from the FIFO until a
bubble, or empty position, propagates backward to the input.
Fig. 9. Linear controller with a Tsquare 3 template. Throughput reaches a condition where it is limited by the cycle
Fig. 11. A P-4-2 and P-2-4 parallel design w/ capacity of 10
Fig. 13. Power and Energy of 14 deep Parallel FIFOs
V. R ESULTS
Lf Lb
Ot Cf (1) Fig. 14 shows a comparison of the latencies of the six
tc tc
FIFO designs across capacities ranging from three to 50 words.
Equation 1 models the effect of latency on throughput. Lf One of the graphs highlights small capacity FIFOs. Forward
and Lb are the forward and backward latencies, Cf is the total latency is a measure of how quickly maximum throughput is
capacity, tc is the cycle time, and Ot is the range of number reached, as well as the time to propagate a value through the
of tokens across which the optimal throughput is reached. FIFO. The asynchronous linear structure has the best forward
The smaller the forward and backward latency, the sooner the latency for very small capacities of four or less. Between four
maximum throughput of the FIFO is reached. and 16 the parallel and tree structures have the best latency.
Fig. 15. Cycle time and maximum throughput
Fig. 14. Forward and Backward Latencies of 64-bit FIFOs. Labels in maximum throughput under a single token occupancy value.
decreasing order at maximum depth.
The next highest throughput is the asynchronous linear design,
but it suffers similar problems with the clocked elastic buffer
Beyond 16 elements the tree architecture is best. Backward design. For some designs, these could be the optimal choice
latency produces a result similar to forward latency. However, if throughput is the primary metric and the FIFOs could be
backward latency in the asynchronous designs is degraded. maintained in the small optimal range. However, for FIFOs
Thus the elastic half buffer is best up to about seven data items, that require high throughput and low latency the asynchronous
after which the head / tail pointer is the best. This implies that tree FIFO and the clocked head / tail pointer FIFO are the best
under a stalled condition, the clocked designs will recover to up to a capacity of around 16, beyond which the tree FIFO
full throughput quicker than these asynchronous designs. has the highest throughput.
The cycle time and maximum throughput of the designs are Fig. 16 compares the throughput versus occupancy for
compared across many capacities. Fig. 15 shows the results for designs with a capacity of 10 and 50 words. The asynchronous
small capacity designs. The design with the highest throughput tree and parallel designs reach maximum throughput sooner
is the clocked linear structure. However, this architecture has than all other designs with a broad maximum throughput
the largest latency by far of any design and only achieves range, and the tree reaches a significantly higher maximum.
Clocked and asynchronous designs are compared and con-
trasted to determine the best structure for a specific need.
The square FIFO, while academically interesting, is shown
to be an impractical design as it is never the best choice
for any parameter. In general an asynchronous FIFO is the
best choice across nearly every capacity and for most metrics.
Latency, energy, and throughput will usually be the primary
factors used to select the best design. In such cases, one of
three asynchronous FIFO structures are usually the best. The
clocked linear FIFO does attain the highest throughput of any
design. However, this performance is only reached for a single
Fig. 17. Energy comparison of small capacity FIFOs occupancy value.
This work facilitates generating CAD that will weigh the
priorities, utilize the first order equations to select a structure,
and synthesize the correct design for the application.
VII. ACKNOWLEDGMENTS
This material is based upon work supported by Semiconduc-
tor Research Corporation task 1817.001 and the National Sci-
ence Foundation under Grant No. 0702539. The full physical
Artisan library was provided under a grant from Arm.
Fig. 18. Leakage power of small capacity FIFOs
R EFERENCES
[1] E. Brunvand, Low Latency Self-Timed Flow Through FIFOs, in 16th
For an application where dynamic buffering is required with Conference on Advanced Research in VLSI, UC Santa Cruz, March 1995,
good throughput and low latency from the empty state, the pp. 7690.
[2] R. W. Apperson, Z. Yu, M. J. Meeuwsen, T. Mohsenin, and B. M. Bass,
asynchronous tree design appears to be the best option. A Scalable Dual-Clock FIFO for Data Transfers Between Arbitrary
Fig. 17 compares the average energy for a datum to pass and Haltable Clock Domains, IEEE Transactions on Very Large Scale
Integration, vol. 15, no. 10, pp. 11251134, Oct 2007.
through the various structures for small capacity designs. The [3] Altera Corporation, Single & Dual-Clock FIFO Mega-
asynchronous tree design is the most energy efficient. The functions User Guide, June 2003. [Online]. Available:
clocked head / tail pointer expends substantially more energy http://www.altera.com/literature/ug/ug fifo.pdf
[4] J. Ebergen, Squaring the FIFO in GasP, in 7th International Sympo-
than all other designs. This relationship and the relative slopes sium on Asynchronous Circuits and Systems, March 2001, pp. 194205.
hold for the full design space investigated, including designs [5] I. M. Panades and A. Greiner, Bi-Synchronous FIFO for Synchronous
ranging from a 6.25% to 50% data activity factor, and for data Circuit Communication Well Suited for Network-on-Chip in GALS
Architectures, in 2nd International Symposium on Networks-on-Chip.
widths ranging from zero to 64 bits. ACM/IEEE, April 2008, pp. 139148.
Leakage power is compared in Fig. 18. These values also [6] T. Chelcea and S. M. Nowick, Low-latency asynchronous FIFOs using
correspond well to the layout area of the designs. The clocked token rings, in 6th International Symposium on Advanced Research in
Asynchronous Circuits and Systems (ASYNC 2000). IEEE, apr 2000,
linear design gives lowest leakage, but this does not account pp. 210220.
for leakage in the clock distribution network and clock gating [7] C. E. Molnar, I. W. Jones, W. S. Coates, J. K. Lexau, S. M. Fairbanks,
circuitry. and I. E. Sutherland, Two FIFO Ring Performance Experiments,
Proceedings of the IEEE, vol. 87, no. 2, pp. 297307, Feb 1999.
[8] J. T. Yantchev, C. G. Huang, M. B. Josephs, and I. M. Nedelchev,
VI. S UMMARY Low-latency asynchronous FIFO buffers, in 2nd Working Conference
on Asynchronous Design Methodologies, May 1995, pp. 2431.
This paper reports on several common FIFO structures [9] K. S. Stevens, Y. Xu, and V. Vij, Characterization of Asynchronous
that can be used for flow control. They are compared with Templates for Integration into Clocked CAD Flows, in 15th Interna-
tional Symposium on Asynchronous Circuits and Systems. IEEE, May
maximum throughput, throughput versus occupancy, energy 2009, pp. 151161.
efficiency, area, leakage energy, and latencies. First order [10] J. Cortadella, M. Kishinevsky, and B. Grundmann, Synthesis of syn-
equations are derived to model the capacity, latency, and chronous elastic architectures, in Proceedings of the Digital Automation
Conference (DAC06). IEEE, July 2006, pp. 657662.
maximum throughput based on occupancy of the designs. Most [11] J. You, Y. Xu, H. Han, and K. S. Stevens, Performance Evaluation
asynchronous FIFO classes have multiple configurations that of Elastic GALS Interfaces and Network Fabric, Electronic Notes in
result in similar capacity but different power and performance Theoretical Computer Science, vol. 200, no. 1, pp. 1732, February
2008, elsevier.
values. The optimal configurations for a given capacity were [12] L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli,
determined. A huge design space is investigated through gener- Theory of latency-insensitive design, IEEE Transactions on Computer-
ating modular designs and synthesizing hundreds of instances Aided Design, vol. 20, no. 9, pp. 10591076, Sep 2001.
[13] H. M. Jacobson, P. N. Kudva, P. Bose, P. W. Cook, S. E. Schuster,
of six FIFO classes. Results are for physical layout in a E. G. Mercer, and C. J. Myers, Synchronous interlocked pipelines,
65nm process with parasitic extraction, and include varying the in 8th International Symposium on Asynchronous Circuits and Systems,
capacity, data width, configurations, and data activity factors. Apr. 2002, pp. 312.