Clocked and Asynchronous FIFO Characterization and Comparison

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Clocked and Asynchronous FIFO

Characterization and Comparison


HoSuk Han Kenneth S. Stevens
Electrical and Computer Engineering
University of Utah

AbstractHeterogeneous blocks, IP reuse, network-on-chip


interconnect, and multi-frequency design are becoming more
prevalent in integrated circuit design. Communication amongst
these blocks typically employs first-in-first-out (FIFO) buffering
for flow control. This paper characterizes and evaluates several
common designs in order to determine which structure is best
for various specific applications. Two clocked and four clockless
asynchronous FIFO designs are compared varying capacity,
bit width, and structural configurations. The FIFO layouts
are characterized in the IBM 65nm 10sf process for latency,
throughput, area, and power. First order models are created to Fig. 1. Schematic of the linear pipeline controller
aid in CAD for FIFO synthesis, modeling, and optimization. Com-
parative results show that the asynchronous designs uniformly
out perform the clocked designs in nearly every aspect.
The end goal is to develop a tool to algorithmically evaluate
the merits of various FIFO structures and generate a parame-
I. I NTRODUCTION terized synthesis system that will create the most appropriate
design. The choice can have significant impact. For instance,
FIFOs are an increasingly important component as design the best 8 word asynchronous FIFO expends half the energy
has become more modular. The choice of which structured with a forward latency that is almost a third that of the
memory to employ can have significant impact on the power, best clocked FIFO while achieving nearly identical maximum
performance, and cost of a design. The choices are broad and throughput. In general, the asynchronous designs are shown
range from clocked to asynchronous designs [1][6]. Some to significantly out perform clocked designs for nearly every
excellent work on the properties of FIFOs has been published metric and across nearly the full range of design parameters.
[7], [8]. Yet a clear understanding of the comparative cost of
different designs in terms of power, throughput, latency, and II. D ESIGN AND C HARACTERIZATION
area and of the key differences between specific structures The results reported in this paper are derived from the
is not generally available. This paper reports on a study layout of designs that have been automatically synthesized and
performed for a two fold purpose: to help designers choose the characterized. The designs have been physically placed and
best FIFO for their target design, and to develop the foundation routed in the IBM 10sf 65nm process technology using the
for an automatic CAD tool for selecting and synthesizing the Artisan RVt 12T library. Simulation results use full parasitic
best structure. extraction from layout.
The most common clocked and asynchronous designs are The implementations are designed to achieve the goal of
compared across a broad range of design metrics. The designs making them as comparable as possible. This is accomplished
are characterized for buffering capacity, energy per data word, by employing the same universal subcomponents to construct
leakage energy, width of the data path, forward and backward both the asynchronous and clocked FIFO designs. For in-
latency, throughput for a given occupancy, and area. Many of stance, the same clocked one-hot shift register is used for
the asynchronous designs have various structural choices, and address selection in the clocked head / tail pointer design
these structures are compared and optimized. First-order equa- as well as the asynchronous parallel and rectangular FIFOs.
tions are that allow designs to be compared across arbitrary Likewise, the same pipeline controller, shown in Fig. 1, is used
parameterized ranges are also developed. in all the asynchronous designs.
This work evaluates the most common FIFO structures. Two The design and characterization flow proceed as follows.
clocked and four asynchronous FIFO classes are investigated. First, a small set of shared circuit templates were designed.
The clocked designs are assumed to operate entirely in a Each are implemented as a behavioral or structural Verilog
single clock domain. Synchronization costs are ignored in module. The structural modules are mapped to the Artisan
this analysis if the asynchronous design is placed in a clock 65nm static library. There are 10 separate modules, three of
domain. No status information beyond full at the write port which are asynchronous state machines, the rest consist of
and empty at the read port is assumed. clocked components. The data registers are composed of
banks of flip-flops for the clocked head / tail pointer design,
and data latches for all other designs.
The asynchronous modules consist of either pipelined stages
which contain a latch bank to store data, or unpipelined
stages that steer the control and data bits. The asynchronous
unpipelined modules consist entirely of clocked elements,
such as the one-hot shift register. The pipelined modules have
an asynchronous state machine that controls the clock signal to
the latch bank that stores the data, and performs flow control
between buffer stages. The linear pipeline controller circuit
of Fig. 1 is a typical example of an asynchronous finite state
machine (AFSM) controller. The other asynchronous AFSM
designs include the binary toggle and merge components that Fig. 2. Phased Elastic Half Buffer (pEHB) design.
buffer and steer data between three modules.
The three asynchronous finite state machine controllers
were synthesized or hand designed, and implemented as the design fails, then backing off the timing by 15%. The
structural Verilog modules mapped to the static Artisan cell frequency of the asynchronous FIFO designs are a result of
library. These circuits used a characterization flow that permits the system logic delays and margins designed in the race paths.
their use in clocked ASIC CAD flows [9]. This character- Multiple simulations were performed on each design to
ization includes formal verification to prove behavioral and evaluate latencies, worst case throughput, and energy under
timing correctness. Constraints are employed to ensure the various activity factors. A simulation algorithm to ensure a fair
structure of the cells are not modified by the CAD tools, energy comparison was developed by generating patterns that
but that the drive strengths can be correctly optimized for guaranteed identical activity factors on all transfers between
power and delay by using the set_size_only com- storage elements. Data activity factors ranging from 6.25% to
mands for these modules. Specific timing in the asynch- 50% on all data paths were simulated. The designs were also
ronous modules not understood by the clocked CAD are simulated across all valid occupancies to generate throughput
defined with set_max_delay, set_min_delay, and versus occupancy graphs. Due to the volume of the tests,
set_data_check commands. These sdc commands are simulation and characterization scripts were developed to
used by the logic synthesis and place and route tools. They are automate FIFO characterization.
employed in a way that ensures that the asynchronous designs
Hundreds of different designs were synthesized and
are power and timing optimized just as the clocked modules.
characterized. Each FIFO class, and at times many structural
The clocked templates include a one-hot shift register for
variants, were designed with capacities ranging from two
address generation, a counter, unpipelined n-way toggle and
through 50 words and data path widths ranging from from
merge modules, and an elastic half buffer pipeline controller
zero up to 64 bits. The scripts developed to generate the
[10], [11]. Each are implemented as a behavioral or structural
Verilog FIFO designs is therefore highly parameterized. The
Verilog module depending on the design.
parameters include selections to identify (i) one of the six
These components are assembled together to build the six
FIFO classes, (ii) the structural configuration specific to that
FIFO classes reported in this paper. A custom parameterized
class (e.g. different widths and depths described in Sec. IV),
script was written to synthesize the completed FIFO architec-
(iii) the capacity of the FIFO, and (iv) the width of the data
tures as Verilog designs. The result of the script is a Verilog
path. Behavioral test interfaces are attached to the head and
register transfer level (RTL) design. The RTL typically consists
tail of the FIFOs to aid in the characterization.
of both structural and behavioral modules.
The simulation results are evaluated and first order models
The Verilog designs are synthesized using Design Compiler
developed for capacity, throughput, energy scaling, and perfor-
and then physically placed and routed using SoC Encounter.
mance values. These models can be implemented in a CAD
The design area is measured from SoC Encounter and then
tool to automatically select and synthesize the best FIFO for
simulated for power and performance using Modelsim. Results
specific design parameters and quality metrics.
from Modelsim import delays calculated from parasitic ex-
traction from the physical layout in SoC Encounter using the
III. FIFO C LASSES
standard delay format (sdf). One simulation in Modelsim is
designed to measure the power. This simulation generates a Two clocked structures are used: linear flow-through FIFO
value change dump (vcd) file that logs the switching activity and head / tail pointer. The linear clocked FIFO is imple-
on every node in the design. The vcd file is imported into SoC mented as phased elastic half buffer cells (Fig. 2) that employ
which generates the power results based on the parasitics and latches for data storage [10], [11]. This logic implements a
actual activity factors of the nodes from simulation. latency insensitive protocol [12], [13]. Elastic half buffers
The frequency of the clocked FIFOs is determined by must be connected where the control and data latches alternate
simulating the design, increasing the clock frequency until between transparent high and low cells in adjacent stages.
Fig. 3. Design of the Clocked Head/Tail Pointer FIFO. Read and write pointers are one-hot shift registers.

Fig. 4. Block diagram of a 4-deep Linear FIFO. L1L4 are the controller
of Fig. 1, D1D4 are latch banks.

The implementation of the clocked pointer FIFO is shown


Fig. 5. Block diagram of the parallel FIFO. Each L blocks is the same as a
in Fig. 3 [3]. The read and write pointers are implemented with pipelined controller / latch pair in Fig. 4.
a circular clocked one-hot shift register for address selection.
Four asynchronous structures are implemented: linear, par-
allel, binary tree, and square [1], [4], [8]. A linear structure
with capacity of four is shown in Fig. 4. All asynchronous
FIFOs and the clocked linear FIFO use latches for data storage.
The structure of Fig. 4 is used in each of the asynchronous
designs, where the combination of the linear controller and
latch bank form a pipeline stage, just like a Di Li pair in
Fig. 4. A normally closed latching protocol is used for
energy efficiency where the latches are only briefly transparent
to store the data.
A block diagram of the parallel FIFO is shown in Fig. 5.
The T and M blocks are unpipelined modules that steer data Fig. 6. Schematic of the unpipelined Tparallel module
to and from a set of parallel legs. The capacity of the legs
is configurable; each leg in Fig. 5 has a capacity of two. The
parallel design is the same as a linear structure in the degen- two when placed between two linear controllers because it is
erate case with only a single leg. Likewise the configuration not pipelined. However, it is highly modular and scales well.
where each leg contains a single buffer is an asynchronous The parallel merge module has a design similar to the parallel
implementation of the clocked H/T pointer FIFO. toggle, but requires a mux to select and steer data.
The design of the toggle template for the parallel FIFO is The block diagram of the binary tree FIFO is shown in
shown in Fig. 6. The one-hot shift register is the same design Fig. 7. In this design the data is steered through a pipelined
as the circular shift register used in the read and write pointers logarithmic fanout tree to a number of parallel legs. The
of the clocked head / tail pointer FIFO (Fig. 3. This template legs are then merged through a logarithmic binary tree to
implements an unpipelined protocol that steers the handshake the output. The parallel legs can have a capacity >= 0. The
signals without storing the data. The data is broadcast to binary toggle and merge modules are pipeline asynchronous
all downstream modules. This module reduces the frequency templates with a protocol and implementation similar to the
of the parallel FIFO to the first order by about a factor of linear controller in Fig. 1.
Fig. 7. Block diagram of Tree FIFO. The T and M blocks are pipelined.
Fig. 10. Throughput vs. occupancy for five 14-deep parallel FIFO configu-
rations. Labels are shown in inverse order of maximum throughput.

IV. C HARACTERIZATION
First order equations were developed to represent the struc-
ture, capacity, and forward latency for each design. The
equations are based on the number of parallel legs Nl , the
capacity of each leg Cl , the capacity of the FIFO Cf , the
clock period Pc and the latency of the asynchronous linear
controller Lc . The top two classes in the following table are
clocked, the latter four are asynchronous.
Design Structure Capacity Fwd Latency
Clk Linear Cl (Cf /2)Pc
Clk H/T Nl 3Pc
A. Linear Cl Cf Lc
Fig. 8. Block diagram of the S-4-4 square FIFO. h1i shows the path of the Parallel P-Nl -Cl 2 + Nl Cl (Cl + 4)Lc
first datum, h2i of the second. Tree T-Nl -Cl 2(Nl 1) + Nl Cl (2log2 Nl + Cl )Lc
Square S-Nl -(Cl + 2) Nl (Cl + 2) (2Nl + Cl)Lc
The structures are specified by the class, number of parallel
The block diagram of a square FIFO is shown in Fig. 8. legs, and capacity of each leg. Thus the parallel FIFO of Fig. 5
This design consists of a row of steering cells on the top and is a P-4-2 FIFO and the tree design of Fig. 7 is a T-4-1
bottom, with parallel legs in between. Data flows across the configuration.
top row to each leg in order, down, and then out at the bottom
right. The path of the first two tokens is shown in the figure A. Latency and Throughput
with arcs labeled h1i and h2i. The controllers in the bottom Forward latency is defined as the delay from placing a token
row first steer the datum from the top to the output, then take into the head of an empty FIFO until it has been read from the
one or more tokens from the left based on the location in the tail. The backward latency is its dual: the delay from removing
rectangle. The degenerate case of a single leg equals a linear a token from the tail of a full FIFO until one can place a new
structure. The square toggle and merge templates, like the token in the head.
parallel modules, are not pipelined. Fig. 9 shows a pipelined Latency has a major effect on FIFO throughput under certain
linear controller connected to a square toggle module. Like occupancy ranges. Throughput is limited by the latency in all
the unpipelined parallel design, these templates also reduce designs when the occupancy is near empty or full [7]. The
the frequency of the design approximately in half. throughput of a FIFO is therefore dependent on its occupancy
and cycle time.
Throughput vs. occupancy is measured keeping the number
of data items in the FIFO constant at all times. When a
datum is removed from the tail of the FIFO, a new datum
is simultaneously added to the head of the FIFO. The FIFO is
first initialized with a particular occupancy. When a FIFO is
near empty, throughput is reduced due to the forward latency
of tokens. More data could be added to the FIFO, but this
must be delayed until data is valid at the output to maintain
a constant occupancy. When the FIFO is nearly full, the dual
applies. New data can not be removed from the FIFO until a
bubble, or empty position, propagates backward to the input.
Fig. 9. Linear controller with a Tsquare 3 template. Throughput reaches a condition where it is limited by the cycle
Fig. 11. A P-4-2 and P-2-4 parallel design w/ capacity of 10
Fig. 13. Power and Energy of 14 deep Parallel FIFOs

B. Optimal Asynchronous Structures


All asynchronous FIFOs except the linear structure have
multiple configurations that achieve similar capacity. Fig. 11
shows two configurations of a parallel structure each with a
capacity of 10. The first part of our characterization determines
the best configuration for the asynchronous designs. This
is illustrated with a parallel FIFO having a capacity of 14
words. Five structures are synthesized and characterized: the
P-2-6, P-3-4, P-4-3, P-6-2, and P-12-1. As the number of
Fig. 12. Latency and cycle time of five 14-deep FIFOs parallel legs increases, the cycle time increases but the forward
latency decreases. This creates the tradeoff shown in Fig. 12.
Fig. 13 shows the energy per token tradeoff for each of the
time of the design if forward and backward latencies are small. configurations, showing P-6-2 as clearly the best. Throughput
This results in the graphs in Fig. 10 where throughput triangles versus occupancy is shown in Fig. 10, displaying the tradeoff
saturate due to the maximum cycle time of the design. between maximum throughput and the range over which the
Designs with small forward and backward latencies saturate maximum throughput is obtained. The graph for area is not
their throughput quickly based on the maximum frequency shown, but it correlates rather well to the cycle time in Fig. 12.
of the design. However, designs with large latencies restrict The area increases from 20,636 to 22,093 to 26,767 m2
throughput across a large range of occupancies. The clocked for the P-4-3, P-6-2, and P-12-1 configurations. The parallel
linear FIFO has a very high latency. For this design the FIFOs with capacity of two in each leg achieves the best cycle
throughput is reduced due to latency in all cases except for time area energy throughput versus occupancy. Hence
FIFOs of even length that are exactly half full. these configurations are used in this paper.
The forward and backward latencies for clocked designs are The same tradeoffs occur in the tree structures with a
equivalent. However, The forward and backward latencies of similar optimum. The results can be explained as follows: (i)
asynchronous structures can have different values which are Latency, energy, and the breadth of the maximum throughput
dependent on the handshake protocol. In this study we selected versus occupancy improve by reducing the number of linear
a protocol with a faster forward latency than backward latency. stages a datum must pass through. (ii) The steering logic is
This results in different slopes as can be seen in Fig. 10. Since significantly more expensive than the linear pipeline logic in
the forward latency is less than the backward latency for this terms of latency, energy, and area. (iii) The difference between
design, the maximum throughput is reached when the FIFO a capacity of one and two tokens per leg doubles the cost of
contains two or three tokens. However, maximum throughput the steering logic, but only increases the number of tokens
is not reached until up to 5 bubbles exist in the designs when a datum flows through by a small percentage (depending on
the FIFO is nearly full. total FIFO capacity).

V. R ESULTS
Lf Lb
Ot Cf (1) Fig. 14 shows a comparison of the latencies of the six
tc tc
FIFO designs across capacities ranging from three to 50 words.
Equation 1 models the effect of latency on throughput. Lf One of the graphs highlights small capacity FIFOs. Forward
and Lb are the forward and backward latencies, Cf is the total latency is a measure of how quickly maximum throughput is
capacity, tc is the cycle time, and Ot is the range of number reached, as well as the time to propagate a value through the
of tokens across which the optimal throughput is reached. FIFO. The asynchronous linear structure has the best forward
The smaller the forward and backward latency, the sooner the latency for very small capacities of four or less. Between four
maximum throughput of the FIFO is reached. and 16 the parallel and tree structures have the best latency.
Fig. 15. Cycle time and maximum throughput

Fig. 16. Throughput versus occupancy comparison for FIFO capacities of


10 and 50 tokens

Fig. 14. Forward and Backward Latencies of 64-bit FIFOs. Labels in maximum throughput under a single token occupancy value.
decreasing order at maximum depth.
The next highest throughput is the asynchronous linear design,
but it suffers similar problems with the clocked elastic buffer
Beyond 16 elements the tree architecture is best. Backward design. For some designs, these could be the optimal choice
latency produces a result similar to forward latency. However, if throughput is the primary metric and the FIFOs could be
backward latency in the asynchronous designs is degraded. maintained in the small optimal range. However, for FIFOs
Thus the elastic half buffer is best up to about seven data items, that require high throughput and low latency the asynchronous
after which the head / tail pointer is the best. This implies that tree FIFO and the clocked head / tail pointer FIFO are the best
under a stalled condition, the clocked designs will recover to up to a capacity of around 16, beyond which the tree FIFO
full throughput quicker than these asynchronous designs. has the highest throughput.
The cycle time and maximum throughput of the designs are Fig. 16 compares the throughput versus occupancy for
compared across many capacities. Fig. 15 shows the results for designs with a capacity of 10 and 50 words. The asynchronous
small capacity designs. The design with the highest throughput tree and parallel designs reach maximum throughput sooner
is the clocked linear structure. However, this architecture has than all other designs with a broad maximum throughput
the largest latency by far of any design and only achieves range, and the tree reaches a significantly higher maximum.
Clocked and asynchronous designs are compared and con-
trasted to determine the best structure for a specific need.
The square FIFO, while academically interesting, is shown
to be an impractical design as it is never the best choice
for any parameter. In general an asynchronous FIFO is the
best choice across nearly every capacity and for most metrics.
Latency, energy, and throughput will usually be the primary
factors used to select the best design. In such cases, one of
three asynchronous FIFO structures are usually the best. The
clocked linear FIFO does attain the highest throughput of any
design. However, this performance is only reached for a single
Fig. 17. Energy comparison of small capacity FIFOs occupancy value.
This work facilitates generating CAD that will weigh the
priorities, utilize the first order equations to select a structure,
and synthesize the correct design for the application.
VII. ACKNOWLEDGMENTS
This material is based upon work supported by Semiconduc-
tor Research Corporation task 1817.001 and the National Sci-
ence Foundation under Grant No. 0702539. The full physical
Artisan library was provided under a grant from Arm.
Fig. 18. Leakage power of small capacity FIFOs
R EFERENCES
[1] E. Brunvand, Low Latency Self-Timed Flow Through FIFOs, in 16th
For an application where dynamic buffering is required with Conference on Advanced Research in VLSI, UC Santa Cruz, March 1995,
good throughput and low latency from the empty state, the pp. 7690.
[2] R. W. Apperson, Z. Yu, M. J. Meeuwsen, T. Mohsenin, and B. M. Bass,
asynchronous tree design appears to be the best option. A Scalable Dual-Clock FIFO for Data Transfers Between Arbitrary
Fig. 17 compares the average energy for a datum to pass and Haltable Clock Domains, IEEE Transactions on Very Large Scale
Integration, vol. 15, no. 10, pp. 11251134, Oct 2007.
through the various structures for small capacity designs. The [3] Altera Corporation, Single & Dual-Clock FIFO Mega-
asynchronous tree design is the most energy efficient. The functions User Guide, June 2003. [Online]. Available:
clocked head / tail pointer expends substantially more energy http://www.altera.com/literature/ug/ug fifo.pdf
[4] J. Ebergen, Squaring the FIFO in GasP, in 7th International Sympo-
than all other designs. This relationship and the relative slopes sium on Asynchronous Circuits and Systems, March 2001, pp. 194205.
hold for the full design space investigated, including designs [5] I. M. Panades and A. Greiner, Bi-Synchronous FIFO for Synchronous
ranging from a 6.25% to 50% data activity factor, and for data Circuit Communication Well Suited for Network-on-Chip in GALS
Architectures, in 2nd International Symposium on Networks-on-Chip.
widths ranging from zero to 64 bits. ACM/IEEE, April 2008, pp. 139148.
Leakage power is compared in Fig. 18. These values also [6] T. Chelcea and S. M. Nowick, Low-latency asynchronous FIFOs using
correspond well to the layout area of the designs. The clocked token rings, in 6th International Symposium on Advanced Research in
Asynchronous Circuits and Systems (ASYNC 2000). IEEE, apr 2000,
linear design gives lowest leakage, but this does not account pp. 210220.
for leakage in the clock distribution network and clock gating [7] C. E. Molnar, I. W. Jones, W. S. Coates, J. K. Lexau, S. M. Fairbanks,
circuitry. and I. E. Sutherland, Two FIFO Ring Performance Experiments,
Proceedings of the IEEE, vol. 87, no. 2, pp. 297307, Feb 1999.
[8] J. T. Yantchev, C. G. Huang, M. B. Josephs, and I. M. Nedelchev,
VI. S UMMARY Low-latency asynchronous FIFO buffers, in 2nd Working Conference
on Asynchronous Design Methodologies, May 1995, pp. 2431.
This paper reports on several common FIFO structures [9] K. S. Stevens, Y. Xu, and V. Vij, Characterization of Asynchronous
that can be used for flow control. They are compared with Templates for Integration into Clocked CAD Flows, in 15th Interna-
tional Symposium on Asynchronous Circuits and Systems. IEEE, May
maximum throughput, throughput versus occupancy, energy 2009, pp. 151161.
efficiency, area, leakage energy, and latencies. First order [10] J. Cortadella, M. Kishinevsky, and B. Grundmann, Synthesis of syn-
equations are derived to model the capacity, latency, and chronous elastic architectures, in Proceedings of the Digital Automation
Conference (DAC06). IEEE, July 2006, pp. 657662.
maximum throughput based on occupancy of the designs. Most [11] J. You, Y. Xu, H. Han, and K. S. Stevens, Performance Evaluation
asynchronous FIFO classes have multiple configurations that of Elastic GALS Interfaces and Network Fabric, Electronic Notes in
result in similar capacity but different power and performance Theoretical Computer Science, vol. 200, no. 1, pp. 1732, February
2008, elsevier.
values. The optimal configurations for a given capacity were [12] L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli,
determined. A huge design space is investigated through gener- Theory of latency-insensitive design, IEEE Transactions on Computer-
ating modular designs and synthesizing hundreds of instances Aided Design, vol. 20, no. 9, pp. 10591076, Sep 2001.
[13] H. M. Jacobson, P. N. Kudva, P. Bose, P. W. Cook, S. E. Schuster,
of six FIFO classes. Results are for physical layout in a E. G. Mercer, and C. J. Myers, Synchronous interlocked pipelines,
65nm process with parasitic extraction, and include varying the in 8th International Symposium on Asynchronous Circuits and Systems,
capacity, data width, configurations, and data activity factors. Apr. 2002, pp. 312.

You might also like