Microelectronics Journal: Seyed Mohamad Taghi Adl, Siamak Mohammadi
Microelectronics Journal: Seyed Mohamad Taghi Adl, Siamak Mohammadi
Microelectronics Journal: Seyed Mohamad Taghi Adl, Siamak Mohammadi
Microelectronics Journal
journal homepage: www.elsevier.com/locate/mejo
A high performance dual clock elastic FIFO network interface for GALS NoC
Seyed Mohamad Taghi Adl a, Siamak Mohammadi a, b, *
a
Dependable System Design Lab, School of ECE, University of Tehran, Tehran, Iran
b
School of Computer Science, Institute of Fundamental Sciences (IPM), Tehran, Iran
A R T I C L E I N F O A B S T R A C T
Keywords: A dual clock register based elastic First-In First-Out Architecture is presented for Globally Asynchronous Locally
Dual clock FIFO Synchronous (GALS) Network on Chip interface. The FIFO is designed using synchronous elastic methods,
GALS NoC interface facilitating its synthesis with commercial CAD tools. This FIFO supports arbitrary phase and frequency for read
Performance evaluation and write operations and prepares safe data transmission between different clock domains. The presented
Process variation structure can be easily used as an interface between synchronous or asynchronous GALS modules. The FIFO is
simulated and analyzed with 32 nm PTM library in HSPICE. Metastability, process variation, throughput, power,
area, delay and maximum frequency are analyzed. Results show elastic FIFO power delay product (PDP) is 23%
less than similar synchronous FIFOs. Our proposed elastic FIFO has double capacity while the area is almost the
same. The elastic FIFO tolerates better high variability and can preserve its functionality by 5% in average more
than the DSPIN synchronous FIFO in presence of variation.
* Corresponding author. Dependable System Design Lab, School of ECE, University of Tehran, Tehran, Iran.
E-mail addresses: [email protected] (S.M.T. Adl), [email protected] (S. Mohammadi).
https://doi.org/10.1016/j.mejo.2018.04.014
Received 4 December 2017; Received in revised form 25 April 2018; Accepted 25 April 2018
faster writing than reading and the other is when the reading is faster. first. Afterward, GALS synchronization challenges are explored and some
Our dual clock elastic FIFO is easily connected to synchronous, asyn- main dual clock FIFO structures will be explained.
chronous and elastic circuit's interfaces. The presented FIFO is compat-
ible with general CAD tools and there is no need for custom cell designs.
2.1. Elastic circuits
The obtained results demonstrate the presented dual clock elastic FIFO
consumes less power, while it possesses double data capacity compared
According to Cortadella studies, the notion of elasticity belongs to a
to similar structures. The FIFO can be used in higher frequency and shows
range of circuit design that can tolerate variable latency inputs and still
better resilience against high variation situation.
preserves functionality [16].
From the structural point of view, the main contributions of this paper
In this case synchronous systems are the least elastic designs and
can be expressed as follows:
delay insensitive asynchronous circuits are the most elastic structures.
The new method that is called synchronous elastic design, stays some-
- The elastic FIFO is simple and has double data capacity in less area
where in the middle of this range. Plenty of properties have been stated
than the most similar synchronous dual clock FIFO designs.
for synchronous elastic circuits on computational circuits [17]. We
- The presented structure benefits from fast, accurate, and efficient
evaluate synchronous elastic circuits for GALS NoC interfaces in this
full/empty detection mechanism.
study.
- Our FIFO does not need multi-bit synchronization which is compli-
Elastic circuits can preserve functionality while their inputs have
cated and unreliable, but instead uses simple single bit brute-force
arbitrary delays. A storage structure in elastic circuits which is called
synchronizer
Elastic Buffer (EB), unlike a flip flop in synchronous design, can hold two
- The presented dual clock elastic FIFO can easily connect to synchro-
distinct data in its two adjacent latches. Since elastic circuits operate with
nous, asynchronous, and elastic interfaces
clock, these circuits do not need special interface to communicate with
synchronous modules of GALS NoC.
In the following section, some related works are explained. In Section
Different structures have been presented for elastic concept [18,
3 we present the new FIFO structure. In Section 4, the evaluation results
19]. In this paper, we use the synchronous elastic structure presented
are presented and finally the paper is concluded in Section 5.
by Cortadella [15]. Valid and stop signals along with clock form an
elastic channel that controls flow of data using SELF protocol. Valid
2. Related work
control signal is sent in the same direction as data showing its val-
idity. Stop control signal is transmitted in the opposite direction of
In this section, some basic information and definitions necessary for
data flow. Inactivity of stop signal means the consumer can get new
the following sections are given. Elasticity concepts will be described
data.
Fig. 2. Elastic buffer [16]. (a) Structure. (b) Controller Structure. (c) Three possible states of buffer.
70
S.M.T. Adl, S. Mohammadi Microelectronics Journal 76 (2018) 69–80
Fig. 3. Token ring. (a) Structure. (b) Output with one bit encoding.
According to the SELF handshaking protocol, due to valid (V) and stop 2.2. GALS synchronization
(S) signals conditions, the circuit can be in three modes: Transfer, Idle, or
Retry. Property for each state is defined as follows: Each data transmission between two different clock domains may face
metastability problem, as the consumer does not know when data are
Transfer, (V ^ : S) stable to be captured on the clock edge. To deal with metastability, two
Idle, (: V) main classes of solutions are presented, which are called value-safe and
Retry (V ^ S) time-safe [28]. Value safe methods wait for metastability to resolve such
as pausible clock techniques [29] or stretching the clock [28]. Timing
These modes are depicted in Fig. 1. In Transfer mode, data is for- safe techniques try to allocate a fixed period of time for metastability to
warded like in ordinary flip flops. be resolved. In GALS structure since any transaction between adjacent
In Retry mode, valid data is waiting for the stop situation to disappear. modules should pass through these synchronization interfaces, the whole
This means until the consumer gets ready to receive new data, the pro- network performance strongly depends on synchronization performance
ducer will have to wait before sending data. [30].
Vast researches have been done in synchronous elastic realm. Two simple solutions are available for the time safe method: either
Different asynchronous modules such as mutex, fork, and join have using a series of flip flops called synchronizer, or using FIFOs. Using
special elastic design [20,21]. Moreover, performance improvement synchronizers degrades the system's throughput impressively [31].
techniques used in synchronous circuits have also been implemented and Moreover, using synchronizer for multi-bit signals is unreliable and
evaluated with the elastic concept [22]. Methods such as retiming, synchronized output may not be valid. Synchronizers can be used only
recycling, and speculation have been mapped to elastic design [23–25] to when multi-bit signals change in one bit for consecutive values [32].
be used in design automation [26]. The second time safe synchronization solution is using FIFOs.
Many structures have been introduced for EBs [27] We use the EB Although FIFOs have a more complicated design and some area over-
design presented in Ref. [17], as shown in Fig. 2. In Fig. 2a, two L, and H head, they do not impact the network throughput very much, and are
latches play similar roles as master slave latches in ordinary flip flop (FF). suitable for multi-bit synchronization. The FIFO's delay is comparable
These latches are enabled by a control circuit depicted in Fig. 2b based on with that of a synchronizer. Practically, we can consider a FIFO as
SELF protocol. pipelined synchronizers. Thus, a FIFO seems to be a better choice for
According to valid and stop signals, the EB can enter three states: Full, GALS NoC communication. Synchronous and asynchronous FIFOs are
Half full, and Empty. The finite state machine expressing different states of two ways available for GALS NoC modules interfaces.
the EB is depicted in Fig. 2 When there is no valid data in the buffer
(V1 ¼ 0, V2 ¼ 0), the state is Empty. When valid data is being sent with no 2.3. FIFOs in GALS
restriction (S2 ¼ 0), the buffer is Half-full and operates like a flip flop. But
when the signal S2 is asserted, the buffer holds its current data, and can In GALS research area, different FIFOs have been presented. Asyn-
save another new data and thus store two different data. The EB gets Full chronous FIFOs [33,34] provide high degree of robustness and resilience
and propagates the stop (S1) signal backward to prevent data from being against environmental variation [5]. However, the data transfer rate is
overwritten. limited due to handshaking delay [35].
At first glance, an EB seems more complex than a FF imposing power Some improvements are suggested to decrease delay by using parallel
and area overhead. Further, as the EB gates the clock signal, it needs structure, or tree design FIFOs [36,37]. However, it is not easy to exploit
timing engineering, and thus the circuit may become subject to process them as an interface between synchronous modules since they need some
variation. However, in large circuit designs we will show power con- extra wrapper interface to reliably communicate between synchronous
sumption and area of elastic circuits are not much more than synchro- and asynchronous circuits [8]. Moreover, there is no mature and com-
nous circuits; because elastic circuits do not need extra control modules plete Computer Aided Design (CAD) tool to design and test asynchronous
necessary for synchronous circuits. Besides, elastic circuits can naturally and synchronous circuits simultaneously together [7]. Thus, designers
better tolerate variability compared to synchronous designs. tend to deploy dual clock synchronous FIFO for GALS NoC Network
interface, where read and write operations are synchronous with clock
Table 1
Different FIFO structures overview.
Storage structure Synchronizer used Access storage cells by Full/empty detector based on Design method CAD tool compatibility
Cumming FIFO [32] Ram based Multi bit Binary counter Gray code Synchronous Yes
Chelcea and Nowick FIFO [39] Register based Single bit Combinational Status register Sync/Async No (needs custom cells)
Apperson FIFO [38] Ram based Multi bit Binary counter Gray code Synchronous Yes
DSPIN FIFO [40] Register based Multi bit Token ring Bubble encoding Synchronous Yes
Two hot encoding
Our presented FIFO Register based Single bit Token ring Elastic control signals Elastic Yes
One hot encoding
71
S.M.T. Adl, S. Mohammadi Microelectronics Journal 76 (2018) 69–80
but have different phases and frequencies. Chelcea and Nowick presented a register-based dual clock FIFO for a
Dual clock FIFO's structures mainly differ in data storage mechanism. GALS structure [39]. This FIFO has 5 main modules: data cells, full/-
Data management and storage structures for FIFOs can be ram_based or empty detector, put/get controller. An SR flip flop is used to indicate the
register_based. Ram_based structures are more scalable and suitable for data cell status, whether a data is valid or not. The full detector in-
deep FIFOs. While register_based FIFOs are usually small and cells are vestigates if any of adjacent cells are full, in which case the full signal is
accessed based on linear or circular mechanism used in the FIFO struc- asserted. Similarly, the empty detector checks all adjacent cells to see if
ture. In Linear FIFOs, data enters from one side and exits from the other, they are empty.
operating like a pipeline, and imposing a high delay. This FIFO is the basis of lots of researches but has some limitations:
Circular dual clock FIFOs and ram-based structures use some pointers
to indicate read and write places. Pointers in ram-based structures are - Full/empty detector needs some custom cell designs, and is not
usually generated by counters [32,38], whereas in register-based struc- compatible with standard design tools, although it has an optimal
tures, pointers are generated by token rings [39,40]. A Token ring is a area.
sequence of cascaded flip flops circulating some tokens with a clock, - If the read clock runs three times faster than the write clock, the
where different registers take turns on read/write. A simple token ring FIFO's functionality will fail, since the SR latches work asynchro-
structure and its output are depicted in Fig. 3. nously and assert their output before the data is written, and deassert
Based on researches on GALS NoC implementations [3,41], it is not the empty signal simultaneously. Now, if the read operation is per-
necessary to have very deep FIFO interfaces and usually a FIFO with less formed with high speed, the FIFO will try to read a data that is not yet
than 10-word depth can provide full throughput communication [42]. written. On the contrary, if the producer is fast this situation repeats
Therefore, we choose a circular register_based dual clock FIFO structure and the FIFO cell is overwritten.
for GALS NoC interface. - This FIFO is exposed to a glitch failure, as a glitch can change the state
of a register.
2.3.1. Dual clock FIFO structures
Some of the principal designs are summarized in Table 1. These pre- An improvement proposed in Ref. [45] uses standard cells instead of
eminent designs considered in studies will be introduced in the custom ones.
following. Apperson in Ref. [38] states that if the FIFO is used in real situation
In Ref. [32] a dual clock ram based FIFO has been presented with a there may be some clock needed to reach producer/consumer in large
binary code addressing. Binary addresses are converted to a special gray GALS NoCs, thereby this distance should be considered to generate
code to be synchronized with the corresponding clock domain and used full/empty signals. Other designs do not consider this restriction and
for full/empty detection. This structure is composed of 5 modules: assume that full/empty signals immediately reach the
memory, read pointer generator and empty detector, write pointer producer/consumer.
generator and full detector, read pointers synchronizers and write Another dual clock register based FIFO is presented in Ref. [40] and
pointers synchronizers. used as interface in DSPIN NoC [41]. The FIFO contains 5 main modules:
Gray codes are generated from binary addresses with an extra one bit. read/write token ring, full/empty detector, and data buffers. FIFO uses
The most significant bit indicates the rotation of code. As the gray code is read/write token rings to access FIFO cells. The full/empty is detected
a mirror code, less significant bits are repeated and show the addresses. based on read and write tokens. Since these tokens are generated in
The most significant bit shows whether we are on the next round for different clock domains, there is a need for multi-bit synchronization. For
access. This structure has some drawbacks. First, the design complexity correct multi-bit synchronization, token rings use bubble encoding and
and code conversion causes power and area overhead. Second, based on some extra modules are needed to control synchronizer output validity.
the gray code structure, this FIFO can only have capacity in power of two, With bubble encoding two consecutive tokens (ones) circulate. For
which is considered as a limitation. Moreover, recent studies have write token ring the first one indicates the write position, and for read
claimed that it is impossible to use and synchronize gray codes between token ring the first place after the second one represents the read place.
different clock domains when multiple queues are implemented. To To detect full FIFO, the detector checks whether the read and write
avoid synchronization problem, authors [43] have proposed a pointers have reached the same position. According to the synchronizer
loosely-coupled solution, which decouple shared buffering and syn- delay, the full condition is reported two cells before the real full situation.
chronization. This method imposes additional data latency, which leads This structure guarantees that if the FIFO is exposed to a burst write, no
to a larger minimum buffer capacity for full throughput operation. A data is overwritten. However, the FIFO utilization in usual read/write
multiple queue scheme has been presented in Ref. [44] to fulfil dynamic operations is degraded as the nominal FIFO capacity is not used. De-
virtual channel needs. signers of this FIFO believe that detecting an empty FIFO is more
A ram-based dual clock FIFO has been presented in Ref. [38] using important than a full one. Hence, they have designed a more complex
pausible clock technique. Its structure is similar to that of [32]. The ram is empty detector. Because, if the full signal is asserted sooner, only the
accessed through a binary code. The full/empty detector operates based FIFO utilization is degraded with no impact on the functionality, whereas
on gray code because of multi-bit synchronization needs. They have the empty signal destroys the functionality.
considered the timing path between FIFO and producer/consumer. Based on [40] a dual clock FIFO designed [42] and used for inter-
Therefore, an adaptive circuit is designed to assert full/empty some clock facing different clock domains. The main claim of the authors is the
cycles sooner based on this timing path. smaller area and less delay in mesochronous usage. A reconfigurable dual
clock FIFO structure based on [40] is presented in Ref. [46] for adaptive
voltage/frequency domains.
72
S.M.T. Adl, S. Mohammadi Microelectronics Journal 76 (2018) 69–80
write and read grants with one hot encoding. The storage structure of this persistent enable is exerted on the current FIFO cell, and read or write
FIFO is made of elastic buffers. The control circuit of an elastic buffer operation is continuously performed on the same cell. As this could lead
imposes power and area overhead, however it has been shown that as the to a malfunction, the token ring's output is gated by the token ring's
bandwidth increases, this overhead dies away. enable signal.
Despite the complexity of the storage structure in the new FIFO, the
read/write controllers are simpler and the full/empty detectors perform 3.1.2. Data storage and modification module
accurately. With regard to the requirements of GALS structures for syn- Elastic buffers are used as data storage cells. Each elastic buffer has
chronization, this FIFO can operate in arbitrary phase and frequency. two input control signals as depicted on Fig. 2. One is V1 from the pro-
This elastic FIFO has N elastic buffers. It can store 2N data, and access ducer side, which shows the validity of written data. The other signal is
them by N bit read/write token rings. For better FIFO performance, S2, which is sent by the consumer and is used for reading from FIFO. V1 is
elastic buffers can operate with the faster clock. This comes in two fla- in write clock domain and generated by write token ring and is asserted
vours: A FIFO running with a faster write clock, where buffers run with for one write clock period for each data, similarly S2 is in read clock
the write clock, or a FIFO running with a faster read clock. domain. Data storage cells either operate with read or write clock. As one
of V1 or S2 signals is longer than one data storage cell clock period, a
3.1. Elastic FIFO structure malfunction may be caused and data may be written or read incorrectly.
For this reason, it is necessary to shorten the signal activation time to one
The FIFO is formed by six main modules: read/write token rings, data clock period using the modification module depicted in Fig. 4.
storage cells, full/empty detectors, and modification module. To achieve better performance, it is recommended data storage cells
run with the fastest read or write clock. Therefore, two FIFO structures
3.1.1. Token rings have been presented. These structures are similar, but only differ in data
Token rings which use one hot encoding prepare read/write positions storage cell's clock and the place of the modification module.
for the FIFO. Write token is sent as a valid signal for the corresponding When the producer is faster than the consumer, elastic buffers should
elastic buffer to stores new data. The read token is the inverted stop work with write clock. We call this design fast_wr elastic FIFO and the
signal. When the read token is asserted, stop signal is low and data can be modification circuit should be added to S2 path to shorten S2 activation
read. Whenever read token is zero, stop signal is asserted. Data will be time to one write clock. Whereas, when the consumer runs faster, storage
held in the buffer and also another data can be written in that buffer. cells should operate with read clock. V1 signal should be modified to
When a token ring is disabled, it holds its state. This means that a shrink V1 assertion to one read clock and we call it, fast_rd elastic FIFO.
Fig. 5. Dual clock elastic FIFO structure. (a) fast_wr structure. (b) fast_rd structure.
73
S.M.T. Adl, S. Mohammadi Microelectronics Journal 76 (2018) 69–80
Table 2
Area comparison of various dual clock FIFO designs.
technology 4 16 4 32 8 16 8 32
Area (μm )
2
gate count area (μm )2
gate count area (μm )
2
gate count area (μm2) gate count
74
S.M.T. Adl, S. Mohammadi Microelectronics Journal 76 (2018) 69–80
Table 3 Table 4
Transistor count collation of Elastic buffer and Flip Flop. Minimum required FIFO depth for 100% and 50% throughput.
Total Latches Control circuit Minimum FIFO depth for 50% throughput for 100% throughput
75
S.M.T. Adl, S. Mohammadi Microelectronics Journal 76 (2018) 69–80
Table 5
Elastic FIFO
64 ¼ 2048/32
Minimum read and write clock period (ps) for Elastic FIFO (fast_Wr) and DSPIN
8.04E06
13834822
FIFO for different control gate size (W/L).
3001.6
2680
1.12
1.309
3251
4255
W/L 4 ¼ 128/ 8 ¼ 256/ 16 ¼ 512/ 32 ¼ 1024/
32 32 32 32
DSPIN FIFO
1200/1500
DSPIN Write 850 450 300 250
1.33E07
FIFO period
4119.0
1.267
3251
Read 1050 600 350 300
32 ¼ 1024/32
period
13211251
Elastic Write 1150 650 350 300
Elastic FIFO
FIFO period
3251
4063
1.25
Read 1450 700 400 350
5.74E06
2576.8
1.155
period
2231
4.3. Throughput analysis
16 ¼ 512/32
DSPIN FIFO
1100/1200
13390924
7.30E06
3162.3
1.369
1.267
2310
The throughput of dual clock FIFOs can be reported as a function of
3251
4119
FIFO depth. As the dual clock FIFOs impose a delay to the data path, the
throughput of the network could be possibly degraded. Deeper FIFOs do
Elastic FIFO
not decrease the throughput, as buffering data could cover the latency,
4.62E06
2394.4
and data can be read from the FIFO in every cycle [48]. Simulations show
8 ¼ 256/32
1931
13824253
1.24
the minimum buffer requirement of our presented design for having 50%
1.308
3251
4252
and 100% throughput is 5 and 7 buffers, respectively.
DSPIN FIFO
Our presented FIFO uses elastic buffers as storage elements. Each
900/1000
6.44E06
elastic buffer can store double data, therefore, our design only requires 3
3219.6
1.609
2001
and 4 elastic buffers to have 50% and 100% throughput, respectively.
DSPIN FIFO
4 ¼ 128/32
Table 4 compares buffer requirements of different designs. The minimum
15388465
depth of our proposed structure for 100% throughput is less than the
Elastic FIFO
1.456
3251
4733
other structures, and for 50% throughput the minimum depth is less than
3.40E06
2087.6
DSPIN FIFO based structures [40,42,46], and similar to Cumming [32]
1631
1.28
and Ono [45] structures. Note that for having 100% throughput dual
clock FIFO, it is necessary to have the same clock frequency for read and
DSPIN FIFO
64 ¼ 2048/32
write clock.
700/800
3.12E06
17178667
2404.2
1.848
1301
Presented Elastic FIFO and DSPIN FIFO Delay, Power, PDP and EDP comparison with 4-deep, 34 bit-width.
2.516
2613
Elastic FIFO
We have implemented our presented structure using HSPICE to study
2.95E06
1999.4
1.351
power consumption as well as variation. Since, the proposed structure
1480
32 ¼ 1024/32
(a) Where the write frequency is 833 MHz and the read frequency is 666 MHz for different W/L ratio
open up a new design space, other presented ideas for custom designs
10672333
such as [42,46] can be based on our elastic dual clock FIFO. Moreover, to
4067.2
DSPIN FIFO
2624
600/700
2.84E06
2471.1
we have compared our design with the most similar synchronous design
2.147
1151
2.58E06
3001
(b) With control circuit W/L ¼ 512/32 ratio for different frequencies
Expanding the bandwidth can change the FIFO's delay, as the control
circuits should drive more buffers. Thus, we have modified the control
DSPIN FIFO
3.09E06
2941.7
L ratio for the control circuit, changes the maximum operating frequency
2.799
8 ¼ 256/32
1051
8122331
of FIFOs. The minimum range for read/write clock period with 50ps
1.025
2815
2885
approximation error for fast_wr elastic FIFO and DSPIN FIFO has been
Elastic FIFO
reported in Table 5.
2.20E06
For the same W/L size and 34-bit data width, DSPIN FIFO can run
1865.5
1.581
1180
with nearly 19% higher frequency than the elastic FIFO on average. The
Elastic FIFO
4 ¼ 128/32
reason is the prominent role of the control circuit in the elastic FIFO.
11197617
However, for smaller bandwidth, simulated for 5 bit width, the elastic
DSPIN FIFO
1.172
3091
3622
400/500
Power (mW)
way that FIFOs serve the same data count. Table 6 a reports delay, power
Delay (ps)
Delay (ps)
W/L
PDP
EDP
PDP
EDP
(EDP) for DSPIN and elastic FIFOs for 20 ns simulation time, where the
write clock runs with 1200 ps clock period and 1500 ps read clock period
76
S.M.T. Adl, S. Mohammadi Microelectronics Journal 76 (2018) 69–80
Fig. 7. Comparison of synchronous DSPIN FIFO and elastic FIFO with fixed read clock period (1500 ps ~666 MHz) and write clock period (1200 ps ~ 833 MHz) but
different control circuit size. (a) Power consumption comparison. (b) PDP comparison.
Fig. 8. Elastic FIFO and DSPIN FIFO comparison based on frequency degradation. (a) Delay trend. (b) Power trend. (c) PDP comparison. (d) EDP comparison.
for 4-deep 34-bit width FIFOs. with fixed frequency and control circuit size in terms of delay, power,
The DSPIN synchronous FIFO delay does not depend on the control PDP, and EDP. Results show decreasing the read/write frequency, in-
circuit size; whereas the elastic FIFO delay decreases with control gates creases the FIFO delay, while it decreases the power consumption. Fig. 8
enlargement. However, increasing the gate size causes more power depict different parameters trends.
consumption as depicted in Fig. 7a. The delays are almost the same, however for high frequency opera-
We have calculated the PDPs, and compared them in Fig. 7b. Based on tions, DSPIN FIFO imposes less delay. While, for frequencies under 1 GHz
PDP results, the elastic FIFO, W ¼ 256 nm and W ¼ 512 nm shows better the elastic FIFO operates faster and exhibits less delay. The elastic FIFO
performances, whereas DSPIN FIFO has its best performance with consumes less power. Power reduction trends based on frequency
W ¼ 512 nm and W ¼ 1024 nm, therefore we have chosen w ¼ 512 nm degradation has been compared in Fig. 8 c. Synchronous DSPIN FIFO
for both designs in our later simulations. consumes 28% more power in average than the elastic FIFO. For a better
To determine the frequency impact on delay and power consumption, comparison, we have considered PDP parameter, where that of DSPIN
we have compared both FIFOs with a fixed size and in fair conditions. We FIFO is 23% larger than elastic structure's. These results demonstrate
have assumed L ¼ 32 nm, and W ¼ 512 nm for control circuits, and have elastic circuit is a better choice for GALS NoC interface structure.
simulated FIFOs for 20 ns. Results for different write and read frequencies EDP has been calculated as well and depicted in Fig. 8 d. Results
have been presented in Table 6 b. DSPIN and elastic FIFOs are compared confirm the elastic FIFO usage instead of DSPIN FIFO for GALS NoC
Table 7
Simulation parameters for variation exploration.
(a) circuit level parameters
77
S.M.T. Adl, S. Mohammadi Microelectronics Journal 76 (2018) 69–80
Fig. 9. Comparison of elastic and DSPIN FIFOs performance under Vth variation. (a) Power, Delay, and PDP comparison in presence of 30% Vth variation based on
different frequencies (10%, 20%, 33%). (b) Distribution of PDP comparison based on different variation situation (20%, 30%, 40%) where FIFOs run 10% slower
maximum frequency.
interface. All along these discussions, the important point has been that parameters are stated in Table 7 a.
the elastic FIFO capacity is twice DSPIN FIFO. In Network on chip per- We have chosen Wn ¼ 512 nm for control circuit's gates size, so that
formance research area, the FIFO capacity plays an undeniable role for the maximum frequency for DSPIN FIFO will be 300 ps for the write clock
the congestion control [50]. period, and 350 ps for the read clock. The elastic FIFO can operate at least
As a result of technology shrinkage unpredictable design changes may with 350 ps write clock period, and 400 ps read clock period, as stated in
occur due to process variability. The process variation modifies delay Table 5.
characteristics of circuits after manufacturing. Thereby, variation has FIFOs performance parameters have been studied, when the clocks
become one of the most important challenges to be considered in circuit run 10%, 20%, and 33% slower than the maximum nominal frequency in
design as feature size scale down from 65 nm to 16 nm [51]. presence of Vth variation. Table 7 b expresses different read and write
We have explored Die-to-Die process variation impact on the FIFOs clock periods in test scenarios.
performance and functionality. According to studies conducted in Refs. We have simulated both FIFOs in nine different scenarios where Vth as
[51,52], the most effective variation parameter in deep submicron variation parameter, fluctuates 20%, 30%, and 40% from the nominal
technology is the threshold voltage (Vth), which has a direct impact on amount in three different operating frequencies. Monte Carlo simulations
performance and functionality. Other parameters such as transistors' are performed with 256 iterations for different FIFOs. We have analyzed
width and length, or gate oxide thickness come second in process power consumptions, FIFO delays, and PDP changes.
variation. Results show the elastic FIFO can preserve functionality 5% more
Variation of threshold voltage can be as large as 80 mV in 20 nm than DSPIN synchronous FIFO in average in presence of variation. The
technology [53]. Although, we have simulated our design in 32 nm FIFO's functionality is verified based on the input scenario. The output
technology, we have explored Vth variations up to 40% of the nominal data is checked to see whether the output values are in the intended
amount for today technology needs [54]. order. For example, if 2(010), 3(011), 4(100) enter the FIFO in this order,
For this purpose, we have simulated and analyzed two 34-bit 4-deep the output order should be the same and the signal transitions must be
FIFOs near nominal maximum read/write frequencies. Simulation checked for correct output data. For instance when Vth variation is 30%
78
S.M.T. Adl, S. Mohammadi Microelectronics Journal 76 (2018) 69–80
DSPIN FIFO preserve its functionality 81.5% while elastic FIFO work well are compared with the most similar synchronous design, called DSPIN
in 87.5%. Results show increased variation of Vth degrades the syn- FIFO. They confirm the presented FIFO is a better choice for GALS
chronous DSPIN FIFO functionality more than elastic FIFO, whereas the structure. The Elastic FIFO consumes less power, while delay are slightly
DSPIN FIFO functionality is slightly better when the circuit suffers less the same. The elastic design has more delay in high frequencies
variation. (1 GHz–2 GHz), however, in lower frequencies the synchronous deign
To study Vth variation effect on FIFOs performance and compare delay increases dramatically, while the elastic design keeps its smooth
designs, Coefficient of Variation (Cv) parameter is used. Cv shows how a increase trend. The PDP parameter proves the efficiency of the elastic
parameter varies based on variation [55]. Coefficient of variation is FIFO.
calculated as explained below:
Pn Acknowledgment
i¼1 xi
μðxÞ ¼ (9)
n This research was in part supported by a grant from Institute for
Pn Research in Fundamental Sciences (IPM) (No. 3-1396-9).
μðxÞÞ2
i¼1 ðxi
varðxÞ ¼ (10)
n1 References
pffiffiffiffiffiffiffiffiffiffiffiffiffi
σ ðxÞ ¼ varðxÞ (11) [1] L. Benini, G. De Micheli, Networks on chips: a new SoC paradigm, Computer (Long.
Beach. Calif) 35 (1) (2002) 70–78.
[2] D.M. Chapiro, Globally-asynchronous Locally-synchronous Systems, Stanford
σ ðxÞ University, 1984.
Cv ¼ (12) [3] E. Beigne, P. Vivet, Design of on-chip and off-chip interfaces for a GALS NoC
μðxÞ
architecture, in: Proceedings - International Symposium on Asynchronous Circuits
and Systems, vol. 2006, 2006, pp. 172–181.
We have obtained vast results from the evaluation of delay, power
[4] A. Strano, D. Ludovici, D. Bertozzi, A library of dual-clock FIFOs for cost-effective
and PDP of FIFOs in presence of variation. We have compared the FIFO's and flexible MPSoC design, in: 2010 International Conference on Embedded
performance, where both FIFOs have correct functionality. Results show Computer Systems: Architectures, Modeling and Simulation, 2010, pp. 20–27.
that generally under variation, the mean value (μ) for the elastic FIFO's [5] M. Jhamb, R.K. Sharma, A.K. Gupta, A novel FIFO design for data transfer in mixed
timing systems, Int. J. Electr. Comput. Energy Electron. Commun. Eng. 8 (3) (2014)
power is less than that of DSPIN synchronous FIFO, while the mean delay 609–614.
of DSPIN FIFO is less than that of the presented elastic FIFO. These are [6] A. Chakraborty, M.R. Greenstreet, Efficient self-timed interfaces for crossing clock
mainly because of the frequency difference, where the elastic FIFO is domains, in: Ninth International Symposium on Asynchronous Circuits and
Systems, 2003. Proceedings, 2003, pp. 78–88.
examined with slower frequency. We have thoroughly studied Vth vari- [7] A. Yakovlev, P. Vivet, M. Renaudin, Advances in asynchronous logic: from
ation impact on performance. A small amount of results has been re- principles to GALS & NoC, recent industry applications, and commercial CAD tools,
ported here. When the threshold voltage variation changes by 30% of the in: Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013,
2013, pp. 1715–1724.
nominal amount, the elastic FIFO performance variation becomes less [8] H. Han, K.S. Stevens, Clocked and asynchronous FIFO characterization and
than DSPIN FIFO. Power variation increases with frequency degradation, comparison, in: 2009 17th IFIP International Conference on Very Large Scale
while delay variation decreases. PDP is a function of power and delay, Integration (VLSI-SoC), 2009, pp. 101–108.
[9] J. Cortadella, M. Kishinevsky, B. Grundmann, Synthesis of synchronous elastic
and thereby because of more power growth and less delay decrease, PDP
architectures, in: 2006 43rd ACM/IEEE Design Automation Conference, 2006,
increases while the frequency decreases. Trends for delay, power and pp. 657–662.
PDP variations based on different frequencies have been showed in Fig. 9 [10] J. Cortadella, L. Lavagno, D. Amiri, J. Casanova, C. Macian, F. Martorell, J.A. Moya,
L. Necchi, D. Sokolov, E. Tuncer, Narrowing the margins with elastic clocks, in:
a. Also, for 40% Vth variation, similar to 30% Vth variation case, the
2010 IEEE International Conference on Integrated Circuit Design and Technology,
performance variation of the elastic FIFO is less than that of DSPIN FIFO. 2010, pp. 146–150.
However, for 20% Vth variation DSPIN FIFO shows almost the same [11] L.P. Carloni, K.L. McMillan, A. Saldanha, A.L. Sangiovanni-Vincentelli,
variability compared to the elastic FIFO. A methodology for correct-by-construction latency insensitive design, in: IEEE/
ACM International Conference on Computer-aided Design. Digest of Technical
From another point of view, we can observe how higher Vth variation Papers (Cat. No.99CH37051), 1999, pp. 309–315.
impacts the FIFO's performance, when both FIFOs run 10% slower than [12] S. Krstic, J. Cortadella, M. Kishinevsky, J. O'Leary, Synchronous elastic networks,
their maximum frequency. Results in Fig. 9 b, which show PDP distri- in: 2006 Formal Methods in Computer Aided Design, vol. 2, 2006, pp. 19–30.
[13] V.S. Vij, R.P. Gudla, K.S. Stevens, Interfacing synchronous and asynchronous
bution, confirm the elastic FIFO performs better in higher variation domains for open core protocol, in: 2014 27th International Conference on VLSI
situations. Design and 2014 13th International Conference on Embedded Systems, 2014,
pp. 282–287.
[14] K. Swaminathan, G. Lakshminarayanan, F. Lang, M. Fahmi, S.-B. Ko, Design of a low
5. Conclusion power network interface for Network on chip, in: 2013 26th IEEE Canadian
Conference on Electrical and Computer Engineering (CCECE), vol. 2, 2013, pp. 1–4
A dual clock elastic FIFO has been presented, which read and write no. December 2014.
[15] J. Cortadella, M. Kishinevsky, B. Grundmann, SELF: specification and design of a
operations are capable to run with arbitrary phase and frequency. This
synchronous elastic architecture for DSM systems, Int. Work. Timing Issues Specif.
FIFO is suitable to be used as GALS NoC interface. This FIFO easily Synth. Digit. Syst. (2006).
connects to any synchronous or asynchronous interface, and can be [16] J. Cortadella, M. Galceran-Oms, M. Kishinevsky, Elastic systems, in: Eighth ACM/
IEEE International Conference on Formal Methods and Models for Codesign
designed with the commercial CAD tools.
(MEMOCODE 2010), 2010, pp. 149–158.
The presented elastic FIFO can store double data with less area [17] J. Carmona, J. Cortadella, M. Kishinevsky, A. Taubin, Elastic circuits, IEEE Trans.
overhead compared to other FIFOs. It has a simple structure, and does not Comput. Des. Integr. Circuits Syst. 28 (10) (Oct. 2009) 1437–1455.
need multi-bit synchronization or special coding. Full/empty detection in [18] M.R. Casu, L. Macchiarulo, Adaptive latency-insensitive protocols, IEEE Des. Test
Comput. 24 (5) (Sep. 2007) 442–452.
this FIFO is performed more accurately compared to other FIFOs, because [19] L.P. Carloni, A.L. Sangiovanni-Vincentelli, Coping with latency in SOC design, IEEE
of its simple combinational logic, which has small delay overhead. La- Micro 22 (5) (Sep. 2002) 24–35.
tency and throughput of design has been explored. Minimum buffer [20] M. Galceran-Oms, J. Cortadella, D. Bufistov, M. Kishinevsky, Automatic
microarchitectural pipelining, in: 2010 Des. Autom. Test Eur. Conf. Exhib. (DATE
requirement of the proposed design for 100% throughput is less than the 2010), Mar. 2010, pp. 961–964.
other structures. [21] T. Kam, M. Kishinevsky, J. Cortadella, M. Galceran-Oms, Correct-by-construction
Two structures for the presented elastic FIFO are simulated using microarchitectural pipelining, in: 2008 IEEE/ACM Int. Conf. Comput. Des, vol. 3,
Nov. 2008, pp. 434–441.
HSPICE. Delay, power consumption, PDP and EDP has been studied. [22] J. You, Y. Xu, H. Han, K.S. Stevens, Performance evaluation of elastic GALS
Moreover, the threshold voltage variation impact on functionality and interfaces and network fabric, Electron. Notes Theor. Comput. Sci. 200 (1) (Feb.
performance has been analyzed for 34-bit 4-deep FIFO structure. Results 2008) 17–32.
79
S.M.T. Adl, S. Mohammadi Microelectronics Journal 76 (2018) 69–80
[23] M. Galceran-Oms, J. Cortadella, M. Kishinevsky, Speculation in elastic systems, in: [46] A. Rahmani, P. Liljeberg, J. Plosila, H. Tenhunen, Design and implementation of
Proceedings of the 46th Annual Design Automation Conference on ZZZ - DAC ’09, reconfigurable FIFOs for voltage/frequency island-based, Microprocess. Microsyst.
vol. 1, 2009, p. 292. 37 (4–5) (2013) 432–445.
[24] D.E. Bufistov, J. Cortadella, M. Galceran-oms, J. Júlvez, M. Kishinevsky, Retiming [47] M.M. Mano, COMPUTER SYSTEM Computer System, Prentice Hall, 1992.
and recycling for elastic systems with early evaluation, in: Design Automation [48] I. Miro Panades, Design and Implementation a Micro-network on Chip with Service
Conference, 2009. DAC ’09. 46th ACM/IEEE, vol. 1, 2009, pp. 288–291. Guarantee, Pierre-and-Marie-Curie University, 2008.
[25] M. Galceran-Oms, A. Gotmanov, J. Cortadella, M. Kishinevsky, Microarchitectural [49] Predictive Technology Model, 2016 [Online]. Available: http://ptm.asu.edu/.
transformations using elasticity, ACM J. Emerg. Technol. Comput. Syst. 7 (4) (Dec. [50] Q. Liu, R.D. Russell, RGBCC: a new congestion control mechanism for InfiniBand,
2011) 1–24. in: 2016 24th Euromicro International Conference on Parallel, Distributed, and
[26] M. Galceran-oms, Automatic Pipelining of Elastic Systems, Universitat Politecnica Network-based Processing (PDP), 2016, pp. 91–100.
De Catalunya, 2011. [51] C. Hernandez, A. Roca, F. Silla, J. Flich, J. Duato, Improving the performance of
[27] G. Dimitrakopoulos, A. Psarras, I. Seitanidis, Microarchitecture of Network-on-chip GALS-based NoCs in the presence of process variation, in: 2010 Fourth ACM/IEEE
Routers, Springer New York, New York, NY, 2015. International Symposium on Networks-on-chip, 2010, pp. 35–42.
[28] R. Mullins, S. Moore, Demystifying data-driven and pausible clocking schemes, in: [52] M. Mirzaei, M. Mosaffa, S. Mohammadi, Variation-aware approaches with power
13th IEEE Int. Symp. Asynchronous Circuits Syst, Mar. 2007, pp. 175–185. improvement in digital circuits, Integrat. VLSI J. 48 (1) (Jan. 2015) 83–100.
[29] K.Y. Yun, R.P. Donohue, Pausible clocking: a first step toward heterogeneous [53] D. Moon, J. Song, O. Kim, Effect of source/drain doping gradient on threshold
systems, in: Proceedings International Conference on Computer Design. VLSI in voltage variation in double-gate fin field effect transistors as determined by discrete
Computers and Processors, 1996, pp. 118–123. random doping, Jpn. J. Appl. Phys. 49 (104301) (2010).
[30] R. Ginosar, Fourteen ways to fool your synchronizer, in: Proc. - Int. Symp. [54] A.V. Kauppila, Analysis of Parameter Variation Impact on the Single Event Response
Asynchronous Circuits Syst, 2003, pp. 89–96. in Sub-100nm CMOS Storage Cells, Vanderbilt, 2012.
[31] P. Teehan, M. Greenstreet, G. Lemieux, A survey and taxonomy of GALS design [55] M. Alioto, S. Member, G. Palumbo, M. Pennisi, Understanding the effect of process
styles, IEEE Des. Test Comput. 24 (5) (Sep. 2007) 418–428. variations on the delay of static and domino logic, IEEE Trans. Very Large Scale
[32] P. Alfke, C.E. Cummings, Simulation and synthesis techniques for aynchronous FIFO Integr. Syst. 18 (5) (2010) 697–710.
design with asynchronous pointer comparisons, in: Snug-2002, 2002, pp. 1–17.
[33] A.V. Yakovlev, A.M. Koelmans, L. Lavagno, High-level modeling and design of
asynchronous interface logic, IEEE Des. Test Comput. 12 (1) (Jan. 1995) 32–40.
[34] E. Brunvand, Low latency self-timed flow-through FIFOs, in: Advanced Research in Seyed Mohamad Taghi Adl received his BSc. and MSc Degree
VLSI, 1995. Proceedings., Sixteenth Conference on, 1995, pp. 76–90. in computer engineering from Shahid Beheshti University and
[35] A. Chakraborty, M.R. Greenstreet, Efficient self-timed interfaces for crossing clock Isfahan University of technology in 2009 and 2011 respectively.
domains, in: Ninth International Symposium on Asynchronous Circuits and He is currently a Ph.D. student at University of Tehran. His
research interests include low power and asynchronous system
Systems, 2003. Proceedings, 2003, pp. 78–88.
[36] J. Ebergen, Squaring the FIFO in GasP, in: Proc. - Int. Symp. Asynchronous Circuits design, on chip interconnection in GALS NoC, process variation
Syst, 2001, pp. 194–199 no. 2. and elastic design.
[37] J.T. Yantchev, C.G. Huang, M.B. Josephs, I.M. Nedelchev, Low–latency
asynchronous FIFO buffers, in: Proceedings Second Working Conference on
Asynchronous Design Methodologies, vol. 19, IEEE Comput. Soc. Press, 2013,
pp. 24–31 no. 1.
[38] R.W. Apperson, Z. Yu, M.J. Meeuwsen, T. Mohsenin, B.M. Baas, A scalable dual-
clock FIFO for data transfers between arbitrary and Haltable clock domains, IEEE
Trans. Very Large Scale Integr. Syst. 15 (10) (Oct. 2007) 1125–1134.
[39] T. Chelcea, S.M. Nowick, Robust interfaces for mixed-timing systems, IEEE Trans.
Very Large Scale Integr. Syst. 12 (8) (Aug. 2004) 857–873.
[40] I. Miro Panades, A. Greiner, Bi-synchronous FIFO for synchronous circuit Siamak Mohammadi received his BSc, MSc and Ph.D. degrees
communication well suited for network-on-chip in GALS architectures, in: First from the University of Paris Sud Orsay, France in 1990, 1992
International Symposium on Networks-on-chip (NOCS’07), 2007, pp. 83–94. and 1996, respectively, all in electrical engineering. During his
[41] I. Miro-Panades, F. Clermidy, P. Vivet, A. Greiner, Physical implementation of the Ph.D. he was supported by a grant from the Ministry of Educa-
DSPIN network-on-chip in the FAUST architecture, in: Second ACM/IEEE tion of France. From 1997 to 1999 he was a Research Associate
International Symposium on Networks-on-chip (nocs 2008), 2008, pp. 139–148. with the Department of Computer Science, University of Man-
[42] T.-T. Nguyen, X.-T. Tran, A novel asynchronous first-in-first-out adapting to multi- chester, England. In 1999 he moved to Canada and worked at
synchronous network-on-chips, in: 2014 International Conference on Advanced Cogency Semiconductor Inc. in Toronto until 2003 and then at
Technologies for Communications (ATC 2014), vol. 2015–Febru, 2014, ATI Technologies Inc. until 2005. Currently he is an Assistant
pp. 365–370. Professor in School of Electrical and Computer engineering, at
[43] M. Paschou, A. Psarras, C. Nicopoulos, G. Dimitrakopoulos, CrossOver: clock the University of Tehran, Iran. He has over 15 years experience
domain crossing under virtual-channel flow control, in: Design, Automation & Test in VLSI digital area, ASIC design and verification, asynchronous
in Europe Conference & Exhibition (DATE), 2016. design, and on-chip interconnects in GALS NoCs. He has
[44] A. Psarras, M. Paschou, C. Nicopoulos, G. Dimitrakopoulos, A dual-clock multiple- contributed to the design of the first asynchronous ARM
queue shared buffer, IEEE Trans. Comput. 66 (10) (Oct. 2017) 1809–1815. microprocessor, as well as several PowerLine networking and
[45] T. Ono, M. Greenstreet, A modular synchronizing FIFO for NoCs, in: 2009 3rd ACM/ RFID chips.
IEEE International Symposium on Networks-on-chip, 2009, pp. 224–233.
80