Asghari Csi
Asghari Csi
Asghari Csi
Communication Mechanism
Seyyed Amir Asghari
1
, Hossein Pedram
1
and Mohammad Khademi
2
1
Amirkabir University of Technology
2
Shahid Beheshti University
[email protected], [email protected], [email protected]
Abstract
One of the important issues in power and
performance trade off analysis in Network on Chip
designs is communication. Communication portion in
the power consumption of System on Chip can be up to
50% of the whole power of consumption of the chip.
This deems to be more important for Network on Chips
which center around an intercommunication networks.
Many Networks on Chip routers have been designed;
however most of them have not been implemented until
now. In this paper, design and implementation of a
synchronous Network on Chip router based on
asynchronous communication mechanism are
presented. We designed a router with scalability
feature which is synthesized in both FPGA and ASIC
infrastructures. In addition, the proposed router uses
low resource utilization percentage of FPGA and
ASIC.
1. Introduction
Power and performance are two essential features
which in corresponded with each other, produce main
concerns in design and implementation. Nowadays,
very large integrated digital systems [1-4] (Systems on
Chip) may contain different components such as
processor, input-output units and different types of
memories. Likewise, each of these components may
include different specifications such as variable
bandwidth, buses and different communicative
protocols. Generally, bus is utilized for interconnecting
the processing elements of System on Chip (SoC).
However by increasing the number of processing
elements, the bus itself is transmuted into a bottleneck.
To obviate this difficulty, the idea of Network on Chip
(NoC) has been introduced [5].
This network can be modeled as a graph wherein
nodes, processing elements and edges are the
connective links of the processing elements. In this
article, design and implementation of an NoC router
are presented. In the second section of this article, the
utilized routing algorithm is briefly analyzed. In
implementation, XY routing algorithm is utilized [6-7].
In the third section, the wormhole switching which is
used in implementation is reviewed [8, 9]. In the forth
of this article, the utilized traffic pattern is briefly
explained. In the fifth section which considers being
the main body of this article, handshaking
communication mechanism is introduced and analyzed.
In this section, the structure of information packets,
router function and different states of the router are
analyzed. Furthermore, the practical results of
implementation and synthesis of this routing are
presented in the final section of this article. In this
routing, handshaking communication protocol is
utilized to interconnect different processing elements.
2. The Utilized Routing Algorithm
The utilized topology for implementation is an nn
regular two dimensional mesh. A sample of this
topology is shown in Figure 1 shows a NoC.
Figure 1. A regular 33 mesh topology
978-1-4244-4262-1/09/$25.00 2009 IEEE
Proceedings of the 14th International CSI Computer Conference (CSICC'09)
225
The features which are shown in rectangles represent
NoC routers and those which are shown in circles
represent the processing elements of this network. By
the use of communication links and routers, these
processing elements which are connected to each other
transfer information. Routers are named based on their
position in coordinate system. Router ports are also
named based on their geographical direction.
However, as it is shown in Figure 1, the number of
the ports connected to each other is different due to its
position in topology. For example, the router which is
placed in the northeast of topology in 22 coordinates
([2,2]), possesses 3 ports and the router in the center of
the topology in 11 coordinate ([1,1]), has 5 ports.
For n-dimensional mesh topologies in NoCs,
dimension order routing produces deadlock-free
routing algorithms. These algorithms are very popular,
like XY routing (for 2-D mesh). The routing algorithm
which is used in this design is a version of XY
algorithm. This algorithm is deterministic algorithm in
which packet takes routing in one dimension and it
continues till this packet attains the desired coordinate
in that dimension. Then routing is fulfilled in the same
way. This method warrants no deadlock to occur [8, 9].
In this algorithm, according to the coordinates of each
router and destination address, routing takes place first
in X direction and then in Y direction. This algorithm
warrants dead-lock free routing. However, it is
sensitive to link destruction and may not be able to
adopt a substituting router. It is due to the fact that
these types of algorithms adopt routing only based on
the source-destination address of packets. Therefore,
two packets with the same source and destination
address necessarily cross the same route and do not
consider the momentary traffic in the route.
3. The Utilized Switching
The need to buffer complete packet within a router
can make it difficult to construct low area, compact and
fast routers. In implementation, wormhole switching is
used which is utilized in almost all of NoCs [8].
In wormhole switching message packets are also
pipelined through the network. A message packet is
broken up into flits that the flit is the unit of message
flow control. Therefore, input and output buffers at a
router are typically large enough to store a few flits [9].
As we said, in this switching, message packets are
divided into equal smaller sections named as flit. Flits
are concurrently transferred in the network. Therefore
if 16-bit flits are ready to be transferred, 32 signals
between two routers are considered to transfer the flits.
16 signal for sending and 16 signal for receiving. In
this way, flits are transferred in parallel. Other
switching techniques are not commonplace in NoCs
usages. For instance, circuit switching technique due to
its low performance contradicts with power and
performance parameters, similarly packet switching as
a result of its big buffers requirement shows the same
contradiction.
4. The Utilized Traffic Pattern
The traffic model is one of the important parameters
in evaluating the latency time of interconnection
networks. These models are produced according to the
application programs which are run on the machine. In
different applications, different models are used.
Traffic models are defined according to three
parameters [9]: a)The entrance time to networks
b)Message Length and c) Address distribution type.
4.1 The uniform traffic model
The uniform traffic model is the simplest traffic
model which used in most of evaluations. In this
model, each node sends message to the other nodes in
network with equal probability. For example in a 6 6
mesh topology, each nodes sends message to the other
nodes with the probability of %2.85. All source or
destination nodes are selected with equal probability.
The selection of source and destination node for each
message will be independent from other messages [9].
5. Asynchronous Communication
Mechanism
For making interaction between routers, handshaking
communication protocol is utilized in case the data is
put on the line; the existence of the data is informed to
the next router. Next router takes the data from the line
and transmits its confirmation to the sender router. So
in addition to the flits sending and receiving channels,
TX, ACK-TX, RX and ACK-RX signals are required.
TX base is the output and whenever the data is ready in
the output port, this base equals to one and waits for
ACK-TX to be equaled to one. Likewise each input
port after finding the RX input base to be one, reads the
data on this port and equals the ACK-RX output base
to one. The link of two ports from two neighbor
routers is shown in Figure2.
Proceedings of the 14th International CSI Computer Conference (CSICC'09)
226
Figure 2. Communication signals in handshake protocol between
two NoC routers
5.1 The structure of information packets
In each communication standards, the communication
payload contains a series of control fields. These fields
can be put in the main frame as the redundant fields in
order to increase the controllability, fault tolerance,
security and some other issues like these. In our
intercommunication protocol, flits are used to
structuralize. A flit structure is considered in the way
that the first bit shows the flit to be the header-trailer or
the data. When the first bit equals one, this flit is a
header or trailer. In this case, the 2
nd
bit determines
which one is the header and which one is the trailer.
This representation is shown in Table I.
TABLE I. THE DEFINED PROTOCOL THAT CHARACTERIZE THE
FLIT TYPE
Information
type
Second bit Information
type
First bit
Data * Data 0
Trailer 0
Header 1
Header/Trailer 1
Header flit contain source and destination address
and if it is needed, the length of packet based on the
number of flits. The structure of packed based on flits
is shown in Figure 3. The first flit is the header flit
containing the destination address. The next part is the
data flit and final part is the trailer flit.
Figure 3. Transfer packet structure
5.2 Routing function
Each router by receiving the header flit from input,
accomplish routing and updates routing tables
according to its source and destination addresses based
on XY algorithm.
Henceforth, all of the flits take routing based on the
tables till receiving the final flit (trailer). Routing tables
conclude two tables: routing table and output table. The
first table represents the out port for each input and the
second represents the state of each out port (busy or
free). In Fugure4, you can see an NoC central router in
mesh topology. The central router has 5 in/out port.
The local port is utilized to connect the correspondent
circle to the processing element (IP block) and other
ports are for connecting to other routers.
Figure 4. Central NoC router in mesh topology with its ports
The main point here is that the correspondent circle
with this routing should have the same interface to be
able to use this routing. The internal sight of the
features of a routing with their details is not shown and
only the main features are considered (in Figure 5).
Figure 5. A NoC router with its main components
Routing function feature takes the charge of routing
based on routing algorithm and selection function
feature under takes the responsibility of choosing out
Proceedings of the 14th International CSI Computer Conference (CSICC'09)
227
port in competition circumstances based on the defined
priority mechanism. In our designing, mechanisms is
implemented by the software in the manner that it gives
priority to input port and whatever an input port has a
higher priority. It selects its desired output port faster.
However, we should consider that competition
circumstance only take place when in one moment,
there is a request from two input port for one output
port.
Our fulfilled designing is implemented by the use of
VHDL hardware describing language. In order to
router implementation, one entity is designed for whole
routing. In code segment of Figure 6, size and type of
input/output port are shown.
Figure 6. Size and type of input-outputs
Types of arrayPortsRegisters and
PortsRegisters signals are defined in one packet.
In order to implement, we defined a machine of
definite state for input which you can see in Figure 7.
Receive
Header_bit ='1'
Trailer_bit='1'
Data
Received
Trailer
Received
Header
Received
Transmit
Header_bit='0'
Header_bit='1'
Trailer_bit='0'
ACK_Tx='1'
Figure 7. Finite State Machine for flit and router status analyze
5.2.1 Received state
In this state, the routing await for its Rx base to be
one. In case this happens, firstly the data in Data-in is
need and then the correctness of this data is examined.
In case of being correct, ACK-RX equal one. Then the
next state is defined according to the header/trailer bit.
5.2.2 Header received state
In this state, the appropriate output port is defined
based on the source and destination addresses and
outport table. Then routing table and outport table are
updated. Finally we alter routing state to transmit state.
5.2.3 Trailer received state
In this state, after the destination port is determined
by the routing table, this table of outport table is
updated. In order to do this, the home correspondent
with the input is equaled to NO PORT and also the
output port state in outport table is equaled to Free.
5.2.4 Data received state
In this state, after finding the output port by routing
table, the received flit is put in the output port.
5.2.5 Transmit state
In this state, after placing the flit in the output port
and equaling the desired output port TX base to one,
we await for receiving ACK-TX and after it's receiving
we equal TX to zero and turn back to the received
state.
6. Experimental results
A testbench is considered to test the routing which
alternating enters some packets from input port to
routing and saves the output port packets. This
simulation is fulfilled by the use of the Modelsim
software and uniform traffic model as utilized for
packet injection to network [9]. In the best state,
receive time and Header-Received, Trailer-Received
and Data-Received states are one clock cycle and
Transmit time is two clock cycles which is a good
scale. After simulation, the beginning routing synthesis
operation are fulfilled by the use of Leonardo spectrum
software and its synthesis results are carried out on
ASIC and FPGA that you can see in the following
tables.
Total reports of router synthesis on FPGA
(SPARTAN3E) have been showed in Table II. For
instance, the sum of the all needed gates is 544. The
operating frequency and data arrival time of this router
in FPGA have been estimated 153.0 MHz and 5.81 ns.
Entity router is
Port(
Clock: in std_lolgic;
Reset: in std_logic;
Data_in: in arrayPortsRegisters;
Rx: in PortsRegisters;
Ack_rx: out PortsRegisters;
Data_out: out arrayPortsRegisters;
Tx: out PortsRegisters;
Ack_tx: in PortsRegisters);
End router;
Proceedings of the 14th International CSI Computer Conference (CSICC'09)
228
TABLE II. TOTAL ACCUMULATED AREA IN SPARTAN3E
102 Number of ports
881 Number of nets
830 Number of instances
0 Number of references to this view
1 Number of BUFGP
176 Number of Dffs or Latches
548 Number of Function Generators
51 Number of IBUFG
4 Number of MUXF5
50 Number of OBUF
544 Number of gates
830 Number of accumulated instances
Table III shows the synthesis results (resource
utilization percentage) of the NoC router in FPGA. In
this table, we see the available and utilized resource of
SPARTAN 3E platform.
TABLE III. 8 BIT NOC ROUTER RESOURCE UTILIZATION
PERCENTAGE OF SPARTAN 3E
Utilization Avail Used Resource
52.06 194 101 IOs
4.17 24 1 Global Buffers
2.52 21712 548 Function Generators
3.16 8672 274 CLB Slices
0.80 22100 176 Dffs or Latches
0.00 28 0 Block RAMs
0.00 28 0 Block Multipliers
0.00 2016 0 Block Multiplier Dffs
Table IV shows the total statistical information of 16-
bit router synthesis on FPGA.
TABLE IV. TOTAL ACCUMULATED AREA IN SPARTAN3E 16BIT
182 Number of ports
1356 Number of nets
1264 Number of instances
0 Number of references to this view
1 Number of BUFGP
256 Number of Dffs or Latches
824 Number of Function Generators
91 Number of IBUF
2 Number of MUXF5
90 Number of OBUF
816 Number of gates
1264 Number of accumulated instances
Table V shows the percentage of resource utilization of
16-bit router from FPGA (SPARTAN3E model). The
operating frequency and data arrival time of this router
in FPGA have been estimated 132.2 MHz and 6.84 ns.
TABLE V. PERCENTAGE OF RESOURCE UTILIZATION OF
SPARTAN 3E16 BIT
Utilization Avail Used Resource
93.30 194 181 IOs
4.17 24 1 Global Buffers
3.80 21712 824 Function Generators
4.75 8672 412 CLB Slices
1.16 22100 256 Dffs or Latches
0.00 28 0 Block RAMs
0.00 28 0 Block Multipliers
0.00 2016 0 Block Multiplier Dffs
Figure 8 shows total accumulated area comparison
between 8 bit and 16 bit flit sizes (bandwidth) in
SPARTAN 3E.
0
200
400
600
800
1000
1200
1400
1600
N
u
m
b
e
r
o
f
p
o
r
t
s
N
u
m
b
e
r
o
f
n
e
t
s
N
u
m
b
e
r
o
f
i
n
s
t
a
n
c
e
s
N
u
m
b
e
r
o
f
r
e
f
e
r
e
n
c
e
s
t
o
t
h
i
s
v
i
e
w
N
u
m
b
e
r
o
f
B
U
F
G
P
N
u
m
b
e
r
o
f
D
f
f
s
o
r
L
a
t
c
h
e
s
N
u
m
b
e
r
o
f
F
u
n
c
t
i
o
n
G
e
n
e
r
a
t
o
r
s
N
u
m
b
e
r
o
f
I
B
U
F
G
N
u
m
b
e
r
o
f
M
U
X
F
5
N
u
m
b
e
r
o
f
O
B
U
F
N
u
m
b
e
r
o
f
g
a
t
e
s
N
u
m
b
e
r
o
f
a
c
c
u
m
u
l
a
t
e
d
i
n
s
t
a
n
c
e
s
8 bit Flit Size
16 bit Flit Size
Figure 8. Comparison of Totla Accumulated Area in SPARTAN
3E between a 8 bit flit size router and 16 bit
Table VI shows the total statistical information of our 8
bit router synthesis on ASIC (TSMC 13U). For
instance, the sum of the all needed gates is 10936. The
operating frequency and data arrival time of this router
in FPGA have been estimated 303.2 MHz and 3.01 ns.
TABLE VI. TOTAL ACCUMULATED AREA IN 8BIT TSMC13U
(ASIC)
102 Number of ports
847 Number of nets
716 Number of instances
0 Number of references to this view
10936 Number of gates
716 Number of accumulated instances
Proceedings of the 14th International CSI Computer Conference (CSICC'09)
229
Table VII shows the total statistical information of our
16 bit router synthesis on ASIC. For instance, the sum
of the all needed gates is 14885. The operating
frequency and data arrival time of this router in FPGA
have been estimated 299.2 MHz and 3.03 ns.
TABLE VII. TOTAL ACCUMULATED AREA IN 16 BIT TSMC13U
(ASIC)
182 Number of ports
1082 Number of nets
879 Number of instances
0 Number of references to this view
14885 Number of gates
879 Number of accumulated instances
Figure 9 shows total accmulated area comparison
between 8 bit and 16 bit flit sizes (bandwidth) in
TSMC 13u.
0
2000
4000
6000
8000
10000
12000
14000
16000
N
u
m
b
e
r
o
f
p
o
r
t
s
N
u
m
b
e
r
o
f
n
e
t
s
N
u
m
b
e
r
o
f
in
s
t
a
n
c
e
s
N
u
m
b
e
r
o
f
r
e
f
e
r
e
n
c
e
s
t
o
t
h
is
v
ie
w
N
u
m
b
e
r
o
f
g
a
t
e
s
N
u
m
b
e
r
o
f
a
c
c
u
m
u
l
a
t
e
d
in
s
t
a
n
c
e
s
8 bit Flit Size
16 bit Flit Size
Figure 9. Comparison of Totla Accumulated Area in TSMC13u
between a 8 bit flit size router and 16 bit
The presented figures and tables, show the bandwidth
effect in router elements area. So the bandwidth
increase factor has not linear impact in area.
6. Conclusion
In the presented article, designing and implementing
an NoC router were analyzed. In today's very integrated
digital systems, power and performance correspond
closely with each other. In engineering designing,
always need to analyze the loss for each option for each
optimum solution. One of the features, which directly
influence on power, is the communication issue in
NoCs. In this article not only we used an asynchronous
communication mechanism based on handshaking to
transfer information (which implies low power
consumption and developing capability) but also by
using statistical data, we showed that this designed
router occupies very little space.
Scalable design of this router leads to easy and efficient
addition of new capabilities like 8-bit and 16-bit
bandwidth. The resource utilization of this router is
more efficient than similar implementation on FPGA
and ASIC platforms.
7. References
[1] L. Benini and D. Bertozzi, Network-on-chip
architectures and design methods, IEE Computers and Digital
Techniques, Volume 152, Issue 2, p. 261-272, March 2005
[2] X. Chen and L. Shiuan Peh, Leakage Power Modeling
and Optimization in Interconnection Networks, ISLPED03,
August2527,2003,Seoul,Korea.
[3] P. Pratim Pende, C. Grecu, M. Jones, A. Ivanov and
R. Saleh, Performance Evaluation and Design Trade-Offs for
Network-on-Chip Interconnect Architectures, IEEE
TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 8,
pp:1025:1040, AUGUST 2005
[4] N. Eisley and L. Shiuan Peh, High-Level Power
Analysis for On-Chip Networks, CASES04September 22
25,2004,Washington,DC, USA
[5] G.M. Chiu, "The odd-even turn model for adaptive
routing, IEEE Transactions on Parallel and Distributed
Systems, vol. 11, pp. 729-38, 2000.
[6] R. Holsmark, M. Palasi and S. Kumar, Deadlock Free
Routing Algorithms for Mesh Topology NoC Systems with
Regions, Digital System Design: Architectures, Methods and
Tools, 2006. DSD 2006. 9th EUROMICRO Conference on
Volume , Issue , 0-0 0 Page(s):696 703
[7] Zh. Xiaohu and C. Yang, W. Liwei, A Novel Routing
Algorithm for Network-on-Chip, Wireless Communications,
Networking and Mobile Computing, 2007. WiCom 2007.
International Conference on
Volume , Issue , 21-25 Sept. 2007 Page(s):1877 1879
[8] J. Duato, A new theory of deadlock-free adaptive
routing in wormhole networks, IEEE Trans. On Parallel and
Distributed Systems, vol. 4 No. 12, pp. 1320-1331, December
1993
[9] W. Hsh, Performance issues in wire-limited
hierarchical networks, PhD Thesis, University of Illinois-
Urbana Champaign, 1992.
Proceedings of the 14th International CSI Computer Conference (CSICC'09)
230