Academia.eduAcademia.edu

Aggregated CDMA crossbar for Network-on-Chip

2016, 2016 28th International Conference on Microelectronics (ICM)

Code Divison Multiple Access (CDMA) is proposed as the physical layer enabler of Network-On-Chip (NoC) interconnects for its prominent features such as fixed latency, guaranteed service, and reduced system complexity. CDMA interconnects have been adopted by the NoC community as it originates in wireless communications where each bit in a CDMA encoded data word is transmitted on a separate channel to avoid interference. However, the wireless interference problem can be efficiently mitigated in on-chip interconnects eliminating the need for replicating the CDMA channel. Moreover, wireless channels are sequential by nature which is not the case in on-chip interconnects where parallel buses are the default communication means. When CDMA was adopted by the NoC community, the same wireless CDMA scheme has been maintained where each data bit is encoded in a separate CDMA channel and the encoding/decoding logic is replicated for data packets. In this work, we present a novel CDMA encoding/decoding scheme called Aggregated CDMA (ACDMA) for NoC interconnects in which all packet bits are encoded in a single CDMA channel, consequently, eliminating the area and energy overheads resulted from replicating the channel encoding/decoding logic. The ACDMA NoC crossbar is synthesized on a 45-nm standard-cell process. Compared to the conventional CDMA NoC crossbars, the presented method achieves 60.5% less area, 55% less power consumption, and 124% more throughput per area ratio.

Aggregated CDMA Crossbar for Network-on-Chip Khaled E. Ahmed, Mohamed R. Rizk, Mohammed M. Farag Electrical Engineering Department, Faculty of Engineering, Alexandria University, Alexandria, Egypt Email: [email protected], [email protected], [email protected] Abstract—Code Divison Multiple Access (CDMA) is proposed as the physical layer enabler of Network-On-Chip (NoC) interconnects for its prominent features such as fixed latency, guaranteed service, and reduced system complexity. CDMA interconnects have been adopted by the NoC community as it originates in wireless communications where each bit in a CDMA encoded data word is transmitted on a separate channel to avoid interference. However, the wireless interference problem can be efficiently mitigated in on-chip interconnects eliminating the need for replicating the CDMA channel. Moreover, wireless channels are sequential by nature which is not the case in on-chip interconnects where parallel buses are the default communication means. When CDMA was adopted by the NoC community, the same wireless CDMA scheme has been maintained where each data bit is encoded in a separate CDMA channel and the encoding/decoding logic is replicated for data packets. In this work, we present a novel CDMA encoding/decoding scheme called Aggregated CDMA (ACDMA) for NoC interconnects in which all packet bits are encoded in a single CDMA channel, consequently, eliminating the area and energy overheads resulted from replicating the channel encoding/decoding logic. The ACDMA NoC crossbar is synthesized on a 45-nm standard-cell process. Compared to the conventional CDMA NoC crossbars, the presented method achieves 60.5% less area, 55% less power consumption, and 124% more throughput per area ratio. Index Terms—NoC, On-Chip Interconnect, CDMA Crossbar. I. I NTRODUCTION Modern Systems-on-chips (SoCs) are becoming massively parallel with many harmoniously interconnected Processing Elements (PEs). Interconnecting the PEs is commonly achieved through buses and Networks-on-Chips (NoCs) [1]. In NoCs, exchanged data is bundled into packets and traverse several network layers passing by the physical layer which defines how packets are actually transmitted between NoC units. The physical layer of a NoC is implemented by routers employing crossbar switches. Code Division Multiple Access (CDMA) is a medium sharing technique that leverages orthogonal codes to enable simultaneous packet routing. Unlike timeshared channels, CDMA leverages the code space to enable channel sharing. CDMA has been proposed as an on-chip interconnect technique for both bus and NoC interconnect architectures [2]. Many advantages of using CDMA for onchip interconnects include reduced power consumption, fixed communication latency, and reduced system complexity [3]. Utilizing CDMA in NoC interconnects is adopted from the wireless communications literature, where the data is spread by orthogonal codes at the transmitters, the spread data are added on the wireless channel, and the received sum is decoded at the receivers. Classical CDMA systems rely on the Walsh orthogonal code family to enable medium sharing. Many research groups have investigated several aspects of CDMA in NoCs, including our group which presented the Overloaded CDMA for on-chip Interconnects (OCI) [4] [5] [6]. A 14-node CDMA-based network has been developed in [7]. The network utilizes 7 Walsh codes and assignment of the Walsh codes to the network nodes is dynamic based on the request from each node. Two architectures have been introduced in [7]: a serial CDMA network where each data chip in the spreading code is sent in one clock cycle; and a parallel CDMA network where all data chips are sent in the same cycle. The serial and parallel CDMA-based networks have been compared to a conventional CDMA network, a meshbased NoC, and a Time Division Multiple Access (TDMA) bus. For the same network area, the throughput of the parallel CDMA network is higher than that of the mesh-based NoC and the TDMA bus due to the simultaneous medium access nature of CDMA. Standard-basis codes are proposed as a replacement to Walsh CDMA codes in [8]. Standard-basis codes resemble TDMA signaling because each code consists of only a single chip of one and the remaining chips are zeros. The TDMA codes’ orthogonality enables them to replace the Walsh codes as spreading and despreading CDMA codes, which reduces the complexity of the channel adder and decoder as the sum of TDMA codes is limited to zero or one per clock cycle. The conventional CDMA crossbar employed in the literature is depicted in Figure 1. The crossbar interconnects N transmit ports to N receive ports using N -chip length Walsh spreading codes. The binary data from each transmit port is encoded using an XOR encoder; the data bit is XORed with a unique N -chip spreading code assigned to the transmit-receive pair and transmitted in N clock cycles. Data spread from all encoders are added by the CDMA channel adder and sent to all receive port. The decoder at each receive port extracts the data from the channel sum by correlating the channel sum with the assigned spreading code. The correlation operation is implemented using an accumulator and a multiplexer since the despreading code chips are unipolar (“0” or “1”). In all of the CDMA interconnect related work, each data bit in a data word is encoded and transmitted in a separate CDMA channel and the encoding/decoding logic is replicated W times for data packets of width W which is a direct application of the wireless CDMA principles in NoC interconnects. However, wireless communication channels are sequential by nature due to the interference problem. Multiple access and MIMO techniques can enable concurrent data transmission on the same wireless channel at the expense of increasing the transmitter/receiver complexity. In on-chip interconnects, on Serial CDMA Encoders Serial CDMA Decoders Accumulator data 1 Table I D EFINITION OF N OTATIONS data 1 Add/Sub Spreading Code 1 De-Spreading Code 1 Adder Accumulator data M data M Add/Sub Spreading Code M Symbol Description N Number of CDMA code chips W Port width (the number of bits in a flit) De-Spreading Code M Data from port j Figure 1. Conventional CDMA crossbar [6]. dj Cji The ith chip of the CDMA code from transmitter j the other hand, a single channel can be efficiently utilized to enable parallel data transmission on a single CDMA channel as noise and interference effects can be efficiently mitigated [9]. In this work, we present a single channel, multi-bit CDMA crossbar namely Aggregated CDMA (ACDMA) NoC crossbar. The forthcoming discussion is organized as follows; the mathematical foundation of the aggregated CDMA scheme is developed in Section II. Architectural details and a complexity analysis of the ACDMA NoC crossbar are presented in Section III. Implementation results are advanced in Section IV and the conclusions are drawn in Section V Uji The unipolar representation of Cji Eji CDMA encoded data from transmitter j II. ACDMA M ATHEMATICAL F OUNDATIONS at the ith clock cycle Si The channel sum of all CDMA encoded data at the ith clock cycle Xkl The output of the kth orthogonal decoder at the lth clock cycle Since code set is orthogonal, the sum PN −1 thei CDMA i i=0 Cj Ck is equal to N when j = k and is equal to zero otherwise. Thus, the output of the decoder after N decoding cycles is: XkN −1 = N dk . (6) Symbols presented in this work are defined in Table I. In the ACDMA crossbar, the encoder multiplies a W -bit width data flit from the transmitting port by a CDMA code. The crossbar transaction takes N clock cycles. At the ith clock cycle, the data flit from the transmitting port j is multiplied by the ith chip from the CDMA code of the port’s encoder: Since the despreading code chips are ±1, the multiplyaccumulate process of the decoder can be implemented simply by an up/down accumulator. The data dj can be extracted from the decoder output XkN −1 by shifting the result log2 N bits to the right simply by rewiring. Eji = dj Cji . III. ACDMA N O C C ROSSBAR A RCHITECTURE (1) Since the CDMA code chip value is ±1, then the multiplication can be efficiently computed by negating dj when Cji = −1 which is mathematically reduced to: Eji = (dj XOR Uji ) + Uji . (2) where Uji = (1 − Cji )/2. Data encoded from all encoders are added up, the channel sum at the ith clock cycle can be expressed as: Si = N −1 X Eji = N −1 X dj Cji . (3) j=0 j=0 It should be indicated that the random effects such as noise and fading are neglected in the above equation, which is justified and experimentally validated for digital on-chip interconnects [9]. At the receiving port side, the decoder crosscorrelates the channel sum with the despreading code by multiplying the channel sum with the CDMA code chips and accumulating over the duration of the N cycles: Xkl = l X S i Cki = i=0 l N −1 X X ( dj Cji ) Cki . (4) i=0 j=0 After N decoding cycles: l = N − 1 and the decoder equation can be evaluated by applying the distributive and associative properties of the addition operation: XkN −1 = N −1 X j=0 dj ( N −1 X i=0 Cji Cki ). (5) The ACDMA crossbar implements the psychical layer of the NoC by interconnecting N transmit (TX) ports to N receive ports where the data width of each port is W where W = log2 max(dj ). The high-level architecture of the ACDMA crossbar illustrated in Figure 2(a) is composed of three main parts; encoders, channel adder, and decoders. The encoders spread data from each TX port using W XOR gates as shown in Figure 2(b). Instead of adding the spreading chips of the Walsh orthogonal code to the result in the encoder block as suggested by (2), this operation is postponed to the channel adder block in order to merge the channel adder with the spreading code adders. The output of each encoder is, therefore, limited to W -bit width. The encoder outputs are then added together to form the sum S i of (3). To minimize the critical path of the channel adder, the addition is done using a tree adder architecture as depicted by Figure 2(c) where the leafs of the tree are the encoders of each TX port, and the root of the tree is the channel sum output. Because there are N leaves, the height of the tree is log2 (N ). The width of the output wires from each adder in the tree is equal to the width of the input wires plus one to prevent overflows. Since the input to the firs level of adders is (W +1)bit wide and the height of the adder tree is log2 (N ), then the width of the output wires at the root adder is W +1+log2 (N ). Pipeline registers are inserted after each stage in the tree to minimize the critical path of the channel. The sum S i is then sent to all the N decoders, a decoder per RX port. The decoders implement the cross-correlation of (4) in a cost efficient manner; the decoder consists of only TX Port 1 TX Port 2 TX Port N D-bit Encoder 1 D-bit Encoder 2 CDMA Crossbar D-bit D-bit Decoder 2 D + logN-bit + D-bit D-bit D-bit Decoder 1 CDMA Channel Adder D-bit Encoder N D-bit Decoder N RX Port 1 RX Port 2 RX Port N Data Control Spreading Code and Counter Spreading Code and Counter Controller (a) Encoder Data D-bit XOR 1 Encoded D-bit Spreading Code Chip 1-bit XOR N (b) C1 1-bit E1 D-bit E2 D-bit Stage 1 Ʃ CDMA Channel Adder D+1 bit Stage 2 Ʃ C2 C3 E3 D-bit E4 D-bit Ʃ D+1 bit D+2 bit Stage log(N) C4 Ʃ 1-bit C N-3 E N-3 E N-2 Ʃ D-bit D-bit EN Ʃ D-bit D-bit CN Table II C OMPLEXITY ANALYSIS OF THE C ONVENTIONAL AND ACDMA CROSSBARS . D+1 bit Ʃ C N-2 C N-1 E N-1 D+ log(N) D+2 bit E: Encoded data D+1 bit C: Spreading Code Chip Pipeline register (c) Counter CDMA crossbar complexity is presented. The number of two input XOR gates is the same for both circuits. The improvement of the ACDMA crossbar over the conventional CDMA crossbar is evident in the number of channel adder wires; in the conventional CDMA crossbar, the number of the adder wires for the single-bit channel is increased by one in each stage due to the additional carry bit. Therefore, the number of adder wires in stage i is equal to 1 + log2 (N ) − i. For a W -bit word, the number of adder wires is increased to W +W (log2 (N )−i), and since there are 2i adders at each stage, then the total Plog2N −1 i number of wires is equal to i=0 2 (W +W (log2 N −i)). In the ACDMA crossbar, conversely, the number of adder wires for a W -bit word is W + P log2 (N ) − i, which makes the log2N −1 i total number of wires equals to i=0 2 (W + log2 N − i) which is a factor of W less than that of the conventional CDMA crossbar. The reduced number of carry bits of the ACDMA crossbar is the prime reason for its superiority. The number of wires for the decoder accumulator and the number of flip-flops in the decoder registers is proportional to the number of channel wires—the last stage of the adder. This follows that the complexity of the ACDMA crossbar is in an order of W less than that of the conventional CDMA crossbar. Component Aggregated CDMA Conventional CDMA Encoders N W XOR gates N W XOR gates Plog2 N −1 Plog2 N −1 Channel adder wires 2i (W + log2 N − i) 2i (W + W (log2 N − i)) Decoder accumulators N (W + log2 N ) = N W + N log2 N N W (1 + log2 N ) = N W + N W log2 N i=0 i=0 Decoder IV. I MPLEMENTATION R ESULTS Sum Mux Adder/ Subtractor Register D-bit Decoded D + logN-bit Despreading Code Chip Add/Sub (d) Figure 2. (a) ACDMA crossbar high-level architecture, (b) ACDMA encoder, (c) ACDMA channel adder, (d)ACDMA decoder. an adder/subtracter and a register configured as an up/down accumulator as shown in Figure 2(d). Since the despreading code Ck consists of ±1 chips, cross correlation is reduced to simple addition and subtraction operations of consequent sums S i . Therefore, the decoder is implemented as an up/down accumulator; the adder/subtracter adds or subtracts the sum S i from the result saved in the registers according to the value of despreading chip Cki . In particular, when the despreading chip is ‘1’, the adder adds S i to the contents of the register but subtracts S i from the contents of the register when the despreading chip is ‘-1’. At the end of the decoding cycle, the accumulator register holds N dk according to (5), and because N = 2n and n is an integer, data dk is decoded by shifting the accumulator content by log2 (N ) bits. In Table II, analysis of the ACDMA versus conventional In this section, the implementation results of the ACDMA crossbar are presented. The crossbar is synthesized using a 45 − nm standard-cell process; the synthesis results are compared to the conventional Walsh-based (WB) CDMA crossbar and the standard-basis (SB) NoC crossbar presented in [8]. However, to neutralize any discrepancy between the implementation of the ACDMA crossbar and the implementations presented in [8] that may arise due to different synthesis strategies, the WB and SB crossbar implementation results are reproduced for the same technology. The results in [8] are derived from a single bit interconnect where the data flit of width W is serialized by a parallel to serial converter and transmitted bit by bit using the single bit CDMA interconnect. Therefore, to transmit W bits in parallel, the single bit interconnect of [8] is replicated W times in the reproduced implementation compared herein with the ACDMA crossbar. The area is estimated using Synopsys Design Compiler, activity factors are estimated by ModelSim to aid the Synopsys Design Compiler to accurately estimate the power. Table III compares the area A in µm and total – leakage and dynamic– power dissipation of the ACDMA crossbar against the WB and SB crossbars. The IoWB% and IoSB% columns indicate Table III A REA IN µm2 AND POWER DISSIPATION IN mW OF THE WB, SB, AND ACDMA CROSSBARS FOR N = 8 NODES ( CODE CHIPS ) AND W = 4 BITS . 16 Nodes SB [8] 01511.9 1696 3207.9 6290.3 6703.2 12993.5 Area (µm2 ) ACDMA 1082.1 1522.8 2604.9 2849.9 4939.3 7789.2 IoWB% 40.1 59.5 53.2 60.9 60.3 60.5 the percentage improvement of the ACDMA crossbar over WB and SB crossbars, respectively. The clock frequency of the architectures presented in Table III is fixed at 2.5 GHz to facilitate the comparison with the results in [8]. To arrive at this clock frequency, the number of bits W is set to four, because as W increases, the carry chain in the adders reduces the clock frequency which does not occur in conventional CDMA. However, as illustrated by Table III, the savings in the area compared to the WB and SB crossbars are up to 60.5% and 40.1%, respectively. Furthermore, the ACDMA crossbar achieves up to 55.2% and 8.2% reduction in power dissipation compared to that of the WB and SB crossbars, respectively. The latency L in ns of the ACDMA crossbar is compared to that of the SB and WB crossbars in Figure 3(a). Due to the increase in the carry chain length in the channel adder with the increase in the data width W , the latency of the channel adder and the ACDMA crossbar latency, consequently, increases. Conversely, the carry chains of the WB and SB crossbars are invariant with the data width W since the adders are replicated W times instead of increasing the adder input width as in the ACDMA crossbar. However, due to the substantial reduction in total area as illustrated in Table III, the Throughput per Area (TPA), calculated through (7), of the ACDMA crossbar is still higher than that of the SB and WB crossbar counterparts. W (7) LN A The TPA of the WB, SB, and ACDMA crossbars in Mbps per µm2 are juxtaposed in Figure 3(b) and the increase in TPA of the ACDMA crossbar over that of the WB and SB crossbars are up to 124% and 39.5%, respectively. TPA = V. C ONCLUSION In this work, we presented the ACDMA NoC crossbar to enable parallel transmission of multi-bit data packets on a single CDMA channel. The overhead of channel replication is mitigated which results in up to 60.5% area and 55% power savings with 124% improvement in throughput per area compared to the conventional CDMA crossbar. As a future work, we plan to build and evaluate a full ACDMA-based NoC under different workloads and routing protocols. R EFERENCES [1] L. Wang, J. Hao, and F. Wang. Bus-based and NoC infrastructure performance emulation and comparison. In Information Technology: New Generations, 2009. ITNG ’09. Sixth International Conference on, pages 855–858, April 2009. IoSB% 28.4 10.2 18.8 54.7 26.3 40.1 WB [8] 0.422 0.764 1.186 1.367 2.275 3.642 SB [8] 0.29 0.286 0.576 0.801 0.985 1.786 Power (mW ) ACDMA 0.227 0.304 0.531 0.597 1.042 1.639 IoWB% 46.2 60.2 55.2 56.3 54.2 55 IoSB% 21.7 -6.3 7.8 25.5 -5.8 8.2 Latency (ns) for N=8 0.7 Wb Sb ACDMA 0.6 Latency in ns 8 Nodes Encoder and adder Decoder Total Encoder and adder Decoder Total WB [8] 1806.2 3756.3 5562.5 7294.3 12448.2 19742.5 0.5 0.4 0.3 0.2 0.1 0 4 8 16 Data Width W (bits) 32 (a) Throughput per Area (Mbps/µm2) for N=8 0.6 0.5 TPA in Mbps/µm2 Modules Wb 18.2% 36.6% 0.4 0.3 108% 124% Sb 32.5% 112% ACDMA 39.5% 96.3% 0.2 0.1 0 4 8 16 Data Width W (bits) 32 (b) Figure 3. Latency (a) in ns and Throughput per Area (b) in Mbps/µm2 for the WB, SB, and ACDMA crossbars. [2] R. H. Bell, Chang Yong Kang, L. John, and E. E. Swartzlander. CDMA as a multiprocessor interconnect strategy. In Signals, Systems and Computers, 2001. Conference Record of the Thirty-Fifth Asilomar Conference on, volume 2, pages 1246–1250 vol.2, Nov 2001. [3] B. C. C. Lai, P. Schaumont, and I. Verbauwhede. CT-bus: a heterogeneous CDMA/TDMA bus for future SOC. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Eighth Asilomar Conference on, volume 2, pages 1868–1872 Vol.2, Nov 2004. [4] K. E. Ahmed and M. M. Farag. Overloaded CDMA bus topology for MPSoC interconnect. In 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14), pages 1–7, Dec 2014. [5] K. E. Ahmed and M. M. Farag. Enhanced overloaded CDMA interconnect (OCI) bus architecture for on-chip communication. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pages 78–87, Aug 2015. [6] K. E. Ahmed and M. M. Farag. Parallel overloaded CDMA interconnect (OCI) bus architecture for on-chip communications. In 2015 IEEE International Conference on Electronics, Circuits, and Systems (ICECS), pages 621–624, Dec 2015. [7] Basel Halak, Teng Ma, and Ximeng Wei. A dynamic CDMA network for multicore systems. Microelectronics Journal, 45(4):424 – 434, 2014. [8] J. Wang, Z. Lu, and Y. Li. A new CDMA encoding/decoding method for on-chip communication network. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 24(4):1607–1611, April 2016. [9] Jacob Postman and Patrick Chiang. A survey addressing on-chip interconnect: energy and reliability considerations. ISRN Electronics, 2012, 2012.