A Novel 32-Bit Scalable Multiplier Architecture: Yeshwant Kolla Yong-Bin Kim John Carter
A Novel 32-Bit Scalable Multiplier Architecture: Yeshwant Kolla Yong-Bin Kim John Carter
A Novel 32-Bit Scalable Multiplier Architecture: Yeshwant Kolla Yong-Bin Kim John Carter
Yeshwant Kolla
SUN Microsystems, Inc Burlington, MA, 01803, USA
Yong-Bin Kim
Dept. opf ECE, Northeastern University Boston, MA, 02115, USA
John Carter
School of Computing,University of Utah Salt Lake City, UT,USA
ABSTRACT
In this paper, we present a novel hybrid multiplier architecture that has the regularity of linear array multipliers and the performance of tree multipliers and is highly scalable to higher-order multiplication. This multiplier topology is highly conducive for an electronic design automation (EDA) tool based implementation. A 32-bit version of this multiplier has been implemented using a standard ASIC design methodology and one variation of the standard design methodology in a 0.25m technology. This 32-bit multiplier has a latency of 3.56ns.
the speed-up. However, conventional tree structures are less regular than linear array topologies and their interconnect wires tend to be long and complex. In addition, the varying lengths of the interconnects causes delay balancing problems. This leads to signicant glitching at the outputs of the logic gates, resulting in higher power dissipation. The longer interconnects that result from complex interconnections, require larger transistors to drive the additional load. This increase in transistor sizes has a domino eect, rippling all the way to the source driver, resulting in an overall area increase. We propose a multiplier architecture with the performance advantage of trees and the regularity of linear array multipliers.
General Terms
Multiplier,Design, Performance
Keywords
Multiplier, Architecture,CMOS VLSI
1.
INTRODUCTION
Hardware multipliers are some of the most widely used datapath elements. This basic arithmetic unit is one of the most critical elements in high-speed microprocessors, digital signal processors, graphics processors, and many other application-specic processors. In addition, multiplication is a high latency operation requiring between 2 and 8 cycles to complete [1]. Consequently, the performance of a multiplier has a direct eect on the throughput and cycle time of a system. Conventional linear array multipliers have very good regularity, but the latency through these multipliers tends to be high. Usually the number of levels of addition are equal to the number of bits of a multiplier. To improve on the latency of linear array multipliers, tree topologies [4, 2, 3] are used. Tree topologies use parallelism in performing the dierent steps of multiplication in order to obtain
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. GLSVLSI03, April 2829, 2003, Washington, DC, USA. Copyright 2003 ACM 1-58113-677-3/03/0004 ...$5.00.
241
Multiplicand bits
Msb Lsb
M 2M
Msb
Lsb
Partial Product
Figure 1: 2-bit Modied Booth Partial Product Generator. simultaneously using a structure called the 8-to-2 compressor. The following sub-section discuses the design of the 8-to-2 compressor.
Table 1: 8-to-2 Compressor Truth Table. Note: In the table, Nin (n) represents the total number of inputs with weight n equal to 1 and Nout (n + 1) represents the total number of outputs with weight n+1 equal to 1. Nin (n) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Nout (n + 1) 0 0 1 1 2 2 3 3 4 4 5 5 6 6 Nout (n) 0 1 0 1 0 1 0 1 0 1 0 1 0 1
SPOut (n+1)
8to2 compressor
(n)
SPIn (n)
POut (n+1)
POut (n)
Figure 2: A weight n 8-to-2 Compressor Block Diagram. A block diagram of a 8-to-2 compressor with a weight of n is shown in Figure 2. This structure takes eight primary inputs (PIn) and produces two primary outputs (POut). The eight primary inputs and one of the primary outputs have a weight of n, while the other primary output has weight of n+1. This compressor also has ve stage-propagate inputs (SPIn) of weight n and ve stage-propagate outputs (SPOut) of weight n+1. The truth table for this structure is shown in Table 1. The 8-to-2 compressor can be implemented in a number of dierent ways. One obvious approach is a direct logic implementation of the truth table, shown in Table 1 and using logical optimizations to reduce the number of the levels of logic along the critical path. Though this approach
may provide the smallest number of gate delays, interconnect delays tend to be excessive because of lack of emphasis on regularity of the design. A more regular and structured approach is to use carry-save adders as the basic building block and composing the structure either in a tree topology or a linear topology. A tree-like implementation of the 8-to-2 compressor is shown in Figure 3. This structure is more regular and more structures in comparison to direct logic implementation but still has some amount of interconnect and layout complex-
242
Inputs
3to2 adder
3to2 adder
3to2 adder
Outputs
3to2 adder
3to2 adder
tion requires a stage of 9-to-2 compressors in addition to a stage of 8-to-2 compressors. The 9-to-2 compressor stage is needed to account for the additional partial product arising from performing modied-booth recoding Creating a 9-to-2 compressor from a linear 8-to-2 compressor is fairly simple. Though this multiplication topology using linear 8-to-2 compressors has an additional delay of 2 CSAs as compared to an ideal tree topology, the performance impact should be overcome by the longer wire delays in tree topologies due to complex routing and lower regularity. In addition, though the total number of CSAs required by the 8-to-2 tree is almost on par to that required by conventional tree topologies, the complexity of routing in the latter should result in higher routing track requirement, requiring larger area. This tree topology ensures that the depth of the tree structure at the top-level is not more than two stages for a 32-bit multiplier, beyond three stages the routing becomes signicantly complicated. Depending on the performance requirements of the multiplier each of the sub-arrays can be implemented in either a linear or a tree approach. Also depending on the bit-width of the multiplier, the summation tree is divided into multiple levels of sub-arrays. This reduces the length and complexity of global routes while absorbing some of the complexity locally within the sub-arrays. Finally, the proposed tree structure scales very well to higher order multipliers while the routing complexity of conventional tree multipliers increase exponentially with the order of the multiplier. Ease of scaling and lower routing complexity become a very important factor when using design automation tools.
4. RESULTS
The performance of the multiplier implemented using a standard ASIC design methodology and a variation namely, base-cell methodology are compared for the typical case library and operating conditions. In the base-cell ow, carrysave adders are used as the basic building blocks for the multiplier. The results presented are obtained using a 0.25m technology. The performance analysis is presented using the two common design metrics - Area and timing. The results obtained after just the synthesis stage and after the complete physical design stage are presented. The synthesis and place & route results are presented for the standard methodology and the base-cell variation in a bottom-up fashion i.e. starting at the base-cell level. The Compress82 block generated using the standard methodology creates a very complex interconnect-based design as compared to a very regular structure created by the base-cell design methodology. Due to the way timing critical designs are handled by automated tools, the standard ow generates a logically-optimal design by minimizing the number of levels of logic along the critical path. This makes the designs slightly faster but in order to reduce logic along the critical path, the design usually ends up having complicated wiring. As a result of increased wire loads due to the complicated wiring, the drivers along these paths need to be upsized to make them meet path timing requirements. This upsizing of the drivers compounded by rippling eect of upsizing downstream drivers along these paths results in a huge area penalty. This eect can be seen in Tables 2, 3. In the case of compress82 in Table 2, a speed-up of 16% is obtained at the expense of 290% area penalty and in the case of compress92 in Table 3, a speed-up of 30% is obtained
3to2 adder
3to2 adder
3to2 adder
Outputs
Figure 4: Linear 8-to-2 Compressor Block. ity. Figure 4 shows a linear implementation of the 8-to-2 compressor. This is a highly regular structure and the routing is fairly simple. Though the linear 8-to-2 compressor has six carry-save adders along its critical path in comparison to four in the case of the tree implementation, the linear structure is easily implementable and is highly scalable. The linear 8-to-2 compressor is also very conducive to a bit-slice layout, helping decrease layout time.
3.
8-TO-2 TREE
The top-level tree structure that adds the results from the 8-bit sub-arrays to produce the nal carry-save product is shown in Figure 5. As the gure depicts, the implementa-
243
Input bus 0
9-to-2 compressor
9-to-2 compressor
9-to-2 compressor
Input bus 1
8-to-2 compressor
8-to-2 compressor
8-to-2 compressor
4-to-2 compressor
Figure 5: A Section of a 32-bit 8-to-2 Tree Structure. at the expense of 254% area penalty. These tables include both the After Synthesis and After Place & Route numbers to show the advantages of the timing driven Place & Route, which provides better timing than synthesis but can take a bigger area penalty due to upsizing some drivers based on wire-lengths. Table 2: Performance comparison of the Compress82 base cell for the standard methodology and the variation - after synthesis and after place & route Design Methodology After Synth After P&R Standard Variation Standard Variation Area (in 2 ) 3119.82 820.20 3678.42 943.71 Timing (in ns) 3.84 4.29 2.51 2.92 the multiplier and the physical design tool requires a certain amount of keep-out area around sub-blocks. This area around each base cell results in a huge penalty for both timing and area. One way to improve timing would be to add these base cells as part of the library and be used by the physical tool as a leaf cell not requiring the keep-out area. From Table 4, the multiplier design after synthesis generated using the standard ow has a speed-up of 14% with an area penalty of 66% over the base-cell variation whereas the design after place & route using standard ow has a speed-up of 33% with an area penalty of just 46%. Table 4: Performance comparison of the 32-bit Multiplier for the standard methodology and the variation - after place & route Design Methodology After Synth After P&R Standard Variation Standard Variation Area (in 2 ) 234986.95 141663.92 463080.25 317419.56 Timing (in ns) 4.93 5.60 3.56 4.72
Table 3: Performance comparison of the Compress92 base cell for the standard methodology and the variation - after synthesis and after place & route Design Methodology After Synth After P&R Standard Variation Standard Variation Area (in 2 ) 3254.02 953.91 4602.94 1298.88 Timing (in ns) 4.00 5.00 2.62 3.39
6. REFERENCES
[1] H. Al-Twaijry and M. Flynn. Multipliers and datapaths. Tech. Report, Stanford University, December 1994. [2] L. Dadda. Some schemes for parallel multipliers. Alta Frequenza, 34(5):349356, March 1965. [3] M. Santoro. Design and clocking of vlsi multipliers. Ph.D. thesis, Stanford University, October 1989. [4] C. Wallace. A suggestion for fast multipliers. IEEE Transactions on electronic Computers, EC(13):1417, February 1964.
5.
In the case of the multiplier, the synthesis results provide a more accurate picture of the improvements of the base-cell variation over the standard ow because of some tool shortcomings. The base cells are treated as sub-blocks within
244