Full Paper PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Design of floating-point multiplier for logic and

power optimization: A Review


Sampath Kumar V Rashika Anurag Khushi Sangal
Department of ECE Department of ECE Department of ECE
JSSATEN JSSATEN JSSATEN
Noida, India Noida, India Noida, India
[email protected] [email protected] [email protected]

Sanyam Jain Shristi Bharti


Department of ECE Department of ECE
JSSATEN JSSATEN
Noida, India Noida, India
[email protected] [email protected]

Abstract—Floating-point representation of any fractional delay overhead. With the increasing demand for
number offers a wider dynamic range which makes it computational power for scientific applications such as
extremely compliant and scalable against fixed-point computational physics, computational geometry, etc. [2] that
representation. Since fractional numbers are frequently used require high precision in the calculation, it is critical to have
in computation, such as in astronomical calculations, faster and more accurate floating-point units, especially the
graphics processing, and signal processing, floating-point multiplier. The two main types of multipliers are serial
representation becomes the ideal representation for them. multipliers and parallel multipliers. Each bit in a partial
Floating-point multipliers perform differently depending on product is generated in parallel by a parallel multiplier.
the multiplier design. The purpose of this paper is to review Nevertheless, a serial multiplier uses every bit of the
various studies done in the field of floating-point multiplier to make partial products. As a result, serial
multipliers, such as the Modified Booth multiplier, Array multipliers have lower speeds than parallel multipliers. [3]
multiplier, Dadda multiplier, Wallace Tree multiplier, and
Furthermore, it is important to note that floating-point
Vedic multiplier. A floating-point multiplier's performance
multipliers have long been popular in a variety of other
is evaluated based on a number of attributes, including
disciplines, including image processing, and graphic
speed, latency, area, and power consumption. In the design
processing, digital signal processing (DSP). In comparison
phase, Verilog HDL is used, and in the simulation phase, to fixed-point numbers, floating-point numbers have a much
Xilinx Isim is used. To implement RTL blocks created with wider dynamic range. But this increased range comes with a
Xilinx ISE 14.7, FPGA devices were used.
drawback of increased structural complexity. As accuracy
increases, they also require more space, which tends to grow
Keywords — Floating-point numbers, Single-precision (32-
as complexity increases. [2] It has been observed that single-
bit), Double-precision (64-bit), IEEE-754, Array, Modified
precision floating-point representation can handle the
Booth, Wallace tree, Dadda tree, Vedic, Verilog HDL, required range of numbers [5]. In order to multiply 32-bit
Xilinx ISE, FPGA. floating-point values, three steps must be performed, namely
sign bit determination, exponent addition, and significand
I. INTRODUCTION multiplication [1]. After examining various single-precision
(32-bit) floating-point multipliers and their implementation
Microprocessors carry out a variety of arithmetic operations, issues, it was found that of these three components, the
including addition, subtraction, multiplication, division, and significand multiplication unit is the slowest. Also, it uses
logical operations, using their ALU (Arithmetic-Logic Unit) the maximum chip area, so it contributes the most to power
block. Among all these operations, binary multiplications dissipation. [5] If the efficiency of significand multiplication
are noteworthy. They get the constant attention of could be improved either in terms of speed, area, or power,
researchers and scientists because of their resource-intensive it would refine the architecture and boost overall throughput
and time-consuming executions. According to Moore's law remarkably. Therefore, multiplicative power optimization is
chip transistor count doubles every 18 to 24 months. a significant advancement in optimizing floating-point
Although the pace of this is much slower than that of operations.
computing power growth. Additionally, floating-point
calculations consume approximately 75% of the core power
and 45% of the total power of high-performance computing II. IEEE-754 FLOATING POINT REPRESENTATION
applications. Therefore, floating-point operations have been Computers have used a variety of floating-point
rendered less efficient as a result of the large resource and representations, but IEEE-754 is one of the most extensively
used standards in the industry. It is regarded as the most operations. In the following paragraphs, we will discuss a
widely applied standard for floating point calculations by few of them.
the Institute of Electrical and Electronics Engineers (IEEE)
A. Array Multiplier
[6]. Three fundamental parts make up single precision
representations of this standard: a sign bit, an 8-bit This multiplier multiplies two binary numbers by
exponent, and a 23-bit mantissa [7]. utilizing an assembly of half and full adders. Due to its
predictable structure, this add-and-shift-based algorithm is
well-liked. The final product is created by adding the partial
TABLE I. FLOATING-POINT SINGLE AND DOUBLE products using a carry propagate adder after multiplying the
PRECISION FORMAT multiplier's bits by the multiplicand's bits to create the partial
products and shifting them in accordance with their bit order.
We need N-1 stages, where N is the multiplier bit number, to
get the final result [7].

In order to multiply two 32-bit floating-point numbers, do


the following: [8]

Step 1. Both numbers' exponents are added, and additional


bias is subtracted from the resulting sum.

Step 2. Mantissas of two numbers is multiplied.

Step 3. The sign bits of both numbers are XORed to get the
sign bit for the final product.
Fig. 2. 4x4 Array Multiplier
Step 4. Normalization of the result is done such that the
MSB of the result is logic-1.
B. Modified Booth Multiplier
The number of multiplicand multiples has been decreased in
this multiplier by using a higher representation radix, which
also lowers the number of partial products. A given range of
numbers requires fewer digits when a representation radix is
increased. A k-bit binary number can be represented as a
radix-4 number with K/2 digits, a radix-8 number with K/3
digits, and so on. It is capable of handling multiple
multiplier bits at once by utilizing high radix multiplication.
A partial product tree can be reduced in this method with
very few adder blocks, resulting in a shorter signal path and
faster performance for partial product reduction. Using tree
reduction algorithms, partial product reduction is carried
out.

Fig. 1. Floating-Point Multiplier Block Diagram

III. METHODOLOGIES
Many research studies have been done in recent years aimed
at improving the multiplier’s performance by efficiently
designing the floating-point arithmetic unit. Power and area
consumption by the floating-point units were reduced in
some works, while high speed and improved accuracy were
attained in others [9]. A variety of multipliers have been
designed till now and successfully implemented for
significand multiplication in floating-point multiplication Fig. 3. Modified Booth Multiplier Block Diagram
The following table shows how radix-4 recoding can be
used to reduce partial products by half.[3]

TABLE II. RADIX-4 ENCODING TABLE

C. Wallace Tree Multiplier


Its height is logarithmic in word size, not linear which
makes it faster than a regular array multiplier. The carry-
save method is also used to speed up the process of adding
partial products. It works in three stages: Partial product Fig. 5. 4x4 Wallace Tree Multiplier reduction process
generation using AND operation. Use of half and full adders
to reduce rows of partial products to just two rows. The final
product is obtained by using an appropriate carry propagate D. Dadda Multiplier
adder to add the remaining two rows. [4]-[7] The carry-save It works like the Wallace algorithm. It is a bit faster than
addition algorithm reduces the latency. Wallace as it requires fewer gates than the latter. Wallace
This method compresses the three and two bits in each multiplier works by reducing the partial products on each
column by using the maximum number of full adders and layer as much as possible but the Dadda multiplier works by
half adders within each group of three rows. The process of performing fewer number reductions as much as possible.
grouping and adding is repeated until only two rows As a result, the number of half-adders and full-adders used
remain.[11] at each level has been reduced significantly. Using the
Wallace Reduction Table, it reduces the partial products in
every column to the maximum number of layers in the
previous level of the table. Due to this, the reduction part of
this multiplier becomes less expensive.[9]

TABLE III. WALLACE REDUCTION TABLE

Fig. 4. 4x4 Wallace Tree Multiplier


IV. RESULTS AND ANALYSIS
According to [3], Xilinx ISE 14.7 was used to simulate the
8-bit and 16-bit array and radix-4 booth multipliers. The
comparative analysis of both the multipliers for various
performance parameters such as power, area, and speed
yield the following results.

Fig. 9. An analysis of 8-bit multipliers

Fig. 6. 4x4 Dadda Multiplier reduction process

E. Vedic Multiplier
Vedic mathematics underpins Vedic multipliers. As far as
speed is concerned, the "Urdhva Tiryakbhyam" sutra has
been considered the most effective of the sixteen sutras in
Vedic multiplication. [12]
In 1965, Shri Bharathi Krishna Tirthaji proposed the Urdhva
Tiryakbhyam multiplication algorithm in a book entitled
"Vedic Mathematics" [4]. The algorithm multiplies two
numbers into many sums of products of each number's
single digit. Vertical and crosswise multiplication is used to
multiply bits in the multiplicand and multiplier at various bit
Fig. 10. An analysis of 16-bit multipliers
positions. This multiplier has a lower propagation delay than
a complex traditional multiplier. [4]-[7]
It has been observed that the array multiplier consumes less
area but also has less speed whereas the radix-4 booth
multiplier occupies more area but has a higher speed.

With reference to [4], the design of array, Vedic and


Wallace tree multipliers have been simulated using Xilinx
ISE 14.7. Their simulation yields the following results.
Fig. 7. The Urdhva Tiryakbhyam Sutra's multiplication technique
TABLE IV. COMPARATIVE RESULTS OF MULTIPLIER DESIGNS

The array multiplier takes up the least space in terms of


LUTs and slices when compared to other multipliers. Vedic
multipliers perform worse than Wallace tree multipliers in
terms of both area consumption and
Fig. 8. 3x3 Vedic multiplier
delay.

TABLE VI. AREA COMPARISON OF MULTIPLIER DESIGNS BASED ON


XILINX ARTIX-7 FPGA

Fig. 11. Comparison of the multipliers' area and delay

This figure shows that Wallace multipliers offer the lowest


delay and are suitable for high-performance applications.
Due to its smaller area requirements, the array multiplier is
the best choice for applications that place a high priority on
area usage.

From both the tables, it can be observed that the area of


Wallace and Dadda multipliers is lesser than the array
multiplier especially when combined with radix-4 booth
encoding. Since Dadda multipliers require lesser adder
blocks than Wallace tree multipliers so it consumes lesser
area than the latter.

Fig. 12. A comparison of multipliers' latency

As can be seen from the above figure, the array multiplier


has a higher latency than the Vedic and Wallace tree
multipliers. Vedic multipliers are reliant on the optimization
of the adder for improved efficiency. [4] Therefore, Wallace
multipliers perform best when their adders are not optimized.
As per [11], multipliers are designed and implemented using
Xilinx Artix-7 FPGA device and TSMC 180nm technology
for performance comparison based on area and delay. As a
result of their implementation, the following results are
obtained.
Fig. 13. Analyzing the delay of multipliers implemented with TSMC
180nm technology
TABLE V. AREA COMPARISON OF MULTIPLIER DESIGNS BASED ON
180NM TSMC TECHNOLOGY
REFERENCES

[1] Na Bai, Hang Li, Jiming Lv, Shuai Yang, Yaohua Xu, "Logic Design
and Power Optimization of Floating-Point Multipliers",
Computational Intelligence and Neuroscience, vol. 2022, Article ID
6949846, 10 pages, 2022. https://doi.org/10.1155/2022/6949846.
[2] S. Arish and R. K. Sharma, "Run-time reconfigurable multi-precision
floating point multiplier design for high speed, low-power
applications," 2015 2nd International Conference on Signal
Processing and Integrated Networks (SPIN), 2015, pp. 902-907, doi:
10.1109/SPIN.2015.7095315.
[3] Shakya, Mr. Rahul and Jindal, Poonam, Comparative analysis of 8-bit
and 16-bit Array multiplier, modified Booth Multiplier-A Study
(February 25, 2022). Proceedings of the 3rd International Conference
on Contents, Computing & Communication (ICCCC-2022), Available
at
SSRN: https://ssrn.com/abstract=4043967 or http://dx.doi.org/10.213
9/ssrn.4043967.
[4] V. K. R, A. R. S and N. D. R, "A comparative study on the
performance of FPGA implementations of high-speed single-
precision binary floating-point multipliers," 2019 International
Conference on Smart Systems and Inventive Technology (ICSSIT),
2019, pp. 1041-1045, doi: 10.1109/ICSSIT46314.2019.8987800.
Fig. 14. Analyzing the delay of multipliers implemented with Xilinx Artix- [5] A. Sharma and T. K. Rawat, "Truncated Wallace Based Single
7 FPGA Precision Floating Point Multiplier," 2018 7th International
Conference on Reliability, Infocom Technologies and Optimization
It can be seen that the delay of array multipliers increases (Trends and Future Directions) (ICRITO), 2018, pp. 407-411, doi:
linearly with a linear increase in the number of bits. The 10.1109/ICRITO.2018.8748843.
delay for Wallace and Dadda multipliers does not follow the [6] K. V. Gowreesrinivas and P. Samundiswary, "Comparative study on
performance of single precision floating point multiplier using vedic
linear trend and increases logarithmically as data width multiplier and different types of adders," 2016 International
increases. We can say that the Dadda multipliers are faster Conference on Control, Instrumentation, Communication and
than Wallace multipliers for a given data width and on Computational Technologies (ICCICCT), 2016, pp. 466-471, doi:
combining Booth and Dadda multipliers we get the least 10.1109/ICCICCT.2016.7987995.
possible delay. [7] K. V. Gowreesrinivas and P. Samundiswary, "Comparative
performance analysis of multiplexer based single precision floating
point multipliers," 2017 International conference of Electronics,
V. CONCLUSION Communication and Aerospace Technology (ICECA), 2017, pp. 430-
435, doi: 10.1109/ICECA.2017.8212851.
This paper provides a comparative study of different
[8] B. Jeevan, S. Narender, C. V. K. Reddy and K. Sivani, "A high speed
floating-point multipliers based on their performance binary floating point multiplier using Dadda algorithm," 2013
characteristics. When data width is low, array multipliers International Mutli-Conference on Automation, Computing,
consume the fewest LUTs, but they have the highest Communication, Control and Compressed Sensing (iMac4s), 2013,
latency. As compared to regular array multipliers, modified pp. 455-460, doi: 10.1109/iMac4s.2013.6526454.
Booth multipliers consume a larger area but have a higher [9] V. Buddhe, P. Palsodkar and P. Palsodakar, "Design and verification
of Dadda algorithm based Binary Floating Point Multiplier," 2014
speed. The speed of Vedic and Wallace tree multipliers is International Conference on Communication and Signal Processing,
higher than array multipliers, but their power consumption 2014, pp. 1073-1077, doi: 10.1109/ICCSP.2014.6950012.
is also higher. The Dadda multiplier is even faster than [10] D. Kalaiyarasi and M. Saraswathi, "Design of an Efficient High
Wallace because of the lesser number of half and full adders Speed Radix-4 Booth Multiplier for both Signed and Unsigned
Numbers," 2018 Fourth International Conference on Advances in
at every level. We can draw the conclusion that the Electrical, Electronics, Information, Communication and Bio-
amalgamation of the modified Booth multiplier and Dadda Informatics (AEEICB), 2018, pp. 1-6, doi:
multiplier results in the shortest delay among the various 10.1109/AEEICB.2018.8480959.
designs that have been analyzed. [11] Fun, Chuah Ching and Nandha Kumar Thulasiraman. “Synthesizable
Verilog Code Generator for Variable-Width Tree
Multipliers.” Journal of Physics: Conference Series 1962 (2021): n.
pag.
[12] S. S. Sinthura, A. Begum, B. Amala, A. Vimala and V. Vidhya
Aparna, "Implemenation and Analysis of Different 32-Bit Multipliers
on Aspects of Power, Speed and Area," 2018 2nd International
Conference on Trends in Electronics and Informatics (ICOEI), 2018,
pp. 312-317, doi: 10.1109/ICOEI.2018.8553859.

You might also like