A new truncation algorithm of low hardware cost multiplier

Khalid Humood

A new truncation algorithm of low hardware cost multiplier

Khalid Humood

2021, Periodicals of Engineering and Natural Sciences (PEN)

visibility

…

description

7 pages

link

1 file

Multiplier is one of the most inevitable arithmetic circuit in digital signal design. Multipliers dissipate high power and occupy significant amount of the die area. In this paper, a low-error architecture design of the pretruncated parallel multiplier is presented. The coefficients word length has been truncated to reduce the multiplier size. This truncation scaled down the gate count and shortened the critical paths of partial product array. The statistical errors of the designed multiplier are calculated for different pre-truncate values and compared. The multiplier is implemented using Stratix III, FPGA device. The post fitting report is presented in this paper, which shows a saving of 36.9 % in resources usage, and a reduction of 17 % in propagation time delay.

ISSN 2303-4521 Original Research Periodicals of Engineering and Natural Sciences Vol. 10, No. 1, January 2022, pp.188-194 A new truncation algorithm of low hardware cost multiplier Qahtan Khalaf Omran1, Khalid Awaad Humood 2, Tahreer Mahmood*3 1,2,3Department of Electronic Engineering, College of Engineering, University of Diyala, Iraq ABSTRACT Multiplier is one of the most inevitable arithmetic circuit in digital signal design. Multipliers dissipate high power and occupy significant amount of the die area. In this paper, a low-error architecture design of the pretruncated parallel multiplier is presented. The coefficients word length has been truncated to reduce the multiplier size. This truncation scaled down the gate count and shortened the critical paths of partial product array. The statistical errors of the designed multiplier are calculated for different pre-truncate values and compared. The multiplier is implemented using Stratix III, FPGA device. The post fitting report is presented in this paper, which shows a saving of 36.9 % in resources usage, and a reduction of 17 % in propagation time delay. Keywords: Multiplier; Truncation error; DDFS; FPGA; carry save adder (CSA) Corresponding Author: Qahtan Khalaf Omran Department of Electronic Engineering, College of Engineering University of Diyala Diyala, Iraq Email: [email protected] 1. Introduction The researchers in [1][2] introduces a new ROM reduction technique that allows accessing the memory cells twice at one clock cycle using time sharing. As shown in Fig 1, (a) the MUX and its coefficients represents the only main source of segments initial coefficients Ci. The key feature of the presented method is to use theses coefficients to drive the slope coefficients. Two succeeding coefficients at a time has been manipulated in such a way to achieve the targeted slope coefficients. The approach introduces a good solution to eliminate the bulky ROM-based LUT. In other hand, the computational cost, which is paid in terms of utilization extra logic gates, has been raised. It seemed to be unavoidable due to incorporating the costly multiplier. So, in this work we develop a new pre-truncation of coefficients word-length, the aim is to minimize the existing multiplier size without sacrificing the design performance. The concept here is to modify the process of multiplication rather than developing an algorithm of multiplier itself. The main heart of multiplier is the multi operand adder. In designing the target multiplier, the parallel structure is adopted for inherently unique feature. In doing so, trimming down the input word-length can help to reduce the carry save adder (CSA) array significantly. It is worth to note that the proposed technique presented in this paper has been designed to fit successfully to the mentioned work in [1] but, with a little bit modification, the same design procedure can be applied for any different number of segments with similar results. The aim of our proposal design is to understand the statistical errors of the designed multiplier. Also, to simulate this design we using Stratix III, FPGA device simulation program for different scenarios with discussion the results of these statistical errors of the designed multiplier to calculate different pre-truncate values. In this paper, the generated results with high data rate can provide meaningful foundations in both electronic and telecommunication systems such as 4G LTE, 5G Massive MIMO, [3-8]. Additionally, it can also support to determine the properties and factors affecting most types of channels and antennas [9,10]. The outline of this research starts from the current introduction in which a general overview to the main objectives is produced. An Introduction to the theory of electronic truncation algorithm of hardware multiplier is offered in section one. In section two and three the methodology or theoretical simulation of proposal design form and how does it really work. In section four, the result of simulated proposal design is © The Author 2022. This work is licensed under a Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) that allows others to share and adapt the material for any purpose (even commercially), in any medium with an acknowledgement of the work's authorship and initial publication in this journal. 188 PEN Vol. 10, No. 1, January 2022, pp.188-194 discussed with theory calculations of it. Finally, section five introduction Conclusion of the development work as well as the suggestions of the work. 2. Proposed approach and mathematical model For an n×m multiplier, the n×m partial products can be easy generated in parallel by using n×m AND gates. The hard part in designing fast multiplier is to minimize the logic utilization and time required to add these partial products. As mentioned in section 1, the main concept is to adopt the parallel structure in designing the proposed multiplier, so it can easy, for the carry save adder (CSA) array, to be reduced by truncating of the operand word length. As shown in Fig 1, (b) by using this technique we can benefit from the fact that the slope coefficients Mi has been calculated and stored in register 3 during the interval Δx. therefore it can be truncate its word length such that to reduce the multiplier size. The truncation of the multiplier's input achieves smaller partial product array, but it produces an arithmetic error. By choosing accurately the value of the truncated part we can significantly elevate the impact of this error. Further error correction can be done by rounding the most significant bit of the truncated part. The slope coefficient Mi can be calculated from the sine function as in [1], and the approximated segment lines can be written as (𝐬𝐢𝐧 (𝒊 𝚫𝒙)− 𝐬𝐢𝐧 (𝒊−𝟏)𝚫𝒙).𝒙 + 𝑪𝒊 (1) Yi(x) = 𝚫𝒙 th Where Ci is the initial amplitude of i segment, Δx = the length of segment. To avoid the division process in (1), the formula can be modified to Δx× Yi (x) = (𝒔𝒊𝒏(𝒊 𝜟𝒙) − 𝒔𝒊𝒏(𝒊 − 𝟏)𝜟𝒙). 𝒙 + 𝑪𝒊 × 𝜟𝒙 (𝟐) (𝑠𝑖𝑛 (𝑖 Let us define SD = (𝑖 𝛥𝑥) − 𝑠𝑖𝑛 − 1)𝛥𝑥) as a difference of any two consecutive sine points, thus the equation (2) becomes Δx× Yi (x) = SD. x + Ci × Δx (3) Δx is constant, thus the (Ci × Δx) product can be simply realized by costless hardwired pre-shifting, while variable multiplier is essential to perform (SD.x) product which represents the core focus of this study. The dynamic range of quantized SDQ is (4) ⌊𝟐𝐝 𝑺𝑫𝒎𝒊𝒏 ⌋≤ SDQ ≤ ⌊𝟐𝐝 𝑺𝑫𝒎𝒂𝒙 ⌋ Where ⌊. ⌋ denotes the floor function and d=L-1 represents segment initial amplitude resolution. The maximum and minimum SD's word are Pmin= ⌈𝐥𝐨𝐠 𝟐 𝑺𝑫𝑸𝒎𝒊𝒏 ⌉ , Pmax= ⌈𝐥𝐨𝐠 𝟐 𝑺𝑫𝑸𝒎𝒂𝒙 ⌉ Where ⌈. ⌉ denotes the ceiling function. So the multiplier must be performed, i.e the multiplication of 2Pmax×2B, since we need (Pmax ×B) bit multiplier. Let Δx = U×V, Δx =2B Equation (3) becomes U×V× Yi (x) = SDQ. x + Ci × U×V (5) Divide the two sides by V, we have 𝑺 .𝒙 𝑼 × 𝒀𝒊 (𝒙) = 𝑫𝑸 + 𝑪𝒊 × 𝑼 (6) 𝑽 J B-J If V=2 , where 1 ≤ J < B, then U=2 and (6) becomes 𝑺 .𝒙 (7) 𝟐𝑩−𝑱 × 𝒀𝒊 (𝒙) = 𝑫𝑸𝑱 + 𝑪𝒊 × 𝟐𝑩−𝑱 𝟐 Where the range of quantized 𝐶𝑞𝑖 0≤ Cqi ≤ ⌊2𝐿−1 sin bit multiplier. So, it is desirable to minimize ( 2𝑃𝑚𝑖𝑛 ×2𝐵 𝑆𝐷𝑄 .𝑥 2𝑃𝑚𝑎𝑥 ×2𝐵 𝑉 π(𝑆−1) 2𝑆 ⌋.The first term of (7), ( 𝑆𝐷𝑄 .𝑥 𝑣 𝑃𝑚𝑎𝑥×𝐵 ) 𝐽 𝑆𝐷𝑄 .𝑥 ) needs ( ) to obtain minimum multiplier size, where ( 𝑣 ) somewhere in the range of to .The only possible solution for minimizing the multiplier 2𝐽 2𝐽 size is by maximizing J, J can be somewhere between 1 and B-1, avoid the numbers whose values are less than one from the division process results, then J must satisfy the following condition ⌊ 2𝑃𝑚𝑖𝑛 i.e. 𝟐𝑱 < 𝑺𝑫𝑸𝒎𝒊𝒏 𝛑(𝑺−𝟐) 𝛑(𝑺−𝟏) ] − [𝐬𝐢𝐧 𝟐𝑺 ]}⌋ 𝟐𝑺 𝛑(𝐒−𝟏) 𝛑(𝐒−𝟐) [(𝐬𝐢𝐧 𝟐𝑺 ) − (𝐬𝐢𝐧 𝟐𝑺 )]}⌋ 𝑺𝑫𝑸𝒎𝒊𝒏 = ⌊ 𝟐𝑳−𝟏 × {[𝐬𝐢𝐧 So, 𝑱 ≤ ⌊𝐥𝐨𝐠 𝟐 {𝟐𝑳−𝟏 × Equation (10) represents the realizable range of J.The final multiplier becomes 189 2𝑃𝑚𝑖𝑛 ⌋ 2𝐽 (8) (9) (10) > 20 , or 2𝐽 < PEN Vol. 10, No. 1, January 2022, pp.188-194 2𝑃𝑚𝑎𝑥 )× 2𝐽 ( 𝛑 𝑺𝑫𝑸𝒎𝒂𝒙 = ⌊ 𝟐𝑳−𝟏 × [𝐬𝐢𝐧 ]⌋ 𝟐𝑺 2𝐵 , i.e. (Pmax - J) × B instead of (Pmax) ×B . Where p max =⌈log 2 𝑆𝐷𝑄𝑚𝑎𝑥 ⌉ , and (11) 𝐶 By applying the division process before storing the coefficients on the Memory cells , i.e. storing 𝑉𝑖 instead of Ci, we can introduce another improvement. Where the cells word length can be reducing by amount of J bit. 𝐽 ) ROM reduction ratio. So, it is useful to rewrite (7) as follows. Consequently, there is an additional ( 𝟐𝑩−𝑱 × 𝒀𝒊 (𝒙) = 𝑺𝑫𝑸 .𝒙 𝟐𝑱 𝐿−𝐽−1 𝑪 + 𝟐𝒊𝑱 × 𝟐𝑩−𝑱 × 𝟐𝑱 = ( 𝑺𝑫𝑸 𝟐𝑱 𝑪 ) 𝒙 + 𝟐𝑩 ( 𝟐𝒊𝑱 ) (12) 𝐶 In Fig 1, (b) it can be seen that a MUX and its coefficients provides the segment initial amplitudes 2𝑖𝐽 , represented with L-J-1 bits, and the 𝑆𝐷𝑄 2𝐽 coefficients with P-J bits. An error produced by this truncation can be estimated as follows 𝜺𝒊 = 𝑪 |𝑪𝒊 −⌊ 𝒊𝑱 +𝟎.𝟓⌋ ×𝟐𝑱 | 𝟐 (13) 𝟐𝑳−𝟏 The worst-case errors when the value for J LSB bits of Ci are equal to(2 𝐽 − 1) ) (i.e. all J LSB bits of 𝐶𝑖 are non-zero digits (logic 1) then the error become 𝜺𝒎𝒂𝒙 = 𝟐𝑱−𝟏 (14) 𝐶 The fractional part denoted by { 𝑉𝑖} for real 𝑪 { 𝑽𝒊 } = 𝑪𝒊 𝑽 𝑪 - ⌊ 𝑽𝒊 ⌋ for all 𝑪𝒊 , 𝑽 𝑪 0≤ { 𝑽𝒊 } <1 𝐶𝑖 , 𝑉 is defined by the formula. (15) To reduce the amount of error, the most significant bit, MSB of the fraction part 𝐶 integer part ⌊ 𝑉𝑖⌋. 𝑪 𝑪 𝐫𝐨𝐮𝐧𝐝 ( 𝑽𝒊 ) = ⌊ 𝑽𝒊 + 𝟎. 𝟓⌋ 𝐶 { 𝑖} 𝑉 should be rounded to the (16) This rounding alleviates the amount of error by enforcing the J LSB of all Ci to be less than(2 𝐽 − 1) and this can be done by modifying the sine points before storing in ROM to insure that the truncation of Ci coefficients will not produce a binary results with J LSB nonzero digits equal to ( 2 𝐽 − 1). This minor modification will not contribute noticeable error from the system’s perspective. As the number of coefficients small, (S Coefficients) the rounding process can be applied by modifying the coefficients Ci before storing in the ROM and that means eliminating the additional rounding hardware. Fig. 2 (a), (b). shows the Array structure 𝑆 of (Pmax × B) and (Pmax - J) × B parallel multipliers. The ( 𝐷𝑄 𝐽 ) . 𝑥 product can be expressed as follows 𝒑𝒎𝒂𝒙 −𝟏 𝑩−𝟏 𝒊=𝑱 𝒌=𝟎 2 𝒑𝒎𝒂𝒙 −𝟏 𝑩−𝟏 𝑺𝑫𝑸 ( 𝑱 ) . 𝒙 = ∑ 𝑺𝑫 . 𝟐𝒊 . ∑ 𝒙𝒌 𝟐𝒌 . = ∑ ∑ 𝑺𝑫 . 𝒙𝒌 𝟐𝒊+𝒌 𝟐 𝒊=𝑱 𝒌=𝟎 (𝟏𝟕) It can be seen that the new multiplier has (Pmax - J) ×B partial product term (𝑆𝐷 . 𝑥𝑘 ) i.e. we just need (Pmax - J) × B Full Adder (FA) instead of Pmax × B F.A, and the time delay become (Pmax - J + B) × TPD, FA instead of (Pmax + B) × TPD, FA Where TPD, FA represent the propagation delay of FA. Hence the proposed multiplier offers an improvement of (J × TPD, FA) time delay and has (J ×B) FA fewer component counts. 190 PEN Vol. 10, No. 1, January 2022, pp.188-194 3. Calculation of multiplier size Based on design requirement derived in previous section, we start the design of multiplier by computing the π SDqmax as follows 𝑆𝐷𝑄𝑚𝑎𝑥 = ⌊ 214 × [sin 64]⌋ = 804 Using this value to obtain the maximum SD's word length; Pmax= ⌈log 2 804 ⌉= 10 That is mean we need 10×8 bits multiplier to perform SD. x product, Obviously the multiplier's size is so large, hence in the following step, we will attempt to reduce the multiplier size so as to reduce the gate count with an acceptable error. To find the realizable range of J we have to calculate 𝑆𝐷𝑄𝑚𝑖𝑛 Truncation(bit) Maximum Average Variance Error Error J=5 16 8.25 24.437499 J=4 8 3.5625 4.80859 J=3 4 2.0625 1.24609 J=2 2 1.1875 0.5898437 J=1 1 0.375 0.2343749 π(31) π(30) sing (9).𝑆𝐷𝑄𝑚𝑖𝑛 = ⌊ 214 × {[sin 64 ] − [sin ]}⌋ = 60 . Then by using (10) we have 𝐽 ≤ ⌊log 2 60⌋ = 64 5, 1 ≤ 𝐽 ≤ 5 And using formula (13) we can calculate the average absolute error for J=1, 2, 3, 4, 5, the calculated results are reported in Table 1. Consecutive sine 𝑆 points difference 𝐷 𝑽 [100,101,100,99,98,97,95, 94,92,89,88,85,82,79,76, 73,69,66,62,58,54,49,45, 40,36,32,27,22,17,12,8,2] Table 1. The truncation error (×2L-1) 𝑪 Initial amplitude coefficient 𝑽𝒊 (Quantized with 11 bits) [0,100,201,301,400,498,595,690, 784,876,965,1053,1138,1220,1299, 1375,1448,1517,1583,1645,1703,1757, 1806,1851,1892,1928,1960,1987,2009, 2026,2038,2046] Figure 1. Propsed architecture 191 PEN Vol. 10, No. 1, January 2022, pp.188-194 Table 2. Design coefficients, 32 segments, J=3 Figure 2. Array structure of parallel multiplier (a) Pmax×B bit multiplier (b) (Pmax-J) ×B bit It can be seen that the average error for J=3 has acceptable values and thus considered as reasonable compromising between reduction of multiplier size and tolerable error. Using this value of J, hence calculating the 𝐶𝑖 𝑉 coefficients, the coefficients are included in Table II. Finally, a truncated 7×8 multiplier has been placed in the targeted DDFS architecture instead of the 10×8 multiplier. The proposed multiplier offers an improvement of (3 × TPD, FA) time delay and the component less by (3 × 8) FA. 4. Design implementation and verification The proposed design is written in VHDL code using the Quartus II 11.0 sp1 software. Stratix III, EP3SE50F484C2 FPGA device is used for implementing both the truncated and conventional full width multiplier. Table III show the post fitting report of the implementations. The project is then verified for desired output using the Modalism-Altera 6.6d simulation software. Figure 3 shows RTL simulated result for sample data SD/V (49,45,40,36) multiply by all the possible combination of phase sample inputs. Figures shows the truncated and rounding process, hard-wired shifting of the truncated value in decimal and binary Radix, it’s worth to note that the truncated multiplier result present after one clock cycle which is highlighted in Figures due to output register. Table 3. post fitting report of the multiplier Truncated Multiplier (7×8) Full-width Multiplier (10×8) Combinational ALUTs 123 195 Dedicated logic registers 30 36 F clock 211.37 MHz 175.8 MHz 192 PEN Vol. 10, No. 1, January 2022, pp.188-194 Figure 3. Simulation result of truncated 7×8 multiplier 5. Conclusion A pre-truncation of initial amplitude value has been employed to reduce the gate count and time delay for the multiplier with an acceptable error. It was shown that an improvement of (3 × TPD, FA) time delay and 24 less Full Adder count (the component fewer by 24 FA) was achieved. The developed version of multiplier has been placed in the DDFS system and tested. The proposed multiplier has shown 36.9 % less resources logic utilization, 17 % more speed than the conventional multiplier. References [1] Q. K. Omran, M. T. Islam, and N. Misran, “A new approach to the design of low-complexity direct digital frequency synthesizer,” Przegląd Elektrotechniczny (Electrical Review), vol. 89, no. 5, pp. 157– 160, 2013. [2] Q.K Omran., M.T. Islam,“An efficient ROM compression technique for linear-interpolated direct digital frequency synthesizer,” IEEE Conf. Semicond. Electron., vol. 48, pp. 2409–2418, 2014. [3] T. Mahmood, O. A. Mahmood, and K. A. Humood. “An efficient technique to PAPR reduction for LTE uplink using Lonzo’s resampling technique in both SC-LFDMA and SC-DFDMA systems,” Applied Nanoscience, 2021. [4] H. K. AL-Qaysi, T. Mahmood, and K. A. Humood, “Evaluation of different quantization resolution levels on the BER performance of massive MIMO systems under different operating scenarios.” Indonesian Journal of Electrical Engineering and Computer Science, vol. 23, no. 3, pp. 1493-1500, 2021. [5] A. H. M. Alaidi, A. S. Abdalrada, and F. T. Abed, "Analysis the Efficient Energy Prediction for 5G Wireless Communication Technologies," International Journal of Emerging Technologies in Learning (iJET), vol. 14, no. 08, pp. 23-37, 2019. [6] H. T. S. Al-Rikabi, Enhancement of the MIMO-OFDM Technologies. California State University, Fullerton, 2013. 193 PEN Vol. 10, No. 1, January 2022, pp.188-194 [7] [8] [9] [10] A. Al-Dawoodi, H. Maraha, S. Alshwani, A. GHAZI, A. M. FAKHRUDEEN, S. Aljunid, S. Z. S. IDRUS, A. A. MAJEED, and K. A. AMEEN, "Investigation of 8 x 5 Gb/s mode division multiplexingfso system under different weather condition," Journal of Engineering Science Technology, vol. 14, no. 2, pp. 674-681, 2019. A. Ghazi, S. Aljunid, S. Z. S. Idrus, R. Endut, C. Rashidi, N. Ali, A. Al-dawoodi, A. M. Fakhrudeen, A. Fareed, and T. Sharma, "Hybrid WDM and Optical-CDMA over Multi-Mode Fiber Transmission System based on Optical Vortex," Journal of Physics: Conference Series, vol. 1755, no. 1, p. 012001, 2021. T. Mahmood, H. AL-Qaysi, and A. Hameed, “The Effect of Antenna Height on the Performance of the Okumura/Hata Model Under Different Environments Propagation,” International Conference on Intelligent Technologies (CONIT), pp. 1-4. IEEE, 2021. T. Mahmood, W. Q. Mohamed, and O. A. Imran, “Factors Influencing the Shadow Path Loss Model with Different Antenna Gains Over Large-Scale Fading Channel," International Conference on Artificial Intelligence and Mechatronics Systems (AIMS), pp. 1-5, 2021. 194

Log In

A new truncation algorithm of low hardware cost multiplier

Related papers

Related papers

Related topics