A Low-Error Energy-Efficient Fixed-Width Booth Multiplier With Sign-Digit-Based Conditional Probability Estimation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

236 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO.

2, FEBRUARY 2018

A Low-Error Energy-Efficient Fixed-Width Booth


Multiplier With Sign-Digit-Based Conditional
Probability Estimation
Ziji Zhang and Yajuan He, Member, IEEE

Abstract—Fixed-width multipliers are intensively used in many A partitioning method was proposed in [6] to partition the
DSP applications whose accuracy and energy efficiency affect truncated part in Booth multiplication into two: one major and
the whole digital system to a large extent. To improve the com- one minor depending upon the influence on the induced trun-
putation accuracy, a Booth-encoded sign-digit-based conditional cation error. This method is widely adopted in the subsequent
probability estimation approach is proposed. A symmetric error designs. Later on, a simple compensation circuit is proposed
distribution is obtained by taking the sign bit of the Booth- in [7] to compensate the truncation error with limited accu-
encoded multiplier into consideration when applying the con-
ditional probability. In addition, a more generalized mux-based
racy. To lower the error, a simulation-based method in [8] is
estimation method is formulated for the circuit implementation, formulated by taking the information provided by the Booth
which reduces the delay time and power dissipation. Simulation encoding. However, the exhaustive simulation is time consum-
results show that the proposed multiplier exhibits the best compu- ing. A probabilistic estimation bias method is thus proposed
tation accuracy with the least energy per operation. It performs in [9] to reduce the simulation time with acceptable accu-
even better for those operand lengths that are not multiples of racy. The error performance is further improved in [10] by
4. The maximum reduction on energy-delay-error product can applying a complex multi-level conditional probability model,
reach 14.8% compared with all its contenders among various which gains higher accuracy at the cost of increased area.
operand lengths. A better accuracy-area trade-off solution is presented in [11],
Index Terms—Fixed-width multiplier, Booth encoding, approx- which applies both the conditional probability estimation and
imate computation, mean square error, energy-efficient. the computer simulation with a dynamic error-compensation
method.
In this brief, a Booth-encoded sign-digit-based conditional
probability (BSCP) method is proposed in order to achieve
I. I NTRODUCTION a lower mean square error since it is an important indicator of
IXED-WIDTH multipliers are widely used in modern
F energy-efficient multimedia and digital signal process-
ing systems, which are desirable to utilize a fixed operand
computation accuracy [1]. The compensation value is derived
by applying the sign bit of Booth encoded multiplier to the
conditional and expected probability. A simple mux-based esti-
length [1]–[5]. In order to obtain the same bit width as input, mation circuit is formulated from the proposed method, which
the least significant bits of the output are normally trun- yields a simplified compensation logic design.
cated. Among various fixed-width multipliers, post-truncated The rest of this brief is organized as follows. Section II
multiplier (PTM) rounds off the product after a complete introduces the basics of fixed-width Booth multiplier. The pro-
calculation, which makes its accuracy the best while the posed compensation algorithm and the corresponding circuit
hardware expenses the most. Contrarily, direct-truncated mul- design are illustrated in Section III. Section IV evaluates the
tiplier (DTM) cuts off the least significant half partial products performance of proposed fixed-width multiplier and compares
directly without any compensation, which results in the least it with its rivals. Finally, Section V concludes this brief.
computation accuracy and the hardware complexity as well.
For a low error energy-efficient multiplication, the truncation II. P RELIMINARIES
error has to be well compensated with minimum overhead in
order to maintain the required precision while keep the energy Suppose A(aN−1 aN−2 . . . . . . a1 a0 ) and B(bN−1 bN−2 . . . . . .
dissipation as low as possible. b1 b0 ) are two N-bit two’s complement numbers, which rep-
Recently, modified Booth algorithm has been increas- resent multiplicand and multiplier, respectively. Thus, their
ingly employed in the fixed-width multiplier design so as to product P (p2N−1 p2N−2 . . . . . . p1 p0 ) can be obtained as
contribute further to the speedup and logic reduction [6]–[11]. follows:
 

N−2
Manuscript received March 26, 2017; accepted April 30, 2017. Date of pub- P = A × B = −aN−1 · 2N−1 + ai · 2i
lication May 29, 2017; date of current version January 29, 2018. This work
i=0
was supported by the National Natural Science Foundation of China under ⎛ ⎞
Grant 61604036 and Grant 61534002. This brief was recommended by 
N−2
Associate Editor H.-T. Zhang. (Corresponding author: Yajuan He.) × ⎝−bN−1 · 2N−1 + bj · 2j ⎠. (1)
The authors are with the School of Microelectronics and Solid State
Electronics, University of Electronic Science and Technology of China, j=0
Chengdu 610054, China (e-mail: [email protected]).
Color versions of one or more of the figures in this paper are available In fast multiplication design, modified Booth encoding is an
online at http://ieeexplore.ieee.org. efficient way to reduce the number of partial products by half.
Digital Object Identifier 10.1109/TCSII.2017.2709801 As shown in Table I, three consecutive bits (b2j+1 b2j b2j−1 )
1549-7747 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
ZHANG AND HE: LOW-ERROR ENERGY-EFFICIENT FIXED-WIDTH BOOTH MULTIPLIER 237

TABLE I
M ODIFIED B OOTH E NCODING

Fig. 1. Partial product matrix of an 8×8-bit Booth multiplier.

from a multiplier are grouped to form a signed digit (dj ). cj


represents a correction bit due to Booth encoding in two’s
complement representations. It is ‘1’ if dj is negative, other-
wise it is ‘0’. The encoded multiplier D from its binary variant
B is shown in (2).

N/2−1 Fig. 2. Partial product matrix of fixed-width Booth multiplier with
D= dj · 22j (2) compensation algorithm.
j=0

Therefore, P in (1) can be rewritten as


 ⎛N/2−1 ⎞ A. Sign-Digit-Based Conditional Probability

N−2  As mean square error (MSE) is a key indicator of accu-
P = A × D = −aN−1 · 2N−1 + ai · 2i ⎝ dj · 22j ⎠. racy, it is important to keep the MSE as low as possible. Let
i=0 j=0 εmse indicate the MSE of εA,D , and we use TPA,D replacing
(3) TPminor · 2−N in (6) for convenience, so we have
Fig. 1 illustrates a typical partial product matrix of an 8×8-
1 −1
2N−1 −1
2N−1
bit Booth multiplier. ppj,i indicates the partial product located εmse = 2N εA,D
2
at the i-th column and j-th row, which is generated from the 2
D=−2N−1 A=−2N−1
i-th bit of the multiplicand ai and the j-th digit of the Booth
encoded multiplier dj . A simplified sign extension method 1 −1
2N−1 −1
2N−1

2
introduced in [12] is utilized, in which sj denotes the extended = TPA,D − f (D)/2 . (7)
sign bit of the j-th partial product. 22N
D=−2N−1 A=−2N−1
For a fixed-width multiplier, the partial product matrix
shown in Fig. 1 is typically partitioned into two parts: a main For a given D, the MSE of εA,D can be expressed as εmse,D .
part (MP) for an accurate accumulation, and a truncation part The compensation function f (D) can always be formulated
(TP) for an approximate estimation. Thus, the product of the according to a specific D, which leads to the following:
fixed-width multiplier, PFWM , is given by
1 −1
2N−1
1 −1
2N−1

2
PFWM = MP + TP · 2−N . (4) εmse,D = N εA,D
2
= N TPA,D − f (D)/2 .
2 2
For an N-bit PTM, both MP and TP are calculated and the A=−2N−1 A=−2N−1
result is then rounded to N bits, which gives the maximum (8)
error εPTM,max of 1/2. On the other hand, DTM truncates TP
directly, which gives the maximum truncation error εDTM,max Thus, (7) can be rewritten as
of N/2.
1 −1
2N−1
εmse = N εmse,D . (9)
III. P ROPOSED F IXED -W IDTH M ULTIPLIER D ESIGN 2
D=−2N−1
Fig. 2 illustrates the composition of the proposed compen-
sation algorithm. As indicated, TPmajor and TPminor are the Let σ 2 (εA,D ) and μ(εA,D ) be the variance and mean of εA,D ,
sum of the most significant column of TP and the rest part respectively, according to the definition of MSE, we have:
of TP, respectively. The former is reserved for an accurate

accumulation, while the latter is used for estimation. Suppose εmse,D = σ 2 εA,D + μ2 εA,D
f (D) is the compensation function, which is only related to

the Booth-encoded multiplier D. As shown in Fig. 2, f (D) = σ 2 TPA,D − f (D)/2 + μ2 TPA,D − f (D)/2 . (10)
represents the estimation result from TPminor . Thus, the N-bit
Since f (D) is only related to D, the variance of εA,D should
product PA,D can be expressed as

be a constant and independent of f (D). Therefore, the Min
PA,D = MP + TPmajor + f (D) /2. (5) value of εmse,D can be obtained as shown in (11), with the
condition being satisfied in (12).
With (4) and (5), the truncation error after compensation
εA,D can be calculated as

εmse,Dmin = σ 2 TPA,D − f (D)/2 (11)


εA,D = PFWM − PA,D = TPminor · 2−N − f (D)/2. (6) μ TPA,D − f (D)/2 = 0. (12)
238 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 2, FEBRUARY 2018

TABLE II
PARTIAL P RODUCT B ITS AND THE C ORRESPONDING E XPECTATIONS
BASED ON A LL P OSSIBLE B OOTH -E NCODED D IGITS

Fig. 3. A merge sorting circuit used for a fixed-width Booth multiplier.

With (12) and (18), f (D) can be derived as


Therefore, the Min value of εmse can be derived as


N/2−1

−1
2N−1 −1
2N−1

f (D) = 2μ TPA,D = β/2 + 2 −dj · 22j−1−N . (19)
εmse min = εmse,Dmin = σ 2 TPA,D − f (D)/2 . j=0
D=−2N−1 D=−2N−1
(13) As the final product is supposed to be an integer, the com-
pensation value should be rounded into an integer as well.
N/2−1
As the mean of TPA,D could be obtained from the expecta- In addition, the amplitude of j=0 (−dj ) · 22j−1−N is less
tions of each partial product together with the corresponding than a quarter with its sign opposite to the multiplier, i.e.,
correction bit. Thus, the mean value of the j-th partial product when bN −1 = 0, this term is negative, otherwise, it is pos-
in the matrix, μj (TPA,D ), can be calculated as itive. Considering the parity of β, the compensation integer
f (D)int can be derived as:


N−2j−2


μj TPA,D = E ppj,i · 22j+i−N + cj · 22j−N (14) β/2 + 1 bN−1 = 1
f (D)int = (20)
i=0 (β − 1)/2 + 1 bN−1 = 0.
where E(·) indicates the expectation. Noted that cj here uses
the actual bit generated from the encoded multiplier instead B. Compensation Circuits Design
of a traditional estimation.
Suppose ωk−1 ωk−2 . . . . . . ω1 ω0 is a k-bit compensation
Let nj be an indication whether the j-th Booth encoded digit
binary that equals f(D)int , k is determined by
(dj ) is a non-zero digit or not. Based on all possible values of
dj , nj , cj , each partial product bit ppj ,i and its corresponding  
k = 1 + log2 (max(β)/2 + 1)
expectation are summarized in Table II. ai is the i-th bit of the  
multiplicand (a−1 =0) with a uniform input distribution. Thus, = 1 + log2 (N/4 + 1) . (21)
the probability of ai being one is a half.
Therefore, the compensation value can be calculated based
Observed from Table II, E(ppj,i ) can be expressed as
on β for the operand length N according to (22).
a function of dj and cj according to nj and the integer i as
To obtain ‘β’ and ‘β-1’ at the same time, odd-even


0 nj = 0 merge sorting algorithm is a more efficient way than tra-
E ppj,i = 1/2 nj = 0, i = 0 (15) ditional addition. In this algorithm, the non-zero bit array
1 − dj /2 − cj nj = 0, i = 0 (nN /2−1 nN /2−2 . . . nj . . . n1 n0 ) is sorted in such an ascend-
ing order, (γ N /2−1 γ N /2−2 . . . γ j . . . γ 1 γ 0 ), so that there
From (15), E(ppj,i ) can be further derived as a function of must be a demarcation point, γ β , where γ β =0 and γ β −1 = 1.
dj , cj and nj Therefore, we have


nj /2 i = 0
E ppj,i = (16) 1 (0 ≤ j < β)
nj − dj /2 − cj i = 0. γj = (22)
0 (β ≤ j < N/2).
Therefore, μj (TPA,D ) in (14) can be calculated as

A simple odd-even merge sorting circuit is exemplified in
μj TPA,D Fig. 3 with N = 8 and k = 2. (γ 3 γ 2 γ 1 γ 0 ) is the order gen-
erated by the merge sorting circuits from the original array

N−2j−2

(n3 n2 n1 n0 ).
= E ppj,i · 22j+i−N + cj · 22j−N Once the order is generated, a 2-bit compensation value
i=0

f (D)int (ω1 ω0 ) can be derived as
= nj /4 + −dj · 22j−1−N (j = 0, 1, . . . , N/2 − 1). (17)
ω1 = γ2 b7 + γ1 b7 = γ2 + γ1 b7
Let β be the total number of non-zero digits in D, we have
ω0 = γ2 b7 + γ3 b7 + γ1 b7 = ω1 + γ3 b7 . (23)


N/2−1


N/2−1

μ TPA,D = μj TPA,D = nj /4 + −dj · 22j−1−N As noticed, the merge sorting circuit requires a bal-
j=0 j=0
anced structure. Therefore, extra hardware resources would
be incurred for those operand lengths which are not multiples

N/2−1

of 4. In order to deal with this problem, we propose a sim-
= β/4 + −dj · 22j−1−N (0 ≤ β ≤ N/2). ple method that firstly round down the operand length to be
j=0 a multiple of 4, which excludes nN /2−1 from the original array.
(18) Then, bN −1 and nN /2−1 are used as two control signals to
ZHANG AND HE: LOW-ERROR ENERGY-EFFICIENT FIXED-WIDTH BOOTH MULTIPLIER 239

Fig. 4. Estimation circuit for producing one-bit compensation value ωm .


Fig. 5. Circuit structure of an 8×8-bit BSCP multiplier.

select the correct compensation value. As an example, when TABLE III


N = 10, the two-bit binary f (D)int can be derived in this way as E RROR P ERFORMANCE OF D IFFERENT F IXED -W IDTH M ULTIPLIERS

ω1 = (γ2 n4 + γ1 n4 )b9 + (γ1 n4 + γ0 n4 )b9


ω0 = ((γ1 + γ3 )n4 + γ2 n4 )b9
+ ((γ0 + γ2 )n4 + (γ1 + γ3 )n4 )b9 . (24)
A more generalized derivation on the m-th bit compensa-
tion value, ωm , can be derived based on the number of operand
length N whether it is a multiple of 4 or not. Therefore, an
estimation circuit used to generate the compensation value
is proposed with one value generation block and one selec-
tion logic. The former is responsible to generate all possible
compensation values, ωm,r . Fig. 4(a) shows its gate-level
implementation, where r indicates the corresponding bit in
. Fig. 4(b) and 4(c) illustrates two cases for a slice of the
estimation circuit with the operand length a multiple of 4 and
not a multiple of 4, respectively. Once ωm,r is obtained from
each corresponding value generation block, a correct compen-
sation value ωm will be chosen by the selection logics with
the control signal bN −1 for case (b) and bN −1 together with
nN /2−1 for case (c).
Fig. 5 illustrates the circuit structure of the reduction proce-
dure for an 8×8-bit BSCP multiplier. As shown in the figure,
black dots indicate accurate bits corresponding to the par-
tial products in MP and TPmajor according to the partitioning
method in Fig. 2. A gate-level circuit surrounded by the dashed maximum absolute error (MAE) and mean square error (MSE)
line is used to generate the compensation value derived from for each multiplier.
f (D)int . This compensation value is a two-bit number, ω1 ω0 , As illustrated in Table III, except PTM, the proposed BSCP
represented with white dots in Fig. 5. HA and FA represent multiplier exhibits the best accuracy with the minimum ME,
a half adder and a full adder, respectively. A carry propagation MAE as well as MSE. Compared with the most competitive
adder is used to generate the final result. competitor SCG [8], although BSCP performs the same in
terms of ME and MAE, it outperforms in MSE. This is due
IV. P ERFORMANCE E VALUATIONS to the way which the compensation value is derived. With
the actual sign information instead of a traditional estima-
This section evaluates the performance of our proposed
tion, the proposed method comes with a more symmetric error
fixed-width Booth multiplier and compares it with several
distribution which results in a better MSE.
competitive multipliers, including DTM, PTM, the probability
and computer simulation (PACS) based multiplier [11], and B. Circuit Performance
the one with simple circuit generator (SCG) [8].
To evaluate the circuit performance, each multiplier men-
For a fair and legitimate comparison, all fixed-width Booth
tioned in Table III is described in Verilog HDL, and synthe-
multipliers are implemented with the same Booth encoder and
sized using the Synopsys Design Compiler. All experiments
the compression tree structure.
are carried out in a 32-nm CMOS technology. Table IV
summarizes the worst-case delay, power dissipation, normal-
A. Accuracy ized energy per operation and area of the fixed-width Booth
The accuracy analysis of fixed-width Booth multiplier is multipliers. The average power dissipations are simulated by
performed for operand lengths of 8, 10, 12, 14 and 16 bits. The Synopsys Power Compiler with back annotated switching
results are shown in Table III, which lists the mean error (ME), activity files generated from one million random input vectors
240 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 2, FEBRUARY 2018

TABLE IV
C IRCUIT P ERFORMANCE W ITH D IFFERENT F IXED -W IDTH M ULTIPLIER

Fig. 6. EDEmse with different fixed-width Booth multiplier.

one with the highest accuracy. It is more efficient for those


multipliers with the operand lengths that are not multiples of
4. It can reach maximum 14.8% reduction compared with all
its contenders among various operand lengths.

V. C ONCLUSION
In this brief, a Booth encoded sign-digit-based conditional
probability method for a fixed-width multiplier design is
presented. A mux-based estimation circuit is proposed to sim-
plify the compensation logics, and shorten the critical paths.
The proposed approach can be easily adapted for an accuracy-
adaptive design by putting variable columns of TPmajor into
at a 200MHz data rate with 50% switching activity to each considerations. Simulation results show that our proposed mul-
design. Both area and energy per operation for each operand tiplier is the most energy efficient design with the highest
length are normalized so that DTM with the least area and accuracy. It can reach maximum 14.8% reduction on the
energy consumption is one. energy-delay-error product compared with all its contenders
As indicated in Table IV, the proposed BSCP multiplier among various operand lengths.
has the minimum delay time except DTM. This is because
R EFERENCES
the compensation function of this brief uses a mux-based
estimation circuits instead of modifying the partial products [1] N. Petra, D. De Caro, V. Garofalo, E. Napoli, and A. G. M. Strollo,
before compression as in [8] and [11]. This results in shorter “Design of fixed-width multipliers with linear compensation function,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 5, pp. 947–960,
critical paths with a slight overhead on area. As for power May 2011.
dissipation, the proposed BSCP multiplier dissipates the least [2] C.-H. Chang and R. K. Satzoda, “A low error and high performance
for those bit widths which are not multiples of 4, although multiplexer-based truncated multiplier,” IEEE Trans. Very Large Scale
their corresponding area are not the smallest. That is due to Integr. (VLSI) Syst., vol. 18, no. 12, pp. 1767–1771, Dec. 2010.
our logic simplification on traditional merge sorting algorithm [3] S.-N. Tang, J.-W. Tsai, and T.-Y. Chang, “A 2.4-Gs/s FFT processor for
OFDM-based WPAN applications,” IEEE Trans. Circuits Syst. II, Exp.
discussed in Section III-B, which decreases the chances of Briefs, vol. 57, no. 6, pp. 451–455, Jun. 2010.
switching activity for the internal signals. When the operand [4] S. W. Yen et al., “A 5.79-Gb/s energy-efficient multirate LDPC codec
length increases, the power savings becomes more obvious chip for IEEE 802.15.3c applications,” IEEE J. Solid-State Circuits,
for all operand lengths. Benefit from the improved speed, vol. 47, no. 9, pp. 2246–2257, Sep. 2012.
and power dissipation, this brief also exhibits the least energy [5] H.-J. Ko and S.-F. Hsiao, “Design and application of faithfully rounded
and truncated multipliers with combined deletion, reduction, truncation,
consumption among all competitors. and rounding,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 5,
pp. 304–308, May 2011.
C. Energy-Delay-Accuracy Analysis [6] K.-J. Cho, K.-C. Lee, J.-G. Chung, and K. K. Parhi, “Design of low-
For benchmarking the energy efficiency of a circuit, the error fixed-width modified Booth multiplier,” IEEE Trans. Very Large
energy-delay product (EDP) is used to be a better metric Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 522–531, May 2004.
[7] H.-A. Huang, Y.-C. Liao, and H.-C. Chang, “A self-compensation fixed-
than energy per operation. For lossy applications, to maintain width Booth multiplier and its 128-point FFT applications,” in Proc.
a required accuracy, error performance should be taken into IEEE Int. Symp. Circuits Syst., May 2006, pp. 3538–3541.
account as well as the energy efficiency. Thus, a new design [8] J.-P. Wang, S.-R. Kuang, and S.-C. Liang, “High-accuracy fixed-width
metric energy-delay-error-product (EDEmse ) is adopted to bet- modified Booth multipliers for lossy applications,” IEEE Trans. Very
ter evaluate the design efficiency of the proposed approach, Large Scale Integr. (VLSI) Syst., vol. 19, no. 1, pp. 52–60, Jan. 2011.
[9] C.-Y. Li, Y.-H. Chen, T.-Y. Chang, and J.-N. Chen, “A probabilistic
which is given by estimation bias circuit for fixed-width Booth multiplier and its DCT
applications,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 4,
EDEmse = Energy(E) × Delay(D) pp. 215–219, Apr. 2011.
× MSE(Emse ). (25) [10] Y.-H. Chen, “An accuracy-adjustment fixed-width Booth multiplier
based on multilevel conditional probability,” IEEE Trans. Very Large
As there is no compensation algorithm involved in either Scale Integr. (VLSI) Syst., vol. 23, no. 1, pp. 203–207, Jan. 2015.
DTM or PTM, these two multipliers are excluded from this [11] W.-Q. He, Y.-H. Chen, and S.-J. Jou, “High-accuracy fixed-width Booth
comparison. The results are illustrated in Fig. 6. multipliers based on probability and simulation,” IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 62, no. 8, pp. 2052–2061, Aug. 2015.
As expected, the proposed design exhibits the least EDEmse [12] G. W. Bewick, “Fast multiplication: Algorithm and implementation,”
for all operand lengths. That means BSCP approach in the Ph.D. dissertation, Dept. Elect. Eng., Stanford Univ., Stanford, CA,
design of a fixed-width multiplier is the most energy efficient USA, 1994.

You might also like