A Low-Power, High-Performance Approximate Multiplier With Configurable Partial Error Recovery

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

A Low-Power, High-Performance Approximate

Multiplier with Configurable Partial Error Recovery

Cong Liu Jie Han Fabrizio Lombardi


Department of Electrical and Department of Electrical and Department of Electrical and
Computer Engineering Computer Engineering Computer Engineering
University of Alberta University of Alberta Northeastern University
Edmonton, Alberta, Canada T6G 2V4 Edmonton, Alberta, Canada T6G 2V4 Boston, MA 02115
Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—Approximate circuits have been considered for for the final stage addition in a multiplier. The error tolerant
error-tolerant applications that can tolerate some loss of accuracy multiplier (ETM) of [7] is based on the truncation of a
with improved performance and energy efficiency. Multipliers are multiplier into an accurate multiplication part for MSBs and a
key arithmetic circuits in many such applications such as digital non-multiplication part for LSBs.
signal processing (DSP). In this paper, a novel approximate multi-
plier with a lower power consumption and a shorter critical path In this paper, a novel approximate multiplier design is
than traditional multipliers is proposed for high-performance proposed using a simple, yet fast approximate adder. This
DSP applications. This multiplier leverages a newly-designed newly designed adder can process data in parallel by cutting
approximate adder that limits its carry propagation to the the carry propagation chain (and thus, introducing an error). It
nearest neighbors for fast partial product accumulation. Different has a critical path delay that is even shorter than a conventional
levels of accuracy can be achieved through a configurable error
recovery by using different numbers of most significant bits
one-bit full adder. Albeit having a high error rate, this adder
(MSBs) for error reduction. The approximate multiplier has simultaneously computes the sum and generates an error
a low mean error distance, i.e., most of the errors are not signal; this feature is employed to reduce the error in the
significant in magnitude. Compared to the Wallace multiplier, final result of the multiplier. In the proposed approximate
a 16-bit approximate multiplier implemented in a 28nm CMOS multiplier, a simple tree of the approximate adders is used for
process shows a reduction in delay and power of 20% and up partial product accumulation and the error signals are used to
to 69%, respectively. It is shown that by utilizing an appropriate compensate the error for obtaining a better accuracy. Compared
error recovery, the proposed approximate multiplier achieves to the traditional (exact) Wallace and Dadda trees, the proposed
similar processing accuracy as traditional exact multipliers but multiplier has a significantly shorter critical path as well as a
with significant improvements in power and performance. reduced circuit complexity.
I. I NTRODUCTION
II. P ROPOSED A PPROXIMATE A DDER
Approximate computing has emerged as a potential so-
In this section, the design of a new approximate adder is
lution for the design of energy-efficient digital systems [1].
presented. This adder operates on a set of pre-processed inputs.
Applications such as multimedia, recognition and data mining
The input pre-processing (IPP) is based on the interchange-
are inherently error-tolerant and do not require a perfect
ability of bits with the same weights in different addends.
accuracy in computation. For these applications, approximate
For example, consider two sets of inputs to a 4-bit adder:
circuits may play an important role as a promising alternative
i) A = 1010, B = 0101 and ii) A = 1111, B = 0000.
for reducing area, power and delay in digital systems that
Clearly, the additions of i) and ii) produce the same result.
can tolerate some loss of accuracy, thereby achieving better
In this process, the two input bits Ai Bi = 01 are equivalent
performance in energy efficiency.
to Ai Bi = 10 (with i being the bit index), because of
As one of the key components in arithmetic circuits, adders the interchangeability of the corresponding bits in the two
have been extensively studied for approximate implementation operands.
(see [1] for a review). New methodologies to model, analyze
and evaluate the approximate adders have been discussed in The basic rule for the IPP is to switch Ai and Bi if Ai = 0
[2]–[4]. However, there has been relatively less effort in the and Bi = 1 (for any i), while keeping the other combinations
design of approximate multipliers. A multiplier usually con- (i.e., Ai Bi = 00, 10 and 11) unchanged. By doing so, more
sists of three stages: partial product generation, partial product 1’s are expected in A and more 0’s are expected in B. If Ȧi Ḃi
accumulation and a carry propagation adder (CPA) at the are the ith bits in the pre-processed inputs, the IPP functions
final stage. [5] considers using approximate adders to generate are given by:
the radix-8 Booth encoding 3x with error reduction. In [6], Ȧi = Ai + Bi , (1)
approximate partial products are computed using inaccurate Ḃi = Ai Bi . (2)
2 × 2 multiplier blocks, while accurate adders are used in
an adder tree to accumulate the approximate partial products. (1) and (2) compute the propagate and generate signals used
[2] briefly discusses the use of approximate speculative adders in a parallel adder such as the carry look-ahead (CLA). The

978-3-9815370-2-4/DATE14/ 2014
c EDAA
TABLE I. T RUTH TABLE OF AN APPROXIMATE ADDER CELL .
Ḃi Ḃi−1 00 01 10 11
Ȧi Ȧi Ȧi 1 1
Ci−1 /Ḃi−1 0 1 0 1
Si Ȧi 1 0 1
Ei 0 Ȧi 0 0

proposed adder can process data in parallel by cutting the carry


propagation chain. A carry propagation chain starts at the ith
bit when Ḃi = 1, Ȧi+1 = 1, Ḃi+1 = 0. In an accurate adder,
Si+1 is 0 and the carry propagates to the higher bit. However,
in the proposed approximate adder, Si+1 is set to be 1 and an Fig. 1. An approximate multiplier with OR-gate based partial error recovery
error signal is generated as Ei+1 = 1. This prevents the carry using 4 MSBs of the error vector.
signal from propagating to higher bits. By doing so, a carry
signal is produced only by the generate signal, i.e. Ci = 1
only when Ḃi = 1, and it only propagates to the next higher this problem by utilizing the error signal. The resulting design
bit, i.e. the (i+1)th position. Table I shows the truth table of has a critical path delay that is shorter than a conventional
one-bit full adder, because the new n-bit adder can process
the approximate adder, where Ȧi , Ḃi and Ḃi−1 are the inputs
data in parallel.
after IPP, Ci−1 is the carry signal, Si and Ei are the sum and
error bits, respectively. The error signal is utilized for error
compensation purposes as discussed in a later section. In this B. Error Reduction
case, the approximate adder is similar to a redundant number As (7) is applicable to the sum of every single approximate
system [8] and the logical functions of Table I are given by adder in the tree, an error reduction circuit is applied to
the final multiplication result rather than to the output of
Si = Ḃi−1 + Ḃi Ȧi , (3) each adder. Two steps are required to reduce errors: i) error
accumulation and ii) error recovery by the addition of the
Ei = Ḃi Ḃi−1 Ȧi . (4)
accumulated errors to the adder tree output using an accurate
Replacing Ȧ, Ḃ using (1) and (2), the logic functions with adder (Fig. 1).
respect to the original inputs are given by 1) Error Accumulation: The error signals can be summed
Si = (Ai ⊕ Bi ) + Ai−1 Bi−1 , (5) up using accurate adders and thus, the accumulated error can
fully compensate the inaccurate product; however to reduce
Ei = (Ai ⊕ Bi )Ai−1 Bi−1 . (6) complexity, an approximate error accumulation is introduced.
Consider the observation that the error vector of each approx-
Consider as an example a 6-bit adder with two inputs given
imate adder tends to have more 0’s than 1’s. Therefore, the
by A = 001111 and B = 000110. The correct (exact) sum S
probability that the error vectors have an error bit ’1’ at the
is0 010101; however, the approximate adder produces the sum
same position, is quite small. Hence, an OR gate is used to
S = 001101 and an error E = 001000. It is easy to show approximately compute the sum of the errors for a single bit.
that 0 If m error vectors (denoted by E1, E2, ..., Em) have to be
S = S + E. (7) accumulated, the sum of these vectors is obtained as
Note that in (7) ’+’ means the addition of two binary numbers Ei = E1i OR E2i OR ... OR Emi . (8)
rather than the ’OR’ function. The error E is always non-
negative and the approximate sum is always equal to or smaller
than the accurate sum. This is an important feature of this 2) Error Recovery: To reduce the error, an accumulated er-
adder, because an additional adder can be used to add the ror vector is added to the adder tree output using a conventional
error to the approximate sum as a compensation step. adder (e.g. a carry look-ahead adder). However, only several
(e.g. k) MSBs of the error signals are used to compensate
the outputs for further reducing the overall complexity. The
III. P ROPOSED A PPROXIMATE M ULTIPLIER number of MSBs is selected according to the extent that errors
In the proposed approximate multiplier, an adder tree is must be compensated. For example in an 8×8 adder tree, there
utilized for partial product accumulation; the error signals in are a total of 7 error vectors, generated by the 7 approximate
the tree are then used to compensate the error in the output to adders in the tree. However, not all the bits in the 7 vectors
obtain a product with a better accuracy. need to be added, because the MSBs of some vectors are less
significant than the least significant bits of the k MSBs. In the
A. Partial Product Accumulation example of Fig. 1, 4 MSBs (i.e. the 11-14th bits) are considered
for error recovery and as a result, 4 error vectors are considered
A significant feature of the proposed approximate multi- (i.e. the error vectors of adders A3, A4, A6 and A7). Note that
plier is the simplicity to use approximate adders in the partial the error vectors of the other three adders are less significant
product accumulation. [6] has shown that this may lead to poor than the 11th bit, so they are not considered. The accumulated
performance, because errors may accumulate and it is difficult error E is obtained using (8); then, the final result is found by
to correct errors using existing approximate adders. However, adding E to S using a fast accurate adder. The adder tree and
the use of the newly proposed approximate adder overcomes the error reduction scheme are shown in Fig. 1.
0.06 0.2 1
OR Gate Error Accumulation OR Gate Error Accumulation OR Gate Error Accumulation
Exact Error Accumulation 0.18 Exact Error Accumulation 0.9 Exact Error Accumulation
0.05
0.16 0.8

0.14 0.7
0.04
0.12 0.6
NMED

MRED

ER
0.03 0.1 0.5

0.08 0.4
0.02
0.06 0.3

0.04 0.2
0.01
0.02 0.1

0 0 0
0 5 10 15 0 5 10 15 0 5 10 15
Number of Bits Used for Error Reduction Number of Bits Used for Error Reduction Number of Bits Used for Error Reduction

(a) (b) (c)

Fig. 2. Accuracy comparison of the approximate multiplier using OR-gate and exact error accumulation: (a) NMED (b) MRED (c) ER vs. different number
of bits for error reduction.

TABLE II. A RITHMETIC ACCURACY COMPARISON BETWEEN THREE


IV. ACCURACY E VALUATION APPROXIMATE MULTIPLIERS .

In [4], the error distance (ED) and mean error distance Proposed multiplier ETM [7] 2 × 2 approximate multiplier [6]
NMED (%) 0.20 2.85 1.39
(MED) are proposed to evaluate the performance of approxi- MRED (%) 0.62 25.21 3.25
mate arithmetic circuits. For multipliers, ED is defined to be ER (%) 31.59 98.88 46.73
the arithmetic difference between0 the accurate product 0
(M )
and the approximate product (M ), i.e., ED = |M − M |.
MED is the average of EDs for a set of outputs (obtained
by applying a set of inputs). A metric for comparing mul-
tipliers of different sizes is the normalized MED (NMED):
M ED
N M ED = M max
, where Mmax is the maximum magnitude
of the output of an (accurate) multiplier, i.e. (2n − 1)2 for an
(a) (b)
n × n multiplier. 0The relative error distance (RED) is defined
as: RED = |M M−M | = ED M . Similarly, the mean relative
Fig. 3. (a) An exact full adder and (b) the approximate adder cell.
error distance (MRED) can be obtained. The error rate (ER)
is defined as the percentage of erroneous outputs among all
outputs. These three metrics (NMED, MRED and ER) are The proposed multiplier is compared with two other ap-
used to evaluate the proposed multiplier. A functional model proximate multipliers: the ETM in [7] and the 2 × 2 ap-
of the proposed multiplier is implemented using Matlab and an proximate multiplier in [6]. In this comparison, the ETM
exhaustive simulation is performed for an 8 × 8 approximate is divided equally into multiplication and non-multiplication
multiplier. sections, while the proposed multiplier uses 10 MSBs for error
reduction. As shown in the results in Table II, the proposed
Both the OR gate error accumulation and the exact error multiplier has the lowest NMED, MRED and ER among the
accumulation are considered for the proposed multiplier; Fig. three approximate multipliers. In particular, it has very low
2 shows the three metrics (NMED, MRED and ER) for using NMED and MRED compared to the other two designs.
different numbers of MSBs for error reduction. Let m denote
the number of MSBs used for error reduction. It can be V. D ELAY AND P OWER E VALUATION
seen that the NMED and MRED drop drastically as m is
increased from 0 to 6 and continue to drop as m increases, A. Delay Estimation
even though at a slower rate. The ER also decreases as m is Based on the linear model of [9], the delays of a full
increased. For the approximate multiplier, there is no error in adder (Fig. 3(a)) and the approximate adder cell (Fig. 3(b)) are
the most significant bit of the output, so the largest number derived to be approximately 3τg and 2τg , respectively, where
of MSBs used is 15. It is also shown that the OR gate error τg is an approximate “gate delay”. For an n-bit approximate
accumulation produces a good approximation to the exact error multiplier, there are dlog2 ne layers in the adder tree. Taking
accumulation. Therefore, m=6 or m=7 may be appropriate for into consideration the delay of the error accumulation using
a good trade-off in terms of the NMED and MRED. For m=7, OR gates, the delay of the proposed multiplier is given by
the NMED is below 0.3% and the MRED is approximately
1.8%. However, the error rate is reduced significantly as m DAp = (2 dlog2 ne + 1)τg . (9)
increases; it decreases to 20% when m=12 for OR gate error
accumulation. These three figures indicate that the proposed There are blog1.5 nc layers in the Wallace or Dadda tree and
approximate multiplier has a rather high error rate, but the their delays are given by [10]
error is usually very small compared to both the accurate and DW,D = 3 blog1.5 nc τg . (10)
the largest possible output of the approximate multiplier. For
example, for m=7, the error rate can be as high as 62%, but the Table III shows the delay of the partial product accumulation
MRED is only 1.8%, i.e., most of the errors are not significant. tree in both the proposed and Wallace/Dadda multipliers. For a
TABLE III. D ELAY OF PARTIAL PRODUCT ACCUMULATION TREE OF
THE PROPOSED AND CONVENTIONAL MULTIPLIERS OF DIFFERENT SIZES .

n 8 16 32 64 2k
DAp (τg ) 7 9 11 13 2k + 1
DW,D (τg ) 12 18 24 30 ≈ 5k

TABLE IV. P OWER CONSUMPTIONS OF FPGA IMPLEMENTATIONS OF


THE 16- BIT APPROXIMATE AND WALLACE MULTIPLIERS .
(a) (b)
Dynamic Quiescent Total
Wallace (W) 0.122 0.083 0.205
Approximate (W) 0.068 0.082 0.150 Fig. 5. Power vs. frequency for (a) 8-bit and (b) 16-bit approximate and
Wallace multipliers.

16-bit multiplier, the delay of an exact multiplier tree is twice Wallace multipliers of the same size have been implemented
as large as the delay of the proposed multiplier tree; as the size in STM 28nm CMOS process. The approximate adder cell
of the multiplier increases, this factor is approximately 2.5. in Fig. 3(b) is implemented using shared logic between two
Since the approximate adder cell is simpler than a full adder, neighboring approximate adder cells, as shown in Fig. 4,
the approximate multiplier has no additional area overhead to thereby saving additional area. In Fig. 4, the signal Ci is given
achieve the shorter delay. For the 2 × 2 approximate multiplier by Ci = Ai Bi and shared between two cells. The critical path
in [6], only the partial product generation layer is simplified delays of 16 × 16 approximate and Wallace multipliers are
and the height of the partial product tree is only decreased 0.48ns and 0.6ns, respectively, resulting in a delay reduction
by 1, so the delay reduction is quite limited. The ETM in [7] of 20%. The power consumption for image multiplication is
can reduce the n × n partial product tree to n2 × n2 . By (10), obtained by applying three frequencies (0.1 GHz, 0.25 GHz
the difference between the delays of n × n and n2 × n2 trees is and 1GHz) to all these multiplier circuits. As shown in Fig. 5,
approximately 3log1.5 2τg ≈ 5.13τg . In summary, the other two the 8 × 8 and 16 × 16 approximate multipliers achieve power
multipliers reduce the critical path delay by a limited value. In savings in the ranges of 37%-53% and 48%-69%, respectively,
contrast, the proposed multiplier can reduce the delay of the compared to the accurate Wallace multipliers.
partial product accumulation tree by nearly 60%, which scales
with the size of the multiplier. VI. C ONCLUSION

B. Experimental Results In this paper, a novel approximate multiplier design is


proposed using a newly designed approximate adder. On a
1) FPGA Implementation: 16 × 16 approximate and Wal- statistical basis the proposed multiplier has a very small error
lace multipliers are implemented in VHDL using the Xilinx distance and thus a high accuracy. Simulations have shown
Spantan3E XC3S500E FPGA. The critical path delays of the that the proposed design has a shorter critical path delay and a
proposed approximate multiplier and the exact Wallace multi- significantly lower power consumption compared to an exact
plier are 13.990ns and 21.999ns, respectively, thus achieving Wallace multiplier. It also uses a configurable error recovery
a reduction of 36.4%. The input data for simulating power that can produce more accurate results than other state-of-the-
consumption are given by the multiplication of two images. art approximate multipliers.
The node activity rates are extracted by performing post-
place and route simulation running at the maximum frequency R EFERENCES
of the Wallace multiplier. Based on the activity rates, the
[1] J. Han and M. Orshansky, “Approximate Computing: An Emerging
Xilinx XPower Analyzer is used to obtain the power con- Paradigm For Energy-Efficient Design,” in IEEE ETS, 2013.
sumption, as shown in Table IV. The quiescent power of the [2] J. Huang, J. Lach, and G. Robins, “A methodology for energy-quality
approximate multiplier is slightly smaller than the Wallace tradeoff using imprecise hardware,” in DAC 2012, pp. 504–509.
multiplier, however the approximate multiplier saves 44.3% [3] J. Miao, K. He, A. Gerstlauer, and M. Orshansky, “Modeling and
of the dynamic power compared to the Wallace multiplier. synthesis of quality-energy optimal approximate adders,” in ICCAD
Overall, the proposed multiplier achieves a reduction of 26.8% 2012, pp. 728–735.
in total power consumption. [4] J. Liang, J. Han, and F. Lombardi, “New metrics for the reliability of
approximate and probabilistic adders,” IEEE Transactions on Comput-
2) ASIC Implementation: ASIC designs for n × n (n = ers, vol. 62, no. 9, pp. 1760–1771, 2013.
8, 16) approximate multipliers with n-bit error reduction and [5] S.-L. Lu, “Speeding up processing with approximation circuits,” Com-
puter, vol. 37, no. 3, pp. 67–73, 2004.
[6] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading accuracy for power
with an underdesigned multiplier architecture,” in 24th IEEE Intl. Conf.
on VLSI Design, 2011, pp. 346–351.
[7] K. Y. Kyaw, W. L. Goh, and K. S. Yeo, “Low-power high-speed
multiplier for error-tolerant application,” in IEEE Intl. Conf. Electron
Devices and Solid-State Circuits (EDSSC), 2010, pp. 1–4.
[8] B. Parhami, Computer arithmetic. Oxford university press, 2000.
[9] N. H. Weste and H. David, CMOS VLSI Design-A Circuit and Systems
Perspective, 3rd ed. Pearson Addison Wesley, 2005.
[10] K. Bickerstaff, E. Swartzlander, and M. Schulte, “Analysis of column
compression multipliers,” in 15th IEEE Symp. on Computer Arithmetic,
Fig. 4. Two neighboring approximate adder cells for ASIC implementation. 2001, pp. 33–39.

You might also like