A Low-Power, High-Performance Approximate Multiplier With Configurable Partial Error Recovery
A Low-Power, High-Performance Approximate Multiplier With Configurable Partial Error Recovery
A Low-Power, High-Performance Approximate Multiplier With Configurable Partial Error Recovery
Abstract—Approximate circuits have been considered for for the final stage addition in a multiplier. The error tolerant
error-tolerant applications that can tolerate some loss of accuracy multiplier (ETM) of [7] is based on the truncation of a
with improved performance and energy efficiency. Multipliers are multiplier into an accurate multiplication part for MSBs and a
key arithmetic circuits in many such applications such as digital non-multiplication part for LSBs.
signal processing (DSP). In this paper, a novel approximate multi-
plier with a lower power consumption and a shorter critical path In this paper, a novel approximate multiplier design is
than traditional multipliers is proposed for high-performance proposed using a simple, yet fast approximate adder. This
DSP applications. This multiplier leverages a newly-designed newly designed adder can process data in parallel by cutting
approximate adder that limits its carry propagation to the the carry propagation chain (and thus, introducing an error). It
nearest neighbors for fast partial product accumulation. Different has a critical path delay that is even shorter than a conventional
levels of accuracy can be achieved through a configurable error
recovery by using different numbers of most significant bits
one-bit full adder. Albeit having a high error rate, this adder
(MSBs) for error reduction. The approximate multiplier has simultaneously computes the sum and generates an error
a low mean error distance, i.e., most of the errors are not signal; this feature is employed to reduce the error in the
significant in magnitude. Compared to the Wallace multiplier, final result of the multiplier. In the proposed approximate
a 16-bit approximate multiplier implemented in a 28nm CMOS multiplier, a simple tree of the approximate adders is used for
process shows a reduction in delay and power of 20% and up partial product accumulation and the error signals are used to
to 69%, respectively. It is shown that by utilizing an appropriate compensate the error for obtaining a better accuracy. Compared
error recovery, the proposed approximate multiplier achieves to the traditional (exact) Wallace and Dadda trees, the proposed
similar processing accuracy as traditional exact multipliers but multiplier has a significantly shorter critical path as well as a
with significant improvements in power and performance. reduced circuit complexity.
I. I NTRODUCTION
II. P ROPOSED A PPROXIMATE A DDER
Approximate computing has emerged as a potential so-
In this section, the design of a new approximate adder is
lution for the design of energy-efficient digital systems [1].
presented. This adder operates on a set of pre-processed inputs.
Applications such as multimedia, recognition and data mining
The input pre-processing (IPP) is based on the interchange-
are inherently error-tolerant and do not require a perfect
ability of bits with the same weights in different addends.
accuracy in computation. For these applications, approximate
For example, consider two sets of inputs to a 4-bit adder:
circuits may play an important role as a promising alternative
i) A = 1010, B = 0101 and ii) A = 1111, B = 0000.
for reducing area, power and delay in digital systems that
Clearly, the additions of i) and ii) produce the same result.
can tolerate some loss of accuracy, thereby achieving better
In this process, the two input bits Ai Bi = 01 are equivalent
performance in energy efficiency.
to Ai Bi = 10 (with i being the bit index), because of
As one of the key components in arithmetic circuits, adders the interchangeability of the corresponding bits in the two
have been extensively studied for approximate implementation operands.
(see [1] for a review). New methodologies to model, analyze
and evaluate the approximate adders have been discussed in The basic rule for the IPP is to switch Ai and Bi if Ai = 0
[2]–[4]. However, there has been relatively less effort in the and Bi = 1 (for any i), while keeping the other combinations
design of approximate multipliers. A multiplier usually con- (i.e., Ai Bi = 00, 10 and 11) unchanged. By doing so, more
sists of three stages: partial product generation, partial product 1’s are expected in A and more 0’s are expected in B. If Ȧi Ḃi
accumulation and a carry propagation adder (CPA) at the are the ith bits in the pre-processed inputs, the IPP functions
final stage. [5] considers using approximate adders to generate are given by:
the radix-8 Booth encoding 3x with error reduction. In [6], Ȧi = Ai + Bi , (1)
approximate partial products are computed using inaccurate Ḃi = Ai Bi . (2)
2 × 2 multiplier blocks, while accurate adders are used in
an adder tree to accumulate the approximate partial products. (1) and (2) compute the propagate and generate signals used
[2] briefly discusses the use of approximate speculative adders in a parallel adder such as the carry look-ahead (CLA). The
978-3-9815370-2-4/DATE14/
2014
c EDAA
TABLE I. T RUTH TABLE OF AN APPROXIMATE ADDER CELL .
Ḃi Ḃi−1 00 01 10 11
Ȧi Ȧi Ȧi 1 1
Ci−1 /Ḃi−1 0 1 0 1
Si Ȧi 1 0 1
Ei 0 Ȧi 0 0
0.14 0.7
0.04
0.12 0.6
NMED
MRED
ER
0.03 0.1 0.5
0.08 0.4
0.02
0.06 0.3
0.04 0.2
0.01
0.02 0.1
0 0 0
0 5 10 15 0 5 10 15 0 5 10 15
Number of Bits Used for Error Reduction Number of Bits Used for Error Reduction Number of Bits Used for Error Reduction
Fig. 2. Accuracy comparison of the approximate multiplier using OR-gate and exact error accumulation: (a) NMED (b) MRED (c) ER vs. different number
of bits for error reduction.
In [4], the error distance (ED) and mean error distance Proposed multiplier ETM [7] 2 × 2 approximate multiplier [6]
NMED (%) 0.20 2.85 1.39
(MED) are proposed to evaluate the performance of approxi- MRED (%) 0.62 25.21 3.25
mate arithmetic circuits. For multipliers, ED is defined to be ER (%) 31.59 98.88 46.73
the arithmetic difference between0 the accurate product 0
(M )
and the approximate product (M ), i.e., ED = |M − M |.
MED is the average of EDs for a set of outputs (obtained
by applying a set of inputs). A metric for comparing mul-
tipliers of different sizes is the normalized MED (NMED):
M ED
N M ED = M max
, where Mmax is the maximum magnitude
of the output of an (accurate) multiplier, i.e. (2n − 1)2 for an
(a) (b)
n × n multiplier. 0The relative error distance (RED) is defined
as: RED = |M M−M | = ED M . Similarly, the mean relative
Fig. 3. (a) An exact full adder and (b) the approximate adder cell.
error distance (MRED) can be obtained. The error rate (ER)
is defined as the percentage of erroneous outputs among all
outputs. These three metrics (NMED, MRED and ER) are The proposed multiplier is compared with two other ap-
used to evaluate the proposed multiplier. A functional model proximate multipliers: the ETM in [7] and the 2 × 2 ap-
of the proposed multiplier is implemented using Matlab and an proximate multiplier in [6]. In this comparison, the ETM
exhaustive simulation is performed for an 8 × 8 approximate is divided equally into multiplication and non-multiplication
multiplier. sections, while the proposed multiplier uses 10 MSBs for error
reduction. As shown in the results in Table II, the proposed
Both the OR gate error accumulation and the exact error multiplier has the lowest NMED, MRED and ER among the
accumulation are considered for the proposed multiplier; Fig. three approximate multipliers. In particular, it has very low
2 shows the three metrics (NMED, MRED and ER) for using NMED and MRED compared to the other two designs.
different numbers of MSBs for error reduction. Let m denote
the number of MSBs used for error reduction. It can be V. D ELAY AND P OWER E VALUATION
seen that the NMED and MRED drop drastically as m is
increased from 0 to 6 and continue to drop as m increases, A. Delay Estimation
even though at a slower rate. The ER also decreases as m is Based on the linear model of [9], the delays of a full
increased. For the approximate multiplier, there is no error in adder (Fig. 3(a)) and the approximate adder cell (Fig. 3(b)) are
the most significant bit of the output, so the largest number derived to be approximately 3τg and 2τg , respectively, where
of MSBs used is 15. It is also shown that the OR gate error τg is an approximate “gate delay”. For an n-bit approximate
accumulation produces a good approximation to the exact error multiplier, there are dlog2 ne layers in the adder tree. Taking
accumulation. Therefore, m=6 or m=7 may be appropriate for into consideration the delay of the error accumulation using
a good trade-off in terms of the NMED and MRED. For m=7, OR gates, the delay of the proposed multiplier is given by
the NMED is below 0.3% and the MRED is approximately
1.8%. However, the error rate is reduced significantly as m DAp = (2 dlog2 ne + 1)τg . (9)
increases; it decreases to 20% when m=12 for OR gate error
accumulation. These three figures indicate that the proposed There are blog1.5 nc layers in the Wallace or Dadda tree and
approximate multiplier has a rather high error rate, but the their delays are given by [10]
error is usually very small compared to both the accurate and DW,D = 3 blog1.5 nc τg . (10)
the largest possible output of the approximate multiplier. For
example, for m=7, the error rate can be as high as 62%, but the Table III shows the delay of the partial product accumulation
MRED is only 1.8%, i.e., most of the errors are not significant. tree in both the proposed and Wallace/Dadda multipliers. For a
TABLE III. D ELAY OF PARTIAL PRODUCT ACCUMULATION TREE OF
THE PROPOSED AND CONVENTIONAL MULTIPLIERS OF DIFFERENT SIZES .
n 8 16 32 64 2k
DAp (τg ) 7 9 11 13 2k + 1
DW,D (τg ) 12 18 24 30 ≈ 5k
16-bit multiplier, the delay of an exact multiplier tree is twice Wallace multipliers of the same size have been implemented
as large as the delay of the proposed multiplier tree; as the size in STM 28nm CMOS process. The approximate adder cell
of the multiplier increases, this factor is approximately 2.5. in Fig. 3(b) is implemented using shared logic between two
Since the approximate adder cell is simpler than a full adder, neighboring approximate adder cells, as shown in Fig. 4,
the approximate multiplier has no additional area overhead to thereby saving additional area. In Fig. 4, the signal Ci is given
achieve the shorter delay. For the 2 × 2 approximate multiplier by Ci = Ai Bi and shared between two cells. The critical path
in [6], only the partial product generation layer is simplified delays of 16 × 16 approximate and Wallace multipliers are
and the height of the partial product tree is only decreased 0.48ns and 0.6ns, respectively, resulting in a delay reduction
by 1, so the delay reduction is quite limited. The ETM in [7] of 20%. The power consumption for image multiplication is
can reduce the n × n partial product tree to n2 × n2 . By (10), obtained by applying three frequencies (0.1 GHz, 0.25 GHz
the difference between the delays of n × n and n2 × n2 trees is and 1GHz) to all these multiplier circuits. As shown in Fig. 5,
approximately 3log1.5 2τg ≈ 5.13τg . In summary, the other two the 8 × 8 and 16 × 16 approximate multipliers achieve power
multipliers reduce the critical path delay by a limited value. In savings in the ranges of 37%-53% and 48%-69%, respectively,
contrast, the proposed multiplier can reduce the delay of the compared to the accurate Wallace multipliers.
partial product accumulation tree by nearly 60%, which scales
with the size of the multiplier. VI. C ONCLUSION