Project Base Paper
Project Base Paper
Project Base Paper
Abstract—Multiplication is a key fundamental function for Our approach introduces a term representing the power and
many error-tolerant applications. Approximate multiplication is accuracy requirements which simplifies the partial product
considered to be an efficient technique for trading off energy reduction (PPR) component as needed. An approximate
against performance and accuracy. This paper proposes an multiplier is designed using the proposed adder and
accuracy-controllable multiplier whose final product is generated compressor. This multiplier, together with a conventional
by a carry-maskable adder. The proposed scheme can dynamically multiplier and the previously studied approximate multipliers,
select the length of the carry propagation to satisfy the accuracy was implemented in Verilog HDL using a 45-nm library to
requirements flexibly. The partial product tree of the multiplier is evaluate the power consumption, critical path delay, and design
approximated by the proposed tree compressor. An ૡ ൈ ૡ
area. Compared with the conventional Wallace tree multiplier,
multiplier design is implemented by employing the carry-
the proposed approximate multiplier reduced power
maskable adder and the compressor. Compared with a
conventional Wallace tree multiplier, the proposed multiplier
consumption by between 47.3% and 56.2% and the critical path
reduced power consumption by between 47.3% and 56.2% and delay by between 29.9% and 60.5%, depending on the required
critical path delay by between 29.9% and 60.5%, depending on the computational accuracy. In addition, its design area was 44.6%
required accuracy. Its silicon area was also 44.6% smaller. In smaller. Comparisons with the established approximate
addition, results from an image processing application multipliers, none of which have any dynamic reconfigurability,
demonstrate that the quality of the processed images can be demonstrate that the proposed multiplier provided the best
controlled by the proposed multiplier design. trade-off of power and delay against accuracy. All the multiplier
designs are then evaluated in a real image processing
I. INTRODUCTION application.
Many increasingly popular applications, such as image The remainder of this paper is organized as follows. Section
processing and recognition, are inherently tolerant of small II reviews previous works. Section III introduces the accuracy-
inaccuracies. These applications are computationally controllable approximate multiplier after explaining the tree
demanding and multiplication is their fundamental arithmetic compressor and the CMA. Section IV evaluates the multipliers
function, which creates an opportunity to trade off experimentally and then evaluates the proposed approximate
computational accuracy for reduced power consumption. multiplier using an image processing application. Section V
Approximate computing is an efficient approach for error- presents our conclusions.
tolerant applications because it can trade off accuracy for
power, and it currently plays an important role in such II. PREVIOUS WORK
application domains [1]. The adder is a basic element of most multipliers. Mahdiani
Different error-tolerant applications have different accuracy et al. [2] proposed the lower-part-OR adder, which utilizes OR
requirements, as do different program phases in an application. gates for addition of the lower bits and precise adders for
If multiplication accuracy is fixed, power will be wasted when addition of the upper bits. It is similar to our proposed CMA in
high accuracy is not required. This means that approximate that it uses OR gates to generate the sum approximately, but our
multipliers should be dynamically reconfigurable to match the CMA is also dynamically reconfigurable.
different accuracy requirements of different program phases Liu et al. [3] utilized an approximate adder to reduce carry
and applications. propagation delay in partial product accumulation. They also
This paper focuses on an approximate multiplier design that proposed a recovery vector to improve accuracy. The bit width
can control accuracy dynamically. A carry-maskable adder of the error recovery vector can be selected by the designer to
(CMA) is proposed that can be dynamically configured to satisfy accuracy requirements. Hashemi et al. [4] proposed a
function as a conventional carry propagation adder (CPA), a set technique that reduces the size of the multiplier by detecting the
of bit-parallel OR gates, or a combination of the two. This leading one bit of the input operands and selecting the following
configurability is realized by masking carry propagation: the ݇ bits as abridged operands for both inputs, where ݇ is a
CPA in the last stage of the multiplier is replaced by the designer-defined value that specifies the bandwidth used in the
proposed CMA. An approximate tree compressor is utilized to core accurate multiplier. Both [3] and [4] allow a static trade-
reduce the accumulation layer depth of the partial product tree. off between power consumption and accuracy. The bit lengths
606
7A-1
D1 iCACs P1 mask_x
mask_x
x
x y
D2 Q1
y
s
s
D3 iCACs P2 Cout
Cin
D4 Q2 Cout
V
Q3 (a) (b)
D5 iCACs Fig. 4. (a) Carry-maskable half adder, (b) Carry-maskable full adder.
D6 P3
Q4 1 1 1 1 1
D7 iCACs
Stage 1 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
ATC-8
D8 P4
V1
Fig. 3. Structure of an approximate tree compressor with eight inputs.
B. Carry-maskable Adder
A CMA is proposed to control the accuracy flexibly and
dynamically. A ݇-bit CMA comprises (݇ െ1) carry-maskable P1 P2 P3 P4
full adders and one carry-maskable half adder, and its structure
is similar to that of a ݇-bit CPA. P1
ATC-4
V2
P2
The structures of the proposed carry-maskable half and full P3
P4
adders are shown in Fig. 4. In the proposed half adder, when
mask_x is 0, S is equal to x ܱܴ y and Cout is equal to 0. P5 P6
Otherwise, when mask_x is 1, S is equal to x ܱܴܺ b and Cout
is equal to x ܦܰܣy. In other words, the operation of the P5
iCACs Q7
proposed half adder can be controlled by the active-low signal P6
mask_x. When mask_x is disabled (=1), it functions as an
accurate half adder, and when mask_x is enabled (=0), Cout is P7
masked to 0 and it functions as an OR gate with output S. The Stage 2
operation of the proposed full adder is similar to the half adder: P7
Q7
when mask_x is disabled (=1), it functions as an accurate full V2
V1
adder, and when mask_x is enabled (=0), Cout is equal to Cin and
S is the output of an OR gate. Seven OR gates
607
7A-1
the CPA is 11 [7]. In our proposed multiplier, the length of the TABLE II. ACCURACY COMPARISON.
CPA is 13. In Stage 4, the CPA is divided into three parts in NMED MRED ER
order to reduce the length of the carry propagation. Since the (%) (%) (%)
lower bits are not significant for accuracy, bits 0 to 4 are defined
as the truncated part and three OR gates are used to generate the m_7b 0.25 0.85 36.16
values for bits 2, 3, and 4 of the final result. Because there is no m_6b 0.26 0.99 43.46
carry out from the truncated part, the length of the CPA is
reduced to 10. Since the upper bits are the most significant for m_5b 0.29 1.31 52.07
accuracy, bits 12 to 14 are defined as the accurate part, and three m_4b 0.35 1.93 61.05
accurate adders are used to generate the values for these bits of 0.49 3.05 69.61
the final result. m_3b
m_2b 0.71 4.57 74.93
The accuracy-controllable part lies between the truncated
and accurate parts. This part is important for both critical path m_1b 1.05 6.50 78.10
delay and accuracy. In Stage 4, bits 5 to 11 in the CPA are m_0b 1.64 9.02 80.02
replaced by a 7-bit CMA. Note that every 1-bit CMA has a
mask_x signal. Given a value for ݑ, the ݑupper bits in the AMER_10b 0.20 0.62 31.59
accuracy-controllable part are configured as a ݑ-bit CPA and AMER_8b 0.24 1.16 55.44
the lower bits are configured as ( െ )ݑ2-input OR gates by 0.46 3.23 71.12
AMER_6b
managing the seven mask_x signals appropriately. When = ݑ
7, it functions as a 7-bit CPA, and when = ݑ0, it functions as AMER_4b 1.20 7.53 79.54
seven 2-input OR gates. For each bit of S that is generated by a ACCI_M2 0.04 0.62 72.29
2-input OR gate, power consumption is reduced because the
switching activity is reduced in some of the logic gates. switching activity interchange format files generated from the
Furthermore, the maximum delay of the CMA is reduced. VCD files. The Synopsys VCS was used to evaluate the
numerical outputs of all the multipliers. Because 8-bit
IV. EXPERIMENTAL RESULTS multipliers were evaluated, the total number of test patterns
A. Experimental Setup was 65,536.
In this section, the proposed multiplier is evaluated in terms B. Accuracy Results
of power consumption, critical path delay, design area, and The error distance (ED) and mean ED (MED) measures
computational accuracy. To clarify the ability of the have been proposed to evaluate the performance of
approximate multiplier to save power, shorten critical path delay, approximate arithmetic circuits [10]. For multipliers, the ED is
and control the accuracy, a conventional Wallace tree multiplier defined as the arithmetic difference between the accurate
and the previously-proposed approximate multipliers [3] [8] product () and the approximate product ( ᇱ ): ൌ ȁ െ ᇱ ȁ.
were implemented for comparison. The approximate multiplier The MED is the average ED for a set of outputs. In [3], the mean
[3] can be configured at design time, and its accuracy is relative ED (MRED) and normalized MED (NMED) are
controlled by the length of the recovery vector. Four different proposed to evaluate approximate multipliers. The relative ED
approximate multipliers were implemented, using 10-bit, 8-bit, (RED) is the ED divided by the accurate output: ൌ
6-bit, and 4-bit recovery vectors, and are referred to as ȁ െ ᇱ ȁΤ, and the MRED is the average RED, which can be
AMER_10b, AMER_8b, AMER_6b, and AMER_4b, obtained similarly to the MED. The NMED is defined as
respectively. Note that the accuracy of AMER_XX is not ൌ Τ୫ୟ୶ , where ୫ୟ୶ is the maximum output
dynamically controllable, unlike that of our proposed multiplier. magnitude of an accurate multiplier. The error rate (ER) is the
The ACCI2 approximate multiplier [8] is one of the most percentage of inaccurate outputs among all outputs generated
accurate approximate multipliers and is referred to as ACCI_M2. from all combinations of inputs. These three metrics (NMED,
The multipliers with eight different accuracy settings (values of MRED, and ER) were used to evaluate the proposed multiplier.
)ݑare referred to as m_7b, m_6b, …, m_0b. Multiplier m_ݑb
utilized an approximate adder for the final results from the PPR Table II compares the accuracy results. It can be seen that
consisting of a (͵ )ݑ-bit CPA and ሼሺ െ ݑሻ ͵ሽ 2-input OR the accuracy of the proposed multiplier changes widely
gates. For example, the approximate adder for m_6b consisted according to its setting. While the NMED and MRED values of
of a 9-bit CPA and four 2-input OR gates. the most accurate configuration of the proposed multiplier are
larger than those of the most accurate AMER configuration and
All the approximate multipliers, as well as the conventional
ACCI_M2, its controllability is better than that of AMER.
Wallace tree multiplier, were eight bits and coded using Verilog
HDL. The Synopsys VCS was used to simulate the designs and Remember that the proposed multiplier is dynamically
generate value change dump (VCD) files to evaluate the power controllable, unlike AMER.
consumption precisely. The Synopsys Design Compiler was C. Power, Critical Path Delay, and Design Area Results
used to synthesize the multipliers with the NanGate 45nm Open
Comparisons of the power consumption and critical path
Cell Library [9]. The power consumption was evaluated at a
delay for the different multipliers relative to accuracy are
frequency of 0.5GHz. The operating conditions for synthesis
shown in Fig. 6 and Fig. 7, respectively, where the ݔ-axis
were typical (a 1.00 process factor, 1.1 V power supply, and
25°C operating temperature). All designs were synthesized and indicates the MRED. The circles, triangles, asterisk, and square
optimized using the default compiler options. The Synopsys represent the proposed accuracy-controllable multiplier with
Power Compiler was used to estimate power consumption from different dynamic configurations (m_7b, m_6b, …, m_0b), the
608
7A-1
Fig. 6. Power consumption results relative to the MRED. Fig. 8. Design area results.
609
7A-1
The processed image quality was measured using the peak Co., Ltd. for assistance with the experiments. This work was
signal-to-noise ratio (PSNR). This is usually used to measure supported by JSPS KAKENHI Grant Number JP17K00088 and
the quality of reconstructive processes that involves by funds (No.175007 and No.177005) from the Central
information loss and is defined in terms of the mean squared Research Institute of Fukuoka University. This work is
error (MSE) [6]. The MSE and PSNR were defined in [6] as supported by VLSI Design and Education Center (VDEC), the
ଵ ିଵ
University of Tokyo in collaboration with Synopsys, Inc.
ൌ σିଵ ଶ
ୀ σୀ ሾܫሺ݅ǡ ݆ሻ െ ܭሺ݅ǡ ݆ሻሿ ǡ (2)
ெమ
ൌ ͳͲ ଵ ሺ ሻǡ (3)
ୗ REFERENCES
where ܫሺ݅ǡ ݆ሻ and ܭሺ݅ǡ ݆ሻ are the correct and obtained values, [1] S. Venkataramani, V. K. Chippa, S. T. Chakradhar, K. Roy, and A.
respectively, of each pixel, ݉ and are the image Raghunathan. “Quality programmable vector processors for approximate
dimensions, and ܺܣܯூ represents the maximum value of each computing,” 46th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), pp. 1-12, Dec. 2013.
pixel (255 here, as the images are 8-bit).
[2] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, “Bio-Inspired
Table IV shows the PSNR results of the approximate imprecise computational blocks for efficient VLSI implementation of
multipliers, in dB. Larger values represent better quality images. Soft-Computing applications,” IEEE Transactions on Circuits and
Systems I: Regular Papers, vol. 57, no. 4, pp. 850-862, Apr. 2010.
As can be seen, different PSNR values are found for the
[3] C. Liu, J. Han, and F. Lombardi, “A Low-Power, High-Performance
different images on each column of the table. This confirms that approximate multiplier with configurable partial error recovery,” Design,
the dynamic reconfigurability is necessary for the situations Automation & Test in Europe Conference & Exhibition (DATE), Mar.
where different qualities are required. In addition, the proposed 2014.
accuracy-controllable multiplier produced a wide range of [4] S. Hashemi, R. I. Bahar, and S. Reda, “DRUM: A Dynamic Range
Unbiased Multiplier for approximate applications,” IEEE/ACM
PSNR values, with its largest values being comparable to those International Conference on Computer-Aided Design (ICCAD), pp. 418-
of the other approximate multipliers. 425, Nov. 2015.
[5] B. Moons, M. Verhelst, “DVAS: Dynamic Voltage Accuracy Scaling for
V. CONCLUSION increased energy-efficiency in approximate computing,” IEEE/ACM
An accuracy-controllable approximate multiplier has been International Symposium on Low Power Electronics and Design
(ISLPED), Jul. 2015.
proposed in this paper that consumes less power and has a
[6] A. Momeni, J. Han, P. Montuschi, and F. Lombardi, “Design and analysis
shorter critical path delay than the conventional design. Its of approximate compressors for multiplication,” IEEE Transactions on
dynamic controllability is realized by the proposed CMA. The Computers, vol. 64, no. 4, pp. 984-994, Apr. 2015.
multiplier was evaluated at both the circuit and application [7] K. C. Bickerstaff, E. E. Swartzlander, and M. J. Schulte, “Analysis of
levels. The experimental results demonstrate that the proposed column compression multipliers,” 15th IEEE Symposium on Computer
multiplier was able to deliver significant power savings and Arithmetic, pp. 33-39, Jun. 2001.
speedups while maintaining a significantly smaller circuit area [8] Z. Yang, J. Han, and F. Lombardi, “Approximate compressors for Error-
than that of the conventional Wallace tree multiplier. Resilient multiplier design,” IEEE International Symposium on Defect
and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS), pp.
Furthermore, for the same accuracy, the proposed multiplier 183-186, Oct. 2015.
delivered greater improvements in both power consumption
[9] NanGate, Inc. NanGate FreePDK45 Open Cell Library,
and critical path delay than other previously studied http://www.nangate.com/?page_id=2325, 2008
approximate multipliers. Finally, the ability of our proposed [10] J. Liang, J. Han, and F. Lombardi, “New metrics for the reliability of
multiplier to control accuracy was confirmed by an application- approximate and probabilistic adders,” IEEE Transactions on computers,
level evaluation. vol. 62, no. 9, pp. 1760-1771, Sep. 2013.
[11] M. S. Lau, K. V. Ling, and Y. C. Chu, “Energy-Aware probabilistic
ACKNOWLEDGMENT multiplier: Design and Analysis,” 2009 international Conferrence on
Compliers, architeture, and synthesis for embedded systems, pp. 281-290,
Thanks are due to Katsuhiko Wakasugi of Logic Research Oct. 2009.
610