MAC - Low Power and Area

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

An Approximate Multiply-Accumulate Unit


with Low Power and Reduced Area
Tongxin Yang Toshinori Sato Tomoaki Ukezono
Logic Research Co., Ltd. Department of Electronics Engineering Department of Electronics Engineering
Fukuoka, Japan and Computer Science and Computer Science
[email protected] Fukuoka University Fukuoka University
Fukuoka, Japan Fukuoka, Japan
orcid: 0000-0001-5272-7533 [email protected]

Abstract—Approximate computing benefits applications that y A technique, which distributes every bit of an accumulated
are, to some extent, error tolerant with regard to accuracy by value (ACC) to partial product (PP) rows, is proposed to
trading power for accuracy. In this paper, a low-power merge a multiplier and an adder in an atomic approximate
approximate multiply-accumulate (MAC) unit with reduced area circuit.
is proposed. In the proposed MAC unit, a multiplier and an adder
are merged in an approximate multiplier by distributing all y An approximate sign converter is proposed to handle signed
accumulated sums to partial product rows. Experimental results numbers presented by two’s complement.
demonstrate that, compared to a conventional MAC unit, the
y The proposed MAC unit is implemented using Verilog HDL
unsigned design of the proposed unit reduces power consumption
and circuit area by 42.6% and 46.1%, respectively, without
and its power, area, and delay are evaluated at the gate level.
impacting accuracy significantly. While most approximate circuits y The quality of approximate computation at the application
do not handle signed numbers represented by two’s complement level is evaluated using a handwritten digit recognition
effectively, evaluations using handwritten digit recognition on application, which is based on a deep neural network.
LeNet-5 indicate that the signed design of the proposed unit is
sufficient for practical use. The remainder of this paper is organized as follows. Section
II surveys related work. The proposed approximate MAC unit is
Keywords—approximate computing, multiply-accumulate unit, described in Section III and the proposed unit is evaluated in
low-power circuit, accuracy scaling, deep neural network Section IV. Conclusions are presented in Section V.
I. INTRODUCTION II. RELATED WORK
In the post-Dennard scaling era, some benefits, such as Several types of approximate multipliers have been
power reduction due to voltage scaling, are not provided; thus a proposed. This paper focuses on those which are approximate
new method way to reduce power consumption is strongly for PP trees. Speculative multiplier [6], significance-driven logic
required. Approximate computing [1, 2], which trades off compression [7], and approximate tree compressor [8] utilize an
computational accuracy against power consumption, is a OR gate to approximate additions in the PP matrix. The
promising candidate. Some modern applications, such as image proposed MAC unit utilizes the approximate PP compressor.
processing, sensing, recognition, and neural networks, are
inherently error-tolerant; thus approximate computing would be Esposito et al. [9] proposed an approximate MAC unit that
applicable to such applications [3]. Recently, deep neural combines a precise multiplier and an approximate adder. The
networks have received considerable research attention. It final output of the multiplier is two rows reduced from the PP
performs large numbers of multiply-accumulate (MAC) matrix. The output rows are merged with an ACC and reduced
operations and therefore consumes significant power [4]. In again into two rows by a precise carry save adder (CSA). The
deep learning, convolutions account for more than 90% of final addition of the two rows is processed by a segmentation-
overall computation [5]. MAC operations are also extensively type approximate adder. Esposito et al. also proposed a MAC
used in digital signal processing for data intensive applications. unit [10] that combines an approximate multiplier and a precise
For example, dedicated MAC units are used to accelerate FIR adder. In contrast, this paper proposes an MAC unit wherein a
and FFT computations. Considering the above, this paper multiplier and an adder are merged into an atomic approximate
proposes a low-power MAC unit. Compared to a conventional circuit.
MAC unit comprising a Wallace tree multiplier and an adder,
III. APPROXIMATE MULTIPLY-ADD UNIT
for unsigned values, the proposed unit outperforms the
conventional MAC unit by 42.6%, 46.1%, and 42.0% relative to A conventional MAC unit consists of a multiplier, an adder,
power, area, and delay, respectively. and an accumulator. The accumulator is simply a register that
holds the outcome of the adder; therefore, this paper focuses on
The contributions of this study are summarized as follows. arithmetic units, i.e., multiplier and adder. Here, multiplication

978-1-7281-3391-1/19/$31.00 ©2019 IEEE 385


DOI 10.1109/ISVLSI.2019.00076
consists of three steps: (1) PP generation, (2) PP reduction, and
(3) final addition. In the first step, AND gates form the PPs. The
next two subsections explain how the second step is processed
in the approximate multipliers [6, 7, 8]. In the third subsection,
merged accumulation for a MAC unit is proposed. The final
subsection describes how signed numbers are handled.
A. Approximate Partial-Product Compression
OR gates are utilized to reduce the number of rows in a PP
matrix [6, 7, 8]. Each of them works as an approximate half
adder. Addition of two PPs, ‫݌݌‬௜ and ‫݌݌‬௝ , is expressed as follows
[6, 8]:
‫݌݌‬௜ ൅ ‫݌݌‬௝ ൌ ൫‫݌݌‬௜ ܱܴ‫݌݌‬௝ ൯ ൅ ൫‫݌݌‬௜ ‫݌݌ܦܰܣ‬௝ ൯
ൌ ܱ௜ǡ௝ ൅ ‫ܣ‬௜ǡ௝
(1)
ܱ௜ǡ௝ ൌ ‫݌݌‬௜ ܱܴ‫݌݌‬௝
‫ܣ‬௜ǡ௝ ൌ ‫݌݌‬௜ ‫݌݌ܦܰܣ‬௝
Since the probability of ܱ௜ǡ௝ is significantly higher than that of
‫ܣ‬௜ǡ௝ , an OR operation is utilized as approximate addition. If
approximate addition is performed for each group of two
columns, the number of rows in the PP matrix is reduced by one-
half. Then, the rows are further reduced by any means until their
number reaches two. To further reduce the rows, Esposito et al.
[10] utilized CSA. In contrast, Qiqieh et al. [7] and Yang et al.
[8] utilized OR gates repeatedly. The former method provides
better accuracy because, in the latter, accuracy is reduced as OR
operations accumulate. Nonetheless, the latter method is used in
this study because the primary goal is power reduction.
B. Recovery Vector Generation
Yang et al. [8] utilized ‫ܣ‬௜ǡ௝ (Equation (1)) to recover the
accuracy that is diminished by approximate additions. Accuracy
is improved if as many ‫ܣ‬௜ǡ௝ s as possible are added to the
approximate sum of the PPs, however, power consumption is
not reduced. Thus, parts of the summations of ‫ܣ‬௜ǡ௝ s are also
approximated by the OR operations [8]. The degree of the
approximation manages the trade-off between accuracy and
power reduction. Note that a set of ‫ܣ‬௜ǡ௝ s utilized for recovery is
Fig. 1. ሺͺ ൈ ͺ ൅ ͳ͸ሻ-bit Multiply-Add Unit
referred to as a recovery vector. The proposed MAC unit adopts
this error recovery technique. and each bit is concatenated with one PP row. In Fig.1, the top
C. Merged Accumulation part shows the PP matrix and the ACC. Here, a black circle
Rather than combining a multiplier and an adder to construct represents a PP and each PP row is labeled (PP0…PP7). Red
a MAC unit, this paper proposes a MAC unit wherein a circles represent the higher 8-bits of the ACC (ACCH) and blue
multiplier and an adder are merged into a single component. The circles represent the lower 8-bits of the ACC (ACCL). In Fig.1,
the middle section explains the approximate PP compression.
structure of an ሺͺ ൈ ͺ ൅ ͳ͸ሻ-bit MAC unit that is based on the
First, the ACC is distributed bit by bit, and each bit is
approximate multiplier proposed by Yang et al. [8] is shown in
concatenated with one PP row. Each bit of ACCH is
Fig. 1. The accumulation is merged with the PP summation,
concatenated with a single PP row as its most significant bit
which will reduce the delay, area, and power consumption.
(MSB). In contrast, each ACCL bit is concatenated as the least
However, accuracy may also be reduced. As mentioned
significant bit (LSB). For example, ACC[9] and ACC[0] are
previously (Section III.A), PP summation accuracy is reduced
concatenated with PP1 as its MSB and LSB, respectively. Note
as the number of accumulated OR operations increases. If the
that the MSB of ACCL (ACC[7]) is the only exception. It
16-bit ACC is divided into two 8-bit rows that are added to the
replaces the LSB of PP7 because it cannot find its place. Thus,
PP matrix, the number of rows increases and thus the number of
except for PP0, each PP row becomes (8+2) bits long and the
OP operations required to reduce the number until two also
increases. This diminishes the accuracy. MAC operation looks like 10-bit ൈ 8-bit multiplication. This
maintains the number of rows and thus avoids accuracy
To address the problem, the ACC is distributed bit by bit, degradation.

386
The following process is same as for any approximate evaluated. Signed CONV utilizes the Baugh-Wooley algorithm.
multiplier, i.e., row compression and final addition. Any The original ESP2 partially utilizes the Baugh-Wooley
previously proposed approximate multiplier scheme can be algorithm and only the multiplicand is a signed number. Thus,
selected. Here the approximate multiplier proposed by Yang et unsigned ESP2 that does not utilize the algorithm is also
al.’s [8] was selected due to its good balance between power and implemented. In signed ESP2, the negative multiplier is
accuracy. In addition, its accuracy configurability is also a converted to unsigned value by using the naïve method proposed
desirable feature. The difference is bit length of each in Section III.D. The proposed signed MAC unit utilizes the
intermediately processed row. Although this is an incremental naïve method for both multiplicand and multiplier. The
study, it will be found that very small overheads on power, area, Synopsys Design Compiler and the NanGate 45nm Open Cell
and delay over a multiplier enable a MAC unit. Library [12] are used for logic synthesis with the default
compiler options. The value change dump files generated from
In the second step, first, according to Equation (1), each two the Synopsys VCS simulations are used by the Synopsys Power
rows of PP0, …, PP6, and PP7 are compressed to an Compiler for dynamic power estimation. One million randomly
approximate sum, i.e., O1, O2, O3, and O4. In parallel, recovery generated inputs are used to obtain outputs. A set of input signals
vector V1 is generated by applying an OR operation to A1, A2, for the simulation has a period of 2 ns.
A3, and A4. The four rows of O1, O2, O3, and O4 are further
compressed into rows O5 and O6. Similarly, recovery vector V2 Second, the proposed MAC unit and ESP2 are evaluated in
is generated by OR-’ing A5 and A6. Recovery vector A7 is terms of accuracy. Mean relative error distance [13] (MRED) is
generated by compressing O5 and O6 into O7. Now, the eight used as the metric to evaluate accuracy. The error distance (ED)
rows are reduced into four rows, i.e., O7, V1, V2, and A7. Next, is defined as the difference between an accurate sum (M) and its
bit positions between 3 and 11 of V1 and V2 are approximately approximate sum (Ԣ), i.e.,  ൌ ȁ ᇱ െ ȁ. The relative ED
summed up by OR gates to reduce the number of rows to three. (RED) is defined as the ED divided by M, i.e.,  ൌ Τ ൌ
Finally, the three rows are processed using CSA to obtain two ȁ ᇱ െ ȁΤ. The mean RED (MRED) is the average of REDs.
rows. In addition, to assess the practicality of the proposed MAC unit,
two image processing applications, i.e., image sharpening [14]
In the third step, the two rows obtained in the second step are
and handwritten digit recognition [15] are also evaluated.
summed using accuracy-configurable addition. The most
significant four bits are summed using a carry propagating adder The image sharpening application consists of two steps. First,
(CPA). The carry out in the bit position 17 is discarded. The next Gaussian smoothing on the input image is performed as follows:
seven bits are approximately/precisely summed using a carry ଶ ଶ
maskable adder (CMA) [8]. Note that a control signal is applied ͳ
to change the functionality of the CMA between precise addition ܴሺ݅ǡ ݆ሻ ൌ ෍ ෍ ‫ܩ‬ሺ݇ ൅ ʹǡ ݈ ൅ ʹሻ ή ‫ܫ‬ሺ݅ ൅ ݇ǡ ݆ ൅ ݈ሻ
ʹ͹͵
and OR operations. The remaining bits are approximately ௞ୀିଶ ௟ୀିଶ
summed using OR gates. Now, the result of the 16-bit ACC is ͳ Ͷ ͹ Ͷ ͳ
obtained. ‫ۍ‬Ͷ ͳ͸ ʹ͸ ͳ͸ Ͷ‫ې‬
‫ێ‬ ‫ۑ‬
D. Handling of Signed Numbers ‫ ܩ‬ൌ ‫ێ‬͹ ʹ͸ Ͷͳ ʹ͸ ͹‫ۑ‬
‫ێ‬Ͷ ͳ͸ ʹ͸ ͳ͸ Ͷ‫ۑ‬
Most existing approximate circuits do not target signed ‫ͳۏ‬
numbers. The Baugh-Wooley algorithm [11] is used to handle Ͷ ͹ Ͷ ͳ‫ے‬
signed numbers in conventional array multipliers. In a
preliminary study, it was found that the algorithm does not work where I(i, j) and R(i, j) represent each pixel of the input and the
well. Negative numbers that are represented by two’s smoothed images, and G is its Gaussian kernel. Second, the
complement have “1”s in the leading bit positions. OR image sharpening application obtains the sharpened image S by
operations used in approximate additions retain them and thus ܵ ൌ ʹ‫ ܫ‬െ ܴ . Only MAC operations in convolutions are
any negative numbers never turn to be positive, resulting in approximated. The inputs are 512 × 512 grayscale bitmaps with
frequent sign bit errors. This paper proposes a naïve technique 8-bit pixels of the well-known Lena, Baboon, Peppers, and
to address the problem. First, a negative multiplicand and a Barbara images. Since all values processed in this application
multiplier are approximately converted to positive values, and are positive, the unsigned MAC units are evaluated. Peak signal-
the ACC is also negated if necessary. Second, unsigned to-noise ratio (PSNR) is used to evaluate the quality of the
multiplication is performed. Finally, the output is negated if application output. Here, PSNR is defined as follows:
necessary. The approximate conversion utilizes one’s
complement and an OR operation with “1” in the LSB. ‫ܺܣܯ‬ூଶ
ܴܲܵܰ ൌ ͳͲ Ž‘‰ଵ଴ ሺ ሻ
IV. EXPERIMENTS ‫ܧܵܯ‬
௫ିଵ ௬ିଵ
First, the proposed MAC unit is evaluated in terms of power, ͳ
delay, and area. For comparison, the base multiplier [8], a ‫ ܧܵܯ‬ൌ ෍ ෍ሾܲሺ݅ǡ ݆ሻ െ ‫ܣ‬ሺ݅ǡ ݆ሻሿଶ
‫ݔ‬ή‫ݕ‬
conventional MAC unit comprising a Wallace tree multiplier ௜ୀ଴ ௝ୀ଴
and the CPA, and MAC units proposed by Esposito et al. [9, 10] where MAXI, P(i, j), and A(i, j) are the maximum, precise, and
are also evaluated. They are referred to as MULT [8], CONV, approximate values of each pixel, respectively, and x and y are
ESP1 [9], and ESP2 [10], respectively. They are implemented in the image dimensions. PSNR values greater than 40 dB are
Verilog HDL. Note that both unsigned and signed designs are considered good [16].

387
Fig. 2. LeNet-5 Deep Neural Network

The handwritten digit recognition utilizes a deep neural


network. A slight variation of LeNet-5 [15] is implemented in
Darknet [17]. It consists of two convolutional layers, each of
which is followed by a max pooling layer, two fully-connected
layers, and a softmax layer, as shown in Fig. 2. The activation
function is tanh. The training set includes 60,000 images from
Fig. 3. Dynamic Power Consumption of MAC Units. (* Mutiplier)
the MNIST database [18]. Single-precision floating-point
operations are used for training. Note that only the inference
phase is experienced to evaluate the approximate circuits. In
addition, only MAC operations in convolutional and fully-
connected layers are performed using fixed-point operations and
then approximated. 8- and 16-bit operations are used in multiply
and accumulate operations, respectively. Since the deep neural
network treats signed values, the signed MAC units are
evaluated. LeNet-5 includes 341k MAC operations [4] and
convolutions account for more than 90% of the overall
computation [5]. The test set comprises 10,000 images from the
MNIST database. Here, the recognition rate is used as a metric
to evaluate accuracy.
A. Power, Area, and Delay
Fig. 3 shows the dynamic power consumption of MULT,
CONV, ESP1, ESP2, and the proposed MAC unit1. For MULT, Fig. 4. Area and Delay of MAC Units. (* Multiplier)
ESP1, and the proposed MAC unit, which are configurable, the
most accurate configurations are selected. The impact of
configuration on power will be discussed in Section IV.B. In MAC unit does not incur any additional power overhead.
Fig.3, the leftmost five bars represent the unsigned MAC units.
The sixth bar represents the original ESP2, and the rightmost Next, signed MAC units are discussed. Because the target of
three bars represent the signed MAC units. this paper is power reduction, ESP1 is removed from evaluation.
Signed CONV is only 5.8% larger in power than unsigned
The unsigned MAC units are compared with CONV (gray CONV. This means that the Baugh-Wooley algorithm does not
bar, Fig. 3). ESP1 (light blue, Fig.3) increases power have significant power overhead. In contrast, ESP2 and the
consumption by 24.7%, because its segmented approximate proposed MAC unit incur significant power overhead.
adder has redundant circuits. The power reduction rate of ESP2 Compared to their unsigned implementations, ESP2 and the
(dark blue, Fig.3) is 13.2%. This confirms that the approximate proposed MAC unit increase power consumption by 31.6% and
compression works effectively. When its multiplier is extended 54.1%, respectively, due to the sign conversion. Consequently,
to a “Signed × Unsigned” multiplier (S × U, Fig.3), the reduction ESP2 consumes more power than CONV, and thus it loses its
rate becomes somewhat smaller, i.e., 5.1% because negative attractiveness in power reduction by approximation. In contrast,
numbers represented by two’s complement cause frequent bit the proposed MAC achieves power reduction from CONV, i.e.,
flips. Power consumed by the proposed MAC unit (red, Fig.3) a reduction rate of 18.0%.
is 42.6% less than that of CONV. The proposed MAC unit
utilizes the approximate compression more times than ESP2; Fig. 4 summarizes the area and delay of the MAC units. For
therefore, further power reduction is achieved. It is only 20.1% each group of two bars, the left bar (A) represents area and the
greater than MULT (black, Fig.3). Note that the proposed MAC right bar (D) represents delay. All values are normalized by
unit is based on MULT. This percentage matches the difference those of unsigned CONV. Here, 46.1% and 41.7% area
between the PP row sizes of MULT and the proposed MAC unit reduction is obtained for unsigned and signed MAC units,
This means that the extension from MULT to the proposed respectively. This is not particularly remarkable for the
following reason. CONV, which includes a Wallace tree
multiplier, utilizes a full adder to compress three 1-bit values to

1
Note that there is a considerable difference between power consumption of are provided in ascending order using a nested loop; thus, the toggle rate is low.
MULT shown in Fig.2 and that reported in a previous study [8]. The difference In contrast, in this paper, randomly generated signals are provided; thus, the
is due to how test patterns are provided. In the previous study [8], input signals toggle rate is relatively high.

388
two values. In contrast, the proposed MAC unit utilizes an OR TABLE I. MRED (%)
gate to compress two 1-bit values to a single value. Even if the Proposed ESP2
logic for error recovery is considered, the total number of logic mask U S design U S/U S
gates is reduced significantly. The tendency in area is similar to 0000000 0.75 5.24 #1 0.55 4.50 4.91
that in power, except for signed ESP2. While power 0000001 0.79 5.69 #2 0.66 5.38 5.80
consumption of signed ESP2 is greater than signed CONV, the 0000011 #3
0.88 6.72 3.18 42.7 42.7
area of the former is smaller than that of the latter. Again, this 0000111 #4
1.07 8.59 8.71 132 132
shows that the sign converter incurs significant power overhead.
0001111 1.42 11.6
Except for unsigned ESP1, the tendency in delay is also similar
0011111 2.07 16.5
to that in power. Delay of ESP1 is less than that of unsigned
0111111 3.13 23.3
CONV because the segmented adder is targeted to reduce delay.
1111111 4.77 32.4
Among the approximate MAC units for both unsigned and
signed implementations, the proposed MAC unit is the smallest
for both area and delay. In addition, the overheads on area and TABLE II. PSNR (DB)
delay over MULT are considerably small. Note that the Bab Bar Len Pep avg
unsigned design does not incur any delay overhead. mask Proposed
0000000 41.9 45.2 43.3 42.8 43.3
B. Accuracy
0000001 39.3 41.7 40.4 40.1 40.4
Table I summarizes the MRED of the proposed MAC unit 0000011 35.5 36.8 36.0 35.8 36.0
(left half in the table) and ESP2 (right half). ESP2 is better in 0000111 30.5 31.4 31.0 30.6 30.9
power and area than ESP1; thus, the proposed MAC unit is 0001111 25.5 26.6 25.8 25.6 25.9
compared to only ESP2. In Table I, “U”, “S”, and “S/U” mean 0011111 20.9 21.0 20.8 20.8 20.9
that the MAC unit utilizes unsigned, signed, and “signed ×
0111111 15.7 17.2 16.2 15.9 16.2
unsigned” multipliers, respectively. Note that “mask” indicates
1111111 12.9 13.5 12.8 12.8 13.0
mask bits for the CMAs. If the mask bit is “1”, the CMA works
design ESP2
as an OR gate. Otherwise, it works as a precise full adder.
“design” indicates the four designs investigated in the literature #1 48.6 48.2 47.8 47.3 48.0
#2 39.6 39.5 39.3 39.1 39.4
[10], and design #1, #2, #3, and #4 truncate the least significant
0, 4, 8, and 10 bits, respectively. Both MAC units have #3 17.3 18.0 17.4 17.3 17.5
scalability relative to accuracy. In addition, the signed and S/U #4 10.8 11.2 10.7 10.8 10.9
implementations are worse in accuracy than the unsigned
implementations. As mentioned previously, S/U ESP2 utilizes
the Baugh-Wooley algorithm to handle signed multiplicand
represented by two’s complement. The approximate
compression used in ESP2 is weak in negative numbers
represented by two’s complement; thus, MRED of S/U ESP2
becomes worse than that of the unsigned ESP2. Regarding to the
signed design of the proposed MAC unit, the approximate sign
converter severely diminishes the MRED, especially when the
absolute value is small. Thus, the MRED of the signed design of
the proposed MAC unit is larger than that of the unsigned one.
Fig.5 shows the impact of the configurations on power and
accuracy relative to the MRED of unsigned MAC units. Here,
red circles (from left to right) correspond to configurations from
0000000 to 1111111. Similarly, blue crosses (left to right) Fig. 5. Power vs. MRED (Unsigned)
correspond to designs #1 to #4. As expected, smaller MRED
requires larger power. ESP2 has larger scalability in power than
the proposed MAC unit does. However, the former only has
static configurability, and, in contrast, the latter has dynamic
configurability. In other words, the configuration of ESP2 must
be determined during design. In contrast, the configuration of
the proposed MAC units can change when it is working. In
addition, when configurations with good MRED are compared,
the proposed MAC unit always consumes less power than ESP2.
The results of the image sharpening application are shown in
Table II, which presents the PSNR of the four images and their
average for different configurations. The image sharpening
application only operates unsigned values; thus, only the
unsigned implementation was evaluated. As can be seen, the two Fig. 6. Power vs. PSNR

389
TABLE III. RECOGINITION RATE (%) ACKNOWLEDGMENT
mask configurable bit This work was supported by JSPS KAKENHI Grant
[11:5] [10:4] [9:3] Number JP17K00088, by R&D FS Project of Fukuoka Industry,
0000000 91.23 92.53 93.22 Science & Technology Foundation, and by funds (No.175007
0000001 79.58 91.23 92.53 and 177005) from the Central Research Institute of Fukuoka
0000011 11.48 79.58 91.23 University. It was also supported by VLSI Design and Education
Center (VDEC), the University of Tokyo in collaboration with
Synopsys, Inc.
most accurate configurations have good PSNR for both the
proposed MAC unit and ESP2. Fig. 6 shows how configuration REFERENCES
affects power and accuracy in PSNR. Here, red circles (right to [1] J. Han and M. Orshansky, “Approximate computing: an emerging
left) correspond to configurations 0000000 to 1111111. paradigm for energy-efficient design,” 18th European Test Symposium,
Similarly, blue crosses (right to left) correspond to designs #1 to 2013.
#4, respectively. The results are very similar to the MRED cases. [2] K. Roy and A. Raghunathan, “Approximate computing: an energy-
Higher PSNR requires greater power, and relative to power and efficient computing technique for error resilient applications,” IEEE
Computer Society Annual Symposium on VLSI, 2015.
PSNR, ESP2 has greater scalability than the proposed MAC unit.
When two MAC units are compared under the conditions where [3] V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan, “Analysis
and characterization of inherent application resilience for approximate
good PSNR is required, the proposed MAC unit consumes less computing,” 50th Design Automation Conference, 2013.
power than ESP2. [4] V. Sze, Y.H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of
Table III shows the recognition rate of the handwritten digit deep neural networks: a tutorial and survey,” Proceedings of the IEEE,
Vol. 105, No. 12, 2017.
recognition application. The application operates signed values
[5] J. Emer, V. Sze, and Y.-H. Chen, “Hardware architectures for deep neural
and since the signed ESP2 consumes more power than CONV, networks,” Tutorial at International Symposium on Computer
only the signed design of the proposed MAC unit was evaluated. Architecture, 2017. http://www.rle.mit.edu/eems/wp-content/uploads/
The “configurable bit” [11:5] column presents the results. The 2017/06/ISCA-2017-Hardware-Architectures-for-DNN-Tutorial.pdf
recognition rate is reduced suddenly; thus, only the three most [6] A. Cilardo, D. De Caro, N. Petra, F. Caserta, N. Mazzocca, E. Napoli, and
accurate configurations are shown. Note that when recognition A. G. M. Strollo, “High speed speculative multipliers based on
is based on precise floating-point addition, the recognition rate speculative carry-save tree,” IEEE Transactions on Circuits and Systems
I, Vol. 61, No. 12, 2014.
is 97.37%. If the floating-point MAC operation is replaced by
[7] I. Qiqieh, R. Shafik, G. Tarawneh, D. Sokolov, and A. Yakovlev,
the fixed-point one, it becomes slightly smaller (96.64%). “Energy-efficient approximate multiplier design using bit significance-
Comparing to the numbers, a recognition rate of 91.23% is driven logic compression,” 21st Design, Automation & Test in Europe
sufficient. To investigate scalability in recognition rate, the Conference & Exhibition, 2017.
sensitivity on the configurable bits was evaluated. The results [8] T. Yang, T. Ukezono, and T. Sato, “A low-power high-speed accuracy-
indicate that the recognition rate is improved as it moves for controllable approximate multiplier design,” 23rd Asia and South Pacific
lower position. Design Automation Conference, 2018.
[9] D. Esposito, D. De Caro, E. Napoli, N. Petra, and A. G. M. Strollo, “On
V. CONCLUSIONS the use of approximate adders in carry-save multiplier-accumulators,”
International Symposium on Circuits and Systems, 2017.
This paper has proposed a low-power approximate MAC [10] D. Esposito, A. G. M. Strollo, and M. Alioto, “Low-power approximate
unit with reduced area. Compared to a conventional MAC unit MAC unit,” 13th Conference on Ph.D. Research in Microelectronics and
for unsigned values, the proposed MAC unit reduces power and Electronics, 2017.
area by 42.6% and 46.1%, respectively, without serious [11] C. R. Baugh and B. A.Wooley, “A two’s complement parallel array
reduction in accuracy for unsigned operations. This was multiplication algorithm,” IEEE Transactions on Computers, Vol. C-22,
primarily achieved by distributing ACC over the PP rows. The No. 12, 1973.
MAC unit could be implemented as an approximate multiplier [12] Silvaco Inc., “PDK 45nm open cell library,” https://www.silvaco.com/
products/nangate/FreePDK45_Open_Cell_Library/
with longer PP rows. Using an image sharpening application, it
[13] C. Liu, J. Han, and F. Lombardi, “A low-power, high-performance
was found that the unsigned design of the proposed MAC unit approximate multiplier with configurable partial error recovery,” 18th
demonstrated sufficient accuracy. The approximation based on Design, Automation & Test in Europe Conference & Exhibition, 2014.
the OR operation is weak in handling signed numbers [14] M. S. Lau, K. V. Ling, and Y. C. Chu, “Energy-aware probabilistic
represented by two’s complement; thus, the conventional multiplier: design and analysis,” International Conferrence on Compliers,
Baugh-Wooley algorithm did not work effectively. Therefore, Architeture, and Synthesis for Embedded Systems, 2009.
an approximate sign converter was proposed such that the signed [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
design of the proposed MAC unit reduces power and area by applied to document recognition,” Proceedings of the IEEE, Vol. 86, No.
18.0% and 41.7%, respectively. The evaluations using 11, 1998.
handwritten digit recognition demonstrated that it functioned [16] D. Bull, Communicating Pictures - A Course in Image and Video Coding,
Academic Press, 2014.
sufficiently for practical use.
[17] J. Redmon, “Darknet: open source neural networks in C,” https://
pjreddie.com/darknet/
[18] Y. LeCun, C. Cortes, and C. J.C. Burges, “The MNIST database of
handwritten digits,” http://yann.lecun.com/exdb/mnist/

390

You might also like