MAC - Low Power and Area
MAC - Low Power and Area
MAC - Low Power and Area
Abstract—Approximate computing benefits applications that y A technique, which distributes every bit of an accumulated
are, to some extent, error tolerant with regard to accuracy by value (ACC) to partial product (PP) rows, is proposed to
trading power for accuracy. In this paper, a low-power merge a multiplier and an adder in an atomic approximate
approximate multiply-accumulate (MAC) unit with reduced area circuit.
is proposed. In the proposed MAC unit, a multiplier and an adder
are merged in an approximate multiplier by distributing all y An approximate sign converter is proposed to handle signed
accumulated sums to partial product rows. Experimental results numbers presented by two’s complement.
demonstrate that, compared to a conventional MAC unit, the
y The proposed MAC unit is implemented using Verilog HDL
unsigned design of the proposed unit reduces power consumption
and circuit area by 42.6% and 46.1%, respectively, without
and its power, area, and delay are evaluated at the gate level.
impacting accuracy significantly. While most approximate circuits y The quality of approximate computation at the application
do not handle signed numbers represented by two’s complement level is evaluated using a handwritten digit recognition
effectively, evaluations using handwritten digit recognition on application, which is based on a deep neural network.
LeNet-5 indicate that the signed design of the proposed unit is
sufficient for practical use. The remainder of this paper is organized as follows. Section
II surveys related work. The proposed approximate MAC unit is
Keywords—approximate computing, multiply-accumulate unit, described in Section III and the proposed unit is evaluated in
low-power circuit, accuracy scaling, deep neural network Section IV. Conclusions are presented in Section V.
I. INTRODUCTION II. RELATED WORK
In the post-Dennard scaling era, some benefits, such as Several types of approximate multipliers have been
power reduction due to voltage scaling, are not provided; thus a proposed. This paper focuses on those which are approximate
new method way to reduce power consumption is strongly for PP trees. Speculative multiplier [6], significance-driven logic
required. Approximate computing [1, 2], which trades off compression [7], and approximate tree compressor [8] utilize an
computational accuracy against power consumption, is a OR gate to approximate additions in the PP matrix. The
promising candidate. Some modern applications, such as image proposed MAC unit utilizes the approximate PP compressor.
processing, sensing, recognition, and neural networks, are
inherently error-tolerant; thus approximate computing would be Esposito et al. [9] proposed an approximate MAC unit that
applicable to such applications [3]. Recently, deep neural combines a precise multiplier and an approximate adder. The
networks have received considerable research attention. It final output of the multiplier is two rows reduced from the PP
performs large numbers of multiply-accumulate (MAC) matrix. The output rows are merged with an ACC and reduced
operations and therefore consumes significant power [4]. In again into two rows by a precise carry save adder (CSA). The
deep learning, convolutions account for more than 90% of final addition of the two rows is processed by a segmentation-
overall computation [5]. MAC operations are also extensively type approximate adder. Esposito et al. also proposed a MAC
used in digital signal processing for data intensive applications. unit [10] that combines an approximate multiplier and a precise
For example, dedicated MAC units are used to accelerate FIR adder. In contrast, this paper proposes an MAC unit wherein a
and FFT computations. Considering the above, this paper multiplier and an adder are merged into an atomic approximate
proposes a low-power MAC unit. Compared to a conventional circuit.
MAC unit comprising a Wallace tree multiplier and an adder,
III. APPROXIMATE MULTIPLY-ADD UNIT
for unsigned values, the proposed unit outperforms the
conventional MAC unit by 42.6%, 46.1%, and 42.0% relative to A conventional MAC unit consists of a multiplier, an adder,
power, area, and delay, respectively. and an accumulator. The accumulator is simply a register that
holds the outcome of the adder; therefore, this paper focuses on
The contributions of this study are summarized as follows. arithmetic units, i.e., multiplier and adder. Here, multiplication
386
The following process is same as for any approximate evaluated. Signed CONV utilizes the Baugh-Wooley algorithm.
multiplier, i.e., row compression and final addition. Any The original ESP2 partially utilizes the Baugh-Wooley
previously proposed approximate multiplier scheme can be algorithm and only the multiplicand is a signed number. Thus,
selected. Here the approximate multiplier proposed by Yang et unsigned ESP2 that does not utilize the algorithm is also
al.’s [8] was selected due to its good balance between power and implemented. In signed ESP2, the negative multiplier is
accuracy. In addition, its accuracy configurability is also a converted to unsigned value by using the naïve method proposed
desirable feature. The difference is bit length of each in Section III.D. The proposed signed MAC unit utilizes the
intermediately processed row. Although this is an incremental naïve method for both multiplicand and multiplier. The
study, it will be found that very small overheads on power, area, Synopsys Design Compiler and the NanGate 45nm Open Cell
and delay over a multiplier enable a MAC unit. Library [12] are used for logic synthesis with the default
compiler options. The value change dump files generated from
In the second step, first, according to Equation (1), each two the Synopsys VCS simulations are used by the Synopsys Power
rows of PP0, …, PP6, and PP7 are compressed to an Compiler for dynamic power estimation. One million randomly
approximate sum, i.e., O1, O2, O3, and O4. In parallel, recovery generated inputs are used to obtain outputs. A set of input signals
vector V1 is generated by applying an OR operation to A1, A2, for the simulation has a period of 2 ns.
A3, and A4. The four rows of O1, O2, O3, and O4 are further
compressed into rows O5 and O6. Similarly, recovery vector V2 Second, the proposed MAC unit and ESP2 are evaluated in
is generated by OR-’ing A5 and A6. Recovery vector A7 is terms of accuracy. Mean relative error distance [13] (MRED) is
generated by compressing O5 and O6 into O7. Now, the eight used as the metric to evaluate accuracy. The error distance (ED)
rows are reduced into four rows, i.e., O7, V1, V2, and A7. Next, is defined as the difference between an accurate sum (M) and its
bit positions between 3 and 11 of V1 and V2 are approximately approximate sum (Ԣ), i.e., ൌ ȁ ᇱ െ ȁ. The relative ED
summed up by OR gates to reduce the number of rows to three. (RED) is defined as the ED divided by M, i.e., ൌ Τ ൌ
Finally, the three rows are processed using CSA to obtain two ȁ ᇱ െ ȁΤ. The mean RED (MRED) is the average of REDs.
rows. In addition, to assess the practicality of the proposed MAC unit,
two image processing applications, i.e., image sharpening [14]
In the third step, the two rows obtained in the second step are
and handwritten digit recognition [15] are also evaluated.
summed using accuracy-configurable addition. The most
significant four bits are summed using a carry propagating adder The image sharpening application consists of two steps. First,
(CPA). The carry out in the bit position 17 is discarded. The next Gaussian smoothing on the input image is performed as follows:
seven bits are approximately/precisely summed using a carry ଶ ଶ
maskable adder (CMA) [8]. Note that a control signal is applied ͳ
to change the functionality of the CMA between precise addition ܴሺ݅ǡ ݆ሻ ൌ ܩሺ݇ ʹǡ ݈ ʹሻ ή ܫሺ݅ ݇ǡ ݆ ݈ሻ
ʹ͵
and OR operations. The remaining bits are approximately ୀିଶ ୀିଶ
summed using OR gates. Now, the result of the 16-bit ACC is ͳ Ͷ Ͷ ͳ
obtained. ۍͶ ͳ ʹ ͳ Ͷې
ێ ۑ
D. Handling of Signed Numbers ܩൌ ێ ʹ Ͷͳ ʹ ۑ
ێͶ ͳ ʹ ͳ Ͷۑ
Most existing approximate circuits do not target signed ͳۏ
numbers. The Baugh-Wooley algorithm [11] is used to handle Ͷ Ͷ ͳے
signed numbers in conventional array multipliers. In a
preliminary study, it was found that the algorithm does not work where I(i, j) and R(i, j) represent each pixel of the input and the
well. Negative numbers that are represented by two’s smoothed images, and G is its Gaussian kernel. Second, the
complement have “1”s in the leading bit positions. OR image sharpening application obtains the sharpened image S by
operations used in approximate additions retain them and thus ܵ ൌ ʹ ܫെ ܴ . Only MAC operations in convolutions are
any negative numbers never turn to be positive, resulting in approximated. The inputs are 512 × 512 grayscale bitmaps with
frequent sign bit errors. This paper proposes a naïve technique 8-bit pixels of the well-known Lena, Baboon, Peppers, and
to address the problem. First, a negative multiplicand and a Barbara images. Since all values processed in this application
multiplier are approximately converted to positive values, and are positive, the unsigned MAC units are evaluated. Peak signal-
the ACC is also negated if necessary. Second, unsigned to-noise ratio (PSNR) is used to evaluate the quality of the
multiplication is performed. Finally, the output is negated if application output. Here, PSNR is defined as follows:
necessary. The approximate conversion utilizes one’s
complement and an OR operation with “1” in the LSB. ܺܣܯூଶ
ܴܲܵܰ ൌ ͳͲ ଵ ሺ ሻ
IV. EXPERIMENTS ܧܵܯ
௫ିଵ ௬ିଵ
First, the proposed MAC unit is evaluated in terms of power, ͳ
delay, and area. For comparison, the base multiplier [8], a ܧܵܯൌ ሾܲሺ݅ǡ ݆ሻ െ ܣሺ݅ǡ ݆ሻሿଶ
ݔήݕ
conventional MAC unit comprising a Wallace tree multiplier ୀ ୀ
and the CPA, and MAC units proposed by Esposito et al. [9, 10] where MAXI, P(i, j), and A(i, j) are the maximum, precise, and
are also evaluated. They are referred to as MULT [8], CONV, approximate values of each pixel, respectively, and x and y are
ESP1 [9], and ESP2 [10], respectively. They are implemented in the image dimensions. PSNR values greater than 40 dB are
Verilog HDL. Note that both unsigned and signed designs are considered good [16].
387
Fig. 2. LeNet-5 Deep Neural Network
1
Note that there is a considerable difference between power consumption of are provided in ascending order using a nested loop; thus, the toggle rate is low.
MULT shown in Fig.2 and that reported in a previous study [8]. The difference In contrast, in this paper, randomly generated signals are provided; thus, the
is due to how test patterns are provided. In the previous study [8], input signals toggle rate is relatively high.
388
two values. In contrast, the proposed MAC unit utilizes an OR TABLE I. MRED (%)
gate to compress two 1-bit values to a single value. Even if the Proposed ESP2
logic for error recovery is considered, the total number of logic mask U S design U S/U S
gates is reduced significantly. The tendency in area is similar to 0000000 0.75 5.24 #1 0.55 4.50 4.91
that in power, except for signed ESP2. While power 0000001 0.79 5.69 #2 0.66 5.38 5.80
consumption of signed ESP2 is greater than signed CONV, the 0000011 #3
0.88 6.72 3.18 42.7 42.7
area of the former is smaller than that of the latter. Again, this 0000111 #4
1.07 8.59 8.71 132 132
shows that the sign converter incurs significant power overhead.
0001111 1.42 11.6
Except for unsigned ESP1, the tendency in delay is also similar
0011111 2.07 16.5
to that in power. Delay of ESP1 is less than that of unsigned
0111111 3.13 23.3
CONV because the segmented adder is targeted to reduce delay.
1111111 4.77 32.4
Among the approximate MAC units for both unsigned and
signed implementations, the proposed MAC unit is the smallest
for both area and delay. In addition, the overheads on area and TABLE II. PSNR (DB)
delay over MULT are considerably small. Note that the Bab Bar Len Pep avg
unsigned design does not incur any delay overhead. mask Proposed
0000000 41.9 45.2 43.3 42.8 43.3
B. Accuracy
0000001 39.3 41.7 40.4 40.1 40.4
Table I summarizes the MRED of the proposed MAC unit 0000011 35.5 36.8 36.0 35.8 36.0
(left half in the table) and ESP2 (right half). ESP2 is better in 0000111 30.5 31.4 31.0 30.6 30.9
power and area than ESP1; thus, the proposed MAC unit is 0001111 25.5 26.6 25.8 25.6 25.9
compared to only ESP2. In Table I, “U”, “S”, and “S/U” mean 0011111 20.9 21.0 20.8 20.8 20.9
that the MAC unit utilizes unsigned, signed, and “signed ×
0111111 15.7 17.2 16.2 15.9 16.2
unsigned” multipliers, respectively. Note that “mask” indicates
1111111 12.9 13.5 12.8 12.8 13.0
mask bits for the CMAs. If the mask bit is “1”, the CMA works
design ESP2
as an OR gate. Otherwise, it works as a precise full adder.
“design” indicates the four designs investigated in the literature #1 48.6 48.2 47.8 47.3 48.0
#2 39.6 39.5 39.3 39.1 39.4
[10], and design #1, #2, #3, and #4 truncate the least significant
0, 4, 8, and 10 bits, respectively. Both MAC units have #3 17.3 18.0 17.4 17.3 17.5
scalability relative to accuracy. In addition, the signed and S/U #4 10.8 11.2 10.7 10.8 10.9
implementations are worse in accuracy than the unsigned
implementations. As mentioned previously, S/U ESP2 utilizes
the Baugh-Wooley algorithm to handle signed multiplicand
represented by two’s complement. The approximate
compression used in ESP2 is weak in negative numbers
represented by two’s complement; thus, MRED of S/U ESP2
becomes worse than that of the unsigned ESP2. Regarding to the
signed design of the proposed MAC unit, the approximate sign
converter severely diminishes the MRED, especially when the
absolute value is small. Thus, the MRED of the signed design of
the proposed MAC unit is larger than that of the unsigned one.
Fig.5 shows the impact of the configurations on power and
accuracy relative to the MRED of unsigned MAC units. Here,
red circles (from left to right) correspond to configurations from
0000000 to 1111111. Similarly, blue crosses (left to right) Fig. 5. Power vs. MRED (Unsigned)
correspond to designs #1 to #4. As expected, smaller MRED
requires larger power. ESP2 has larger scalability in power than
the proposed MAC unit does. However, the former only has
static configurability, and, in contrast, the latter has dynamic
configurability. In other words, the configuration of ESP2 must
be determined during design. In contrast, the configuration of
the proposed MAC units can change when it is working. In
addition, when configurations with good MRED are compared,
the proposed MAC unit always consumes less power than ESP2.
The results of the image sharpening application are shown in
Table II, which presents the PSNR of the four images and their
average for different configurations. The image sharpening
application only operates unsigned values; thus, only the
unsigned implementation was evaluated. As can be seen, the two Fig. 6. Power vs. PSNR
389
TABLE III. RECOGINITION RATE (%) ACKNOWLEDGMENT
mask configurable bit This work was supported by JSPS KAKENHI Grant
[11:5] [10:4] [9:3] Number JP17K00088, by R&D FS Project of Fukuoka Industry,
0000000 91.23 92.53 93.22 Science & Technology Foundation, and by funds (No.175007
0000001 79.58 91.23 92.53 and 177005) from the Central Research Institute of Fukuoka
0000011 11.48 79.58 91.23 University. It was also supported by VLSI Design and Education
Center (VDEC), the University of Tokyo in collaboration with
Synopsys, Inc.
most accurate configurations have good PSNR for both the
proposed MAC unit and ESP2. Fig. 6 shows how configuration REFERENCES
affects power and accuracy in PSNR. Here, red circles (right to [1] J. Han and M. Orshansky, “Approximate computing: an emerging
left) correspond to configurations 0000000 to 1111111. paradigm for energy-efficient design,” 18th European Test Symposium,
Similarly, blue crosses (right to left) correspond to designs #1 to 2013.
#4, respectively. The results are very similar to the MRED cases. [2] K. Roy and A. Raghunathan, “Approximate computing: an energy-
Higher PSNR requires greater power, and relative to power and efficient computing technique for error resilient applications,” IEEE
Computer Society Annual Symposium on VLSI, 2015.
PSNR, ESP2 has greater scalability than the proposed MAC unit.
When two MAC units are compared under the conditions where [3] V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan, “Analysis
and characterization of inherent application resilience for approximate
good PSNR is required, the proposed MAC unit consumes less computing,” 50th Design Automation Conference, 2013.
power than ESP2. [4] V. Sze, Y.H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of
Table III shows the recognition rate of the handwritten digit deep neural networks: a tutorial and survey,” Proceedings of the IEEE,
Vol. 105, No. 12, 2017.
recognition application. The application operates signed values
[5] J. Emer, V. Sze, and Y.-H. Chen, “Hardware architectures for deep neural
and since the signed ESP2 consumes more power than CONV, networks,” Tutorial at International Symposium on Computer
only the signed design of the proposed MAC unit was evaluated. Architecture, 2017. http://www.rle.mit.edu/eems/wp-content/uploads/
The “configurable bit” [11:5] column presents the results. The 2017/06/ISCA-2017-Hardware-Architectures-for-DNN-Tutorial.pdf
recognition rate is reduced suddenly; thus, only the three most [6] A. Cilardo, D. De Caro, N. Petra, F. Caserta, N. Mazzocca, E. Napoli, and
accurate configurations are shown. Note that when recognition A. G. M. Strollo, “High speed speculative multipliers based on
is based on precise floating-point addition, the recognition rate speculative carry-save tree,” IEEE Transactions on Circuits and Systems
I, Vol. 61, No. 12, 2014.
is 97.37%. If the floating-point MAC operation is replaced by
[7] I. Qiqieh, R. Shafik, G. Tarawneh, D. Sokolov, and A. Yakovlev,
the fixed-point one, it becomes slightly smaller (96.64%). “Energy-efficient approximate multiplier design using bit significance-
Comparing to the numbers, a recognition rate of 91.23% is driven logic compression,” 21st Design, Automation & Test in Europe
sufficient. To investigate scalability in recognition rate, the Conference & Exhibition, 2017.
sensitivity on the configurable bits was evaluated. The results [8] T. Yang, T. Ukezono, and T. Sato, “A low-power high-speed accuracy-
indicate that the recognition rate is improved as it moves for controllable approximate multiplier design,” 23rd Asia and South Pacific
lower position. Design Automation Conference, 2018.
[9] D. Esposito, D. De Caro, E. Napoli, N. Petra, and A. G. M. Strollo, “On
V. CONCLUSIONS the use of approximate adders in carry-save multiplier-accumulators,”
International Symposium on Circuits and Systems, 2017.
This paper has proposed a low-power approximate MAC [10] D. Esposito, A. G. M. Strollo, and M. Alioto, “Low-power approximate
unit with reduced area. Compared to a conventional MAC unit MAC unit,” 13th Conference on Ph.D. Research in Microelectronics and
for unsigned values, the proposed MAC unit reduces power and Electronics, 2017.
area by 42.6% and 46.1%, respectively, without serious [11] C. R. Baugh and B. A.Wooley, “A two’s complement parallel array
reduction in accuracy for unsigned operations. This was multiplication algorithm,” IEEE Transactions on Computers, Vol. C-22,
primarily achieved by distributing ACC over the PP rows. The No. 12, 1973.
MAC unit could be implemented as an approximate multiplier [12] Silvaco Inc., “PDK 45nm open cell library,” https://www.silvaco.com/
products/nangate/FreePDK45_Open_Cell_Library/
with longer PP rows. Using an image sharpening application, it
[13] C. Liu, J. Han, and F. Lombardi, “A low-power, high-performance
was found that the unsigned design of the proposed MAC unit approximate multiplier with configurable partial error recovery,” 18th
demonstrated sufficient accuracy. The approximation based on Design, Automation & Test in Europe Conference & Exhibition, 2014.
the OR operation is weak in handling signed numbers [14] M. S. Lau, K. V. Ling, and Y. C. Chu, “Energy-aware probabilistic
represented by two’s complement; thus, the conventional multiplier: design and analysis,” International Conferrence on Compliers,
Baugh-Wooley algorithm did not work effectively. Therefore, Architeture, and Synthesis for Embedded Systems, 2009.
an approximate sign converter was proposed such that the signed [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
design of the proposed MAC unit reduces power and area by applied to document recognition,” Proceedings of the IEEE, Vol. 86, No.
18.0% and 41.7%, respectively. The evaluations using 11, 1998.
handwritten digit recognition demonstrated that it functioned [16] D. Bull, Communicating Pictures - A Course in Image and Video Coding,
Academic Press, 2014.
sufficiently for practical use.
[17] J. Redmon, “Darknet: open source neural networks in C,” https://
pjreddie.com/darknet/
[18] Y. LeCun, C. Cortes, and C. J.C. Burges, “The MNIST database of
handwritten digits,” http://yann.lecun.com/exdb/mnist/
390