A 32-Bit Ripple-Ling Hybrid Carry Adder

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO.
6, JUNE 2024 2709
A 32-Bit Ripple-Ling Hybrid Carry Adder

Ning Shang , Zhou Wang , Ruikang Liu , Yizhou Huang, Yin Zhang , Zhangqing He , and Meilin Wan
Abstract— The low-order bits of the Ling adder are not on adder and the whole carry structure. In terms of customiz-
the critical path, eliminating the need for a carry lookahead ing sub-units in 1-bit full adder, Naseri and Timarchi use
method to calculate their output sums. In this paper, we propose XOR/XNOR gates based on transmission gate (TG) to realize
a hybrid carry adder that combines high-order Ling and low- six new hybrid 1 bit full-adder circuits [4]. This design
order ripple techniques. The low 11 bits of the adder utilize a
ripple-carry structure, while the high 21 bits employ a Ling-based
performs well when implementing 2- to 4-bit adders. However,
parallel prefix structure. This approach simplifies the low-order the TG structure is unsuitable for cascading adders with a
sum circuit without compromising the critical path length of the bitwidth exceeding 16, due to the substantial increase in delay.
adder. Furthermore, new intermediate variables are introduced to Bhattacharyya et al. [5] uses XOR gates based on TG and
facilitate Shannon expansion and enable efficient implementation MUX to generate local sum and carry signal fast. Due to the
of the output sum. This ensures that the control signal of weak driving capability of the TG, intermediate buffers need
the output MUX maintains a delay consistent with its input to be added, resulting in additional hardware overhead and
signal. The output sum circuit is further custom designed using increase in the critical path for the multi-bit adder. Similarly,
reusable logic circuits. The proposed adder is verified using the
conventional 180 nm and 28 nm processes, as well as the advanced
using of TG to realize 1 bit full-adder, as done in [6], achieves
14 nm FinFET process, with the layout area as 4557.5 µm2 , low power delay product (PDP) and is only suitable for
193.2 µm2 , and 73.8 µm2 , respectively. Testing results show that low operand cases. Although simple sub-units can reduce the
the maximum delay is 0.83 ns, 0.312 ns, and 0.183 ns respectively area and delay for the 1-bit addition, they cannot be directly
for the adder using 180 nm, 28 nm, and 14 nm processes cascaded to realize multi-bit operations. Therefore, we should
respectively. The proposed adder provides an area optimization optimize the adder in both the sub-circuit and the whole
of approximately 10%∼30% and optimizations of 10% in power carry structure to achieve excellent performance in multi-bit
and speed compared to the conventional Ling adder. additions.
Index Terms— Adder, ripple carry, Ling carry, low power, low The common structures for generating carry signals in
cost. multi-bit adders include the Ripple Carry structure [7], Carry
Select (CS) structure [8], and Look-Ahead Carry (LAC) struc-
I. I NTRODUCTION ture. The LAC adder, which utilizes two intermediate signals
- carry propagation and carry generation signals - to simplify
A S DEVICE sizes continue to shrink to the nanometer
scale, the use of low power techniques has become more
important than ever for the design of any complex VLSI
the logic for generating the carry and output sum, is the most
commonly used high-speed adder [9]. In this paper, we will
chip like microprocessors and DSPs, which encompass various also optimize the 32-bit adder based on LAC structure.
complex arithmetic operations such as subtraction, multiplica- The adders that employ the LAC structure include
tion, division, and addition. They are typically implemented the Kogge-Stone adder, Sklansky adder, Brent-Kung adder,
using one or multiple addition operations. Therefore, adders Ladner-Fischer adder, and Han-Carlson adder [10]. They all
are the most used arithmetic units in complex VLSI chips [1]. build different carry trees based on the carry signal Ling
Adders often reside on the critical path of digital circuits, simplified the LAC structure by incorporating a pseudo carry,
directly affecting the overall speed of the system. Hence, Hi [11]. This design reduces the complexity of the logic com-
optimizing the area, power consumption, and operation speed pared to conventional C-based operations. Moreover, it can
of adders is crucial for enhancing the performance of the entire use other LAC carry tree structures with respect to Hi to
system [2]. achieve high speed at minimal cost. Efstathiou et al. [12] and
So far, significant research efforts have been devoted to Mitra and Bakshi [13] construct Ling prefix carry trees by
shortening the critical path in multi-bit addition [3], which using Ladner-Fischer or Kogge-Stone operators to generate the
mainly focus on the optimization of sub-units in 1-bit full carry propagation and generation signals. Moreover, the Hi for
odd and even-indexed bits are calculated independently, thus
Manuscript received 17 October 2023; revised 4 January 2024; directly reducing the fan-out of the prefix tree and helping to
accepted 4 January 2024. Date of publication 19 January 2024; date of current decrease delay. However, they require the use of additional
version 30 May 2024. This work was supported by the National Natural
Science Foundation of China under Grant 62174050 and Grant 62271194. 4-input CS adder to ensure the parallelism of high-order carry
This article was recommended by Associate Editor J. Di. (Corresponding and sum operations when performing arithmetic on multi-bit
author: Meilin Wan.) operands of 32 bits or more, which increases hardware over-
Ning Shang, Zhou Wang, Ruikang Liu, Yizhou Huang, and Meilin Wan are head and fails to reduce area while improving computational
with the School of Microelectronics, Hubei University, Wuhan, Hubei 430064,
China (e-mail: [email protected]). speed. Dimitrakopoulos and Nikolos [14] propose a two-level
Yin Zhang and Zhangqing He are with the School of Electrical and Ling structure, where the carry signals are divided into gen-
Electronic Engineering, Hubei University of Technology, Wuhan 430068, eration groups ggi∗ and propagation groups gt∗j . Computation
China. of the carry is then performed based on the prefix network
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2024.3352139. composed of gg i∗ and gt∗j . This design introduces more OR
Digital Object Identifier 10.1109/TCSI.2024.3352139 gates in the propagation path. With increasing operand bit
1549-8328 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
2710 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 6, JUNE 2024
Fig. 1. The structure of the Ling adder proposed in [15].
numbers, the additional cost cannot fully exploit the advan- adder. Section III explains the structure of the proposed
tages of Ling’s method, either. In [15], Quach and Flynn ripple-Ling hybrid carry structure. Section IV describes the
group the multi-bit operands, specifically three bits per group circuit implementation of the proposed adder. Section V
and three groups per block. Within each group, local sum is presents the simulation and test results, along with an analysis
calculated using conditional sum algorithms. The Ling method and comparison with existing adders. Finally, the paper is
is then employed to generate the carries between blocks. This concluded in Section VI.
method reduces the number of serial transistors on the critical
path, saves one gate delay in the addition operation, and II. BACKGROUND T ECHNOLOGY
is considered one of the high-performance adders. However, The Ling adder proposed in [15], as depicted in Fig. 1,
the division of the first summation block into three sum consists of a 32-bit adder divided into four summing blocks.
groups results in redundant logical operations for the low-order Each summing block is further divided into multiple summing
bits. Considering that the low-order summing computations groups. Specifically, Bit 8:0, Bit 17:9, and Bit 26:18 all have
involve few bits and simple logic, the use of a ripple carry three summing groups, while Bit 31:27 are divided into a 3-bit
structure will not influence the delay of the critical path summing group (Bit 29:27) and a 2-bit summing group (Bit
for the entire adder. Therefore, there is no need to perform 31:30). The carry signals, H2 , H8 , H17 , and H26 , between
low-bit summation using the Ling LAC structure based on the blocks are generated by a global look-ahead carry block.
grouping [16]. Fig. 2 illustrates the structure of the Global Look-Ahead Carry
In this paper, we propose an improved 32-bit adder that module in Fig. 1. To generate the block carry signals, H2 ,
reduces area and power consumption while maintaining high H8 , H17 , and H26 , the group carry generation signals G i∗ and
speed by employing the ripple-Ling hybrid carry structure, propagation signals Pi∗ are first generated. Then, H8 , G ∗bi and
based on the Ling adder presented in [15]. For the low order Pbi∗ are generated based on Pi∗ and G i∗ . Finally, the carry
bits, we adopt the ripple carry structure to ensure a more signals H17 and H26 are generated based on Pbi∗ and G ∗bi .
compact generation of the output sum, while for the high order Let A31∼0 and B31∼0 represent the two 32-bit input binary
bits, we continue to use the Ling LAC structure to maintain numbers, and S31∼0 represent their output sums. The symbols,
a low critical path delay. Furthermore, the output sum circuits + and ⊕, are used to represent the logical operations of OR
for all the bits are customized and optimized to reduce the and XOR, respectively [17]. In binary addition, gi , pi , and
delay on the critical path during circuit implementation. The si represent the generation signal, propagation signal, and the
contributions of this paper are: local sum of the Bit i [16], respectively:
(1) The adder adopts a structure that combines high-bit Ling
carry and low-bit ripple carry, the low 11 bits use a ripple
structure that does not affect the critical path, while the high gi = Ai · Bi , pi = Ai + Bi , si = Ai ⊕ Bi (1)
21 bits still use the Ling LAC structure. The use of the ripple
The output carry and the output sum of the Bit i, can be
carry structure for the low 11 bits simplifies the circuit and
respectively expressed as
lowers the cost and power consumption.
(2) Mixed logic gates have been customized, such as
simultaneously implementing NAND and NOR logic in XOR Ci = Ci−1 pi + gi , Si = si ⊕ Ci−1 (2)
gates, to reduce the cost of basic logic gates. Moreover,
the customized logic of Ai Bi + Ci−1 (Ai ⊕ Bi ) is used to By iteratively applying the expression of Ci in (2) i times,
realize the low-order 1-bit full adder and the high-order bits Ci can be expanded as follows
output sum. This customized unit employs one TG and four Ci = gi + pi gi−1 + pi pi−1 gi−2 + . . . + pi pi−1 . . . p1 g0
transistors to transmit input to output or to pull the output up or (3)
down, thereby simplifying circuit and reducing the hardware
overhead. In each group comprised of three consecutive input bits
(3) In the custom designed high-order sum generation circuit Ai∼i−2 and Bi∼i−2 , the output carry signal Ci provided by
using the Ling carry structure, we have redefined multiple sig- the current group to the next group can be expressed as
nals that serve as Shannon expansion variables. By increasing
Ci = gi + gi−1 pi + gi−2 pi pi−1 + pi−2 pi−1 pi Ci−3 (i ≥ 3)
the delay of the Shannon expansion variable and reducing the
delay of the signal it controls, the delay is more uniformly (4)
distributed between the control signal and the input signal for By defining the group generation and propagation signals
the output MUX, thereby realizing the delay balance for the G i and Pi as
output MUX and reducing the delay of the critical path.
The organization of the paper is as follows. Section II G i = gi + gi−1 pi + gi−2 pi pi−1
(5)
mainly elucidates the design principles of conventional Ling Pi = pi−2 pi−1 pi
SHANG et al.: 32-BIT RIPPLE-LING HYBRID CARRY ADDER 2711
bits. Therefore, we can actually reduce the generation speed

of the low bits’ output sum while keeping the critical path of
the entire adder unchanged, thereby further reducing area and
power consumption.
Specifically, based on the expressions of S8 ∼ S0 in Fig. 3,
we can see the generation of S8 is the slowest among S0 to
S8 . As shown in Fig. 4, the delay to get S8 , which is the delay
of 3 compound logic gates (TI 1 , TI 2 , and TI 3 ) plus 1 MUX
(TI 4 ), is smaller than the delay of S26 which is about the
propagation time of 3 compound logic gates (TI 1 , TI 2 , and
Fig. 2. The carry tree using Ling method in [15].
TI 3 ) plus 2 MUXs (TI 4 and TI 5 ). In this situation, even if the
logic for the summation of the low-order 9 bits is optimized
(4) is rewritten as using the Ling’s structure, there would be no improvement in
Ci = G i + Pi Ci−3 (6) the overall speed of the adder. Instead, a low- cost ripple carry
approach can be used to implement the sum of low-order 9 or
When using the Ling carry propagation structure [18], it has even more bits. This approach does not affect the critical path
of the entire adder, while reducing the cost of the low-order 9
Ci = pi Hi (7) or even more bits.
where the Ling carry signal Hi is defined as Hi = gi + gi−1 + Moreover, the variables used for Shannon expansion can
pi−1 gi−2 + . . . + pi−1 . . . p1 g0 according to (3). From (7), also be optimized for high-order bits. As shown in Fig. 4(b),
it can be deduced that the conventional method treats the block carry signal H17 as a
variable in the first Shannon expansion, and then using pb7 and
Ci−3 = pi−3 Hi−3 (8) gb7 as the variables used for the second Shannon expansion.
The input signal of the final MUX experiences a delay of
After substituting (4) and (8) into (7), the relationship 3 compound logic gates (TI 1 , TI 2 , and TI 3 ), in addition
between Hi and Hi−3 can be obtained as to 2 MUXs (TI 4 and TI 5 ). In contrast, the delay of the final
Hi = G i∗ + Pi∗ Hi−3 (i > 3) MUX’s control signal H17 consists of 3 compound logic gates’

(9)
delay (TC1 , TC2 , and TC3 ) combined with TC4 which is the
where G i∗ and Pi∗ is defined as the Ling group generation and delay from the control signal to the output of the MUX. Since
propagation signals respectively. the delay from the input signal of the final MUX to the output
∗ S26 , representing TI 5 , is higher than that from the control
G i = gi + gi−1 + gi−2 pi−1 signal of the final MUX to the output S26 , representing TC4 .
(10)
Pi∗ = pi−3 pi−2 pi−1 As a result, the arrivals of input and control signals for the final
From (5), (6), (9), and (10), in the Ling carry structure, there MUX are not simultaneous. The delay from the input signal to
is a reduction of one pi operation for G i∗ and Hi compared S26 reaches approximately 5 logic gates’ delay, while the delay
to G i and Ci . Moreover, upon comparing (5) and (10), it can from the control signal to S26 is approximately that of 3.5 logic
be observed that the generation of G i∗ is more compact, and gates, where the delay from input and control signals to one
the generation of carry signal in the Ling structure exhibits a MUX is considered approximately 1 and 0.5 normal logic
simpler design. gate’s delay respectively. We can hence appropriately increase
Next, the carry and the output sum for each bit will be the delay of MUX’s control signal and reduce the delay of its
determined by combining their respective carry signals based input signal, that is, reasonably realize the delay balance for
on the structure illustrated in Fig. 1. The complete expressions generating S26 , to reduce the critical path of the adder.
of C30∼0 and S31∼0 are shown in Fig. 3. Specifically, C30∼0 are
obtained using the look-ahead method and established based III. S TRUCTURE OF P ROPOSED R IPPLE -L ING H YBRID
on the carry tree depicted in Fig. 2. The output sum, S31∼0 , C ARRY A DDER
are obtained by XOR-ing C30∼0 with the local sums s31∼0 In order to address the issue of the conventional Ling
and then optimized using Shannon expansion [15]. Fig. 3 structure utilizing complex logics for the summation of low
also presents the expressions for the inter-block carry signals, order bits, this paper presents a new structure that combines the
H8 , H17 , H26 , as well as the intermediate variables used Ling-carry structure for the high-order bits and the ripple-carry
for Shannon expansion. This ensures that all expressions are structure for the low-order bits. By utilizing the low delay of
ultimately derived based on G i∗ , Pi∗ , gi , pi , and si . the Ling structure on the critical path and the low cost of the
For the Ling method used in [15], the use of group gener- ripple structure on the non-critical path, the proposed adder
ation and propagation signals G i∗ , Pi∗ , G ∗bi and Pbi∗ , and the ensures high-speed computation while reducing hardware con-
block carry signals, H2 , H8 , H17 , and H26 , ensures that as the sumption. The overall structure of the proposed 32-bit adder
operands increase, the high-order operands can be calculated is shown in Fig. 5. It is still divided into four summation
in parallel instead of waiting for the carry signals from the blocks, and the grouping in each block remains unchanged.
low-order operands. However, the generation of these signals The carry tree is the same as that shown in Fig. 2, but the Bit
necessitates multiple modules, which leads to an increase in 10∼0 utilize a ripple carry structure. Next, we will provide
hardware overhead. In particular, the delay in generating S26 the specific logical expressions for the carry and output sum
represents the critical path of the adder, while the delays in of Bit 10∼0, and explain why the ripple carry extends only
producing the output sum of the low-order bits are smaller. up to Bit 10.
This implies that there is no need to employ an excessive The output carry of Bit 0 is C0 = A0 B0 . In the ripple carry
amount of resources for generating the output of the low-order structure, based on the recursive equation Si = si ⊕ Ci−1 ,
Fig. 3. The carry and output sum expressions of the Ling adder proposed in [15].
we can see that the generation of S26 requires a delay of

3 compound gates and 2 MUXs, while S5 requires a delay
of 1 G i ∗-GEN compound logic gate, 1 three-input AND gate,
2 Ci -GEN compound logic gates which are used to realize
Ci = Ci−1 si−1 + gi−1 , and 1 MUX from its control signal to
the output [4]. If it is possible to design 2 Ci -GEN compound
logic gates with delay lower than 2 MUXs, the delay of S5
can be lower than that of S26 , and the output sum of Bit 5∼0
can all be implemented using ripple carry structure. As shown
in Fig. 6, fewer AND gates and OR gates are needed for the
ripple carry structure since it doesn’t need to compute g and
p.
The carry signal C5 can be obtained as follows:
Fig. 4. The gate level circuit to get (a) S8 and (b) S26 in [15]’s Ling method. C5 = C4 (A5 ⊕ B5 ) + A5 B5 (16)
However, as shown in Fig. 6, waiting for the arrival of the
we can obtain S1 as:
C4 signal nearly requires a delay of four compound logic gates.
S1 = s1 ⊕ C0 (11) If we simply connect Bit 6 and Bit 7 after Bit 5 in a ripple
structure, the path length of generating C5 will exceed the
Moreover, C1 and S2 is respectively as delay of four compound logic gates, and C6 will even have a
C1 = C0 (A1 ⊕ B1 ) + A1 B1 (12) delay of six compound logic gates, exceeding the critical path
S26 . Fortunately, as illustrated in Fig. 2, the globally generated
and G ∗2 , G ∗5 , P5∗ will be used to generate block carry signal H8 ,
and we can also use the signals G ∗2 , G ∗5 , and P5∗ to quickly
S2 = s2 ⊕ C1 (13) obtain the carry signal C5 , thereby ensuring that the delay of
As shown in Fig. 2, since the computation of block carry Group 1 does not exceed the critical path of the whole adder,
signal H8 includes the term G ∗2 (which is also H2 ), so the C5 = p5 G ∗5 + P5∗ G ∗2

(17)
carry signal G ∗2 is retained, and the carry signal C2 can be
quickly obtained using G ∗2 : Hence, we can calculate S6 as:
S6 = s6 ⊕ [ p5 G ∗5 + P5∗ G ∗2 ]

C2 = p2 G ∗2 (14) (18)
In this way, when generating the output sum S6 , it only
S3 , C3 , S4 , C4 , and S5 are calculated as follows when using
requires a delay of 2 compound logic gates, where G 5 ∗, P5 ∗,
ripple carry structure:
 and G 2 ∗ have the delay of 1 logic gate, and calculating C5

 S3 = s3 ⊕ C2 from G 5 ∗, P5 ∗, and G 2 ∗ consumes delay of 1 more logic
gate. Then, if a MUX is employed to implement the final
 C3 = C2 (A3 ⊕ B3 ) + A3 B3



XOR gate and output S6 , the total delay is also maintained at
S4 = s4 ⊕ C3 (15) approximately 2.5 logic gates’ propagation time by utilizing
(A )



 C 4 = C 3 4 ⊕ B4 + A B
4 4 C5 as a control signal. The output carry of Bit 6 in the ripple

S5 = s5 ⊕ C4 structure is

The generation of the carry signals C3 and C4 is dependent C6 = C5 (A6 ⊕ B6 ) + A6 B6 (19)

on the prior generation of C2 , as illustrated in Fig. 6. There- The output sum and carry of Bit 7 can be directly obtained
fore, S5 experiences a delay that is two stages greater than as:
that of S3 . Next, we will compare the generation circuits of
S5 and S26 to analyze the effectiveness of utilizing a ripple S7 = s7 ⊕ C6
(20)
carry structure for the low-order bits. From Fig. 4 and Fig. 6, C7 = C6 (A7 ⊕ B7 ) + A7 B7
Fig. 5. The structure of the proposed ripple-Ling hybrid carry 32-bit adder.
Fig. 8. Circuits using Ling carry structure and ripple-carry structure for
Fig. 6. Circuits using Ling carry structure and ripple-carry structure for generating S9 and S10 .
generating S0 ∼ S5 .
For Bit 9, its output sum is
S9 = s9 ⊕ C8 (22)
Its output carry and the output sum of Bit 10 are
C9 = C8 (A9 ⊕ B9 ) + A9 B9

(23)
S10 = s10 ⊕ C9
The modification of the low 11-bit structure, although
increasing the delay to get their output carries, does not
increase the critical path of the entire adder. In the meantime,
Fig. 7. Circuits using Ling carry structure and ripple-carry structure for from Fig. 6, 7, and 8, it can be observed that the ripple
generating S6 − S8 . carry structure has lower hardware cost. Using the ripple
carry structure in low-order bits and Ling carry structure
Similarly, the output sum of the Bit 8 can be obtained as in high-order bits reduces hardware resource consumption
without impacting the calculation speed of the adder.
S8 = s8 ⊕ C7 (21)
Fig. 7 gives the circuit of Group 2 when using the ripple IV. C IRCUIT I MPLEMENTATION OF THE P ROPOSED A DDER
carry structure, we can see that the delay of C7 is the delay In this section, we will present the detailed circuit imple-
of 4 compound logic gates. Similarly to Bit 5, since C7 and mentation of the entire adder based on the proposed new
S8 is used as the control and input signal for the output MUX ripple-Ling hybrid carry structure. Due to the strong noise
respectively, if we can design the 1 Pi ∗ /Gi∗ -GEN, 3 Ci -GEN immunity and low power consumption, this paper adopts the
compound logic gates, and 1 MUX from its control signal static circuit approach for circuit implementation. The circuit
to the output to have a total delay lower than the delay of will be optimized in the following two aspects. First, custom
3 compound gates plus 2 MUXs, the delay of S8 will be lower designed compound logic gates, such as NAND, NOR, and
than that of S26 . In this case, Bit 8 ∼ 6 can all be implemented XOR-mixed gates, as well as the custom designed Ai Bi +
using ripple carry structure. X (Ai ⊕ Bi ) operator, are used to achieve area optimization
Since the next summing group needs to wait for the block while ensuring fast computation. Next, in order to reduce the
carry signal H8 , which will introduce two additional logic delay of output sum with long paths, such as S26 , we optimize
operations compared to the summing group in Block 0. If the the Shannon expansion variables to align the delays of control
ripple-carry structure is extended to Block 1, as shown in and input signals for the final MUX, further reducing overall
Fig. 8, the delay of generating S10 and S11 becomes 4.5 logic latency.
gates’ propagation delay (TI 1 , TI 2 , TI 3 , TI 4 , TC1 ) and 5.5 logic
gates’ propagation delay (TI 1 , TI 2 , TI 3 , TI 4 , TI 5 , TC2 ) respec-
tively. However, the total delay of the critical path S26 for the A. XOR-NAND-NOR Mixed Logic Gate
entire adder, is also about that of 5 logic gates. So the delay Firstly, we propose a compact circuit that simultaneously
of S10 is similar to that of S26 . If the ripple carry structure is realizes XOR, NAND, and NOR operations. The XOR logic is
further extended to Bit 11, it will reach 5.5 logic gates’ delay. the most commonly used arithmetic unit in addition logic. The
Consequently, it will exceed the critical path of the entire simple static circuit for XOR logic is shown in Fig. 9, utilizing
adder. Therefore, the ripple-carry structure is not suitable for 10 transistors. It can be observed that the NOR operation is
the summation of Bit 11 and beyond and is only applied to incorporated in the XOR logic and can be achieved using the
the summation of Bit 10-0. left four transistors. Furthermore, it also contains a parallel
Fig. 11. Output sum generation circuit of Bit 26 when realizing the Shannon
expansion balance.
Fig. 9. (a) XOR-NAND-NOR, (b) XOR-NOR, and (c) XOR-NAND mixed

logic gates.
Fig. 12. Output sum generation circuit of Bit 2∼0.
Ai and Bi are 0 or 1, the pull-up or pull-down module

will work, making Y = Ai Bi . At this time, the TG is off,
Y = Ai Bi is selected as the output Ci , and the output delay is
Fig. 10. Custom design to realize Ci -GEN.
approximately a NAND gate, independent of Ci−1 . When one
of Ai and Bi is 1, TG is on, Ci−1 is selected as the final output
PMOS pull-up circuit controlled by A and B. If a series Ci and determines the output delay of Ci . There is one TG
connected NMOS pull-down circuit controlled by A and B gate on this path, with a delay lower than the MUX composed
can be added to point C without affecting the original function of two TGs.
of the XOR gate, the NAND gate can also be integrated When both Ai and Bi are either 0 or 1, regardless of the
into the XOR gate. As illustrated in Fig. 9 (a), a NMOS number of cascaded circuits preceding it, the output delay of
pull-down path is added to point C, if either A or B is 0, Ci is only decided by Ai Bi and independent of the value of
or both of them are 0, point C remains pulled up to 1, and the the input Ci−1 , Ci -GEN is about the delay of a NAND gate.
newly added NMOS pull-down path remains inactive, thereby However, when only one of Ai and Bi is 1, and the output Ci
maintaining the original function of the XOR gate. When A is determined by Ci−1 . In this case, the circuit for computing
and B are both 1, Y N O R is 1, MPX is off, and the logic Ci can be seen as a TG gate, with a delay slightly smaller
value at point C does not affect the XOR gate output, either. than that of a MUX composed of two TG gates. Therefore,
Now, Point C is pulled down to 0 by the added pull-down as shown in Fig. 6, 7, and 8, the delay for generating S5 ,
NMOS path, and it can be used as the output of NAND logic. S8 , and S10 will be smaller than the delay for S26 , ensuring
Therefore, integrating the NAND gate into XOR will not affect the effectiveness of using ripple carry structure for the lower
the original XOR functionality, and we design a logic gate 11 bits as proposed in this paper.
circuit that simultaneously integrates XOR, NAND, and NOR The designed carry module can realize the logic of Ai Bi +
functions, which is shown in Fig. 9 (a). This circuit can be Ci−1 (Ai ⊕ Bi ), and can be extended to realize Y = Ai Bi +
further expanded to XOR-NAND mixed logic gate, or XOR- X (Ai ⊕ Bi ) for any X input. Moreover, this operator can also
NOR mixed logic gate, which is depicted in Fig. 9 (b) and (c) achieve the operation of Y= gi + pi X . When Ai and Bi are
respectively. Integrating NOR and NAND in the XOR circuit the same, Y= gi . Otherwise, when Ai and Bi are not the
will inevitably increase the load on the internal nodes of XOR same, pi = 1 and gi = 0, therefore, Y = X, which has the
gate, leading to the increased delay for the output of XOR. same functionality as the Ai Bi + X (Ai ⊕ Bi ) operator and
However, in this design, the integration of NOR and NAND can also be implemented using the custom designed Ci -GEN
is primarily used in XOR which generates the local sum si , circuit. As Fig. 3 shows, after utilizing Shannon expansion to
and si is usually not in the critical path to generate one bit’s implement output sum, it will involve numerous Y= gi + pi X
output sum, thereby not affecting the delay of output sum. For similar operators. By employing this circuit to realize such
example, as shown in Fig. 7, the slight increase in the delay of operators, the cost can be significantly reduced.
s6 ∼s8 will not affect overall delay of output sum S6 ∼S8 , which
means utilizing a customized XOR-NAND-NOR mixed gate
allows for cost reduction while not impacting the critical path. C. Optimization of Shannon Expansion
As discussed in Section II, the conventional method that
B. Ci -GEN treats the block carry signal H17 as a variable for Shannon
As shown in (12), (15), (16), (17), (20) and (23), the output expansion cannot realize delay balance for the realization of
carry Ci using the ripple structure is commonly expressed S26 , and we can appropriately increase the delay of the control
as Ai Bi + Ci−1 (Ai ⊕ Bi ). If standard logic gates are used to signal and decrease that of the input signal for the output
generate this signal, as shown in Fig. 10, it would require two MUX. In this paper, k = pb7 H17 + H17 gb7 is used as the
AND gates, one OR gate, one XOR gate, and three inverters. Shannon expansion object, and S26 can be rewritten as
In order to achieve the expression Ci = Ai Bi + S26 = s26 ⊕ (g25 + p25 p24 ) k + [s26 ⊕ (g25 + p25 g24 )]k

Ci−1 (Ai ⊕ Bi ) more concisely, as depicted in Fig. 10, a highly
compact and customized circuit is used [15]. Only when both (24)
Fig. 13. 1-bit full adder using ripple carry structure.
The new structure to realize S26 according to (24) is shown Fig. 14. (a) G i∗ -GEN and (b) H8 -GEN used to generate G i∗ and H8
in Fig. 11, the delay from the input of the final MUX to S26 respectively.
is about 4 logic gates’ propagation delay (TI 1 , TI 2 , TI 3 , TI 4 ),
while the delay from the control signal to S26 is also about
that of 4 logic gates (TC1 , TC2 , TC3 , TC4 , TC5 , where TC4
and TC5 represent the delay from the MUX’s control signal
to its output signal, both estimated to be about 0.5 logic Fig. 15. Output sum generation circuit of Bit 5∼3.
gate’s delay). As a result, both the delay from the input and
control signals of MUX to S26 reach 4 logic gates’ delay,
and the overall delay to generate S26 is reduced from 5 to
approximately 4 logic gates’ delay, resulting in an optimization
ratio of about 20%. For other bits that the generation of their
output sums is time consuming, the similar Shannon expan-
sion optimization can all be realized. Although the proposed
method does not theoretically shorten the critical path length
of the circuit, it improves the calculation speed of S26 in
the implementation of the circuits. By restructuring the object Fig. 16. Output sum generation circuit of Bit 8∼6.
of Shannon expansion, the adder can achieve optimization in
terms of delay. named Pi ∗-GEN. As for G i∗ , literature [15] has optimized the
circuit, where the longest path in both the pull-up and pull-
D. Output Sum Generation Circuit for Each Bit down paths consists of only three transistors in series, which
Next, we will generate the output sum for each bit using is shown in Fig. 14 (a). This circuit structure is named G i∗ -
the optimization methods mentioned above. For the output sum GEN. G ∗2 , G ∗5 ,. . . , G ∗29 and P5∗ , P8∗ ,. . . , P29
∗ as Fig. 2 shows
∗
will all be obtained by using the above G i -GEN and Pi ∗ -GEN
generation circuit of Bit 2, according to Fig. 3, the local bit
sum s2 needs to be XORed with the carry signal C1 to obtain circuits.
the output sum S2 . To further simplify the circuit, as shown in After obtaining G ∗2 , according to Fig. 3, p2 and G ∗2 are used
Fig. 12, we use the MUX composed of two TGs to implement to generate the carry signal C2 , which is then connected to the
the final XOR logic, in which s2 acts as input signal and C1 is input carry of Bit 3 full adder for ripple carry realization. The
the control signal. An inverter is added to the output of MUX last full adder of summation Group 1 only needs to provide S5
to obtain the final S2 and ensure its output driving capability. and p5 for the input carry calculation of Bit 6. The complete
Based on this scheme, the output sum generation circuit of circuit structure for computing the output sum of Bit 5∼3 is
S2 is designed as shown in Fig. 12. Furthermore, it utilizes shown in Fig. 15, where FA4, FA3, and FA3 are used to realize
the custom designed circuit described in the previous part, the 1-bit full adders for Bit 5∼3 respectively.
Ai Bi + X (Ai ⊕ Bi ), to achieve Ci . The whole ripple-carry based adder for Group 2 is shown
Since the 3 bits in each summation group of Block 0 have a in Fig. 16. We first design a compound gate to generate the
ripple carry structure, they will be connected in series using a input carry signal C5 according to Fig. 3. Similarly, the 1-
group of 1-bit full adders with the same structure. Therefore, bit full adder of Bit 8 outputs a carry propagation signal p8
we encapsulate the circuit for computing the output S2 into a instead of C8 to the next summation group. As the last group
1-bit full adder module, as shown in Fig. 13. For the 1-bit full of circuits in Block 0, it also needs to provide block carry
adder that requires to output p and g, we utilize the XOR- signal H8 to the Block 1. According to the H8 expression
NOR-NAND gate as shown in Fig. 9(a) to simultaneously in Fig. 3, we designed H8 -GEN circuit to generate the block
obtain the p and g, and the adder is named as FA1. For units carry signal H8 , which is shown in Fig. 14 (b).
that only need to individually output g or p, only NAND As shown in Fig. 17, Bit 9 and Bit 10 in Group 3 of Block
or NOR function is integrated in XOR with symbol as FA2 1 still use a ripple carry structure, while Bit 11 uses the Ling
and FA3 respectively. Additionally, for the last bit in a group, carry structure to generate input carry. First, as Fig. 3 depicts,
it does not necessitate the provision of Ci , rather, it only needs g9 and p9 need to be used to get S11 by using Ling method.
to provide pi to the next group. Thus, the 1-bit full adder Therefore, in the output sum generation circuit of Bit 9, the
FA4, which uses XOR-NOR mixed gate and doesn’t include XOR_NOR_NAND logic gate is used in FA1 to generate
Ci -GEN, is employed. signals s9 , g9 , and p9 simultaneously. Next, in order to
For Block 0, as discussed in Section II, the input carry of simplify the S11 generation circuit, we define an intermediate
Group 1, 2, 3, which is C2 , C5 , and C8 respectively, employ signal k0 = C8 = p8 H8 as a variable to perform Shannon
the Ling carry structure. So, it is necessary to use simplified expansion, then S11 as shown in Fig. 3 can be rewritten as
circuits to implement G i∗ , Pi∗ , G ∗bi , Pbi∗ , and ultimately achieve S11 = k0 {s11 ⊕ [ p10 (g10 + p9 )]} + k0 s11 ⊕ [ p10 (g10 + g9 )]}
C2 , C5 , and C8 . For Pi∗ , according to their expressions in
Fig. 3, a three-input NAND gate is used, and this circuit is (25)
Fig. 17. Output sum generation circuit of Bit 11∼9.
Fig. 19. The output sum generation circuit for Bit 14∼12.
Fig. 18. (a) pb3 -GEN and (b) gb3 -GEN used to generate pb3 and gb3
respectively.
From Fig. 3, the conventional structure based on H8 for

Shannon expansion causes large delay and complexity in the
Fig. 20. (a) pb4 -GEN and (b) gb4 -GEN used to generate pb4 and gb4
remaining term s11 ⊕ [ p10 (g10 + g9 + p9 p8 )]. In contrast, respectively.
as (25) depicts, the method proposed in this paper is based
on k0 for Shannon expansion and can achieve a similar delay
between the object and the remain term of Shannon expansion,
which is k0 = p8 H8 and s11 ⊕ [ p10 (g10 + p9 )] respectively.
Hence, when it is realized using MUX, the control signal
k0 = p8 H8 and the input signal s11 ⊕ [ p10 (g10 + p9 )] of the
output MUX will arrive essentially simultaneously. The delay
of the MUX inputs signal and subsequently the delay of the
entire S11 will be reduced.
Moreover, as Fig. 10 shows, the customized operator Ci -
GEN, mentioned earlier, is used to implement the sub-logic
[ p10 (g10 + p9 )] and [ p10 (g10 + g9 )]. Fig. 21. The output sum generation circuit for Bit 17∼15.
Group 4 includes the Bit 14∼12, which are also optimized
using new Shannon expansion variable. As shown in Fig. 18, Group 5 includes Bit 17∼15. Similar to the previous
the intermediate signals, gb3 and pb3 , are designed, where gb3 summation group, based on the logical signals defined in
is obtained with a NAND gate, and pb3 is obtained with a Fig. 3, compound gates gb4 -GEN and pb4 -GEN, as shown in
compound logic gate circuit called pb3 -GEN. Fig. 20 (a) and (b), are designed to generate the gb4 and pb4
From the expression of S12 in Fig. 3, we have performed respectively. Then, let k2 = pb4 H8 + H8 gb4 , S15 , S16 , and S17
a re-optimization of the implementation of S12 by rewriting it respectively can be optimized as:
as

S12 = s12 ⊕ k1 (26)  S15 = s15 ⊕ k2
S = (s ⊕ p15 ) k2 + (s16 ⊕ g15 ) k2
where k1 = pb3 H8 + H8 gb3 .  16 16
S17 = s17 ⊕ (g16 + p16 p15 ) k2 + s17 ⊕ (g16 + p16 g15 ) k2

Similarly, for S13 , we first utilize Shannon expansion to gb3 ,
pb3 , and H8 , and then perform recombination and optimization (29)
to finally achieve Shannon expansion for k1 , the expression of
S13 is simplified as The complete output sum generation circuit of Group 5 is
shown in Fig. 21. Adopting k2 for Shannon expansion achieves
S13 = (s13 ⊕ p12 ) k1 + (s13 ⊕ g12 )k1 (27) a similar delay between the control signal and the input signal
S14 is optimized as of the output MUX, thus reducing the overall delay.
As the final summation group in Block 1, it is necessary to
S14 = s14 ⊕ (g13 + p13 p12 ) k1 + [s14 ⊕ (g13 + p13 g12 )]k1

provide the carry signal H17 for the next block. Fig. 22 (a)
(28) shows the circuit for generating H17 based on its expression
in Fig. 3. The circuits for generating the intermediate signals
When generating the logic terms (g13 + p13 g12 ) and G ∗b1 and Pb1
∗ are shown in Fig. 22 (b) and (c) respectively.
(g13 + p13 p12 ), the customized circuit of Ci -GEN is utilized. For Group 6, by letting k3 = p17 H17 , S18 , S19 , and S20 are
The complete circuit of Group 4 is depicted in Fig. 19. Simi- rewritten as
larly, for the generation of S14 , compared to the conventional
implementation based on H8 for Shannon expansion which 
 S18 = s18 ⊕ k3
is shown in Fig. 3, the method proposed in this paper uses
S = (s ⊕ p18 ) k3 + (s19 ⊕ g18 )k3
Shannon expansion based on k1 and achieves a similar delay  19 19
S20 = s20 ⊕ (g19 + p19 p18 ) k3 + s20 ⊕ (g19 + p19 g18 ) k3

between the control signal k1 , which realizes the delay balance
and achieves the purpose of overall delay reduction. (30)
Fig. 22. ∗ , and (c) G ∗ .

Circuits used to generate (a) H17 , (b)Pb1 b1
Fig. 25. The output sum generation circuit for Bits 26∼24.
Fig. 23 gives the complete circuit of Group 6, similarly, the
logic terms (g19 + p19 p18 ) and (g19 + p19 g18 ) are implemented
using the custom designed operator shown in Fig.10. The using
of k3 for Shannon expansion can realize almost the same delay
for the control signal and input signal of the output MUX,
which can reduce the overall delay.
The summation circuit in Group 7 includes Bit 23∼21. The
intermediate signals gb6 and pb6 defined in Fig. 3 have the
same logic as the signals gb3 and pb3 respectively, and can
be generated using compound gates gb3 -GEN and pb3 -GEN
shown in Fig. 18. Moreover, we use k4 = pb6 H17 + H17 gb6 Fig. 26. The output sum generation circuit for Bit 29∼27.
and perform Shannon expansion based on k4 to reduce the
overall delay, obtaining S21 , S22 , and S23 as which can be obtained using the G ∗b1 -GEN circuit shown in
 ∗ can still be obtained using a three-input
Fig. 22 (c). Also, Pb2
 S21 = s21 ⊕ k4 NAND gate.
S = (s ⊕ p21 ) k4 + (s22 ⊕ g21 )k4 The summation circuit in Group 9 includes Bit 29∼27. Let
 22 22
S23 = s23 ⊕ (g22 + p22 p21 ) k4 + [s23 ⊕ (g22 + p22 g21 )]k4 p26 H26 = k5 , and S27 , S28 , and S29 are redefined as follows

(31) 
 S27 = s26 ⊕ k5
The complete circuit of Group 7 is shown in Fig. 24, S = (s ⊕ p27 ) k5 + (s28 ⊕ g27 )k5
 28 28
where the logic terms (g22 + p22 p21 ) and (g22 + p22 g21 ) are S29 = s29 ⊕ (g28 + p28 p27 ) k5 + [s29 ⊕ (g28 + p28 g27 )]k5

implemented using the custom designed operator. (33)
The summation circuit in Group 8 includes Bit 26∼24. Also,
the intermediate signals gb7 and pb7 defined in Fig. 3 are real- As shown in Fig. 26, the logic terms (g28 + p28 p27 )
ized using gb3 -GEN and pb3 -GEN respectively. As illustrated and (g 28 + p28 g27 ) are implemented using custom designed
previously, let k = pb7 H17 + H17 gb7 , and S24 , S25 , and S26 in operator as shown in Fig. 10, and Shannon expansion based
Fig. 3 are rewritten as on k6 can also reduce the overall delay.
 For Group 10 includes the sum of the Bit 30 and Bit 31,
 S24 = s24 ⊕ k according to Fig. 3, we can first generate gb9 and pb9 using
S = (s ⊕ p24 ) k + (s25 ⊕ g24 )k the compound gates gb3 -GEN and pb3 -GEN shown in Fig. 18.
 25 25 Then let k6 = pb9 H26 + H26 gb9 , and S30 and S31 are obtained
S26 = s26 ⊕ (g25 + p25 p24 ) k + [s26 ⊕ (g25 + p25 g24 )]k

as
(32)
S30 = s30 ⊕ k6
Similarly, as shown in Fig. 25, the logic terms g25 + p25 p24 (34)
S31 = (s31 ⊕ p30 ) k6 + (s31 ⊕ g30 )k6
and g25 + p25 g24 are implemented using custom designed
operator as shown in Fig. 10. The complete circuit of Group 10, established based on the
Since this group needs to provide the output block carry above expressions, is shown in Fig. 27.
signal H26 , by comparing the expressions of H8 and H26 The specific implementation of each bit in the 32-bit adder
in Fig. 3, we can use the H8 generation circuit shown in has been fully discussed. For the delay optimization of the
Fig. 14 (b) to get H26 . The input variables to get H26 are the critical path S26 , the compound gate used for generating Pi∗
intermediate signals G ∗b1 , Pb1
∗ , G ∗ , and P ∗ , where the signals
b2 b2 and G i∗ signals is the first delay stage, the compound gate used
G ∗b1 and Pb1∗ have already been implemented in summing for generating Pb1 ∗ , G ∗ , and H signals is the second delay
b1 8
Group 5. While for G ∗b2 and Pb2 ∗ , it can be seen from Fig. 3 stage, and that used for generating the H17 (and further k) is
that the logical circuit for G b2 is the same as that for G ∗b1 ,
∗ the third delay stage. Finally, the output MUX to obtain S26
Fig. 28. The layout of the proposed adder using 28 nm process.
introduce the fourth delay stage. Compared to the conventional

Ling adder that uses H8 , H17 , and H26 as the objects of Fig. 29. (a) the die photo and (b) layout of the test chip, and layouts of
Shannon expansion, we have redefined k and k0 ∼ k6 as the (c) the proposed adder and (d) Ling adder in [15] using 180 nm; (e) the die
photo and (f) layout of the security chip, and layouts of (g) the proposed
Shannon expansion variables for the high 21-bit summation adder and (h) Ling adder in [15] using 28 nm; (i) the die photo and (j) layout
circuit, reducing the delay of the input signals of the MUX of the security chip, and layouts of (k) the proposed adder and (l) Ling adder
controlled by them. As a result, the delay of the conventional in [15] using 14 nm.
adder, which is about 5 logic gates, is reduced to a delay about
4 logic gates, and the delay of critical path of the proposed on the right. The inputs A31∼0 and B31∼0 are placed at the
adder achieves nearly 20% optimization. top, while the outputs and S31∼0 are placed at the bottom. The
For the area optimization, after adopting the customized block carry signals are positioned nearly in the middle of each
NOR-NAND-XOR mixed circuits, each summing group can block.
save 6 transistors since it requires 2 pi and 1 gi . This results in The layout of the adder and the layout and photo of the test
a reduction about 60 transistors for a 32-bit adder that consists chips using 180 nm, 28 nm, and 14 nm processes are shown
of 10 groups. Additionally, by utilizing the customized Ci - in Fig. 29 (a)∼(l). When implemented using 180 nm, 28 nm,
GEN circuit, the lower 11-bit ripple carry adder can save and 14 nm process, the proposed Ling adder occupies an area
approximately 176 transistors compared with the traditional of 4557.5 µm2 , 193.2 µm2 , and 73.8 µm2 respectively, while
circuit shown in Fig. 10, with each operator saving 16 tran- the conventional Ling adder occupies an area of 5107.7 µm2 ,
sistors. By employing a combination of a low-order ripple 246.85 µm2 , and 119.7 µm2 respectively.
carry structure and optimized operators, the proposed adder When using the proposed optimization methods, the num-
is composed of 1084 transistors, whereas the conventional ber of transistors used in the adder has been reduced
Ling adder in [15] necessitates 1476 transistors, resulting in a from 1476 to 1084, which achieves an optimization ratio of
reduction of 392 transistors. about 26%. But the real area consumptions in different pro-
cesses show that the optimization effects of the proposed adder
V. E XPERIMENTAL R ESULTS are more significant in advanced process. Due to the multiple
metal layers and low thickness, the power supplying efficiency
In order to verify the effectiveness of the proposed is low in advanced process. Therefore, the tap cells and power
ripple-Ling hybrid carry adder, this paper implements it using stripes should be added into the adder. Once the area is opti-
different processes, including the advanced 14 nm FinFET mized about 26% by using less devices, the tap cells and power
process, and the traditional 28 nm and 180 nm standard CMOS stripes can be less, thereby resulting an additional area reduc-
processes. The adders are integrated on digital security chips or tion. The proposed adder can achieve an overall area reduction
specific test chip, and the performance of the adder is verified of approximately 39% in 14 nm process. While for the adder
through simulation and testing. implemented in 28 nm process, there is no need to add power
For the implementation of the entire adder, it’s important stripes internally to the adder, and the substrate’s tap is directly
to carefully design the sizes of each logic gate, especially for implemented on the strip that horizontally connects the power
signals that need to be transmitted over long distances with and ground, without using a separate tap cell. Therefore, the
heavy loads, such as the block carry signals H8 , H17 , H26 , actual area optimization ratio is equivalent to the optimization
etc. The output inverters are generally designed with large ratio of the number of devices, which is approximately 23%.
sizes and the lines are usually wide to provide large driving As for the adder implemented in the 180 nm process, its area
capability for these signals. While for local signals, the sizes of optimization ratio should be at also about 26%. However,
logic gates driving them are usually chosen to be the minimum in our implementation process, we design and realize each
logic gate sizes that the standard cells use. For example, in sub-cell using custom design method, but the whole adder
14 nm process, the NMOS and PMOS in logic gates driving is connected using a digital Placing & Routing process,
local signals have the same size as L=16 nm and fin=3. In so the spacing between each cell is large, and the final area
28 nm process, the size of NMOS and PMOS is designed as optimization ratio for 180 nm process is only 10%.
100 nm/30 nm and 105 nm/30 nm respectively. In 180 nm
process, the size is 600 nm/180 nm and 1000 nm/180 nm
respectively. Additionally, TGs are not used for transmitting A. Post-Layout Simulation Results
signals over long paths, and an inverter is added after the Under different processes, we used multiple sets of random
output MUX to enhance the driving ability. 32-bit A and B as the inputs of adder, and performance
In terms of the overall layout, as Fig. 28 depicts, we arrange evaluation are performed through post-layout simulation. The
it in the way of high-order bits on the left and low-order bits variation frequency of the input operands is 1 GHz. For the
TABLE I
M AX D ELAY AND P OWER C ONSUMPTION OF THE P ROPOSED AND THE
C ONVENTIONAL L ING A DDERS U NDER 1 G HZ O PERATING F RE -
QUENCY
TABLE II
S IMULATED PDP OF A DDER U SING 28 NM O PERATING AT THE M AXIMUM
F REQUENCY
Fig. 30. Simulated eye diagrams of S26 and S10 of both the proposed adder
and the traditional Ling adder under different temperatures and process corners
when using the (a)∼(c) 180 nm, (d)∼(f) 28 nm, and (g)∼(i) 14 nm process.
28 nm, and 14 nm process respectively, while the conventional
adder using 180 nm, 28 nm, and 14 nm, VDD is 1.8 V, 0.9 V, Ling adder consumes 5720 µW, 107.2 µW, and 146.1 µW
and 0.8 V respectively, and operating temperatures are −40◦ C, respectively. In terms of power consumption, the proposed
25◦ C, and 125◦ C. Fig. 30 shows the eye diagrams of the adder achieves an optimization about 10%, and that using the
critical path, S26 , the longest delay using ripple carry, S10 , advanced process can achieve higher optimization which is
of the proposed structure, as well as S26 for the conventional about 15%.
Ling adder, obtained by simulating 100,000 sets of random 32- After obtaining the maximum delay of adders under differ-
bit A and B operands in TT, FF, and SS process corners. It can ent operating conditions, we then simulate the PDP of the
be seen that under the different conditions, the delay required adders operating at the maximum frequency under various
to generate S26 in the proposed structure is shorter than that process, voltage and temperature (PVT) conditions for the
in the conventional Ling structure, and the critical path of adders using 28 nm and 14 nm processes. The simulated
the proposed adder is decreased by over 21%. Similarly, the results are shown in TABLE II and III, it can be seen that
delay required to generate S10 is much shorter than the critical the proposed solution can achieve 10% improvement in PDP.
path S26 of the adder, which effectively validates the use of
a ripple carry structure for the low-order 11 bits. The delay
and power performance of the adders under 100%∗VDD is B. Test Results
shown in TABLE I. It can be observed that, for the 180 nm To test the performance of the proposed adder, we provide
process, the maximum delay of the proposed and conventional different test platforms to verify the operation of the adder
Ling adder is 0.75 ns and 0.95 ns respectively. When using for different processes. In the chips using 28 nm and 14 nm
28 nm process, it is 0.132 ns and 0.198 ns respectively. For the FinFET processes, we implement the proposed adder as a full
adders using 14 nm process, the proposed and the conventional adder in SHA-256 hash algorithm. By validating the fastest
adder demonstrates a maximum delay of 0.096 ns and 0.286 ns frequency of the hash operation and considering the delay
respectively. For traditional processes, the delay optimization distribution of each sub-operator in the hash compressor part,
is mainly obtained from Shannon expansion delay balance, we approximately estimate the fastest operating frequency of
which is approximately 20%. But for 14 nm process, the the full 32-bit adder. As shown in Fig. 31, in each round of the
proposed adder achieves more than 20% delay optimization. compression operation, the longest path is achieving A(T+1)
This is because the smaller area allows for a reduction of which is indicated by the red path (L 1 + L 2 + L 3 + L 4 ).
metal wires’ length. In advanced processes, the proportion The proposed adder is used in the last 32-bit full adder
of signal wire metal delay is higher than that of transistor’s that generates the result of the compressor A(T+1) [19].
parasitic capacitance, resulting in an additional 40% overall Firstly, the estimated percentage of the operating delay using
delay optimization when the area and further metal wires 28 nm and 14 nm process from simulation results is shown
are reduced by 39%. Combined with the 20% optimization in TABLE V and TABLE IV respectively. It shows that when
obtained from the delay balance of Shannon expansion, the the maximum speed is achieved, the delay of the proposed
final optimization reaches 60%. full adder is about 25.4% and 32.2% of the whole delay of
Regarding power consumption, the proposed adder con- the compressor for the SHA-256 hash operator using 28 nm
sumes 5250 µW, 98.9 µW, and 122.8 µW when using 180 nm, and 14 nm respectively. Then the function tests are performed
TABLE III TABLE VI

S IMULATED PDP OF A DDER U SING 14 NM O PERATING AT THE M AXIMUM T EST PDP OF A DDER U SING 180 NM U NDER D IFFERENT VDD AND
F REQUENCY T EMPERATURE
TABLE IV
S IMULATED D ELAY IN SHA-256 C OMPRESSOR U SING 14 NM
TABLE V Fig. 31. The compressor structure of SHA-256.

S IMULATED D ELAY IN SHA-256 C OMPRESSOR U SING 28 NM
Fig. 32. Test platform used to test the adders using 180 nm process.
at different input frequencies, under typical power conditions
with VDD = 0.8 V and VDD = 0.9 V respectively, the tested
maximum operating frequency of SHA-256 in the 14 nm and reference adder have the same result, and the tested adder
28 nm verification chips is about 1.75 GHz and 0.81 GHz can operate at that frequency. Otherwise, if the result at the
respectively. Hence, the length of the critical path in the output of the test circuit is 1, it indicates that the tested
entire compressor section is approximately 0.57 ns and 1.23 ns adder and the reference adder have some differing bits, the
respectively, which corresponds to an estimated longest delay output result of the adder is erroneous, and it cannot operate
of approximately 0.183 ns and 0.312 ns respectively for the properly at that frequency. Once the maximum operating
proposed 32-bit adder. These are approximately twice the post- frequency is obtained, the power consumption of this adder
simulation results. We believe this difference is attributed to is individually tested at that frequency. Test results of the
the high overall power consumption of the chip, which is power and maximum operating frequency for 40 samples at
approximately 3 W, leading to a large IR drop. Consequently, typical environment (VDD = 1.8 V, temperature = 25◦ C)
this results in a lower actual power supply voltage delivered are shown as shown in Fig. 33, and the maximum delay is
to each operator and, subsequently, to the adder, causing a 0.83 ns. The distribution of PDP [26] is given in Fig. 34.
significant difference in operating frequency compared to the It can be seen that the proposed adder can achieve an average
simulated frequency. PDP about 4.82 mW∗ ns when operating under the maximum
Since the adder is not individually tested with a test cir- frequency, which is 11% lower than that of conventional Ling
cuit, the power consumption cannot be tested for the circuit adder. TABLE VI presents the test average PDP under different
realized in 14 and 28 nm processes. Therefore, we continue temperature and power supply conditions. The PDP of the
to use a 180 nm test chip to test the speed and power adders proposed in this paper ranges from approximately 80%
consumption of the entire adder. Due to the high operating to 90% of the Ling adder proposed in [15].
frequency of the adder, it is challenging to directly lead signals The performance comparison between the adder designed
to off-chip for testing. Therefore, as illustrated in Fig. 32, in this paper and adders proposed in other literature is shown
we specifically design an on-chip test circuit. It uses an adder in TABLE VII. For the proposed adder using 180 nm process,
realized in Low Voltage Threshold (LVT) transistors as a the delay, power consumption, and PDP are based on average
reference, to indicate whether the adder implemented using testing results of 40 chips. For the adders using 28 nm and
Standard Voltage Threshold (SVT) transistors works properly. 14 nm processes, the delay are estimated based on testing,
The LVT_ADDER operates under 3.3 V power supply can while the power consumption and PDP is obtained from
work at a much higher speed than the tested SVT adder simulation results at 1 GHz and the maximum operating
operating under 1.8 V power supply. The XORed result of frequency respectively. When comparing with works [5], [14],
the output values of the tested adder and the reference adder [23] using the 180 nm process, the adder proposed in this paper
is used to determine if their calculation results are the same has the smallest area and delay. But the PDP of the proposed
and further the SVT adder work properly. The inputs A and B adder is larger than theirs because it is obtained based on the
are provided by a self-incrementing counter with configurable maximum frequency, while other adders work at much lower
initial values. The clock frequency of the self-incrementing frequencies. For the adder using 28 nm process, the proposed
counter is controlled by the Phase Lock Loop (PLL) module. one has a delay approximately one-third of [24], and the area
At a specific frequency, if the final result at the output of and PDP are also much smaller than [24]. For the adders using
the test circuit is 0, it indicates that the tested adder and the the 65 nm or 90 nm process [20], [21], [22], the proposed
TABLE VII
C OMPARISON W ITH P RIOR A RTS
bits, thereby continuously reducing power consumption and

area.
VI. C ONCLUSION
A high-order Ling and low-order ripple hybrid carry adder is
proposed in this paper. The low order 11 bits use a ripple-carry
Fig. 33. Tested results of the adders using 180 nm process at typical structure instead of the conventional lookahead carry method,
environment. while the high order 21 bits continue to use Ling carry
structure, thereby simplifying the low-order sum circuit while
maintaining the critical path of the entire adder. Moreover,
new intermediate variables are used as the object for Shannon
expansion in the implementation of the output sum circuit, and
simplified custom logic circuits are used to optimize the design
of each bit’s sum circuit. The proposed adder is implemented
using the 180 nm and 28 nm standard CMOS processes,
Fig. 34. The distribution of PDP for the (a) proposed adder and (b) conven- as well as 14 nm FinFET process. The testing results indicate
tional adder using 180 nm process at typical environment.
that compared to the conventional Ling adder, there have been
10% optimizations in speed, area, and power consumption.
adder even has comparable area and delay when using the
180 nm process, and we have reason to believe that our adder R EFERENCES
has better delay and area under the same processes.
[1] Y. He and C.-H. Chang, “A power-delay efficient hybrid carry-
lookahead/carry-select based redundant binary to two’s complement
converter,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 1,
C. Future Work pp. 336–346, Feb. 2008.
[2] A. M. Shams, T. K. Darwish, and M. A. Bayoumi, “Performance analysis
The adder proposed in this paper is designed using static of low-power 1-bit CMOS full adder cells,” IEEE Trans. Very Large
logic, making it suitable for a wide range of scenarios, like Scale Integr. (VLSI) Syst., vol. 10, no. 1, pp. 20–29, Feb. 2002.
encryption algorithms such as SHA, AES, and RSA, as well [3] R. Zlatanovici and B. Nikolic, “Power-performance optimal 64-bit carry-
as in fields like MCU basic addition instructions and DSP FFT lookahead adders,” in Proc. 29th Eur. Solid-State Circuits Conf., Estoril,
operations. In the future, we will customize and standardize the Portugal, 2003, pp. 321–324.
[4] H. Naseri and S. Timarchi, “Low-power and fast full adder by exploring
adder by extracting its layout and timing information, thereby new XOR and XNOR gates,” IEEE Trans. Very Large Scale Integr.
using it as a standard cell to be compatible with digital design (VLSI) Syst., vol. 26, no. 8, pp. 1481–1493, Aug. 2018.
processes which can speed up circuit implementation. [5] P. Bhattacharyya, B. Kundu, S. Ghosh, V. Kumar, and A. Dandapat,
Additionally, we will further optimize the size by designing “Performance analysis of a low-power high-speed hybrid 1-bit full adder
different transistor sizes depending on the application, and circuit,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23,
no. 10, pp. 2001–2008, Oct. 2015.
using the proposed method in realizing adders with different
[6] S. Goel, A. Kumar, and M. A. Bayoumi, “Design of robust, energy-
bit widths. By identifying the point where the ripple carry efficient full adders for deep-submicrometer design using hybrid-CMOS
delay is the same as the critical path of the whole adder, we can logic style,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14,
implement low-order ripple carry and high-order Ling carry no. 12, pp. 1309–1321, Dec. 2006.
designs. The custom designed operator and Shannon expansion [7] K. Papachatzopoulos and V. Paliouras, “Static delay variation models
for ripple-carry and borrow-save adders,” IEEE Trans. Circuits Syst. I,
optimization will also be used for adders with different bit Reg. Papers, vol. 66, no. 7, pp. 2546–2559, Jul. 2019.
widths. In addition, although theoretically the S10 with a ripple [8] N. Kaushik and S. Bodapati, “IMPLY-based high-speed conditional
structure and the S26 with a Ling carry structure have the same carry and carry select adders for in-memory computing,” IEEE Trans.
delay, the actual simulation results show that the delay of the Nanotechnol., vol. 22, pp. 280–290, 2023.
S10 with ripple carry is smaller than that of the S26 . Hence, the [9] B. R. Zeydel, D. Baran, and V. G. Oklobdzija, “Energy-efficient design
ripple carry structure can be expanded beyond S10 to S11 and methodologies: High-performance VLSI adders,” IEEE J. Solid-State
Circuits, vol. 45, no. 6, pp. 1220–1233, Jun. 2010.
even higher to achieve more compressed area. In future, with [10] D. Esposito, D. De Caro, E. Napoli, N. Petra, and A. G. M. Strollo,
the constraint of keeping the critical path unchanged, we intend “Variable latency speculative Han–Carlson adder,” IEEE Trans. Circuits
to further optimize and extend the ripple structure to higher Syst. I, Reg. Papers, vol. 62, no. 5, pp. 1353–1361, May 2015.
[11] H. Ling, “High-speed binary adders,” IBM J. R&D, vol. 25, nos. 2–3, Ruikang Liu received the B.E. degree in optoelec-
pp. 156–166, May 1981. tronics science and engineering from the Faculty of
[12] C. Efstathiou, Z. Owda, and Y. Tsiatouhas, “New high-speed multioutput Physics and Electronic Sciences, Hubei University,
carry look-ahead adders,” IEEE Trans. Circuits Syst. II, Exp. Briefs, Wuhan, China, in 2022. He is currently pursuing the
vol. 60, no. 10, pp. 667–671, Oct. 2013. master’s degree with the School of Microelectron-
[13] A. Mitra and A. Bakshi, “Design of a high speed adder,” Int. J. Sci. ics, Hubei University. His research interests include
Eng. Res., vol. 6, no. 4, pp. 918–921, Apr. 2015. hardware/hardware-assisted security and IC design.
[14] G. Dimitrakopoulos and D. Nikolos, “High-speed parallel-prefix VLSI
Ling adders,” IEEE Trans. Comput., vol. 54, no. 2, pp. 225–231,
Feb. 2005.
[15] N. T. Quach and M. J. Flynn, “High-speed addition in CMOS,” IEEE
Trans. Comput., vol. 41, no. 12, pp. 1612–1615, Dec. 1992.
[16] Y. Wang, C. Pai, and X. Song, “The design of hybrid carry-
lookahead/carry-select adders,” IEEE Trans. Circuits Syst. II, Analog
Digit. Signal Process., vol. 49, no. 1, pp. 16–24, Jan. 2002.
[17] D. Esposito, D. De Caro, and A. G. M. Strollo, “Variable latency
speculative parallel prefix adders for unsigned and signed operands,” Yizhou Huang was born in Sichuan, China, in 2000.
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 8, pp. 1200–1209, He received the bachelor’s degree from Hubei Uni-
Aug. 2016. versity, Wuhan, in 2022, where he is currently
[18] Y. Choi and E. E. Swartzlander, “Speculative carry generation with prefix pursuing the master’s degree. His research interests
adder,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 3, include analog and custom integrated circuit design.
pp. 321–326, Mar. 2008.
[19] J. Wang, G. Liu, Y. Chen, and S. Wang, “Construction and analysis of
SHA-256 compression function based on chaos S-Box,” IEEE Access,
vol. 9, pp. 61768–61777, 2021.
[20] B. K. Mohanty and S. K. Patel, “Area–delay–power efficient carry-
select adder,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 6,
pp. 418–422, Jun. 2014.
[21] S. Purohit and M. Margala, “Investigating the impact of logic and circuit
implementation on full adder performance,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 20, no. 7, pp. 1327–1331, Jul. 2012.
[22] G. A. Ruiz, “Evaluation of three 32-bit CMOS adders in DCVS logic
for self-timed circuits,” IEEE J. Solid-State Circuits, vol. 33, no. 4, Yin Zhang was born in Hubei, China, in 1991.
pp. 604–613, Apr. 1998. He received the Ph.D. degree from the Faculty
[23] A. Meaamar and M. Othman, “High-speed hybrid parallel-prefix carry- of Physics and Electronic Sciences, Hubei Univer-
select adder using Ling’s algorithm,” in Proc. IEEE Int. Conf. Semicon- sity, Wuhan, China, in 2022. He is currently a
ductor Electron., Kuala Lumpur, Malaysia, Nov. 2006, pp. 598–602. Lecturer with the Hubei University of Technology.
[24] A. K. Panda, R. Palisetty, and K. C. Ray, “High-speed area-efficient His research interests include hardware/hardware-
VLSI architecture of three-operand binary adder,” IEEE Trans. Circuits assisted security design.
Syst. I, Reg. Papers, vol. 67, no. 11, pp. 3944–3953, Nov. 2020.
[25] G. Yang, S.-O. Jung, K.-H. Baek, S. H. Kim, S. Kim, and S.-M. Kang,
“A 32-bit carry lookahead adder using dual-path all-N logic,” IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 8, pp. 992–996,
Aug. 2005.
[26] V. Pudi and K. Sridharan, “New decomposition theorems on majority
logic for low-delay adder designs in quantum dot cellular automata,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 10, pp. 678–682,
Oct. 2012.
Zhangqing He was born in Hubei, China, in 1980.
He received the M.Sc. and Ph.D. degrees in elec-
tronic engineering from the Huazhong University of
Ning Shang was born in China in 1997. She Science and Technology, China, in 2008 and 2016,
received the B.E. degree from the Henan University respectively. Currently, he is a Professor with the
of Science and Technology in 2019. She is currently Hubei University of Technology. He has published
pursuing the master’s degree with Hubei University. more than 30 academic articles, two textbooks, and
Her research interests include custom circuit design has eight national patents. His research interests
and digital IC design. include hardware/hardware-assisted security and IC
design.
Zhou Wang was born in Hubei, China, in 1997. Meilin Wan was born in China in 1988. He received
He received the B.E. degree in electronic sci- the B.Sc. degree from the Huazhong University of
ence and technology from the Faculty of Physics Science and Technology, Wuhan, China, in 2009, the
and Electronic Sciences, Hubei University, Wuhan, M.Sc. degree from the China Academy of Telecom
China, in 2019, where he is currently pursuing Technology in 2012, and the Ph.D. degree from the
the master’s degree. His research interests include Huazhong University of Science and Technology.
hardware/hardware-assisted security and IC design. He is currently an Associate Professor with Hubei
University, Wuhan. His current research interests
include custom circuit design and hardware security
IC design.

A 32-Bit Ripple-Ling Hybrid Carry Adder

Uploaded by

Copyright:

Available Formats

A 32-Bit Ripple-Ling Hybrid Carry Adder

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A 32-Bit Ripple-Ling Hybrid Carry Adder

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO.

6, JUNE 2024 2709

A 32-Bit Ripple-Ling Hybrid Carry Adder

Fig. 1. The structure of the Ling adder proposed in [15].

bits. Therefore, we can actually reduce the generation speed

we can see that the generation of S26 requires a delay of

The generation of the carry signals C3 and C4 is dependent C6 = C5 (A6 ⊕ B6 ) + A6 B6 (19)

Fig. 9. (a) XOR-NAND-NOR, (b) XOR-NOR, and (c) XOR-NAND mixed

Fig. 12. Output sum generation circuit of Bit 2∼0.

Ai and Bi are 0 or 1, the pull-up or pull-down module

Fig. 13. 1-bit full adder using ripple carry structure.

Fig. 17. Output sum generation circuit of Bit 11∼9.

From Fig. 3, the conventional structure based on H8 for

Fig. 22. ∗ , and (c) G ∗ .

Fig. 28. The layout of the proposed adder using 28 nm process.

introduce the fourth delay stage. Compared to the conventional

TABLE III TABLE VI

TABLE V Fig. 31. The compressor structure of SHA-256.

bits, thereby continuously reducing power consumption and

You might also like