A 32-Bit Ripple-Ling Hybrid Carry Adder
A 32-Bit Ripple-Ling Hybrid Carry Adder
A 32-Bit Ripple-Ling Hybrid Carry Adder
Abstract— The low-order bits of the Ling adder are not on adder and the whole carry structure. In terms of customiz-
the critical path, eliminating the need for a carry lookahead ing sub-units in 1-bit full adder, Naseri and Timarchi use
method to calculate their output sums. In this paper, we propose XOR/XNOR gates based on transmission gate (TG) to realize
a hybrid carry adder that combines high-order Ling and low- six new hybrid 1 bit full-adder circuits [4]. This design
order ripple techniques. The low 11 bits of the adder utilize a
ripple-carry structure, while the high 21 bits employ a Ling-based
performs well when implementing 2- to 4-bit adders. However,
parallel prefix structure. This approach simplifies the low-order the TG structure is unsuitable for cascading adders with a
sum circuit without compromising the critical path length of the bitwidth exceeding 16, due to the substantial increase in delay.
adder. Furthermore, new intermediate variables are introduced to Bhattacharyya et al. [5] uses XOR gates based on TG and
facilitate Shannon expansion and enable efficient implementation MUX to generate local sum and carry signal fast. Due to the
of the output sum. This ensures that the control signal of weak driving capability of the TG, intermediate buffers need
the output MUX maintains a delay consistent with its input to be added, resulting in additional hardware overhead and
signal. The output sum circuit is further custom designed using increase in the critical path for the multi-bit adder. Similarly,
reusable logic circuits. The proposed adder is verified using the
conventional 180 nm and 28 nm processes, as well as the advanced
using of TG to realize 1 bit full-adder, as done in [6], achieves
14 nm FinFET process, with the layout area as 4557.5 µm2 , low power delay product (PDP) and is only suitable for
193.2 µm2 , and 73.8 µm2 , respectively. Testing results show that low operand cases. Although simple sub-units can reduce the
the maximum delay is 0.83 ns, 0.312 ns, and 0.183 ns respectively area and delay for the 1-bit addition, they cannot be directly
for the adder using 180 nm, 28 nm, and 14 nm processes cascaded to realize multi-bit operations. Therefore, we should
respectively. The proposed adder provides an area optimization optimize the adder in both the sub-circuit and the whole
of approximately 10%∼30% and optimizations of 10% in power carry structure to achieve excellent performance in multi-bit
and speed compared to the conventional Ling adder. additions.
Index Terms— Adder, ripple carry, Ling carry, low power, low The common structures for generating carry signals in
cost. multi-bit adders include the Ripple Carry structure [7], Carry
Select (CS) structure [8], and Look-Ahead Carry (LAC) struc-
I. I NTRODUCTION ture. The LAC adder, which utilizes two intermediate signals
- carry propagation and carry generation signals - to simplify
A S DEVICE sizes continue to shrink to the nanometer
scale, the use of low power techniques has become more
important than ever for the design of any complex VLSI
the logic for generating the carry and output sum, is the most
commonly used high-speed adder [9]. In this paper, we will
chip like microprocessors and DSPs, which encompass various also optimize the 32-bit adder based on LAC structure.
complex arithmetic operations such as subtraction, multiplica- The adders that employ the LAC structure include
tion, division, and addition. They are typically implemented the Kogge-Stone adder, Sklansky adder, Brent-Kung adder,
using one or multiple addition operations. Therefore, adders Ladner-Fischer adder, and Han-Carlson adder [10]. They all
are the most used arithmetic units in complex VLSI chips [1]. build different carry trees based on the carry signal Ling
Adders often reside on the critical path of digital circuits, simplified the LAC structure by incorporating a pseudo carry,
directly affecting the overall speed of the system. Hence, Hi [11]. This design reduces the complexity of the logic com-
optimizing the area, power consumption, and operation speed pared to conventional C-based operations. Moreover, it can
of adders is crucial for enhancing the performance of the entire use other LAC carry tree structures with respect to Hi to
system [2]. achieve high speed at minimal cost. Efstathiou et al. [12] and
So far, significant research efforts have been devoted to Mitra and Bakshi [13] construct Ling prefix carry trees by
shortening the critical path in multi-bit addition [3], which using Ladner-Fischer or Kogge-Stone operators to generate the
mainly focus on the optimization of sub-units in 1-bit full carry propagation and generation signals. Moreover, the Hi for
odd and even-indexed bits are calculated independently, thus
Manuscript received 17 October 2023; revised 4 January 2024; directly reducing the fan-out of the prefix tree and helping to
accepted 4 January 2024. Date of publication 19 January 2024; date of current decrease delay. However, they require the use of additional
version 30 May 2024. This work was supported by the National Natural
Science Foundation of China under Grant 62174050 and Grant 62271194. 4-input CS adder to ensure the parallelism of high-order carry
This article was recommended by Associate Editor J. Di. (Corresponding and sum operations when performing arithmetic on multi-bit
author: Meilin Wan.) operands of 32 bits or more, which increases hardware over-
Ning Shang, Zhou Wang, Ruikang Liu, Yizhou Huang, and Meilin Wan are head and fails to reduce area while improving computational
with the School of Microelectronics, Hubei University, Wuhan, Hubei 430064,
China (e-mail: [email protected]). speed. Dimitrakopoulos and Nikolos [14] propose a two-level
Yin Zhang and Zhangqing He are with the School of Electrical and Ling structure, where the carry signals are divided into gen-
Electronic Engineering, Hubei University of Technology, Wuhan 430068, eration groups ggi∗ and propagation groups gt∗j . Computation
China. of the carry is then performed based on the prefix network
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2024.3352139. composed of gg i∗ and gt∗j . This design introduces more OR
Digital Object Identifier 10.1109/TCSI.2024.3352139 gates in the propagation path. With increasing operand bit
1549-8328 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
2710 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 6, JUNE 2024
numbers, the additional cost cannot fully exploit the advan- adder. Section III explains the structure of the proposed
tages of Ling’s method, either. In [15], Quach and Flynn ripple-Ling hybrid carry structure. Section IV describes the
group the multi-bit operands, specifically three bits per group circuit implementation of the proposed adder. Section V
and three groups per block. Within each group, local sum is presents the simulation and test results, along with an analysis
calculated using conditional sum algorithms. The Ling method and comparison with existing adders. Finally, the paper is
is then employed to generate the carries between blocks. This concluded in Section VI.
method reduces the number of serial transistors on the critical
path, saves one gate delay in the addition operation, and II. BACKGROUND T ECHNOLOGY
is considered one of the high-performance adders. However, The Ling adder proposed in [15], as depicted in Fig. 1,
the division of the first summation block into three sum consists of a 32-bit adder divided into four summing blocks.
groups results in redundant logical operations for the low-order Each summing block is further divided into multiple summing
bits. Considering that the low-order summing computations groups. Specifically, Bit 8:0, Bit 17:9, and Bit 26:18 all have
involve few bits and simple logic, the use of a ripple carry three summing groups, while Bit 31:27 are divided into a 3-bit
structure will not influence the delay of the critical path summing group (Bit 29:27) and a 2-bit summing group (Bit
for the entire adder. Therefore, there is no need to perform 31:30). The carry signals, H2 , H8 , H17 , and H26 , between
low-bit summation using the Ling LAC structure based on the blocks are generated by a global look-ahead carry block.
grouping [16]. Fig. 2 illustrates the structure of the Global Look-Ahead Carry
In this paper, we propose an improved 32-bit adder that module in Fig. 1. To generate the block carry signals, H2 ,
reduces area and power consumption while maintaining high H8 , H17 , and H26 , the group carry generation signals G i∗ and
speed by employing the ripple-Ling hybrid carry structure, propagation signals Pi∗ are first generated. Then, H8 , G ∗bi and
based on the Ling adder presented in [15]. For the low order Pbi∗ are generated based on Pi∗ and G i∗ . Finally, the carry
bits, we adopt the ripple carry structure to ensure a more signals H17 and H26 are generated based on Pbi∗ and G ∗bi .
compact generation of the output sum, while for the high order Let A31∼0 and B31∼0 represent the two 32-bit input binary
bits, we continue to use the Ling LAC structure to maintain numbers, and S31∼0 represent their output sums. The symbols,
a low critical path delay. Furthermore, the output sum circuits + and ⊕, are used to represent the logical operations of OR
for all the bits are customized and optimized to reduce the and XOR, respectively [17]. In binary addition, gi , pi , and
delay on the critical path during circuit implementation. The si represent the generation signal, propagation signal, and the
contributions of this paper are: local sum of the Bit i [16], respectively:
(1) The adder adopts a structure that combines high-bit Ling
carry and low-bit ripple carry, the low 11 bits use a ripple
structure that does not affect the critical path, while the high gi = Ai · Bi , pi = Ai + Bi , si = Ai ⊕ Bi (1)
21 bits still use the Ling LAC structure. The use of the ripple
The output carry and the output sum of the Bit i, can be
carry structure for the low 11 bits simplifies the circuit and
respectively expressed as
lowers the cost and power consumption.
(2) Mixed logic gates have been customized, such as
simultaneously implementing NAND and NOR logic in XOR Ci = Ci−1 pi + gi , Si = si ⊕ Ci−1 (2)
gates, to reduce the cost of basic logic gates. Moreover,
the customized logic of Ai Bi + Ci−1 (Ai ⊕ Bi ) is used to By iteratively applying the expression of Ci in (2) i times,
realize the low-order 1-bit full adder and the high-order bits Ci can be expanded as follows
output sum. This customized unit employs one TG and four Ci = gi + pi gi−1 + pi pi−1 gi−2 + . . . + pi pi−1 . . . p1 g0
transistors to transmit input to output or to pull the output up or (3)
down, thereby simplifying circuit and reducing the hardware
overhead. In each group comprised of three consecutive input bits
(3) In the custom designed high-order sum generation circuit Ai∼i−2 and Bi∼i−2 , the output carry signal Ci provided by
using the Ling carry structure, we have redefined multiple sig- the current group to the next group can be expressed as
nals that serve as Shannon expansion variables. By increasing
Ci = gi + gi−1 pi + gi−2 pi pi−1 + pi−2 pi−1 pi Ci−3 (i ≥ 3)
the delay of the Shannon expansion variable and reducing the
delay of the signal it controls, the delay is more uniformly (4)
distributed between the control signal and the input signal for By defining the group generation and propagation signals
the output MUX, thereby realizing the delay balance for the G i and Pi as
output MUX and reducing the delay of the critical path.
The organization of the paper is as follows. Section II G i = gi + gi−1 pi + gi−2 pi pi−1
(5)
mainly elucidates the design principles of conventional Ling Pi = pi−2 pi−1 pi
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
SHANG et al.: 32-BIT RIPPLE-LING HYBRID CARRY ADDER 2711
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
2712 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 6, JUNE 2024
Fig. 3. The carry and output sum expressions of the Ling adder proposed in [15].
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
SHANG et al.: 32-BIT RIPPLE-LING HYBRID CARRY ADDER 2713
Fig. 5. The structure of the proposed ripple-Ling hybrid carry 32-bit adder.
Fig. 8. Circuits using Ling carry structure and ripple-carry structure for
Fig. 6. Circuits using Ling carry structure and ripple-carry structure for generating S9 and S10 .
generating S0 ∼ S5 .
For Bit 9, its output sum is
S9 = s9 ⊕ C8 (22)
Its output carry and the output sum of Bit 10 are
C9 = C8 (A9 ⊕ B9 ) + A9 B9
(23)
S10 = s10 ⊕ C9
The modification of the low 11-bit structure, although
increasing the delay to get their output carries, does not
increase the critical path of the entire adder. In the meantime,
Fig. 7. Circuits using Ling carry structure and ripple-carry structure for from Fig. 6, 7, and 8, it can be observed that the ripple
generating S6 − S8 . carry structure has lower hardware cost. Using the ripple
carry structure in low-order bits and Ling carry structure
Similarly, the output sum of the Bit 8 can be obtained as in high-order bits reduces hardware resource consumption
without impacting the calculation speed of the adder.
S8 = s8 ⊕ C7 (21)
Fig. 7 gives the circuit of Group 2 when using the ripple IV. C IRCUIT I MPLEMENTATION OF THE P ROPOSED A DDER
carry structure, we can see that the delay of C7 is the delay In this section, we will present the detailed circuit imple-
of 4 compound logic gates. Similarly to Bit 5, since C7 and mentation of the entire adder based on the proposed new
S8 is used as the control and input signal for the output MUX ripple-Ling hybrid carry structure. Due to the strong noise
respectively, if we can design the 1 Pi ∗ /Gi∗ -GEN, 3 Ci -GEN immunity and low power consumption, this paper adopts the
compound logic gates, and 1 MUX from its control signal static circuit approach for circuit implementation. The circuit
to the output to have a total delay lower than the delay of will be optimized in the following two aspects. First, custom
3 compound gates plus 2 MUXs, the delay of S8 will be lower designed compound logic gates, such as NAND, NOR, and
than that of S26 . In this case, Bit 8 ∼ 6 can all be implemented XOR-mixed gates, as well as the custom designed Ai Bi +
using ripple carry structure. X (Ai ⊕ Bi ) operator, are used to achieve area optimization
Since the next summing group needs to wait for the block while ensuring fast computation. Next, in order to reduce the
carry signal H8 , which will introduce two additional logic delay of output sum with long paths, such as S26 , we optimize
operations compared to the summing group in Block 0. If the the Shannon expansion variables to align the delays of control
ripple-carry structure is extended to Block 1, as shown in and input signals for the final MUX, further reducing overall
Fig. 8, the delay of generating S10 and S11 becomes 4.5 logic latency.
gates’ propagation delay (TI 1 , TI 2 , TI 3 , TI 4 , TC1 ) and 5.5 logic
gates’ propagation delay (TI 1 , TI 2 , TI 3 , TI 4 , TI 5 , TC2 ) respec-
tively. However, the total delay of the critical path S26 for the A. XOR-NAND-NOR Mixed Logic Gate
entire adder, is also about that of 5 logic gates. So the delay Firstly, we propose a compact circuit that simultaneously
of S10 is similar to that of S26 . If the ripple carry structure is realizes XOR, NAND, and NOR operations. The XOR logic is
further extended to Bit 11, it will reach 5.5 logic gates’ delay. the most commonly used arithmetic unit in addition logic. The
Consequently, it will exceed the critical path of the entire simple static circuit for XOR logic is shown in Fig. 9, utilizing
adder. Therefore, the ripple-carry structure is not suitable for 10 transistors. It can be observed that the NOR operation is
the summation of Bit 11 and beyond and is only applied to incorporated in the XOR logic and can be achieved using the
the summation of Bit 10-0. left four transistors. Furthermore, it also contains a parallel
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
2714 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 6, JUNE 2024
Fig. 11. Output sum generation circuit of Bit 26 when realizing the Shannon
expansion balance.
The new structure to realize S26 according to (24) is shown Fig. 14. (a) G i∗ -GEN and (b) H8 -GEN used to generate G i∗ and H8
in Fig. 11, the delay from the input of the final MUX to S26 respectively.
is about 4 logic gates’ propagation delay (TI 1 , TI 2 , TI 3 , TI 4 ),
while the delay from the control signal to S26 is also about
that of 4 logic gates (TC1 , TC2 , TC3 , TC4 , TC5 , where TC4
and TC5 represent the delay from the MUX’s control signal
to its output signal, both estimated to be about 0.5 logic Fig. 15. Output sum generation circuit of Bit 5∼3.
gate’s delay). As a result, both the delay from the input and
control signals of MUX to S26 reach 4 logic gates’ delay,
and the overall delay to generate S26 is reduced from 5 to
approximately 4 logic gates’ delay, resulting in an optimization
ratio of about 20%. For other bits that the generation of their
output sums is time consuming, the similar Shannon expan-
sion optimization can all be realized. Although the proposed
method does not theoretically shorten the critical path length
of the circuit, it improves the calculation speed of S26 in
the implementation of the circuits. By restructuring the object Fig. 16. Output sum generation circuit of Bit 8∼6.
of Shannon expansion, the adder can achieve optimization in
terms of delay. named Pi ∗-GEN. As for G i∗ , literature [15] has optimized the
circuit, where the longest path in both the pull-up and pull-
D. Output Sum Generation Circuit for Each Bit down paths consists of only three transistors in series, which
Next, we will generate the output sum for each bit using is shown in Fig. 14 (a). This circuit structure is named G i∗ -
the optimization methods mentioned above. For the output sum GEN. G ∗2 , G ∗5 ,. . . , G ∗29 and P5∗ , P8∗ ,. . . , P29
∗ as Fig. 2 shows
∗
will all be obtained by using the above G i -GEN and Pi ∗ -GEN
generation circuit of Bit 2, according to Fig. 3, the local bit
sum s2 needs to be XORed with the carry signal C1 to obtain circuits.
the output sum S2 . To further simplify the circuit, as shown in After obtaining G ∗2 , according to Fig. 3, p2 and G ∗2 are used
Fig. 12, we use the MUX composed of two TGs to implement to generate the carry signal C2 , which is then connected to the
the final XOR logic, in which s2 acts as input signal and C1 is input carry of Bit 3 full adder for ripple carry realization. The
the control signal. An inverter is added to the output of MUX last full adder of summation Group 1 only needs to provide S5
to obtain the final S2 and ensure its output driving capability. and p5 for the input carry calculation of Bit 6. The complete
Based on this scheme, the output sum generation circuit of circuit structure for computing the output sum of Bit 5∼3 is
S2 is designed as shown in Fig. 12. Furthermore, it utilizes shown in Fig. 15, where FA4, FA3, and FA3 are used to realize
the custom designed circuit described in the previous part, the 1-bit full adders for Bit 5∼3 respectively.
Ai Bi + X (Ai ⊕ Bi ), to achieve Ci . The whole ripple-carry based adder for Group 2 is shown
Since the 3 bits in each summation group of Block 0 have a in Fig. 16. We first design a compound gate to generate the
ripple carry structure, they will be connected in series using a input carry signal C5 according to Fig. 3. Similarly, the 1-
group of 1-bit full adders with the same structure. Therefore, bit full adder of Bit 8 outputs a carry propagation signal p8
we encapsulate the circuit for computing the output S2 into a instead of C8 to the next summation group. As the last group
1-bit full adder module, as shown in Fig. 13. For the 1-bit full of circuits in Block 0, it also needs to provide block carry
adder that requires to output p and g, we utilize the XOR- signal H8 to the Block 1. According to the H8 expression
NOR-NAND gate as shown in Fig. 9(a) to simultaneously in Fig. 3, we designed H8 -GEN circuit to generate the block
obtain the p and g, and the adder is named as FA1. For units carry signal H8 , which is shown in Fig. 14 (b).
that only need to individually output g or p, only NAND As shown in Fig. 17, Bit 9 and Bit 10 in Group 3 of Block
or NOR function is integrated in XOR with symbol as FA2 1 still use a ripple carry structure, while Bit 11 uses the Ling
and FA3 respectively. Additionally, for the last bit in a group, carry structure to generate input carry. First, as Fig. 3 depicts,
it does not necessitate the provision of Ci , rather, it only needs g9 and p9 need to be used to get S11 by using Ling method.
to provide pi to the next group. Thus, the 1-bit full adder Therefore, in the output sum generation circuit of Bit 9, the
FA4, which uses XOR-NOR mixed gate and doesn’t include XOR_NOR_NAND logic gate is used in FA1 to generate
Ci -GEN, is employed. signals s9 , g9 , and p9 simultaneously. Next, in order to
For Block 0, as discussed in Section II, the input carry of simplify the S11 generation circuit, we define an intermediate
Group 1, 2, 3, which is C2 , C5 , and C8 respectively, employ signal k0 = C8 = p8 H8 as a variable to perform Shannon
the Ling carry structure. So, it is necessary to use simplified expansion, then S11 as shown in Fig. 3 can be rewritten as
circuits to implement G i∗ , Pi∗ , G ∗bi , Pbi∗ , and ultimately achieve S11 = k0 {s11 ⊕ [ p10 (g10 + p9 )]} + k0 s11 ⊕ [ p10 (g10 + g9 )]}
C2 , C5 , and C8 . For Pi∗ , according to their expressions in
Fig. 3, a three-input NAND gate is used, and this circuit is (25)
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
2716 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 6, JUNE 2024
Fig. 19. The output sum generation circuit for Bit 14∼12.
Fig. 18. (a) pb3 -GEN and (b) gb3 -GEN used to generate pb3 and gb3
respectively.
Fig. 23. The output sum generation circuit for Bit 20∼18.
Fig. 25. The output sum generation circuit for Bits 26∼24.
Fig. 23 gives the complete circuit of Group 6, similarly, the
logic terms (g19 + p19 p18 ) and (g19 + p19 g18 ) are implemented
using the custom designed operator shown in Fig.10. The using
of k3 for Shannon expansion can realize almost the same delay
for the control signal and input signal of the output MUX,
which can reduce the overall delay.
The summation circuit in Group 7 includes Bit 23∼21. The
intermediate signals gb6 and pb6 defined in Fig. 3 have the
same logic as the signals gb3 and pb3 respectively, and can
be generated using compound gates gb3 -GEN and pb3 -GEN
shown in Fig. 18. Moreover, we use k4 = pb6 H17 + H17 gb6 Fig. 26. The output sum generation circuit for Bit 29∼27.
and perform Shannon expansion based on k4 to reduce the
overall delay, obtaining S21 , S22 , and S23 as which can be obtained using the G ∗b1 -GEN circuit shown in
∗ can still be obtained using a three-input
Fig. 22 (c). Also, Pb2
S21 = s21 ⊕ k4 NAND gate.
S = (s ⊕ p21 ) k4 + (s22 ⊕ g21 )k4 The summation circuit in Group 9 includes Bit 29∼27. Let
22 22
S23 = s23 ⊕ (g22 + p22 p21 ) k4 + [s23 ⊕ (g22 + p22 g21 )]k4 p26 H26 = k5 , and S27 , S28 , and S29 are redefined as follows
(31)
S27 = s26 ⊕ k5
The complete circuit of Group 7 is shown in Fig. 24, S = (s ⊕ p27 ) k5 + (s28 ⊕ g27 )k5
28 28
where the logic terms (g22 + p22 p21 ) and (g22 + p22 g21 ) are S29 = s29 ⊕ (g28 + p28 p27 ) k5 + [s29 ⊕ (g28 + p28 g27 )]k5
implemented using the custom designed operator. (33)
The summation circuit in Group 8 includes Bit 26∼24. Also,
the intermediate signals gb7 and pb7 defined in Fig. 3 are real- As shown in Fig. 26, the logic terms (g28 + p28 p27 )
ized using gb3 -GEN and pb3 -GEN respectively. As illustrated and (g 28 + p28 g27 ) are implemented using custom designed
previously, let k = pb7 H17 + H17 gb7 , and S24 , S25 , and S26 in operator as shown in Fig. 10, and Shannon expansion based
Fig. 3 are rewritten as on k6 can also reduce the overall delay.
For Group 10 includes the sum of the Bit 30 and Bit 31,
S24 = s24 ⊕ k according to Fig. 3, we can first generate gb9 and pb9 using
S = (s ⊕ p24 ) k + (s25 ⊕ g24 )k the compound gates gb3 -GEN and pb3 -GEN shown in Fig. 18.
25 25 Then let k6 = pb9 H26 + H26 gb9 , and S30 and S31 are obtained
S26 = s26 ⊕ (g25 + p25 p24 ) k + [s26 ⊕ (g25 + p25 g24 )]k
as
(32)
S30 = s30 ⊕ k6
Similarly, as shown in Fig. 25, the logic terms g25 + p25 p24 (34)
S31 = (s31 ⊕ p30 ) k6 + (s31 ⊕ g30 )k6
and g25 + p25 g24 are implemented using custom designed
operator as shown in Fig. 10. The complete circuit of Group 10, established based on the
Since this group needs to provide the output block carry above expressions, is shown in Fig. 27.
signal H26 , by comparing the expressions of H8 and H26 The specific implementation of each bit in the 32-bit adder
in Fig. 3, we can use the H8 generation circuit shown in has been fully discussed. For the delay optimization of the
Fig. 14 (b) to get H26 . The input variables to get H26 are the critical path S26 , the compound gate used for generating Pi∗
intermediate signals G ∗b1 , Pb1
∗ , G ∗ , and P ∗ , where the signals
b2 b2 and G i∗ signals is the first delay stage, the compound gate used
G ∗b1 and Pb1∗ have already been implemented in summing for generating Pb1 ∗ , G ∗ , and H signals is the second delay
b1 8
Group 5. While for G ∗b2 and Pb2 ∗ , it can be seen from Fig. 3 stage, and that used for generating the H17 (and further k) is
that the logical circuit for G b2 is the same as that for G ∗b1 ,
∗ the third delay stage. Finally, the output MUX to obtain S26
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
2718 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 6, JUNE 2024
Fig. 27. The output sum generation circuit for Bit 31∼30.
TABLE I
M AX D ELAY AND P OWER C ONSUMPTION OF THE P ROPOSED AND THE
C ONVENTIONAL L ING A DDERS U NDER 1 G HZ O PERATING F RE -
QUENCY
TABLE II
S IMULATED PDP OF A DDER U SING 28 NM O PERATING AT THE M AXIMUM
F REQUENCY
Fig. 30. Simulated eye diagrams of S26 and S10 of both the proposed adder
and the traditional Ling adder under different temperatures and process corners
when using the (a)∼(c) 180 nm, (d)∼(f) 28 nm, and (g)∼(i) 14 nm process.
28 nm, and 14 nm process respectively, while the conventional
adder using 180 nm, 28 nm, and 14 nm, VDD is 1.8 V, 0.9 V, Ling adder consumes 5720 µW, 107.2 µW, and 146.1 µW
and 0.8 V respectively, and operating temperatures are −40◦ C, respectively. In terms of power consumption, the proposed
25◦ C, and 125◦ C. Fig. 30 shows the eye diagrams of the adder achieves an optimization about 10%, and that using the
critical path, S26 , the longest delay using ripple carry, S10 , advanced process can achieve higher optimization which is
of the proposed structure, as well as S26 for the conventional about 15%.
Ling adder, obtained by simulating 100,000 sets of random 32- After obtaining the maximum delay of adders under differ-
bit A and B operands in TT, FF, and SS process corners. It can ent operating conditions, we then simulate the PDP of the
be seen that under the different conditions, the delay required adders operating at the maximum frequency under various
to generate S26 in the proposed structure is shorter than that process, voltage and temperature (PVT) conditions for the
in the conventional Ling structure, and the critical path of adders using 28 nm and 14 nm processes. The simulated
the proposed adder is decreased by over 21%. Similarly, the results are shown in TABLE II and III, it can be seen that
delay required to generate S10 is much shorter than the critical the proposed solution can achieve 10% improvement in PDP.
path S26 of the adder, which effectively validates the use of
a ripple carry structure for the low-order 11 bits. The delay
and power performance of the adders under 100%∗VDD is B. Test Results
shown in TABLE I. It can be observed that, for the 180 nm To test the performance of the proposed adder, we provide
process, the maximum delay of the proposed and conventional different test platforms to verify the operation of the adder
Ling adder is 0.75 ns and 0.95 ns respectively. When using for different processes. In the chips using 28 nm and 14 nm
28 nm process, it is 0.132 ns and 0.198 ns respectively. For the FinFET processes, we implement the proposed adder as a full
adders using 14 nm process, the proposed and the conventional adder in SHA-256 hash algorithm. By validating the fastest
adder demonstrates a maximum delay of 0.096 ns and 0.286 ns frequency of the hash operation and considering the delay
respectively. For traditional processes, the delay optimization distribution of each sub-operator in the hash compressor part,
is mainly obtained from Shannon expansion delay balance, we approximately estimate the fastest operating frequency of
which is approximately 20%. But for 14 nm process, the the full 32-bit adder. As shown in Fig. 31, in each round of the
proposed adder achieves more than 20% delay optimization. compression operation, the longest path is achieving A(T+1)
This is because the smaller area allows for a reduction of which is indicated by the red path (L 1 + L 2 + L 3 + L 4 ).
metal wires’ length. In advanced processes, the proportion The proposed adder is used in the last 32-bit full adder
of signal wire metal delay is higher than that of transistor’s that generates the result of the compressor A(T+1) [19].
parasitic capacitance, resulting in an additional 40% overall Firstly, the estimated percentage of the operating delay using
delay optimization when the area and further metal wires 28 nm and 14 nm process from simulation results is shown
are reduced by 39%. Combined with the 20% optimization in TABLE V and TABLE IV respectively. It shows that when
obtained from the delay balance of Shannon expansion, the the maximum speed is achieved, the delay of the proposed
final optimization reaches 60%. full adder is about 25.4% and 32.2% of the whole delay of
Regarding power consumption, the proposed adder con- the compressor for the SHA-256 hash operator using 28 nm
sumes 5250 µW, 98.9 µW, and 122.8 µW when using 180 nm, and 14 nm respectively. Then the function tests are performed
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
2720 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 6, JUNE 2024
TABLE IV
S IMULATED D ELAY IN SHA-256 C OMPRESSOR U SING 14 NM
Fig. 32. Test platform used to test the adders using 180 nm process.
at different input frequencies, under typical power conditions
with VDD = 0.8 V and VDD = 0.9 V respectively, the tested
maximum operating frequency of SHA-256 in the 14 nm and reference adder have the same result, and the tested adder
28 nm verification chips is about 1.75 GHz and 0.81 GHz can operate at that frequency. Otherwise, if the result at the
respectively. Hence, the length of the critical path in the output of the test circuit is 1, it indicates that the tested
entire compressor section is approximately 0.57 ns and 1.23 ns adder and the reference adder have some differing bits, the
respectively, which corresponds to an estimated longest delay output result of the adder is erroneous, and it cannot operate
of approximately 0.183 ns and 0.312 ns respectively for the properly at that frequency. Once the maximum operating
proposed 32-bit adder. These are approximately twice the post- frequency is obtained, the power consumption of this adder
simulation results. We believe this difference is attributed to is individually tested at that frequency. Test results of the
the high overall power consumption of the chip, which is power and maximum operating frequency for 40 samples at
approximately 3 W, leading to a large IR drop. Consequently, typical environment (VDD = 1.8 V, temperature = 25◦ C)
this results in a lower actual power supply voltage delivered are shown as shown in Fig. 33, and the maximum delay is
to each operator and, subsequently, to the adder, causing a 0.83 ns. The distribution of PDP [26] is given in Fig. 34.
significant difference in operating frequency compared to the It can be seen that the proposed adder can achieve an average
simulated frequency. PDP about 4.82 mW∗ ns when operating under the maximum
Since the adder is not individually tested with a test cir- frequency, which is 11% lower than that of conventional Ling
cuit, the power consumption cannot be tested for the circuit adder. TABLE VI presents the test average PDP under different
realized in 14 and 28 nm processes. Therefore, we continue temperature and power supply conditions. The PDP of the
to use a 180 nm test chip to test the speed and power adders proposed in this paper ranges from approximately 80%
consumption of the entire adder. Due to the high operating to 90% of the Ling adder proposed in [15].
frequency of the adder, it is challenging to directly lead signals The performance comparison between the adder designed
to off-chip for testing. Therefore, as illustrated in Fig. 32, in this paper and adders proposed in other literature is shown
we specifically design an on-chip test circuit. It uses an adder in TABLE VII. For the proposed adder using 180 nm process,
realized in Low Voltage Threshold (LVT) transistors as a the delay, power consumption, and PDP are based on average
reference, to indicate whether the adder implemented using testing results of 40 chips. For the adders using 28 nm and
Standard Voltage Threshold (SVT) transistors works properly. 14 nm processes, the delay are estimated based on testing,
The LVT_ADDER operates under 3.3 V power supply can while the power consumption and PDP is obtained from
work at a much higher speed than the tested SVT adder simulation results at 1 GHz and the maximum operating
operating under 1.8 V power supply. The XORed result of frequency respectively. When comparing with works [5], [14],
the output values of the tested adder and the reference adder [23] using the 180 nm process, the adder proposed in this paper
is used to determine if their calculation results are the same has the smallest area and delay. But the PDP of the proposed
and further the SVT adder work properly. The inputs A and B adder is larger than theirs because it is obtained based on the
are provided by a self-incrementing counter with configurable maximum frequency, while other adders work at much lower
initial values. The clock frequency of the self-incrementing frequencies. For the adder using 28 nm process, the proposed
counter is controlled by the Phase Lock Loop (PLL) module. one has a delay approximately one-third of [24], and the area
At a specific frequency, if the final result at the output of and PDP are also much smaller than [24]. For the adders using
the test circuit is 0, it indicates that the tested adder and the the 65 nm or 90 nm process [20], [21], [22], the proposed
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
SHANG et al.: 32-BIT RIPPLE-LING HYBRID CARRY ADDER 2721
TABLE VII
C OMPARISON W ITH P RIOR A RTS
VI. C ONCLUSION
A high-order Ling and low-order ripple hybrid carry adder is
proposed in this paper. The low order 11 bits use a ripple-carry
Fig. 33. Tested results of the adders using 180 nm process at typical structure instead of the conventional lookahead carry method,
environment. while the high order 21 bits continue to use Ling carry
structure, thereby simplifying the low-order sum circuit while
maintaining the critical path of the entire adder. Moreover,
new intermediate variables are used as the object for Shannon
expansion in the implementation of the output sum circuit, and
simplified custom logic circuits are used to optimize the design
of each bit’s sum circuit. The proposed adder is implemented
using the 180 nm and 28 nm standard CMOS processes,
Fig. 34. The distribution of PDP for the (a) proposed adder and (b) conven- as well as 14 nm FinFET process. The testing results indicate
tional adder using 180 nm process at typical environment.
that compared to the conventional Ling adder, there have been
10% optimizations in speed, area, and power consumption.
adder even has comparable area and delay when using the
180 nm process, and we have reason to believe that our adder R EFERENCES
has better delay and area under the same processes.
[1] Y. He and C.-H. Chang, “A power-delay efficient hybrid carry-
lookahead/carry-select based redundant binary to two’s complement
converter,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 1,
C. Future Work pp. 336–346, Feb. 2008.
[2] A. M. Shams, T. K. Darwish, and M. A. Bayoumi, “Performance analysis
The adder proposed in this paper is designed using static of low-power 1-bit CMOS full adder cells,” IEEE Trans. Very Large
logic, making it suitable for a wide range of scenarios, like Scale Integr. (VLSI) Syst., vol. 10, no. 1, pp. 20–29, Feb. 2002.
encryption algorithms such as SHA, AES, and RSA, as well [3] R. Zlatanovici and B. Nikolic, “Power-performance optimal 64-bit carry-
as in fields like MCU basic addition instructions and DSP FFT lookahead adders,” in Proc. 29th Eur. Solid-State Circuits Conf., Estoril,
operations. In the future, we will customize and standardize the Portugal, 2003, pp. 321–324.
[4] H. Naseri and S. Timarchi, “Low-power and fast full adder by exploring
adder by extracting its layout and timing information, thereby new XOR and XNOR gates,” IEEE Trans. Very Large Scale Integr.
using it as a standard cell to be compatible with digital design (VLSI) Syst., vol. 26, no. 8, pp. 1481–1493, Aug. 2018.
processes which can speed up circuit implementation. [5] P. Bhattacharyya, B. Kundu, S. Ghosh, V. Kumar, and A. Dandapat,
Additionally, we will further optimize the size by designing “Performance analysis of a low-power high-speed hybrid 1-bit full adder
different transistor sizes depending on the application, and circuit,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23,
no. 10, pp. 2001–2008, Oct. 2015.
using the proposed method in realizing adders with different
[6] S. Goel, A. Kumar, and M. A. Bayoumi, “Design of robust, energy-
bit widths. By identifying the point where the ripple carry efficient full adders for deep-submicrometer design using hybrid-CMOS
delay is the same as the critical path of the whole adder, we can logic style,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14,
implement low-order ripple carry and high-order Ling carry no. 12, pp. 1309–1321, Dec. 2006.
designs. The custom designed operator and Shannon expansion [7] K. Papachatzopoulos and V. Paliouras, “Static delay variation models
for ripple-carry and borrow-save adders,” IEEE Trans. Circuits Syst. I,
optimization will also be used for adders with different bit Reg. Papers, vol. 66, no. 7, pp. 2546–2559, Jul. 2019.
widths. In addition, although theoretically the S10 with a ripple [8] N. Kaushik and S. Bodapati, “IMPLY-based high-speed conditional
structure and the S26 with a Ling carry structure have the same carry and carry select adders for in-memory computing,” IEEE Trans.
delay, the actual simulation results show that the delay of the Nanotechnol., vol. 22, pp. 280–290, 2023.
S10 with ripple carry is smaller than that of the S26 . Hence, the [9] B. R. Zeydel, D. Baran, and V. G. Oklobdzija, “Energy-efficient design
ripple carry structure can be expanded beyond S10 to S11 and methodologies: High-performance VLSI adders,” IEEE J. Solid-State
Circuits, vol. 45, no. 6, pp. 1220–1233, Jun. 2010.
even higher to achieve more compressed area. In future, with [10] D. Esposito, D. De Caro, E. Napoli, N. Petra, and A. G. M. Strollo,
the constraint of keeping the critical path unchanged, we intend “Variable latency speculative Han–Carlson adder,” IEEE Trans. Circuits
to further optimize and extend the ripple structure to higher Syst. I, Reg. Papers, vol. 62, no. 5, pp. 1353–1361, May 2015.
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.
2722 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 6, JUNE 2024
[11] H. Ling, “High-speed binary adders,” IBM J. R&D, vol. 25, nos. 2–3, Ruikang Liu received the B.E. degree in optoelec-
pp. 156–166, May 1981. tronics science and engineering from the Faculty of
[12] C. Efstathiou, Z. Owda, and Y. Tsiatouhas, “New high-speed multioutput Physics and Electronic Sciences, Hubei University,
carry look-ahead adders,” IEEE Trans. Circuits Syst. II, Exp. Briefs, Wuhan, China, in 2022. He is currently pursuing the
vol. 60, no. 10, pp. 667–671, Oct. 2013. master’s degree with the School of Microelectron-
[13] A. Mitra and A. Bakshi, “Design of a high speed adder,” Int. J. Sci. ics, Hubei University. His research interests include
Eng. Res., vol. 6, no. 4, pp. 918–921, Apr. 2015. hardware/hardware-assisted security and IC design.
[14] G. Dimitrakopoulos and D. Nikolos, “High-speed parallel-prefix VLSI
Ling adders,” IEEE Trans. Comput., vol. 54, no. 2, pp. 225–231,
Feb. 2005.
[15] N. T. Quach and M. J. Flynn, “High-speed addition in CMOS,” IEEE
Trans. Comput., vol. 41, no. 12, pp. 1612–1615, Dec. 1992.
[16] Y. Wang, C. Pai, and X. Song, “The design of hybrid carry-
lookahead/carry-select adders,” IEEE Trans. Circuits Syst. II, Analog
Digit. Signal Process., vol. 49, no. 1, pp. 16–24, Jan. 2002.
[17] D. Esposito, D. De Caro, and A. G. M. Strollo, “Variable latency
speculative parallel prefix adders for unsigned and signed operands,” Yizhou Huang was born in Sichuan, China, in 2000.
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 8, pp. 1200–1209, He received the bachelor’s degree from Hubei Uni-
Aug. 2016. versity, Wuhan, in 2022, where he is currently
[18] Y. Choi and E. E. Swartzlander, “Speculative carry generation with prefix pursuing the master’s degree. His research interests
adder,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 3, include analog and custom integrated circuit design.
pp. 321–326, Mar. 2008.
[19] J. Wang, G. Liu, Y. Chen, and S. Wang, “Construction and analysis of
SHA-256 compression function based on chaos S-Box,” IEEE Access,
vol. 9, pp. 61768–61777, 2021.
[20] B. K. Mohanty and S. K. Patel, “Area–delay–power efficient carry-
select adder,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 6,
pp. 418–422, Jun. 2014.
[21] S. Purohit and M. Margala, “Investigating the impact of logic and circuit
implementation on full adder performance,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 20, no. 7, pp. 1327–1331, Jul. 2012.
[22] G. A. Ruiz, “Evaluation of three 32-bit CMOS adders in DCVS logic
for self-timed circuits,” IEEE J. Solid-State Circuits, vol. 33, no. 4, Yin Zhang was born in Hubei, China, in 1991.
pp. 604–613, Apr. 1998. He received the Ph.D. degree from the Faculty
[23] A. Meaamar and M. Othman, “High-speed hybrid parallel-prefix carry- of Physics and Electronic Sciences, Hubei Univer-
select adder using Ling’s algorithm,” in Proc. IEEE Int. Conf. Semicon- sity, Wuhan, China, in 2022. He is currently a
ductor Electron., Kuala Lumpur, Malaysia, Nov. 2006, pp. 598–602. Lecturer with the Hubei University of Technology.
[24] A. K. Panda, R. Palisetty, and K. C. Ray, “High-speed area-efficient His research interests include hardware/hardware-
VLSI architecture of three-operand binary adder,” IEEE Trans. Circuits assisted security design.
Syst. I, Reg. Papers, vol. 67, no. 11, pp. 3944–3953, Nov. 2020.
[25] G. Yang, S.-O. Jung, K.-H. Baek, S. H. Kim, S. Kim, and S.-M. Kang,
“A 32-bit carry lookahead adder using dual-path all-N logic,” IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 8, pp. 992–996,
Aug. 2005.
[26] V. Pudi and K. Sridharan, “New decomposition theorems on majority
logic for low-delay adder designs in quantum dot cellular automata,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 10, pp. 678–682,
Oct. 2012.
Zhangqing He was born in Hubei, China, in 1980.
He received the M.Sc. and Ph.D. degrees in elec-
tronic engineering from the Huazhong University of
Ning Shang was born in China in 1997. She Science and Technology, China, in 2008 and 2016,
received the B.E. degree from the Henan University respectively. Currently, he is a Professor with the
of Science and Technology in 2019. She is currently Hubei University of Technology. He has published
pursuing the master’s degree with Hubei University. more than 30 academic articles, two textbooks, and
Her research interests include custom circuit design has eight national patents. His research interests
and digital IC design. include hardware/hardware-assisted security and IC
design.
Zhou Wang was born in Hubei, China, in 1997. Meilin Wan was born in China in 1988. He received
He received the B.E. degree in electronic sci- the B.Sc. degree from the Huazhong University of
ence and technology from the Faculty of Physics Science and Technology, Wuhan, China, in 2009, the
and Electronic Sciences, Hubei University, Wuhan, M.Sc. degree from the China Academy of Telecom
China, in 2019, where he is currently pursuing Technology in 2012, and the Ph.D. degree from the
the master’s degree. His research interests include Huazhong University of Science and Technology.
hardware/hardware-assisted security and IC design. He is currently an Associate Professor with Hubei
University, Wuhan. His current research interests
include custom circuit design and hardware security
IC design.
Authorized licensed use limited to: Rajeev Gandhi Memorial College of Eng and Tech. Downloaded on August 09,2024 at 09:52:21 UTC from IEEE Xplore. Restrictions apply.