Fast Modular Multiplication Using Parallel Prefix Adder: Pravin P. Zode Raghavendra B. Deshmukh

Fast Modular Multiplication using
Parallel Prefix Adder

Pravin P. Zode Raghavendra B. Deshmukh
Research Scholar, Center for VLSI and Nanotechnology, Professor, Center for VLSI and Nanotechnology,
Visvesvaraya National Institute of Technology, Visvesvaraya National Institute of Technology,
Nagpur, India Nagpur, India
Abstract- Public key cryptography applications involve use of by incorporating alternative regular adder structures. An
large integer arithmetic operations which are compute intensive Interleaved and Montgomery modular multiplication algorithm
in term of power, delay and area. Modular multiplication, which are considered for implementation as they are more Area-Time
is frequently used most resource hungry block. Generally, last
stage of modular multiplication is implemented by using carry efficient and widely used [9]. These are implemented on Xilinx
propagate adder whose long carry chain takes more time. In this Virtex -6 FPGA platform to test the performance.
paper, FPGA based Modulo multiplication architectures using The rest of the paper is organized as follows- Section-2
Carry Save and Kogge-Stone parallel prefix adder are presented describes reported Interleaved and Montgomery Multiplication
to reduce this problem. Proposed implementations are faster as Algorithms, Section-3 describes proposed architecture of
compared to conventional carry save adder and carry propagate Interleaved and Montgomery Multiplier using Kogge-Stone
adder implementations.
Parallel Prefix adder. The Experimental results are discussed in
Keywords: Interleaved Multiplication, Montgomery Section-4 and finally concluding remarks and future directions
Multiplication, Parallel Prefix Adder, Kogge-Stone Adder, Carry are presented in Section-5.
Save Adder
I. INTRODUCTION II. PREVIOUS WORK

Cryptographic applications are implemented on many There are basic four approaches for calculation of modular
mobile embedded devices like smart card and cell phones. multiplication namely- multiply and then divide[10], Brickell
These applications use large integer arithmetic in prime and method [11], multiply and reduction interleaved [10], and
binary field with typical size of operands ranging from 160 bit Montgomery method [12]. Of these, multiply and reduction
to 2048 bits. Trend is to move to large key sizes to increase interleaved and Montgomery methods have lesser
security. This is possible by speeding the computation and computational overhead [10].
reducing the energy required to perform the computation. To In multiply and reduce method, first n-bit numbers X and Y
meet the speed and power requirements of portable are multiplied to obtain 2n-bit product. Then, this product is
applications the hardware implementation is preferred to divided (reduced) by M to obtain n-bit product representing
reduce compute cost in terms of area, delay and power of the ((X*Y) mod M). In Multiply and interleaved method, shift and
public key crypto algorithms. Modular multiplication is the add multiplication algorithm is used to reduce partial product
most frequently used operation in these cryptographic at each step [10]. Brickell's method is based on use of a carry
applications which requires high area and power. Performance delayed integer [11]. This method is combination of sign
of cryptographic applications are decided by Area-Time estimation and Omura's method of correction [10].
complexity[1-3]. As a result, an efficient implementation Many design optimization schemes are reported in current
becomes critical. literature by using Residue Number System [13], systolic
Carry Save Adder (CSA) has been frequently reported[4-6]. Array[14], Lookup Tables [15] and pipeline architecture[16].
High-radix multipliers are also reported [7]. These We have considered Interleaved Multiplication[17][18]
implementations have longer critical paths and irregular (Algorithm-1) and Montgomery Algorithm for binary base
complex hardware topology resulting in high dynamic power (Algorithm-2)[19] for FPGA implementation.
consumption. The popular RSA algorithm is implemented
using Montgomery modulo multiplication [7-8] A. Interleaved Multiplication
Generally, hardware implementation of multiplier requires Idea of interleaved multiplication is keep results short. In
addition of partial products generated by array of AND gates Interleaved multiplication, multiplier is processed from most
and intermediate additions performed using carry save adders. significant bit position. Algorithm-I shows the carry save
Reducing delay in adder chain results in major performance representation of interleaved multiplication. Step 3 and 4 are
improvement. This problem is more severe in case of implemented by using CSA and step 5 is by using lookup table.
cryptographic implementations due to large operand length. Final result (Step-7) is computed using addition of carry and
The main objective of proposed work is to reduce the critical sum generally by RCA.
path delay of the modulo multiplication process. An attempt is
made to improve the hardware performance of this architecture
978-1-4799-6986-9/14/$31.00©2014 IEEE
Algorithm 1: Interleaved Multiplication Algorithm [10] depth and minimum fan out and lowest possible delay [27][28].
Inputs: X, Y, M with 0 < X ,Y < M
Next sub section discusses structure of Kogge-Stone adder.
Output :P = X*Y mod M n: number of bits in X Subsequently, Interleaved multiplication and Montgomery
1. S:=0; C:= 0; A=0; Multiplication implementation is presented.
2. for i in n-1 to 0 loop A. Kogge-Stone Adder
3. C,S := 2S + 2C + Xi*Y
Peter M. Kogge and Harold S. Stone [21] reported this
4. C,S := S + C + A;
parallel prefix form of carry look-ahead adder architecture.
5. A= (2*Sn+1+Sn+2*Cn+1+Cn)*2n mod M; Fig. 1 shows the architecture of Kogge-Stone adder (KSA).
6. end loop ; Three building blocks for this adder are bitwise Propagate-
7. P := C + S; Generate (PG) cell, Group-Propagate-Generate(GPG) cell, and
the sum XORs. Square black cells in architecture are carry
propagate (A+B) and generate structures(A.B). Black cells
B. Montgomery Multiplication
drives upper input of PG Cell. Gray cells are AND-OR
Montgomery multiplication was proposed by P.L. networks used for lower inputs for sum logic and Triangular
Montgomery [12] in 1985 without trial division for speeding blocks are buffers. Kogge-Stone architecture achieves log2N
up the modulus calculation i.e. (XY mod M) for positive stages and fanout of 2 at each stage.
integers. The algorithm implements division operation with
shifts which can be efficiently implemented in hardware. This
approach avoids time consuming trial division that is common
bottleneck in modular multiplication. It works from Least
Significant Bit position.
Algorithm 2: Montgomery Algorithm for Binary Base [19]

Inputs: X, Y, M with 0 < X ,Y < M
Output :P = X*Y R-1mod M n: number of bits in X
1. S:=0; C:= 0;
2. for i in 0 to n-1 loop
3. S, C := S +C + xi*Y Fig 1: Kogge-Stone Adder Architecture [27]
4. S,C := S + C +So * M;
5. S := S Div 2; Critical path delay of Ripple Carry Adder (RCA) and Kogge-
6. C := C Div 2; Stone Adder (KSA) are represented in equations 1 and 2
7. end loop; respectively[27]. Where, tpg is delay of 1-bit propagate and
8. P := S + C generate gate, tAO is delay of AND-OR gate in gray cell, txor is
9. if (P >= M) then delay of sum XOR. The delay is reduced to (N-1) multiple to
10. P:= P - M; (log2N) is significant improvement in performance over RCA.
11. else
t ripple = tpg + ( N-1) tAO + txor (1)
12. P:= P;
13. end if;
t ksa = tpg + ( log2N) tAO + txor (2)
The above Carry Save based Algorithms are implemented by B. Proposed Architecture
conventional carry propagate adder and Kogge-Stone parallel Application of above structure for Interleaved and
prefix adder. Montgomery Multiplier is presented in this sub-section.
III. PROPOSED ARCHITECTURE Datapath of 163 bit Interleaved Multiplication (Algorithm-1)
and Montgomery Multiplication (Algorithm-2) is shown in
The proposed work explores the performance improvement by fig.2 and fig.3 respectively. Steps 3 and 4 of algorithm- 1 and
improving the adder architecture. The Carry computation in 2 are implemented using CSA . The operation of steps 5 and 6
addition process is speeded up using parallel prefix form of are divide by 2 which are implemented by right shift
carry look-ahead adder. In literature, various parallel prefix operations. Performing additions with CSA significantly saves
adder structures reported are- Ling [20], Kogge-Stone [21], delay and area. The proposed work uses two carry save adders
Brent-Kung [22], Lander-Fischer [23], Han-Carlson [24], for calculation of intermediate results. Instead of carry
Knowles [25] and Sklansky[26]. In parallel prefix adders there propagate adder which is generally used. Final results in the
is always tradeoff between stages of logic, number of logic form of SUM and CARRY (P=S+C) of algorithm are added
gates, maximum fanout of each stage and amount of using Kogge-Stone adder. This reduces the critical path of
interconnection between stages [27]. Kogge-stone adder is operation and increases the speed of operation. FSMs were
widely used because of its regular structure with minimum designed to control the above datapaths.
presents comparison results of implementation of Interleaved
C=0 S=0 and Montgomery Multiplication using RCA and Kogge-Stone
X Y adder. In the proposed architectures dynamic power
consumption is increased by 5.31% and 1.97% , area is
increased by 8.53% and 5.51% due to increase in switching,
fanout and logic levels. However, this structure significantly
Right Shift Right Shift Xi*Y reduces the carry propagation delay. Reduction of delay is
FSM Controller
noticeable in multipliers with Carry Save adder and Kogge-
Carry Save Adder -1 Stone adder. This improvement is attributed because of use of
alternative adder structure. Overall Power-Delay product sees
2*A improvement of 72.83% and 27.42% over RCA. Area-Delay
product also shows improvement of 72.00% and 24.90%. The
Carry Save Adder -2 proposed architectures are best suitable for constraint based
implementation.
A=(A= (2*Sn+1+Sn+2*Cn+1+Cn)*2n mod M) Table 1: Implementation Comparison of Modulo Multipliers
`
Multiplier Interleaved Multiplier Montgomery Multiplier
Kogge-Stone Adder -3 Parameter RCA KSA RCA KSA
Dynamic
44.22 mW 46.57mW 57.98 mW 59.12 mW
Fig. 2 Proposed Datapath of Interleaved Multiplier Power
Delay 6.881ns 1.775ns 1.735ns 1.235ns
C=0 S=0 Xi Y M
Slice LUT 2661 2888 3595 3793
Power- Delay
3.04E-10 8.27E-11 1.006E-10 7.3E-11
Xi*Y Product
Area- Delay
1.83E-05 5.13E-06 6.237E-06 4.68E-06
Product
Carry Save Adder -1
V. CONCLUSIONS
Carry C Sum S S*M We have presented FPGA based architectures for modular
multipliers using parallel prefix Kogge-Stone adder.
FSM Controller
Architectures are implemented on Xilinx Virtex-6 FPGA. The

Carry Save Adder -2
proposed architectures are high speed as it reduces the delay by
74.20% and 28.82% in Interleaved and Montgomery modular
Right Shift Right Shift multiplication. Dynamic power consumption is slightly
increased and overall Power-Delay product and Area-Delay
product improved. Higher performance is achieved by using
alternative adder structures. Future work includes the
improvement in the performance of the multiplication process
especially by reducing the critical path and exploring other
Kogge-Stone Adder -3 parallel strategies.
REFERENCES
Fig.3 Proposed Datapath of Montgomery Multiplier [1] A.K. Lutz, J. Treichler, et al. ," 2Gbit/s Hardware Realizations of
RIJNDAEL and SERPENT: A Comparative Analysis," in Fourth
IV. IMPLEMENTATION AND EXPERIMENTAL RESULTS International Workshop on Cryptographic Hardware and Embedded
Systems– CHES2002, August 13-15, 2002
The proposed architectures are synthesized and implemented [2] Khoongming Khoo, Thomas Peyrin ,Axel Y. Poschmann ,Huihui Yap,
on Xilinx Virtex–6 Family Device (Virtex-6 VCX240T- "FOAM: Searching for Hardware-Optimal SPN Structures and
Components with a Fair Comparison" in Sixteenth International Workshop
1FF1156 FPGA). 163 bit Montgomery multiplier and on Cryptographic Hardware and Embedded Systems (CHES-2004), 23-16
Interleaved multiplier are coded using VHDL and synthesized September 2014, LNCS 8731, pp. 433–450
using Xilinx XST with default settings. 163 bit Ripple Carry [3] Eric Peeters, Michael Neve, Mathieu Ciet, "XTR Implementation on
Adder, Carry Save Adder and Kogge-Stone Adder structures Reconfigurable Hardware," in Third International Workshop on
Cryptographic Hardware and Embedded Systems (CHES-2004), LNC
were implemented and used as components for proposed 3156, pp. 386-399.
implementation. Direct performance comparison of hardware [4] Ming-Der Shieh, Jun-Hong Chen, Wen-Ching Lin, and Hao-Hsuan Wu "A
implementations against other reported results not possible New Algorithm for High-Speed Modular Multiplication Design," IEEE
because of different FPGA technologies are used for their Transactions on Circuits And Systems—I: Regular Papers, Vol. 56, No. 9,
September 2009
implementations and architectural differences. Table-1
[5] C. McIvor, M. McLoone, and J.V. McCanny, “Modified Montgomery
Modular Multiplication and RSA Exponentiation,” IEE Proceedings on
Computers and Digital Techniques, Vol. 151, pp. 402-408, 2004
[6] Alexandre F. Tenca, Georgi Todorov, and Cetin K. Koc¸ ,"High-Radix
Design of a Scalable Modular Multiplier," Third International Workshop
on Cryptographic Hardware and Embedded Systems (CHES-2001), LNCS
2162, pp 185-201
[7] T. Blum and C. Paar, “High-Radix Montgomery Modular Exponentiation
on Reconfigurable Hardware,” IEEE Transaction on Computers, Vol. 50,
pp. 759-764, 2001
[8] Eldridge, S.E., and Walter, C.D. "Hardware Implementation of
Montgomery’s Modular Multiplication Algorithm’, IEEE Transactions on
Computers, 1993, 42, pp. 693 –699
[9] Tim Guneysu and Christof Paar, "Modular Integer Arithmetic for Public-
Key Cryptography" in Secure Integrated Circuits and Systems, Springer
Publication. 2010.
[10] Francisco Rodriguez, N.A. Saquib et al, "Cryptographic Algorithms on
Reconfigurable Hardware", Springer, 2006. pp 105-108
[11] E.F. Brickell, "A Fast Modular Multiplication Algorithm with Application
to Two key Cryptography," in Advances in Cryptology, Proceedings of
Crypto 86, pages 51-60
[12] P.L. Montgomery, "Modular Multiplication Without Trial Division,"
Mathematics of Computation, Vol. 44, 1985, pp 519-521.
[13] K. Posch and R. Posch, “Modulo Reduction in Residue Number Systems,”
IEEE Trans. Parallel and Distributed Systems, Vol. 6, No. 5, pp. 449-454,
1995.
[14] C. McIvor, M. McLoone, and J. V. McCanny, “High-radix systolic
modular multiplication on reconfigurable hardware,” in Proc. IEEE
International Conference on Field-Programmable Technology 2005
(ICFPT’05), Dec. 2005, pp. 13–18.
[15] Elbert, A.J., and Paar, C. "Towards an FPGA Architecture Optimized for
Public-Key Algorithms’. Presented at the SPIE Symposium on Voice,
Video and Communications, Sept. 1999
[16] Miaoqing Huang, Kris Gaj, Tarek El-Ghazawi, "New Hardware
Architectures for Montgomery Modular Multiplication Algorithm," in
IEEE Transactions on Computers , Vol.60, Issue 7, July 2011.
[17] Miroslav Knezevic, Frederik Vercauteren, Ingrid Verbauwhede, "Faster
Interleaved Modular Multiplication Based on Barrett and Montgomery
Reduction Methods "IEEE Transactions On Computers, Vol. 59, No. 12,
December 2010.
[18] Buminov V. Schimmler, Manfred "Area and time efficient modular
multiplication of large integers," Proceedings of IEEE International
Conference on Application-Specific Systems, Architectures, and
Processors, 24-26 June 2003.
[19] V. Buminov M. Schmimmer, “Area-Time Optimal Modular Multiplication
,”Embedded Cryptographic Hardware: Methodologies and Architectures,
2004, ISBN 1594540128.
[20] H. Ling, "High speed binary adder," IBM Journal of Research &
Development ,25(3):156–166, May 1981.
[21] P.M. Kogge and H.S. Stone, "A Parallel Algorithm for the Efficient
Solution of a General Class of Recurrence Equations," IEEE Transactions
on Computers, Vol. 22, No. 8, pp. 786-792, August 1973.
[22] P.R. Brent and H.T. Kung, "A Regular Layout for Parallel Adders ," IEEE
Transactions on Computers, Vol. 31, No. 3, pp. 260-264 , March 1982
[23] R.E. Lander and M.J. Fischer "Parallel Prefix Computation," Journal of
the ACM, 27(4):831-838, October 1980
[24] T. Han and D. Carlson, "Fast Area Efficient VLSI Adders," Proceedings of
Symposium on Computer Arithmetic , pp. 49-56, May 1987
[25] S. Knowles, "A family of Adders," Proceedings of 14th symposium on
computer arithmetic, pp. 30-34, April 1999.
[26] J. Sklansky, "Conditional Sum Addition Logic," IEEE Transactions on
Electronic Computers, Vol. 9, No. 6, pp. 226-231, June 1960
[27] Neil H.E. Weste, David Money Harris," CMOS VLSI Design : A Circuits
and Systems Perspective, " Fourth Edition, Addison Wesley
[28] Giorgos Dimitrakopoulos, Dimitris Nikolos, "High Speed Parallel Prefix
Ling Adders," IEEE Transactions on Computers, Vol. 54, No.2, February
2005.

Fast Modular Multiplication Using Parallel Prefix Adder: Pravin P. Zode Raghavendra B. Deshmukh

Uploaded by

Copyright:

Available Formats

Fast Modular Multiplication Using Parallel Prefix Adder: Pravin P. Zode Raghavendra B. Deshmukh

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fast Modular Multiplication Using Parallel Prefix Adder: Pravin P. Zode Raghavendra B. Deshmukh

Uploaded by

Copyright:

Available Formats

Fast Modular Multiplication using

Parallel Prefix Adder

I. INTRODUCTION II. PREVIOUS WORK

Algorithm 2: Montgomery Algorithm for Binary Base [19]

Architectures are implemented on Xilinx Virtex-6 FPGA. The

You might also like