Elliptic Curve Cryptography On Embedded Multicore Systems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Elliptic Curve Cryptography on Embedded Multicore

Systems

Junfeng Fan, Kazuo Sakiyama and Ingrid Verbauwhede


Katholieke Universiteit Leuven, ESAT/SCD-COSIC
Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium
{jfan, ksakiyam, iverbauw}@esat.kuleuven.be

ABSTRACT transfer, e-mail, remote monitor and so on need strong secu-


The increasing use of network-connected embedded devices rity. In order to secure the transactions, a lot of security pro-
and online transactions creates a growing demand of network tocols such as SSL and IPsec were proposed, and can provide
security for embedded systems. The security requirements, multiple level authentication, confidentiality and integrity
such as authentication, confidentiality and integrity, always for the communication. The cryptographic algorithms used
make computationally intensive processes and can easily be- in these protocols often involve intensive computations and
come the bottleneck of the related applications. In this pa- create a big challenge for the embedded devices.
per we implement Elliptic Curve Cryptography (ECC) on an
embedded multicore system, and explore the task scheduling Multicore processors for embedded systems, just as its coun-
methods in different levels. First, we propose an instruction terpart of computers, are now getting increasing interest
scheduling method that utilizes all the cores to perform one from the embedded system industry. Chip vendors like Intel,
modular operation in parallel. Second, we perform multiple ARM, Fujitsu and Renesas have announced different multi-
modular operations with multiple cores in parallel. The per- core processors and SoCs [1, 2, 3, 4] for embedded systems.
formance of those two implementations is compared and a The advantages of using multicore processors in embedded
scheduling method combining these two types of parallelism systems can be simply summarized into two points. First,
is proposed. We discuss the details of our proposed method it gives the embedded developers more spaces to add new
by using an FPGA implementation of ECC over a prime applications. Second, by using parallel computing we can
field. reduce the clock frequency and voltage to achieve an energy-
efficient design.
Categories and Subject Descriptors In this paper we investigate the implementation of ECC on
E.3 [Data Encryption]: Public key cryptosystems; B.2.1 a multicore system. ECC can be used to perform key ex-
[Arithmetic and Logic Structure]: Design StylesPar- change, data encryption and digital signature generation. It
allel ; C.3 [Special-purpose and Application-based Sys- has been widely adopted in embedded systems since it uses
tems]: Smartcards much shorter operands than RSA. To the best of our knowl-
edge, so far there is no software implementation of ECC
General Terms reported for embedded multicore systems. In this paper, we
Cryptographic system, parallel computing consider the performance improvement of ECC implemen-
tation by deploying multiple embedded cores. We propose
a configurable multicore prototype processor and three task
Keywords scheduling methods for Elliptic Curve (EC) scalar multipli-
multicore embedded system, elliptic curve cryptography, par- cation.
allel computing

1. INTRODUCTION
While a growing number of embedded systems get connected
to the Internet and enjoy the convenience created by the In-
ternet, they also share almost all the security problems that
the Internet has. Many applications such as online money

Figure 1: ECC implementation pyramid.


2. PREVIOUS WORK Algorithm 1 Algorithm for left-to-right binary scalar mul-
The pyramid in Fig. 1 reprents a typical ECC implemen- tiplication.
tation on a general purpose core. On the top level is the Input: P = (x, y), k = (kl1 kl2 k0 )2
application of ECC in security protocols such as SSL. The Output: Q = (x , y ) = kP
second level is the EC scalar multiplication, which consists 1: P2 = O
of a sequence of Point Addition (PA) and Point Doubling 2: for i from l 1 downto 0 do
(PD). PA and PD consist of a sequence of modular multipli- 3: P2 = 2P2
cation, addition and substraction. These large integer mod- 4: if ki = 1 then
ular operations are then implemented as multiple-precision 5: P2 = P1 + P2
operations on a w-bit processor. A short introduction on 6: end if
ECC algorithm is given below. 7: end for
8: Return P2
2.1 A Short Introduction on ECC
ECC relies on a group structure induced on an elliptic curve.
Good references for the mathematical background are [5, 6, 2.2 Modular Multiplication
7, 8]. As we consider a prime field, i.e. GF(p), an elliptic As described before, modular multiplication is a fundamen-
curve E over GF(p) is often expressed as the set of solutions tal operation in ECC. Montgomery proposed an efficient al-
of the Weierstrass equation, gorithm for modular multiplication [9]. The algorithm can
be operated with multiplications and additions, and can
y 2 = x3 + ax + b (1) avoid trial divisions which are time-consuming. Alg. 2.2
3 2 shows an optimized Montgomery algorithm called FIOS (Finely
where a, b, x, y GF(p) with 4a + 27b 6= 0 (mod p), to-
Integrated Operand Scanning) [10], which is suitable for a
gether with O. The points on the curve and the point at
software implementation on a w-bit datapath.
infinity O form an abelian group. In this paper we use
weighted projective coordinates where an affine point (x, y)
Algorithm 2 Radix-2w n-bit Montgomery modular multi-
is converted to a projective point (X, Y, Z) by computing
plication (FIOS).
x = X/Z 2 , y = Y /Z 3 [7]. The projective curve equation
corresponding to the affine Eq. 1 is given by Input: integers P = (ps1 , ..., p0 )r , X = (xs1 , ..., x0 )r ,
Y = (ys1 , ..., y0 )r , where 0 X, Y < P , r = 2w , s = w
n
,
Y 2 = X 3 + aXZ 4 + bZ 6 . (2) s
R = r with gcd(p, r) = 1 and P = P mod r. 1

Output: X Y R1 mod p
Suppose that P1 = (X1 , X1 , Z1 ) and P2 = (X2 , Y2 , Z2 ) are 1: z = (zs1 , ..., z0 )r 0
two points on the curve and P1 6= P2 . When computing PA, 2: for i = 0 to s 1 do
P2 = P2 + P1 , the computation sequences on the projective 3: T (z0 + x0 yi ) p mod r
curve is presented in Eq. 3 for the special case Z1 = 1. 4: Z (Z + X yi + P T )/r
5: end for
6: if Z > p then
7: Z Z P
t3 = Z 2 Z 2 , t 4 = X 1 t 3 + X2 , t 2 = X1 t 3 X2 ,
8: end if
t3 = t3 Z 2 , Y2 = t3 Y1 Y2 , t3 = t3 Y1 + Y2 , 9: return Z
t5 = t2 t2 , t 1 = t4 t5 , X2 = Y2 Y2 t1 ,
(3)
t4 = 2X2 , t 4 = t1 t4 , t 1 = t5 t2 , Obviously, parallel computing can be applied to different
t1 = t3 t1 , t3 = t4 Y2 t1 , Y2 = t3 /2, abstract levels. One can use all the cores to perform one
Modular Multiple-precision Operation (MMO) or to per-
Z 2 = t2 Z 2 . form different MMOs with different cores in parallel. In
[11] a superscalar processor was proposed to perform paral-
For point doubling, P2 = 2P2 , we use lel computing. Modular operations are executed in parallel
if no data dependency exists between them. Hardware im-
t 1 = X2 X2 , t1 = 3t1 , t 2 = Z2 Z2 , plementations [12, 13, 14] always use multiple Processing
t2 = t2 t2 , t2 = at2 + t1 , t1 = 2Y2 , Elements (PE) to perform one MMO. Generally, the perfor-
mance of these two types of parallel implementation varies
Z 2 = Z 2 t1 , t 3 = t1 t1 , t 4 = X2 t 3 ,
(4) for the type of the field (prime field or binary field), platform
t5 = t2 t2 , X2 = 2t4 , X 2 = t 5 X2 , architecture and clock frequency. For multicore systems,
t1 = t1 t3 , t1 = t1 Y2 , t 3 = t 4 X2 , software designers have to choose a good parallel computing
Y2 = t2 t3 t1 . strategy that fits the platform architecture. In this paper
we will compare the performance of two parallel computing
methods and try to achieve an efficient combination of them.
Scalar multiplication is the main routine that ECC-based
applications call. A scalar multiplication is the computation For parallel computation, one major problem is data trans-
of kP , where k is a large integer and P is a point on the fers between different cores. Take a multicore SoC as an
elliptic curve. It consists of a sequence of PAs and PDs as example, the data is always transferred through the shared
shown in Alg. 1. In our case both PA and PD require 16 memory. Reducing the number of data transfers does not
modular operations over a prime field. only speed up the computation, but also reduce the power
with sixteen 16-bit general registers and one status register.
Instruction Data
Main Controller The Arithmetic Logic Unit(ALU) includes one 16-bit multi-
Memory Memory
plier and one 16-bit adder. It also has an output register to
Instruction Bus store the data that will be written to the data memory, and
an input register to buffer the data from the data memory.
Data Bus
Both of them are 16-bit. One 32-bit Write Back (WB) reg-
ister is also used to store data from the ALU. The bit-length
core-1 core-2 core-3 core-m of both data-path and registers are doubled if it is configured
as a 32-bit (w = 32) core. The cores here support a simple
load/store Instruction Set Architecture (ISA). Instructions
for each core are 16-bit long. All the arithmetic operations
are performed among data stored in the local register file.
When data needs to be moved from one core to another, it
16 16 16 is first stored to the data memory, then it is loaded by the
Ins Rout Rin destination core. Cores in this platform support a 4-stage
instruction pipelining.
Decoder

A B 4. PARALLEL IMPLEMENTATION
As mentioned in Section 2, the sequence of scalar multipli-
16 16 cation can be scheduled to this multicore system in different
16-bit + ways. To make it easy to express, we use the following two
Register terms to denote parallelism in different abstract level.
File
0000

WB 1. Horizontal Parallelism: when performing the se-


quence of MMOs on a multicore system, each MMO is
performed with all the cores in parallel.
Figure 2: Schematic block diagram for VLIW pro- 2. Vertical Parallelism: when performing the sequence
totype processor. of MMOs on a multicore system, different MMOs that
have no data dependencies are performed by different
cores in parallel.
consumption since memory access instructions are quite energy-
consuming. In this paper, we also investigate the data de-
pendency of the modular multiplication and propose one in- We first propose an instruction scheduling method to per-
struction scheduling method to reduce the number of data form horizontal parallelism. The performance of horizontal
transfers. parallelism is compared with that of vertical parallelism. Fi-
nally, a combination of horizontal parallelism and vertical
3. IMPLEMENTATION PLATFORM parallelism is deployed to improve the performance.
General multicore systems can have various architectures
and corresponding memory configurations. For instance, 4.1 Horizontal Parallelism
they can be multiple embedded cores with shared data cache, The main data dependencies in the Montgomery algorithm
or multiple processors with a shared memory. In this paper using multiple cores are due to the carries generated by ad-
we employ a VLIW architecture processor as a prototype. ditions. Taking the Algorithm 2 as an example. The step
The reason of making a prototype processor but not using 4 can be expressed as (zj + (X yi )j + (P T )j + ca zj1 ),
the existing multicore processors is that it offers us high con- where ca zj1 is the carry generated when computing zj1 .
figurability. The prototype processor can be configured to Obviously, xj yi , for any 0 i, j s 1, is only dependent
have 1, 2, 4, 8 or even more embedded cores. This allows us on the operands X and Y . We can also calculate pj T im-
to explore the instruction scheduling methods for different mediately after the generation of T . The products with the
platforms. same weight of zj and the carry from zj1 are accumulated
to zj , generating a new zj and 2-bit carries. As a result, zj
As shown in Fig. 2, the processor consists of a main con- can only be generated after carry from zj1 is ready. It is
troller, data memory, program memory storing instructions significantly inefficient to use carries generated by another
and four embedded cores. Only the main controller can core, especially when using a shared-memory-based multi-
access the program memory and the data memory. The core system.
instructions are dispatched to the cores in parallel via the
instruction bus. The results are stored in the register file. Therefore, it is desirable to partition the Montgomery algo-
The data memory has only one read/write port, therefore, rithm so that carry is only used in the local core. Since carry
a single data memory access is allowed in each cycle. propagation only exists in each iteration, one simple strat-
egy is using one core to perform one iteration so that carry
We denote w as the operation size of w-bit core. A 16-bit transfers are avoided. A software pipelining method can be
(w = 16) core is also shown in Fig. 2. It is a highly simplified used to perform modular multiplication in parallel: core-1
load/store CPU. It has an instruction decoder, a register file performs the first iteration, and core-2 performs the second
finish one modular multiplication. When using two 32-bit
cores, 223 clock cycles are need. If four 32-bit cores are used,
only 175 clock cycles are needed. In other words, by utiliz-
ing two or four cores to perform one 192-bit Montgomery
modular multiplication, the performance can be improved
by a factor of 1.53 and 1.96 compared to single-core based
implementations.

4.2 Vertical Parallelism


As can be seen in Eq. 3 and Eq. 4, we can compute sev-
eral modular operations with different cores in parallel. By
checking the Read-After-Write (RAW) dependency we can
find modular operations that can be executed in parallel.
Take the sequence of point doubling in Eq. 4 as an exam-
ple, the first modular operation, t1 = X2 X2 , and the third
modular operation, t2 = Z2 Z2 , have no data dependency on
Figure 3: Data dependency of FIOS Montgomery each other, thus can be performed in parallel with core-1
algorithm. and core-2. The second modular operation, t1 = 3t1 , needs
the result of the first one, thus it can be performed only after
t1 = X2 X2 is done. Repeating this procedure, we have the
following scheduling method as shown in Table 1 for PD. We
can obtain a similar parallelized sequence for PA as well. As
there are no data dependencies between modular operations
that are performed in parallel, a p-way parallel computing
is almost p times faster than a single-core based computing.

Table 1: Parallelized sequences for PD, up to three-


way vertical parallelism can be found.
Point Doubling
core-1 core-2 core-3 core-4
t1 = X2 X2 t2 = Z2 Z2 - -
t1 = 3t1 t2 = t2 t2 - -
t2 = at2 + t1 t1 = 2Y2 - -
Z2 = Z2 t1 t3 = t1 t1 t5 = t2 t2 -
Figure 4: Parallelized Montgomery modular multi- t4 = X2 t3 t1 = t1 t3 - -
plication (field size n = 256, machine size w = 32 and X2 = 2t4 t1 = t1 Y2 - -
s = 8). X2 = t5 X2 - -
t3 = t4 X2 - -
Y2 = t2 t3 t1 - -

one, core-3 third one and core-4 the forth one. This schedul-
ing method can successfully avoid carry transfers. However,
it causes s(s 1) data transfers when sending intermediate The vertical parallelism can be dynamically scheduled in
data, Z, to neighboring cores. software. Suppose the RAW dependency is checked in a se-
quence of t operations. On one hand, higher parallelism de-
Note that to compute T only z0 must be ready at the end gree may be found if larger t is chosen. On the other hand,
of each iteration, and carry from z1 , z2 , .., zs1 do not need the number of conditions to check for RAW dependencies
to be accumulated to the corresponding words of Z immedi- grows exponentially as t increases. In this paper we stati-
ately. Based on this fact another scheduling method can be cally schedule the PD and PA sequence in prior to execute
considered as shown in Fig. 4. In this method, each iteration scalar multiplication, thus avoid this possible bottleneck in
in Alg. 4 is performed by multiple cores. Suppose we have the whole design.
n
n = 256, w = 32 and s = w = 8. During the whole loop
(z1 , z0 ) is generated and stored in core-1, (z3 , z2 ) in core-2, Scalar multiplication can be computed by combining three
(z5 , z4 ) in core-3 and (z7 , z6 ) in core-4. Carry is only used sequences that are (a) PA followed by PD (denoted as PA-
in the local core. At the end of each iteration, z2 in core-2 PD) (b) PD-PD, and (c) PD-PA. As shown in Eq. 3 and
becomes new z1 and is sent to core-1. Also, z4 is sent to Eq. 4, both PA and PD have 16 operations. As a result,
core-2 and z6 is sent to core-3. After eight iterations and a for all PA-PA, PD-PD and PD-PA combination we have
conditional substraction, Z = X Y R1 mod P is obtained 32 operations. We exploit the vertical parallelism for this
and stored separately in the register file of each core. In 32 operations and store schedule to the program area in
general, if Z has s words and the system has p cores, the software. we can find parallelism up to three-way in ECC
number of data transfers is 2(p 1)s, which is much smaller over a prime field.
than that of the previous scheduling method.
4.3 Two-dimensional Parallelism
When performing 192-bit Montgomery modular multiplica- Apparently, using only vertical parallelism cannot keep all
tions on a single 32-bit core, 343 clock cycles are needed to the cores busy during the whole scalar multiplication since 4-
Figure 5: Various strategies for two-dimensional parallelism exploration in the case of using four cores.

way parallelism is not always achievable (in fact, we can only tical parallelism only, (II) two-dimensional parallelism with
find up to three-way parallelism). In order to utilize all the up to three-way vertical parallelism, (III) two-dimensional
cores efficiently, we can combine the idea of horizontal par- parallelism with up to two-way vertical parallelism, and (IV)
allelism and vertical parallelism, making a two-dimensional the horizontal parallelism only.
parallelism.
As a result of an FPGA implementation, the best perfor-
We start from the schedule generated by vertical parallelism mance is obtained with the case-III, and 192-bit EC scalar
method, and deploy the horizontal parallelism method to multiplication is operated in 9.9 msec running at 93 MHz.
perform some of the modular operations so that all cores The processor uses 3,173 slices, sixteen 18-bit dedicated mul-
in the system can be utilized. Possible combinations of the tipliers and six embedded BRAMs on Xilinx Virtex-II PRO
vertical and horizontal parallelism are illustrated in Fig. 5. (XC2VP30-7).
The figure explains how the modular operations are assigned
to cores in the system. In Fig. 5(a), three-way parallel The performance comparison of EC scalar multiplication is
modular operations can be performed with only the vertical shown in Table 3. Compared to ARM based implementation
parallelism, horizontal parallelism, or combination of both. that proposed in [15], our implementation is about 7 times
Fig. 5(b) shows the possible methods of performing two- faster. This is due to the use of multiple cores and higher
way parallel operations, and Fig. 5(c) shows the methods clock frequency. The implementation from [16] is quite fast
for single MMO. because of the use of fixed-base comb method. Compared
to the state-of-the-art hardware implementations [17, 18],
Table 2 summarizes the number of clock cycles required for software implementations are still slower. However, copro-
one 192-bit modular multiplication. We obtain different re- cessors add area and complexity to the whole system, and
sults depending on the strategy to exploit parallelism. The have far less flexibility than software implementations.
best results are highlighted in the table.

The results indicate that the horizontal parallelism degrades 6. CONCLUSIONS AND FUTURE WORK
the performance when three-way vertical parallelism is found. We present a two-dimensional parallel computing architec-
This is due to the fact that the horizontal parallel computa- ture for ECC. The results show that the performance of EC
tion can speed up the computation by a factor of 1.96 when scalar multiplication can be improved further by introduc-
using 4 cores, while the vertical parallel computation can ing two-dimensional parallelism. Our implementation re-
speed up the computation by a factor of 2.96 when three-
way parallelism is available. On the other hand, the hori-
zontal parallelism reduces the clock cycle when no vertical
Table 2: Required clock cycles for one modular mul-
parallelisms are obtained. With regard to two-way vertical
tiplication for different strategies to exploit paral-
parallelism, we obtain the best performance when using two
lelism.
cores for each modular operation. Vertical Horizontal Parallelism
Parallelism 1 2 4
5. RESULTS 3 372 416 525
In order to see the effectiveness of two-dimensional paral- 2 352 241 350
lelism, we observe the performance of EC scalar multiplica- 1 343 - 175
tion as shown in Fig. 6. We consider four cases: (I) the ver-
[5] N. Koblitz. Algebraic Aspects of Cryptography.
Table 3: Performance comparison of EC scalar mul- Springer-Verlag, first edition, 1998.
tiplication over GF(p).
[6] A. Menezes, Y.-H. Wu, and R. Zuccherato. An
Reference Field Target Freq. Perf
Design Size [bit] Platform [M Hz] [msec] Elementary Introduction to Hyperelliptic Curves -
This Work 192 Virtex-II Pro 93 9.90 Appendix, pages 155178. Springer-Verlag, 1998. N.
[15] 192 ARM7TDMI 80 69.2 Koblitz: Algebraic Aspects of Cryptography.
[16] 192 Pentium II 400 2.14 [7] I. Blake, G. Seroussi, and N.P. Smart. Elliptic Curves
[17] 192 Virtex-E 40 3.00
[18] 256 Virtex-II Pro 100 2.70
in Cryptography. London Mathematical Society
Lecture Note Series. Cambridge University Press,
* Fixed modulo p = 2192 264 1. 1999.
Fixed-base comb method. [8] D. Hankerson, A. Menezes, and S. Vanstone. Guide to
Elliptic Curves Cryptography. Springer-Verlag, 2004.
[9] P. Montgomery. Modular multiplication without trial
division. Mathematics of Computation,
sults using two-dimensional parallelism improve the perfor- 44(170):519521, 1985.
mance by 26% and 32% respectively compared to the case [10] C.K. Koc, T. Acar, and B.S. Kaliski Jr. Analyzing and
of exploiting only the vertical or horizontal parallelism. comparing Montgomery multiplication algorithms.
IEEE Micro, 16(3):2633, June 1996.
It is possible to apply this idea to any sequences of EC scalar [11] K. Sakiyama, L. Batina, B. Preneel, and
multiplication and other PKCs including RSA. Further im- I. Verbauwhede. Superscalar Coprocessor for
provements in performance can be obtained by using a bet- High-speed Curve-based Cryptography. In Proceedings
ter projective coordinate and recoding the scalar with Non- of 8th International Workshop on Cryptographic
Adjacent Format (NAF) [7] representation. Our future work Hardware and Embedded Systems (CHES), number
also includes using parallel computing to achieve energy ef- 4249 in LNCS, pages 415429. Springer-Verlag, 2006.
ficient implementations of ECC on multicore systems.
[12] S.E. Eldridge and C.D. Walter. Hardware
implementation of Montgomerys modular
16 multiplication algorithm. IEEE Transactions on
14.5
Inversion
Computers, 42:693699, 1993.
14 13.4
PA/PD chain [13] A.F. Tenca and C.K. Koc. A scalable architecture for
12 modular multiplication based on Montgomerys
Performance [msec]

10.2 9.9 algorithm. IEEE Transactions on Computers,


10 52(9):12151221, September 2003.
8
[14] N. Mentens, K. Sakiyama, B. Preneel, and
I. Verbauwhede. Efficient pipelining for modular
6 multiplication architectures in prime fields. In
Proceedings of the 2007 Great Lakes Symposium on
4
VLSI (GLSVLSI2007), pages 534539, 2007.
2 [15] M. Aydos, T. Yanik, and C.K. Koc. High-speed
implementation of an ECC-based wireless
0 authentication protocol on an ARM microprocessor.
case I case II case III case IV IEE Proceedings-Communications, 148(5):273279,
Strategy for parallelism October 2001.
[16] Michael Brown, Darrel Hankerson, Julio Lopez, and
Alfred Menezes. Software implementation of the nist
Figure 6: Performance of 192-bit EC scalar multi- elliptic curves over prime fields. In Proceedings of the
plication for different strategies of exploiting paral- 2001 Conference on Topics in Cryptology: The
lelism. Cryptographers Track at RSA, number 2020 in LNCS,
pages 250265, 2001.
Acknowledgments [17] G. Orlando and C. Paar. A scalable GF(p) elliptic
Junfeng Fan and Kazuo Sakiyama are funded by a research curve processor architecture for programmable
grant of the Katholieke Universiteit Leuven and FWO projects hardware. In Proceedings of 3rd International
(G.0450.04, G.0475.05). This work was supported in part Workshop on Cryptograpic Hardware and Embedded
by the IAP Programme P6/26 BCRYPT of the Belgian Systems (CHES), number 2162 in LNCS, pages
State (Belgian Science Policy), by the EU IST FP6 projects 356371. Springer-Verlag, 2001.
(SESOC and ECRYPT), by the K. U. Leuven, and by the [18] K. Sakiyama, N. Mentens, B. Preneel, and
IBBT-QoE project of the IBBT. I. Verbauwhede. A parallel processing hardware
architecture for elliptic curve cryptosystems. In
7. REFERENCES Proceedings of IEEE International Conference on
[1] http://www.embedded.com/products. Acoustics, Speech, and Signal Processing (ICASSP
[2] http://www.arm.com/products/CPUs. 2006), pages 904907, 2006.
[3] http://www.fujitsu.com.
[4] http://www.renesas.com.

You might also like