Elliptic Curve Cryptography On Embedded Multicore Systems
Elliptic Curve Cryptography On Embedded Multicore Systems
Elliptic Curve Cryptography On Embedded Multicore Systems
Systems
1. INTRODUCTION
While a growing number of embedded systems get connected
to the Internet and enjoy the convenience created by the In-
ternet, they also share almost all the security problems that
the Internet has. Many applications such as online money
Output: X Y R1 mod p
Suppose that P1 = (X1 , X1 , Z1 ) and P2 = (X2 , Y2 , Z2 ) are 1: z = (zs1 , ..., z0 )r 0
two points on the curve and P1 6= P2 . When computing PA, 2: for i = 0 to s 1 do
P2 = P2 + P1 , the computation sequences on the projective 3: T (z0 + x0 yi ) p mod r
curve is presented in Eq. 3 for the special case Z1 = 1. 4: Z (Z + X yi + P T )/r
5: end for
6: if Z > p then
7: Z Z P
t3 = Z 2 Z 2 , t 4 = X 1 t 3 + X2 , t 2 = X1 t 3 X2 ,
8: end if
t3 = t3 Z 2 , Y2 = t3 Y1 Y2 , t3 = t3 Y1 + Y2 , 9: return Z
t5 = t2 t2 , t 1 = t4 t5 , X2 = Y2 Y2 t1 ,
(3)
t4 = 2X2 , t 4 = t1 t4 , t 1 = t5 t2 , Obviously, parallel computing can be applied to different
t1 = t3 t1 , t3 = t4 Y2 t1 , Y2 = t3 /2, abstract levels. One can use all the cores to perform one
Modular Multiple-precision Operation (MMO) or to per-
Z 2 = t2 Z 2 . form different MMOs with different cores in parallel. In
[11] a superscalar processor was proposed to perform paral-
For point doubling, P2 = 2P2 , we use lel computing. Modular operations are executed in parallel
if no data dependency exists between them. Hardware im-
t 1 = X2 X2 , t1 = 3t1 , t 2 = Z2 Z2 , plementations [12, 13, 14] always use multiple Processing
t2 = t2 t2 , t2 = at2 + t1 , t1 = 2Y2 , Elements (PE) to perform one MMO. Generally, the perfor-
mance of these two types of parallel implementation varies
Z 2 = Z 2 t1 , t 3 = t1 t1 , t 4 = X2 t 3 ,
(4) for the type of the field (prime field or binary field), platform
t5 = t2 t2 , X2 = 2t4 , X 2 = t 5 X2 , architecture and clock frequency. For multicore systems,
t1 = t1 t3 , t1 = t1 Y2 , t 3 = t 4 X2 , software designers have to choose a good parallel computing
Y2 = t2 t3 t1 . strategy that fits the platform architecture. In this paper
we will compare the performance of two parallel computing
methods and try to achieve an efficient combination of them.
Scalar multiplication is the main routine that ECC-based
applications call. A scalar multiplication is the computation For parallel computation, one major problem is data trans-
of kP , where k is a large integer and P is a point on the fers between different cores. Take a multicore SoC as an
elliptic curve. It consists of a sequence of PAs and PDs as example, the data is always transferred through the shared
shown in Alg. 1. In our case both PA and PD require 16 memory. Reducing the number of data transfers does not
modular operations over a prime field. only speed up the computation, but also reduce the power
with sixteen 16-bit general registers and one status register.
Instruction Data
Main Controller The Arithmetic Logic Unit(ALU) includes one 16-bit multi-
Memory Memory
plier and one 16-bit adder. It also has an output register to
Instruction Bus store the data that will be written to the data memory, and
an input register to buffer the data from the data memory.
Data Bus
Both of them are 16-bit. One 32-bit Write Back (WB) reg-
ister is also used to store data from the ALU. The bit-length
core-1 core-2 core-3 core-m of both data-path and registers are doubled if it is configured
as a 32-bit (w = 32) core. The cores here support a simple
load/store Instruction Set Architecture (ISA). Instructions
for each core are 16-bit long. All the arithmetic operations
are performed among data stored in the local register file.
When data needs to be moved from one core to another, it
16 16 16 is first stored to the data memory, then it is loaded by the
Ins Rout Rin destination core. Cores in this platform support a 4-stage
instruction pipelining.
Decoder
A B 4. PARALLEL IMPLEMENTATION
As mentioned in Section 2, the sequence of scalar multipli-
16 16 cation can be scheduled to this multicore system in different
16-bit + ways. To make it easy to express, we use the following two
Register terms to denote parallelism in different abstract level.
File
0000
one, core-3 third one and core-4 the forth one. This schedul-
ing method can successfully avoid carry transfers. However,
it causes s(s 1) data transfers when sending intermediate The vertical parallelism can be dynamically scheduled in
data, Z, to neighboring cores. software. Suppose the RAW dependency is checked in a se-
quence of t operations. On one hand, higher parallelism de-
Note that to compute T only z0 must be ready at the end gree may be found if larger t is chosen. On the other hand,
of each iteration, and carry from z1 , z2 , .., zs1 do not need the number of conditions to check for RAW dependencies
to be accumulated to the corresponding words of Z immedi- grows exponentially as t increases. In this paper we stati-
ately. Based on this fact another scheduling method can be cally schedule the PD and PA sequence in prior to execute
considered as shown in Fig. 4. In this method, each iteration scalar multiplication, thus avoid this possible bottleneck in
in Alg. 4 is performed by multiple cores. Suppose we have the whole design.
n
n = 256, w = 32 and s = w = 8. During the whole loop
(z1 , z0 ) is generated and stored in core-1, (z3 , z2 ) in core-2, Scalar multiplication can be computed by combining three
(z5 , z4 ) in core-3 and (z7 , z6 ) in core-4. Carry is only used sequences that are (a) PA followed by PD (denoted as PA-
in the local core. At the end of each iteration, z2 in core-2 PD) (b) PD-PD, and (c) PD-PA. As shown in Eq. 3 and
becomes new z1 and is sent to core-1. Also, z4 is sent to Eq. 4, both PA and PD have 16 operations. As a result,
core-2 and z6 is sent to core-3. After eight iterations and a for all PA-PA, PD-PD and PD-PA combination we have
conditional substraction, Z = X Y R1 mod P is obtained 32 operations. We exploit the vertical parallelism for this
and stored separately in the register file of each core. In 32 operations and store schedule to the program area in
general, if Z has s words and the system has p cores, the software. we can find parallelism up to three-way in ECC
number of data transfers is 2(p 1)s, which is much smaller over a prime field.
than that of the previous scheduling method.
4.3 Two-dimensional Parallelism
When performing 192-bit Montgomery modular multiplica- Apparently, using only vertical parallelism cannot keep all
tions on a single 32-bit core, 343 clock cycles are needed to the cores busy during the whole scalar multiplication since 4-
Figure 5: Various strategies for two-dimensional parallelism exploration in the case of using four cores.
way parallelism is not always achievable (in fact, we can only tical parallelism only, (II) two-dimensional parallelism with
find up to three-way parallelism). In order to utilize all the up to three-way vertical parallelism, (III) two-dimensional
cores efficiently, we can combine the idea of horizontal par- parallelism with up to two-way vertical parallelism, and (IV)
allelism and vertical parallelism, making a two-dimensional the horizontal parallelism only.
parallelism.
As a result of an FPGA implementation, the best perfor-
We start from the schedule generated by vertical parallelism mance is obtained with the case-III, and 192-bit EC scalar
method, and deploy the horizontal parallelism method to multiplication is operated in 9.9 msec running at 93 MHz.
perform some of the modular operations so that all cores The processor uses 3,173 slices, sixteen 18-bit dedicated mul-
in the system can be utilized. Possible combinations of the tipliers and six embedded BRAMs on Xilinx Virtex-II PRO
vertical and horizontal parallelism are illustrated in Fig. 5. (XC2VP30-7).
The figure explains how the modular operations are assigned
to cores in the system. In Fig. 5(a), three-way parallel The performance comparison of EC scalar multiplication is
modular operations can be performed with only the vertical shown in Table 3. Compared to ARM based implementation
parallelism, horizontal parallelism, or combination of both. that proposed in [15], our implementation is about 7 times
Fig. 5(b) shows the possible methods of performing two- faster. This is due to the use of multiple cores and higher
way parallel operations, and Fig. 5(c) shows the methods clock frequency. The implementation from [16] is quite fast
for single MMO. because of the use of fixed-base comb method. Compared
to the state-of-the-art hardware implementations [17, 18],
Table 2 summarizes the number of clock cycles required for software implementations are still slower. However, copro-
one 192-bit modular multiplication. We obtain different re- cessors add area and complexity to the whole system, and
sults depending on the strategy to exploit parallelism. The have far less flexibility than software implementations.
best results are highlighted in the table.
The results indicate that the horizontal parallelism degrades 6. CONCLUSIONS AND FUTURE WORK
the performance when three-way vertical parallelism is found. We present a two-dimensional parallel computing architec-
This is due to the fact that the horizontal parallel computa- ture for ECC. The results show that the performance of EC
tion can speed up the computation by a factor of 1.96 when scalar multiplication can be improved further by introduc-
using 4 cores, while the vertical parallel computation can ing two-dimensional parallelism. Our implementation re-
speed up the computation by a factor of 2.96 when three-
way parallelism is available. On the other hand, the hori-
zontal parallelism reduces the clock cycle when no vertical
Table 2: Required clock cycles for one modular mul-
parallelisms are obtained. With regard to two-way vertical
tiplication for different strategies to exploit paral-
parallelism, we obtain the best performance when using two
lelism.
cores for each modular operation. Vertical Horizontal Parallelism
Parallelism 1 2 4
5. RESULTS 3 372 416 525
In order to see the effectiveness of two-dimensional paral- 2 352 241 350
lelism, we observe the performance of EC scalar multiplica- 1 343 - 175
tion as shown in Fig. 6. We consider four cases: (I) the ver-
[5] N. Koblitz. Algebraic Aspects of Cryptography.
Table 3: Performance comparison of EC scalar mul- Springer-Verlag, first edition, 1998.
tiplication over GF(p).
[6] A. Menezes, Y.-H. Wu, and R. Zuccherato. An
Reference Field Target Freq. Perf
Design Size [bit] Platform [M Hz] [msec] Elementary Introduction to Hyperelliptic Curves -
This Work 192 Virtex-II Pro 93 9.90 Appendix, pages 155178. Springer-Verlag, 1998. N.
[15] 192 ARM7TDMI 80 69.2 Koblitz: Algebraic Aspects of Cryptography.
[16] 192 Pentium II 400 2.14 [7] I. Blake, G. Seroussi, and N.P. Smart. Elliptic Curves
[17] 192 Virtex-E 40 3.00
[18] 256 Virtex-II Pro 100 2.70
in Cryptography. London Mathematical Society
Lecture Note Series. Cambridge University Press,
* Fixed modulo p = 2192 264 1. 1999.
Fixed-base comb method. [8] D. Hankerson, A. Menezes, and S. Vanstone. Guide to
Elliptic Curves Cryptography. Springer-Verlag, 2004.
[9] P. Montgomery. Modular multiplication without trial
division. Mathematics of Computation,
sults using two-dimensional parallelism improve the perfor- 44(170):519521, 1985.
mance by 26% and 32% respectively compared to the case [10] C.K. Koc, T. Acar, and B.S. Kaliski Jr. Analyzing and
of exploiting only the vertical or horizontal parallelism. comparing Montgomery multiplication algorithms.
IEEE Micro, 16(3):2633, June 1996.
It is possible to apply this idea to any sequences of EC scalar [11] K. Sakiyama, L. Batina, B. Preneel, and
multiplication and other PKCs including RSA. Further im- I. Verbauwhede. Superscalar Coprocessor for
provements in performance can be obtained by using a bet- High-speed Curve-based Cryptography. In Proceedings
ter projective coordinate and recoding the scalar with Non- of 8th International Workshop on Cryptographic
Adjacent Format (NAF) [7] representation. Our future work Hardware and Embedded Systems (CHES), number
also includes using parallel computing to achieve energy ef- 4249 in LNCS, pages 415429. Springer-Verlag, 2006.
ficient implementations of ECC on multicore systems.
[12] S.E. Eldridge and C.D. Walter. Hardware
implementation of Montgomerys modular
16 multiplication algorithm. IEEE Transactions on
14.5
Inversion
Computers, 42:693699, 1993.
14 13.4
PA/PD chain [13] A.F. Tenca and C.K. Koc. A scalable architecture for
12 modular multiplication based on Montgomerys
Performance [msec]