Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
Implementation
Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
Department of Electrical & Computer Engineering, Northeastern University
Boston, MA 02115, USA
Email: {luochao, yfei, silenceluo, saoni, kaeli}@ece.neu.edu
AbstractGraphics Processing Units (GPUs) have been used
to run a range of cryptographic algorithms. The main reason
to choose a GPU is to accelerate the encryption/decryption
speed. Since GPUs are mainly used for graphics rendering, and
only recently have they become a fully-programmable parallel
computing device, there has been little attention paid to their
vulnerability to side-channel attacks.
In this paper we present a study of side-channel vulnerability
on a state-of-the-art graphics processor. To the best of our
knowledge, this is the rst work that attempts to extract the secret
key of a block cipher implemented to run on a GPU. We present
a side-channel power analysis methodology to extract all of the
last round key bytes of a CUDA AES (Advanced Encryption
Standard) implementation run on an NVIDIA TESLA GPU. We
describe how we capture power traces and evaluate the power
consumption of a GPU. We then construct an appropriate power
model for the GPU. We propose effective methods to sample
and process the GPU power traces so that we can recover the
secret key of AES. Our results show that parallel computing
hardware systems such as a GPU are highly vulnerable targets
to power-based side-channel attacks, and need to be hardened
against side-channel threats.
I. I NTRODUCTION
Graphics Processing Units (GPU), originally designed for 3D graphics rendering, have evolved into high performance general purpose processors, called GPGPUs. A GPGPU can provide signicant performance advantages over traditional multicore CPUs by executing workloads in parallel on hundreds to
thousands of cores. What has spurred on this development is
the delivery of programmable shader cores, and high-level programming languages [1]. GPUs have been used to accelerate
a wide range of applications [2], including: signal processing,
circuit simulation, and molecular modeling. Motivated by the
demand for efcient cryptographic computation, GPUs are
now being leveraged to accelerate a number of cryptographic
algorithms [3], [4], [5].
While cryptographic algorithms have been implemented to
run on GPUs for higher performance, the security of GPUbased cryptographic systems remains an open question. Previous work has analyzed the security of GPU systems [6], [7],
[8], [9]. The prior work focused more on using software methods to exploit the vulnerabilities of the GPU programming
model. Side-channel vulnerabilities of GPUs have received
This work was supported in part by the National Science Foundation under
grants CNS-1314655 and CNS-1337854.
c
978-1-4673-7166-7/15/$31.00 2015
IEEE
limited attention in the research community. Meanwhile, cryptographic systems based on CPU, application-specic integrated circuits (ASICs), and FPGA platforms have been shown
to be highly vulnerable to side-channel attacks. For example,
Moradi et. al. showed that side-channel power leakage can
be utilized by attackers to compromise cryptographic systems
that use microcontrollers [10], smart cards [11], ASICs [12]
and FPGAs [13], [14].
Different attack methods can be used for analyzing sidechannel power leakage, e.g., differential power analysis
(DPA) [15], correlation power analysis (CPA) [16] and mutual
information analysis (MIA) [17]. These attack methods pose
a large threat to both hardware-based and software-based
cryptographic implementations. Given all of this previous sidechannel power analysis activity, it is surprising that GPU-based
cryptographic resilience has not been considered. In this paper,
for the rst time, we apply CPA on an AES implementation
running on a GPU, and succeed in extracting the secret key
through analyzing the power consumption of the GPU.
Note that the inherent Single Instruction Multiple Thread
computing architecture of a GPU introduces a lot of noise into
the power side-channel, as each thread can be in a different
phase of execution, generating a degree of randomness. We
certainly see that GPU execution scheduling introduces some
timing uncertainties in the power traces. In addition, the
complexity of the GPU hardware system makes it rather
difcult to obtain clean and synchronized power traces. In
this paper, we propose an effective method to obtain clean
power traces, and build a suitable side-channel leakage model
for a GPU. We analyze AES on an NVIDIA TESLA C2070
GPU [18] and evaluate power traces obtained on a Keysight
oscilloscope. CPA analysis using the acquired traces shows
that AES-128 developed in CUDA on an NVIDIA C2070 GPU
is susceptible to power analysis attacks.
The rest of the paper is organized as follows. In Section II,
we provide a brief overview of the CUDA GPU architecture,
including both software and hardware models. We also describe our CUDA-based AES implementation. In Section III,
we describe the experimental setup used to collect power
traces, and discuss difculties we faced during designing
our power analysis attack on GPUs compared to on other
platforms. In Section IV, we discuss the construction of our
power model, and present our attack results. Finally, we
conclude the paper in Section V.
281
Instruction Cache
Grid
Block(1,0)
Block(0,0)
Block(1,1)
Block(0,1)
Block(2,0)
Thread(2,0)
Dispatch Unit
Dispatch Unit
...
SFU
...
LD/ST
...
Core
...
Core
...
Core
...
Block(1,1)
Thread(1,0)
Warp Scheduler
Block(2,1)
Core
Thread(0,0)
Warp Scheduler
Core
Core
Core
Core
LD/ST
SFU
Thread(3,0)
Interconnect Network
Thread(0,1)
Thread(1,1)
Thread(2,1)
Thread(3,1)
Thread(0,2)
Thread(1,2)
Thread(2,2)
Thread(3,2)
Fig. 1. A sample CUDA execution of threads and blocks in a single grid [20].
282
Initial round
=
Round key
Middle rounds
C. AES Implementation
In this paper, we implement a 128-bit ECB mode AES128 encryption in CUDA based on the CUDA reference
implementation by Margara [21]. Because the GPUs register
width is 32-bit, the T-table version of the AES algorithm [22]
is adopted. Each thread is responsible for computing one
column of a 16-byte AES state, expressed as a 4 4 matrix,
with each element a state byte. Four threads are needed to
process a single block of data. Note here, the GPU thread block
is different from AES data block, which is a 16-byte data block
iteratively updated on each round, transforming the plaintext
input to ciphertext output. Due to the ShiftRows operation,
threads working on different columns share their computation
results, and thus shared memory is used to hold the round
state results, which facilitates communication between threads
easily and efciently. Since threads for one data block will
be grouped into the same warp, there is no need to explicitly
synchronize the threads.
Fig. 3 shows the round operations for one column running
in a single thread. The initial round is simply an XOR of the
plaintext and the rst round key. There are nine middle rounds
for the 128-bit AES encryption. Each thread takes one diagonal
of the state as its round input, and maps each byte into a 4-byte
word through a T-table look-up-table. These four 4-byte words
are XORed together with the corresponding 4-byte round key
bytes, and the results are stored in a column of the output
state. The last round has no MixColumns operation, and so
only one out of four bytes is kept after the T-table lookup.
The T-tables are stored in the GPUs constant memory,
which is cached to improve performance. The key scheduling
is performed on the CPU, and then the expanded key is
copied into GPU memory. According to the conguration
of the TESLA C2070, there are at most 8 blocks and 48
warps residing together on one SM. The number of threads
present in a single block is set to 192 (48 32/8), which
fully utilizes the hardware. To perform an encryption, the grid
size is determined by the number of plaintext blocks to be
encrypted. In this work, 49,152 plaintext blocks are encrypted
at one time, so the grid size is 1024 (49152 4/192) blocks,
in order to achieve good occupancy on the GPU.
Round key
T-table lookup
Last round
Round key
T-table lookup
Power
supply
plain
cipher
Attacker
plain
Server
Power trace
GPU
cipher
Voltage
probe
Oscilloscope
283
284
Votagel(V)
0.05
0.1
0.15
0.2
0.5
1.5
2.5
3.5
Time(ms)
(b) Sample power trace of GPU
0
0.05
Votagel(V)
0.1
0.15
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(ms)
(1)
Rn
Load T-table
Rn
Reserve only
one byte
Rm
Rm
aHg
Hg
cov(W, Hg )
=
=
2 + 2
W Hg
W
2
a H
b
g
(2)
285
Reference Trace
Fisrt Trace
-0.02
-0.115
-0.04
-0.06
Voltage (V)
-0.11
-0.12
-0.125
-0.08
-0.1
-0.12
-0.14
-0.13
-0.16
-0.135
-0.18
0
500
1000
1500
Trace Index
2000
2500
3000
0.5
1.5
Sampling Point
2.5
10 5
Fig. 7. Average power trends and comparison of the rst and a reference
trace.
of the GPU is low, making the fan run slowly, which draws
less power. Since the workload of the GPU is high during
encryption, the GPU temperature rises, causing the fan to draw
more power to cool the GPU. The fan power consumption is
a substantial part of the measured power and contributes to
variance (noise) in the power measurements. We need a way
to address this.
To understand the effects of the cooling fan, we simply
calculated the average power (presented in volts) of each
power trace and present these in Fig. 7(a). The average voltage
starts from -0.115V and drops to -0.13V after trace 1000, and
stays at that level for the rest of the traces. Fig.7(b) shows the
rst measured voltage trace and a later reference trace. The
voltages in the rst trace are clearly higher than the reference
trace. Note here, we measure the voltage after the resistor, so
we get negative voltage values. The more negative the voltage
value, the higher the power consumption. Since traces before
trace 1000 are much more heavily inuenced by the cooling
fan (i.e., they have much wider variations), this data will be
excluded when computing correlations in our attack approach.
In a serial AES encryption implementation, plaintext blocks
are processed sequentially, and the timing of leakage points
in the power trace (under a specic power model) can be
precisely identied. However, the GPU accelerates the encryption by processing multiple plaintext blocks in parallel,
and given our lack of control over warp scheduling, leakage
points for each block occur at different times in the traces.
As shown in Fig. 4, an FPGA power trace exhibits round
operations clearly [23], and the leakage points can be easily
determined. However, for the GPU power trace, nothing about
the encryption details can be told at this point.
Our assumption is that when one thread on the GPU is
executing the leaky instructions of the last round, the power
consumption caused by the thread at that time correlates
with the Hamming distances of the registers. However, the
measured power consumption contains a lot of noise, both
spatially and temporarily, due to the fact that a number of
threads are executing other instructions, and that the threads
are not necessarily synchronized. In our experiments, we
measured N power traces, and each trace is sampled into T
time points. Assuming for each trace that there are Q threads
performing AES encryption. We model the power of one
286
Q
i=1
Q
Pthread i (t)
h(t Li ) Hi + B(t)
i=1
Q
Hi .
(3)
i=1
distance.
S
H
0.15
Right key
Wrong keys
0.1
Correlation
0.05
-0.05
-0.1
-0.15
0
0.5
1.5
2.5
Number of Traces
3.5
4
10 4
Fig. 8. Correlation between power traces and Hamming distances for all of
the key candidates
Q
(
Hsetk )
S
k=1
Q
Q
=
SHsetk =
Hsetk
S
S
H=
287
0.05
Byte0
Byte1
Byte2
Byte3
Byte4
Byte5
Byte6
Byte7
Byte8
Byte9
Byte10
Byte11
Byte12
Byte13
Byte14
Byte15
0.05
Correlation coefficient
0.05
0.05
0.05
0.05
0.05
0.05
64
128
192
256 1
64
128
192
Key values
64
128
192
256 1
64
128
192
256
288
[14] S. B. Ors,
E. Oswald, and B. Preneel, Power-analysis attacks on
an FPGArst experimental results, in Cryptographic Hardware &
Embedded Systems, 2003, pp. 3550.
[15] P. Kocher, J. Jaffe, and B. Jun, Differential power analysis, in Advances
in Cryptology, Dec. 1999, pp. 388397.
[16] E. Brier, C. Clavier, and F. Olivier, Correlation power analysis with
a leakage model, in Cryptographic Hardware & Embedded Systems,
2004, vol. 3156, pp. 1629.
[17] B. Gierlichs, L. Batina, P. Tuyls, and B. Preneel, Mutual information
analysis, in Cryptographic Hardware & Embedded Systems, 2008, pp.
426442.
[18] T. NVIDIA, C2050/C2070 gpu computing processor, 2010.
[19] C. Cuda, Programming guide, NVIDIA Corporation, July, 2012.
[20] N. Leischner, V. Osipov, and P. Sanders, Nvidia fermi architecture white
paper, 2009.
[21] P. Margara, engine-cuda, a cryptographic engine for cuda supported
devices, 2015. [Online]. Available: https://code.google.com/p/enginecuda/
[22] J. Daemen and V. Rijmen, AES proposal: Rijndael, 1998.
[23] T. Swamy, N. Shah, P. Luo, Y. Fei, and D. Kaeli, Scalable and
efcient implementation of correlation power analysis using (GPUs),
in Workshop on Hard. & Arch. Support for Sec. and Priv., 2014, p. 10.