Thesis PDF
Thesis PDF
Thesis PDF
Faculty of Science
Kerckhoffs Institute
by
Martijn Sprengers
[email protected]
Supervisors:
Dr. L. Batina (Radboud University Nijmegen)
Ir. S. Hegt (KPMG IT Advisory)
Ir. P. Ceelen (KPMG IT Advisory)
Abstract
Since users rely on passwords to authenticate themselves to computer systems, adversaries attempt to recover those passwords. To prevent such a recovery, various
password hashing schemes can be used to store passwords securely. However, recent
advances in the graphics processing unit (GPU) hardware challenge the way we have
to look at secure password storage. GPUs have proven to be suitable for cryptographic operations and provide a significant speedup in performance compared to
traditional central processing units (CPUs).
This research focuses on the security requirements and properties of prevalent password hashing schemes. Moreover, we present a proof of concept that launches an
exhaustive search attack on the MD5-crypt password hashing scheme using modern
GPUs. We show that it is possible to achieve a performance of 880 000 hashes
per second, using different optimization techniques. Therefore our implementation,
executed on a typical GPU, is more than 30 times faster than equally priced CPU
hardware. With this performance increase, complex passwords with a length of 8
characters are now becoming feasible to crack. In addition, we show that between
50% and 80% of the passwords in a leaked database could be recovered within 2
months of computation time on one Nvidia GeForce 295 GTX.
Preface
Contents
Contents
1 Introduction
1.1 Introduction . . . . . . .
1.2 Related work . . . . . .
1.3 Scope and contributions
1.4 Research methodology .
1.5 Relevance . . . . . . . .
1.6 External validity . . . .
1.7 Outline . . . . . . . . .
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Hash functions
3.1 Introduction to hash functions and their cryptographic
3.2 The MD5 hash function . . . . . . . . . . . . . . . . .
3.2.1 Merkle-Damgard hash functions . . . . . . . .
3.2.2 Notation . . . . . . . . . . . . . . . . . . . . . .
3.2.3 MD5 algorithm description . . . . . . . . . . .
3.2.4 The MD5 compression function . . . . . . . . .
.
.
.
.
.
.
.
1
2
2
3
4
4
5
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
8
8
9
10
11
11
11
properties
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
13
14
16
16
17
18
19
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
viii
CONTENTS
4.2
4.3
4.4
4.5
23
24
27
29
30
33
.
.
.
.
.
.
.
.
35
36
37
39
41
42
43
43
45
.
.
.
.
.
.
.
.
49
50
50
52
52
53
54
54
56
61
62
63
63
67
72
76
77
8 Experimental evaluation
8.1 Experiment setup . . . . . . . .
8.1.1 Performance metric . .
8.1.2 Available hardware . . .
8.2 Optimizations effects . . . . . .
8.2.1 Algorithm optimizations
79
80
80
80
80
80
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
8.2.2
8.2.3
8.3
8.4
Configuration optimizations . . . . . . . . . . . . . . . . .
Comparison of theoretical limitations versus practical limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison with other implementations and hardware . . . . . .
8.3.1 Comparison against CPU implementations . . . . . . . .
8.3.2 Comparison against other cryptographic implementations
Consequences for practical use of password hashing schemes . . .
. 82
.
.
.
.
.
83
84
84
85
86
work
91
. . . . . . . . . . . . . . . . . . . . . . . . . . 92
. . . . . . . . . . . . . . . . . . . . . . . . . . 93
. . . . . . . . . . . . . . . . . . . . . . . . . . 94
Bibliography
A Appendix A
A.1 Specifications test machine . . . . . . . . . . .
A.1.1 Specifications Intel Core i7 920 . . . . .
A.1.2 Specifications Nvidia GeForce GTX 295
A.2 Code overview . . . . . . . . . . . . . . . . . .
A.2.1 Password generation algorithm . . . . .
95
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
103
103
103
104
105
105
107
List of Figures
108
List of Algorithms
111
List of Tables
112
ix
Chapter 1
Introduction
1. Introduction
1.1
Introduction
1.2
Related work
After Thompson et al. [70] had shown that GPUs could be used for general
purpose computing too, in contrast to specific graphics applications, Cook et al.
[15] showed that GPUs could be used for cryptography. However, due the lack of
integer arithmetic and support of APIs, no remarkable speedup was gained. After
general programming APIs for GPUs became available, Yang and Goodman[77]
2
and Harrison and Waldron[29] showed that contemporary GPUs can outperform
high-performance Central Processing Units (CPUs) on symmetric cryptographic
computations, yielding speedups of at most 60 times for the DES symmetric
key algorithm. Moreover, GPU implementations of the AES algorithm have
been extensively reviewed [28, 41, 20, 12]. Szerwinski et al.[69] and Harrison
et al.[30] showed how GPUs could be used for asymmetric cryptography, while
Bernstein et al.[9] showed that it is possible to reach up to 481 million modular
multiplications per second on an Nvidia GTX 295, in order to break the Certicom
elliptic curve cryptosystem (ECC) challenge1 [10, 8]. In addition, cryptographic
hash functions, such as MD5 [40, 32] and Blowfish [48], have been implemented
on graphic cards too, yielding significant speed ups over CPUs.
1.3
1. Introduction
We publish a working proof of concept, which is the first and fastest GPU
implementation to crack password hashes for MD5-crypt.
We argue to what extent password hashing schemes should be improved
in order to withstand exhaustive search attacks executed on near-future
hardware.
1.4
Research methodology
1.5
Relevance
1.6
External validity
The password hashing scheme that will be reviewed in this work, MD5-crypt, was
designed in the early nineties and has not been updated ever since. Subsequent
password hashing schemes, such as SHA-crypt[21], are based on the same design
as MD5-crypt. They only differ in the way how the underlying hash functions are
implemented. Therefore, if our research proves to be successful, it could easily
be adapted to other password hashing schemes, making our method generic. The
same generalization applies for our efficient GPU implementation of the MD5
hash function, since it is very similar in design to other commonly used hash
functions.
1.7
Outline
Chapter 2 introduces the generic attacks on cryptographic systems and the feasibility of such attacks on current key sizes. Chapter 3 describes the properties and
5
1. Introduction
requirements of hash functions and the MD5 hash function in particular. Then,
Chapter 4 describes the properties and requirements of password hashing schemes
based on those cryptographic hash functions. Further more, specific attacks and
attacker models to password hashing schemes are defined. To describe our attack
with GPUs on one specific password hashing scheme, Chapter 5 contains an introduction to GPU hardware and programming models. Chapter 6 will cover our
approach to launch an exhaustive search attack with GPUs on MD5-crypt. To
maximize the performance of our attack, our optimization strategies and efficient
GPU implementation will be described in Chapter 7. Chapter 8 then contains
the results of the experimental evaluation for our proof of concept. Finally, the
conclusions and future work are presented in Chapter 9.
Chapter 2
2.1
Generic attacks
This section describes the attacks that hold for all cryptographic functions, including hash functions and password hashing schemes.
2.1.1
The easiest to perform and most powerful attack is the exhaustive search attack.
However, most secure cryptographic systems have large key spaces, which make
it impossible to find the key in a feasible amount of time. To decrease the time
needed for cryptanalytic attacks to find a key, either the time to apply the cryptographic function should be decreased or the available computing power should
be increased. Of course combining the two is even more advantageous for such
attacks. For example, consider a cryptographic hash function (e.g. MD5) with
an output space of 128 bit. If all outcomes are equally likely, the search space
would be 2128 , which is not feasible for modern hardware. The European Network of Excellence in Cryptology II (ECRYPT II), publishes an annual report on
algorithms and keysizes for both symmetric and asymmetric cryptographic applications. For 2010, Table 2.1 considers the feasibility of launching an exhaustive
search attack on different keysizes for specific cryptographic applications (after
[64, 54]).
Duration
Symmetric
RSA
ECC
Days-hours
5 years
10-20 years
30-50 years
50
73
103
141
512
1024
2048
4096
100
146
206
282
Table 2.1: Duration of exhaustive search attacks on key sizes, in bit length, for
specific cryptographic applications. (Assumptions: no quantum computers; no
breakthroughs; limited budget)
the correlation between the two, e.g. by finding the key to encrypt the
plaintext.
Ciphertext-only With this type of attack, the adversary only has access to
the ciphertext. His goal is to find the corresponding plaintext or key, e.g.
a hashed password.
Since password hashing schemes are based on one-way hash functions, exhaustive
search on those schemes is only possible in combination with ciphertext only
attacks1 .
If one assumes that the probability of the exhaustive search attack succeeding
on the first try is exactly equal to the probability that it would succeed on the
2nd, or n-th attempt, the law of averages then states that the best and unbiased
estimate is the mid-point of the series. In here, best is defined as having the
smallest sum of squared deviations of the difference between the successful attempt and the half-way point[26]. So, if an exhaustive search has a complexity
of 2n in time, a successful try is expected at 2n1 .
2.1.2
Birthday attack
This type of attack is based on the birthday problem in probability theory. This
theory states the probability that in a set of randomly chosen people a pair
of them will have the same birthday. Against most peoples expectation, the
probability that 2 people, out of a set of 23, having their birthday on the same
date is close to 50 %. With a birthday attack, an adversary randomly generates
output of a given cryptographic function until two inputs map to the same output.
Let n(p, N ) be the smallest number of values we have to choose, such that the
probability for finding a collision is at least p (with a total of N possibilities).
Then n(p, N ) can be approximated by [60]:
r
1
n(p, N ) 2N ln
.
(2.1)
1p
Now let Q(N ) be the expected number of values an adversary has to choose before
finding the first collision. This number can be approximated by [60, 6]:
r
Q(N )
N.
(2.2)
2
Since cryptographic hash functions map an arbitrary input to a fixed size output,
collisions always occur (due to the pigeonhole principle). A good cryptographic
hash function H has N output values that are all equally likely. This makes birth
day attacks more efficient, since finding a collision only takes 1.25 N evaluations
1
Actually, the original UNIX crypt function uses DES to encrypt the user password, but
with the zeros as the plaintext and the user password as the key.
of the hash function H (where N should be sufficiently large). For example, MD5
has an output of 128 bits, so N = 2128 . If MD5 is perfect collision resistant, a
collision will then be found after approximately 264 tries. However, MD5 has
been widely investigated on its cryptographic strength. For example, Feng and
Xie [73, 74] reported that collisions can be found in approximately 221 time.
2.1.3
2.2
Performance enhancements
The previously described attacks can be enhanced by other techniques and hardware, which will be described this section.
2.2.1
Bit slicing
While the origin of the bit slicing technique dates from the early seventies, it
became popular again (albeit in another form) by Bihams 1997 paper [11], in
which he describes how bit slicing can increase the performance of cryptographic
applications like DES. Bitslicing, as described by Biham, views an n-bit processor
as a computer with n one-bit processors. The algorithm is then broken down to
AND, OR, NOT, and XOR gates, and these gates are implemented as machine
instructions. For example, if we consider a 64-bit machine, this results in the
(cryptographic) function being executed 64 times in parallel. The bit-sliced DES
algorithm yields a significant speedup over the original implementation, which
supports exhaustive search attacks.
In more recent times, Kasper and Schwabe[33] showed that a bit sliced implementation of AES encryption is up to 25% faster than previous implementations,
while simultaneously offering protection against timing attacks. This makes it
valuable to consider when performing exhaustive search attacks on MD5 and
MD5-crypt. Unfortunately, no literature on how to implement a bit sliced version of MD5 of MD5-crypt could be found. The authors of John the Ripper [19]
made a proof of concept for a bitsliced implementation of MD5, but no notable
performance increase was gained. The authors state that bitsliced implementations for MD5 and SHA-1 are possible, but they are only more efficient than
traditional ones if near-future CPUs have wider registers (over 128 bit), larger
L1 caches, and higher instruction issue rates. However, current general-purpose
CPUs that satisfy these criteria happen to support parallel operations on 32-bit
elements within the 128-bit vector registers (like the SSE instruction set), which
is both more straightforward and more efficient than a pure bitsliced version of
these hash functions.
2.2.2
12
Chapter 3
Hash functions
Since password hashing schemes rely on the cryptographic properties of the underlying hash functions, this chapter describes the design of hash functions and
one hash function in particular, the Message Digest Algorithm 5 (MD5).
13
3. Hash functions
3.1
(3.1)
where H is the hash function, Z2 is equal to {0, 1}, m is the input size (in number
of bits) and n is the output size (in number of bits). In most cases m > n holds.
This is the reason why hash functions are also called compression functions. Most
hash functions are built for a particular purpose. To produce the hash, the bits of
the input message are mixed by bitwise operations (rotations and shifts), modular
additions and compression functions. These mixing techniques are then iterated
in order to ensure high complexity and pseudo-randomness of the output.
The main properties of a good hash function are:
Uniformly distributed A perfect hash function should produce unique output
for every unique input. However, due the fact that in most cases the domain
is greater than the range and according to the pigeon hole principle1 , some
different inputs will map to the same output. These situations are called
collisions. To minimize the likelihood of collisions, the output of a good
hash function should be uniformly distributed, meaning that the probability
for all outputs is the same: 1/N , where N is the size of the output space.
In the case that the output size is determined by n bits, the probability of
all the 2n outputs should be 1/2n .
Deterministic For any given input, the hash function should produce the same
hash value on any given time.
Low complexity It should be easy to compute a hash value for any given message. An efficient hash function should have a linear complexity of O(m).
The idea for hash functions was originally conceived in the late 50s [37], but it
is still a very active research field. Hash functions can be used for many purposes,
but the most prevalent are:
Fast table lookup. Elements of the table can be found fast if together with
the element the hash value of every element is stored as an index.
Message digest. A hash function is used to compare two large bit streams
by calculating the hash value of both the streams. If the hash values are different, the input streams must be different. If the hash values are the same,
1
The pigeonhole principle states that if n items are put into m pigeonholes with n > m,
then at least one pigeonhole must contain more than one item.
14
then one could say that the input streams are the same, with probability
P . The hash function is considered good if P is high.
Encryption. Some encryption algorithms use hash functions to produce
ciphertext that cannot be mapped back to the input. Hash functions that
are used in this context are called mixing functions.
Digital signatures. Instead of signing a whole document, it is more efficient
to sign only the hash of the document. Upon verification of the signature,
the hash of the document is used to ensure document integrity.
Authentication. Hash-based Message Authentication is a specific construction for calculating a message authentication code (MAC) involving a cryptographic hash function in combination with a secret key. This way, end
users who share the key can determine if the message has been tampered
with.
Password storage. Since hash functions deterministically map input to uniformly distributed output and the fact that they are hard to reverse, they
are considered appropriate functions for password storage.
Depending on the application of the hash function generally two kinds of
hash functions can be distinguished: cryptographic and non-cryptographic hash
functions. Regarding the purpose of this research, we will only review the cryptographic hash functions.
Cryptographic hash functions are used in a variety of security applications,
such as digital signatures, message authentication codes and other forms of authentication. If the security of such a function is broken, the parent security
application may be broken too. Therefore extra security properties are required
to make general hash functions suitable for cryptographic use[43].
Pre-image resistance This concept is related to that of a one-way function.
Functions that lack this property are vulnerable to so-called pre-image attacks. To avoid those, two subtypes of pre-image resistance can be defined:
1. First pre-image resistance Given a hash function H and a hash h, it
should be hard to find any message m such that h = H(m).
2. Second pre-image resistance Given a hash function H and a message
m1 it should be hard to find another message m2 (where m2 6= m1 )
such that H(m1 ) = H(m2 ). This property is sometimes referred to as
weak collision resistance.
Collision resistance Given a hash function H, it should be hard to find two
different messages m1 and m2 such that H(m1 ) = H(m2 ). Such a pair is
15
3. Hash functions
3.2
3.2.1
3.2.2
Notation
3. Hash functions
3.2.3
To compute the MD5 message digest of an input M , with length b, the following
stages should be completed (as described in [56]).
1. Padding To make sure that the length of the message is congruent to 448
modulo 512, the message is extended with a single 1 bit followed by a
number of 0 bits. To make the length of the message an exact multiple
of 512, 64 bits representing b are added at the end. In the unlikely event
that b is greater than 264 , then only the low-order 64 bits of b are used.
For example, let the message M be the single character a, then the representation of the message in bits will be: 01100001 (a has number 97 in
the ASCII-table). M contains 8 bits, which means that 440 (448-8) bits
will be added. The first added bit will be a 1 and the 439 others will be
0 bits. Because the length of the message is 8, the word representing the
length will be 00001000 and is appended directly after the message (The 64
bits representing the message length are appended as two 32-bit words and
appended low-order word first). Altogether, the message is represented by
the following 512 bits:
0110000110000000 . . . 0000000054 . . .
00000000000000000000000000001000 . . . 000000004 . (3.2)
Padding is always performed, even in the length of the message is already
congruent to 448 modulo 512.
18
=
=
=
=
01
89
fe
76
23
ab
dc
54
45
cd
ba
32
67
ef
98
10
4. Processing MD5 goes through N states IHVi , for 0 i < N , called the
intermediate hash values. Each IHVi consists of four 32-bit words Ai , Bi ,
Ci , Di . If i = 0 these are initialized to the four words described above:
IHV0 = (A0 , B0 , C0 , D0 ).
For i = 1 . . . N , the intermediate hash value IHVi is computed using the
MD5 compression function:
IHVi = MD5Compress(IHVi1 , Mi ).
(3.3)
3.2.4
This section will describe the MD5 compression function based on the RFC1321
[56] and the work of Stevens [65]. The input for the compression function
MD5Compress(IHVi1 ,Mi ) is given by an intermediate hash value IHVi1 =
(A, B, C, D) and a 512-bit message block Mi . There are 64 steps (numbered 0
up to 63), split into four consecutive rounds of 16 steps each. Each step uses
a modular addition, a left rotation, and an application of a non-linear function.
Depending on the step t, an Addition Constant ACt and a Rotation Constant
RCt are defined as follows:
ACt =
32
2 |sin(t + 1)| , 0 t < 64,
19
3. Hash functions
for
for
for
for
F (X, Y, Z) = (X Y ) (X Z)
G(X, Y, Z) = (Z X) (Z Y )
ft (X, Y, Z) =
H(X, Y, Z) = X Y Z
I(X, Y, Z) = Y (X Z)
t = 0, 4, 8, 12,
t = 16, 20, 24, 28,
t = 32, 36, 40, 44,
t = 48, 52, 56, 60.
for
for
for
for
0 t < 16,
16 t < 32,
32 t < 48,
48 t < 64.
The message block Mi is partitioned into sixteen consecutive 32-bit words m0 , . . . , m15
and expanded to 64 words (W t)63
t=0 for each step using the following relations:
m
for
16 t < 32,
(1+5t)mod16
Wt =
m(7t)mod16
for 48 t < 64.
For t = 0, . . . , 63, the compression function algorithm maintains four 32-bit integers (Qt , Qt1 , Qt2 , Qt3 ) to store the intermediate state. These are initialized
as (Q0 , Q1 , Q2 , Q3 ) = (B, C, D, A) and updated as follows:
Qt+1 = Qt + ((Qt3 + ft (Qt , Qt1 , Qt2 ) + Wt + ACt ) <<< RCt ) for 0 t < 64.
After this step the resulting state words are added to the intermediate hash value.
The MD5 compression function can finally be described as:
MD5Compress(IHVi1 ,Mi ) = (A + Q61 , B + Q64 , C + Q63 , D + Q62 ).
20
Chapter 4
Passwords are the most common form of user authentication in computer systems.
To secure against unauthorized disclosure of users credentials, passwords are
usually protected by hashing them, e.g. by using one of the hash functions
described earlier. The hashed passwords are then stored in a password table.
During the log in process, the user-provided password is hashed again using the
same hash function and compared with the stored hashed password to authorize
user access. There are two main types of attack that can be mounted on such a
authentication system:
1. On line guessing attack With this attack, the only way for the adversary to
verify whether a password is correct, is by interacting with the login server.
This kind of attack can be countered by techniques as account locking and
delayed response. This research will not focus on this type of attack. An
evaluation and more information about the countermeasures can be found
in [52].
2. Off line guessing attack With this attack, it is presumed that the adversary
has access to the files or database where the hashed passwords are stored.
He can mount all kinds of attacks against the hashed passwords, due to
the fact that they are now in his property. This research will not focus
on how to prevent an adversary to access password files, but it will focus
on the schemes that hash and store them. Those schemes try to provide
security even if an adversary has access to the password files.
This chapter focuses on the design, security properties and attacker models of
prevalent password hashing schemes.
21
4.1
Password hashing schemes are a set of algorithms, including one or more hashing functions, that try to protect the password information of users on a given
system. Before the plaintext password is hashed, it is usually transformed and
complemented by a salt. Because a password hashing scheme has to be a static
one-way function, a salt is random data, which may be public, that is hashed
together with the password in order to make it unlikely that identical passwords
are hashed to the same ciphertext. Another advantage of a salt is the fact that it
makes dictionary attacks more difficult, which will be described in Chapter 4.4.
A password hashing scheme is not the same as a hash function. Typically,
it is build on top of a cryptographic hash function to ensure that passwords are
stored securely. Most password hashing schemes are defined as follows:
s
n
P HS : Zm
2 Z2 Z2 ,
(4.1)
where P HS is the password hashing scheme, Z2 equals {0, 1}, m is the password
size (in number of bits), s is the salt size (in number of bits) and n is the output
size (in number of bits). Typically, m + s < n, because the length of the salt
and the plaintext password do not exceed the output size. Password hashing
schemes can restrict the input size, because if m + s > n, it will not add extra
security. However, users should be able to remember their passwords somehow
and therefore pick m small. In most schemes, the output of the function is not
merely the hash of the password: it is a text string which also stores the salt
and identifies the hash algorithm used. See Figure 4.1 for a graphical overview of
a common password hashing scheme. This figure shows the black box overview
Figure 4.1: An black box overview of the UNIX password hashing scheme crypt().
of the UNIX password hashing scheme called crypt(). If a new user has to
set his password, the system takes the plaintext password, hashes it together
with the salt according to the specified hash function scheme and stores it in the
/etc/shadow file in the following way:
$HASHFUNCTIONID$SALT$CIPHERTEXT
22
Identifier
Scheme
Hash function
1
2a
md5
5
6
MD5-crypt
B-crypt
Sun MD5
SHA-crypt
SHA-crypt
MD5
Blowfish
MD5
SHA-256
SHA-512
8
8
8
16
16
64
64
64
128
128
Table 4.1: Schemes shown in this table are built around the hash function they
are named after.
When a user has to authenticate himself to the system again, he has to type
in his password which will be hashed in the same way as when he had to set
his password together with the stored salt. If the newly calculated ciphertext
matches the one stored in the /etc/shadow file, authentication is successful.
4.2
To understand the need for password hashing schemes, it it is important to distinguish between two types of off-line attacks that adversaries can mount against
the storage of passwords (it is assumed that the adversary already has access to
the storage files):
1. Depth-search With this kind of attack the adversary wants to find the password of one specific user, which is often a user with high level access, like
a superuser or administrator. The adversary has to focus on this one hash.
This this kind of attack can be executed with the complete attack strategy[71], where the adversary is willing to take all the required computational effort to find the password.
2. Broad-search With this kind of attack the adversary wants to find just one
plaintext password out of a storage with more hashed passwords, because
one plaintext password can be sufficient to compromise the security of the
system. This kind of attack can be executed with the incomplete attack
strategy[71], where the adversary tries a number of N guesses for each
hashed password in the storage file, thereby finding the password only with
a certain probability.
23
Now let p be the probability that a user has chosen a weak password and n be
number of users on the targeted system. Then it is easy to see that a broad-search
has a higher probability (np) of success than a depth-search (only p), given the
assumption that the password strengths are identical. Table 4.2 shows some of
the characteristics of real (cracked) password datasets.
Dataset
Number of
passwords
Average
password
length
%
only
lowercase
ASCII
% in dictionary
phpbb.com
rockyou.com
184 000
14 000 000
7.5
8.7
2
7
41
25
29
1
The datasets in this table were published by skullsecurity.org1 and contain the
passwords that people use when they are not restricted by a password policy. The
phpbb.com database was cracked in January 2009 and the rockyou.com database
in December 2009. While the average password length for both datasets looks
good (with a password length > 7), 29% of the passwords in the phpbb dataset
could be found in an extended English dictionary2 . A fast CPU can iterate
through such a dictionary in less then a second. The statistics also show that
few people use special characters in their passwords and on average more than
30 % of the people only use lowercase ASCII characters from the set {a, . . . , z}.
Regarding the average length of 8 characters, this means that only 268 237
combinations are used, which which is feasible for an exhaustive search attack.
4.3
Attack strategies
24
So an exhaustive search attack will always find the password. However, the
time to find it is limited by N . If the input size is very large, it is not feasible
to iterate over all the possibilities anymore. For example if the maximum
input length is 9 characters and if all 94 printable ASCII characters are
admitted, N will be:
N=
9
X
(4.3)
k=0
If one has a computer that can review 1 billion passwords per second, it will
still take more than 18 years to go through the whole search space and the
expected time to crack one password will be half that time. However, users
tend to pick their passwords with less randomness than randomly chosen
passwords.
2. Dictionary attack In order to remember them more easily, humans tend to
pick common words as a password[51, 75]. This enables adversaries to use
common dictionaries to find passwords. A typical dictionary has less entries than the number of possibilities one has to iterate with an exhaustive
search attack. This makes dictionary attacks faster than exhaustive search
attacks. However, under the assumption that users select uniformly distributed passwords, the probability that an adversary eventually will find
the password is less than with an exhaustive search attack. Let E be the
set of entries in a dictionary and |E| = K, where E M . The probability
P of a successful attack can be calculated by:
P =
X
mE
pm = K
1
K
=
N
N
(4.4)
Typically K << N , which means that the probability of a successful attack is small. However, in real world situations this is not the case and a
dictionary attack therefore has a higher probability of a successful attack.
(See Table 4.2).
3. Fingerprinting attack This kind of attacks exploits the fact that people
have to remember their password somehow. When constructing an easy
to remember yet still hard to crack password, people use various kind
of strategies. These include techniques as generating mnemonic phrases,
obfuscation or the use of patterns. An example of a mnemonic phrase
25
n
X
(4.5)
i=1
where b is the base of the logarithm used and is p(xi ) is the probability mass
function of outcome xi . For example, the entropy in a fair coin flip is exactly 1 bit because the probability p(heads) = p(tail) = 0.5. Then H(X) =
p(heads) log2 p(heads) p(tail) log2 p(tail) = 1 bit.
The entropy for an ideal password can now be calculated by:
u = log2 (n),
(4.6)
where u is the entropy in bits and n the number of possibilities. For example
if the maximum size of the input space is 10 characters and if all 94 printable
3
26
ASCII characters are admitted, the entropy will be log2 (9410 ) 66 bits. However, Shannon also conducted some social experiments to show the entropy of
the English language [62]. It turned out that standard English has an entropy
between 0.6 to 1.3 bits per letter. For an ideal password that is encoded with
ASCII (six bits per character), one would like to have an entropy of six bits per
character, which implies maximal randomness. Passwords do not always contain
standard English words, but the United States National Institute of Standards
and Technology (NIST)[13] calculated that a user-chosen password based on all
94 printable ASCII characters with a length of 10 characters has an entropy of 21
bits. Using modern day computing power, all 221 possibilities could be searched
in limited time.
Zviran and Haga[78] already showed in 1999 by means of an empirical study
that users are not eager to change their habits regarding the creation of strong
passwords. One of the causes, as identified by Katz et al.[34], is that users are
asked to generate, memorize, and keep secret a growing number of passwords
as they join new password protected services over time. Since the mid nineties
this trend has been recognized as both a nuisance and a security risk. They
have shown that typical users can be trained to select and remember a single
secure password, but multiplying this dozens or hundreds of times is sure to push
the physiological limitations of human memory. Halderman et al.[27] proposed
a solution by securing all the passwords of a user by masterpassword, but this
only solves the problem at the user-side of the authentication mechanism. At the
server-side, the low entropy passwords are still stored in their original form.
4.4
If all users just pick high entropy passwords, there is no need for password hashing
schemes, since an application of a simple cryptographic hash function like MD5
would be enough to secure their passwords (with the assumption that MD5 is
pre-image resistant). However the studies by Shannon, Zviran, Katz and the
NIST show that it is hard for users to create passwords with high entropy, so the
storage of hashed passwords has become very important. To review the quality of
such a scheme, it it necessary to define the properties of a good password hashing
scheme first. In 1979, Morris et al.[47] already proposed ways to protect stored
passwords. Most of their principles are still valid today.
Correct use of salts As stated in Section 4.1, salts are necessary to produce
a secure output of a password hashing scheme. Let H be a hash function,
m1 be a plaintext password that user 1 wants to hash and s the salt that
is concatenated with password before the hash function is applied. Now
consider that user 2 has a password m2 that is the same as user 1, so
m1 = m2 . If we apply the hash function H without a salt, H(m1 ) = H(m2 ).
27
Chapter 4.4.1.
Avoid pipelined implementations As described by Provos et al. [55], a password hashing scheme should not lend itself to any kind of pipelined hardware implementation. It should therefore permit relatively little speed-up
from any kind of computation. For example, calculating the hash of 1000
passwords with the same salt and calculating the hash of one password under 1000 salts should both cost approximately 1000 times more computation
time than calculating the hash of a single password.
Enforcement of complex passwords This is not really a property of a good
password hashing scheme, but rather a policy of the system that uses password authentication mechanisms, which may be implemented by the password hashing scheme. To enforce the use of complex passwords, systems
should enforce password policies. However, there has to be a trade-off between the strictness of the policy and the limitations of the human memory
or will to remember difficult passwords. Yan et al. [75] stated that an interesting and important challenge is finding compliance enforcement mechanisms that work well with mnemonic password choice. Proactive password
checkers, which verify that a password is not part of a known weak subset of
the password space, might be an effective tool. But as Yan[75] has shown,
what engineers expect to work and what users actually make to work are
two different things. Rigorous experimental testing of interface usability is
in their view a necessary ingredient for robust secure systems. If users are
not content with new password policies, they just stop using the service,
fall back to their simple and weak passwords or write them down.
4.4.1
Key-stretching
To slow down exhaustive search attacks on password hashing schemes, two methods have been proposed in the literature:
In the first approach, which was proposed by Abadi et al.[3] and Manber[42], the password is concatenated with an additional salt (also known
as a password supplement) before it is hashed. This supplement is kept
secret to the user and is typically chosen in a range 1 . . . k, where k should
be feasible for an exhaustive search. The user has to guess this supplement
by repeatedly hashing his password, salt and a random supplement in the
predefined range. However, this also holds for the attacker who is guessing
the password by brute force. When checking a password guess an attacker
will need to perform a hash for each possible supplement until the space is
searched. Thus, if the password supplement space is of size k the attacker
will need to perform k hashes before he can completely eliminate one guess
29
4.5
4.5. Attacker models for hash functions and password hashing schemes
(4.7)
The term weak collision resistance is preferred over second pre-image resistance because
it better states the difference between first and second pre-image resistance.
31
(4.8)
the attack still holds, since the collision already occurred at H 0 and the salt,
password and result of H i1 are always concatenated in the same way. Considering the previous, password hashing schemes that use key-stretching in the
two ways described above do not provide additional security for strong collision
attacks other than the application of the first hash (H 0 ), given that the salt can
be chosen by the attacker.
However, password hashing schemes that concatenate the password p1 , the
salt chosen by the adversary s1 and the result of the last round H i1 in a pseudorandom way can provide k times additional security. In this way, the complexity
of a short-chosen prefix collision attack [66] on the password hashing scheme will
be k times higher than such an attack on one application of the hash function,
since the specially crafted salt should work for all the k applications of the hash
function.
An off line exhaustive search attack on the input of a password hashing scheme
is always possible, although such an attack can not guarantee a successful search
because the size of the input set can be infinite (or unfeasible to iterate over).
However, as shown by Section 4.2, users tend to pick passwords with far less
entropy as can be achieved with a given character set. This makes the password
hashing scheme more vulnerable to dictionary and brute force attacks. To deal
with these kinds of attack, most password hashing schemes use the key stretching
technique to slow down the hashing process. In their main loop they apply the
hash function k times, every time with different input concatenated with the
output of the last round (e.g. MD5-crypt, which will be described in Chapter 6).
While the key stretching technique slows down exhaustive search attacks for large
k, users also have to apply the hash function k times, which leads to a paradox:
Users expect k to be low fast authentication.
Users expect k to be high less successful exhaustive search.
Moreover, this paradox also applies for the design of hash functions. On the
one hand hash functions should be computationally fast in order to hash large
amounts of data, while on the other hand hash functions should be computationally slow in to order to slow down exhaustive search.
32
4.5. Attacker models for hash functions and password hashing schemes
4.5.1
33
Chapter 5
This chapter describes the internal working of a graphics processing unit and
explains why it is more useful for parallel computing than a normal central processing unit (CPU). Since the late nineties Graphics Processing Units (GPU)
have been developed and improved. They have proven to be very suitable for
processing parallel tasks and calculating floating point operations. It is especially the parallel design of a GPU that makes it suitable for cryptographic functions. While the advantages of GPUs in other areas (like graphical design and
game-industry) have already been recognized, the cryptographic community was
not able to use them due to the lack of user-friendly programming APIs and
the lack of support for integer arithmetic. However, GPU producers have dealt
with those shortcomings. Goodman and Yang[77] have shown that contemporary
GPUs can outperform high-performance Central Processing Units (CPUs) on
cryptographic computations, yielding speedups of at most 60 times for the AES
and DES symmetric key algorithms. Szerwinski and G
uneysu showed how GPUs
could be used for asymmetric cryptography [69], while Bernstein et al. showed
that it is possible to reach up to 481 million modular multiplications per second
on a NVIDIA GTX 295 [9], in order to break the Certicom elliptic curve cryptographic system (ECC) challenge with the use of GPUs [10]. This knowledge
is very useful, since a lot of cryptographic functions were designed to be implemented on a traditional CPU. The shift of focus to GPUs provides new insights
on how we should deal with (implementations of) cryptography in the future.
35
5.1
Hardware outline
Figure 5.1: Overview of a typical CPU and GPU. Due to its design, the GPU is
specialized for intensive, highly parallel computation [1].
A typical GPU, as shown in Figure 5.1, consists of the same elements as a
normal CPU:
Arithmetic logic unit (ALU) The digital circuit that performs arithmetic,
logical and shift operations. In a GPU, this unit is also called stream
processor or thread processor.
Control Unit Dispatches instructions and keeps track of the locations in
memory. In a GPU, also a thread dispatcher resides on the control unit,
which controls the creation and execution of threads.
DRAM Main memory, which can be refreshed dynamically.
Cache A smaller, faster memory which stores copies of the data from the
most frequently used main memory locations.
In addition, a GPU has multiprocessors. Every multiprocessor contains a number
of thread processors and a shared memory, which is accessible by all the thread
processors in the multiprocessor.
Because of the fact that a GPU is specialized for intensive, highly parallel
computation, exactly what graphics rendering is about, it is designed in such
a way that more transistors are devoted to data processing rather than data
caching and flow control, as schematically illustrated by Figure 5.1. Moreover, a
GPU is suited to address problems that can be expressed as data-parallel computations with high arithmetic intensity (the ratio of arithmetic operations to
memory operations). This means that a GPU can execute the same program for
each data element in parallel, which implies lower requirements for sophisticated
flow control and higher arithmetic intensity (the memory access latency can now
be hidden with calculations by the execution of other threads instead of data
caches). The processing of parallel data maps data elements to parallel processing
36
threads. Many applications that process large data sets can use a data-parallel
programming model to speed up the computations. In 3D rendering, large sets of
pixels and vertices are mapped to parallel threads. But also many algorithms outside the field of image rendering and processing are accelerated by data-parallel
processing, for example applications of a cryptographic hash function on a large
dataset can be done in parallel.
5.1.1
The increase in CPU frequency is limited by physical restrictions and high power
consumption. Their performance is often raised by increasing the number of
cores: some processors may contain up to six cores. However, further growth is
not expected to be coming soon [68]. As stated by Garland et al.[25], the emphasize is shifting from latency-oriented CPUs, which focus on minimizing the
latency of a single sequential task, towards throughput-oriented GPUs, which
focus on maximizing total throughput. Latency-oriented CPUs are designed for
common applications, based on the Multiple-Instruction Multiple-Data (MIMD)
architecture, as proposed by Flynn [23]. This means that each core works independently of the others, executing various instructions for multiple processes.
For throughput-oriented GPUs, the requirements for the graphical computational problems led to a specific parallel architecture in which multiprocessors can execute hundreds of threads concurrently. This architecture is called
Single-Instruction Multiple-Threads (SIMT). It is based on the Single-Instruction
Multiple-Data model (SIMD)[23], which describes multiple processing elements
that perform the same operation on multiple data simultaneously. Accordingly,
such an architecture exploits data level parallelism.
The SIMT architecture
The SIMT architecture describes a multiprocessor that can create, manage, schedule and execute threads in groups. A group of N parallel threads is called a warp.
Every thread has its own instruction address counter and register state. Individual threads in the same warp start together at the same program address, but
can execute and branch independently.
A warp executes one common instruction at a time, so full efficiency is realized
when all N threads of a warp agree on their execution path. If a thread in a
warp diverges via a data-dependent conditional branch, the warp serially executes
each branch path taken, disabling threads that are not on that path. When
all paths are completed, the threads converge back to the original execution
path. Branch divergence occurs only within a warp. Different warps on different
multiprocessors execute independently regardless of whether they are executing
common or disjoint code paths.
37
To maximize utilization, a GPU relies thus on thread-level parallelism. Utilization is therefore directly linked to the number of warps residing on a multiprocessor. At every instruction issue time, a warp scheduler selects a warp that
is ready to execute its next instruction and issues the instruction to the active
threads of the warp. The number of clock cycles it takes for a warp to be ready
to execute its next instruction is called latency (which actually depends on the
number of clock cycles it takes to issue a memory request). Full utilization is
then achieved when the warp scheduler always has an instruction to issue for
some warp at every clock cycle during that latency period. This is called latency
hiding. The number of instructions required to hide a latency of L clock cycles
depends on the throughput of these instructions. For example, if we assume that
a multiprocessor issues one instruction per warp over 4 clock cycles and maximum throughput for all instructions is achieved, then the number of instructions
to hide the latency should be L/4.
The SIMT architecture is related to SIMD vector organizations in the sense
that a single instruction controls multiple processing elements. The major difference between the SIMD and SIMT architectures is the fact that SIMD vector
organizations expose the width (and thus the number of threads that can run
concurrently on one processor) to the software. SIMT instructions on the other
hand specify the execution and branching behavior of a single thread. This enables programmers to write thread-level parallel code for independent threads as
well as data-parallel code for coordinated threads. Actually, the programmer can
even ignore the SIMT behavior; however, substantial performance improvements
can be realized by taking care that the code seldom requires threads in a warp
to diverge.
Threads and resources
Apart from the difference in architectural design, there exists also a major difference in the organization of threads. On a typical CPU only one or two threads
can run concurrently per core, whereas on a typical GPU up to 1024 threads can
run concurrently per multiprocessor (and most GPUs have several multiprocessors). For example, the Nvidia GTX 295 has 60 multiprocessors, which can yield
more than 30.000 concurrent or active threads. In comparison, an Intel i7 9xx
processor has only 4 cores, which can run 8 concurrent threads in total (if HyperThreading is enabled). In addition, switching from one thread to another takes
hundreds of clock cycles on a typical CPU (threads are then called heavyweight
threads), whereas a GPU can switch several threads per clock cycle (threads
are then called lightweight threads). This is due the fact that for a CPU the
operating system must swap threads on and off the CPU execution channels to
provide multithreading capability. Thread switching is therefore slow and expensive. On a GPU, thousands of threads are queued in warps of N threads each.
38
If the GPU must wait on one warp of threads, it simply begins executing work
on another warp. Because separate registers are allocated to all active threads,
no swapping of registers or state need to be done when switching between GPU
threads. Resources stay allocated to each thread until its execution is completed.
Memory access
The CPU and GPU both have their own RAM. On a CPU, the RAM is generally
equally accessible to all code (if the code accesses memory that is located within
the boundaries enforced by the operating system). On a GPU, the RAM divided
virtually and physically into different types, each of which has a special purpose
and fulfills different needs. Due to the fact that different GPUs have different
types of memory, we will not go in to detail here.
Computing capabilities
In this research we use the GPU hardware from the Nvidia Corporation. The
properties and performance differ per Nvidia card, but all hardware is based on
the SIMT principle. Nvidia has classified their graphic cards in order to distinguish the properties and features of each device. This classification is called
compute capability, and determines the set of instructions supported by the device, the maximum number of threads per block and the number of registers per
multiprocessor. Higher compute capability versions are supersets of lower (i.e.
earlier) versions, and so they are backward compatible.
5.1.2
has a clock frequency of 2.66 GHz. So this processor can handle 4 4 2.66 42
GFLOPS. For GPUs it is a bit harder to calculate the actual GFLOP unit.
This is because (Nvidia) graphic cards use three clocks: a memory, processor (or
shader) and core (or graphics) clock. To measure the actual peak performance,
the shader clock is the most important because it defines the speed at which the
program fragments (shaders) that perform calculations on individual elements
(and produce information related to that element) can run. As an example consider the Nvidia GeForce GTX 295 graphics card which has 2 GPU units with
240 multiprocessors each and its shader clock is set to 1.242 GHz. In its turn,
every multiprocessor has 8 cores. The CUDA Programming Guide [1] states that
it takes 4 clocks for one multiprocessor to handle one warp, which means that
it takes 4 clocks for 8 cores to handle one instruction for all the 32 threads in
a warp. In other words, each core can handle one instruction per clock cycle.
This means that the GeForce GTX 295 can handle 480 1.242 596 GLOPS.
This value is much lower than stated in Figure 5.2, because a GPU can use an
extended instruction set for floating point operations: multiply and add is a single instruction and it can also perform one extra multiply (which is called the
dual-issued single precision floating point multiplication) per clock cycle, setting
the total to 3 596 1800 GFLOPS. However, cryptographic functions like MD5
or AES do computations on 32-bit integers, not on floating points. To compare
the speed between CPUs and GPUs with regard to cryptography it is necessary
to define GIPS: Giga Instructions Per Second. Instructions that require floating
points, such as multiply and add and the dual-issued single precision floating point
multiplication, cannot be used on integers and therefore the number of instructions per second (IPS) is smaller than the number of floating point operations
per second (FLOPS).
40
32
16
Multiple Instructions
8
8
16
Multiple Instructions
16
Table 5.1 shows the throughput of the arithmetic instructions (in operations
per clock cycle per multiprocessor) that are natively supported in hardware for
(Nvidia) devices of various compute capabilities. If we assume that 24-bit integer
multiply is sufficient for all calculations, we can still state that one multiprocessor
can handle 8 arithmetic instructions per clock cycle. So this means that the
GeForce GTX 295 (with compute capability 1.3 and 60 multiprocessors) can
handle 60 8 1.242 596 GIPS. For a typical CPU the number of instructions
per second is more difficult to determine. In practice, this number depends on the
algorithm, instruction set used, memory bandwidth and bus size. For example,
an Intel Core i7 980EE can handle about 45 instructions per clock cycle at 3.3
GHz1 , which leads to a total of 3.3 45 148 GIPS.
5.2
Developers that want to produce code that exploits the power of GPUs for generic
programming could use one of the available application programming interfaces
(APIs). The most currently used APIs for GPU programming are:
Compute Unified Device Architecture (CUDA) This parallel computing architecture, developed by Nvidia and released begin 2007, provides both a
low level and high level API. It only works on CUDA enabled graphic cards
from Nvidia and is basically an extension of the C programming language.
The CUDA drivers are needed to run an executable compiled with CUDA.
The main advantages of this architecture are:
code can read from arbitrary memory addresses,
fast shared memory, which acts as an user-managed cache, can be used
by threads residing on the same multiprocessor,
1
41
5.3
The CUDA programming model enforces cooperation between hardware, software and different tools to compile, link and run CUDA code. This section will
42
describe the compiler model, the mapping from the hardware to the software (the
execution model) and how the memory on the GPU can be used. Moreover, the
CUDA programming model also contains a collection of compilers and tools to
optimize CUDA code.
5.3.1
Compiler model
The CUDA programming model supports two ways of writing programs. This
research will only focus on the CUDA C variant, which is a minimal set of extensions to the C language2 . A source file can contain a mixture of host code (i.e.
code that executes on the CPU) and device code (i.e. code that executes on the
GPU). The nvcc compiler first separates the host and device code. Whereas the
host code (plain C for example) is then compiled and linked by the programmers
favorite compiler, the device code is compiled by nvcc to an assembly form (the
PTX code). Then this PTX code can either be converted to binary form, which
is called a cubin object, or loaded by the application at runtime and get compiled
via the just-in-time compilation mechanism.
5.3.2
Execution model
Because both the GPU hardware and the CUDA framework are designed by
Nvidia, these two interact very well with each other. The most important aspect
to understand how CUDA works is the mapping between the API and the hardware. The soft- and hardware entities in the programming model are shown in
Figure 5.3. Every entity in the figure will be described below:
Device The device is defined as the GPU hardware which runs the code.
Host The host is defined as the CPU hardware which controls the GPU.
Kernel A kernel is defined as a function that runs on the device. As mentioned earlier, the CUDA architecture is based on the SIMT principle. This
means that every thread processor executes the same function in parallel.
Only one kernel is executed at the time and therefore many threads can
execute that kernel simultaneously.
Thread and Tread Processor A thread runs a kernel which is executed by
the thread processor. Every thread has a unique id.
Thread Block and Multiprocessor Every thread block contains a predefined
number of threads. A thread block is then executed by one multiprocessor.
However, several concurrent thread blocks can reside on one multiprocessor,
limited by multiprocessor resources such as shared memory and number
2
The other way to write CUDA programs is via the CUDA driver API.
43
Grid and Device Every grid contains a predefined number of thread blocks,
so a kernel is launched as a grid of thread blocks.
The number of threads in a thread block and the number of thread blocks in
a grid can be determined at compile time or runtime. This allows programs to
transparently scale to different GPUs and hardware configurations. The hardware is free to schedule thread blocks on any processor as a long as it fits in the
warp size. When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps that get scheduled by a warp scheduler for
execution. The way a block is partitioned into warps is always the same. Each
warp contains threads of consecutive, increasing thread ids with the first warp
containing thread 0. The configuration of a kernel (in number of grids, thread
blocks and threads) significantly influences the execution speed. It is important
to divide the threads evenly over every warp. We will discuss optimal kernel
configuration in Chapter 7.2.3.
Figure 5.4 gives an overview of how threads, thread blocks and grids can be
defined. Over time, every kernel is executed depending on its own configuration.
This configuration is described by the number of threads per block, the number
of blocks per grid, the amount of static memory needed and the algorithm specific
parameters.
44
5.3.3
Memory Outline
Figure 5.5 gives an overview of the available memory on a typical Nvidia CUDA
enabled GPU.
The following different classes of physical memory can be distinguished:
Device Memory This is the largest memory, with sizes up to 1.5 GB, but
also the memory with the highest latency with regard to the individual
thread processors. It can be compared with the DRAM of a normal CPU.
The device memory is virtually divided into the following parts:
Global memory Global memory resides in device memory and is accessed via 32-, 64-, or 128-byte transactions. These memory transactions must be aligned. This means that only the first 32-, 64-, or 128byte segments of the global memory that are aligned to their size (their
first address is a multiple of their size) can be read or written by memory transactions. When a warp executes an instruction that accesses
global memory, it coalesces the memory accesses of the threads within
the warp into one or more of these memory transactions depending
45
on the size of the word accessed by each thread and the distribution
of the memory addresses across the threads. The more transactions
are necessary, the more unused words are transferred compared to the
words accessed by the individual threads, which reduces the instruction throughput. For example, if a thread wants to access 4 bytes, but
a 32-byte memory transaction is generated, the throughput is divided
by a factor 8.
Local memory The memory outline in Figure 5.5 shows that the local
memory is a distinctive unit close to the individual thread processor.
The local memory space resides in device memory however, so local
memory accesses have the same high latency and low bandwidth as
global memory accesses. In contrast to global memory, this type of
memory is organized in such a way that consecutive 32-bit words are
accessed by consecutive tread ids and the access is therefore fully
coalesced as long as the threads in a warp originate from the same
relative address. Local memory can not be allocated: variables are
automatically placed in local memory by the compiler if they not fit in
46
Location Cached
on/off
chip
Access
Scope
Lifetime
Register
Local
Shared
Global
Constant
Texture
On
Off
On
Off
Off
Off
R/W
R/W
R/W
R/W
R
R
1 thread
1 thread
All threads
All threads
All threads
All threads
Thread
Thread
Block
Host allocation
Host allocation
Host allocation
n/a
No
n/a
No
Yes
Yes
in block
+ host
+ host
+ host
Table 5.2: Features of Nvidias GPU device memory (with compute capability
1.3)[1].
Chapter 6
49
6.1
Considerations
Since this research focuses on one attack case only, some assumptions have to be
defined. Our implementation is based on the following considerations.
Attacker model We assume that the adversary definitely has to find the
plaintext password given the ciphertext (ciphertext only attack) and therefore should perform an exhaustive search over all possible inputs. Furthermore, the attacker does not have control over the salt and because the salt
is 48 bit in size, we consider time-memory trade-off not valuable enough to
exploit.
Hardware The exhaustive search is performed by Nvidia GPUs, since we
have chosen CUDA to be our specific programming interface. However,
the optimizations that described could be used for different architectures
as well.
Password generation To make sure the performance of the implementation
can be compared with real password recovery tools, unique passwords need
to be generated. The performance is than measured in unique password
hashes that can be checked per second. Since some of the optimizations
depend on the password length, we assume the password length to be less
than 16 bytes. 1
Law of averages We cannot use the law of averages with our exhaustive
search since we use a linear password generation algorithm (thread 0 checks
aaaa, thread 1 checks baaa, etc.) and since we assume that the password
input space is not uniformly random distributed[26].
6.2
Design of MD5-crypt
The scheme that will be reviewed in this research is the MD5-crypt scheme. MD5crypt is the standard password hashing scheme for FreeBSD operating systems,
supported by most Linux distributions (in the GNU C library) and implemented
in Cisco routers. It was designed by Poul-Henning Kamp in 1994.
Figure 6.1 shows the schematic overview of the MD5-crypt implementation.
The figure only shows the most relevant parts and abstracts from initializations.
Basically, MD5-crypt applies the MD5-compression function 1002 times:
In the first application, the concatenation of the password, salt and password again is hashed.
1
50
16 bytes is about twice the average password length of real world datasets (see Table 4.2).
scheme in an article about open source software by Jeff Norris [49]. In this article
Poul-Henning Kamp states the following:
Fortunately, I have no reason to believe that any problem exists with
either the algorithm or the implementation, and given that MD5 is
pretty strong, its unlikely that any will ever be found.
He is correct when he argues that MD5 is still pretty strong for the purpose of
password encryption, since MD5 only suffers from collision attacks and no feasible
pre-image attacks have been found yet. There is a theoretic attack by Sasaki and
Aoki [58] that generates a pseudo pre-image of MD5 with a complexity of 2116.9
and generates a full pre-image of MD5 with a complexity of 2123.4 . These numbers
are still not feasible for contemporary and near future hardware, and even if they
were, the key stretching technique makes the attack even n times harder.
As we have seen in Chapter 4.5, collisions can theoretically influence password
hashing schemes. Wang et al. [59] and Leurent [39] have shown that for the APOP
protocol strong collision attacks can be used to recover the plaintext password.
The APOP protocol uses one application of MD5(P, C), where P is the password
and C a random nonce. If an adversary has access to C, he can launch a collision
attack. This attack assumed the following conditions:
1. There exists no message difference in the last part of messages.
2. Many collisions can be generated in practical time.
MD5-crypt uses the key-stretching technique in such a way that it becomes
very hard for an adversary to exploit the break of the strong collision resistant
property of MD5. The attack from [59] could work theoretically if the salt and
password are concatenated in the same way with every iteration of the hash
function and the salt could be arbitrarily chosen by the adversary. Then an evil
salt could be created such that two different inputs collide under the salt and
its evil twin [18]. Since MD5-crypt concatenates the password, salt and result of
the previous iteration in a pseudo-random way (which breaks the first condition)
and since MD5-crypt uses 1000 iterations (which breaks the second condition),
we have no belief that collision attacks affect the security of MD5-crypt.
6.3
6.3.1
updates one word of a 4-word accumulated digest, using the entire intermediate
digest as well as block data and constants (the Markle-Damgard principle). So
in general, each basic step depends on the output of the prior step, defeating
simple parallelization of the steps. The same concept holds for MD5-crypt, in
the main loop (where the MD5-Compression functions is applied a 1000 times),
the output of the last round serves as input for the next round, again defeating
parallelization of the rounds.
While it is not possible to parallelize the MD5-crypt algorithm itself, an exhaustive search attack to find a matching password for a given hash can be parallelized easily. In fact, exhaustive searches are embarrassingly parallel [24], since
the parallel processing units of the underlying hardware do not have to interact
or cooperate with each other. Every processing unit tries a set of possibilities and
compares them with a target. The only cooperation that is needed, is the division
of the search space and a general stop message that terminates the search when
one processing unit has found a match. If we map this approach to MD5-crypt,
every processing unit needs to calculate the MD5-crypt output hashes of all the
candidate passwords in the input set he is given and match them with a target
hash.
6.3.2
Algorithm optimizations
This section describes the optimizations that can be done in the algorithm itself
in order to speed up the execution of one MD5-crypt round. The main part
of the MD5-crypt password hashing scheme consists of the application of the
MD5-compression function a thousand times. So every optimization in the MD5compression function will yield a linear speed up of 1000 times in the MD5crypt function. Based on the assumption of the password length, two major
optimizations of the MD5 hash function can be used.
The first optimization is based on the fact that passwords with sizes lesser
than 16 bytes result in only one application of the MD5-compression function. As
described in Chapter 3.2.3, the MD5 hash function operates on a b-bit message
M which is padded to make the length of the message a multiple of 512. For
every 512-bit block of the message M , the MD5-compression function is called.
However, if M is only 512 bit long, the MD5-compression function is called only
once. So if the input to the MD5 hash functions in MD5-crypt is only 448 bit
long a linear speed up can be achieved. To calculate the maximum length of the
password that still allows this optimization, we have to find the call to the MD5compression function in the MD5-crypt algorithm that uses the largest input.
Depending on the round in the algorithm, the password P , salt S and the result
of the last application of the MD5-function R are concatenated in such a way
that it uses at most 2 |P | + |S| + |R| bytes. Because the salt and the result of
last round are always 8 and 16 byte respectively, the following should hold for
53
|P |:
448
|P | < 16.
(6.1)
8
The second optimization is based on the fact that the length of passwords
is typically significantly shorter than 16 bytes. As described in Section 3.2.4,
the MD5-compression function operates on a 512-bit message block Mi , which
is partitioned into sixteen consecutive 32-bit words m0 , . . . , m15 , and then expanded to 64 words (W t)63
t=0 for each of the 64 steps in the function. Because
of the assumption of the password length, the length of the input for the MD5compression function can be calculated in advance. This means that not all of
the input words m0 , . . . , m15 are initialized to a value. Regarding Equation 6.1,
the number of input words Nnull that are zero can then be calculated as:
56 (2 |P | + 24)
16 |P |
Nnull =
+1=
+ 1.
(6.2)
4
2
2 |P | + 24 <
The 4 in the denominator comes from the fact that 4 bytes make up 1 word
and the addition with 1 comes from the fact that the last word is not needed to
represent the length of the input (the input length is smaller than 232 ).
6.4
6.4.1
F (X, Y, Z) = (X Y ) (X Z)
G(X, Y, Z) = (Z X) (Z Y )
ft (X, Y, Z) =
H(X, Y, Z) = X Y Z
I(X, Y, Z) = Y (X Z)
for
for
for
for
0 t < 16,
16 t < 32,
32 t < 48,
48 t < 64.
Logic Operations
One application of Q
16 applications of Q
F
G
H
I
4
4
2
3
11
11
9
10
176
176
144
160
656
Architecture
name
Avarage price in
EUR
Peak
performance in GIPS
Theoretical
speed
MD5
(hash/sec.)
Theoretical
speed
MD5-crypt
(hash/sec.)
Nvidia GTX295
Nvidia GTX580
Nvidia S1070
Intel i7 965EE
Intel i7 980EE
290
450
6000
450
900
596
790
1382
76
147
908M
1204M
2107M
116M
224M
906K
1202K
2100K
115K
223K
the Nvidia Tesla S1070 has an average price of 6000 Eur, but produces only a
little over twice the performance over the GTX295, while the average cost is 15
times higher2 . This due to the design of the S1070: it is built out of four separate
Tesla T10s and intended for memory intensive implementations. Therefore it
has a total of 4 GB of global memory, which is expensive.
6.4.2
Since the previous model only takes the number of operations into account and
neglects the memory accesses, the model will not be very accurate for a performance estimation of real implementations. In addition, the CUDA execution
model is complicated and performance can not be measured by simply calculating
the number of clock cycles and memory accesses. It also depends on the compute
capability of the device and the way the architecture maps the software implementation on the hardware. A more accurate way to estimate the performance
of real CUDA implementations was proposed by Kothapalli et al. [38]. Their
model takes into account the various facets of the GPU architecture like scheduling, memory hierarchy and pipelining. It is based on the arithmetic intensity and
memory accesses per kernel, which is executed by each thread. Let the number
of clock cycles required for arithmetic computation in a thread be denoted as
Ncomp and let the number of clock cycles required for memory accesses (both
shared and global memory) be denoted as Nmem . Now let CT (K) be the number of clock cycles required to execute kernel K on thread T . If the hardware
is capable of hiding all the memory latency (which depends on the configuration
and the nature of the problem one wants to solve), the MAX model can be used:
CT (K) = max(Ncomp , Nmem )
2
(6.3)
For overview peak performance and specifications of Nvidia GPUs, see: http://en.
wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
56
If the hardware cannot hide all the latencies, the worst case scenario is used to
determine CT (K) (all memory accesses are serialized), which is called the SUM
model:
CT (K) = Ncomp + Nmem
(6.4)
Depending on the model, the number of clock cycles required for the kernel K to
execute, denoted as C(K), can be defined as:
C(K) = NB (K) Nw (K) Nt (K) CT (K)
1
.
NC D
(6.5)
C(K)
.
R
(6.6)
All the previously defined variables are depending on the kind of problem the
GPU has to solve. The MAX model can be applied for arithmetic intensive
implementations while the SUM model can be used best for memory intensive
implementations.
Case study: MD5-crypt
To estimate the performance of MD5-crypt with the practical based model, we
have to determine number of threads that a we need to count the number of
hashes per second. If we assume that every unique thread T can review one password candidate, it is possible to determine the number of clock cycles C(K) a
kernel K requires to execute one round of MD5-crypt. Together with a predetermined number of password candidates N , it is possible to estimate the number of
hashes per second with this model. For the sake of simplicity, let N be a multiple
of 32 (the number of threads in a warp), for example 320 000. If we define a
block size of 256 threads and we have a device with 60 multiprocessors (like the
Nvidia GTX295), the number of blocks per multiprocessor NB (K) = 320000
60256 .
Now it is necessary to determine the number of clock cycles required for arithmetic computation Ncomp and the number of clock cycles required for memory
accesses Nmem done in one thread that executes one instance of MD5-crypt for
one candidate password. Ncomp is the same as the amount in Section 6.4.1,
which is 657312. To determine Nmem , we just count all the memory accesses in
the default MD5-crypt implementation. However, this value is dependent on the
57
password- and salt length. If we assume that both have a length of 8 bytes, the
total number of memory accesses will be 110394 for one thread that executes one
instance of MD5-crypt for one candidate password. Further more, on a Nvidia
GTX 295 the number of cores per multiprocessors NC is 8, the maximum number
of threads in a warp Nt (K) is 32 and the depth of the pipeline D is 1. Finally,
because the block size is 256, the number of warps that fit in each block Nw (K)
is 8. With the previous information available, it is now possible to estimate the
kernel execution time for 320000 candidate passwords, by combining Equation 6.5
and 6.6. Figure 6.4.2 shows the performance estimates for the SUM and MAX
model, both for accessing shared memory (which takes on average 4 clock cycles
per access [1]) and global memory (which takes on average 500 clock cycles per
access [1]). Furthermore, the performance estimate for the theoretical model is
shown. While the difference between the shared memory and the global memory
approach seems large (900K versus 11K hashes per second respectively), in practice the warp scheduler will try to hide the latency by swapping threads that are
waiting on a memory request and therefore the practical values will be higher.
The degree of hiding depends on the configuration and the implementation itself. The difference between the values of the MAX model with shared memory
and the theoretical model is caused by the configuration settings in the practical
model. If the workload is not divided amongst all multiprocessors equally, the
performance will decrease.
Figure 6.2: Performances achieved for MD5-crypt on a Nvidia GTX295, calculated by the practical based model.
59
Chapter 7
Optimization of our
implementation on a CUDA
enabled GPU
61
7.1
Implementation details
This section contains the actual implementation in pseudocode. Only the most
relevant parts are reviewed here. For an overview of all code, we refer to http:
//www.martijnsprengers.eu/phKrack/. The implementation itself differs from
the original implementation by Poul-Henning Kamp because all calculations are
done on integers instead of bytes and no conversion to base 64 takes place, since
it is enough to compare the 128 bit output of the final MD5-compression function
with the 128 bit hash of the target password. Furthermore, the algorithm specific
optimizations from Chapter 6.3.2 are now incorporated. GPU and CUDA specific
code is left out.
The algorithm can be described by the following steps:
1. Decode the base 64 target hash to four 32-bit words (CPU).
2. Calculate the size of the search space, based on the character set and maximum password length (CPU).
3. Initialize n GPU kernels with the salt, target hash, configuration parameters
(such as grid size and number of threads per block) and the i-th (where
i < n) part of the search space (CPU).
4. Depending on the virtual thread number, generate an unique password
candidate. Since thread creation does not render large overhead, every
thread checks one password candidate.(GPU)
5. Hash the candidate password is with MD5-crypt, which is described by
Algorithm 1, 2 and 3 (GPU).
Algorithm 1 MD5-crypt pseudo code, Context 1.
1: unsigned integer buf f er[16]
2: unsigned integer f inal[4]
3: set(buf f er,passwordksaltkpassword)
4: set(buf f er,0x80)
. Add the binary 1
5: set(buf f er,msglen << 3)
. Add the message bit length
6: MD5Compress(f inal,buf f er)
. Call MD5 and store result in f inal
Algorithm 1 shows how the first context is built. It sets up buf f er, which
contains the concatenation of the password, salt and password. Then, buf f er is
structured as described by Chapter 3.2.3 and the MD5-compression function is
called. The result is stored in the f inal array. This result is then fed into context
two (shown in Algorithm 2), where a new buf f er is built. This time it contains
the concatenation of the password, the string $1$ and the salt. Furthermore n
bytes of the result of the first context are added, where n is the length of the
62
7.2. Optimizations
password. Then, depending on n, a null byte or the first byte of the password is
added. Finally, buf f er is padded and partitioned before the MD5-compression
function is called. Again, the result of this context is stored in the f inal array
and fed to the third context (Algorithm 3). This context contains a loop with
1000 iterations. Every iteration, the result of the last round (f inal) and the
password are stored in buf f er. Then, depending on the iteration number, the
password and salt are added, and finally the MD5-compression function is called,
which stores the result in f inal again. At the end, the result of the last iteration
contains the actual password hash, which is then matched against the target hash.
If both match, a flag is set and the algorithm can stop searching. This concludes
the description of the most important part of the MD5-crypt algorithm.
Algorithm 2 MD5-crypt pseudo code, Context 2.
1: set(buf f er,passwordk$1$ksalt)
2: set(buf f er,f inal[len(password)])
. Add len(password) bytes of f inal
3: for i = passwordLength, i, i >>= 1 do
4:
if i & 1 then
. Case i is odd
5:
set(buf f er,\0)
. Add a null byte
6:
else
. Case i is even
7:
set(buf f er,pw[0])
. Add the first byte of the password
8:
end if
9: end for
10: set(buf f er,0x80)
. Add the binary 1
11: set(buf f er,msglen << 3)
. Add the message bit length
12: MD5Compress(f inal,buf f er)
. Call MD5 and store result in f inal
7.2
Optimizations
This section describes the types of optimizations that can be used within the
CUDA framework. Some types result in a significant performance increase while
others even decrease the overall performance. The optimizations described in
this section are applied to our implementation. In general, the programmer has
to determine if this specific type of optimization affects his own implementation.
The discussed optimizations are mentioned in [1, 2].
7.2.1
Maximizing parallelization
64
7.2. Optimizations
Application level
According to Amdahls law [4], the maximum speedup of a program using multiple
processors in parallel computing is limited by the time needed for the sequential
fraction of the program. It is thus desirable that most of the program can be
executed in parallel. Amdahls law states that if P is the part of a program that
can be made parallel (i.e. benefit from parallelization), and (1 P ) is the part
that cannot be parallelized (remains sequential), then the maximum speedup that
can be achieved by using N processors is calculated by:
S=
1
(1 P ) +
P
N
(7.1)
the actual hashes is still fully parallelized on the GPU. In addition, there are no
intermediate communications between the GPU and the host CPU, which are
dependent on the bus speed and slow down the execution speed.
Device level
The compute capability of the device determines the number of kernels that can
reside concurrently on a device. This is an important aspect, because the more
kernels that reside on a device, the more the hardware is kept busy. For example,
if one kernel needs to access data on the global memory, the device can run
another kernel while waiting for the memory access to complete. However, on
devices with compute capability 1.x, only one kernel can reside on the device
concurrently. This implies that the kernel should be launched with sufficient
thread blocks, i.e. with at least as many thread blocks as there are multiprocessors
on the device. Chapter 7.2.3 will treat more configuration optimizations that can
be applied to our implementation.
Multiprocessor level
The application should use the thread-level parallelism principle at an even lower
level. It is therefore important to keep the individual multiprocessors as busy
as possible. This can be achieved by latency hiding. Whenever the warp scheduler issues an instruction, it selects a warp which is ready to execute its next
instruction and then issues that instruction to the active threads in the warp.
The number of clock cycles it takes for a warp to be ready for the execution of
its next instruction is called latency. Maximum utilization is therefore directly
dependent on the number of warps that can reside on one multiprocessor. If all
warp schedulers always have an instruction to issue for some warp at every clock
cycle during that latency period, the latency can be completely hidden. As described in Chapter 5.1.1, the number of instructions needed to hide a latency of
L clock cycles can be determined by taking L4 , since a multiprocessor in a device
with compute capability 1.x can issue one instruction per warp over 4 clock cycles
[1].
A warp is not always ready to execute its next instruction. This is mainly
caused by the input variables not being available yet. Since our final optimized
implementation does not require calls to global memory, latency is caused by
either register dependencies, shared memory accesses and constant cache misses.
If all input operants are registers, for example in the MD5-compression function,
the latency is caused by previous instructions whose execution is not completed
yet.
66
7.2. Optimizations
7.2.2
Memory Optimizations
While hash functions have high arithmetic intensity in their calculations, the password hashing schemes based on them frequently access the memory (for example
to add the password and salt to a buffer) and therefore it is mostly the memory
that determines the bottleneck in the execution speed. This section describes all
memory optimizations that can be done in order to improve performance. The
result of each improvement will be presented in Chapter 8.
Global memory
Our implementation does not use large data structures and therefore we do not
need to store data in global memory manually. However, if our data structures do
not fit in shared memory and the compiler could not place them in the registers,
those structures will be placed in local memory, which actually resides in global
memory. Local memory is so named because its scope is local to the thread,
but the physical location of this memory is off-chip and has the same properties
as global memory. Only the compiler decides what variables will be placed in
local memory, based on how much space the data structures need and based
on how much registers are spilled. (See Section 5.3.3). Because global memory
latency is very high compared to shared memory or register latency, the most
important performance consideration is to make sure global memory accesses
are coalesced. With our implementation this is already the case, since variables
in local memory are organized as such that consecutive words are accessed by
consecutive thread ids. Accesses are therefore fully coalesced if all threads in a
warp access the same relative address. This is the case in our implementation
because all threads use the same index in the arrays every time. In our first,
non-optimized implementation, the following data structures are stored in per
thread local memory (and thus having the same properties of global memory):
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
char
int
char
char
char
int
buffer [64];
final
[16];
password[N];
salt
[8];
charset [M];
target [4];
buffer is used to store the 64-byte input for the MD5-compression function,
final is used to store the output of the MD5-compression function, password
is the N -byte password that has to be checked, salt is the 8-byte salt that is
the same for all threads, charset is the M -byte character set which is used to
build a candidate password based on the current thread id and target is the
representation of the target hash in 4 unsigned integers.
67
As a consequence, the kernel has to access the local memory every time it uses
one of the above variables. The latency of these accesses is about 400 to 600 clock
cycles (depending on the compute capability and the memory bus speed of the
device). The number of warps that are needed to keep the warp schedulers busy
during the latency period can be determined by the ratio between the number of
instructions with on-chip memory and the number of instructions with off-chip
memory. If this ratio is low, more warps resident per multiprocessor (and thus
more threads per block) are required to hide the latency.
Shared memory
Although the warp scheduler uses latency hiding to cover the global memory
latency, the access time of shared memory is almost as fast as register access time
and use of shared memory is therefore preferred over global memory. Because it
is on-chip, uncached shared memory latency is roughly 100x lower than global
memory latency. However, all virtual threads that run on one multiprocessor
have to share the available shared memory and so concurrent threads can access
other threads memory addresses. Therefore, the developer has to manage the
memory accesses himself.
Individual threads can access their shared memory addresses concurrently
with the use of equally sized 32-bit wide memory modules, called banks. Because
banks can be accessed simultaneously, any memory load or store of n addresses
that spans n distinct memory banks can be serviced simultaneously, yielding an
effective bandwidth that is n times as high as the bandwidth of a single bank. On
a device with compute capability 1.x, the shared memory has 16 banks that are
organized in such a way that successive 32-bit words are assigned to successive
banks, i.e. interleaved. A memory request for a complete warp is split into two
memory requests, one for each half-warp, that are issued independently. Each
bank has a bandwidth of 32 bits per clock cycle and since there are 16 banks on
a device and the warp size is 32, accessing all the banks for the threads in a full
warp has a bandwidth of 32 bits per two clock cycles.
Bank conflicts arise when multiple threads from the same half-warp access
the same bank. The warp is then serialized, and every thread that accesses that
bank is then executed sequentially. No bank conflicts can arise between threads
that belong to the first- and second half warp respectively. The hardware splits
a memory request that has bank conflicts into as many separate conflict-free
requests as necessary, decreasing the effective bandwidth by a factor equal to the
number of separate memory requests. There is one exception: when all threads
in a half warp address the same shared memory location, the value is broadcasted.
To minimize bank conflicts, it is important to understand how memory addresses
map to memory banks and how memory requests should be optimally scheduled
in a specific implementation. For example, consider the following listing:
68
7.2. Optimizations
To calculate the total amount of shared memory Sblock in bytes that is allocated for a block, the following can be used:
Sblock = ceil(Sk , GS ),
(7.2)
where Sk is the amount of shared memory used by the kernel in bytes and GS
is the shared memory allocation granularity, which is equal to 512 for devices of
compute capability 1.x and 128 for devices of compute capability 2.x. Sk can be
derived by taking the product of the shared memory needed to run one thread
and the number threads in a block.
If we want to decrease the local memory usage to zero and remove all the
bank conflicts, our implementation needs the following data structures available
in shared memory (assumed that the password length is n bytes):
__shared__ unsigned int
__shared__ unsigned int
__shared__ unsigned char
buffer
[THREADS_PER_BLOCK][k+1];
final
[THREADS_PER_BLOCK][4+1];
passwords[THREADS_PER_BLOCK][n+1];
(7.3)
On a device with compute capability 1.x, the maximum shared memory per
multiprocessor is set to 16384 bytes. For example, if we take 256 threads per
block and a password length of 8 bytes, the total shared memory needed would
become ceil(256 (4 10 + 8 + 25), 512) = 18944 bytes (which is more than the
16384 bytes available, making the kernel refuse to launch). If we take 224 threads
per block, the kernel uses ceil(224 (4 10 + 8 + 25), 512) = 16384 which is exactly
the maximum possible amount.
Table 7.1 shows the effect of the stride on the number of warp serializes in
our implementation when the kernel is ran once with 224 threads per block. The
about 9000 warp serializes that remain after the shared memory bank conflicts are
solved, have several causes: uncontrollable register bank conflicts, constant cache
misses and read-after-write dependencies. However, compared to the 66 million
warp serializes that exist with the bank conflicts, this number is not significant.
Registers
Theoretically, because registers have a zero clock cycle latency, it is of utmost
importance to keep all the program variables in the registers whenever possible.
70
7.2. Optimizations
Stride
No stride
Stride 1
66 000 000
9 389
Tb
, 1),
Wsize
(7.4)
where Tb is the number of threads per block, Wsize is the warp size which is
equal to 32 for devices with compute capability 1.x and ceil(x, y) is equal to x
rounded up to the nearest multiple of y. Now, the total number of registers Rblock
allocated for a block can be calculated as follows[1]:
Rblock = ceil(ceil(Wblock , Gw ) Wsize Rk , GT ),
(7.5)
where Gw is the warp allocation granularity which is equal to 2 for devices with
compute capability 1.x, Rk is the number of registers used by the kernel and
GT is the thread allocation granularity (which is equal to 256 for devices with
compute capability 1.0 and 1.1, and 512 for devices with compute capability 1.2
and 1.3).
Register latency is caused by either read-after-write dependencies and register
memory bank conflicts. This type of latency approximately costs 24 cycles, but
can be completely hidden if the number of active threads per block is well chosen.
A device with compute capability 1.x has 8 thread processors per multiprocessor
and thus needs 8 24 = 192 active threads (which is equal to 6 warps) to hide
this latency.
The programmer has no control over which variables are placed in registers
and which are spilled to local memory. The only thing a programmer can do, is
to restrict the compiler in the number of registers that can be used by a kernel.
The compiler and hardware thread scheduler will then schedule instructions as
optimally as possible to avoid register memory bank conflicts. They achieve the
best results when the number of threads per block is a multiple of 64. Other than
following this rule, an application has no direct control over these bank conflicts.
71
Constant memory
The constant memory space originally resides in the device (global) memory and
is cached in the constant cache. Because the constant memory is cached on-chip,
a read from the constant memory is as fast as a read from the registers. On
a cache miss, a read from the constant memory is as fast as a read from the
device memory. For devices with compute capability 1.x, a request for constant
memory is split into two request, one for all threads in a half-warp, which are
issued independently. This request is then split into as many separate requests as
there are different memory addresses in the initial request. So accesses to different
addresses are thus serialized and decreasing throughput by a factor equal to the
number of separate memory addresses.
In our implementation, the following variables stay the same throughout the
execution of the whole kernel and can therefore be placed in the constant memory:
__constant__ unsigned char
__constant__ unsigned char
__constant__ unsigned int
salt
[8];
charset [M];
target [4];
From these 3 variables, the threads of a warp always access the same addresses
for salt and target and those requests are thus as fast as register requests.
However, every thread generates a unique password based on its thread id and so
multiple threads in a warp access different memory addresses of charset when
generating the password. However, this is only done once per thread and therefore
it has no significant impact on the performance.
7.2.3
One of the major keys to performance is to keep the hardware as busy as possible.
This implies that the workload should be equally shared between all the multiprocessors of a device. If the work is poorly balanced across the multiprocessors,
they will deliver suboptimal performance. It is therefore important to optimize
the execution configuration for a given kernel. The key concept that helps to
achieve optimal performance is occupancy.
Occupancy is specified as the metric that determines how effectively the hardware is kept busy by looking at the active warps on a multiprocessor. This metric
originates from the fact that thread instructions are executed sequentially and
therefore the only way to hide latencies and keep the hardware busy when the
current warp is paused or waiting for input, is to execute other warps that are
a
available on the multiprocessor. Occupancy is therefore defined as WWmax
, where
Wa is the number of active warps per multiprocessor and Wmax is the maximum
number of possible active warps per multiprocessor (which are 24, 32 and 48
for devices with compute capability 1.2, 1.3 and 2.x respectively). Occupancy
72
7.2. Optimizations
is kernel (and thus application) dependent, which means that some kernels do
achieve higher performance with lower occupancy. However, low occupancy always interferes with the ability to hide memory latency, which result in a decrease
of performance. On the other hand, higher occupancy does not always implies
higher performance, but it may help to cover the latencies and achieve a better
distribution of the workload.
The main factor to determine occupancy is thus the number of active warps
per multiprocessor, which in its turn is determined by the number of thread
blocks per multiprocessor and the number of threads per block. Recall from
Chapter 5.3.2 that a entire blocks are executed on one multiprocessor and that
the warp size is 32 for all compute capabilities. The number of active warps per
multiprocessor can then be calculated as Wa = T32a , where Ta is the number of
active threads per multiprocessor. Now to determine the optimal number threads
per block Tb we have to take into account the following restrictions, which we
will discuss below:
Physical configuration limits of the GPU.
Maximum number of available registers per multiprocessor.
Maximum number of available shared memory per multiprocessor.
Apart from these restrictions, we will also discuss some guidelines to determine
the optimal number of threads per block.
Physical configuration limits of the GPU
Every GPU has its own physical properties, described by the compute capability
group the GPU belongs. The most important properties are:
Threads per warp Wsize (warpsize). The maximum allowed value of the
number of threads that can be executed concurrently, which is equal to 32
for all compute capabilities.
Warps per multiprocessor Wmax . The maximum allowed value of the number of warps that can reside on one multiprocessor.
Threads per multiprocessor. The maximum allowed value of the number of
threads that can reside on one multiprocessor. This number is determined
by the product of the threads per warp and the warps per multiprocessor.
Thread blocks per multiprocessor T Bmax . The maximum allowed value of
the number of thread blocks per multiprocessor, which is equal to 8 for all
compute capabilities.
73
Based on the previous, the number of active threads per multiprocessor Ta can
be calculated as:
Wmax Wsize
Ta = Tb min T Bmax ,
(7.6)
Tb
Now maximum occupancy can be achieved if the active threads per multiprocessor
is equal to the product of warps per multiprocessor and threads per warp:
Wmax Wsize
Tb min T Bmax ,
= Wsize Wmax
(7.7)
Tb
For example, if we do not consider the register and shared memory usage for our
implementation and we use a device with compute capability 1.3, we will get the
following formula for the calculation of the optimal number of threads per block
Tb (and thus maximum occupancy):
32 32
Tb min 8,
= 1024
(7.8)
Tb
Figure 7.2 plots Equation 7.8 for several values of Tb and shows the influence of
several block sizes on the occupancy of our kernel implementation. As becomes
clear from the figure, block sizes of 128, 256, 512 and 1024 result in the maximum
occupancy. However, we did not yet take the effects of shared memory and
register size into consideration.
Figure 7.2: General influence of number of threads per block on the occupancy
of a device with compute capability 1.3.
7.2. Optimizations
code, it turned out that our implementation uses 17 registers per thread. With
this information we can calculate the maximum number of threads per block and
find the maximum occupancy our implementation can achieve, given that we use
a device with compute capability 1.3 which has 16384 32-bit registers available
per multiprocessor. Now the maximum number of threads per block Tb can be
calculated as:
Tb =
16384
Rblock
Wsize =
32 = 960
Rk Wsize
17 32
16384
Because of the ceil(x, y) function we have to round down 1732
to 30 and the
final result then becomes 960 threads per block. With this number of threads per
block, the number of active threads per multiprocessor Wa is then 960/32 = 30
and the achieved occupancy is then 30/32 100% 94%.
Figure 7.3: Maximum achieved occupancy for our implementation ran on a device
with compute capability 1.3.
To hide the latency for register memory bank conflicts, the optimal number
of threads should be a multiple of 64.
Since the latency of register read-after-write dependencies is approximately
24 cycles, the number of threads per block should be higher then 8 24 =
192 and 32 24 = 768 for compute capabilities 1.3 (8 thread processors per
multiprocessor) and 2.x (32 thread processors per multiprocessor) respectively.
The performance increases when the gridsize is a multiple of the number of
available multiprocessors since the thread blocks are then equally divided
amongst the multiprocessors.
Together with the information extracted from Figure 7.2, the most optimal
number of threads per block should therefore be 256, 512 or 1024, but since our
implementation is restricted by available shared memory our most optimal value
is 224 threads per block, which equals to 22 % occupancy.
7.2.4
Instruction Optimizations
GPUs have an instruction set that fits their purposes for graphics calculation.
However, not all instructions that execute fast on a CPU will do so on a typical
GPU. Therefore, awareness of how instructions are executed often permits lowlevel optimizations, especially in code that is run frequently (hot-spots), such as
the 1000 iteration for-loop in our implementation. When all high-level optimizations have been done, it may be advantageous to look at low-level instruction
optimizations. The following low-level optimizations may affect our implementation:
76
7.2. Optimizations
Modular arithmetic This kind arithmetic is expensive on GPUs. In the hotspot in our implementation the if-statements if(i % 3) and if(i % 7) are
used. Unfortunately, we cannot rewrite these statements because 3 and 7
are not powers of 2. If they were, (i&(n1)) (where n is a power of 2) could
be used, which is less expensive. The same holds for division, but since we
use division only once per thread (in the password generation phase), it
is not significantly beneficial to rewrite this statement. And if it was, we
could still not use it since the denominator is not a power of 2 (else (i / n)
could be written as (i >> log2 (n))).
Type conversions This kind of arithmetic is expensive too, for example
functions operating on char whose operands need to be converted to an
int. In the original implementation two data structures were used to store
the input to the MD5-compression function:
unsigned char
unsigned int
buffer[64];
buffer[16];
The 64 char array was converted to a 16 int array, which then served as
input for the MD5-compression function. However, since the elements of
the array are stored sequentially in the GPU memory, it is not necessary
to converge between char and int. In our current implementation only
the int buffer[16] array is enough, because the addresses of the 16 int
elements can be accessed byte-wise with this function:
void setThisRef(unsigned int *a,unsigned int b, unsigned char c){
unsigned char* ptr = (unsigned char*) a;
*(ptr + b) = c;
}
While this solution does not decrease the number of instructions (because a
cast from int to char is still needed), it needs less calls to shared memory
(because we only need one data structure to store the input), which slow
down the execution.
7.2.5
Any flow control instruction (if, switch, do, for, while) can significantly affect
the instruction throughput if threads of the same warp follow different execution paths, which is called warp divergence. If this is the case, the different
execution paths must be serialized, increasing the total number of instructions
executed for this warp. When all the different execution paths have completed,
the threads converge back to the same execution path. This only happens when
77
78
Chapter 8
Experimental evaluation
We have described how GPUs can speedup the performance of MD5-crypt and
differentiated the optimizations. To see whether our implementation is successful, this chapter will contain the setup and outcomes of our experiments. The
following three kinds of results are distinguished:
We describe how the specific optimizations, as defined in Chapter 7, effect
the performance on CUDA enabled GPUs.
We compare the performance of known CPU implementations of MD5crypt to ours. Moreover, we compare the performance increase against
other achievements in the field (such as GPU implementations for AES and
ECC).
We measure the effects of our performance increase on contemporary password databases and password policies.
79
8. Experimental evaluation
8.1
Experiment setup
8.1.1
Performance metric
8.1.2
Available hardware
Although our implementation can execute on multiple GPUs, we have ran our
tests only on one Nvidia GeForce 295 GTX. To compare the implementation
with CPU hardware, we used an equally priced Intel i7 920 processor. The
specifications of the hardware on which we have performed the experiments can
be found in Appendix A.1.
8.2
Optimizations effects
8.2.1
Algorithm optimizations
After we have defined a baseline (point 1), four optimizations are compared to
this baseline (point 2 to 5). Then, the optimizations are combined (point 6 and
7) and again compared to the baseline.
80
1. Baseline. This is the implementation which does not use any optimization
at all. Moreover, all the variables are stored in the local (i.e. global)
memory and the allocation of the memory is done automatically.
2. Constant memory. The variables that do not change during the execution,
such as target hash, salt and character set, are now stored in the constant
memory. The other variables are still stored in local memory.
3. Shared memory (bank conflicts). The changing variables, such as the input
buffer and the resulting hashes, are now stored in the fast shared memory.
However, as described in Chapter 7.2.2, bank conflicts and thus warp serializes occur, which slow down the execution. The non-changing variables
are still stored in the local memory.
4. Shared memory (no bank conflicts). The allocation of the shared memory is
now done is such a way that no bank conflicts occur. With a strided access
pattern every thread can access its own bank such that warp serializes are
kept to a minimum, which increases the performance (at the cost of a little
increase in total shared memory used).
5. Optimized MD5. Based on the description in Chapter 6.3.2, the MD5compression function is statically optimized. Depending on the password
length, only the necessary calculations are performed.
6. Shared and constant memory (no bank conflicts). Now both the changing and non-changing variables are kept in shared and constant memory
respectively.
7. Optimized MD5 with shared and constant memory. In addition to point 6,
the optimized version of the MD5-compression function is used as well.
To make a fair comparison, the configuration parameters are kept equal. The
number of thread blocks (gridsize) is set to 1200 (since this is a multiple of 60,
which is the number of multiprocessors on a Nvidia GTX 295) and the number
of threads per block (blocksize) is set to 160 (since this is the maximum number
for all optimizations to have enough shared memory available). In Figure 8.1 the
performance increase per optimization is shown in black, combined optimizations
are shown in gray. The figure shows us that the shared memory (without bank
conflicts) and MD5-compression function optimization achieve speedups of 3 and
2 times the baseline respectively. However, when we combine both optimizations,
the speedup is only a little over 3 times the baseline. This can be explained by
the fact that the MD5-compression function optimization reduces the number
of memory calls. While this optimization achieves a significant performance increase when all variables are stored in local memory, only little performance
81
8. Experimental evaluation
increase is gained when all variables are stored in shared memory (since this type
of memory has low latency). Furthermore, the figure also shows that there is
very little performance increase when the non-changing variables are stored in
constant memory and the other variables are stored in shared memory compared
to storing all the variables in shared memory (point 6 and 4 respectively). This
can be explained by the fact that the compiler stores frequently used variables
in the constant memory by itself and keeps a copy in local memory too (when
a constant cache miss arises). Finally, the figure shows that storing variables in
the shared memory while not solving the bank conflicts even degrades the performance compared to the baseline. Therefore, bank conflicts (and thus warp
serializes) should be avoided.
8.2.2
Configuration optimizations
Chapter 7.2.3 shows us that it is important to keep the hardware as busy possible
by maximizing the occupancy. To measure the influence of the number of threads
per block (blocksize) on the performance of our most optimal implementation, we
varied the number of threads per block while keeping the gridsize constant. Figure
8.2 shows the influence of the blocksize on the performance while the gridsize is set
to 1200. With the most optimal implementation and the threads per block set to
224, we are able to launch an exhaustive search with our maximum performance
of approximately 880 000 hashes per second. Blocksizes up to 229 could be used
as well, but did not achieve more performance than 224 threads per block. With
blocksizes higher than 229, the kernel refused to run since there was not enough
82
Figure 8.2: Influence of the number threads per block on the performance of our
most optimal implementation, executed on one Nvidia GeForce GTX 295.
8.2.3
In the theoretical estimate of the performance we did not incorporate the candidate password generation phase and the comparison with the target hash. Since
our password generation algorithm (which can be found in Appendix A.2.1) uses
modular arithmetic and integer division, this influences the final performance.
However, every thread only generates one password, so the time this takes may
be ignored compared to the execution time of the rest of the algorithm. Figure
8.3 shows our most optimal implementation, but without the optimization of the
MD5 algorithm, compared against the theoretic and practical based models (as
described in Chapter 6.4.1 and 6.4.2). From the figure it becomes clear that the
performance of our implementation is between the performance estimates of the
SUM and MAX model. This can be explained by the fact that the warp scheduler
swaps threads while they are waiting for a (shared) memory call to complete and
thus higher performance can be achieved than estimated with the SUM model.
Since MD5-crypt is memory intensive, memory calls can not be ignored and therefore our implementation will never reach the same level of performance as the
MAX or theoretic model.
83
8. Experimental evaluation
8.3
8.3.1
In order to compare our implementation against implementations on other hardware platforms, it is necessary to consider the costs of the hardware. We have
ran our implementation on two equally priced devices: the Nvidia GeForce 295
GTX and the Intel i7 920 (both with an average price of 290 euro). However,
we neglected the fact that a GPU cannot run without a host CPU. In addition,
a CPU is not able to address all its resources since it has to run an operating
system as well.
At the time of writing this thesis, no other MD5-crypt GPU implementations
were available for testing. Therefore, we compared our implementation to the
fastest MD5-crypt CPU implementation known to us, which is incorporated in
the password cracker John the Ripper[19]. However, John the Ripper does not
have support for parallel MD5-crypt yet. As of June 2010, only parallel CPU
implementations of DES and bcrypt are supported1 . In order to fully use all four
cores of the Intel i7 920, we used the OpenMP library to parallelize our CPU implementation. Figure 8.4 shows the performance of MD5-crypt implementations
on GPU and CPU hardware. Our GPU implementation (about 880 000 hashes
per second) achieves a speedup of 28 times over our CPU implementation (about
32 000 hashes per second) and a speedup of 104 times over the John the Ripper
CPU implementation (about 8500 hashes per second). Compared to the John
the Ripper implementation, we achieve a speedup of two orders of magnitude.
1
84
See: http://openwall.info/wiki/john/parallelization
8.3.2
GPUs have proven to be suitable to speedup the implementations of symmetric [29, 28] and asymmetric [30, 69] cryptographic functions, such as AES [12],
RSA[30] and ECC[9, 8]. Table 8.1 shows the different speed ups GPUs can
achieve over CPUs. Note that not all experiments were carried out on equally
priced hardware, which influences the final outcome. Moreover, variations in
hardware, relative newness of GPU and CPU chips, CPU implementations, and
CPU technologies (e.g. the use of the SSE instruction set) will all lead to changes
in the GPU / CPU run-time ratio. However, the table does show that GPUs
can be efficiently used to perform cryptographic operations, or at least can act
as a cryptographic co-processor.
Work
Cryptographic
type
Algorithm
Speed up GPU
over CPU
[9, 8]
[41]
[29]
[30]
This work
Asymmetric
Symmetric
Symmetric
Asymmetric
Hashing
ECC
AES
AES
RSA
MD5-crypt
4-5
5-20
4-10
4
28
Table 8.1: Speed up GPU over CPU for different cryptographic applications.
The major difference between the speedup for hashing and the other cryptographic types comes from the fact that hash functions are solely build out of
native arithmetic instructions (such as shifts, additions and logical operators)
that have high throughput on a typical GPU. RSA and ECC, in contrast, are
build out of multiplications and modular arithmetic, which have lower throughput than native arithmetic instructions. In addition, password hashing allows for
85
8. Experimental evaluation
a maximal parallelization whereas with AES not all modes of encryption can be
used for parallelization (e.g. Cipher-block Chaining or Cipher Feedback mode).
8.4
We have shown that in the field of exhaustive searches on the password hashing
scheme MD5-crypt, GPUs can significantly speed up the cracking process. To
determine if this speedup is significant enough to attack systems that use such
password hashing schemes, we have to see how our implementation performs on
real world datasets. As example databases, we took the ones described in Chapter
4.2. To see whether the passwords are crackable in a feasible amount of time, we
defined four password classes:
1. Passwords consisting of only lowercase ASCII characters (26 in total)
2. Passwords consisting of lowercase and numeric ASCII characters (36 in
total)
3. Passwords consisting of lowercase, numeric and uppercase ASCII characters
(62 in total)
4. Passwords consisting of lowercase, numeric, uppercase and special ASCII
characters (94 in total)
Table 8.2 shows how the four classes are represented in the two datasets.
Dataset
% in class 1
% in class 2
% in class 3
% in class 4
phpbb.com
rockyou.com
41
26
47
59
10
8
2
7
Figure 8.5: Crack times for real world datasets on one Nvidia GTX 295 with a
performance of 880000 hashes per second.
Figure 8.5 then shows the amount of time it takes for one Nvidia GTX 295 to
crack a percentage of the database, given that all the passwords were hashed with
MD5-crypt. The figure shows us that within two months (which is the standard
password expiration period in most corporate environments) of cracking time,
respectively 77% and 54% of the passwords in the phpbb and rockyou database
could be cracked. In addition, the figure shows that 98% of the phpbb database
could be cracked if no password hashing scheme is used and the passwords are
hashed with only one application of the MD5 function. Considering the fact that
one cracked password can influence the security of an entire system, we consider
the identified percentages as unacceptable. Moreover, given Moores Law, even
more passwords can be effectively searched in the near future. Table 8.4 shows
the search times for the four password classes and some password lengths, given
the performance of 880 000 hashes per second.
Length
26 characters
36 characters
62 characters
94 characters
4
5
6
7
8
9
10
0,5 Seconds
13 Seconds
5 Minutes
2 Hours
2 Days
71 Days
5 Years
2 Seconds
1 Minute
41 Minutes
1 Days
37 Days
4 Years
132 Years
16 Seconds
17 Minutes
18 Hours
46 Days
8 Years
488 Years
30243 Years
2 Minutes
2 Hours
10 Days
3 Years
264 Years
20647 Years
2480775 Years
Table 8.3: Exhaustive search times of some password lengths on one GTX 295
with a performance of 880 000 hashes per second.
87
8. Experimental evaluation
SAE
ck
(8.2)
Figure 8.6: Probability of a successful exhaustive search attack for some lengths
in the most complex password class.
period set to 2 months, well funded adversaries (with A > $10, 000, 000) are able
to crack all passwords with a length less than 10 characters.
This concludes the description of our experimental results. We have shown
that GPUs can significantly speedup cryptographic applications such as password hashing, provided that the full power of a typical GPU can be addressed.
Moreover, we have shown that either the design of password hashing schemes
should be adapted to contemporary standards or users should increase the entropy in their passwords.
89
Chapter 9
91
9.1
Conclusions
In this research, we have identified both the security properties and attacker
models of prevalent password hashing schemes. In particular, we have shown that
GPUs could be used for launching exhaustive search attacks on password hashing
schemes that rely on the key strength of user-chosen passwords. We have shown
that such attacks could be parallelized because of their embarrassingly parallel
nature, which enabled us to optimize the implementation of one password hashing
scheme, MD5-crypt. We have shown that our implementation approaches the
theoretical speed limit on a CUDA enabled GPU. Moreover, our implementation
achieves a speedup of two orders of magnitude over the best known existing CPU
implementation.
We described how a general programming framework for GPUs, CUDA, could
be used in combination with the specific hardware properties of a (Nvidia) GPU,
such as fast shared memory and intelligent thread schedulers. Our implementation achieves a performance around 880 000 password hashes per second, whereas
an equally priced CPU only achieves a performance around 30 000 password
hashes per second. With this performance increase, complex passwords with a
length of 8 characters are now becoming feasible to crack. Moreover, we showed
that between 50 % and 80 % of the passwords in a leaked database could be recovered within 2 months of computation time on one Nvidia GeForce 295 GTX.
To help an organisation assess its risks related to the password policy, we proposed a model that, given the value of the protected assets, the performance
of a specific hardware platform, the password policy imposed and the password
expiration period, will estimate the probability of a succesfull exhaustive search
attack on a password hahing scheme. To decrease the probability of success for
such an attack, the complexity of prevalent password hashing schemes should be
increased by at least four orders of magnitude.
In addition to the existing attacks on password hashing schemes, we identified a theoretical attack that is based on a strong collision attack on the underlying hash function (see Chapter 4.5). Therefore, one extra property should be
defined for password hashing schemes that use the key-stretching technique to
strengthen the password. Based on the collision attacks, those schemes should
hash the salt, password and result of last round in a pseudo random way for every
iteration in the key-stretching phase. While the collision resistance property of
the MD5 hash function has been broken, password hashing schemes based on this
function, according to our findings, can still be used securely. However, the use
of new hash functions, such as SHA-3, should be considered in the near future.
92
9.2. Discussion
9.2
Discussion
9.3
Future work
94
See: http://www.digidentity.eu/
Bibliography
[1]
[2]
CUDA Best practices guide. Technical report, Nvidia Corporation, August 2010.
[cited at p. 63, 75]
[3]
[4]
[5]
[6]
M. Bellare and T. Kohno. Hash function balance and its impact on birthday attacks. In Advances in Cryptology-Eurocrypt 2004, pages 401418. Springer, 2004.
[cited at p. 9]
[7]
[8]
[9]
95
Bibliography
[11] E. Biham. A fast new DES implementation in software. In Fast Software Encryption: 4th International Workshop, FSE97, Haifa, Israel, January 1997. Proceedings. Springer, January 1997. [cited at p. 11]
[12] J. W. Bos, D. A. Osvik, and D. Stefan. Fast Implementations of AES on Various
Platforms. Technical report, Cryptology ePrint Archive, Report 2009/501, November 2009. http://eprint. iacr. org, 2009. [cited at p. 3, 85]
[13] W. E. Burr, D. F. Dodson, and W. T. Polk. Electronic authentication guideline.
NIST Special Publication, 800:63, 2004. [cited at p. 27]
[14] L. Clair, L. Johansen, W. Enck, M. Pirretti, P. Traynor, P. McDaniel, and T. Jaeger.
Password exhaustion: predicting the end of password usefulness. Information Systems Security, pages 3755, 2006. [cited at p. 8]
[15] D. L. Cook, J. Ioannidis, A. D. Keromytis, and J. Luck. CryptoGraphics: Secret
key cryptography using graphics cards. Topics in CryptologyCT-RSA 2005, pages
334350, 2005. [cited at p. 2]
[16] T. M. Cover, J. A. Thomas, and J. Wiley. Elements of information theory, volume 1.
Wiley Online Library, 1991. [cited at p. 26]
[17] I. Damg
ard. A design principle for hash functions.
CRYPTO89:416427, 1990. [cited at p. 16]
Advances in Cryptology,
[18] B. Den Boer and A. Bosselaers. Collisions for the compression function of MD5. In
Advances in CryptologyEurocrypt93, pages 293304. Springer, 1994. [cited at p. 52]
[19] S. Designer. John the Ripper password cracker, January 2011. [cited at p. 11, 84]
[20] A. Di Biagio, A. Barenghi, G. Agosta, and G. Pelosi. Design of a parallel AES for
graphics hardware using the CUDA framework. In Parallel & Distributed Processing,
2009. IPDPS 2009. IEEE International Symposium on, pages 18. IEEE, 2009.
[cited at p. 3]
[21] Ulrich Drepper. Unix crypt using SHA-256 and SHA-512. Technical report, Akkadia,
2008. [cited at p. 5]
[22] A. J. Elbirt, W. Yip, B. Chetwynd, and C. Paar. An FPGA implementation and
performance evaluation of the AES block cipher candidate algorithm finalists. In
The Third AES Candidate Conference, printed by the National Institute of Standards
and Technology, Gaithersburg, MD, pages 1327. Citeseer, 2000. [cited at p. 12]
[23] M. J. Flynn. Some computer organizations and their effectiveness. Computers,
IEEE Transactions on, 100(9):948960, 2009. [cited at p. 37]
[24] I. Foster. Designing and building parallel programs: concepts and tools for parallel
software engineering. Addison-Wesley Longman Publishing Co., Inc. Boston, MA,
USA, 1995. [cited at p. 53]
[25] M. Garland and D. B. Kirk. Understanding throughput-oriented architectures. Communications of the ACM, 53(11):5866, 2010. [cited at p. 37]
96
Bibliography
97
Bibliography
[40] C. Li, H. Wu, S. Chen, X. Li, and D. Guo. Efficient implementation for MD5-RC4
encryption using GPU with CUDA. In Anti-counterfeiting, Security, and Identification in Communication, 2009. ASID 2009. 3rd International Conference on, pages
167170. IEEE, 2009. [cited at p. 3]
[41] S. A. Manavski. CUDA compatible GPU as an efficient hardware accelerator for
AES cryptography. In Signal Processing and Communications, 2007. ICSPC 2007.
IEEE International Conference on, pages 6568. IEEE, 2008. [cited at p. 3, 85]
[42] U. Manber. A simple scheme to make passwords based on one-way functions much
harder to crack. Computers & Security, 15(2):171176, 1996. [cited at p. 29]
[43] A. J. Menezes, P. C. Van Oorschot, and S. A. Vanstone. Handbook of applied
cryptography. CRC, 1997. [cited at p. 8, 14, 15]
[44] N. Mentens, L. Batina, I. Verbauwhede, and Bart Preneel. Time-Memory Trade-Off
Attack on FPGA Platforms: UNIX Password Cracking. Reconfigurable Computing:
Architectures and Applications, pages 323334, 2006. [cited at p. 10, 12, 28]
[45] R. Merkle. A certified digital signature. In Advances in CryptologyCRYPTO89
Proceedings, pages 218238. Springer, 1990. [cited at p. 16]
[46] G. E. Moore et al. Cramming more components onto integrated circuits. Proceedings
of the IEEE, 86(1):8285, 1998. [cited at p. 93]
[47] R. Morris and K. Thompson. Password security: A case history. Communications
of the ACM, 22(11):594597, 1979. [cited at p. 27]
[48] R. Mukherjee, M. S. Rehman, K. Kothapalli, PJ Narayanan, and K. Srinathan. Presenting new Speed records and constant time encryption on the GPU. [cited at p. 3]
[49] J. S. Norris and P. H. Kamp. Mission-critical development with open source software:
Lessons learned. Ieee Software, pages 4249, 2004. [cited at p. 52]
[50] P. Oechslin. Making a faster cryptanalytic time-memory trade-off. Advances in
Cryptology-CRYPTO 2003, pages 617630, 2003. [cited at p. 10, 28]
[51] L. OGorman. Comparing passwords, tokens, and biometrics for user authentication.
Proceedings of the IEEE, 91(12):20212040, 2005. [cited at p. 25]
[52] B. Pinkas and T. Sander. Securing passwords against dictionary attacks. Proceedings
of the 9th ACM Conference on Computer and Communications Security, pages 161
170, 2002. [cited at p. 21]
[53] Bart Preneel, editor. Progress in Cryptology - AFRICACRYPT 2009, Second International Conference on Cryptology in Africa, Gammarth, Tunisia, June 21-25,
2009. Proceedings, volume 5580 of Lecture Notes in Computer Science. Springer,
2009. [cited at p. 97]
[54] Bart Preneel. Perspectives on lightweight cryptography. Inscrypt 2010, October
2010. [cited at p. 8]
[55] N. Provos and D. Mazieres. A future-adaptable password scheme. In Proceedings of
the Annual USENIX Technical Conference. Citeseer, 1999. [cited at p. 29, 33]
98
Bibliography
[56] R. Rivest. RFC1321: The MD5 message-digest algorithm. RFC Editor United
States, 1992. [cited at p. 16, 18, 19]
[57] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu.
Optimization principles and application performance evaluation of a multithreaded
GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 7382. ACM, 2008. [cited at p. 75]
[58] Y. Sasaki and K. Aoki. Finding preimages in full MD5 faster than exhaustive search.
Advances in Cryptology-EUROCRYPT 2009, pages 134152, 2009. [cited at p. 52]
[59] Y. Sasaki, L. Wang, K. Ohta, and N. Kunihiro. Security of md5 challenge and
response: Extension of apop password recovery attack. In Proceedings of the 2008
The Cryptopgraphers Track at the RSA conference on Topics in cryptology, pages
118. Springer-Verlag, 2008. [cited at p. 52]
[60] B. Schneier. Applied cryptography: protocols, algorithms, and source code in C.
A1bazaar, 2007. [cited at p. 9]
[61] F. J. Seinstra, J. Maassen, R. V. Nieuwpoort, N. Drost, T. Kessel, B. Werkhoven,
J. Urbani, C. Jacobs, T. Kielmann, and H. E. Bal. Jungle Computing: Distributed
Supercomputing beyond Clusters, Grids, and Clouds. Grids, Clouds and Virtualization, pages 167197, 2011. [cited at p. 94]
[62] C. E. Shannon. Prediction and entropy of printed English. Bell System Technical
Journal, 30(1):5064, 1951. [cited at p. 27]
[63] C. E. Shannon. A mathematical theory of communication. ACM SIGMOBILE
Mobile Computing and Communications Review, 5(1):355, 2001. [cited at p. 26]
[64] N. Smart. ECRYPT II yearly report on algorithms and keysizes (2009-2010). Technical report, Technical report, ECRYPT, 2010, 2010. [cited at p. 8]
[65] M. Stevens. On Collisions for MD5. Masters thesis, Eindhoven University of Technology, June 2007. [cited at p. 19]
[66] M. Stevens, A. Lenstra, and B. De Weger. Chosen-prefix collisions for MD5
and colliding X. 509 certificates for different identities. Advances in CryptologyEUROCRYPT 2007, pages 122, 2007. [cited at p. 12, 32]
[67] M. Stevens, A. Sotirov, J. Appelbaum, A. Lenstra, D. Molnar, D. Osvik, and
B. De Weger. Short chosen-prefix collisions for MD5 and the creation of a rogue CA
certificate. Advances in Cryptology-CRYPTO 2009, pages 5569, 2009. [cited at p. 3,
12]
[68] H. Sutter. The free lunch is over: A fundamental turn toward concurrency in
software. Dr. Dobbs Journal, 30(3):202210, 2005. [cited at p. 37]
[69] R. Szerwinski and T. G
uneysu. Exploiting the power of GPUs for asymmetric
cryptography. Cryptographic Hardware and Embedded SystemsCHES 2008, pages
7999, 2008. [cited at p. 3, 35, 85]
99
Bibliography
[70] C. J. Thompson, S. Hahn, and M. Oskin. Using modern graphics architectures for
general-purpose computing: A framework and analysis. In Proceedings of the 35th
annual ACM/IEEE international symposium on Microarchitecture, pages 306317.
IEEE Computer Society Press, 2002. [cited at p. 2]
[71] E. Verheul. Selecting secure passwords. Topics in CryptologyCT-RSA 2007, pages
4966, 2006. [cited at p. 23]
[72] X. Wang and H. Yu. How to break MD5 and other hash functions. Advances in
CryptologyEUROCRYPT 2005, pages 1935, 2005. [cited at p. 3]
[73] T. Xie and D. Feng. How To Find Weak Input Differences For MD5 Collision
Attacks. 2010. [cited at p. 10]
[74] T. Xie and Dengguo Feng. Construct md5 collisions using just a single block of
message. Cryptology ePrint Archive, Report 2010/643, 2010. http://eprint.
iacr.org/. [cited at p. 10]
[75] J. Yan, A. Blackwell, R. Anderson, and A. Grant. Password memorability and security: Empirical results. Security & Privacy, IEEE, 2(5):2531, 2004. [cited at p. 25,
29]
[76] C. T. Yang, C. L. Huang, and C. F. Lin. Hybrid CUDA, OpenMP, and MPI parallel
programming on multicore GPU clusters. Computer Physics Communications, 2010.
[cited at p. 94]
[77] J. Yang and J. Goodman. Symmetric key cryptography on modern graphics hardware. Advances in CryptologyASIACRYPT 2007, pages 249264, 2008. [cited at p. 2,
35]
100
Appendices
101
Appendix A
Appendix A
A.1
CPU
Memory
Motherboard
Harddisk
GPU
Power supply
Operating System
A.1.1
The Intel Core i7 920 processor we used for this research has the following specifications.
CPU Essentials:
Number of cores
Number of Threads
Clock Speed (MHz)
Max Turbo Frequency (MHz)
Smart Cache (MB)
Instruction Set
Instruction Set Extensions
4
8
2670
2930
8
64-bit
SSE4.2
103
A. Appendix A
A.1.2
24
DDR3-800/1066
3
25.6
The Nvidia GeForce GTX 295 cards we used for this research have the following
specifications.
480 (240 x 2)
576
1242
92.2
Memory Specs:
Memory Clock (MHz)
Standard Memory Config
Memory Interface Width
Memory Bandwidth (GB/sec)
999
1792MB (896MB x 2) GDDR3
896-bit (448-bit x 2)
223.8
The Nvidia GeForce 9800 GT card we used for this research has the following
specifications.
104
112
600
1500
33.6
Memory Specs:
Memory Clock (MHz)
Standard Memory Config
Memory Interface Width
Memory Bandwidth (GB/sec)
900
512MB GDDR3
256-bit (448-bit x 2)
57.6
A.2
Code overview
A.2.1
A. Appendix A
input[passwordLength] = \0;
return input;
}
106
List of Symbols
and Abbreviations
Abbreviation
Description
Definition
AES
ALU
BSD
CBC
CFB
CPU
CUDA
DES
ECC
FLOP
FPGA
GPU
IPS
MD5
NIST
OpenCL
PHS
RAM
RFC
RSA
SHA
SIMD
SIMT
107
List of Figures
3.1
4.1
5.1
5.2
5.3
5.4
5.5
the GPU is
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
36
40
44
45
46
6.1
6.2
7.1
7.2
7.3
8.1
8.2
8.3
8.4
82
83
84
85
List of Figures
8.5
8.6
Crack times for real world datasets on one Nvidia GTX 295 with a
performance of 880000 hashes per second. . . . . . . . . . . . . . . . . 87
Probability of a successful exhaustive search attack for some lengths
in the most complex password class. . . . . . . . . . . . . . . . . . . . 89
109
List of Algorithms
1
2
3
111
List of Tables
2.1
4.1
4.2
5.1
5.2
Schemes shown in this table are built around the hash function they
are named after. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Characteristics of real world password datasets. . . . . . . . . . . . . . 24
Throughput of native arithmetic
cycle per multiprocessor)[1]. . . .
Features of Nvidias GPU device
1.3)[1]. . . . . . . . . . . . . . . .
6.1
6.2
7.1
8.1
8.2
8.3
112