Document

CONTENTS
Name of Contents Page No.

LIST OF FIGURES I
LIST OF TABLES II
ABSTRACT 1
CHAPTER 1: INTRODUCTION
1.1 Motivation 4
1.2 Literature Review 6
1.2.1 High Speed AES Design 6
1.2.2 Architecture and Implementation of S-Box 7
1.3 Research Objective 8
1.4 Organization 8
CHAPTER 2: ADVANCED ENCRYPTION STANDARD (AES) ALGORITM

2.1 Definition and History of Cryptography 10
2.2 Galois Field 11
2.3 The Data Encryption Standard (DES) 11
2.4 The Advanced Encryption Standard (AES) 12
2.4.1 Subbytes/Inverse Subbytes Transformation 14
2.4.2 Shiftrows/Inverse Shiftrows Transformation 16
2.4.3 Mixcolumns/Inverse Mixcolumns Tranformation 16
2.4.4 Add roundkey Transformation and Key Expansion 17
2.4.4.1 Add roundkey Transformation 17
2.4.4.2 Key Expansion 17
2.5 Composite Field Arithmatic S-Box 18
2.5.1 Addition Operation In Gf (24) 20
2.5.2 Squaring Operation In Gf (24) 20
2.5.3 Multiplication With Constant, Λ 21
2.5.4 Galois Field Gf (24) Multiplication 22
2.5.4.1 Multiplication With Constant, Φ 22
2.5.4.2 Galois Field Gf (22) Multiplication 23
2.5.5 Multiplicative Inversion In Gf 24
CHAPTER 3: IMPLEMENTATION OF PROPOSED ARCHITECTURE FOR S-

BOX
3.1 Introduction 26
3.2 Proposed Architecture of S-Box 26
3.2.1 Introduced an Operator (Op) After Merging of Some 27
Blocks
3.2.2 Implementation of Multiplicative Inverse in Gf (24 ) 28
Using Multiple
3.2.3 Reduced The Critical Path of Multiplication In Gf 30
3.3 Implementation of Proposed Architecture Of S-Box 31
3.3.1 Asic Implementation of Mi In Gf (24) 31
3.3.2 Asic Implementation of Multiplication In Gf (22) 31
3.3.3 Asic And Fpga Implementation of Proposed S-Box 32
3.4 Conclusion 34
CHAPTER 4: HIGH SPEED AES ENCRYPTION

4.1 Introduction 36
4.2 Proposed Architecture for AES Encryption Algorithm 36
4.3 FPGA Implementation of Proposed Architecture of AES 37
4.4 Conclusion 39
CHAPTER 5: RESULTS AND DISCUSSION

5.1 Overview Of VLSI 41
5.2 History of Scale Integration 42
5.3 VLSI And Systems 43
5.4 Applications 44
5.5 Verilog HDL 45
5.6 Major Capabilities 46

5.7 Synthesis 47
CHAPTER 6: CONCLUSION AND FUTURE SCOPE

6.1 Conclusion 49
6.2 Future Scope 50
REFERENCES 51
LIST OF FIGURES
Figure No. Figure name Page No.

Figure 2-1 Basic step of Encryption in cryptography 10
Figure 2-2 Symmetric Key Cryptography 11
Figure 2-3 Rounds of AES Encryption algorithm 13
Figure 2-4 SubBytes Transformation 14
Figure 2-5 ShiftRows Transformation for AES Encryption 16
Figure 2-6 MixColumns transformation of AES Encryption 17
Figure 2-7 AddRoundKey transformation of AES Encryption 17
Figure 2-8 SubBytes and Inverse SubBytes transformation in 18
composite field
Figure 2-9 The conventional S-box architecture in composite field 19
Figure 2-10 Meaning of symbols used in Figure 2-9 19
Figure 2-11 Logical hardware diagram of squarer for GF (2 4) 21
Figure 2-12 Logical hardware diagrams for multiplication with 21
constant, λ
Figure 2-13 Logical hardware implementation of GF (24 ) multiplier 22
Figure 2-14 Hardware implementation of multiplication with φ 23
Figure 2-15 Hardware implementation of GF (22) multiplication 23
Figure 3-1 4:1 multiplexer for (a) LSB output and (b) MSB output for 31
2 bits output of multiplication in GF (22)
Figure 3-2 Proposed Multiplicative Inverse architecture 31
Figure 3-3 Hardware implementation Result of S-box obtained from 32
XC2VP30 device using ChipScope pro logic analyzer
Figure 4-1 the proposed architecture of AES encryption algorithm 37
Figure 4-2 Simulation Output of proposed AES encryption for 128 38
bits in Xilinx ISE 13.4

LIST OF TABLES
Table No. Table name Page No.

Table I: Round key size and number of rounds in three 12
versions of AES
Table II Subbytes Transformation Table 15
Table III Inverse SubBytes Transformation Table 15
Table IV Pre-computed results of the multiplicative inverse 24
operation in GF (24)
Table V MI in GF (24) 29
Table VI Comparison of MI in GF (24) in ASIC 31
Table VII Comparisons of multiplication in GF (22) in ASIC 32
Table VIII FPGA Implementation Results and Comparisons In 33
Xilinx Vertex-II Pro
Table IX FPGA Implementation Results and Comparisons in 33
Spartan6 (xc6slx16-3csg324)
Table X ASIC Implementation Results and Comparisons 34
ABSTRACT
In the present era of information processing through computers and access of private
information over the internet like bank account information even the transaction of money,
business deal through video conferencing, encryption of the messages in various forms has
become inevitable. There are mainly two types of encryption algorithms, a private key (also
called symmetric key having a single key for encryption and decryption) and public key
(separate key for encryption and decryption). In terms of computational complexity, private
key algorithm is less complex than a public key algorithm. The simple architecture of private
key algorithm attracts the VLSI implementation through the basic digital components like
basic gates and flip-flops. Moreover, the high throughput architecture can be realized for
encryption of very large amounts of data, e.g., images and videos, in real time. National
Institute of Standards and Technology (NIST) adopted Advanced Encryption Standard (AES)
as the standard for encryption and decryption of blocks of data. The draft is published under
the name as FIPS-197 (Federal Information Processing Standard number 197). AES is a
symmetric key block cipher. It encrypts data of block size 128 bits. The AES algorithm is
used in diverse application fields like WWW servers, automated teller machines (ATMs),
cellular phones and digital video recorders.
The AES is an iterative algorithm. It encrypts the data using four different transformations
namely SubBytes, ShiftRows, MixColumns and AddRoundkeys. SubBytes transformation
(also called substitution) is a non-linear operation in AES wherein each byte of a state is
mapped to a different value. The SubBytes transformation is done through S-box and it is the
most complex steps in terms of cost and implementation. Use of ROM table and composite
field arithmetic are two techniques to perform substitutions. The ROM based approach
requires high amount of memory and also it causes low latency and low throughput because
of ROM access time. Composite field arithmetic is more suitable for S-box implementation
of high speed AES encryption.
In the present work, hardware optimization for AES architecture has been done in different
stages. The critical path delay in architectural path has been reduced using different logic
components. For instance, there are a number of XOR gates and its combination in logic path
of substitution block which uses Galois field (GF) arithmetic. The equation in GF has been
restructured in order to have low delay components in the critical path. Moreover, the basic
1
components in GF arithmetic are also replaced with the digital components that can be
realized using multiplexers. For the AES implementation, merging technique has been used
wherein ShiftRows, MixColumns and AddRoundkeys transformations are performed in a
single VLSI module. Two different architectures of AES, namely iterative and concurrent,
have been implemented in Xilinx FPGA.
The hardware comparison results show that as AES architecture has critical path delay of
9.78 ns when conventional s-box is used, whereas it has a critical path delay of 8.17 ns using
proposed s-box architecture. The total clock cycles required to encrypt 128 bits of data using
a proposed AES architecture are 86 and therefore, throughput of the AES design in Spartan-6
of Xilinx FPGA is approximately 182.2 Mbits/s.
To achieve the very high speed, full custom design of the s - box in composite field has been
done for the proposed s-box architecture in Cadence Virtuoso. The novel XOR gate is
proposed for use in s-box design which is efficient in terms of delay and power along with
high noise margin. The implementation has been done in 180 nm UMC technology. Total
dynamic power in the proposed XOR gate is 0.63 µW as compared to 5.27 µW with the
existing design of XOR. The designed s-box using proposed XOR occupies a total area of
27348 µm2. The s-box chip consumes 22.6 µW dynamic power and has 8.2 ns delay after
post layout simulation has been performed.
2
CHAPTER 1
INTRODUCTION
3
In this chapter the motivation, research objectives and the organization of the thesis is
presented.
1.1 MOTIVATION
With worldwide communication of private and confidential data over the computer networks
or the Internet, there is always a possibility of threat to data confidentiality, data integrity and,
also data availability. Data encryption maintains data confidentiality, integrity and
authentication. Information has become of the most important assets in growing demand of
need to store every single importance of events in everyday life. Messages need to be secured
from unauthorized party. Encipherment is one of the security mechanisms to protect
information from public access. Encryption hides the original content of a message so as to
make it unreadable to anyone, except the person who has the special knowledge to read it.
In the past cryptography means only encryption and decryption using secret keys,
nowadays it is defined in different mechanisms like asymmetric-key encipherment (public-
key cryptography) and symmetric-key encipherment (called as privet-key cryptography). The
public key algorithm is complex and has very high computation time. Private Key algorithms
involve only one key, both for encryption as well as decryption whereas, public key
algorithms involve two keys, one for encryption and another for decryption. There were
many cryptographic algorithms proposed such as Data Encryption Standard (DES), 2-DES,
3-DES, Elliptic Curve Cryptography (ECC), the Advanced Encryption Standard (AES) and
other algorithms. Many investigators and hackers are always trying to break these algorithms
using brute force and side channel attacks. Some attacks were successful as it was the case
for the Data Encryption Standard (DES) in 1993[21].
The Advanced Encryption Standard (AES) is considered as one of the strongest
published cryptographic algorithms. National Institute of Standards and Technology (NIST)
adopted Advanced Encryption Standard (AES) as the standard for encryption and decryption
of blocks of data after the failing of the Data Encryption Standard (DES). The draft is
published under the name as FIPS-197 (Federal Information Processing Standard number
197)[5]. Moreover, it is used in many applications such as in ATM Machines, RFID cards,
cellphones and large servers. AES is widely used for encryption of audio/video data contents
in real time.
4
Due to the significance of the AES algorithm and the numerous real –time
applications, the main concern of this thesis will be presenting new efficient hardware
implementations for this algorithm.
AES algorithm is an iterative algorithm, which requires many computation cycles. A
software platform cannot provide the high speed encryption of data, specially used for real-
time applications. Audio/video content encryption is required in real-time in business deals
via video conferencing. Therefore, dedicated hardware implementation is inevitable in such
applications. Hardware implementation can be done through different architectures trading
throughput with area and power consumption. At any time, designing best architecture for a
particular design with low area and low latency is a challenge. Hardware implementations of
the AES algorithm vary according to the application. While some applications require very
high throughputs as in e-commerce servers, others require a medium throughput range as in
designs for cell phones. Some others require very low area and low power implementations to
be used application as RFID cards.
The AES is an iterative algorithm and uses four operations in different rounds, namely
SubBytes, ShiftRows, MixColumns and Key Additions transformations. SubBytes
transformation is done through S-box. S-box is the vital component in the AES architecture
that decides the speed/throughput of the AES[1]. The ROM based approach requires high
amount of memory and also it causes low latency because of ROM access time. Therefore,
composite field arithmetic is more suitable for S-box (substitution) implementation its
hardware optimization for VLSI implementation is very important to reduce the area and
power of the AES architecture.
We have designed a custom S-box for AES encryption. Firstly the logic verification
has been done by FPGA implementing using VHDL code of the S-box in Xilinx ISE and
ASIC using 0.18 µm standard cell technology library. The optimization of the design has
been done by proposing novel circuit for smaller components like XOR gate and other circuit
components like Galois Field (GF) multiplier. The XOR has been designed using minimum
number of transistors and it has high noise margin and low power consumption as compared
to existing XOR designs.
The design optimization has been done by replacing conventional modules in AES
architecture with a module which best suits for the area and latency reduction. Further, we
have synthesized two different design styles of AES, namely, iterative and concurrent
(pipeline) for implementation in Xilinx FPGA. Iterative architecture can be realized with low
5
area, but throughput is low as compared to concurrent architecture which has higher
throughput.
1.2 LITERATURE REVIEW
The Advanced Encryption Standard (AES) algorithm has published by NIST as a draft FIPS-
197 in 2001. There are numerous hardware implementations were suggested for it, among all
the implementation mostly they have targeted the AES with 128-bits key size [1-3]. This key
size is considered to be appropriate for most of the commercial applications, where using
higher key sizes is considered as excess of resources. It involves higher area implementations
with longer processing time and not easy to implement for small scale devices. Key sizes of
192-bit and 256 bits are used mostly in top secret military applications to confirm the
maximum level of security.
1.2.1 High Speed AES Design
AES algorithm is an iterative algorithm, which requires many computation cycles. A software
platform cannot provide the high speed encryption of data, specially used for real-time
applications. Audio/video content encryption is required in real-time in business deals via
video conferencing. Therefore, dedicated hardware implementation is inevitable in such
applications.
Hardware implementation can be done through different architectures trading
throughput with area and power consumption. The design optimization can be done by
replacing conventional modules in AES architecture with a module which best suits for the
area and latency reduction details in [13, 14].
Further, there are mainly two different design styles found and its implementation in
different devices namely, iterative and concurrent (pipeline) for implementation in Xilinx
FPGA [9, 10]. It has observed that concurrent implementation requires less time but the area
is large with high power consumption. The transformations used in different rounds are same
so, algorithm can be used repeatedly and area and power can save with the improvement in
speed [13, 17]. The iterative implementation could be efficient as per the requirement
published in [10].
6
1.2.2 Architecture and Implementation of S-Box
There are four transformations in the AES algorithm among all the transformation, SubBytes
is complex and non-linear. There are two techniques found to implement S-BOX, one using
RAM and another using composite field arithmetic architecture. The implementation of the
composite field S-BOX is accomplished using combinational logic circuits rather than using
pre-stored S-BOX values.
S-BOX substitution starts by finding the multiplicative inverse of the number in GF
(28), and then applying the affine transformation. Implementing a circuit to find the
multiplicative inverse in the finite field GF (28) is very complex and costly, therefore, [19]
has suggested using the finite field GF (24) to find the multiplicative inverse of elements in
the finite field GF (28). First detailed implementation of the composite field S-BOX was
published in [16].
The S-Box is at the major of any AES implementation and is measured a full
complexity design consuming the main portion of the power and energy inexpensive of the
AES hardware. The substitute way is to design the S-Box circuit using combinational logic
directly from its arithmetic operations. This method has a fine delay - path from S-Box
processing. The AES algorithm can be implemented on a varied range of platforms under
different constraints [2]. In transportable applications figuring resources are usually restricted
and dedicated hardware implementation of the safety purpose is essential.
AES Implementation using FPGA (Field Programmable Gate Array) is not
appropriate for such applications generally due to size and power limitations. A full-custom
chip is more suitable for compact small foot-print design in such a case. The Galois Field
arithmetic for S-Box, it is very clearly evident that the implementation of S-Box/InvS-Box
needs a large number of XOR operations [15].
The novel XOR has been designed using minimum number of transistors and it has
high noise margin and low power consumption as compared to existing XOR designs. The
new approach to minimize the silicon - area of S-Box design demonstrated by using a new 2-
input XOR gate for low-power composite field arithmetic to reduce the power dissipation and
delays for the complete circuit [15].
7
1.3 RESEARCH OBJECTIVE
Based on the above discussion, the main objectives for efficient AES algorithm designs are:
1. To propose the high speed S-BOX and its implementation in FPGA and ASIC.
2. In Composite Field Arithmetic, XOR is used in addition so, XOR has been designed
using minimum number of transistors and it has high noise margin and low power.
3. Full custom design of S-BOX for AES Encryption algorithm.
4. Implementation of high speed architecture of AES algorithm.
1.4 ORGANIZATION
This thesis is organized as follows:

Chapter 2 describes the AES algorithm in details. The four encryption steps are
presented: Byte Substitution, Shift Rows, Mix Column and finally Add Round Key. It
also describes the details of S-BOX implementation in Composite Field Arithmetic.
In Chapter 3, a proposed architecture of S-box with some modification in Conventional
architecture is presented. The implementation results of proposed architecture in FPGA
and ASIC with comparisons of previous work is also provided.
In Chapter 4, a high speed 128-bits pipelined and iterative AES encryption using new
efficient merging technique is presented. The simulated output and comparisons with
conventional works is also provided.
In Chapter 5, a proposed two inputs novel XOR gate has been presented with
comparison of conventional XOR. The schematic and layout of all the require blocks and
proposed S-BOX have been presented and comparison of post layout and before post
layout simulation results have been done.
Finally, the conclusion and future work are presented in Chapter 6.
8
CHAPTER 2
ADVANCED
ENCRYPTION
STANDERD (AES)
ALGORITHM
9
The brief introduction of Advanced Encryption Standard (AES) and its steps has discussed in
this chapter. Literature review of AES, S-BOX architecture and its implemented conventional
hardware design is also present.
2.1 DEFINITION AND HISTORY OF CRYPTOGRAPHY
Information need to be secured from unauthorized party. Cryptography is one of the security
mechanisms to protect information from public access. Cryptography is a Greek origin word
which means “secret writing” to make the information secure and immune to attacks. Classic
cryptography was used for top-secret communications between people. This kind of
cryptography is commonly applied by replacing the message letters with other letters using
definite formula. Nowadays it‟s changed into algorithm based cryptography according to
demand by the users. It has two processes encryption and decryption, in the first process
encryption; the Plain text (Original message) will be converted into secured text or Ciphertext
(Encrypted message) using a specific algorithm which is shown in Figure 2-1. The second
process is decryption which is the reverse process of encryption; here Ciphertext will be
converted into Plain text using all the inverse steps applied for encryption.
Cryptographic
Algorithm
Plaintext Ciphertext
(Original message) (Encrypted Message)
Figure 2-1 Basic step of Encryption in cryptography
There are mainly two types of encryption algorithms, a private key (also called symmetric
key) and public key. Private Key algorithms involve only one key, both for encryption as
well as decryption whereas; public key algorithms involve two keys, one for encryption and
another for decryption.
The private Key algorithm is less complex and easy to implement for high speed application.
Figure 2-2 illustrates the concept private key cryptography, an entity can send an encrypted
message through insecure channels to another entity but the key should be sent from a secure
channel to decrypt the message in Symmetric key cryptography. Advanced Encryption
Standard (AES) is a private key algorithm [1, 3].
10
Insecure channel
Encryption Cipher text
Decryption
Plain text Plain text
1010110… 11000011….. 1010110…
Key Key
Secure channel
1101000… 1101000…
Figure 2-2 Symmetric Key Cryptography.
2.2 GALOIS FIELD
The Galois field (GF) or Finite Field with a finite number of elements are extensively used in
cryptography. The total number of element present in GF is called the order of fields. A GF is
of the form pn, where n is a positive integer and p is a prime number also called the
characteristic of the Galois field.
There are many cryptographic algorithms using GF among them, the AES algorithm uses the
GF (28). The data byte can be represented using a polynomial representation of GF (28).
Equation 2.1 shows the polynomial representation of data bytes in Finite Fields.
( ) (2.1)
Arithmetic operation is totally different from normal arithmetic algebra; an addition can be
found using bit-wise XOR operation. In Galois field, the multiplication product of
polynomials will be modulo an irreducible polynomial so final answer can be within the used
finite field. The polynomial which cannot be factorized of two or more than two is called as
irreducible polynomial.
2.3 THE DATA ENCRYPTION STANDERED (DES)

National Institute of Standard and Technology (NIST) published a proposal from IBM in
1973 for symmetric key cryptosystem. DES was accepted and published in March 1975 as a
draft of Federal Information Processing Standard (FIPS). It was finally published in January
1977 as FIPS 46 in the Federal Register.
DES is 64-bit cryptosystem, here 64-bit plain text takes and creates a 64-bit cipher text for
the encryption process. Similarly, in decryption of a 64-bit cipher text taken to convert into
11
64-bit plain text. There is 56-bit same key has been used for both encryption and decryption.
Round-key generator generates the different round key for each round.
The linear cryptanalysis attack could break the DES algorithm and made it unconfident
algorithm. Several published brute force attacks started to fail DES algorithm. The NIST
started looking for replacement of DES algorithm because of its failure.
The NIST specifications required 128 bits block size and three different key sizes of 128, 192
and 256 bits, should be an open algorithm. The NIST declared that Rajndael cipher was
selected as Advanced Encryption Standard (AES).
2.4 THE ADVANCED ENCRYPTION STANDARD (AES)
The National Institute of Standards and Technology (NIST) announced that Rajndael
pronounced as “Rain Doll” planned by two Belgium researchers Joan Daemen and Vincent
Rijment was adopted as Advanced Encryption Standard (AES) for encryption and decryption
of blocks of data. The draft is published in December 2001, under the name as FIPS-197
(Federal Information Processing Standard number 197).
The criteria defined by selecting AES fall into three areas Security, Implementation and cost
of the algorithm. The main emphasis was the security of the algorithm to focus on resistance
of cryptanalysis attacks, implementation cost should be less so it can be used for small
devices like smart cards.
The AES algorithm is a private key block cipher. It encrypts data of block size 128 bits. It
uses three key sizes, 128 bits, 192 bits and 256 bits in three versions. AES uses three different
types of round operations. Table I shows the number of rounds in three versions of AES. But,
in each version final round key is 128 bits.
Table I: Round key size and number of rounds in three versions of AES
Cipher Key size No. Of Rounds (Nr) Round Key size
128 bits 10 128 bits
12
The initialization is done by adding first round key (128 bits) with 128 bits plain text. In
subsequent steps, the following transformations are done: SubBytes, ShiftRows,
MixColumns and AddRoundKey. The last round is different from the previous rounds as
there is no MixColumns transformation. Figure 1 shows the round in AES. The internal 128
bits data in AES are represented in the form of 4x4 square matrix containing elements of size
8 bits and named as state elements. The decryption process involves of the inverse steps,
decryption round contains of: Inverse S-BOX used for Byte Substitution, Inverse Shift Rows,
Add Round Key and Inverse Mix Columns. The round keys will be generated using a unit
called the key generation unit. This unit will be generating 176, 208 or 240 bytes of round
keys depending on the size of the used key.
Plain text
Round Key (0)

AddRoundKey
SubBytes
For i = 1
to Nr-1 ShiftRows
round
MixColumns
Round Key (i)

AddRoundKey
SubBytes
Final round
ShiftRows
Round Key (Nr)

AddRoundKey
Cipher text
Figure 2-3 Rounds of AES Encryption algorithm.
13
2.4.1 SUBBYTES/INVERSE SUBBYTES TRANSFORMATION
The first transformation, SubBytes, is used for encryption and inverse SubBytes used for
decryption. The SubBytes substitution is a nonlinear byte substitution that operates
independently on each byte of the State using a substitution table (S-box). Take the
multiplicative inverse in the finite field GF (28) and affine transform to do the SubBytes
transformation. Inverse affine transform have to find for inverse SubBytes transformation
then multiplicative inverse of that byte.
Figure 2-4 SubBytes Transformation [5].
Figure 2-4 indicates that how the transformation can be done. There are two hexadecimal
digits a and b in one state element, the left digit (a) defines the row and the right digit (b)
defines the column of the substitution table. The junction of these two digits is the new bytes.
Inverse SubBytes transformation is inverse of SubBytes transformation. It can find in the

similar way only table which is used for mapping the byte is different. The SubBytes
transformation is done through S-box. There are two techniques to perform substitutions, (i)
using S-BOX table, and (ii) using composite field arithmetic. There are separate tables for
SubBytes and its inverse; Table II is used for SubBytes transformation and Table III used for
its inverse. It can be found using S-box architecture in composite field arithmetic which is
discussed in the next chapter.
14
Table II: Subbytes Transformation Table [16]
Table III: Inverse SubBytes Transformation Table [16]
15
SubBytes table is also called as S-box and inverse SubBytes table is an Inverse S - box. There
are two parts of affine transformation and its inverse; a constant matrix will be multiplied
with the data in multiplication part, then the addition part, where a constant vector is added to
multiplication result.
2.4.2 SHIFTROWS/INVERSE SHIFTROWS TRANSFORMATION
The transformation is called ShiftRows performs in encryption, in which rows are cyclic
shifting to the left. The number of shifting depends upon the row number of the state matrix.
First row no shifting, second row one byte, third row two bytes and fourth row three byte
shifting left. In the decryption, InvShiftRows transformation performs the right cyclic shifting
operation inverse of ShiftRows; number of shifting depends on number of row number.
Figure 2-5 shows the Cyclic ShiftRows transformation for AES algorithm.
Figure 2-5 ShiftRows Transformation for AES Encryption.
2.4.3 MIXCOLUMNS/INVERSE MIXCOLUMNS TRANFORMATION

The MixColumns transformation functions after the ShiftRows on the State column-by-
column, considering each column as a four-term polynomial. Inverse MixColumns are the
inverse process of MixColumns which is used in the decryption of cipher text. The columns
are considered as polynomials over GF (28) and multiplied modulo x4 + 1 with a fixed
polynomial A (x), given in equation 2.2.
A(x) = {03} x3 + {01} x2 + {01} x + {02}. (2.2)
The algorithm for MixColumns and Inverse MixColumns involves multiplication and
addition in GF (28). The MixColumns multiplies the rows of the constant matrix by a column
in the state. Figure 2-6 describes the operation of this transformation; key addition is the next
transformation of the encryption.
16
Constant matrix
State State
Figure 2-6 MixColumns transformation of AES Encryption.
2.4.4 ADDROUNDKEY TRANSFORMATION KEY EXPANSION
2.4.4.1 AddRoundKey Transformation
The AddRoundKey adds the round key word with each column of state matrix. It is similar to
MixColumns; the AddRoundKey proceeds one column at a time. The most important in this
transformation, that it includes the cipher key. The state column will get XOR with key
which is generated by key generator and create another state as shown in Figure 2-7.
Figure 2-7 AddRoundKey transformation of AES Encryption.
2.4.4.2 Key Expansion

The key expansion term describes the operation of generating all Round Keys from the
original input key. The initial round key is original key in case of encryption and in case of
decryption the last group of the generated by key expansion keys will be original keys. As
mentioned earlier this initial round key will be added to the input firstly before starting the
encryption or decryption iterations. The 128 bits key size, 10 groups of round keys will be
generated with 16 bytes size. The round keys are generated word by word. There are some
similar encryption transformations used to generate the round key.
17
2.5 COMPOSITE FIELD ARITHMATIC S-BOX
The SubBytes transformation, done through S-BOX mapping is computationally inefficient

when implemented using a ROM. But, it is not efficient for applications requiring very high
throughput as ROM accessing involves one complete clock cycle for mapping one 8-bits state
element and consequently 16 clock cycles are required to transform the 128 bits data (16
bytes). To increase the throughput, parallel ROMs are required resulting in large size of chip
area.
Therefore, a more feasible solution is to implement an S - box is by using composite

field arithmetic which uses only logic elements in the implementation. The S-BOX
substitution starts by finding the multiplicative inverse of the data in GF (2 8), and then
applying the affine transformation. Figure 2-8 shows steps for the one byte forward and
inverse SubBytes transformation using composite field arithmetic.
Figure 2-8 SubBytes and Inverse SubBytes transformation in composite field [6].
To find the S-BOX transformation first multiplicative inverse of GF (2 8) then affine

transformation calculated. Similarly, for InvSubBytes first InvAffine transformation then
multiplicative inverse has to be calculated. There are one major operation involve here, which
is to find the multiplicative inverse in GF (28). This can be done by breaking the GF (28)
elements in GF (24) and some more logical blocks. I.e., Any arbitrary polynomial in GF (28)
can be represented as bx+c using an irreducible polynomial x2+Ax+B. Here, b is the most
significant nibble and c is the least significant nibble. The multiplicative inverse can be found
by using the following equation (2.3).
(bx  c ) 1
 b(b 2 B  bcA  c 2 ) 1 x  (c  bA)(b 2 B  bcA  c 2 ) 1 (2.3)
 b(b 2   c (b  c )) 1 x  (c  b)(b 2   c (b  c )) 1
18
Where, A=1, B=λ, as the irreducible polynomial used is x2+x+λ. Figure 2.9 shows the block
diagram to find the multiplicative inverse in GF (28) using GF (24). Figure 2.10 shows the
meanings the symbols used in Figure 2.9. The mapping structure in different fields along with
the irreducible polynomials is as follows.
GF (22) → GF (2) : x2 + x + 1
GF ((22)2) → GF (22) : x2 + x + φ
(2.4)
2 2 2 2 2 2
GF (((2 ) ) ) → GF ((2 ) ) : x + x + λ
4 4
x2 ×λ ×
8 4 -1
8 δ-1
δ x AT
GF(28) Sub Byte
Element in 4 out
4
× ×
Figure 2-9 The conventional S-box architecture in composite field [6].
δ Multiplication operation in GF (24)

Isomorphic mapping to Composite Fields
×
x2 Squarer in GF (24) δ-1 Inverse Isomorphic mapping to GF (24)
×λ Multiplication with constant, λ in GF (24)

Addition Operation in GF (24)
x-1 Multiplicative inversion in GF (24) AT Affine transformation
Figure 2-10 Meaning of symbols used in Figure 2-9.
Isomorphic mapping is the first step performed on the 8 bits sub byte input. The
output of the isomorphic mapping is given to the input of multiplicative inverse (MI) module.
Subsequently, inverse isomorphic mapping and affine transformations are the steps that
follow. The detailed discussion of each block has given below.
19
2.5.1 Addition operation in GF (24)
The addition operation in Galois Field can be interpreted to simple bitwise XOR operation
between the two elements.
2.5.2 Squaring operation in GF (24)
The squaring operation of 4 bits, i.e. x2 term can be modulo reduced using the irreducible
polynomial from (2.4), x2 + x + φ. It can reduce into lower order of Galois field, by setting x2
= x + φ and replacing it into x2. Doing above operation GF (24) is converted into GF (22),
here nibble is converted into 2 bit stream. It can be represented using equation 2.5.
( ) (2.5)
Where k {k3k2k1k0} is the four bits output of squarer and q {q3q2q1q0} is the input bit
steam, here qh, kh, ql, and kl are higher 2 bits of q and k and lower 2 bits of q and k
respectively. Now GF (22) can be converted into GF (2), the x2 term can be replaced x2 = x +
1. For the case of x3, it can be obtained by multiplying x2 by x, i.e. x3 = x(x) + x = x2 + x.
after Substituting for x2, x3 = x + 1 + x. The two x terms are presented which cancel out each
other, leaving only x3 = 1. Performing all the substitution output bit stream can be calculated
by input bit streams in GF (2) the final expression yields the following equation 2.6.
( )
Here we know that similar term in XORing operation will get cancelled, after that
polynomial substitution has to do which is discussed earlier. Equation 2.7 gives the logical
expression for all the output bits.
(2.7)
20
The equation (2.7) can be realized hardware logic diagram usingan XOR operationn
anddiagramsm drawn in figure 2.11.
q k
4 4
Figure 2-11 Logical hardware diagram of squarer for GF (24)
2.5.3 Multiplication with constant, λ
The multiplication with constant λ, which value is {1100} in GF (24) will give the
polynomials. Modulo reduction can be performed by substituting x2 = x + φ using the
irreducible polynomial in (2.4) to yield the logical expression. The final output bits k in the
form of input bits q can be calculated using irreducible polynomial, which represented in
equation 2.8. There are total three XOR gate is required to implement the multiplication with
λ. There are two XOR gates in critical path which will give the maximum delay. Figure 2-12
shows the logical hardware of equation 2.8.
(2.8)
q k
4 4
Figure 2-12 Logical hardware diagrams for multiplication with constant, λ
21
2.5.4 Galois field GF (24) multiplication
The GF (24) multiplier is a major component to find the multiplicative inverse using
composite field arithmetic operation. It requires more hardware to implement in
combinational logic. It is multiples of 4 bits with 4 bits and results also in 4 bits. Let k = qw,
where k in the 4 bits binary output and q and w are 4 bits inputs. It can be observed that
multiplication and addition operation in GF (22), multiplication in GF (22) is a major
component in it.
It can be converted into a lower form of Galois field using irreducible polynomial
present in equation 2.4. The final expression of logic implementation can be represented in
equation 2.9. Figure 2-13 shows the logical hardware implementation of GF (2 4)
multiplication. Here „+‟ represents the XOR operation.
( ) (2.9)
GF (22) multiplier.
Figure 2-13 Logical hardware implementation of GF (24) multiplier
2.5.4.1 Multiplication with constant, φ
The multiplication with constant φ, which has a constant value φ = {10}2 is an element of
GF (22). It has two bits value that can be also represented in the form of combinational logic
in equation 2.10. Figure 2-14 shows the hardware implementation of combinational logic.
Here k is a two bits output and q two bits input of the component.
22
(2.10)
Figure 2-14 Hardware implementation of multiplication with φ
2.5.4.2 Galois field GF (22) multiplication
The Galois field (22) multiplier is the major component in GF (24) multiplication, which exist
in the critical path. It can be represented in the input bit streams by using irreducible
polynomial presents in equation 2.4.
It also implemented using combinational logic which presents in equation 2.11.

Figure 2-15 shows its hardware implementation in composite field arithmetic. Here k is two
bits output and q and w are the two bits input of the component.
(2.11)
Figure 2-15 Hardware implementation of GF (22) multiplication
23
2.5.5 Multiplicative Inversion in GF (24)
The multiplicative inverse of q (where q is an element of GF (2 4)) is an intermediate

component of multiplicative inverse. It has derived a formula to compute the multiplicative
inverse of q, such that * +. The inverses of the individual bits can be
computed from the logical equation and pre-computed value can be stored in RAM. The pre-
computed value can be seen in table IV which will used to find the multiplicative inverse.
The method using logic equation is given in equation 2.12.
Table IV: Pre-computed results of the multiplicative inverse operation in GF (2 4) [16].
The table containing the results of the multiplicative inverse in hexadecimal is shown above.
The equation given below helped to hardware implementation in combinational logic where
„+‟ indicates the XOR operation.
q3 1  q3  q3 q2 q1  q3 q0  q2
q2 1  q3 q2 q1  q3 q2 q0  q3 q0  q2  q2 q1
q11  q3  q3 q2 q1  q3 q1q0  q2 q0  q2  q1
q0 1  q3 q2 q1  q3 q2 q0  q3 q1  q3 q1q0  q3 q0
 q2  q2 q1  q2 q1q0  q1  q0
(2.12)
24
CHAPTER 3
IMPLEMENTATION OF
PROPOSED
ARCHITECTURE FOR
S-BOX
25
3.1 INTRODUCTION
The SubBytes transformation is a non-linear operation in AES wherein each byte of a state is
mapped to a different value. The SubBytes transformation is done through S-box. There are
two techniques to perform substitutions, (i) using ROM table [10−12], and (ii) using
composite field arithmetic [13−15]. The SubBytes transformation, done through S-box
mapping is computationally inefficient when implemented using a ROM. But, it is not
efficient for applications requiring very high throughput as ROM accessing involves one
complete clock cycle for mapping one 8-bits state element and consequently 16 clock cycles
are required to transform the 128 bits data (16 bytes).
To increase the throughput, parallel ROMs are required resulting in large size of chip
area. Therefore, a more feasible solution is to implement an S - box is by using composite
field arithmetic which uses only logic elements in the implementation. Substitution is the
most complex steps in terms of cost and implementation [13]. Therefore, its hardware
optimization for VLSI implementation is very important to reduce the area and power of the
AES architecture. The ROM based approach requires high amount of memory and also it
causes low latency because of ROM access time. Therefore, composite field arithmetic is
more suitable for S-box (substitution) implementation.
The Speed improvement along with an area reduction has been the most challenging
research in VLSI implementation. We propose a high speed VLSI architecture for S-box. The
FPGA implementation of the architecture is done along with comparison with some existing
transformation techniques. The proposed architecture has delayed improvement and low
power consumption. The silicon validation of the architecture is done by programming
XC2VP30 device on Virtex-II Pro FPGA board. The proposed architecture is also
implemented in ASIC using 0.18 µm standard cell technology library.
3.2 PROPOSED ARCHITECTURE OF S-BOX.
The new architecture of S-BOX has proposed after 3 modifications in conventional

architecture of S-BOX.
I. Introduced an operator (op) after merging of some blocks.

II. Implementation of multiplicative inverse in GF (24) using multiplexor.
III. Reduced the critical path of multiplication in GF (22)
26
3.2.1 Introduced an operator (op) after merging of some blocks.
An operator (op) has introduced after merging of blocks like squarer, multiplication with
constant λ, a GF (24) multiplier and a four bits XOR. The equation of op has introduced using
Galois field irreducible conversion technique and in the form of an input bit stream. One
major operation involve here is finding the multiplicative inverse in GF (2 8).
This can be done by breaking the GF (28) elements in GF (24). I.e., Any arbitrary
polynomial in GF (28) can be represented as bx+c using an irreducible polynomial x2+Ax+B.
Here, b is the most significant nibble and c is the least significant nibble. The multiplicative
inverse can be found by using the following expression [16].
(bx  c)1
 b(b 2 B  bcA  c 2 )1 x  (c  bA)(b 2 B  bcA  c 2 ) 1 (1)
 b(b 2  c(b  c)) 1 x  (c  b)(b 2   c(b  c)) 1
Where, A=1, B=λ, as the irreducible polynomial used is x2+x+λ. Figure 3 shows the
block diagram to find the multiplicative inverse in GF (28) using GF (24) [16]. The mapping
structure in different fields along with the irreducible polynomials is as follows.
GF (22) → GF (2) : x2+ x+ 1
GF ((22)2) → GF (22) : x2+ x+ φ
GF (((22)2)2) → GF ((22)2) : x2+ x+ λ (2)
The expression b2λ+c (b+ c) in Eq. (1) Can be written as,
b2  bc  c 2  b(b  c)  c 2 (3)
Representing b, c and λ as,
b= bHx+ bL, c = cHx+ cL, λ = λHx+ λ L
Where bH and bL are the upper and lower 2-bits of b, similarly, cH and cL are the upper
and lower 2-bits of c and λH and λL are the upper and lower 2-bits of λ. The bλ, i.e.,
Multiplication with λ can be written as,
b  bH H x2  bL H x (4)
27
And therefore,
b  c  bH H x 2  bL H x  cH x  cL (5)
Now from Eq. (5),
b(b  c)
 (bH H x 2  bH H  cH x  cL )  (bH x  bL )
 (bH 2H   bH 2H  cH bH  cLbH  bL 2H  bL cH ) x (6)
(bL cL  bH 2 H   bH cH  )
Here, λ = (1100)2 and φ = (10)2. Performing operations in GF ((22)2), the following

value can be obtained in terms of upper (b) lower nibble (c) bits of inputs. We can reduce the
blocks in the proposed architecture from its conventional architecture of S-BOX.
op (0)
 b(0)c(0)  b(1)c(1)  b(2)c(3)  b(3)c(3)
b(3)c(2)  in(0)  in(3)  in(5)  in(7)
op (1)
 b(0)c(1)  b(1)c(0)  b(1)c(1)  b(2)c(2)
b(2)c(3)  b(3)c(2)  in(2)  in(3)  in(5)  in(6)
op (2) (7)
 b(0)c(2)  b(1)c(3)  b(2)c(0)  b(2)c(2)
b(3)c(1)  b(3)c(3)  in(1)  in(2)  in(4)  in(6)
op (3)
 b(3)(in(0)  in(3)  in(6))  c (3)(b(0)  b(1)  b(2))
b(2)c(1)  b(1)c(2)  in(1)  in(3)  in(4)
Where, in(0), in(1), in(2), in(3), in(4), in(5), in(6) are the input bits in the isomorphic
mapping module (δ).
3.2.2 Implementation of multiplicative inverse in GF (24) using multiplexor.
MI in GF (24) represented by the symbol x-1 and multiplication in GF (24) are the two main
components falls in the critical path of the design. MI in GF (2 4) consists of complex logic
given by [12].
28
q3 1  q3  q3 q2 q1  q3 q0  q2
q2 1  q3 q2 q1  q3 q2 q0  q3 q0  q2  q2 q1
q11  q3  q3 q2 q1  q3 q1q0  q2 q0  q2  q1
q0 1  q3 q2 q1  q3 q2 q0  q3 q1  q3 q1q0  q3 q0
 q2  q2 q1  q2 q1q0  q1  q0
(8)
Where, q31q2 1q11q0 1 is 4-bits MI of 4-bit value q3 q2 q1q0 and + sign indicates XOR
operation. It is evident that the realization of MI in GF (24) requires a number of exclusive-or

(XOR) gates. By eliminating the XOR gates, delay and area can be reduced. Table V shows
the input and output combination of MI in GF (2 4). The input combinations can be divided
into two equal halves.
In the first half, MSB will have value „0‟ and in the second half, MSB will be „1‟.
This can be realized by a multiplexer, wherein, for „0‟ MSB in input, the 4-bit output will be
given by a combination of three input bits (except MSB). Similarly, for MSB= „1‟, the 4-bit
output will have 3-bit input combination (except MSB). The combinational logic of the first
half and second half can be represented in terms of three input bits.
Table V: MI in GF (24) [16]
Input to MI Output
in GF (24) from MI in
GF (24)
q3q2q1q0 q3-1q2-1q1-1q0-1
0000 0000
0001 0001
0010 0011
First half
0011 0010
0100 1111
0101 1100
0110 1001
0111 1011
1000 1010
1001 0110
Second half
1010 1000
1011 0111
1100 0101
1101 1110
1110 1101
1111 0100
29
By using a multiplexer, one of the outputs, can be selected depending on the MSB of
the input. It is obvious that the combinational logic contains in equation (9a) and (9b) only
OR and AND gates instead of XOR gates.
     
q3 1  q2 q1 q0 or q2 q1q0 or q2 q1 q0 or  q2 q1q0 
q2 1   q q q  or  q q q 
2 1 0 2 1 0
q11   q q q  or  q q q  or  q q q  or  q q q 
2 1 0 2 1 0 2 1 0 2 1 0
q0 1   q q q  or  q q q  or  q q q  or  q q q 
2 1 0 2 1 0 2 1 0 2 1 0
or  q2 q1q0 
(9a)
      
q3 1  q2 q1 q0 or q2 q1 q0 or q2 q1q0 or q2 q1 q0 
q2 1   q q q  or  q q q  or  q q q  or  q q q 
2 1 0 2 1 0 2 1 0 2 1 0
or  q q q  or  q q q 
2 1 0 2 1 0
q11   q q q  or  q q q  or  q q q  or  q q q 
2 1 0 2 1 0 2 1 0 2 1 0
q0 1   q q q  or  q q q  or  q q q 
2 1 0 2 1 0 2 1 0
(9b)
3.2.3 Reduced the critical path of multiplication in GF (22)

The Figure 2-15 shows the conventional architecture of multiplication in GF (22). It is evident
that there are two XOR gates and one AND gate in the critical path of GF (2 2) multiplication.
The output equation can be written as in equation (10).
z (0)  x (1) y (1)  x (0) y (0)
z (1)  x (1) y (1)  x (0) y (1)  x (1) y (0)

(10)
The above equation can be implemented using two 4:1 parallel multiplexers as follows.
Suppose y is the select line. Then, for different values of y, multiplication result z, from Eq.
(10), will have values as given in TABLE III. Figure 3-1 shows the proposed architecture of
multiplier using two multiplexers.
One XOR gate has been eliminated from the critical path by using 4:1 multiplexer,
i.e., there is one XOR gate and one multiplexer only in critical path, as compared to two XOR
gates in conventional. Here, x and y are the 2 two bits inputs and z is the two bits output of
the multiplication in GF (22).
30
„0‟ „0‟
X(0) x (1)
x (1) z (0) x (0) xor x (1) z (1)

x (0) xor x (1) MUX_1 x (0) MUX_2
y y
Figure 3-1 4:1 multiplexer for (a) LSB output and (b) MSB output for 2 bits output of
multiplication in GF (22)
3.3 IMPLIMENTATION OF PROPOSED ARCHITECTURE OF S-BOX

Figure 3-2 shows the proposed architecture of S-box for AES has been implemented in Xilinx
FPGA and 180nm ASIC. The device taken for the implementation is XC2VP30 on Virtex-II
pro board. For the comparison, S-box architecture in [16] has also been implemented using
multiplicative inverse structure in the XC2VP30 device.
4
4 ×
4
MUX
8 δ-1
8 Op
δ
Multiplicative
GF(28) Inverse Out
4
Element 4
in
×
Figure 3-2 Proposed Multiplicative Inverse architecture
3.3.1 ASIC Implementation of MI in GF (24)

The proposed multiplicative inverse implementation has been done in ASIC design. Table VI
shows the ASIC implemented results and comparisons. It is evident that the proposed
methods are delay and area efficient. The total dynamic power is also less.
Table VI: Comparison of MI in GF (24) in ASIC
Technology Conventional Proposed
0.18 µm structure structure
Area (µm2) 352 279.41
Total Dynamic 97.58 62.93
Power (µW)
Delay (ns) 0.79 0.52
31
3.3.2 ASIC Implementation of multiplication in GF (22)
The proposed architecture of multiplication in GF (22) has implemented in ASIC. Table VII
shows the speed and power improvement as compare with conventional architecture.
Table VII: Comparisons of multiplication in GF (22) in ASIC
Technology Conventional Proposed

structure structure
0.18 µm
Area (µm2) 123.00 126.40

Power (µW)
Delay (ns) 0.41 0.23
3.3.3 ASIC and FPGA Implementation of proposed S-BOX
The conventional S-box as well as proposed S-box architectures has been implemented using
VHDL code. The device taken for the implementation is XC2VP30 on Virtex-II pro board.
The proposed architecture has also implemented in Spartan 6 for the comparison, S-box
architecture in [16] has also been implemented using multiplicative inverse structure in the
XC2VP30 device.
The design synthesis has been done in different Xilinx devices. Table VIII shows the
hardware utilization summary in terms of Slices and LUTs. Power consumption has been
measured by Xpower analyser tool in ISE 10.1. From Table VIII, it is evident that there is an
improvement in terms of delay and power consumption in the proposed structure as
compared to the structure in [16].
Figure 3-3 Hardware implementation Result of S-box obtained from XC2VP30 device
using ChipScope pro logic analyzer
32
It can be seen that there is considerable area improvement in terms of FPGA slices, and speed
improvement (the critical path delay) in our proposed method as compared to conventional
architectures of S-box. Table IX shows the comparison results using a Spartan 6. The large
improvement can be seen after implementing in SPARTAN6 (XC6SLX16-3CSG324). The
improvement in terms of slices, LUTS and delay is also optimized. As compared to Xilinx
vertex-II pro cost is less of this device so, it is useful for low utilization hardware
implementation.
Figure 3.3 shows the sample outputs obtained from FPGA through the ChipScope pro
logic analyzer after programming the XC2VP30 device. Table X shows the delay, power and
area comparison in ASIC implementation using 180nm standard cell technology library. The
proposed architecture has delay improvement of about 0.9 ns (=16 %) as compared to the
conventional s-box architecture in [16] with little area cost.
TABLE VIII: FPGA IMPLEMENTATION RESULTS AND COMPARISONS IN XILINX VERTEX-II

PRO
Proposed Structure in [16] Structure in

structure [14]
Device XC2VP30 XC2VP30 XC2V1000
# of Slices 36 48 153
# of 4-input 63 85 NA
LUTs
Max. Delay 15.0 15.6 10.82
(ns)
Total Power 7.27 9.74 NA
(W)
TABLE IX: FPGA IMPLEMENTATION RESULTS AND COMPARISONS IN

SPARTAN6 (XC6SLX16-3CSG324)
Proposed Structure Conventional Structure
# of Slices 24 30
# of 4-input 54 81
LUTs
Max. Delay 15.23 15.63
(ns)
Total Power 20 20
(mW)
33
TABLE X: ASIC IMPLEMENTATION RESULTS AND COMPARISONS
Technology Proposed Conventional

0.18 µm Structure Structure in [16]
Area (µm2) 3968 3715
Gate Counts 404.89 379.08
Delay 4.6 5.51
(Improvement =16 % from
conventional)
Power (mW)
Efficiency 438.25 390.57
(kbps/µm2)
(Throughput/Area)
3.4 CONCLUSION
An optimized architecture of S-box for AES encryption is proposed in this thesis. This novel
architecture is implemented both in ASIC as well as FPGA. The ASIC implementation
indicates speed improvement compared to conventional structure while maintaining area
constant. FPGA implementation shows improvement in delay and area while a significant
enhancement in terms of power compared to conventional architecture.
34
CHAPTER 4
HIGH SPEED AES

ENCRYPTION
35
4.1 INTRODUCTION
The data encryption maintains data confidentiality, integrity and authentication. AES
(Advanced Encryption Standard) is a standard for encryption and decryption of blocks of
data. It is adopted by the National Institute of Standards and Technology (NIST) and
published under the name as FIPS-197 (Federal Information Processing Standard number
197). AES is widely used for encryption of audio/video data contents, data on smart cards,
automated teller machines (ATMs), WWW servers, Network traffic, cellular phones, etc.
AES algorithm is an iterative algorithm, which requires many computation cycles. A

software platform cannot provide the high speed encryption of data, specially used for real-
time applications. Audio/video content encryption is required in real-time in business deals
via video conferencing. Therefore, dedicated hardware implementation is inevitable in such
applications. Hardware implementation can be done through different architectures trading
throughput with area and power consumption. At any time, designing best architecture for a
particular design with low area and low latency is a challenge.
We have designed the architecture of AES encryption which has low latency and low
power consumption. The design optimization has been done by replacing conventional
modules in AES architecture with a module which best suits for the area and latency
reduction. Further, we have synthesized two different design styles of AES, namely, iterative
and concurrent (pipeline) for implementation in Xilinx FPGA. Iterative architecture can be
realized with low area, but throughput is low as compared to concurrent architecture which
has higher throughput. Different implementation and hardware synthesis report have been
presented.
4.2 PROPOSED ARCHITECTURE FOR AES ENCRYPTION

ALGORITHM
The AES algorithm encrypts data in blocks of 128 bits. It uses three key sizes, 128 bits, 192
bits and 256 bits in three versions with three different Nr rounds 10, 12 and 14. But, in each
version final round key is 128 bits. 128 bits cipher key size is used in first version with 10
rounds of transformations. With initial Roundkey addition transformation, there are 9 rounds
of SubBytes, ShiftRows, MixColumns followed by AddRoundKeys transformations. The last
round (10th round) is different from the previous rounds as there is no MixColumns
transformation.
36
Figure 2-3 shows the rounds in AES of conventional architecture. The design
optimization has been done by replacing conventional modules of AES architecture with a
module which best suits for the area and high speed. Figure 4-1 shows our proposed
architecture, in which ShiftRows and AddRoundKeys are merged in MixColumns
transformation module. It means that these three transformations can be done using single
clock cycle.
The proposed architecture of S-BOX with all three modifications (which discussed in
the previous chapter) have used for SubBytes transformation in the proposed architecture of
AES encryption algorithm. Iterative architecture can be realized with low area and proposed
architecture helps to raise the speed.
Plain text
Round Key (0)

AddRoundKey
For i
=1 to SubBytes
Nr-1
round
MixColumns
Round Key (i)
Final round
SubBytes
AddRoundKeys
Round Key (Nr)
Ciphertext
Figure 4-1 the proposed architecture of AES encryption algorithm
4.3 FPGA IMPLEMENTATION OF PROPOSED ARCHITECTURE OF

AES
The conventional S-box as well as proposed S-box architectures has been implemented using
VHDL code. The design synthesis has been done in different Xilinx devices. Table XI shows
the device utilization summary of the proposed designs along with the conventional ones. It
can be seen that there is considerable area improvement in terms of FPGA slices, and speed
improvement (the critical path delay) in our proposed method as compared to conventional
architectures of S-box and AES encryption algorithm.
37
The implementation of pipeline method takes more hardware. Therefore, it is not
possible to implement the pipeline architecture in Spartan devices. The AES iterative design
in Spartan6 FPGA takes 1 clock cycle in SubBytes transformation, 6 clock cycles in combine
in AddRoundKeys, MixColumns and ShiftRows transformations for single round.
Therefore for 10 rounds, it takes 10x (6+1) +16 =86 clock cycles. The maximum
delay in proposed AES iterative design is 8.17 ns. So, it will take (86x8.17 = 702.62) ns to
complete the AES transformation of 128 bits data. Figure 4-2 shows the simulation result of
AES iterative architecture for 128 bits plaintext data.
Figure 4-2 Simulation Output of proposed AES encryption for 128 bits in Xilinx ISE
13.4
Table XI: Design Summary of FPGA Implementation of proposed AES algorithm

Virtex-II
Xilinx
Spartan6 (xc6slx16-3csg324) pro
FPGA Device
(xc2vp30)
AES
S-box AES Iterative
Pipeline
Design With With With

With With
conventional proposed proposed
conventional proposed
logic logic (% logic
logic logic
(% utilization) utilization) (% utilization)
1017 730 9942

No. Of
30 24
Slices
(44 %) (32 %) (72 %)
2741 1838 18597

No. Of
81 54
LUTs
(30 %) (20 %) (67 %)
952 788 5388

No. Of Slice
0 0
Registers
(5 %) (4 %) (19 %)
Delay 15.63 ns 15.23 ns 9.78 ns 8.17 ns 8.58 ns
Power 20 mW 20 mW 21 mW 21 mW 13.86 W
38
4.4 CONCLUSION
In this chapter, new hardware architectures for the Advanced Encryption Standard (AES)
algorithm were presented. FPGA Xilinx technology was used to synthesise the designs and
provide post placement results using Xilinx ISE 10.1 for AES pipeline architecture and
Xilinx ISE 13.4 has used for the AES iterative algorithm. The maximum throughput of the
design is 185.815 Mbits/s. Medium resolution video (640x480) of true colour depth (24 bits
per pixel) has a bit rate of 184.3 Mbits/s.
Therefore, the proposed architecture implementation in spartan6 FPGA has enough

throughputs to encrypt the video resolution mentioned above in real time. Because, the
proposed iterative design has low area, it is suitable for the implementation in small devices
like, smart cards, cellular phones, etc.
39
CHAPTER 5
INTRODUCTION OF VLSI
40
Very-large-scale integration (VLSI) is the process of creating integrated circuits by combining
thousands of transistor-based circuits into a single chip. VLSI began in the 1970s when
complex semiconductor and communication technologies were being developed. The
microprocessor is a VLSI device. The term is no longer as common as it once was, as chips
have increased in complexity into the hundreds of millions of transistors.
5.1 Overview of VLSI
The first semiconductor chips held one transistor each. Subsequent advances added
more and more transistors, and, as a consequence, more individual functions or systems were
integrated over time. The first integrated circuits held only a few devices, perhaps as many as
ten diodes, transistors, resistors and capacitors, making it possible to fabricate one or more
logic gates on a single device. Now known retrospectively as "small-scale integration" (SSI),
improvements in technique led to devices with hundreds of logic gates, known as large-scale
integration (LSI), i.e. systems with at least a thousand logic gates. Current technology has
moved far past this mark and today's microprocessors have many millions of gates and
hundreds of millions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-
scale integration above VLSI. Terms like Ultra-large-scale Integration (ULSI) were used. But
the huge number of gates and transistors available on common devices has rendered such fine
distinctions moot. Terms suggesting greater than VLSI levels of integration are no longer in
widespread use. Even VLSI is now somewhat quaint, given the common assumption that all
microprocessors are VLSI or better.
As of early 2008, billion-transistor processors are commercially available, an

example of which is Intel's Montecito Itanium chip. This is expected to become more
commonplace as semiconductor fabrication moves from the current generation of 65 nm
processes to the next 45 nm generations (while experiencing new challenges such as increased
variation across process corners). Another notable example is NVIDIA’s 280 series GPU.
This microprocessor is unique in the fact that its 1.4 Billion transistor count,
capable of a teraflop of performance, is almost entirely dedicated to logic (Itanium's transistor
count is largely due to the 24MB L3 cache). Current designs, as opposed to the earliest devices,
use extensive design automation and automated logic synthesis to lay out the transistors,
enabling higher levels of complexity in the resulting logic functionality. Certain high-
performance logic blocks like the SRAM cell, however, are still designed by hand to ensure
41
the highest efficiency (sometimes by bending or breaking established design rules to obtain the
last bit of performance by trading stability).
VLSI stands for "Very Large Scale Integration". This is the field which involves packing
more and more logic devices into smaller and smaller areas.
VLSI
 Simply we say Integrated circuit is many transistors on one chip.
 Design/manufacturing of extremely small, complex circuitry using modified

semiconductor material
 Integrated circuit (IC) may contain millions of transistors, each a few mm in size
 Applications wide ranging: most electronic logic devices
5.2 History of Scale Integration

 late 40s Transistor invented at Bell Labs
 late 50s First IC (JK-FF by Jack Kilby at TI)
 early 60s Small Scale Integration (SSI)
 10s of transistors on a chip
 late 60s Medium Scale Integration (MSI)
 100s of transistors on a chip
 early 70s Large Scale Integration (LSI)
 1000s of transistor on a chip
 early 80s VLSI 10,000s of transistors on a
 chip (later 100,000s & now 1,000,000s)
 Ultra LSI is sometimes used for 1,000,000s
 SSI - Small-Scale Integration (0-102)
 MSI - Medium-Scale Integration (102-103)
 LSI - Large-Scale Integration (103-105)
 VLSI - Very Large-Scale Integration (105-107)
 ULSI - Ultra Large-Scale Integration (>=107)
Advantages of ICs over discrete components
While we will concentrate on integrated circuits , the properties of

integrated circuits-what we can and cannot efficiently put in an integrated circuit-largely
42
determine the architecture of the entire system. Integrated circuits improve system
characteristics in several critical ways. ICs have three key advantages over digital circuits built
from discrete components:
Size. Integrated circuits are much smaller-both transistors and wires are shrunk to micrometer
sizes, compared to the millimeter or centimeter scales of discrete components. Small size leads
to advantages in speed and power consumption, since smaller components have smaller
parasitic resistances, capacitances, and inductances.
Speed. Signals can be switched between logic 0 and logic 1 much quicker within a chip than
they can between chips. Communication within a chip can occur hundreds of times faster than
communication between chips on a printed circuit board. The high speed of circuits on-chip is
due to their small size-smaller components and wires have smaller parasitic capacitances to
slow down the signal.
Power consumption. Logic operations within a chip also take much less power. Once again,
lower power consumption is largely due to the small size of circuits on the chip-smaller
parasitic capacitances and resistances require less power to drive them.
5.3 VLSI and systems

These advantages of integrated circuits translate into advantages at the system level:
Smaller physical size. Smallness is often an advantage in itself-consider portable televisions or

handheld cellular telephones.
Lower power consumption. Replacing a handful of standard parts with a single chip reduces
total power consumption. Reducing power consumption has a ripple effect on the rest of the
system: a smaller, cheaper power supply can be used; since less power consumption means less
heat, a fan may no longer be necessary; a simpler cabinet with less shielding for
electromagnetic shielding may be feasible, too.
Reduced cost. Reducing the number of components, the power supply requirements, cabinet
costs, and so on, will inevitably reduce system cost. The ripple effect of integration is such that
the cost of a system built from custom ICs can be less, even though the individual ICs cost
more than the standard parts they replace.
Understanding why integrated circuit technology has such profound influence on the design of
digital systems requires understanding both the technology of IC manufacturing and the
economics of ICs and digital systems.
43
5.4 Applications
 Electronic system in cars.
 Digital electronics control VCRs
 Transaction processing system, ATM
 Personal computers and Workstations
 Medical electronic systems.
 Etc….
Applications of VLSI
Electronic systems now perform a wide variety of tasks in daily life. Electronic
systems in some cases have replaced mechanisms that operated mechanically, hydraulically,
or by other means; electronics are usually smaller, more flexible, and easier to service. In other
cases electronic systems have created totally new applications. Electronic systems perform a
variety of tasks, some of them visible, some more hidden:
 Personal entertainment systems such as portable MP3 players and DVD players perform
sophisticated algorithms with remarkably little energy.
 Electronic systems in cars operate stereo systems and displays; they also control fuel
injection systems, adjust suspensions to varying terrain, and perform the control functions
required for anti-lock braking (ABS) systems.
 Digital electronics compress and decompress video, even at high-definition data rates,
on-the-fly in consumer electronics.Low-cost terminals for Web browsing still require
sophisticated electronics, despite their dedicated function.
 Personal computers and workstations provide word-processing, financial analysis, and

games. Computers include both central processing units (CPUs) and special-purpose
hardware for disk access, faster screen display, etc.
 Medical electronic systems measure bodily functions and perform complex processing
algorithms to warn about unusual conditions. The availability of these complex systems,
far from overwhelming consumers, only creates demand for even more complex systems.
The growing sophistication of applications continually pushes the design and manufacturing
of integrated circuits and electronic systems to new levels of complexity. And perhaps the most
amazing characteristic of this collection of systems is its variety-as systems become more
complex, we build not a few general-purpose computers but an ever wider range of special-
44
purpose systems. Our ability to do so is a testament to our growing mastery of both integrated
circuit manufacturing and design, but the increasing demands of customers continue to test the
limits of design and manufacturing
5.5 VERILOG HDL

Verilog HDL is a hardware description language that can be used to model a digital
system at many levels of abstraction ranging from the algorithmic-level to the gate-level to the
switch-level. The complexity of the digital system being modeled could vary from that of a
simple gate to a complete electronic digital system, or anything in between. The digital system
can be described hierarchically and timing can be explicitly modeled within the same
description.
The Verilog HDL language includes capabilities to describe the behavior-al nature of a
design, the dataflow nature of a design, a design's structural composition, delays and a
waveform generation mechanism including aspects of response monitoring and verification,
all modeled using one single language. In addition, the language provides a programming
language interface through which the internals of a design can be accessed during simulation
including the control of a simulation run.
The language not only defines the syntax but also defines very clear simulation
semantics for each language construct. Therefore, models written in this language can be
verified using a Verilog simulator. The language inherits many of its operator symbols and
constructs from the C programming language. Verilog HDL provides an extensive range of
modeling capabilities, some of which are quite difficult to comprehend initially. However, a
core subset of the language is quite easy to leam and use. This is sufficient to model most
applications.
History:
The verilog HDL language was first developed by Gateway Design Automation in 1983 as
hardware are modleling language for their simulator product, At that time ,twas a propnetary
language. Because of the popularity of the,simulator product, Verilog HDL gained acceptance
as a usable and practical language by a number of designers. In an effort to increase the
popularity of the language, the language was placed in the public domain in 1990. Open verilog
International (OVI) was formed to promote Verilog. In 1992 OVI decided to pursue
45
standardization of verilog HDL as an IEEE standard. This effort was succeful and the language
became an IEEE standard in 1995. The complete standard is described in the verilog hardware
description language reference manual. The standard is called std 1364-1995.
5.6 Major Capabilities:
Listed below are the majort capabilities of the verilog hardware description:
 Primitive logic gates, such as and, or and nand, are built-in into the language.
 Flexibility of creating a user-defined primitive (UDP). Such a primitive could either be
a combinational logic primitive or a sequential logic primitive.
 Switch-level modeling primitive gates, such as pmos and nmos, are also built-in into
the language.
 Explicit language constructs are provided for specifying pin-to-pin delays, path delays
and timing checks of a design.
 A design can be modeled in three different styles or in a mixed style. These styles are:
behavioral style - modeled using procedur-al constructs; dataflow style - modeled using
continuous assign-ments; and structural style - modeled using gate and module
instantiations.
 There are two data types in Verilog HDL; the net data type and the register data type.
The net type represents a physical connection between structural elements while a
register type represents an abstract data storage element.
 Figure. the mixed-level modeling capability of Verilog HDL, that is, in one design,
each module may be modeled at a different level.
 Verilog HDL also has built-in logic functions such as & (bitwise-and) and I (bitwise-
or).
 High-level programming language constructs such as condition- als, case statements,

and loops are available in the language.
 Notion of concurrency and time can be explicitly modeled.
 Powerful file read and write capabilities fare provided.
 The language is non-deterministic under certain situations, that is, a model may produce
different results on different simulators; for example, the ordering of events on an event
queue is not defined by the standard.
46
5.7 SYNTHESIS:
Synthesis is the process of constructing a gate level netlist from a register-transfer level
model of a circuit described in Verilog HDL. Figure.2-2 shows such a process. A synthesis
system may as an intermediate step, generate a netlist that is comprised of register-transfer
level blocks such as flip-flops, arithmetic-logic-units, and multiplexers, interconnected by
wires. In such a case, a second program called the RTL module builder is necessary. The
purpose of this builder is to build, or acquire from a library of predefined components, each of
the required RTL blocks in the user-specified target technology.
Having produced a gate level netlist, a logic optimizer reads in the netlist and optimizes
the circuit for the user-specified area and timing constraints. These area and timing constraints
may also be used by the module builder for appropriate selection or generation of RTL blocks.
In this book, we assume that the target netlist is at the gate level.. The module building and
logic optimization phases are not described in this book.
47
CHAPTER 6
CONCLUSION AND
FUTURE SCOPE
48
6.1 CONCLUSION
We have proposed optimized VLSI architecture of S-box for AES algorithm. The architecture
of s-box in composite field has been modified in order to have high speed and low areas. Using
the proposed s-box, AES architecture has been implemented using the merging technique in
FPGA. The proposed AES architecture has delayed improvement of approx. 1.6 ns along with
area improvement of 287 FPGA slices when implemented in the Spartan-6 FPGA of Xilinx.
The full custom design of the s-box has been done in 180 nm technology in Cadence using
novel XOR gate which has high speed and low power consumption as compared to existing
one. The designed s-box chip consumes 22.6 µW and has 8.2 ns delay after post layout
simulation.
49
6.2 FUTURE WORK
 Full custom design of AES.

 Tape out of full custom AES.
 Video encryption in real-time using proposed design implemented in FPGA
50
REFERENCES
[1] B.A. Forouzan and D. Mukhopadhyay, Cryptography and Network Security, 2nd
Ed.,Tata McGraw Hill, New Delhi, 2012.
[2] M. I. Soliman, G. Y. Abozaid, “FPGA implementation and performance

evaluation of a high throughput crypto coprocessor,” Journal of Parallel and
Distributed Computing, Vol. 71 (8), pp.1075-1084, Aug. 2011.
[3] V. K. Pachghare, Cryptography and information security, E. E. Ed., PHI

Learning, New Delhi, 2009.
[4] M. Mozaffari-Kermani and A. Reyhani-Masoleh, “A Lightweight High-

Performance Fault Detection Scheme for the Advanced Encryption Standard
Using Composite Fields,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, Vol. 19 (1), pp. 85-91, Jan. 2011.
[5] Federal Information Processing Standards Publication 197 (FIPS 197), available
online, http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf.
[6] X. Zhang, K. K. Parhi, “High-Speed VLSI Architectures for the AES

Algorithm,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, Vol. 12 (9), pp. 957-967, Sep. 2004.
[7] M. Jridi and A. AlFalou, “A VLSI implementation of a new simultaneous images

compression and encryption method,” 2010 IEEE International Conference on
Imaging Systems and Techniques (IST), pp.75-79, July 2010.
[8] Chih-Pin Su, Tsung-Fu Lin, Chih-Tsun Huang, and Cheng-Wen Wu, “A High-
Throughput Low-Cost AES Processor,” IEEE Communications Magazine,
Vol.41 (12), pp.86-91, Dec. 2003.
[9] L. Ali, I. Aris, F. S. Hossain and N. Roy, “Design of an ultra-high speed AES
processor for next generation IT security,” Computers and Electrical
Engineering, Vol.37 (6), pp.1160-1170, Nov. 2011.
51

Document

Uploaded by

Copyright:

Available Formats

Document

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Document

Uploaded by

Copyright:

Available Formats

CONTENTS

Name of Contents Page No.

CHAPTER 2: ADVANCED ENCRYPTION STANDARD (AES) ALGORITM

CHAPTER 3: IMPLEMENTATION OF PROPOSED ARCHITECTURE FOR S-

CHAPTER 4: HIGH SPEED AES ENCRYPTION

CHAPTER 5: RESULTS AND DISCUSSION

5.2 History of Scale Integration 42

5.3 VLSI And Systems 43

5.5 Verilog HDL 45

5.6 Major Capabilities 46

CHAPTER 6: CONCLUSION AND FUTURE SCOPE

Figure No. Figure name Page No.

bits in Xilinx ISE 13.4

Table No. Table name Page No.

1.2 LITERATURE REVIEW

1.2.1 High Speed AES Design

3. Full custom design of S-BOX for AES Encryption algorithm.

4. Implementation of high speed architecture of AES algorithm.

This thesis is organized as follows:

2.1 DEFINITION AND HISTORY OF CRYPTOGRAPHY

Figure 2-1 Basic step of Encryption in cryptography

Figure 2-2 Symmetric Key Cryptography.

2.2 GALOIS FIELD

2.3 THE DATA ENCRYPTION STANDERED (DES)

2.4 THE ADVANCED ENCRYPTION STANDARD (AES)

Cipher Key size No. Of Rounds (Nr) Round Key size

128 bits 10 128 bits

192 bits 12 128 bits

256 bits 14 128 bits

Round Key (0)

Round Key (i)

Round Key (Nr)

Figure 2-3 Rounds of AES Encryption algorithm.

Figure 2-4 SubBytes Transformation [5].

Inverse SubBytes transformation is inverse of SubBytes transformation. It can find in the

Table III: Inverse SubBytes Transformation Table [16]

2.4.2 SHIFTROWS/INVERSE SHIFTROWS TRANSFORMATION

Figure 2-5 ShiftRows Transformation for AES Encryption.

2.4.3 MIXCOLUMNS/INVERSE MIXCOLUMNS TRANFORMATION

A(x) = {03} x3 + {01} x2 + {01} x + {02}. (2.2)

2.4.4 ADDROUNDKEY TRANSFORMATION KEY EXPANSION

2.4.4.1 AddRoundKey Transformation

Figure 2-7 AddRoundKey transformation of AES Encryption.

2.4.4.2 Key Expansion

The SubBytes transformation, done through S-BOX mapping is computationally inefficient

Therefore, a more feasible solution is to implement an S - box is by using composite

To find the S-BOX transformation first multiplicative inverse of GF (2 8) then affine

δ Multiplication operation in GF (24)

×λ Multiplication with constant, λ in GF (24)

x-1 Multiplicative inversion in GF (24) AT Affine transformation

Figure 2-10 Meaning of symbols used in Figure 2-9.

2.5.2 Squaring operation in GF (24)

Figure 2-11 Logical hardware diagram of squarer for GF (24)

2.5.3 Multiplication with constant, λ

Figure 2-12 Logical hardware diagrams for multiplication with constant, λ

Figure 2-13 Logical hardware implementation of GF (24) multiplier

2.5.4.1 Multiplication with constant, φ

Figure 2-14 Hardware implementation of multiplication with φ