0% found this document useful (0 votes)
76 views

Hardware Implementation of AES Algorithm With Logic S-Box: Sou Ane Oukili and Seddik Bri

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Hardware Implementation of AES Algorithm With Logic S-Box: Sou Ane Oukili and Seddik Bri

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Journal of Circuits, Systems, and Computers

Vol. 26, No. 9 (2017) 1750141 (19 pages)


.c World Scienti¯c Publishing Company
#
DOI: 10.1142/S0218126617501419

Hardware Implementation of AES Algorithm with


Logic S-box¤

Sou¯ane Oukili† and Seddik Bri‡


Materials and Instrumentation (MIN),
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

Department of Electrical Engineering,


High School of Technology, Moulay Ismail University,
by FUDAN UNIVERSITY on 02/21/17. For personal use only.

Km5, Rue d'Agouray, P1, Meknes 50040, Morocco



[email protected]
[email protected]

Received 23 May 2016


Accepted 10 January 2017
Published 17 February 2017

Cryptography has an important role in data security against known attacks and decreases or
limits the risks of hacking information, especially with rapid growth in communication tech-
niques. In the recent years, we have noticed an increasing requirement to implement crypto-
graphic algorithms in fast rising high-speed network applications. In this paper, we present high
throughput e±cient hardware implementations of Advanced Encryption Standard (AES)
cryptographic algorithm. We have adopted pipeline technique in order to increase the speed and
the maximum operating frequency. Therefore, registers are inserted in optimal placements.
Furthermore, we have proposed 5-stage pipeline S-box design using combinational logic to reach
further speed. In addition, e±cient key expansion architecture suitable for our proposed design
is also presented. In order to secure the hardware implementation against side-channel attacks,
masked S-box is introduced. The implementations had been successfully done by virtex-6
(xc6vlx240t) Field-Programmable Gate Array (FPGA) device using Xilinx ISE 14.7. Our
proposed unmasked and masked architectures are very fast, they achieve a throughput of
93.73 Gbps and 58.57 Gbps, respectively. The obtained results are competitive in comparison
with the implementations reported in the literature.

Keywords: AES; high throughput; e±cient; pipeline; masked S-box; FPGA.

1. Introduction
The astounding growth of the internet and computer systems in the last century,
have meant that the need for e®ective security and reliability of data communica-
tion, processing and storage is greater than ever. In this context, cryptographic

*This paper was recommended by Regional Editor Emre Salman.


† Corresponding author.

1750141-1
S. Oukili & S. Bri

development has been a high priority and challenging research area in both ¯elds
of mathematics and engineering. The Advanced Encryption Standard (AES),
also known as Rijndael is a symmetric key cryptographic algorithm developed by
Dr. Joan Daemen and Dr. Vincent Rijmen.1 In 2001, it was adopted as a Federal
Information Processing Standard by the National Institute of Standards and
Technology.2 This algorithm has an input and an output data of length 128 bits, the
key can be of length 128/192/256 bits.
There are software and hardware approaches to implement cryptographic AES
algorithm. As compared to software implementation, hardware implementation
provides greater physical security and higher speed.3 Because of the increasing
requirements for high-speed, high-volume secure communications combined with
physical security, hardware implementation becomes essential. Low power, high
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

throughput and compactness have always been topic matter of interest for hardware
by FUDAN UNIVERSITY on 02/21/17. For personal use only.

design and implementation. For this paper, the main goal of our hardware imple-
mentation is high throughput design using as less hardware as possible. Besides
performing a cipher algorithm, a cryptographic device is also requested to physically
protect the secret data manipulated during its real execution. Traditionally, crypt-
analysis has been directed around algorithms but, since only few years, hardware
implementation is considered as a fundamental part for security evaluation of the
cryptographic design. New challenges in this ¯eld are the so-called side-channel
attacks,4,5 which exploit information leakage from the cryptographic device due to
physical phenomena such as power consumption, electromagnetic radiation and
execution timing. These attacks are based on monitoring a physical quantity and
applying statistical analysis to extract con¯dential information from extremely noisy
signals. Numerous countermeasures have been proposed to defend against above
side-channel attacks. Masking is one of the popularly used methods, which has the
advantages of low cost and easy implementation.6–10
The AES algorithm is implemented using di®erent methods and contributions
which can be categorized as following. In the ¯rst category, loop unrolling and
iterative techniques are used to increase throughput, increase throughput to area
ratio and decrease area cost. See Refs. 11, 14, 16, 17 and 22 for more details. The
second category includes the designs which use pipelining and sub-pipelining tech-
niques to increase operational frequency and throughput, see Refs. 12–16 and 18–28.
In this paper, we present e±cient high throughput hardware design and imple-
mentation of 128-bit key AES. Pipeline technique is introduced in order to increase
the speed and the maximum operating frequency. The pipelining strategy consists in
parallelizing the data inputs and outputs with the processing. Consequently, the
algorithm is divided into stages and registers are placed. By incrementing the
number of these stages, the critical path is decreased and as a result the throughput is
increased. The S-box substitution is at the core of any AES implementation. It is the
only complex step in each round of encryption algorithm. It is implemented using
combinational logic to avoid the unbreakable delay of LUTs and to achieve any

1750141-2
Hardware Implementation of AES Algorithm with Logic S-box

further increase in processing speed. Moreover, e±cient key expansion architecture


suitable with the pipelined round units is presented. Our proposed design is imple-
mented on Xilinx Virtex-6 FPGA technology. The FPGAs o®er the advantage of hard-
ware speed and software °exibility and programmability. In order to secure the hardware
implementation, masked S-box is introduced based on masked XOR and AND gates.
This paper is structured as follows. Section 2 presents a brief background of the AES
algorithm. Section 3 gives the relevant works of various authors reported in the litera-
ture. Our proposed unmasked and masked AES implementations are presented in Secs. 4
and 5, respectively. Section 6 provides results and comparison between our imple-
mentations and di®erent reported implementations. Finally, conclusion is given in Sec. 7.
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

2. AES Algorithm
by FUDAN UNIVERSITY on 02/21/17. For personal use only.

The AES algorithm is a symmetric block cipher, in which both the sender and the
receiver use a same key for encryption and decryption. The data block length is ¯xed
to be 128 bits and the key length can be 128, 192, or 256 bits. The AES is an iterative
algorithm. Each iteration can be called a round, and the total number of rounds is
dependent on the key length. The output of each round serves as input of next stage.
For each round, 128-bit data input and 128/192/256-bit key is required.
The key length is represented by Nk ¼ 4, 6, or 8, which re°ects the number of 32-
bit words (number of columns) in the cipher key. For the AES algorithm, the number
of rounds to be performed during the execution of the algorithm is dependent on the
key size. The number of rounds is represented by Nr, where Nr ¼ 10 when Nk ¼ 4,
Nr ¼ 12 when Nk ¼ 6 and Nr ¼ 14 when Nk ¼ 8. In AES system, same secret key is
used for both encryption and decryption, so it provides simplicity in design. For this
work, 128-bit key is chosen, which requires 10 rounds of encryption.
The 128-bit data block is arranged in a 4  4 array of bytes called the State, with
four rows and four columns consisting of 16 bytes in total. Each round is composed of
four di®erent byte-oriented transformations: SubByte, ShiftRow, MixColumn and
AddRoundKey except for the last round in which MixColumn transformation is not
performed. Apart from this, there is an initial round at the start that consists of only
AddRoundKey transformation.1,2 Figure 1 shows the 128-bit key AES algorithm.

Fig. 1. 128-bit key AES algorithm.

1750141-3
S. Oukili & S. Bri

. SubByte: operates in each byte of the State independently. Each byte is substi-
tuted by the corresponding byte in the Substitution-box (S-box). S-box is one of
the basic components of any symmetric key algorithm, which exhibits the property
of confusion. This property is provided to increase the di±culty in ¯nding the key
from the known cipher text. S-box takes M inputs and transforms them to deliver
N bits at the output. Fixed S-boxes are used in AES algorithm, which are designed
using multiplicative inverse over GF (28) and combining the inverse function with
an invertible a±ne transformation. These properties make it e±cient over crypt-
analysis by providing nonlinear properties.
. ShiftRow: takes the data in the State matrix and circularly shifts each data block
left by its row index.
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

. MixColumn: in this operation, each column of the State is considered as poly-


by FUDAN UNIVERSITY on 02/21/17. For personal use only.

nomials over GF(28). Then, this vector is multiplied by a ¯xed polynomial.


. AddRoundKey: takes in a unique round key form the key expansion component
and simply performs a bit by bit XOR with each of the bits in the State matrix.

The decryption structure can be derived by inverting the encryption one directly and
its rounds require four inverse operations: InvSubByte, InvShiftRow, InvMixColumn
and AddRoundKey.
The AES algorithm takes the original main Key, and performs a Key Expansion
routine to generate the round keys. In AES 128-bit key, it generates a total of 11
KeyRound of 16 bytes in order to be employed respectively in rounds of AES, taking
into account that the ¯rst KeyRound is the initial key. Key expansion is also an
iterative algorithm with same round number as the AES encryption. The output of
each round is the input of the next one. In each round, the ¯rst four bytes of the input
KeyRound constitute the word w0, the next four bytes the word w1 and so on. The
bytes of the ¯nal word are left rotated by one position, and then each byte passes
through substitution transformation SubWord (S-box). The result is XORed with a
round constant RCon(i). Finally, the columns are added together to generate a new
128-bit round key. Figure 2 shows one round of key expansion module. The key
expansion is designed to be resistant to known cryptanalytic attacks. The inclusion

Fig. 2. Round i of key expansion module.

1750141-4
Hardware Implementation of AES Algorithm with Logic S-box

of a round-dependent round constant eliminates the symmetry, or similarity, be-


tween the ways in which round keys are generated in di®erent rounds.

3. Related Works
There had been many di®erent hardware implementations of AES algorithm pre-
sented in the literature. They aim to improve the throughput and the e±ciency using
area as less as possible. In this section, we review a few of them. Authors in Ref. 11
addressed design, hardware implementation and performance testing of AES algo-
rithm. An optimized code for the Rijndael algorithm with 128-bit keys has been
developed. The area and throughput are carefully trading o® to make it suitable for
wireless military communication and mobile telephony where emphasis is on the
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

speed as well as on area of implementation. In Ref. 12, they proposed recon¯gurable


by FUDAN UNIVERSITY on 02/21/17. For personal use only.

hardware to speed up the performance and °exibility of Rijndael algorithm imple-


mentation. They aimed to achieve the maximum speed and e±ciency of cipher
process, therefore pipeline architecture of AES module was proposed. A new meth-
odology was presented in Ref. 13, where they employed dynamic and partial re-
con¯guration with parallelism and pipelining to implement AES. In other words,
they used VHDL for implementation of AES elements and Handel-C for imple-
mentation of the communication system. Authors in Ref. 14 proposed high
throughput AES architecture, which is based on the Content-Addressable Memory
(CAM)-based SubBytes scheme. By properly pipelining, the proposed CAM-based
SubBytes needs a smaller gate delay than the ROM-based design. Two imple-
mentations of the AES algorithm were introduced in Ref. 15, the ¯rst design is based
on the basic architecture of the AES and the second one is based on the sub-pipelined
architecture. In Ref. 16, authors had adopted pipelining technique to increase the
speed of the AES algorithm by processing multiple rounds simultaneously. Software
parallelization techniques with OpenMP standard is used to increase the speedup of
the algorithm compared to its sequential version. AES implementation in Ref. 18 was
mainly targeted for low-cost embedded applications. It introduced parallel operation
in the folded architecture to obtain better throughput. A new e±cient architecture
for high-speed AES encryptor was presented in Ref. 19. This technique was imple-
mented using composite ¯eld arithmetic byte substitution, where higher e±ciency is
achieved by merging and location rearrangement of di®erent operations required in
the steps of encryption. In addition, multistage sub-pipelined architecture is used to
allow higher e±ciency in terms of throughput and area. In Ref. 20, authors proposed
map operations of AES from Galois Field GF(28) to GF(24) to obtain an area-
e±cient masked AES implementation. Composite ¯eld arithmetic in normal bases
was employed to propose a low-cost S-box for the AES by authors in Ref. 21.
Moreover, they presented e±cient key expansion architecture convenient for six sub-
pipelined round units. Authors in Ref. 22 presented design, implementation and
comparison of highly e±cient architectures for AES on FPGAS: Iterative

1750141-5
S. Oukili & S. Bri

architecture and pipelined architecture. The ¯rst design was optimized for area and
the second one for speed. Fully pipelined crypto processor was presented by authors
in Ref. 23, where AES was integrated with a 32-bit general purpose 5-stage pipelined
MIPS processor. Parallel Sub-Pipelined architecture (PSP) was proposed in Ref. 24
in order to obtain high throughput. The proposed architecture was also compared
with loop unrolled, pipelined, sub-pipelined, parallel and parallel pipelined archi-
tecture in terms of throughput. An extension of a general-purpose processor with a
crypto coprocessor was described by authors in Ref. 25, for encryption and decryp-
tion of information. In Ref. 26, authors presented high throughput digital design of
the 128-bit AES algorithm based on the 2-slow retiming technique on FPGA.
Authors in Ref. 27 presented an equivalent pipelined AES architecture working on
CTR mode to provide high throughput through inserting some registers in appro-
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

priate points making the delay shortest, when implementing the byte transfor-
by FUDAN UNIVERSITY on 02/21/17. For personal use only.

mation in one clock period. In Ref. 28, authors presented three high-throughput
AES implementations in ECB mode and one ultra-high throughput AES imple-
mentation in CTR mode. They performed area-delay e±cient multiplier and
multiplicative inverter over GF(28). Moreover, loop-unrolling, fully pipelining and
fully sub-pipelining techniques were also used and the registers were placed in
optimal placement.

4. Proposed AES Architecture


Proposed e±cient AES architecture in this paper aims to increase the throughput
and use hardware resource as less as possible. Therefore pipelining technique is
adopted, where the algorithm is divided into stages and registers are placed. As a
result of this, the throughput can be increased.
As mentioned before, AES is an iterative algorithm. Thus, we have inserted
registers after each round. This is shown in Fig. 3. Moreover, and to reduce the
critical path, registers are introduced even between the operations of rounds, as
shown in Fig. 4. This will further increase the throughput of the algorithm but at the
cost of increased latency. The optimum number of pipeline stages and the best
placement strategy for pipelining registers are two main factors to achieve an area-
throughput e±cient design. Note that the increase of throughput requires an increase
in area, as registers are required to store intermediate results.
Designing a high speed and low area S-box substitution is one of the most critical
problems in the research process, because it is the only nonlinear transformation in
the four transforms of AES arithmetic and is the key point to improve the
throughput of AES and decrease the resource used.

4.1. Pipelined S-box using combinational logic


The S-box substitution is at the core of any AES implementation. It is the only
complex step in each round of encryption algorithm. Considering on high delay and

1750141-6
Hardware Implementation of AES Algorithm with Logic S-box
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
by FUDAN UNIVERSITY on 02/21/17. For personal use only.

Fig. 3. Pipelining general block of AES.

required gates of this transformation, several di®erent methods have been presented
in the literature to deal with the critical path of S-box and the occupied memory:
S-box using composite ¯eld arithmetic, direct mapping from LUT's or using
combinational logic only. Implementation of S-box using LUT's requires high volume
of gates and su®ers from the inherent and unbreakable existent delay. This delay is
longer than the total delay of the rest of the transformations in each round unit and
prohibits them from being divided into more than two sub-stages to achieve any
further speedup. Composite ¯eld arithmetic can reduce the area of design, however it
increase the critical path delay. Implementation using combinational logic has the
advantage of having small area occupancy, in addition to be able of being pipelined
to achieve any further speedup. Our proposed S-box implementation aims to increase
the throughput and use hardware resource as less as possible. Therefore, S-box using
combinational logic technique is adopted.
As said above, the S-Box transformation is computed by taking the multiplicative
inverse in GF(28) followed by an a±ne transformation. The a±ne transformation
can be represented in matrix form as shown below.2 Note that (a7 ; a6 ; a5 ; a4 ; a3 ; a2 ; a1
and a0 ) bits represent the input byte.

1750141-7
S. Oukili & S. Bri
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
by FUDAN UNIVERSITY on 02/21/17. For personal use only.

Fig. 4. Pipelining round of AES.

0 1 0 1
1 1 1 1 1 0 0 0 0 1 0
B0 a7
B 1 1 1 1 1 0 0C Ba C B1C
C B
C
B C B 6C B C
B0 0 1 1 1 1 1 0 C B a5 C B 1 C
B C B C B C
B0 1C B C B C
AT ðaÞ ¼ B
0 0 1 1 1 1 C  B a4 C  B 0 C : ð1Þ
B1 C
1C B C B C
B 0 0 0 1 1 1 B a3 C B 0 C
B1 C
1 C B a2 C
B B C
B 1 0 0 0 1 1 C B0C
B C @ a1 A B C
@1 1 1 0 0 0 1 1A @1A
a0
1 1 1 1 0 0 0 1 1

The a±ne transformation can be translated to logical XOR operation. The logical
form of the matrix is shown below:
0 1
a7  a6  a5  a4  a3  `0'
B a  a  a  a  a  `1' C
B 6 5 4 3 2 C
B C
B a5  a4  a3  a2  a1  `1' C
B C
B a4  a3  a2  a1  a0  `0' C
AT ðaÞ ¼ B
B a  a  a  a  a  `0' C :
C ð2Þ
B 7 3 2 1 0 C
B C
B a7  a6  a2  a1  a0  `0' C
B C
@ a7  a6  a5  a1  a0  `1' A
a7  a6  a5  a4  a0  `1'

1750141-8
Hardware Implementation of AES Algorithm with Logic S-box

The individual bits in a byte representing a GF(28) element can be viewed as


coe±cients to each power term in the GF(28) polynomial. For instance, f10001011g2
is representing the polynomial q7 þ q3 þ q þ 1 in GF(28). From Ref. 29, it is stated
that any arbitrary polynomial can be represented as bx þ c, given an irreducible
polynomial of x 2 þ Ax þ B. Thus, element in GF(28) may be represented as bx þ c
where b is the most signi¯cant nibble while c is the least signi¯cant nibble. From here,
the multiplicative inverse can be computed using the equation below:
ðbx þ cÞ 1 ¼ bðb 2 B þ cbA þ c 2 Þ 1 x þ ðc þ bAÞðb 2 B þ bcA þ c 2 Þ 1 : ð3Þ
From Ref. 30, the irreducible polynomial that was selected was x 2 þ x þ . Since
A ¼ 1 and B ¼ , then the equation could be simpli¯ed to the form as shown below:
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

ðbx þ cÞ 1 ¼ bðb 2  þ cðb þ cÞÞ 1 x þ ðc þ bÞðb 2  þ cðb þ cÞÞ 1 : ð4Þ


by FUDAN UNIVERSITY on 02/21/17. For personal use only.

This equation indicates that there are multiply, addition, squaring and multi-
plication inversion operations in Galois Field. Each of these operators can be
transformed into individual blocks when constructing the circuit for computing the
multiplicative inverse. From this simpli¯ed equation, the S-box transformation can
be produced as shown in Fig. 5.
Computation of the multiplicative inverse in composite ¯elds cannot be directly
applied to an element which is based on GF(28). That element has to be mapped to
its composite ¯eld representation via an isomorphic function, . Likewise, after
performing the multiplicative inversion, the result will also have to be mapped back
from its composite ¯eld representation to its equivalent in GF(28) via the inverse

Fig. 5. Conventional S-box architecture using combinational logic.

1750141-9
S. Oukili & S. Bri

isomorphic function,  1 . Both  and  1 can be represented as an 8  8 matrix. Let a


be the element in GF(28), then the isomorphic mappings and its inverse can be
written as   a and  1  a, which is a case of matrix multiplication as shown below,
where a7 is the most signi¯cant bit and a0 is the least signi¯cant one.30
0 1
1 0 1 0 0 0 0 0 0 1
B1 1 0 1 1 1 1 0C a7
B C B a6 C
B C B C
B 1 0 1 0 1 1 0 0 C B a5 C
B C B C
B 1 0 1 0 1 1 1 0 C B a4 C
a¼B B C  B C; ð5Þ
C B C
B 1 1 0 0 0 1 1 0 C B a3 C
B 1 0 0 1 1 1 1 0 C B a2 C
B C B C
B C @ a1 A
@0 1 0 1 0 0 1 0A
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

a0
0 1 0 0 0 0 1 1
by FUDAN UNIVERSITY on 02/21/17. For personal use only.

0 1
1 1 1 0 0 0 1 0 0 1
B0 1 0 0 0 1 0 0C a7
B C B a6 C
B C B C
B 0 1 1 0 0 0 1 0 C B a5 C
B C B C
B 0 1 1 1 0 1 1 0 C B a4 C
1
 a¼B B C  B C: ð6Þ
C B C
B 0 0 1 1 1 1 1 0 C B a3 C
B 1 0 0 1 1 1 1 0 C B a2 C
B C B C
B C @ a1 A
@0 0 1 1 0 0 0 0A
a0
0 1 1 1 0 1 0 1

The matrix multiplication can be translated to logical XOR operation. The logical
form of the matrices above is shown below:
0 1
a7  a5
B a7  a6  a4  a3  a2  a1 C
B C
B a7  a5  a3  a2 C
B C
B a7  a5  a3  a2  a1 C
a¼B B C; ð7Þ
a7  a6  a2  a1 C
B C
B a7  a4  a3  a2  a1 C
B C
@ a6  a4  a1 A
a6  a1  a0
0 1
a7  a5  a6  a1
B a2  a6 C
B C
B a  a  a C
B 6 5 1 C
B a  a  a  a  a C
1
 a¼B B 1C
C: ð8Þ
6 5 4 2
a
B 5  a 4  a 3  a 2  a 1C
B a7  a4  a3  a2  a1 C
B C
@ a5  a4 A
a6  a5  a4  a2  a0

1750141-10
Hardware Implementation of AES Algorithm with Logic S-box

The multiplicative inverse computation will be done by decomposing the more


complex GF(28) to lower order ¯elds of GF(2), GF(22) and GFðð2 2 Þ 2 Þ. In order to
accomplish the above, the following irreducible polynomials are used where ’ ¼
f10g2 and  ¼ f1100g2 30:

GFð2 2 Þ ! GFð2Þ : x 2 þ x þ 1 ;
GFðð2 2 Þ 2 Þ ! GFð2 2 Þ : x 2 þ x þ ’ ; ð9Þ
GFððð2 2 Þ 2 Þ 2 Þ ! GFðð2 2 Þ 2 Þ : x 2 þ x þ  :

Addition of two elements in Galois Field can be translated to simple bitwise XOR
operation between the two elements. Based on Ref. 31, the logical equations for the
squaring, multiplication and multiplication inversion blocks are as following, where
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

a and b represent the inputs and q the output.


by FUDAN UNIVERSITY on 02/21/17. For personal use only.

. Squarer in GF(24): the formula for computing the squaring operation is shown
below:
0 1 0 1
q3 a3
B q2 C B a 3  a 2 C
q¼B C B
@ q1 A ¼ @ a 2  a 1 A :
C ð10Þ
q0 a3  a1  a0

. Multiplication with  in GF(24): the formula for computing the multiplication


with  ¼ f1100g2 is shown below:
0 1 0 1
q3 a2  a0
B q2 C B a 3  a 2  a 1  a 0 C
q¼B C B
@ q1 A ¼ @
C:
A ð11Þ
a3
q0 a2

. Multiplication in GF(24): it can be decomposed into multiplications in GF(22) and


multiplication with constant ’ as shown in Fig. 5.
 Multiplication in GF(22): the formula for computing the multiplication in GF
(22) is as follows:
   
q1 ða1  b1 Þ  ða1  b0 Þ  ða0  b1 Þ
q¼ ¼ : ð12Þ
q0 ða1  b1 Þ  ða0  b0 Þ

 Multiplication with ’ in GF(22): the formula for computing the multiplication


with ’ ¼ f10g2 is shown below:
   
q1 a1  a0
q¼ ¼ : ð13Þ
q0 a1

1750141-11
S. Oukili & S. Bri

. Multiplicative Inversion in GF(2 4 Þ: the formula for computing the multiplicative


inversion is shown below:
0 1
q3
B q2 C
q¼@ C
B
q1 A
q0
0 1
ða3  a2  a1 Þ  ða3  a0 Þ  a3  a2
B ða  a  a Þ  ða  a  a Þ  ða  a Þ  ða  a Þ  a C
B 3 2 1 3 2 0 3 0 2 1 2 C
B C
¼ B ða3  a2  a1 Þ  ða3  a1  a0 Þ  ða2  a0 Þ  a3  a2  a1 C : ð14Þ
B C
@ ða  a  a Þ  ða  a  a Þ  ða  a  a Þ  ða  a  a Þ A
3 2 1 3 1 0 3 2 0 2 1 0
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

ða3  a1 Þ  ða3  a0 Þ  ða2  a1 Þ  a2  a1  a0


by FUDAN UNIVERSITY on 02/21/17. For personal use only.

As mentioned before, pipeline technique modi¯es the critical path by increasing


possible frequency of clock cycle. It consists in parallelizing the data inputs and
outputs with the processing by inserting registers. To achieve optimum number and
to get the best placement strategy for pipelining registers in our implementation, we
propose strategy described step by step in Algorithm 1. In this algorithm, ¯rst,
pipeline registers are placed in all possible placements in the architecture. Then in
each round, the pipeline stage which contains the lowest reduction of frequency is
removed until the optimum number and placements of pipeline registers are
achieved. Figure 6 shows the proposed 5-stage pipelined design of S-box using
combinational logic.

1750141-12
Hardware Implementation of AES Algorithm with Logic S-box

Fig. 6. 5-stage pipelined design of S-box using combinational logic.


J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
by FUDAN UNIVERSITY on 02/21/17. For personal use only.

Fig. 7. Proposed round i of key expansion module.

4.2. Pipelined key expansion module


Key expansion, also as one of the crucial steps in the realization of AES design,
decides the throughput of the cipher process. The round keys that are used in each
encryption round unit, can be either generated beforehand and stored in memory or
generated on the °y. If we want to pipeline round keys generation to preserve the
advantage of high throughput of our cipher process, it must be considered that the
key expansion architecture is divided the same as the number of existent stages in
encryption round unit or less. Our proposed key expansion module is divided into
seven sub-stages. Knowing that the encryption round unit contains eight sub-stages.
The SubWord is based on pipelined S-box previously proposed, using combinational
logic. Figure 7 presents proposed key expansion module.

5. Masked AES Architecture


Side channel attacks are considered as a serious threat to AES algorithm, these
attacks are based on side channel information, that is, information that can be
retrieved from the encryption device that is neither the plaintext to be encrypted or

1750141-13
S. Oukili & S. Bri

the ciphertext resulting from the encryption process.4 Power analysis attacks are
powerful attacks among them. Power analysis attacks include simple power analysis
(SPA), di®erential power analysis (DPA), higher order di®erential power analysis
(HODPA) and glitch attack. SPA attack is a technique that involves directly
interpreting power consumption measurements during cryptographic operations.
DPA attack is based on statistical analysis in which the attacker can guess the
correctness of the keys by comparing the di®erences between a sample power trace
and the correct key power trace. HODPA attack is a powerful technique that misuses
joint leakage information of several intermediate values to \crack" the secret key. In
gate level, input signal postponing through circuit used di®erent arriving time,
therefore it leaded to the possibility of glitch attack.5 Many research works focused
on studying side channel attacks and had proposed multiple countermeasure tech-
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

niques at di®erent levels: algorithmic, system and logic. Algorithmic counter-


by FUDAN UNIVERSITY on 02/21/17. For personal use only.

measures can be restricted to the masking schemes.6,7 Its basic idea is to randomize
the intermediate results that are produced during the computation of a crypto-
graphic algorithm. Masking can break the dependence between the power con-
sumption and the intermediate values in the cryptographic algorithm.
A traditional XOR operation is used as a masking counter measure; however, the
mask is arithmetic on GF (28).7 The operation is compatible with the AES structure
(ShiftRow, MixColumn and AddRoundKey) except for SubByte, which is the only
nonlinear transformation since it uses an inversion in the ¯eld. In other words, it is
easy to compute mask correction for all transformations in a round, apart from the
inversion step of the S-box. The problem of masked inversion can be reduced to
compute binary AND on masked data bit without revealing actual data bit.8 In order
to mask our proposed 5-stage S-box design, which is constructed using XOR
and AND gates, we adopted an improved masked AND gate proposed by Refs. 9
and 10. The scheme carefully masks each inputs of the AND gates by constructing a
nonlinear function which e±ciently avoid side channel attacks. In addition, it has
two more inputs parameters compared with the existing methods. Its main circuit
only needs ¯ve XOR and four AND operations which takes up rather smaller area
when porting to FPGA and ASIC platforms.9,10 We denoted masked input bits as
follows:

x 0 ¼ x  rx  x 00 ; ð15Þ
y 0 ¼ y  ry  y 00 ; ð16Þ
x 00 ¼ x  rx ; ð17Þ
00
y ¼ y  ry ; ð18Þ

where x 0 ; y 0 ; x 00 and y 00 are masked data, and rx and ry corresponding masks. All
operations over binary extension ¯eld are the operations over GF(2), namely bit-
wide XOR and AND.

1750141-14
Hardware Implementation of AES Algorithm with Logic S-box

XOR is a linear operation, masked XOR holds the following equation:


x 0  y 0 ¼ ðx  rx  x 00 Þ  ðy  ry  y 00 Þ ¼ ðx  yÞ  ðrx  ry Þ  ðx 00  y 00 Þ : ð19Þ

AND is a nonlinear operation, masked AND holds the following equation:


x 0  y 0 ¼ ðx  rx  x 00 Þ  ðy  ry  y 00 Þ
¼ ðx  yÞ  ðx 00  rx Þ  ðy 0  y 00 Þ  x 0  ðry  y 00 Þ  ry  ðx 00  rx Þ : ð20Þ
The masked S-box has the ability to defend against DPA and glitch attacks,
thereby o®ering high security level. Thus the implementation of masked S-box
increases the system security and hence increases the algorithm's performance.
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

6. Implementation Summary and Comparison


by FUDAN UNIVERSITY on 02/21/17. For personal use only.

FPGA implementation of our proposed AES architectures were established on


Virtex-6 device (xc6vlx240t-3®1784) using Xilinx ISE Design Suite 14.7 as synthesis
tool. The device utilization summary is given in Table 1. The designs were described
using VHDL language. The ¯rst encrypted data takes 80 clock cycles latency. Then,
we recover the forthcoming encrypted data at each clock cycle. The unmasked AES
design achieves a maximum clock frequency of 732.279 MHz (1.36 ns), a throughput
of 93.73 Gbps and an e±ciency of 16.27 Mbps/slice. The masked one achieves a
maximum clock frequency of 457.582 MHz (2.18 ns), a throughput of 58.57 Gbps and
an e±ciency of 6.14 Mbps/slice. Equations (21) and (22) are used to calculate the
throughput and the e±ciency, respectively.
Number of outputed bits
Throughput ¼ ; ð21Þ
Delay of the critical path
Throughput
Efficiency ¼ : ð22Þ
Utilized slices

There are several implementations for the AES algorithm that aim to achieve the
most e±cient architecture, by improving high throughput and area e±ciency.
Table 2 shows the performance ¯gures for some reported architectures up to our best

Table 1. Device utilization summary (xc6vlx240t-3®1784).

Utilization
Resources Unmasked AES Masked AES
Number of slices 5,759/37,680 15% 9,531/37,680 25%
Number of slice LUTs 15,330/150,720 10% 25,740/150,720 17%
Number of slice registers 17,680/301,440 5% 28,960/301,440 9%
Number of bonded IOBs 386/400 96% 386/400 96%
Minimum period 1.36 ns 2.18 ns
Maximum frequency 732.279 MHz 457.582 MHz

1750141-15
S. Oukili & S. Bri

Table 2. Comparison among diverse implementations.

Critical Maximum
delay frequency Throughput E±ciency
Authors Device Slices (ns) (MHz) (Gbps) (Mbps/slices)
Jyrwa and Paily 11
Virtex-2 Pro 6211þ1 — 142.5 18.2 0.23
XC2VP30 BRAM
Gielata et al.12 Virtex-4 1209 — 165 21.2 —
XC4VLX200
Granado-Criado et al.13 Virtex-2 3576þ80 5.1 194.7 24.92 6.97
XC2V6000 BRAMs
Fan and Hwang14 Virtex-2 139357 4.5 222.2 28.40 0.20
XC2V3000
Rizk and Morsy15 Virtex-4 18,855þ200 — — 28.510 1.51
4VLX60FF668 BRAMs
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

Banu et al.16 Virtex-5 — 4 250 31.25 —


by FUDAN UNIVERSITY on 02/21/17. For personal use only.

XC5VLX110T
Kaur et al.17 Virtex-2 1127 4 247.365 31.66 —
XC2VP30
Rahimunnisa et al.18 Virtex-6 2056þ48 1.9 505.5 37.1 15.56
XC6VLX75T BRAMs
Hammad et al.19 XC2V6000 10662 — 305.1 39.05 3.6
Wang and Ha20 Virtex-6 9071þ400 3.1 319.29 40.86 4.51
XC6VLX240T BRAMs
Samiee et al.21 Virtex-2 Pro 7,865 2.9 341.53 43.71 5.55
XC2VP20
Iyer et al.22 Virtex-2 Pro 12,556þ100 2.6 373 47.74 3.8
XC2VP30 BRAMs
Anwar et al.23 Virtex-6 2,547þ204 1.8 553 58 —
ML605 BRAMs
Rahimunnisa et al.24 Virtex-6 2597 2.2 450.045 59.59 22.94
XC6VLX75T
Soliman and Abozaid25 Virtex-5 1,656 1.7 557 70 —
XC5VLX50T
Farashahi et al.26 Virtex-4 3,425 1.73 576.037 73.73 21.53
XC4VLX200
Qu et al.27 Virtex-5 22,994 1.7 576.07 73.73 3.21
XC5VLX85
Soltani and Shari¯an28 Virtex-6 28,520 1.2 803.988 102.91 3.6
XC6VLX240T
This work Unmasked AES Virtex-6 5,759 1.3 732.279 93.73 16.27
XC6VLX240T
Masked AES Virtex-6 9,531 2.1 457.582 58.57 6.14
XC6VLX240T

knowledge. It provides values of hardware utilization, maximum frequency, throughput


and e±ciency.
As can be observed from Table 2, based on the best previous works, our unmasked
AES implementation improve data throughput by 57.29%, 33.9%, 27.12% and
27.12% compared to the implementations in Refs. 24–27, respectively. The imple-
mentation in Ref. 28 is the only one that gives higher throughput than ours. It represents

1750141-16
Hardware Implementation of AES Algorithm with Logic S-box

an increase of 9.79%. But in terms of area, our implementation decreases the area used in
Ref. 28 by 79.8%. Comparing throughput per slice, we notice that ours is 4.52 times
more e±cient than Ref. 28.
Comparing our unmasked and masked AES implementations, we note that the
masked one decreases the throughput by 37%, increases the used slices by 65% and
decreases the e±ciency by 62%. This is due to the masked gates which needs addi-
tional operations.
The results clearly show that our proposed implementations achieve a good bal-
ance between hardware area and design performance.

7. Conclusion
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

In this paper, we present high throughput e±cient AES architectures. They perform
by FUDAN UNIVERSITY on 02/21/17. For personal use only.

pipeline technique to break the critical path delay and increase speed. Moreover,
unmasked and masked S-box were implemented using combinational logic and reg-
isters are inserted in optimal placements. The input can be loaded every clock cycle
and after an initial delay of 80 clock cycles, the encrypted data will appear consec-
utively. The implementations were done by virtex-6 FPGA device. The results show
that our proposed AES architectures are competitive in terms of throughput, area
and e±ciency with the previous implementations.

Acknowledgment
This work is supported by the Moulay Ismail University, Meknes-Morocco.

References
1. J. Daemen and V. Rijmen, AES Proposal: Rijndael (1999), Available at http://csrc.nist.
gov/archive/aes/rijndael/Rijndael-ammended.pdf.
2. National Institute of Standards and Technology (NIST), Advanced Encryption Stan-
dard, Federal Information Processing Standards Publication 197 (2001), Available at
http://csrc.nist.gov/publications/¯ps/¯ps197/¯ps-197.pdf.
3. S. M. Yoo, D. Kotturi, D. W. Pan and J. Blizzard, An AES crypto chip using a high-speed
parallel pipelined architecture, Microprocess. Microsyst. 29 (2005) 317–326.
4. P. Kocher, J. Ja®e and B. Jun, Di®erential power analysis, Lecture Notes in Computer
Science, Advances in Cryptology-CRYPTO'99, Vol. 1666 (Springer-Verlag Berlin
Heidelberg, 1991), pp. 398–412.
5. S. Mangard, E. Oswald and T. Popp, Power Analysis Attacks: Revealing the Secrets of
Smart Cards (Spinger-Verlag, US, 2007).
6. T. Popp, S. Mangard and E. Oswald, Power analysis attacks and countermeasures, IEEE
Design Test Comput. 24 (2007) 535–543.
7. M. L. Akkar and C. Giraud, An implementation of DES and AES, secure against some
attacks, Lecture Notes in Computer Science, Cryptographic Hardware and Embedded
Systems - CHES 2001, Vol. 2162 (Springer-Verlag Berlin Heidelberg, 2001), pp. 309–318.
8. E. Trichina, Combinational logic design for AES subbyte transformation on masked data,
Cryptology ePrint Archive, Report 20031236 (2003), Available at http://eprint.iacr.org.

1750141-17
S. Oukili & S. Bri

9. J. Zeng, Y. Wang, C. Xu and R. Li, Improvement on masked S-box hardware implementa-


tion, Int. Conf. Innovations in Information Technology, AbuDhabi, UAE (2012), pp. 113–116.
10. J. Zeng and C. Xu, An improved masked S-box for AES and hardware implementation, J.
Convergence Inf. Technol. 7 (2012) 338–344.
11. B. Jyrwa and R. Paily, An area-throughput e±cient FPGA implementation of Block
Cipher AES algorithm, Int. Conf. Advances in Computing, Control and Telecommuni-
cation Technologies, Trivandrum Kerala, India (2009), pp. 328–332.
12. A. Gielata, P. Russek and K. Wiatr, AES hardware implementation in FPGA for algo-
rithm acceleration purpose, Int. Conf. Signals and Electronic Systems, Krakow, Poland
(2008), pp. 137–140.
13. J. M. Granado-Criado, M. A. Vega-Rodríguez, J. M. S anchez-Perez and J. A. Gómez-
Pulido, A new methodology to implement the AES algorithm using partial and dynamic
recon¯guration, INTEGRATION 43 (2010) 72–80.
14. C.-P. Fan and J.-K. Hwang, FPGA implementations of high throughput sequential and
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

fully pipelined AES algorithm, Int. J. Electr. Eng. 15 (2008) 447–455.


by FUDAN UNIVERSITY on 02/21/17. For personal use only.

15. M. R. M. Rizk and M. Morsy, Optimized area and optimized speed hardware imple-
mentations of AES on FPGA, 2nd Int. Design Test Workshop, Cairo, Egypt (2007), pp.
207–217.
16. J. S. Banu, M. Vanitha, J. Vaideeswaran and S. Subha, Loop parallelization and pipe-
lining implementation of AES algorithm using openmp and FPGA, Int. Conf. Emerging
Trends in Computing, Communication and Nanotechnology, Tirunelveli, India (2013),
pp. 481–485.
17. A. Kaur, P. Bhardwaj and N. Kumar, FPGA implementation of e±cient hardware for the
advanced encryption standard, Int. J. Innovat. Technol. Exploring Eng. 2 (2013) 187–
190.
18. K. Rahimunnisa, P. Karthigaikumar, S. Rasheed, J. Jayakumar and S. Suresh Kumar,
FPGA implementation of AES algorithm for high throughput using folded parallel
architecture, Security Commun. Netw. 7 (2014) 2225–2236.
19. I. Hammad, K. El-Sankary and E. El-Masry, High-speed AES encryptor with e±cient
merging techniques, IEEE Embed. Syst. Lett. 2 (2010) 67–71.
20. Y. Wang and Y. Ha, FPGA-based 40.9-gbits/s masked aes with area optimization for
storage area network, IEEE Trans. Circ. Syst. II: Express Briefs 60 (2013) 36–40.
21. H. Samiee, R. E. Atani and H. Amindavar, A novel area-throughput optimized archi-
tecture for the AES algorithm, Int. Conf. Electronic Devices, Systems and Applications,
Kuala Lumpur, Malaysia (2011), pp. 29–32.
22. N. Iyer, P. Anandmohan, D. Poornaiah and V. Kulkarni, E±cient hardware architectures
for AES on FPGA, Computational Intelligence and Information Technology, eds. V. V.
Das and N. Thankachan (Springer-Verlag Berlin Heidelberg, 2011), pp. 249–257.
23. H. Anwar, M. Daneshtalab, M. Ebrahimi, J. Plosila and H. Tenhunen, FPGA imple-
mentation of AES-based crypto processor, IEEE 20th Int. Conf. Electronics, Circuits,
and Systems, Abu Dhabi, Emirats Arabes Unis (2013), pp. 369–372.
24. K. Rahimunnisa, P. Karthigaikumar, N. A. Christy, S. S. Kumar and J. Jayakumar, PSP:
Parallel sub-pipelined architecture for high throughput AES on FPGA and ASIC, Central
Eur. J. Comput. Sci. 3 (2013) 173–186.
25. M. I. Soliman and G. Y. Abozaid, FPGA implementation and performance evaluation of
a high throughput crypto coprocessor, J. Parallel Distrib. Comput. 71 (2011) 1075–1084.
26. R. R. Farashahi, B. Rashidi and S. M. Sayedi, FPGA based fast and high-throughput 2-
slow retiming 128-bit AES encryption algorithm, Microelectron. J. 45 (2014) 1014–1025.

1750141-18
Hardware Implementation of AES Algorithm with Logic S-box

27. S. Qu, G. Shou, Y. Hu, Z. Guo and Z. Qian, High throughput, pipelined implementation
of AES on FPGA, Int. Symp. Information Engineering and Electronic Commerce,
Ternopil, Ukraine (2009), pp. 542–545.
28. A. Soltani and S. Shari¯an, An ultra-high throughput and fully pipelined implementation
of AES algorithm on FPGA, Microprocess. Microsyst. 39 (2015) 480–493.
29. V. Rijmen, E±cient Implementation of the Rijndael S-Box, Katholieke Universiteit
Leuven, Dept. ESAT, Belgium (2000), Available at http://luca-giuzzi.unibs.it/corsi/
Support/papers-cryptography/rijndael-sbox.pdf.
30. A. Satoh, S. Morioka, K. Takano and S. Munetoh, A compact Rijndael hardware ar-
chitecture with S-Box optimization, Lecture Notes in Computer Science, Advances in
Cryptology – ASIACRYPT 2001, Vol. 2248 (Springer-Verlag Berlin Heidelberg, 2001),
pp. 239–254.
31. X. Zhang and K. K. Parhi, High-speed VLSI architectures for the AES algorithm, IEEE
Trans. Very Large Scale Integr. (VLSI) Syst. 12 (2004) 957–967.
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
by FUDAN UNIVERSITY on 02/21/17. For personal use only.

1750141-19

You might also like