AES Decryption Core For FPGA
AES Decryption Core For FPGA
Author
scheng
[email protected]
OpenCores
www.opencores.org
2
OpenCores
R EVISION HISTORY
Rev.
0.1
0.1.1
0.1.2
www.opencores.org
3
Date
27 Jan
2014
11 Feb
2014
7 May
2015
Author
scheng
Description
First release
scheng
scheng
OpenCores
CONTENT
Introduction
Highlights
5
5
Benchmarks
Architecture
I/O Ports
Operations
10
11
www.opencores.org
4
13
13
13
15
ct[0:127]
5
OpenCores
kt[0:n-1]
AES
Decryption
Core
INTRODUCTION
clk
The AES Decryption
Core for FPGA implements the decryption portion
rst
of the AES (a.k.a. Rijndael) algorithm described in the FIPS-197
specification. Key lengths of 128 / 192 / 256 bits are supports, each
with a separate instantiation wrapper. The core logic is carefully
designed to take advantage of 6-input lookup table (LUT6) based FPGA
architecture. As a result, it can achieve a peak throughput of over
3Gbps for 256-bit key, yet occupies about 2000 LUTs only. The core has
been verified with random test vectors as well as selected test vectors
in FIPS-197, SP-800a, and AESAVS specifications.
H IGHLIGHTS
T OP
www.opencores.org
5
LEVEL SYMBOL
OpenCores
BENCHMARKS
Xilinx Kintex xc7k325tffg900-3
LUT
FF
BRAM
Latency w/ key
switching
w/o key switching
Fmax
Peak throughput
128-bit
1865
310
0
22 clk
192-bit
2350
443
0
26 clk
256-bit
2033
448
0
30 clk
11 clk
369MHz
4.293Gbps
13 clk
361MHz
3.554Gbps
15 clk
365MHz
3.114Gbps
128-bit
1791
299
0
22 clk
192-bit
2269
441
0
26 clk
256-bit
1969
438
0
30 clk
11 clk
380MHz
4.421Gbps
13 clk
364MHz
3.584Gbps
15 clk
375MHz
3.2Gbps
Test conditions
For the purpose of benchmarking, the core is wrapped in a shiftregister-like structure to reduce I/O pin count, synthesized and
implemented with Xilinx Vivado 2015.1 using Performance_Explore
implementation strategy with a period constraint. Target devices are
Xilinx Kintex family xc7k325tffg900-3 and Kintex UltraScale xcku040ffva1156-2-e.
The latency is a measure of the no. of clock cycles starting from the
arrival of the key text and ciphertext to the clock edge when the
plaintext is available at the output. That is the sum of the key
expansion latency and the decryption engine latency. For subsequent
ciphertext blocks which use the same key text as before, key
expansion latency is zero since the previously computed key schedule
will be re-used, only the decryption engine latency counts.
www.opencores.org
6
OpenCores
InvMixColumns
InvAddRoundKey
ct_vl
d
ct_rd
128
y
ct
InvSubBytes
InvShiftRows
ARCHITECTURE
128
pt
pt_vl
d
decryption
engine
klen_s
el
128
Key schedule
buffer
128
RoundKey Register
0
Rco
n
SubWord
RotWord
kt_vl
d
kt_rd
y
kt
128 /
192 /
256
www.opencores.org
7
Key
Expander
OpenCores
The key schedule buffer sits between the key expander and the
decrypt engine. It is a 16-deep by 128-wide dual port RAM with
associated read and write pointers and handshake logic to interface
with the key expander and the decryption engine. The key schedule
buffer is needed because the inverse cipher algorithm consumes round
www.opencores.org
8
OpenCores
keys in reversed order than they are generated by the key expansion
algorithm. As a result, the whole key schedule has to be stored in a
buffer as it exits from the key expander to allow the decrypt engine to
access in reversed order.
I/O PORTS
Ports
clk
Widt
h
1
Directi
on
Input
rst
Input
kt[0:n]
128/
192/
256
Input
kt_vld
Input
kt_rdy
output
128
Input
Input
output
128
output
ct[0:127
]
ct_vld
ct_rdy
pt[0:127
www.opencores.org
9
Description
Core clock. All logic is synchronous to the
rising edge of clk.
Core reset. Active high synchronous
reset. This signal must be asserted for at
least one clock cycle to reset the core.
Key input. Width equals to the selected
key length. A key must be loaded to the
core first before a decryption can start.
Once loaded, the same key can be used
on multiple ciphertext.
Key valid. Active high. This signal is
driven high by the application to tell the
core that a valid key is present on kt[0:n].
Key transfer occurs at the clock rising
edge when both kt_vld and kt_rdy are
high.
Kt interface ready. Active high. This signal
is driven high by the core when it is
ready to accept a new key.
Ciphertext input.
Ciphertext valid. Active high. This signal
is driven high by the application to
indicate the presence of a valid
ciphertext on ct[0:127]. The ciphertext is
transferred to the core at the clock rising
edge when both ct_vld and ct_rdy are
high.
Ct interface ready. Active high. This
signal is driven high by the core to
indicate that it is ready to accept a new
ciphertext.
Plaintext output
10
OpenCores
]
pt_vld
www.opencores.org
10
output
11
OpenCores
OPERATION
The basic decryption cycle involves 3 steps
1. Load crypto key
2. Load ciphertext
3. Read plaintext
Once the plaintext is available at the pt interface, the next decryption
cycle can starts by loading either the next key or ciphertext. In case
the next ciphertext uses the same key as before, there is no need to
load the key again since the previous key schedule is already stored in
the key schedule buffer.
Load Ct
Ct
Load Ct
Load Kt
Load Ct
Read Pt
Read Pt
Read Load
Pt
Read Pt
Load Kt
Time
Back-to-back
ciphertext
www.opencores.org
11
12
OpenCores
DECRYPTION CYCLE
2
1
4
3
5
The timing diagram of a basic 128-bit decryption cycle is shown above.
1. The core asserts kt_rdy to high when it is ready to accept a new
key.
2. The application presents the key to kt and asserts kt_vld to high
to inform the core that a valid key is present.
3. The core asserts ct_rdy to high when it is ready to accept new
ciphertext.
4. The application presents the ciphertext to ct and asserts ct_vld
to high to inform the core that a valid ciphertext is present.
5. The core presents the plaintext to pt and asserts pt_vld to high
when the decryption process is finished.
The key expansion starts when kt_vld is high and finishes when kt_rdy
goes high again. This process takes 11 clock cycles for 128-bit key. The
decryption engines starts when a valid key schedule is present and
both ct_vld and ct_rdy are high and finishes when pt_vld is high. This
process takes 11 clock cycles for 128-bit key.
Decryption cycle for 192 and 256-bit key are similar except the
latencies are different. Refer to the benchmark section for exact
latency numbers.
www.opencores.org
12
13
OpenCores
D ECRYPTION
1
3
The timing diagram above shows the decryption cycle for back-to-back
ciphertext.
1. The core asserts ct_rdy to high when it is ready to accept new
ciphertext.
2. The application drives the ciphertext to ct and asserts ct_vld to
high to inform the core that a valid ciphertext is present.
3. The core asserts pt_vld to high when a valid plaintext is
available on pt. At the same time it also asserts ct_rdy to high
again to indicate it is ready to accept a new ciphertext.
4. The application drives the next ciphertext to ct and asserts
ct_vld to high to inform the core that a new ciphertext is
present.
5. The core asserts pt_vld to high when the second plaintext is
available on pt.
It can be seen that the availability of the plaintext (pt_vld at 3) and the
loading of next ciphertext (ct_vld at 4) can be overlapped. No dead
cycle is incurred.
www.opencores.org
13
14
OpenCores
SIMULATION
The core is verified against selected test vectors from FIPS-197,
AESAVS, and SP-800a. It is also tested against an AES behavioral model
with random test vectors. All the necessary files for simulation are
provided under the bench/ and sim/ directory so that the verification
result can be reproduced.
T ESTBENCH
The testbenches are located under the bench/ directory. There is a
separate testbench for each supported key length, while the test set is
common to all key lengths. The tests performed are listed below.
1. FIPS-197 sample vector test
2. Back-to-back ciphertext test
3. ECB-AES128/192/256.Decrypt sample vector test. SP800-38a
appendix F
4. GFSbox Known Answer Test. AESAVS appendix B
5. KeySbox Known Answer Test. AESAVS appendix C
6. VarTxt Known Answer Test. AESAVS appendix D
7. VarKey Known Answer Test. AESAVS appendix E
8. Random vector test
For back-to-back ciphertext test and random vector test, the core is
driven with random vectors and the output verified against the AES
SystemVerilog Behavioral Model available from Opencores.org, by the
same designer of this core. Source code of the behavioral model can
be found under the sim/rtl_sim/src/ directory. Users are encouraged to
checkout the latest version of the model from Opencores.org.
The testbench compares the core output against either known good
results or golden model output. In case of a mismatch, it prints an error
message and the simulation continues. Once all tests are finished
either OK or Failed will be printed to indicate whether all tests are
passed.
R UNNING
M ODELSIM
Shell scripts for simulation and Modelsim .do files are provided under
the sim/rtl_sim/bin/ directory. Simulation can be run either directly from
the shell or from the Modelsim GUI. As the simulation runs, messages
www.opencores.org
14
15
OpenCores
www.opencores.org
15
16
OpenCores
To run simulation for other key lengths, replace sim128.* above with
sim192/256.*.
www.opencores.org
16
17
OpenCores
RETARGETING GUIDELINES
The core is designed with the objective of maximizing performance and
resource utilization when implemented on modern LUT6 based FPGA.
This is realized by carefully written source codes which limit
combinational logic to use at most 6 input signals whenever possible
so that they can fit well into LUT6s. Other than that, the source code is
technology independent and portable to FPGA architecture of different
vendors. This section describes the recommended modifications to the
core for retargeting to bring out the full performance of the target
technology.
Inclusion of generic_muxfx.v
The source file generic_muxfx.v located under rtl/verilog/generic/
directory defines technology independent 2-to-1 multiplexors MUXF7
and MUXF8 which are used in the source file Sbox.sv. This file should
NOT be included while targeting Xilinx FPGA to allow the synthesis tool
to use the MUXF7 and MUXF8 in the Xilinx library. When targeting other
FPGA technologies, either provides a technology specific definition of
those multiplexors, or include generic/generic_muxfx.v if a
technology specific version is not available.
The table below shows the Vivado synthesis attributes that need to be
replaced when retargeting to other FPGA technologies or using a
different synthesis tool.
www.opencores.org
17
18
OpenCores
Vivado synthesis
attribute
(*
RAM_STYLE="distributed"
*)
(* KEEP_HIRARACHY =
"yes" *)
Used in
Description
KschBuffer.sv
decrypt128_wrappe
r.sv
decrypt192_wrappe
r.sv
decrypt256_wrappe
r.sv
decrypt.sv
InvSbox.sv
Sbox.sv
(* keep = "true",
max_fanout = 1 *)
KeyExpand128.sv
KeyExpand192.sv
KeyExpand256.sv
www.opencores.org
18