0% found this document useful (0 votes)
25 views

Dilithium FPGA Protocol

This document presents a hardware accelerator for the CRYSTALS-Dilithium post-quantum digital signature scheme. The accelerator achieves a 24% reduction in area over state-of-the-art by optimizing for resource sharing, pre-computation, and parallelism. It can fit on a small Zynq FPGA and outperforms software implementations as a hardware co-processor on various platforms.

Uploaded by

Phan Văn Đức
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Dilithium FPGA Protocol

This document presents a hardware accelerator for the CRYSTALS-Dilithium post-quantum digital signature scheme. The accelerator achieves a 24% reduction in area over state-of-the-art by optimizing for resource sharing, pre-computation, and parallelism. It can fit on a small Zynq FPGA and outperforms software implementations as a hardware co-processor on various platforms.

Uploaded by

Phan Văn Đức
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

Lightweight Hardware Accelerator for


Post-Quantum Digital Signature
CRYSTALS-Dilithium
Naina Gupta, Arpan Jati, Anupam Chattopadhyay, and Gautam Jha

Abstract—The looming threat of an adversary with Quantum recently been attacked [7]. There have been several previous
computing capability led to a worldwide research effort towards works as well compromising the security of this scheme [8]–
identifying and standardizing novel post-quantum cryptographic [10]. This leaves two other digital signature candidates among
primitives. Post-standardization, all existing security protocols
will need to support efficient implementation of these primitives. which Dilithium is one. Naturally, it is of high importance
In this work, we contribute to these efforts by reporting the small- to study efficient implementations of Dilithium [11]. Digital
est implementation of CRYSTALS-Dilithium, a finalist candidate signature being an integral part of numerous security protocols,
for post-quantum digital signature. the use-cases and the platform constraints range from high-
By invoking multiple optimizations to leverage parallelism, speed servers to highly resource-constrained IoT platforms.
pre-computation and memory access sharing, we obtain an imple-
mentation that could be fit into one of the smallest Zynq FPGA. As a result, there have been several efforts towards optimiz-
On Zynq Ultrascale+, our design achieves an improvement of ing Dilithium for different security parameters and different
about 36.7%/35.4%/42.3% in Area×Time (LUTs×s) trade-off platforms such as FPGAs and ASICs targeting pure-hardware
for KeyGen/Sign/Verify respectively over state-of-the-art imple- based implementation [12]–[16], HLS based implementations
mentation. We also evaluate our design as a co-processor on [17]–[19] or as a software-hardware co-design [20]. Further,
three different hardware platforms and compare the results with
software implementation, thus presenting a detailed evaluation few works focused on integration in TLS protocol [21], as a
of CRYSTALS-Dilithium targeted for embedded applications. GPU accelerator [22] and for developing a quantum secure
Further, on ASIC using TSMC 65nm technology, our design blockchain [23].
requires 0.227mm2 area and can operate at a frequency of
1.176 GHz. As a result, it only requires 53.7µs/96.9µs/57.7µs for
From our studies on the existing Dilithium implementations,
KeyGen/Sign/Verify operation for the best-case scenario. we found that there is a significant room for improvement in
terms of area-efficiency, which sets the motivation for this
Index Terms—post-quantum, cryptography, PQC,
CRYSTALS-Dilithium, FPGA, hardware, ASIC, hardware work. The main contributions of this work are:
accelerator
1) In this work, we designed a lightweight hardware ac-
celerator for CRYSTALS-Dilithium. We used multiple
I. I NTRODUCTION
optimization strategies such as resource and control logic
The threat of an adversary with Quantum computing capa- sharing, fusion of modules, pre-computed LUTs etc.
bility is getting increasingly realistic [1], [2], with rapid growth 2) By achieving a reduction of about 24% in LUTs and FFs
in the capacity of Quantum computers, as well as, optimized than state-of-the-art implementation, this work presents
implementations of the current cryptographic primitives using the smallest hardware accelerator for Dilithium. As a
Quantum circuit simulators [3], [4]. It is predicted that current result, it can now be fit into one of the smallest Zynq
public-key primitives could be broken within hours [5], thus FPGA.
necessitating the search for alternative cryptographic primi- 3) In our design, we have leveraged both pipelining as
tives in the era of Quantum computing. This is systematically well as parallelism to achieve a good balance of perfor-
undertaken by NIST through its Post-Quantum Cryptography mance and area. Thus, on Zynq Ultrascale+, our design
(PQC) contest [6], which plans to roll out the winners of achieves efficiency of more than 35% for Area×Time
this contest as a standard. Naturally, there is a pressing compared to existing implementation.
need to study efficient hardware implementations of the PQC 4) To present a fair comparison with the existing implemen-
candidates, which not only plays an important role in the tations, we used two metrics - Area×Time and number
contest judgement process but also helps in the rapid adoption of operations that can be performed per second per
of the standard. LUT on a particular platform. On Zynq Ultrascale+,
The current finalists of the NIST PQC contest consist of our design outperforms the state-of-the-art. Whereas, on
3 candidates in digital signature scheme, one of which has Artix-7, our design has better performance for signing
N. Gupta and A. Chattopadhyay, School of Computer Science and Engi- operation.
neering, Nanyang Technological University, Singapore. 5) Using TSMC 65nm library, we also implemented the
A. Jati, School of Physical and Mathematical Sciences, Nanyang Techno- design on ASIC platform and report numbers for major
logical University, Singapore.
G. Jha, Department of Electronics and Electrical Communication Engineer- modules as well as the overall design. On ASIC, our
ing, IIT Kharagpur, India. design can run at 1.176 GHz with 0.227 mm2 area
2

achieving a reduction of 1.4× in area and improvement TRNG + HASH(H) sk


M
of more than 1.7× in execution times for KeyGen, Sign ρ
0 ρ′ L ρ′ Key K unpack

and Verify. ExpandA ExpandS


s1
ExpandS
s2
t0 s2 s1 ρ Key K tr

6) Further, we performed hardware evaluation of the im-


 NTT NTT NTT ExpandA H H

NTT tˆ0 sˆ2 sˆ1 Â


plemented design as an accelerator on three different sˆ1 κ

hardware platforms and achieved a speedup of 6-15× for t = Â ◦ sˆ1 INTT


ExpandMask
y
modern high performance CPUs and about 105-261× t = A ◦ s1 + s2
NTT

for Microblaze compared to software implementations Power2Round ŷ


w = Â ◦ ŷ
for different operations.
t1 t0
pack
INTT
The paper is organized as follows. Section II starts with the pk H tr
w =A◦y
notations and presents a brief background about Dilithium. KeyGen pack Decompose H
Section III shows the overall system architecture alongwith sk
w0 w1
NTT SampleInBall
the design decisions. It is then followed by the different ex- M pk sig
h = ĉ ◦ tˆ0 r0 = ĉ ◦ sˆ2 z = ĉ ◦ sˆ1
perimental results and performance comparison with the state- unpack
unpack
INTT INTT INTT
of-the-art implementations in Section IV. The performance ρ t1 c̃
h = c ◦ t0 r0 = c ◦ s2 z = c ◦ s1
evaluation as a hardware accelerator is presented in Section
H ExpandA SampleInBall
c z h

V and Section VI finally concludes the work. H NTT NTT NTT
w 0 = w 0 − c ◦ s2 z = c ◦ s1 + y

w1 = Â ◦ ẑ t1 = ĉ ◦ tˆ1

II. P RELIMINARIES w1 = Â ◦ ẑ − ĉ ◦ tˆ1 INTT


CHKNORM w0 = w0 + h MakeHint

A. Notations pack UseHint


Check and Generate Signature
Rejection Loop
rej
Throughout this work, we use the following notation.
H
Verify Challenge pack
Verify Sign
Lower-case letters are used to represent vectors (e.g. e) and the valid sig sig

polynomials in NTT domain are represented using a hat over


the symbol (ê). Matrices are represented using bold upper- Fig. 1. Dilithium Protocol
case letters (e.g. A). R = Z[X]/(X n + 1) denotes the ring of
integer polynomials modulo (X n + 1), where n is a power of
2. Rq is the polynomial ring with coefficients modulo q. For the signature sig. For this, first a masking vector y
dilithium, n = 256 and modulus q = 223 - 213 + 1. is generated using SHAKE-256 and then it is used to
compute w = Ay. The high-order bits of w (denoted as
w1 ) are then hashed with the message M to generate
B. Protocol Description
the challenge c. This challenge is used to generate the
The digital signature scheme CRYSTALS-Dilithium is signature as z = cs1 + y. Apart from generating z, the
based on the hardness of the Module Learning with Errors algorithm also generates some hints h for the verifier.
(MLWE) and the Short Integer Solution (SIS) problems. In If the signature passes all the correctness and security
this section, we briefly discuss about the hardness problems checks, then sig consisting of (c, z, h) is sent as the final
and the protocol. Interested readers are referred to [24] for signature to the verifier. Otherwise, the signature is re-
more details. Let us consider a matrix A of dimension k×l computed again as shown in the rejection loop.
and vectors s1 and s2 of dimensions l and k sampled uniformly. 3) Verify(pk, sig, M ): The verifier computes Az − ct1 and
Then, the MLWE problem can be defined as: Given (A, As1 + set the high-order bits in w1 . It is then hashed with the
s2 ) and (A, b) where b ≈ As1 + s2 and is a uniformly sampled message M to generate the challenge c. This challenge
vector, the goal is to distinguish (A, As1 + s2 ) from (A, b). One is compared with the one received in the signature. If
can note that they are approximately equal, but the unknown the challenges match and also the norm of z is valid,
error vector (s2 ) makes it quite difficult to distinguish. The then the signature is accepted and the algorithm returns
SIS problem can be defined as: Given A, the goal is to find a valid sig as true, otherwise signature is rejected.
vector x such that Ax = 0 and the norm of x is smaller than
an integer value called norm bound β. The dilithium signature III. A RCHITECTURE AND D ESIGN D ECISIONS
scheme consists of three algorithms key generation (KeyGen),
signature generation (Sign) and verification (Verify) shown in Here, we present the overall system architecture and discuss
Fig. 1. various design decisions which led to the overall optimized
modules.
1) KeyGen(): Key generation algorithm generates a keypair
consisting of a public verification key (pk) and a private
signing key (sk). It utilizes two random seeds public A. System Architecture
seed (ρ) and noise seed (ρ′ ) and expands them using The architecture for three algorithms KeyGen, Sign and
a variant of SHAKE-128 to generate the matrix A and Verify is combined together and the resources are extensively
two vectors s1 and s2 . It then computes t = As1 + s2 to shared to keep the memory footprint as small as possible.
generate the final keypair. A global input enable signal is used to start the required
2) Sign(sk, M ): The purpose of this algorithm is to take operation. The modules having similar control logic are also
the message M and the signing key sk and generate combined together to further reduce the resource utilization.
3

Such modules are shown together in the overall design archi- C. Parallel Processing
tecture shown in Fig. 2. In order to reduce clock cycles, the data independent op-
The hardware accelerator is designed using a dedicated FSM erations can be executed in parallel. One can perform fine-
based control unit. All the modules have separate enable and grained parallel execution of operations such as in [13], but,
done signals and are connected to the memory controller. The this leads to significant area overheads, mainly because of
control logic and sequencer unit is responsible for enabling increase in multiplexing, additional module instances etc. We
different modules depending on the required functionality. The implemented parallel execution of modules wherever possible
memory controller consists of a dual-port switch matrix and while ensuring low area.
is responsible to connect different modules with the RAMs
depending on the input and the output for the operation. D. SHAKE
Dilithium uses SHAKE-128 and SHAKE-256 extendable-
msg
msg len
done
verify fail
output functions (XOFs) of the SHA-3 [25] family which
init seed global enable is based around the Keccak permutation [26] for different
Control Logic and Sequencer functionalities with minor variations. For instance, SHAKE-
init seed global enable
256 and its variations are used to generate random seeds from
msg

done

msg len verify fail


an initial seed, for the hashing (H) and also to generate the
Unified Reduction challenge (c). Whereas, to generate Matrix A, vectors s1 , s2
and masking vector y1 , variants of SHAKE-128 are used.
Rejection Sampler
(matrix A,
Pointwise
Multiplication
Pointwise
Addition or
As the XOFs are one of the most expensive and time
vectors s1 , s2 ) NTT/INTT or Reduce Subtraction Verify consuming operations, it is important to design them carefully.
Keccak Wrapper

SampleInBall ChkNorm Consequently, there can be multiple possible design choices.


(challenge c) Memory Controller
UseHint
One can have multiple instances of the Keccak as in [15],
cascade two rounds of Keccak [13] to increase performance or
have a single shared instance to achieve all the functionalities
GenRandSeed/ Unpack Pack Power2Round/ MakeHint
ExpandMask/ H (PK/SK/SIG) (PK/SK/SIG) Decompose
to save resources at the expense of some performance. In our
Fig. 2. High-level System Architecture of Dilithium design, we chose the latter approach and created a unified
wrapper around Keccak to support all the required variations.
For Keccak, we used the implementation provided by the
designers [27]. As Dilithium is a digital signature scheme,

B. Memory Requirements Interface


enable done
The upper bound on the memory storage is decided based
input length

2
shake rate

8 get keccak state

state addr
state data
done abs sqz done sqz
on the signature generation part. In the worst-case scenario for
Operation Control
12 start abs sqz
start sqz

Level V parameters, one needs to store about 118 polynomials.


datain addr1
datain addr0
domain sep

start sqz 64
datain1
datain0

6
This huge requirement is mainly because of matrix A of size Absorb Control
Logic start perm
done sqz

k×l×256. In case of KeyGen and Verify, one can generate Keccak Unit Squeeze Control
Logic
Read Keccak
State

partial matrix of size l×256, perform the computation and


Round
finalize block
start abs

Constants
start perm
store the final result. One can execute these steps k times
Generator
8 round 1600 64
32 11
separately to save resources and generate the final output of 32 11 constant 1600-bit Keccak
Internal State
XOR Logic + 1600
pointwise operation. But, in case of Sign operation, due to Input Address Padding
Keccak-f
Permutation
Register
Controller
multiple rejections, there is a trade-off between pre-computing 64
Logic
64 64

and storing the complete matrix A or re-computing partial 64

matrix coefficients. The former approach requires more RAMs


and less clock cycles whereas the latter results in less storage Fig. 3. Unified SHAKE Wrapper
but more clock cycles. In our design we chose to compute it
fully (56 polynomials) only once at the expense of RAM re- the architecture should support variable input message length.
quirements. To balance this requirement, we analyzed the Sign Also, the requested output length from the XOFs is different
algorithm to determine which variables are never accessed depending on the operation being performed. Such variations
simultaneously (for instance, z and y) and utilized same RAMs with different input/output length, different padding logic,
for both the variables. Also, the internal bus-width for each nonce support, etc. makes it quite challenging to have one
polynomial is variable and optimized based on the coefficient module which is efficient in hardware as well as serve all
size. For instance, the zetas are only of size 23-bit, hence the the required functionalities. As a result, the overall logic for
BRAM data width is set to 23-bit instead of 32-bit. Similarly, this module is quite complex. In order to provide support for
most of the coefficients for Dilithium can be fit into 26-bit variable input and output message length, the design has three
resulting in a reduction of about 16.8% BRAM requirements configurable modes of operation - perform absorb followed
compared to an implementation with a fixed bus width of size by one squeeze (mode = 1), perform only squeeze (mode
32-bit. = 2) and read Keccak state (mode = 3). The modes are
4

extremely useful when the number of executions for squeeze logic to start and fetch coefficients from the Keccak wrapper
is unknown (for instance, in case of rejection sampling). Also, as discussed in III-D.
mode = 3 allows the Keccak state to be accessed from outer t0 = buf[pos] & 0x0F;
modules, as a result we do not need extra resources to store if (t0 < 15) {
additional copies of the 1600-bit register elsewhere compared t0 = t0 - (205 * t0 >> 10) * 5;
a[ctr++] = 2 - t0;
to the work in [13]. The internal state remains valid after each }
squeeze operation unless reset from outside. Fig. 3 shows the
architecture for the implemented SHAKE wrapper module. Listing 1. Rejection Eta
The interface is used to set the corresponding mode using Listing 1 shows a small code snippet from the software
enable signal, provide necessary information such as number reference implementation for rejection sampling. It is used
of bytes to absorb (input length), shake rate (shake rate), etc. to generate vectors s1 and s2 from the output of SHAKE-
and to communicate with the different outer modules. 128. One can note that the value for t0 can only be between
The control logic for all the modes are responsible to start 0 – 14. Thus, the whole computation (requiring multiple
the corresponding operation. For example, for a SHAKE- multiplication and subtraction operations) can be completely
128 operation, the outer module first sets the enable signal avoided and possible output values for polynomial coefficient
to mode = 1 alongwith the input length and shake rate. can be pre-computed and stored in a LUT. This resulted in
The operation control decodes the mode and transfers the saving a lot of resources. The same logic is used for another
control to absorb control logic. This starts the absorb phase value t1. Similar optimizations are used for other sampling
by reading in the domain separator and input. The XORing modules wherever possible alongwith pipelining for good
of the input with the Keccak state and the generation of performance.
corresponding input addresses are handled by XOR Logic +
Input Address Controller. In our design, we are absorbing data
in chunks of 64-bits per clock cycle. The absorb control logic F. Combined Power2Round and Decompose
is responsible for monitoring the absorbed input length and Fig. 4 shows the combined module for Power2Round and
starts the permutation if required (when the input message Decompose modules. As can be seen from the figure, the
length is greater than the shake rate). Further, when the input controller unit is shared between both the modules, thus saving
is completely absorbed, it enables the padding logic to finalize about 180 LUTs and 120 FFs. The controller unit is used
the absorbed block. The absorb control logic then starts a to enable the operation depending on the mode and generate
squeeze operation by setting start sqz signal. The squeeze address and write enable signals after receiving done from
block starts the Keccak permutation and waits for it to finish the respective module. For simplicity we have not shown
execution using a dedicated counter (25 clock cycles as Keccak the individual enable and done signals in the diagram. The
has 25 rounds). It then sends the done sqz signal to the absorb
control logic which then triggers the done abs sqz signal to 1
(1≪(D-1)) - 1
operation control. Once the outer module receives the done
a0
0
≫D ≪D
a
signal, it starts reading the Keccak state by setting mode =
1

Power2Round a1
0

Input address
3, and providing the corresponding state address. In a single
addr a
counter

clock cycle, 64-bit data can be fetched from the wrapper. The
1≪21 Output address we a0
127 addr a0
mode ≫7 ≪10 ≫22 counter + write
LUT we a1
outer module processes this data first and if more data is
[3:0] enable logic
&Q addr a1
≫31
(Q-1)/2
required, it simply increments the address, otherwise it sends Decompose
Controller done

a read done signal to reset the internal Keccak state. If the


required output length is more than the shake rate, then after Fig. 4. Power2Round and Decompose module
reading enough data (= shake rate), the outer module sets
mode = 2 to start another squeeze operation and then read software reference implementation of decompose module for
the data using mode = 3. Such an approach allows us to call Dilithium-V requires two multiplication operations to realize
squeeze back and forth only when required and when we have high-order and low-order bits of the polynomial coefficient.
exhausted all the available Keccak output. Thus, saving us One is (a1*1025) and other one is (a1*2*GAMMA2) where
clock cycles as well as area as we do not need to pre-compute GAMMA2 = 261888, resulting in an expensive implementa-
and store extra output bytes for example, in case of rejection tion for Decompose. In order to prevent DSP usage for the
sampling. two cases, we used two different strategies. We realized the
first multiplication using an addition operation as (a1*1025) is
equivalent to (a1*1024 + a1) and (a1*1024) is very efficient
E. Sampling Modules in hardware and can be realized by shift logic. For the second
Dilithium requires four different types of sampling logic multiplication, we used a small 4-bit lookup-table (LUT) as the
to generate matrix A, the vectors s1 , s2 , challenge c and the input a1 can only have 16 possible values. We pre-computed
masking vector y1 . Even though all of them have different the output of (a1*2*GAMMA2) for all 16 possible values and
sampling logic, but the coefficients are sampled based on the used the LUT whenever required. The rest of the operations
output of a variant of XOF function. Hence, the sampling are mostly either shift operations or addition/subtraction with
modules in our design only contain the sampling and control a constant value. Because of the optimizations employed, the
5

resource utilization of decompose is thus significantly reduced


to only about 120 LUTs and 203 FFs. Power2Round is a very a = 223 a[45 : 23] + a[22 : 0]
simple module requiring only one addition and subtraction
= 213 a[45 : 23] − a[45 : 23] + a[22 : 0]
operation. We have performed such control logic unification
with multiple modules and benefited in terms of area. = 223 a[45 : 33] + 213 a[32 : 23] − a[45 : 23] + a[22 : 0]

= 213 a[45 : 33] − a[45 : 33] + 213 a[32 : 23]


− a[45 : 23] + a[22 : 0]

= 223 a[45 : 43] + 213 a[42 : 33] + 213 a[32 : 23]


G. NTT/INTT − (a[45 : 33] + a[45 : 23]) + a[22 : 0]

= 213 (a[45 : 43] + a[42 : 33] + a[32 : 23])


The NTT/INTT is one of the most computationally expen-
− (a[45 : 43] + a[45 : 33] + a[45 : 23]) + a[22 : 0]
sive operations in Dilithium. Fig. 5 shows the basic architec-
ture utilized in this implementation. As the NTT operations = 213 c − e + a[22 : 0]( mod q)
in Dilithium allows implementations with two simultaneous
where c = (a[45 : 43] + a[42 : 33] + a[32 : 23]) and
butterfly operations we have utilized two 64×256 dual-port
e = a[45 : 43] + a[45 : 33] + a[45 : 23]. Using the same
memories attached with a dual butterfly unit. This allows
modulus property again, c can be further reduced as:
for two butterfly-operations to be performed per clock cycle.
The dual butterfly unit internally utilizes two multiplication-
213 c = 223 c[11 : 10] + 213 c[9 : 0]
reduction units as described in section III-H.
= 213 (c[11 : 10] + c[9 : 0]) − c[11 : 10]

= 213 f − c[11 : 10]( mod q)


Dual-port RAM A Dual-port RAM B
64×256 64×256
Further reduction is possible for f as shown below:
dataout0

dataout1

dataout0

dataout1
datain1

datain1

datain0

datain1

213 c = 223 f [10 : 10] + 213 f [9 : 0] − c[11 : 10]


addr0
addr1
datain1

wen
addr1
wen

in0 32 32
= 213 (f [10] + f [9 : 0]) − (f [10] + c[11 : 10])
in1 Dual Butterfly
out0 3232 Unit
out1
2 q
1
Address 3 2 1 2q
32 9 ≪13 12 23
addr0 Generation and IN A
10
10 ≪13 RED OUT
23 23 q
addr1 Control 9 46 3
10 10
wen
IN B
9 ≪13
23
∗ 13
23
32 23 23
9
Fig. 5. NTT/INTT Architecture
Fig. 6. Efficient Modular Reduction Module

The NTT operation is performed in three stages. First, Fig. 6 shows the implemented reduction module. The extra
the data from external memory is copied over to RAM- multiplication operation required before modular reduction
A. Second, the data in RAM-A is processed (128 butterfly operation is also integrated in this module. In order to reduce
operations) and written over to RAM-B and reverse, until all the input to 23-bit before multiplication, we first performed
the layers have been processed. Finally the result is written an initial reduction using the same recursive property. This
back to the external RAM. The entire operation requires helped in the reduction of 2 DSP resources per such module.
multiple addresses, write enables and zetas to be generated The back-to-back flip flops correspond to the pipeline delays
and provided to various modules. This is performed by the required for optimal DSP implementation. Two such modules
address generation and control state machine. At the end of are instantiated in hardware to speed up operations.
INTT, the required scaling by 1/n is performed by the butterfly
unit as well. So, no additional resources or clock cycles are IV. R ESULTS AND P ERFORMANCE C OMPARISON
utilized.
In this section, we present resource utilization and perfor-
mance results for both FPGA and ASIC implementations. We
also compare our results with the state-of-the-art implementa-
tions for Dilithium-V. The implemented design is realized in
H. Modular Reductions Verilog and the results are presented after synthesis and place
and route using Vivado 2020.1 targeting two platforms Xilinx
The work in [28], [29] proposed an efficient method Artix-7 (XC7A200T-2) and Zynq UltraScale+ (XCZU9EG-
for modular reduction in hardware for NewHope-NIST and 2). The correctness of the implementation is verified using
CRYSTALS-Kyber. In this work, we followed a similar ap- the KAT provided in the NIST package. From here on,
proach for efficient modular reduction in Dilithium. In order Artix-7 is referred as A7 and Zynq UltraScale+ is referred
to compute a mod q, we used Dilithium modulus property as ZUS+. For ASIC, we synthesized the design for TSMC
223 = 213 − 1 (mod 8380417) recursively. The corresponding 65nm process technology. We used Synopsys Design Compiler
equations are as shown below: version R-2020.09-SP5 for synthesis and Cadence Innovus
6

version V19.10-P002 for place-and-route of the design. We and 4 DSPs. Due to a deeply pipelined architecture, our design
used CCS-based standard cell library for accurate results. can run at a maximum frequency of 163 MHz on Artix-7 and
Synopsys DesignWare library components were used wherever 391 MHz on Zynq Ultrascale+. As a result, we require 387µs,
applicable. 699µs and 416µs for KeyGen, Sign and Verify operations on
Artix-7. For Zynq Ultrascale+, the respective operations can
A. Resource Utilization be finished in 161µs, 291µs and 173µs. We also report number
of operations that can be performed per second (OP/s) for all
Table I shows resource utilization for the major components
the implementations.
required in Dilithium. Through extensive resource sharing
Currently, to the best of our knowledge, there are four
between the modules, we obtained a significant reduction in
known hardware implementations which report results for
overall resource utilization. One can note that the resource
Dilithium Security Level V. Hence, we compare our results
utilization for Sampling is quite low compared to the other
with only these implementations.
existing implementations. This is mainly because the different
1) Comparison with [15].: The work by Beckwith et. al.
Sampling modules contain only the minimal arithmetic and
present results for multiple platforms as well as for all the
control logic with the XOFs shared between them. The slice
security levels. For a fair comparison, we compare our results
registers usage is somewhat higher in our case because of the
with the similar FPGA plaform. The authors in [15] targeted
deep pipelines needed for good performance on both FPGA
a high-performance implementation. As a result, they can
and ASIC. The various modules were individually optimized
perform Keygen, Sign and Verify in 121µs, 210µs and 126µs
for hardware using multiple strategies wherever applicable.
achieving a reduction of about 3.2×, 3.33× and 3.30× com-
pared to our implementation. Even though the performance
TABLE I
R ESOURCE UTILIZATION FOR FPGA AND ASIC. T HE ASIC RESOURCES achieved is better, their design requires 3.81×, 4.14× and 4×
ARE REPORTED IN G ATE E QUIVALENTS (GE S ). more LUTs, FFs and DSPs than our design. This is because
their architecture utilizes multiple cores to perform expensive
Sub-module
FPGA ASIC
Reference operations such as NTT, Keccak whereas we are using only
LUT F/F DSP BRAM GEs
one core for almost all the modules. Thus, the high latency
2389 740 0 0 - [15] in our design is compensated by low area utilization as well
MakeHint
67 85 - - 1423.7 This Work as higher achievable frequency. As a result, their design has
UseHint
6453 2808 0 0 - [15] a higher Area×Time trade-off metric of 18.8%, 14.3% and
186 279 0 0 4740 This Work 15.3% for KeyGen, Sign and Verify operations respectively.
Encode + 1626 461 0 0 - [15] 2) Comparison with [16].: The authors in [16] target to
Pack 650 603 0 0 8884 This Work reduce the area in terms of LUTs and FFs by utilizing more
Decode + 2189 239 0 0 - [15] DSP resources in their architecture. Their design requires
Unpack 694 568 0 0 9458.3 This Work 3.2×, 2.02× and 11.25× more LUTs, FFs and DSP resources
Decompose
1437 680 0 0 - [15] and about 1.13× less BRAMs compared to our design. Even
120 203 0 0 3028.7 This Work though, our implementation has slightly more latency for
NTT + 4509 × 2 3146 × 2 16 0 - [15] the three operations, the overall improvement in terms of
PolyArith 5676 1218 41 1 - [16] efficiency (Area×Time) is quite high. Compared to ours, their
2759 2037 4 7 40182 This Work design has a lower Area×Time trade-off efficiency by 200%,
SampleA + 3548 1015 0 0 - [15] 129% and 188% for KeyGen, Sign and Verify operations
SampleS 1479 189 - - - [16] respectively.
360 355 0 0 4710.3 This Work 3) Comparison with [13].: Zhao et. al. proposed a com-
2220 630 0 0 - [15] pact and high-performance hardware design. They did sev-
SampleY
469 48 - - - [16] eral optimizations such as segmented pipeline processing to
99 199 0 0 2353 This Work achieve a high-performance architecture. As a result, they
1856 868 0 0 - [15] can perform KeyGen/Sign/Verify in 90µs/505µs/93µs which is
SampleC
384 662 - - - [16] 4.3×/1.38×/4.47× lower than our design execution time. One
289 244 0 0 3782.7 This Work can note that the improvement in execution time for Sign is not
5483 × 3 4451 × 3 0 0 - [15]
that significant. We believe this is because the Sign operation
Keccak
3708 1623 - - - [16]
does not offer much scope for parallel execution of operations.
4202 1800 0 0 42155.3 This Work
Further, even though their design offers high-performance, it
comes at an expense of consuming large amounts of resources.
Compared to our design, their implementation requires 114.7%
and 51% more LUTs and FFs. As a result, the Area×Time
B. FPGA Implementation trade-off is better in our case for Sign operation whereas it is
Table II presents detailed results for our implementation better in [13] for KeyGen and Verify.
as well as existing implementations for Dilithium-V. In our 4) Comparison with [14].: The design by Aikata et. al.
case, the combined architecture (for KeyGen, Sign and Verify) presents a unified architecture for Dilithium and Saber. The
requires about 13.9k LUTs, 6.8k Slice registers, 35 BRAMs authors in [14] target a low area implementation. Hence, the
7

TABLE II
P ERFORMANCE AND C OMPARISON FOR D ILITHIUM -V FOR BEST- CASE SCENARIO ( SIGNATURE IS VALID AFTER THE FIRST ITERATION ) ON FPGA AND
ASIC. T HE T HE AREA FOR ASIC IS EXCLUDING ON - CHIP MEMORY AND IS REPORTED IN mm2 AND G ATE E QUIVALENTS (GE S )

Reference Family Freq. Area KeyGen Sign Verify


LUT FF RAM DSP Cycles Time OP/s Cycles Time OP/s Cycles Time OP/s
(MHz) (×103 ) (µS) (×103 ) (µS) (×103 ) (µS)
FPGA Results
This Work Artix-7 163 13,975 6,845 35 4 63.2 387 2580 113.9 699 1430 67.9 416 2401
Beckwith et. al. [15] Artix-7 116 53,187 28,318 29 16 14.0 121 8263 24.4 210 4762 14.6 126 7922
Land et. al. [16] Artix-7 140 44,653 13,814 31 45 51.0 364 2746 70.4 503 1989 52.7 377 2656
Zhao et. al. [13] Artix-7 96.9 29,998 10,336 11 10 8.8 90 11055 49.0 505 1977 9.0 93 10720
Beckwith et. al. [15] Kintex-7 173 54,468 28,639 29 16 14.0 81 12324 24.4 141 7102 14.6 85 11815
This Work Zynq Ultrascale+ 391 13,975 6,845 35 4 63.2 161 6189 113.9 291 3431 67.9 173 5759
Aikata et. al. [14] Zynq Ultrascale+ 200 18,406 9,323 24 4 38.8 194 5149 68.5 342 2920 45.8 229 4368
Beckwith et. al. [15] Virtex Ultrascale+ 256 53,907 28,435 29 16 14.0 55 18238 24.4 95 10509 14.7 57 17483
ASIC Results
This Work TSMC 65nm 1176 0.227 mm2 157 kGE 63.2 53.7 18614 113.9 96.9 10320 67.9 57.7 17322
Aikata et. al. [14] UMC 65nm 400 0.317 mm2 220 kGE 38.8 97.1 10298 68.5 171.3 5839 45.8 114.5 8735

Keccak core and the polynomial multiplier are shared between 443 (higher is better)
the two algorithms. But, there are many specialized modules 400 369
412

required just for Dilithium, similarly for Saber. As a result,


357

(OP/s)/LUT (×10−3 )
338
324

their implementation requires 31.7% and 36.2% more LUTs 300 279
246

and FFs compared to our design but the BRAM utilization


237
226 217
195
200 185

is comparatively low in their implementation. One interesting 155


130
159
172
149

thing to note is our design can run at about twice the speed 100
102
90
66
61 59
of their design. As a result, we are able to achieve an 45

improvement of 17%, 14.9% and 24.5% in the execution 0


KeyGen Sign Verify
time of KeyGen, Sign and Verify operations. Further, the This Work (A7) Beckwith et. al. (A7) Land et. al. (A7)
Area×Time trade-off efficiency is lower by 57.9%, 54.8% and Zhou et. al. (A7)
Aikata et. al. (ZUS+)
Beckwith et. al. (K7)
Beckwith et. al. (VUS+)
This Work (ZUS+)

73.3% compared to ours for the respective operations.


Efficiency Metrics. In order to collate all the results, we Fig. 8. Operations/second/LUT: In case of Zynq Ultrascale+, we are
able to achieve an improvement of about 58.78%, 54.72% and 73.84% for
compare our work with the existing implementations using two KeyGen, Sign and Verify operations over the best known state-of-the-art
metrics - Area×Time (LUTs×s) trade-off (shown in Fig. 7) implementation [14]. Whereas, when comparing our design on Artix-7, the
and Number of operations (KeyGen/Sign/Verify) that can be improvement is about 54.5% for Sign operation. The improvement in ZUS+
is quite high for all the operations because our design is better in terms of
performed per second per LUT (shown in Fig. 8). Both metrics both area as well as performance than the existing implementation.
are used to quantify the efficiency of an implementation. In the
former case, a lower value denotes a better design, whereas
for the latter, a higher value is considered to be good. C. Post-layout ASIC Implementation
Table II shows the results for the ASIC implementation.
Our design requires an area of about 0.227 mm2 (≈ 157k
25 GEs) after place-and-route excluding on-chip memory. Be-
22.45

(lower is better)
cause of the highly pipelined architecture, the design can run
20
at 1.176 GHz. As a result, the KeyGen/Sign/Verify takes only
Area x Time (LUTs × s)

16.81
16.26

15.17

15
53.7µs/96.9µs/57.7µs. Compared to state-of-the-art implemen-
tation, we achieve a reduction of 1.4× in area and improve
11.17
9.77

10 the runtime of KeyGen/Sign/Verify by 1.81×/1.77×/1.98×.


7.67

6.71
6.44

The physical layout of the implemented design after place-


5.82
6.3
5.42

5.13

4.61
4.42

4.21
4.07

5
3.57

and-route is shown in Fig. 9. We have highlighted the regions


3.08
2.96
2.71

2.43
2.8
2.26

0 for individual modules in Dilithium. One can see that majority


KeyGen Sign Verify of the area in the design is consumed by Keccak, Sampling,
This Work (A7)
Zhou et. al. (A7)
Beckwith et. al. (A7)
Beckwith et. al. (K7)
Land et. al. (A7)
This Work (ZUS+)
Reduction and NTT modules with Keccak being the largest. In
Aikata et. al. (ZUS+) Beckwith et. al. (VUS+) our design, we are using 2 instantiations of reduction modules
(marked as Reduction-1 and Reduction-2). The remainder of
Fig. 7. Area×Time (LUTs×s) trade-off: Our design has better Area×Time
trade-off in case of Ultrascale+ compared to the existing implementations. For
the space is utilized by the multiplexers connecting the various
A7 the implementation by Zhou et. al. [13] has better Area×Time trade-off modules and the FSM based control logic.
for KeyGen and Verify, whereas our design is better for Sign operation. One
interesting thing to note is that our result on Zynq Ultrascale+ is best across V. H ARDWARE E VALUATION AS AN ACCELERATOR
all platforms.
To demonstrate the effectiveness of the developed hardware
accelerator, we evaluated our design on three test platforms:
8

TABLE III
P ERFORMANCE AS AN ACCELERATOR FOR BEST- CASE SCENARIO : S IGNATURE IS VALID IN THE FIRST ITERATION AND IN THE WORST- CASE SCENARIO :
R EJECTION LOOP IS EXECUTED 17 TIMES BEFORE A VALID SIGNATURE IS GENERATED . T HE NUMBERS ARE CPU CLOCK CYCLES (× 103 ) ELAPSED
WHEN THE OPERATION IS EXECUTED ON HARDWARE VERSUS ON SOFTWARE .

FPGA Processor KeyGen Sign: Best-case Sign: Worst-case Verify


HW SW Speedup HW SW Speedup HW SW Speedup HW SW Speedup
XC7A75T-2 Microblaze 48.2 12613.3 261.7 85.9 16210.8 188.7 796.7 84363.9 105.9 51.4 12857.7 250.1
XC7Z010-1 ARM Cortex-A9 337.5 5079.6 15.05 607.9 6907.6 11.36 5661.3 39156.1 6.92 362.2 5248.9 14.49
XCZU3EG-1 ARM Cortex-A53 254.1 2134.5 8.4 456.3 3747.8 8.21 4245.9 27240.7 6.42 271.7 2347.7 8.64

request to the state handler. It also communicates with the data


handler for address and data read/write operations. Both the
state and data handler act as a bridge between the AXI packet
handler and the Dilithium accelerator. Since, the accelerator is
designed to be configurable from outside, one can also use it
to only accelerate individual operations like KeyGen, Sign or
Verify.

AXI4-Lite Clock
Interconnect Generator

AXI Packet Handler Central


Interconnect

operation mode
operation done

initial seed
public key
secret key
in msg len

signature
verify fail

message
Fig. 9. Physical layout of the Dilithium-V Core after place-and-route. ARM Core

State Handler Data Handler


AXI4 DDR3
Memory
Artix-7 (XC7A75T-2) with Microblaze@100MHz processor, Dilithium Hardware Controller

Zynq-7000 (XC7Z010-1) with ARM Cortex-A9@667MHz


Accelerator

PL ZYNQ PS
and Zynq Ultrascale+ (XCZU3EG-1) with ARM Cortex-
A53@1200MHz. It is important to quantify the real practical Fig. 11. Dilithium as an Accelerator on Xilinx Zynq. The same AXI periph-
benefits of a hardware accelerator in a complete system as in eral IP was used for the Microblaze processor as well. Due to connectivity
many situations the full potential of an accelerator might not using AXI4-lite bus, the accelerator can easily be connected with other
processors as well such as RISC-V.
be achieved. Fig. 10 shows setup for the Zynq Ultrascale+
platform using the Ultra96 board. In Table III, we report results for the hardware accelerator.
For comparison, we chose one best-case scenario where a valid
signature is generated after the first iteration itself and a worst-
case scenario where the valid signature is generated after the
rejection loop is executed for 17 times. It is clear from the
figure that the speedup obtained is inversely proportional to the
performance of the associated CPU. In case of a modern high
performance CPUs like ARM cortex A53 and A9 we obtain
speedups of about 6.42 to 15.05×. Whereas, for Microblaze
the speedup is significantly higher at 105-261×.
VI. C ONCLUSION AND F UTURE W ORK
Fig. 10. Setup with Ultra96-v2 (XCZU3EG-1) as one of the hardware
By utilizing multiple optimization strategies such as re-
evaluation board. source and control logic sharing, pre-computed LUTs, fusion
of modules etc. we achieved a reduction of 24% in LUTs
Fig. 11 shows the block diagram of the accelerator con- and FFs in area compared to the best known implementation.
nected as an AXI Peripheral applicable for both the Zynq Plat- As a result, our design requires only 13.9k LUTs, 6.8k FFs, 4
forms. A very similar block design is used for the Microblaze DSPs and 35 BRAMs. To the best of our knowledge, this work
setup as well. We created an AXI4-lite memory mapped regis- presents the smallest hardware accelerator for CRYSTALS-
ter based peripheral and implemented support for control and Dilthium which can now be fit into the smallest Zynq FPGA.
data logic to connect with the designed Dilithium core. The We also present detailed comparison with existing implemen-
AXI packet handler is responsible for receiving and sending tations and show that our design achieves more than 35%
AXI commands from CPU core and decoding them. Based on efficiency for Area×Time product on Zynq UltraScale+ for
the received command, it sends the corresponding operation all the operations.
9

We also implemented our design on ASIC and report [19] D. Soni, K. Basu, M. Nabeel, and R. Karri, “A hardware evaluation
numbers for major components of Dilithium as well as the study of nist post-quantum cryptographic signature schemes,” in Second
PQC Standardization Conference. NIST, 2019.
overall design. Compared to the state-of-the-art implementa- [20] Z. Zhou, D. He, Z. Liu, M. Luo, and K.-K. R. Choo, “A software/hard-
tion, our design requires about 157 kGE with 0.227 mm2 area, ware co-design of crystals-dilithium signature scheme,” ACM Transac-
achieving a reduction of about 1.4×. In terms of performance, tions on Reconfigurable Technology and Systems (TRETS), vol. 14, no. 2,
pp. 1–21, 2021.
the implemented design can run at 1.176 GHz achieving an [21] A. F. De Abiega-L’Eglisse, K. A. Delgado-Vargas, F. Q. Valencia-
improvement of about 2.95×. Rodriguez, V. G. Gonzalez-Quiroga, G. Gallegos-Garcia, and
Further, this work presents the first hardware evaluation for M. Nakano-Miyatake, “Performance of new hope and crystals-dilithium
postquantum schemes in the transport layer security protocol,” IEEE
complete Dilithium-V as an accelerator on three different plat- Access, vol. 8, pp. 213 968–213 980, 2020.
forms and demonstrate that achieved speedup is significantly [22] J. Wright, M. Gowanlock, C. Philabaum, and B. Cambou, “A crystals-
high, about 105-261× for Microblaze and about 6.42-15.05× dilithium response-based cryptography engine using gpgpu,” in Proceed-
ings of the Future Technologies Conference. Springer, 2021, pp. 32–45.
for ARM Cortex compared to the software implementations [23] L. Sharma and A. Mishra, “Analysis of crystals-dilithium for blockchain
on these platforms. security,” in 2021 2nd International Conference on Secure Cyber Com-
In the future, we plan to integrate low-cost side-channel puting and Communications (ICSCCC). IEEE, 2021, pp. 160–165.
[24] S. Bai, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe,
countermeasures in the design. G. Seiler, and D. Stehlé, “Crystals-dilithium: Digital signatures
from module lattices,” 2021, https://pq-crystals.org/dilithium/data/
R EFERENCES dilithium-specification-round3-20210208.pdf.
[25] M. J. Dworkin et al., “Sha-3 standard: Permutation-based hash and
[1] J. Chow, O. Dial, and J. Gambetta, “Ibm quantum breaks the 100-qubit extendable-output functions,” 2015.
processor barrier,” IBM Research Blog, available in https://research. [26] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche, “Keccak sponge
ibm. com/blog/127-qubit-quantum-process-or-eagle, 2021. function family main document,” Submission to NIST (Round 2), vol. 3,
[2] C. Wang, X. Li, H. Xu, Z. Li, J. Wang, Z. Yang, Z. Mi, X. Liang, T. Su, no. 30, pp. 320–337, 2009.
C. Yang et al., “Towards practical quantum computers: transmon qubit [27] Bertoni, Guido and Daemen, Joan and Peeters, Michaël and Van
with a lifetime approaching 0.5 milliseconds,” npj Quantum Information, Assche, Gilles, “Keccak hardware implementation,” https://keccak.team/
vol. 8, no. 1, pp. 1–6, 2022. hardware.html.
[3] Z. Wang, S. Wei, and G. Long, “A quantum circuit design of aes,” arXiv [28] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, and L. Liu, “Highly efficient
preprint arXiv:2109.12354, 2021. architecture of newhope-nist on fpga using low-complexity ntt/intt,”
[4] J. Zou, Z. Wei, S. Sun, Y. Luo, Q. Liu, and W. Wu, “Some efficient IACR Transactions on Cryptographic Hardware and Embedded Systems,
quantum circuit implementations of camellia,” Quantum Information pp. 49–72, 2020.
Processing, vol. 21, no. 4, pp. 1–27, 2022. [29] F. Yarman, A. C. Mert, E. Öztürk, and E. Savaş, “A hardware accelerator
[5] C. Gidney and M. Ekerå, “How to factor 2048 bit rsa integers in 8 hours for polynomial multiplication operation of crystals-kyber pqc scheme,”
using 20 million noisy qubits,” Quantum, vol. 5, p. 433, 2021. in 2021 Design, Automation & Test in Europe Conference & Exhibition
[6] NIST, “Submission requirements and evaluation criteria (DATE). IEEE, 2021, pp. 1020–1025.
for the post-quantum cryptography standardization pro-
cess,” http://csrc.nist.gov/groups/ST/post-quantum-crypto/
documents/call-for-proposals-final-dec-2016.pdf.
[7] W. Beullens, “Breaking rainbow takes a weekend on a laptop,” Cryp-
tology ePrint Archive, 2022.
[8] Beullens, Ward, “Improved cryptanalysis of uov and rainbow,” in Annual
International Conference on the Theory and Applications of Crypto-
graphic Techniques. Springer, 2021, pp. 348–373.
[9] D. Pokornỳ, P. Socha, and M. Novotnỳ, “Side-channel attack on rainbow
post-quantum signature,” in 2021 Design, Automation & Test in Europe
Conference & Exhibition (DATE). IEEE, 2021, pp. 565–568.
[10] C. Tao, A. Petzoldt, and J. Ding, “Improved key recovery of the hfev-
signature scheme,” Cryptology ePrint Archive, 2020.
[11] L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe, G. Seiler,
and D. Stehlé, “Crystals-dilithium: A lattice-based digital signature
scheme,” IACR Transactions on Cryptographic Hardware and Embedded
Systems, pp. 238–268, 2018.
[12] S. Ricci, L. Malina, P. Jedlicka, D. Smékal, J. Hajny, P. Cibik,
P. Dzurenda, and P. Dobias, “Implementing crystals-dilithium signature
scheme on fpgas,” in The 16th International Conference on Availability,
Reliability and Security, 2021, pp. 1–11.
[13] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu,
S. Yin, S. Wei, and L. Liu, “A compact and high-performance hardware
architecture for crystals-dilithium,” IACR Transactions on Cryptographic
Hardware and Embedded Systems, pp. 270–295, 2022.
[14] A. C. Mert, D. Jacquemin, A. Das, D. Matthews, S. Ghosh, S. S. Roy
et al., “A unified cryptoprocessor for lattice-based signature and key-
exchange,” Cryptology ePrint Archive, 2021.
[15] L. Beckwith, D. T. Nguyen, and K. Gaj, “High-performance hardware
implementation of lattice-based digital signatures,” Cryptology ePrint
Archive, 2022.
[16] G. Land, P. Sasdrich, and T. Güneysu, “A hard crystal-implementing
dilithium on reconfigurable hardware,” Cryptology ePrint Archive, 2021.
[17] K. Basu, D. Soni, M. Nabeel, and R. Karri, “Nist post-quantum
cryptography-a hardware evaluation study,” Cryptology ePrint Archive,
2019.
[18] D. Soni, K. Basu, M. Nabeel, N. Aaraj, M. Manzano, and R. Karri,
Hardware Architectures for Post-Quantum Digital Signature Schemes.
Springer, 2021.

You might also like