Dilithium FPGA Protocol
Dilithium FPGA Protocol
Abstract—The looming threat of an adversary with Quantum recently been attacked [7]. There have been several previous
computing capability led to a worldwide research effort towards works as well compromising the security of this scheme [8]–
identifying and standardizing novel post-quantum cryptographic [10]. This leaves two other digital signature candidates among
primitives. Post-standardization, all existing security protocols
will need to support efficient implementation of these primitives. which Dilithium is one. Naturally, it is of high importance
In this work, we contribute to these efforts by reporting the small- to study efficient implementations of Dilithium [11]. Digital
est implementation of CRYSTALS-Dilithium, a finalist candidate signature being an integral part of numerous security protocols,
for post-quantum digital signature. the use-cases and the platform constraints range from high-
By invoking multiple optimizations to leverage parallelism, speed servers to highly resource-constrained IoT platforms.
pre-computation and memory access sharing, we obtain an imple-
mentation that could be fit into one of the smallest Zynq FPGA. As a result, there have been several efforts towards optimiz-
On Zynq Ultrascale+, our design achieves an improvement of ing Dilithium for different security parameters and different
about 36.7%/35.4%/42.3% in Area×Time (LUTs×s) trade-off platforms such as FPGAs and ASICs targeting pure-hardware
for KeyGen/Sign/Verify respectively over state-of-the-art imple- based implementation [12]–[16], HLS based implementations
mentation. We also evaluate our design as a co-processor on [17]–[19] or as a software-hardware co-design [20]. Further,
three different hardware platforms and compare the results with
software implementation, thus presenting a detailed evaluation few works focused on integration in TLS protocol [21], as a
of CRYSTALS-Dilithium targeted for embedded applications. GPU accelerator [22] and for developing a quantum secure
Further, on ASIC using TSMC 65nm technology, our design blockchain [23].
requires 0.227mm2 area and can operate at a frequency of
1.176 GHz. As a result, it only requires 53.7µs/96.9µs/57.7µs for
From our studies on the existing Dilithium implementations,
KeyGen/Sign/Verify operation for the best-case scenario. we found that there is a significant room for improvement in
terms of area-efficiency, which sets the motivation for this
Index Terms—post-quantum, cryptography, PQC,
CRYSTALS-Dilithium, FPGA, hardware, ASIC, hardware work. The main contributions of this work are:
accelerator
1) In this work, we designed a lightweight hardware ac-
celerator for CRYSTALS-Dilithium. We used multiple
I. I NTRODUCTION
optimization strategies such as resource and control logic
The threat of an adversary with Quantum computing capa- sharing, fusion of modules, pre-computed LUTs etc.
bility is getting increasingly realistic [1], [2], with rapid growth 2) By achieving a reduction of about 24% in LUTs and FFs
in the capacity of Quantum computers, as well as, optimized than state-of-the-art implementation, this work presents
implementations of the current cryptographic primitives using the smallest hardware accelerator for Dilithium. As a
Quantum circuit simulators [3], [4]. It is predicted that current result, it can now be fit into one of the smallest Zynq
public-key primitives could be broken within hours [5], thus FPGA.
necessitating the search for alternative cryptographic primi- 3) In our design, we have leveraged both pipelining as
tives in the era of Quantum computing. This is systematically well as parallelism to achieve a good balance of perfor-
undertaken by NIST through its Post-Quantum Cryptography mance and area. Thus, on Zynq Ultrascale+, our design
(PQC) contest [6], which plans to roll out the winners of achieves efficiency of more than 35% for Area×Time
this contest as a standard. Naturally, there is a pressing compared to existing implementation.
need to study efficient hardware implementations of the PQC 4) To present a fair comparison with the existing implemen-
candidates, which not only plays an important role in the tations, we used two metrics - Area×Time and number
contest judgement process but also helps in the rapid adoption of operations that can be performed per second per
of the standard. LUT on a particular platform. On Zynq Ultrascale+,
The current finalists of the NIST PQC contest consist of our design outperforms the state-of-the-art. Whereas, on
3 candidates in digital signature scheme, one of which has Artix-7, our design has better performance for signing
N. Gupta and A. Chattopadhyay, School of Computer Science and Engi- operation.
neering, Nanyang Technological University, Singapore. 5) Using TSMC 65nm library, we also implemented the
A. Jati, School of Physical and Mathematical Sciences, Nanyang Techno- design on ASIC platform and report numbers for major
logical University, Singapore.
G. Jha, Department of Electronics and Electrical Communication Engineer- modules as well as the overall design. On ASIC, our
ing, IIT Kharagpur, India. design can run at 1.176 GHz with 0.227 mm2 area
2
w1 = Â ◦ ẑ t1 = ĉ ◦ tˆ1
Such modules are shown together in the overall design archi- C. Parallel Processing
tecture shown in Fig. 2. In order to reduce clock cycles, the data independent op-
The hardware accelerator is designed using a dedicated FSM erations can be executed in parallel. One can perform fine-
based control unit. All the modules have separate enable and grained parallel execution of operations such as in [13], but,
done signals and are connected to the memory controller. The this leads to significant area overheads, mainly because of
control logic and sequencer unit is responsible for enabling increase in multiplexing, additional module instances etc. We
different modules depending on the required functionality. The implemented parallel execution of modules wherever possible
memory controller consists of a dual-port switch matrix and while ensuring low area.
is responsible to connect different modules with the RAMs
depending on the input and the output for the operation. D. SHAKE
Dilithium uses SHAKE-128 and SHAKE-256 extendable-
msg
msg len
done
verify fail
output functions (XOFs) of the SHA-3 [25] family which
init seed global enable is based around the Keccak permutation [26] for different
Control Logic and Sequencer functionalities with minor variations. For instance, SHAKE-
init seed global enable
256 and its variations are used to generate random seeds from
msg
done
2
shake rate
state addr
state data
done abs sqz done sqz
on the signature generation part. In the worst-case scenario for
Operation Control
12 start abs sqz
start sqz
start sqz 64
datain1
datain0
6
This huge requirement is mainly because of matrix A of size Absorb Control
Logic start perm
done sqz
k×l×256. In case of KeyGen and Verify, one can generate Keccak Unit Squeeze Control
Logic
Read Keccak
State
Constants
start perm
store the final result. One can execute these steps k times
Generator
8 round 1600 64
32 11
separately to save resources and generate the final output of 32 11 constant 1600-bit Keccak
Internal State
XOR Logic + 1600
pointwise operation. But, in case of Sign operation, due to Input Address Padding
Keccak-f
Permutation
Register
Controller
multiple rejections, there is a trade-off between pre-computing 64
Logic
64 64
extremely useful when the number of executions for squeeze logic to start and fetch coefficients from the Keccak wrapper
is unknown (for instance, in case of rejection sampling). Also, as discussed in III-D.
mode = 3 allows the Keccak state to be accessed from outer t0 = buf[pos] & 0x0F;
modules, as a result we do not need extra resources to store if (t0 < 15) {
additional copies of the 1600-bit register elsewhere compared t0 = t0 - (205 * t0 >> 10) * 5;
a[ctr++] = 2 - t0;
to the work in [13]. The internal state remains valid after each }
squeeze operation unless reset from outside. Fig. 3 shows the
architecture for the implemented SHAKE wrapper module. Listing 1. Rejection Eta
The interface is used to set the corresponding mode using Listing 1 shows a small code snippet from the software
enable signal, provide necessary information such as number reference implementation for rejection sampling. It is used
of bytes to absorb (input length), shake rate (shake rate), etc. to generate vectors s1 and s2 from the output of SHAKE-
and to communicate with the different outer modules. 128. One can note that the value for t0 can only be between
The control logic for all the modes are responsible to start 0 – 14. Thus, the whole computation (requiring multiple
the corresponding operation. For example, for a SHAKE- multiplication and subtraction operations) can be completely
128 operation, the outer module first sets the enable signal avoided and possible output values for polynomial coefficient
to mode = 1 alongwith the input length and shake rate. can be pre-computed and stored in a LUT. This resulted in
The operation control decodes the mode and transfers the saving a lot of resources. The same logic is used for another
control to absorb control logic. This starts the absorb phase value t1. Similar optimizations are used for other sampling
by reading in the domain separator and input. The XORing modules wherever possible alongwith pipelining for good
of the input with the Keccak state and the generation of performance.
corresponding input addresses are handled by XOR Logic +
Input Address Controller. In our design, we are absorbing data
in chunks of 64-bits per clock cycle. The absorb control logic F. Combined Power2Round and Decompose
is responsible for monitoring the absorbed input length and Fig. 4 shows the combined module for Power2Round and
starts the permutation if required (when the input message Decompose modules. As can be seen from the figure, the
length is greater than the shake rate). Further, when the input controller unit is shared between both the modules, thus saving
is completely absorbed, it enables the padding logic to finalize about 180 LUTs and 120 FFs. The controller unit is used
the absorbed block. The absorb control logic then starts a to enable the operation depending on the mode and generate
squeeze operation by setting start sqz signal. The squeeze address and write enable signals after receiving done from
block starts the Keccak permutation and waits for it to finish the respective module. For simplicity we have not shown
execution using a dedicated counter (25 clock cycles as Keccak the individual enable and done signals in the diagram. The
has 25 rounds). It then sends the done sqz signal to the absorb
control logic which then triggers the done abs sqz signal to 1
(1≪(D-1)) - 1
operation control. Once the outer module receives the done
a0
0
≫D ≪D
a
signal, it starts reading the Keccak state by setting mode =
1
Power2Round a1
0
Input address
3, and providing the corresponding state address. In a single
addr a
counter
clock cycle, 64-bit data can be fetched from the wrapper. The
1≪21 Output address we a0
127 addr a0
mode ≫7 ≪10 ≫22 counter + write
LUT we a1
outer module processes this data first and if more data is
[3:0] enable logic
&Q addr a1
≫31
(Q-1)/2
required, it simply increments the address, otherwise it sends Decompose
Controller done
dataout1
dataout0
dataout1
datain1
datain1
datain0
datain1
wen
addr1
wen
in0 32 32
= 213 (f [10] + f [9 : 0]) − (f [10] + c[11 : 10])
in1 Dual Butterfly
out0 3232 Unit
out1
2 q
1
Address 3 2 1 2q
32 9 ≪13 12 23
addr0 Generation and IN A
10
10 ≪13 RED OUT
23 23 q
addr1 Control 9 46 3
10 10
wen
IN B
9 ≪13
23
∗ 13
23
32 23 23
9
Fig. 5. NTT/INTT Architecture
Fig. 6. Efficient Modular Reduction Module
The NTT operation is performed in three stages. First, Fig. 6 shows the implemented reduction module. The extra
the data from external memory is copied over to RAM- multiplication operation required before modular reduction
A. Second, the data in RAM-A is processed (128 butterfly operation is also integrated in this module. In order to reduce
operations) and written over to RAM-B and reverse, until all the input to 23-bit before multiplication, we first performed
the layers have been processed. Finally the result is written an initial reduction using the same recursive property. This
back to the external RAM. The entire operation requires helped in the reduction of 2 DSP resources per such module.
multiple addresses, write enables and zetas to be generated The back-to-back flip flops correspond to the pipeline delays
and provided to various modules. This is performed by the required for optimal DSP implementation. Two such modules
address generation and control state machine. At the end of are instantiated in hardware to speed up operations.
INTT, the required scaling by 1/n is performed by the butterfly
unit as well. So, no additional resources or clock cycles are IV. R ESULTS AND P ERFORMANCE C OMPARISON
utilized.
In this section, we present resource utilization and perfor-
mance results for both FPGA and ASIC implementations. We
also compare our results with the state-of-the-art implementa-
tions for Dilithium-V. The implemented design is realized in
H. Modular Reductions Verilog and the results are presented after synthesis and place
and route using Vivado 2020.1 targeting two platforms Xilinx
The work in [28], [29] proposed an efficient method Artix-7 (XC7A200T-2) and Zynq UltraScale+ (XCZU9EG-
for modular reduction in hardware for NewHope-NIST and 2). The correctness of the implementation is verified using
CRYSTALS-Kyber. In this work, we followed a similar ap- the KAT provided in the NIST package. From here on,
proach for efficient modular reduction in Dilithium. In order Artix-7 is referred as A7 and Zynq UltraScale+ is referred
to compute a mod q, we used Dilithium modulus property as ZUS+. For ASIC, we synthesized the design for TSMC
223 = 213 − 1 (mod 8380417) recursively. The corresponding 65nm process technology. We used Synopsys Design Compiler
equations are as shown below: version R-2020.09-SP5 for synthesis and Cadence Innovus
6
version V19.10-P002 for place-and-route of the design. We and 4 DSPs. Due to a deeply pipelined architecture, our design
used CCS-based standard cell library for accurate results. can run at a maximum frequency of 163 MHz on Artix-7 and
Synopsys DesignWare library components were used wherever 391 MHz on Zynq Ultrascale+. As a result, we require 387µs,
applicable. 699µs and 416µs for KeyGen, Sign and Verify operations on
Artix-7. For Zynq Ultrascale+, the respective operations can
A. Resource Utilization be finished in 161µs, 291µs and 173µs. We also report number
of operations that can be performed per second (OP/s) for all
Table I shows resource utilization for the major components
the implementations.
required in Dilithium. Through extensive resource sharing
Currently, to the best of our knowledge, there are four
between the modules, we obtained a significant reduction in
known hardware implementations which report results for
overall resource utilization. One can note that the resource
Dilithium Security Level V. Hence, we compare our results
utilization for Sampling is quite low compared to the other
with only these implementations.
existing implementations. This is mainly because the different
1) Comparison with [15].: The work by Beckwith et. al.
Sampling modules contain only the minimal arithmetic and
present results for multiple platforms as well as for all the
control logic with the XOFs shared between them. The slice
security levels. For a fair comparison, we compare our results
registers usage is somewhat higher in our case because of the
with the similar FPGA plaform. The authors in [15] targeted
deep pipelines needed for good performance on both FPGA
a high-performance implementation. As a result, they can
and ASIC. The various modules were individually optimized
perform Keygen, Sign and Verify in 121µs, 210µs and 126µs
for hardware using multiple strategies wherever applicable.
achieving a reduction of about 3.2×, 3.33× and 3.30× com-
pared to our implementation. Even though the performance
TABLE I
R ESOURCE UTILIZATION FOR FPGA AND ASIC. T HE ASIC RESOURCES achieved is better, their design requires 3.81×, 4.14× and 4×
ARE REPORTED IN G ATE E QUIVALENTS (GE S ). more LUTs, FFs and DSPs than our design. This is because
their architecture utilizes multiple cores to perform expensive
Sub-module
FPGA ASIC
Reference operations such as NTT, Keccak whereas we are using only
LUT F/F DSP BRAM GEs
one core for almost all the modules. Thus, the high latency
2389 740 0 0 - [15] in our design is compensated by low area utilization as well
MakeHint
67 85 - - 1423.7 This Work as higher achievable frequency. As a result, their design has
UseHint
6453 2808 0 0 - [15] a higher Area×Time trade-off metric of 18.8%, 14.3% and
186 279 0 0 4740 This Work 15.3% for KeyGen, Sign and Verify operations respectively.
Encode + 1626 461 0 0 - [15] 2) Comparison with [16].: The authors in [16] target to
Pack 650 603 0 0 8884 This Work reduce the area in terms of LUTs and FFs by utilizing more
Decode + 2189 239 0 0 - [15] DSP resources in their architecture. Their design requires
Unpack 694 568 0 0 9458.3 This Work 3.2×, 2.02× and 11.25× more LUTs, FFs and DSP resources
Decompose
1437 680 0 0 - [15] and about 1.13× less BRAMs compared to our design. Even
120 203 0 0 3028.7 This Work though, our implementation has slightly more latency for
NTT + 4509 × 2 3146 × 2 16 0 - [15] the three operations, the overall improvement in terms of
PolyArith 5676 1218 41 1 - [16] efficiency (Area×Time) is quite high. Compared to ours, their
2759 2037 4 7 40182 This Work design has a lower Area×Time trade-off efficiency by 200%,
SampleA + 3548 1015 0 0 - [15] 129% and 188% for KeyGen, Sign and Verify operations
SampleS 1479 189 - - - [16] respectively.
360 355 0 0 4710.3 This Work 3) Comparison with [13].: Zhao et. al. proposed a com-
2220 630 0 0 - [15] pact and high-performance hardware design. They did sev-
SampleY
469 48 - - - [16] eral optimizations such as segmented pipeline processing to
99 199 0 0 2353 This Work achieve a high-performance architecture. As a result, they
1856 868 0 0 - [15] can perform KeyGen/Sign/Verify in 90µs/505µs/93µs which is
SampleC
384 662 - - - [16] 4.3×/1.38×/4.47× lower than our design execution time. One
289 244 0 0 3782.7 This Work can note that the improvement in execution time for Sign is not
5483 × 3 4451 × 3 0 0 - [15]
that significant. We believe this is because the Sign operation
Keccak
3708 1623 - - - [16]
does not offer much scope for parallel execution of operations.
4202 1800 0 0 42155.3 This Work
Further, even though their design offers high-performance, it
comes at an expense of consuming large amounts of resources.
Compared to our design, their implementation requires 114.7%
and 51% more LUTs and FFs. As a result, the Area×Time
B. FPGA Implementation trade-off is better in our case for Sign operation whereas it is
Table II presents detailed results for our implementation better in [13] for KeyGen and Verify.
as well as existing implementations for Dilithium-V. In our 4) Comparison with [14].: The design by Aikata et. al.
case, the combined architecture (for KeyGen, Sign and Verify) presents a unified architecture for Dilithium and Saber. The
requires about 13.9k LUTs, 6.8k Slice registers, 35 BRAMs authors in [14] target a low area implementation. Hence, the
7
TABLE II
P ERFORMANCE AND C OMPARISON FOR D ILITHIUM -V FOR BEST- CASE SCENARIO ( SIGNATURE IS VALID AFTER THE FIRST ITERATION ) ON FPGA AND
ASIC. T HE T HE AREA FOR ASIC IS EXCLUDING ON - CHIP MEMORY AND IS REPORTED IN mm2 AND G ATE E QUIVALENTS (GE S )
Keccak core and the polynomial multiplier are shared between 443 (higher is better)
the two algorithms. But, there are many specialized modules 400 369
412
(OP/s)/LUT (×10−3 )
338
324
their implementation requires 31.7% and 36.2% more LUTs 300 279
246
thing to note is our design can run at about twice the speed 100
102
90
66
61 59
of their design. As a result, we are able to achieve an 45
(lower is better)
cause of the highly pipelined architecture, the design can run
20
at 1.176 GHz. As a result, the KeyGen/Sign/Verify takes only
Area x Time (LUTs × s)
16.81
16.26
15.17
15
53.7µs/96.9µs/57.7µs. Compared to state-of-the-art implemen-
tation, we achieve a reduction of 1.4× in area and improve
11.17
9.77
6.71
6.44
5.13
4.61
4.42
4.21
4.07
5
3.57
2.43
2.8
2.26
TABLE III
P ERFORMANCE AS AN ACCELERATOR FOR BEST- CASE SCENARIO : S IGNATURE IS VALID IN THE FIRST ITERATION AND IN THE WORST- CASE SCENARIO :
R EJECTION LOOP IS EXECUTED 17 TIMES BEFORE A VALID SIGNATURE IS GENERATED . T HE NUMBERS ARE CPU CLOCK CYCLES (× 103 ) ELAPSED
WHEN THE OPERATION IS EXECUTED ON HARDWARE VERSUS ON SOFTWARE .
AXI4-Lite Clock
Interconnect Generator
operation mode
operation done
initial seed
public key
secret key
in msg len
signature
verify fail
message
Fig. 9. Physical layout of the Dilithium-V Core after place-and-route. ARM Core
PL ZYNQ PS
and Zynq Ultrascale+ (XCZU3EG-1) with ARM Cortex-
A53@1200MHz. It is important to quantify the real practical Fig. 11. Dilithium as an Accelerator on Xilinx Zynq. The same AXI periph-
benefits of a hardware accelerator in a complete system as in eral IP was used for the Microblaze processor as well. Due to connectivity
many situations the full potential of an accelerator might not using AXI4-lite bus, the accelerator can easily be connected with other
processors as well such as RISC-V.
be achieved. Fig. 10 shows setup for the Zynq Ultrascale+
platform using the Ultra96 board. In Table III, we report results for the hardware accelerator.
For comparison, we chose one best-case scenario where a valid
signature is generated after the first iteration itself and a worst-
case scenario where the valid signature is generated after the
rejection loop is executed for 17 times. It is clear from the
figure that the speedup obtained is inversely proportional to the
performance of the associated CPU. In case of a modern high
performance CPUs like ARM cortex A53 and A9 we obtain
speedups of about 6.42 to 15.05×. Whereas, for Microblaze
the speedup is significantly higher at 105-261×.
VI. C ONCLUSION AND F UTURE W ORK
Fig. 10. Setup with Ultra96-v2 (XCZU3EG-1) as one of the hardware
By utilizing multiple optimization strategies such as re-
evaluation board. source and control logic sharing, pre-computed LUTs, fusion
of modules etc. we achieved a reduction of 24% in LUTs
Fig. 11 shows the block diagram of the accelerator con- and FFs in area compared to the best known implementation.
nected as an AXI Peripheral applicable for both the Zynq Plat- As a result, our design requires only 13.9k LUTs, 6.8k FFs, 4
forms. A very similar block design is used for the Microblaze DSPs and 35 BRAMs. To the best of our knowledge, this work
setup as well. We created an AXI4-lite memory mapped regis- presents the smallest hardware accelerator for CRYSTALS-
ter based peripheral and implemented support for control and Dilthium which can now be fit into the smallest Zynq FPGA.
data logic to connect with the designed Dilithium core. The We also present detailed comparison with existing implemen-
AXI packet handler is responsible for receiving and sending tations and show that our design achieves more than 35%
AXI commands from CPU core and decoding them. Based on efficiency for Area×Time product on Zynq UltraScale+ for
the received command, it sends the corresponding operation all the operations.
9
We also implemented our design on ASIC and report [19] D. Soni, K. Basu, M. Nabeel, and R. Karri, “A hardware evaluation
numbers for major components of Dilithium as well as the study of nist post-quantum cryptographic signature schemes,” in Second
PQC Standardization Conference. NIST, 2019.
overall design. Compared to the state-of-the-art implementa- [20] Z. Zhou, D. He, Z. Liu, M. Luo, and K.-K. R. Choo, “A software/hard-
tion, our design requires about 157 kGE with 0.227 mm2 area, ware co-design of crystals-dilithium signature scheme,” ACM Transac-
achieving a reduction of about 1.4×. In terms of performance, tions on Reconfigurable Technology and Systems (TRETS), vol. 14, no. 2,
pp. 1–21, 2021.
the implemented design can run at 1.176 GHz achieving an [21] A. F. De Abiega-L’Eglisse, K. A. Delgado-Vargas, F. Q. Valencia-
improvement of about 2.95×. Rodriguez, V. G. Gonzalez-Quiroga, G. Gallegos-Garcia, and
Further, this work presents the first hardware evaluation for M. Nakano-Miyatake, “Performance of new hope and crystals-dilithium
postquantum schemes in the transport layer security protocol,” IEEE
complete Dilithium-V as an accelerator on three different plat- Access, vol. 8, pp. 213 968–213 980, 2020.
forms and demonstrate that achieved speedup is significantly [22] J. Wright, M. Gowanlock, C. Philabaum, and B. Cambou, “A crystals-
high, about 105-261× for Microblaze and about 6.42-15.05× dilithium response-based cryptography engine using gpgpu,” in Proceed-
ings of the Future Technologies Conference. Springer, 2021, pp. 32–45.
for ARM Cortex compared to the software implementations [23] L. Sharma and A. Mishra, “Analysis of crystals-dilithium for blockchain
on these platforms. security,” in 2021 2nd International Conference on Secure Cyber Com-
In the future, we plan to integrate low-cost side-channel puting and Communications (ICSCCC). IEEE, 2021, pp. 160–165.
[24] S. Bai, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe,
countermeasures in the design. G. Seiler, and D. Stehlé, “Crystals-dilithium: Digital signatures
from module lattices,” 2021, https://pq-crystals.org/dilithium/data/
R EFERENCES dilithium-specification-round3-20210208.pdf.
[25] M. J. Dworkin et al., “Sha-3 standard: Permutation-based hash and
[1] J. Chow, O. Dial, and J. Gambetta, “Ibm quantum breaks the 100-qubit extendable-output functions,” 2015.
processor barrier,” IBM Research Blog, available in https://research. [26] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche, “Keccak sponge
ibm. com/blog/127-qubit-quantum-process-or-eagle, 2021. function family main document,” Submission to NIST (Round 2), vol. 3,
[2] C. Wang, X. Li, H. Xu, Z. Li, J. Wang, Z. Yang, Z. Mi, X. Liang, T. Su, no. 30, pp. 320–337, 2009.
C. Yang et al., “Towards practical quantum computers: transmon qubit [27] Bertoni, Guido and Daemen, Joan and Peeters, Michaël and Van
with a lifetime approaching 0.5 milliseconds,” npj Quantum Information, Assche, Gilles, “Keccak hardware implementation,” https://keccak.team/
vol. 8, no. 1, pp. 1–6, 2022. hardware.html.
[3] Z. Wang, S. Wei, and G. Long, “A quantum circuit design of aes,” arXiv [28] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, and L. Liu, “Highly efficient
preprint arXiv:2109.12354, 2021. architecture of newhope-nist on fpga using low-complexity ntt/intt,”
[4] J. Zou, Z. Wei, S. Sun, Y. Luo, Q. Liu, and W. Wu, “Some efficient IACR Transactions on Cryptographic Hardware and Embedded Systems,
quantum circuit implementations of camellia,” Quantum Information pp. 49–72, 2020.
Processing, vol. 21, no. 4, pp. 1–27, 2022. [29] F. Yarman, A. C. Mert, E. Öztürk, and E. Savaş, “A hardware accelerator
[5] C. Gidney and M. Ekerå, “How to factor 2048 bit rsa integers in 8 hours for polynomial multiplication operation of crystals-kyber pqc scheme,”
using 20 million noisy qubits,” Quantum, vol. 5, p. 433, 2021. in 2021 Design, Automation & Test in Europe Conference & Exhibition
[6] NIST, “Submission requirements and evaluation criteria (DATE). IEEE, 2021, pp. 1020–1025.
for the post-quantum cryptography standardization pro-
cess,” http://csrc.nist.gov/groups/ST/post-quantum-crypto/
documents/call-for-proposals-final-dec-2016.pdf.
[7] W. Beullens, “Breaking rainbow takes a weekend on a laptop,” Cryp-
tology ePrint Archive, 2022.
[8] Beullens, Ward, “Improved cryptanalysis of uov and rainbow,” in Annual
International Conference on the Theory and Applications of Crypto-
graphic Techniques. Springer, 2021, pp. 348–373.
[9] D. Pokornỳ, P. Socha, and M. Novotnỳ, “Side-channel attack on rainbow
post-quantum signature,” in 2021 Design, Automation & Test in Europe
Conference & Exhibition (DATE). IEEE, 2021, pp. 565–568.
[10] C. Tao, A. Petzoldt, and J. Ding, “Improved key recovery of the hfev-
signature scheme,” Cryptology ePrint Archive, 2020.
[11] L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe, G. Seiler,
and D. Stehlé, “Crystals-dilithium: A lattice-based digital signature
scheme,” IACR Transactions on Cryptographic Hardware and Embedded
Systems, pp. 238–268, 2018.
[12] S. Ricci, L. Malina, P. Jedlicka, D. Smékal, J. Hajny, P. Cibik,
P. Dzurenda, and P. Dobias, “Implementing crystals-dilithium signature
scheme on fpgas,” in The 16th International Conference on Availability,
Reliability and Security, 2021, pp. 1–11.
[13] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu,
S. Yin, S. Wei, and L. Liu, “A compact and high-performance hardware
architecture for crystals-dilithium,” IACR Transactions on Cryptographic
Hardware and Embedded Systems, pp. 270–295, 2022.
[14] A. C. Mert, D. Jacquemin, A. Das, D. Matthews, S. Ghosh, S. S. Roy
et al., “A unified cryptoprocessor for lattice-based signature and key-
exchange,” Cryptology ePrint Archive, 2021.
[15] L. Beckwith, D. T. Nguyen, and K. Gaj, “High-performance hardware
implementation of lattice-based digital signatures,” Cryptology ePrint
Archive, 2022.
[16] G. Land, P. Sasdrich, and T. Güneysu, “A hard crystal-implementing
dilithium on reconfigurable hardware,” Cryptology ePrint Archive, 2021.
[17] K. Basu, D. Soni, M. Nabeel, and R. Karri, “Nist post-quantum
cryptography-a hardware evaluation study,” Cryptology ePrint Archive,
2019.
[18] D. Soni, K. Basu, M. Nabeel, N. Aaraj, M. Manzano, and R. Karri,
Hardware Architectures for Post-Quantum Digital Signature Schemes.
Springer, 2021.