Main Manuscript

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.0322000
Efficient Low-Latency Hardware Architecture for

Module-Lattice-Based Digital Signature Standard
QUANG DANG TRUONG1 , (Graduate Student Member, IEEE), PHAP NGOC DUONG1,2 ,
(Member, IEEE), AND HANHO LEE1 , (Senior Member, IEEE)
1
Department of Electrical and Computer Engineering, Inha University, Incheon 22212, South Korea (e-mail: [email protected], [email protected],
[email protected]).
2
Faculty of Computer Engineering and Electronics, The University of Danang–Vietnam-Korea University of Information and Communication Technology,
Danang 50000, Vietnam.
Corresponding author: Hanho Lee (e-mail: [email protected]).
This work was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC support program (IITP-2021-0-02052)
supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), in part by the National Research
Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. 2021R1A2C1011232) and in part by the IITP grant funded
by the Korea government (MSIT) (No.20210007790012003).
ABSTRACT The rapid advancement of powerful quantum computers poses a significant security risk to
current public-key cryptosystems, which heavily rely on the computational complexity of problems such
as discrete logarithms and integer factorization. As a result, CRYSTALS-Dilithium, a lattice-based digital
signature scheme with the potential to be an alternative algorithm that can withstand both quantum and
classical attacks, has been standardized as ML-DSA after NIST Post-Quantum Cryptography competition.
While prior studies have proposed hardware designs to accelerate this cryptosystem, there is room for
further optimization in the tradeoff between performance and hardware consumption. This paper addresses
these limitations by presenting an efficient low-latency hardware architecture for ML-DSA, leveraging
optimized timing schedules for its three main algorithms. The hardware implementation enables runtime
switching main operations in ML-DSA with various security levels. We design flexible arithmetic and
hash modules tailored for ML-DSA, the most time-consuming submodules and key determinants of the
scheme implementation. Combined with efficient operation scheduling to maximize the utilized time of
submodules, our design achieves the best latency among FPGA-based implementations, outperforming state-
of-the-art works by 1.27∼2.58× in terms of the area-time tradeoff metric. Therefore, the proposed hardware
architecture demonstrates its practical applicability for digital signature cryptosystems in post-quantum era.
INDEX TERMS Post-quantum cryptography (PQC), Module-Lattice-Based Digital Signature Standard

(ML-DSA), CRYSTALS-Dilithium, Lattice-based cryptography (LBC), Number theoretic transform (NTT).
I. INTRODUCTION lected CRYSTALS-Dilithium for digital signature standard-

ization to the next round of PQC standardization process [2].
I N recent years, the rise of quantum computing has posed a
significant threat to traditional public-key cryptographic
schemes. Shor’s algorithm [1], when executed on a suffi-
NIST has recently released the initial public drafts of
the Module-Lattice-Based Digital Signature Standard (ML-
ciently powerful quantum computer, has the capability to ef- DSA) [3], which is based on the Dilithium submission.
ficiently solve complex mathematical problems that form the Dilithium relies on the worst-case hardness of module lattice
basis of widely used cryptographic algorithms such as RSA problems [4]. It has the potential to resist both quantum and
and Elliptic curve cryptography. This realization prompted classical attacks and offers advantages such as fast arithmetic
the National Institute of Standards and Technology (NIST) to operations and compact keys, ciphertext and signature sizes.
launch the Post-Quantum Cryptography (PQC) standardiza- Given the potential of ML-DSA to become a standard and
tion process in 2016, with the aim of developing new public- replace other digital signature schemes specified in NIST
key standards that are believed to be secure even against Federal Information Processing Standards Publication (FIPS)
adversaries in possession of a large-scale quantum computer. and Special Publications (e.g., FIPS 186-5 [5]), which is
After three rounds of evaluation and review, NIST has se- ready to be used for obtaining assurances for digital signa-
VOLUME 11, 2023 1

Truong et al.: Efficient Low-Latency Hardware Architecture for Module-Lattice-Based Digital Signature Standard
ture applications. There is a growing interest in developing posed in [9] achieves the lowest latency. The difference be-
efficient implementations of this scheme in both hardware tween these two architectures lies in the number of arithmetic
and software. Most existing works on implementing and eval- and hash modules used. The architecture in [10] has one hash
uating Dilithium have used pure software methods [6], [7] and arithmetic modules, that means it is slower but more
or hardware-software co-design methods [8]. Developing and compact than the work in [9] which uses multiple modules.
benchmarking software implementations of cryptographic al- Consequently, we realized that designing efficient hash
gorithms like PQC is relatively straightforward. However, and arithmetic modules is key factor for achieving high-
implementing them in hardware platforms, particularly on performance architecture. The latency of hardware imple-
Field-Programmable Gate Arrays (FPGAs), requires signif- mentations for Dilithium is mainly dominated by hashing and
icant time and design effort to assess their efficiency in terms polynomial arithmetic tasks, which should be continuously
of speed, area utilization, power consumption and energy executed as much as possible. Besides, the interconnection
efficiency. By exploring FPGA implementations, researchers between individual submodules in the whole architecture
can realize various ideas and apply the tradeoff method to should be carefully considered to avoid critical paths, which
optimize the approach for hardware designs to achieve de- can reduce the working frequency of the hardware imple-
sired performance goals. FPGA implementations are a crucial mentation. The balance between on-chip resource utilization
and efficient tool for gaining valuable insights into the cost and performance must also be considered to maximize the
and performance characteristics of PQC algorithms when efficiency of combined architecture for all phases. To im-
deployed in hardware environments. plement a more compact architecture, [10] does not support
Several prior researches have focused on improving the packing/unpacking and the final results are stored in BRAMs.
hardware designs for Round 3 Dilithium on FPGA platforms. However, in public-key cryptosystems, packing is essential
In these designs, two main aspects were typically considered. for defining the data layout of the keys and signature before
The first is developing hardware architectures that could sup- transmission. This is necessary as these applications may be
port all security levels as illustrated in [9], [10] or specifying implemented on diverse platforms with varying architectures.
to a particular security level as proposed in [11]–[14]. We Additionally, although the proposed on-the-fly calculation for
recognize that adopting a unified architecture with a flexible matrix A in [10] aims at minimizing storage, it leads to a
mode signal to accommodate all security levels is a more longer critical path as well as restricts the design to operate
effective approach. This strategy can enhance algorithm per- at higher frequency. Conversely, [9] provided practical sup-
formance in hardware, providing flexibility and enabling the port with packing/unpacking modules featuring 64-bit ports.
architecture to address a broader spectrum of applications. However, their hash modules support multiple instances of
The second consideration is the goal of these implementa- SHA-3 and the complexity of iterative NTT structure leads to
tions. Some studies aimed at maximizing performance while a longer critical path. Additionally, utilizing two arithmetic
others focused on minimizing power or area consumption modules with only one used in Keygen and Verify algorithms
or achieving a balance in area-time tradeoff. As a result, of ML-DSA, might result in inefficient hardware utilization.
these implementations were broadly categorized as high- This study approaches a unified hardware design for ex-
performance [9], [10] or lightweight [12], [15] although the ecuting three main algorithms in NIST ML-DSA [3]. Our
distinction is not always clear-cut as some designs may aim design improves submodules and employs appropriate oper-
for specific tradeoff directions. For instance, a high-efficiency ation scheduling tailored to ML-DSA hardware implemen-
design [11] might primarily focus on achieving high perfor- tation. Additionally, this study aims at a high-performance
mance while keeping resource utilization in comparable level. implementation and achieving an optimal balance between
Lightweight implementations aim to minimize hardware execution time and on-chip resource utilization.
utilization, allowing them to fit on even the smallest FPGA In summary, we make the following main contributions:
platforms. However, current FPGA platforms support a large 1) We propose a unified hardware architecture that sup-
amount of hardware resources, as evidenced by the unified ports three algorithms of ML-DSA and can be config-
architecture for CRYSTALS-Kyber and Dilithium on the Ul- ured for NIST security levels in run-time. The com-
traScale+ ZCU102 [15]. This cryptoprocessor utilizes only a bined hardware design improves the submodules and
small portion of the platform’s hardware resources, utilizing optimizes on-chip resource utilization. Furthermore,
less than 10% of available resources. When implementing the proposed architecture supports packing/unpacking
Dilithium on hardware, one of the most important goals is modules that enables efficient data exchange with other
accelerating execution time to improve application perfor- platforms, facilitating accelerated multiple computa-
mance. Therefore, a high-performance approach is more suit- tional tasks in ML-DSA.
able for Dilithium hardware implementation, which is still 2) The hash and arithmetic modules are designed to match
affordable with the available hardware resources on current the specific needs of ML-DSA. Specifically, we use in-
FPGA platforms. Among the high-performance designs that dependent hash modules for SHAKE-128 or SHAKE-
fully support three security levels, the compact and efficient 256 and a fully pipelined NTT architecture in the arith-
hardware architecture presented in [10] achieves the best metic module, resulting in an improved critical path.
results in resource consumption, while the architecture pro- Optimizing these two modules, which have the most
2 VOLUME 11, 2023
substantial impact on the overall execution time of the TABLE 1. Parameters sets of ML-DSA.
architecture, leads to performance enhancement for the Parameters NIST security level
entire cryptosystem. 2 3 5
3) By using flexible submodules for entire architecture, q [modulus] 8380417 8380417 8380417
we propose an optimal timing diagram and proper d [dropped bits from t] 13 13 13
τ [# of ±1’s in c] 39 49 60
scheduling to achieve area-time balance when execut- γ1 [y coefficient range] 217 219 219
ing ML-DSA processes. Comparison results show that γ2 [low-order rounding range] (q-1)/88 (q-1)/32 (q-1)/32
our design achieves the lowest latency among FPGA- (k, l) [dimension of A] (4,4) (6,5) (8,7)
η [secret key range] 2 4 2
based implementations and outperforms state-of-the- β [τ · η] 78 196 120
art studies in terms of area-time product (ATP) metric. ω [max # of 1’s in hint] 80 55 75
The remainder of this paper is organized as follows. Sec- λ [collision strength of c̃] 128 192 256
tion II provides a brief background of ML-DSA and number
theoretic transform. Section III proposes our design decisions Algorithm 1 Key Generation KeyGen() [3]
and details our hardware architecture as well as the optimized Output: pk = (ρ, t1 ), sk = (ρ, K , tr, s1 , s2 , t0 )
modules. The implementation results and comparison with 1: ζ ← {0, 1}
256
the state-of-the-art works are given in Section IV. Finally, 2: (ρ, ρ′ , K ) ∈ {0, 1}
256 512 256
× {0, 1} × {0, 1} ← H(ζ)
Section V summarizes and concludes the paper. 3: A ∈ Rk×l := ExpandA(ρ)
q
4: (s1 , s2 ) ∈ Sηl × Sηk := ExpandS(ρ′ )
II. PRELIMINARIES 5: t:= As1 + s2
A. MODULE-LATTICE-BASED DIGITAL SIGNATURE
6: (t1 , t0 ) := Power2Roundq (t, d)
STANDARD OVERVIEW 512
7: tr ∈ {0, 1} := H(ρ ∥ t1 )
ML-DSA is a digital signature scheme based on CRYSTALS- 8: return (pk, sk)
Dilithium, a member of the Cryptographic Suite for Algebraic
Lattices (CRYSTALS) and a digital signature scheme based
on the "Fiat-Shamir with Aborts" approach [16]. Different
from other lattice-based algorithms recommended after round in Algorithm 2. The ‘hedged’ version is a new improvement
3: Kyber and Falcon, Dilithium uses uniform sampling rather aimed at facilitating countermeasures against side-channel
than discrete Gaussian distribution for secret randomness attacks and fault attacks on deterministic signatures [19].
generation. This approach greatly simplifies polynomial gen- Key generation (Keygen), Signature generation (Sign) and
eration and is easily implemented in constant time. The hard Verification (Verify), three core algorithms of ML-DSA, are
problems underlying the security of ML-DSA schemes are shown in Algorithms 1-3 [3].
Module Learning with Errors (M-LWE) and Module Shortest KeyGen(): Key generation algorithm generates a keypair
Integer Solution (M-SIS) formally proved in [4]. consisting of a private signing key (sk) and public verification
M-LWE problem needed to protect against key-recovery key (pk) used for signature generation and verification. In this
[17] and M-SIS problem for strong unforgeability [18]. The algorithm, two uniform seeds: public seed (ρ) and noise seed
M-LWE problem can be briefly described as follows: Let A (ρ′ ) are mapped to generate polynomial matrix A and vectors
∈ Rqk×l be uniformly chosen s1 ∈ Rlq , s2 ∈ Rkq . Then, the s1 , s2 . It then computes t = As1 +s2 to generate the final
standard M-LWE problem is the public key (A, t = As1 +s2 ) keypair. To keep size small, the public key includes the seed
is indistinguishable from (A, t) where t is chosen uniformly (ρ) instead of matrix A and the upper bit of t. The secret key
at random. The M-SIS problem in ML-DSA is expressed packs the lower d bits of t, seed ρ and two byte-arrays K , tr.
through: from a uniformly chosen matrix A ∈ Rk×l q , that there Sign(): This algorithm takes message M and secret key
exists z and u such that Az+u = 0, where ∥z∥∞ ≤ 2(γ1 − β), to generate the signature attached with the message before
∥u∥∞ ≤ 4γ2 + 2 (parameters can be chosen to match NIST sending to verifier. Signature generation begins by generating
security levels outlined in Table 1). In addition to 3 proposed masking vector y and matrix A and then compute w = Ay. It
NIST security levels, the security of ML-DSA can be adjusted is based on generating challenge c by hashing the message
by changing the parameters inside. The most straightforward with the high order bit of w, signature as z = y+cs1 then
way to enhance or reduce security is by modifying the values and also the hints h for the verifier. The potential signature
of (k, l) and then adjusting η, β and ω accordingly. An- σ consisting of (c, z, h) is checked by adapting the condi-
other approach is to lower the values of γ1 and/or γ2 , which tion of the polynomial max norms are within an acceptable
makes forging signatures more challenging since it relies on predefined range and the maximum number of 1’s in hint
the underlying SIS problem. Increasing (k, l) significantly that the signature can support which are defined by choosing
strengthens security, while adjusting γi is more suitable for parameter of the security level in Table 1. Otherwise, the
minor security tweaks. One of the most notable updates in signature must be rejected and a new attempt is made.
ML-DSA from Dilithium version 3.1 is that it contains two Verify(): The high-order bit of w1 is recovered from mes-
versions of the signature generation algorithm: "hedged" and sage, public key and signature. It is used as a part of hash
"deterministic" which are selected by the value rnd, as seen function to generate challenge c which will be compared to
VOLUME 11, 2023 3
Algorithm 2 Signature Generation Sign(sk, M ) [3] Algorithm 3 Verification Verify(pk, M , σ) [3]

∗ ∗
Input: sk = (ρ, K , tr, s1 , s2 , t0 ), M ∈ {0, 1} Input: pk = (ρ, t1 ), M ∈ {0, 1} , σ = (c̃, z, h)
Output: σ = (c̃, z, h) Output: Valid or Invalid
1: A ∈ Rk×l
q :=ExpandA(ρ) 1: A ∈ Rk×l q := ExpandA(ρ)
256 256
2: µ ← H(tr ∥ M ), rnd ← {0, 1} or {0} 2: µ ← H( H(ρ ∥ t1 ) ∥ M )
′ 256 2λ−256
3: ρ ← H(K ∥ rnd ∥ µ) 3: (c˜1 , c˜2 ) ∈ {0, 1} × {0, 1} ← c̃
4: κ := 0, (z, h) :=⊥ 4: c := SampleInBall(c˜1 )
′ d
5: while (z, h) :=⊥ do 5: w1 := UseHintq (h, Az P − ct1 ·2 , 2γ2 )
6: y ∈ S̃γl 1 := ExpandMask(ρ′ , κ) 6: if ∥z∥∞ < γ1 − β & hi ≤ ω & c̃ = H(µ ∥ w1′ ) then
7: w := Ay 7: return Valid
8: w1 := HighBitsq (w, 2γ2 ) 8: end if
2λ 9: return Invalid
9: c̃ ∈ {0, 1} ← H(µ ∥ w1 )
256 2λ−256
10: (c˜1 , c˜2 ) ∈ {0, 1} × {0, 1}
11: c := SampleInBall(c˜1 )
12: z := y + cs1 resolves the challenge posed by the Sign algorithm, which
13: r0 := LowBitsq (w - cs2 , 2γ2 ) requires more extensive processing compared to the others.
14: if ∥z∥∞ ≥ γ1 − β or ∥r0 ∥∞ ≥ γ2 − β then By incorporating these modules into an appropriate operation
15: (z, h) :=⊥ scheduling, as proposed in section III-A, we achieve improve-
16: else ments in latency within comparable hardware resources.
17: h := MakeHintq (−ct P 0 , w − cs2 + ct0 , 2γ2 )
18: if ∥ct0 ∥∞ ≥ γ2 or hi > ω then B. NUMBER THEORETIC TRANSFORM
19: (z, h) :=⊥ The Number Theoretic Transform (NTT) is a variant of the
20: end if Fast Fourier Transform (FFT) that operates on a finite field
21: end if and provides an efficient method for polynomial multiplica-
22: κ := κ + l tion. It reduces the complexity of polynomial multiplication
23: end while from O(n2 ) by using school-book method to O(nlogn) by
24: return σ = (c̃, z, h) using point-wise multiplication method. The NTT-based mul-
tiplication can be perform as: c = INTT(NTT(a)◦NTT(b)).
Where a, b, c could be polynomial matrix or vector ∈ Rq , ◦
the c̃ provided in the signature to confirm the authentication denotes point-wise multiplication of polynomial coefficients.
and integrity of the original message. ML-DSA, as a lattice-based cryptosystems, utilizes the
As a lattice-based cryptosystem, implementing ML-DSA NTT for matrix-vector and vector-vector multiplications. The
faces challenges in the sample generation and computation NTT is performed in the ring Rq = Zq [x]/(x n +1), where n is a
phases, which are the two most time-consuming progresses. power of 2 and q is a prime number. In ML-DSA, the modulus
To achieve high-performance hardware implementation, effi- q is chosen such that there exists a 2n-th root of unit (ϕ2n =
cient modules tailored to these phases are necessary. Based 1753). There are two common algorithms for implementing
on three algorithms, we observed that the Sign algorithm NTT, Cooley-Tukey (CT) decimal-in-time (DIT) algorithm
requires more extensive processing compared to the others. for NTT and Gentleman-Sande (GS) decimal-in-frequency
Previous works, as documented in citations [10]–[14], typi- (DIF) algorithm for inverse NTT (INTT). Applying the com-
cally followed a sequential approach, implementing the three bination of these algorithms can accelerate computations and
algorithms in the order outlined by Algorithms 1-3. In these eliminate the need of bit-reversal step [20].
works, various methods were proposed, with a predominant When implementing NTT on hardware, two popular vari-
focus on enhancing the efficiency of arithmetic module. To ants of the NTT architecture are used: memory-based and
speed up the execution time of these algorithms, the use pipelined architectures. The memory-based or iterative ar-
of multiple hash and arithmetic modules becomes essential. chitecture has the advantage of requiring fewer hardware
However, this approach increases hardware utilization, de- resources, but it is limited by memory port constraints, which
manding careful consideration of submodule architecture ef- restrict the number of butterfly units (BUs) that can be used
ficiency and its impact on the overall architecture. Therefore, to enhance NTT speed. This is evident in works of [9], [12],
in hashing tasks, we propose the use of two independent [15], which are limited to using a maximum of 4 BUs. Addi-
hash modules, including a SHAKE-256 module and a double tionally, the memory-based architecture requires a complex
Keccak-core Hash module for SHAKE-128 instance in ML- control module to calculate the NTT layer-by-layer, which
DSA. These modules enhance the latency and critical path increases the critical path. The pipelined architecture, on the
of the overall architecture, addressing concerns raised by a other hand, requires a large number of shift registers to store
unified Keccak module proposed in previous works. Addi- delay units, which depend on the chosen scheme [21]. This
tionally, our proposed flexible arithmetic module efficiently consumes more hardware resources, but it results in higher
4 VOLUME 11, 2023
Algorithm 4 Pipelined Radix-2 NTT Algorithm

Input: a = (A0 [0], A0 [1],..., A0 [n-1]) ∈ Rq , Ψ contains twid-
dle factor constants.
Output: A = NTT(a)
1: for(i = 1; i ≤ log2 (n); i++ ) do // unrolling
2: for (j = 0; j < n/2; j++) do // pipelining
3: W ← Ψi [j]
4: temp ← W × Ai−1 [j + n/(2i )] (mod q)
5: Ai [j + n/(2i )] ← Ai−1 [j] - temp (mod q)
6: Ai [j] ← Ai−1 [j] + temp (mod q)
7: end for
8: return A
performance. The pipelined NTT architecture is more suitable

for speeding up the execution time of ML-DSA, as it is used
for the most time-consuming phase of this scheme. How- FIGURE 1. High-level configurable architecture of ML-DSA.
ever, implementing this architecture in ML-DSA requires at

least eight BUs, which leads to the drawback of overusing
resources. This drawback will be addressed by our flexi- Control unit: Functioning as a finite-state machine (FSM),
ble arithmetic module, proposed in Section III. Algorithm 4 the control unit plays a vital role in overseeing various mod-
performs the pipelined radix-2 NTT computation, where the ules. It manages these operations based on required function-
variable i in the loop represents the layer NTT being executed alities and sequences the entire hardware module to execute
simultaneously (corresponding to position of BUi ), j is the all the primary operations of ML-DSA in their respective
order of coefficients in polynomials, which are sequentially orders. The control unit employs two control signals, namely
transformed via the NTT hardware architecture. ‘mode’ and ‘sec_lvl’, to configure which algorithm and secu-
rity level the architecture is set to operate. Depending on these
III. ARCHITECTURE AND DESIGN DECISIONS signals, the parameters are adjusted to work in accordance
As the design’s goal is efficient low-latency implementation with the values specified in Table 1. The value rnd is declared
for ML-DSA on the FPGA SoC platform, the proposed ar- as an initial value in the source code, where the default is a
chitecture is supported to perform all ML-DSA’s main al- 256-bit string consisting entirely of zeroes. Users have the
gorithms at three security levels as defined in its specifica- option to change it to an arbitrary value to enhance security.
tion. This design supports all processes including packing Polynomial BRAM: This module contains six groups of
and sending signature and keypair, which can cooperate with three dual-port 36Kb BRAMs used to store intermediate
other software or hardware platforms easily. This section values during the computation. In ML-DSA scheme, the
will introduce the overall high-level architecture and its op- modulus q is 8380417, allowing for coefficients to be rep-
timized modules used to improve latency. Accompany with resented within 23 bits. The Xilinx FPGA supports dual-
the efficient hardware implementation, the timing schedules port 36Kb BRAM memory on-chip [22]. Instead of storing
in Figs. 2-4 are proposed to optimize the work of entire each coefficient in a separate BRAM address, this module
architecture and get low-latency aspect which takes advantage is composed as a group of three BRAMs, each configured
of the most important and time-consuming submodules (poly- as a 1024×36 memory. This configuration allows for four
nomial arithmetic and hash module specify for ML-DSA). coefficients, with a bandwidth requirement of 4×23=92 bits,
to be stored in a single address of the group BRAMs, resulting
A. HIGH-LEVEL CONFIGURABLE ARCHITECTURE in a 25% reduction in BRAM utilization compared to the
1) Overall hardware architecture. straightforward approach. Totally, six group BRAMs are used
The high-level architecture of our hardware design is shown to save intermediate values during the computation process.
in Fig. 1. As a configurable architecture that implements three Polynomial arithmetic module: used for all arithmetic
main algorithms of ML-DSA, this figure depicts all compo- operations in ML-DSA, including NTT, INTT, point-wise
nents used. When executing one of those algorithms individu- multiplications, additions and subtractions between polyno-
ally, some submodules can be removed. To facilitate seamless mial matrices and vectors.
interoperability with other software/hardware platforms, our Hash module: includes SHAKE-128 and SHAKE-256
architecture supports packing/unpacking tasks with 64-bits modules, which function as hash functions or Extendable-
bus width. The block inside is working in 92-bits bus width to Output Functions (XOF) [23]. These modules generate hash
perform four coefficients at the same time. A brief description values of messages, random seeds and pseudorandom data,
of these modules and their works is explained below: which are then used for polynomial sampling in ML-DSA.
VOLUME 11, 2023 5
FIGURE 2. Timing diagram of Key generation phase in security level 5.
FIGURE 3. Timing diagram of Sign phase in security level 5.
FIGURE 4. Timing diagram of Verification phase in security level 5.
Sampling unit: This module consists of four submodules strings. The different vectors, along with their corresponding
responsible for generating different types of pseudorandom bit widths, are packed into fixed 8-byte strings. The decoder
samples used in ML-DSA. ExpandMatrix maps a uniform can then recover the original vectors from these byte strings.
seed to matrix A, UniformEta for generating the secret vec- Depending on the algorithm being performed, certain pro-
tors s1 , s2 , ExpandMask to make the masking vector y and cessing units can operate in parallel to reduce overall latency,
SampeInBall to create challenge c. provided there is no data flow conflict between them. From
Decomposer: This module breaks up a finite field element Algorithms 1-3, the polynomial generation and computation
r in Zq into their “high-order” bits and “low-order” bits, such phases in the ML-DSA scheme are the most time-consuming.
that r = r1 · 2γ + r0 where 0 ≤ r0 < 2γ. By prioritizing these tasks, the computation time of other
Hint: used to describe the work of both MakeHint and tasks can be masked behind these lengthy phases. This de-
UseHint functions. Given an arbitrary element r ∈ Zq and sign focuses on optimizing these tasks sequencing, enabling
a small element z ∈ Zq , "MakeHint" records a 1-bit hint h as continuous execution to enhance overall performance.
the "carry" that enables the computation of the higher-order
bits of r +z using only r and h. Conversely, "UseHint" utilizes 2) Operation scheduling.
the hint to recover the high-order bits of the sum. Based on algorithms and the proposed architecture, we sug-
Encoder/decoder: To interface with other hardware or gest an optimized timing schedule that maximizes the uti-
software platforms, which typically have a bandwidth of 2n , lization of submodules while maintaining constant progress
this implementation integrates encoder/decoder units. The en- on the critical path of the operations. In our architecture, the
coder is responsible for packing polynomial vectors into byte controller operates as a FSM, ensuring sequential processing
6 VOLUME 11, 2023
of all hardware modules. To enhance clarity, we present the

optimized schedule specifically for level 5 of Algorithms 1-3.
Key generation and Verification: The FSM schedule for
Key generation and Verification algorithms are described
straightforwardly stage by stage in Fig.s 2 and 4 corre-
sponding with Algorithms 1 and 3, respectively. The longest
path among them is the computation, hashing and sampling,
which occupies around 90% total time execution. The pack-
ing/unpacking phases are immediately executed in parallel
with the generation matrix, vectors and computations. Pack
t0 , Pack t1 perform the function Power2Round, the straight-
forward bit-wise way to break up t into 13-bits high-order t1
and 10-bits low-order t0 , in line 6 of Algorithm 1. The key
generation progress is finished after packing their ingredients. FIGURE 5. Sampling for matrix A.
In verification, the validity of the signature depends on the
result of the comparison described in line 6 of Algorithm 3,
which is implemented in stage 7 as depicted in Fig. 4.
Signature generation: This is the most complex main (k, l) of the vectors and matrices in the scheme, leading to
algorithm in ML-DSA. It consists of two phases: precom- a substantial demand for pseudorandom data, corresponding
putation and rejection loop, as shown in Fig. 3. During the to longer execution times in this phase and entire algorithm.
precomputation phase, the secret key and message are un- Especially, in the hardest security level, where matrix A ne-
packed and the necessary calculations are performed prior cessitates 56 polynomial samples, totaling 344 kbits, without
to the while loop in Algorithm 2. This phase is executed considering the rejection rate for each sampling. To address
only once. The rejection loop is responsible for generating this challenge, we employ double 96-bit datapath Keccak
the signature until it satisfies the requirements specified in cores in the SHAKE-128 module, while the SHAKE-256
the algorithm. The average number of loops can be found module employs a single 64-bit core, as illustrated in Fig.
in [3]. Inspired by the approach proposed by Beckwith et 5. This design aligns with the characteristics of ML-DSA,
al. [9], the polynomial arithmetic module is divided into two optimizing the execution time of the sampling phase while
independent parts, functioning as described in mode 2 of this minimizing resource utilization. Our modules are modified
module. The first part is used to create the vector w and from the high-performance Keccak implementation core de-
executes the functions from lines 6 to 8, while the remaining veloped by the Keccak team [24]. The Keccak control unit
functions within the loop are performed in the second module. manages the internal signals to control module operations and
facilitate transitions between the absorb and squeeze phases.
B. HASHING AND SAMPLING Taking the example of sampling matrix A, depicted in Fig.
ML-DSA scheme utilizes SHAKE-128 and SHAKE-256 as 5, the SHAKE-128 module needs 4 clock cycles (CCs) to
its hashing functions, both belonging to the XOF category absorb 256 bits of seed and 16-bit nonce before transitioning
within the Secure Hash Algorithm 3 (SHA-3) family. These to the squeeze phase, where sampling and hashing processes
functions are based on the same Keccak permutation with a operate continuously. With the SHAKE-128 rate r=1344 and
state size of 1600 bits [23]. In this context, matrix A ∈ Rk×l is 96-bits datapath of this core, the output of each squeezing
q
the output stream of SHAKE-128. The other sample vectors phase necessitates 14 cycles, which can occur simultaneously
in ML-DSA are generated by SHAKE-256, described in sec- within 24 cycles required for the Keccak permutation in the
tion II, including polynomial secret vectors s1 , s2 , masking subsequent squeezing phase. In summary, considering that
vector y, challenge c and other hashing tasks. Due to the the coefficients in matrix A fall within Zq (where q is the
shared Keccak-f[1600] core with a 24-round permutation, modulus), the acceptable rate is q/223 , which is close to 1.
the unified SHAKE128/256 approach has been employed Consequently, this module requires approximately 144 cycles
in prior works. However, in our work, we propose using to generate two polynomials, each with 256 coefficients.
two independent Keccak-based Hash modules, customized to The remaining hashing tasks and sampling vectors in ML-
suit ML-DSA’s unique requirements. This proposed method DSA are implemented in the SHAKE-256 module in a similar
allows for the simultaneous execution of multiple tasks, such manner. This module consists of a single 64-bits data-path
as generating matrix A and other vectors, resulting in re- core controlled by the Keccak control unit, specifically de-
duced estimated execution times for the scheme. Moreover, signed for SHAKE-256. It interfaces with the corresponding
implementing dedicated hash modules tailored to specific sampling units to generate the required vectors based on their
requirements can reduce the complexity of this module and rejection conditions and received data widths, as outlined in
the entire architecture, potentially leading to an increase in Table 1. The selection of a 64-bit datapath is well-suited for
the maximum achievable frequency of the entire system. the width of the output samples, making it compatible with
ML-DSA’s security primarily depends on the dimensions integration with other software/hardware platforms.
VOLUME 11, 2023 7
FIGURE 6. Flexible polynomial arithmetic module in mode 1. FIGURE 7. Flexible polynomial arithmetic module in mode 2.
C. FLEXIBLE POLYNOMIAL ARITHMETIC MODULE per cycle during polynomial arithmetic. This mode is used
As discussed in Section II, one of the critical considera- for Sign operation, one module to compute vector y and w,
tions is the extensive computational demands in the Sign the other for remaining computations as shown in Fig. 3.
algorithm, particularly during the rejection loop phase. To To optimize memory and efficiently store twiddle factors
improve performance in this phase, we propose an approach (TFs), this module uses six registers dedicated to storing
inspired by [9] to distribute the calculations across two poly- precomputed TFs for the two first layers of NTT/INTT. These
nomial arithmetic (PolyArith) modules. However, since our registers are directly connected to PE1 and PE4. Additionally,
architecture unifies all three main algorithms of ML-DSA, it utilizes two 36Kb BRAM configurations in a 72×512 mode
implementing two PolyArith modules when Verify and Key- to store TFs associated with the remaining six PEs. The
gen algorithms only require one is an unnecessary resource addresses in BRAM are carefully arranged in accordance with
allocation. Hence, our arithmetic module is designed based their order of use. Shift registers are used to implement a FIFO
on the radix-2 MDC FFT pipelined architecture [21] featuring unit for saving intermediate values during NTT/INTT.
eight Processing Elements (PE), which includes butterfly unit
and other sub-units, to handle all polynomial arithmetic op-
erations in ML-DSA. This module offers two modes that can 1) Butterfly unit.
be flexibly adapted to suit the specific characteristics of ML- The butterfly unit plays a crucial role in the arithmetic module
DSA, both of them are configurable to support NTT/INTT for performing various operations. In the proposed design, a
operations with the data flow pass through the black and faint configurable BU architecture is presented, supporting both
lines, respectively, in Figs. 6, 7. the CT BU and GS BU methods, as depicted in Fig. 8. The
Fully pipelined mode: In this mode, we use eight PEs proposed butterfly unit consists of one modular multiplier,
corresponding to the eight layers of ML-DSA for N=256, as one modular adder and one modular subtractor, without re-
outlined in Algorithm 4. BUi take the input from its previous quiring any additional modular multiplier, adder, or subtractor
BU consequently, the twiddle factors have been precomputed compared to a dedicated CT or GS butterfly configuration.
and saved in corresponding ROM address. The stall delay The proposed BU takes a, b and ω as input, the NTT/INTT
length from input to output is 153 CCs, after that each poly- control signal in the blue line acts as a selection signal for the
nomial requires 128 cycles for NTT/INTT. The BUs can be multiplexers, configuring the BU to operate as either a GS
reused for all polynomial arithmetic, processing 8 coefficients or CT butterfly configuration. the BU architecture enables it
in parallel. This mode is employed in the Keygen and Verify. to handle point-wise multiplication, addition and subtraction
Folding transform mode: In this mode, we divide our through the corresponding internal unit. A and B denote the
module into two independent modules, each equipped with output ports when performing point-wise multiplication or
four PEs. By applying the folding transformation method subtraction, the result is taken from port B, while port A is
[25], each PE is responsible for calculating two adjacent NTT utilized for addition using the CT butterfly configuration.
layers in a time-sliced fashion. This enables us to employ four In [26] Zhang et al. proposed a technique to eliminate the
PEs to implement radix-2 MDC NTT in ML-DSA, similar to multiplication of resulting coefficients with n−1 (mod q) after
the approach in [10]. In this mode, during NTT/INTT, the PEs the INTT operation. The output can be seen as (A, B) :=
handle the odd stages in odd cycles and the even stages in even (2-1 (a + b), (a - b)ω). For an odd prime q, x/2 (mod q) can
cycles. This mode has a delay of 281 CCs from input to output be performed as shown in Eqn. 1. In this work, we adopt
and requires 256 CCs to transfer a single polynomial during this technique by inserting divide 2 operation integrated into
NTT/INTT. It also supports the processing of 4 coefficients addition unit and pre-processed the twiddle factor for the
8 VOLUME 11, 2023
FIGURE 9. Architecture of modular multiplication unit.
our modular multiplication based on this equation, as shown

in Figure 9. In this figure, m = 50331648 and {} denotes
concatenated function.
FIGURE 8. Proposed butterfly unit.
IV. IMPLEMENTATION RESULTS AND COMPARISON
A. RESOURCES UTILIZATION AND PERFORMANCE
inverse NTT to incorporate the factor 2-1 . RESULTS
Table 2 provides an overview of the resource consumption for
x/2 (mod q) = (x ≫ 1) + x[0].(q + 1)/2 (1)
the entire design and its submodules. The complete design
utilizes 49753 LUTs, 23477 FFs, 27.5 BRAMs and 16 DSPs,
2) Modular multiplication.
working at the maximum frequency of 300 MHz. To achieve
Modular multiplication, addition and subtraction are the fun-
the low latency target, we selected the Virtex Ultrascale+
damental arithmetic components inside the butterfly unit.
platform and all the resource data was generated by Xilinx
The reference implementation on software platforms of ML-
Vivado 2022.2. Due to our data-path employing four parallel
DSA [3] uses Montgomery reduction after multiplication and
paths to execute four coefficients per cycle, the resource usage
Barrett reduction for addition and subtraction. Both of these
is higher compared to other Lightweight target works. Two
can apply to hardware and have been adopted in some related
hash modules account for approximately 29% and 42% of the
works but not seem to be the best choice. While Montgomery
LUT usage, making them the most resource-intensive compo-
needs to transfer from normal domain to Montgomery domain
nents. The design utilizes a total of 27.5 BRAMs, with 10.5
leading to additional multiplications, Barret reduction does
BRAMs allocated for storing matrix A and 15 BRAMs used
not exploit the specification of modulus q in ML-DSA. In
for intermediate values during computation. The polynomial
[13], Land proposes the reduction method by recursively
arithmetic module requires 2 BRAMs to store twiddle factors
exploit the relation 223 ≡ 213 − 1:
and 16 DSPs for eight PEs. As a combined architecture, some
units are only utilized in specific algorithms and are not called
s[45 : 0] ≡ 223 s[45 : 23] + s[22 : 0] when performing other tasks. For example, the Hint module
≡ 213 s[45 : 23] − s[45 : 23] + s[22 : 0] is not involved in the Keygen. However, the hash modules,
≡ 223 s[45 : 33] + 213 s[32 : 23] − s[45 : 23] + z ExpandA, PolyArith, encoder/decoder and polynomial RAM
can be fully shared among different algorithms, consuming
≡ 213 (s[45 : 43] + s[42 : 33] + s[32 : 23]) 55% LUTs, 66% FFs and 100% DSPs.
− (s[45 : 43] + s[45 : 33] + s[45 : 23]) + z In terms of latency, we have compiled statistics on the over-
(2) all execution time, including the data input, unpack and pack
Following careful consideration, we decided to incorporate signature and keypair processes. From the average statistics
this technique into our modular reduction unit. However, of 1000 executions, we obtained the following specific pa-
as subtraction is not as hardware-friendly as addition, we rameters. The Keygen operation requires an average of 3320,
employ the additive inverse with s̄ representing the negative 5330 and 8226 CCs for three security levels. The interval
of s, to transform the equation into: between each sample depends on the rejection rate of matrix
A and secret vectors s1 but remains relatively small due
s[45 : 0] ≡ s̄[45 : 23] + (213 s[32 : 23] + s̄[45 : 33]) to the high acceptance rate. The Verify operation consumes
+ (213 (s[42 : 33] + s̄[45 : 43]) + 213 s[45 : 43] around 4137, 5846 and 8062 CCs for three NIST levels. In
+ z + 50331648 this operation, only valid samples are calculated, because the
≡ s̄[45 : 23] + {s[42 : 33], 10′ b0, s̄[45 : 43]} invalid sample is processed much faster and does not show
total time execution of this operation. The interval time also
+ {s[32 : 23], s̄[45 : 33]} + 213 s[45 : 43] + z + m depends on the rejection condition in generating matrix A.
(3) The number of cycles required for the Sign algorithm varies
Since the result of the reduction at this point falls within the widely due to the rejection loop in signature generation. The
interval (-q, 3q), we can simplify the hardware architecture for best-case scenario occurs when the signature is accepted after
VOLUME 11, 2023 9
TABLE 2. Resource utilization of submodules in the architecture. architecture in [12] is the smallest hardware accelerator for
Dilithium, but it only supports level 5. Therefore, our archi-
Submodule Resource Utilization tecture complexity is nearly equivalent to [9], [10], but higher
LUT FF DSP RAM than other implementations dedicated to specific security
Polynomial RAM 0 0 0 25.5 levels. Additionally, with the same lattice-based structure,
Hint 10476 3776 0 0 Aikata et al. [15] propose a combined architecture for Kyber
Encoder 2183 464 0 0
Decoder 2683 251 0 0 and Dilithium schemes. For a comprehensive comparison of
Decomposer 1565 728 0 0 resource utilization and performance in Keygen, Sign and
PolyArith 5819 3527 16 2 Verify in our implementation and related works are listed
SHAKE-128 9278 6041 0 0 in Table 3. As discussed earlier, the execution time of the
SHAKE-256 5126 3941 0 0
Sampling 6942 2868 0 0
Sign algorithm varies depending on the rejection loop con-
Other 5681 1881 0 0 ditions. Therefore, in Table 3, the parameters for the Sign
column are split into the best-case scenario and average case,
Top ML-DSA Architecture 49753 23477 16 27.5
respectively. For an effective evaluation of the design, Figs.
10-13 present a fair comparison using the ATP and EqS
metrics for corresponding parameters of level 5-the highest
the first iteration, resulting in a time of 10804, 14859 and security level supported by Dilithium. These figures present
21199 CCs for three security levels. The processing time of a comparative analysis of the ATP_LUT, ATP_FF, ATP_DSP
input message is determined by the size of the message and and ATP_BRAM parameters among the four most advanced
operates at a rate of 8 bytes per cycle. Based on the parameters designs available.
in the ML-DSA specification, an average of 4.25 attempts is The work of Land [13] is one of the earliest designs for
required for level 2, 5.1 attempts for level 3 and 3.85 for level Round 3 Dilithium. This design has several non-optimized
5. The average time to execute one attempt is 5665, 7658 and aspects, such as the use of separated NTT/INTT blocks and
10340 CCs, leading to average times of signature generation multiplication blocks, which could be combined to achieve
are 29219, 46274 and 50848 CCs for three security levels. efficiency without increasing algorithm execution time.Thus,
based on the data in Table 3, even with the EqS metric, it is
B. COMPARISON WITH RELATED WORKS evident that our design achieves significantly better perfor-
ML-DSA was recently introduced with slight updates to the mance than this design, especially in security level 5.
earlier version of CRYSTALS-Dilithium, including minor In [12], Naina proposed a lightweight implementation for
parameter changes and the "hedged" versions in Algorithm level 5 Dilithium on the Zynq Ultrascale+ hardware plat-
2, as explained in the ML-DSA specification [3]. Despite form. With a focus on minimizing hardware utilization, this
some minor differences, to compare the efficacy of our im- design achieves the best results in terms of LUTs and FFs.
plementation, several state-of-the-art full hardware imple- However, regarding execution time, this design only provides
mentations for Round-3 Dilithium based on FPGA platforms data for the best-case scenario of the Sign algorithm without
with the best results are listed in Table 3. As mentioned specific information about the average execution time of this
above, these implementations vary in their approaches, such algorithm. Despite the differences in design approach, as our
as Lightweight, High-performance, or others. To compare the design supports various security levels, our implementation
effectiveness of these implementations, we employ the area- demonstrates better efficiency in almost all statistics when
time tradeoff metric. The ATP values are calculated by mul- evaluated using the ATP metric, as shown in Figs. 10-13.
tiplying the latency with the number of utilized LUTs, FFs, Wang et al. [11] used three independent cores for the
DSPs and BRAMs denoted as ATP_LUT, ATP_FF, ATP_DSP corresponding Dilithium’s security levels. This work is soft-
and ATP_BRAM, respectively. In this metric, our work serves ware/hardware codesign, where bit packing and unpacking
as the "center point" for comparison, with a value set to are not included in the hardware part. Therefore, when com-
1. Lower values indicate better performance, as both area paring by ATP metric in level 5, a direct comparison may not
and time consumption should be minimized. Additionally, be totally fair as our design, which supports 3 levels, is more
to ensure a fair comparison despite differences in hardware complex and resource-consuming. However, our implemen-
platforms, we use an equivalence conversion method. This tation showcases a slight improvement in terms of memory
method, previously used in [27], approximates the number of utilization and an significant performance enhancement.
LUTs, FFs, DSPs, and BRAMs to equivalent slices (EqS). Zhao et al. [10] presents a compact design for Dilithium
By this way when comparing results implemented on 7 series on the low-end Artix-7 platform. Although it also supports all
FPGAs platforms, such as Artix-7, they can be evaluated with three levels, it lacks the flexibility of our design. Unlike our
half of the equivalent results on UltraScale+ platforms. design, its results are stored only in its memory and it does not
The implementations in our work and in [9], [10] represent support the packing and sending of results to connect with ex-
full hardware designs for all three security levels of Dilithium. ternal hardware/software platforms. Dilithium is a public key
On the other hand, in [11], [13], the authors propose in- cryptosystem and the ability to pack and send public keys and
dependent designs for each security level. The Lightweight digital signatures to other platforms is fundamental. Without
10 VOLUME 11, 2023
TABLE 3. Comparison of FPGA-based implementations for Dilithium signature scheme.
Sec. Design Device Freq. Area Keygen Sign Verify

lvl (MHz) LUT FF DSP BRAM cycles µs cyclesd µsd cycles µs
Aikata [15]a,e ZUs+ 270 23277 9798 4 24 14594 54 31662/- 117/- 15423 57
Land [13]b,e Artix-7 163 27433 10681 45 15 18761 115 19423/76613 119/470 19687 121
II Wang [11]b Zynq-7k 159 18558 7342 10 17 7757 49 -/52038 -/327 7675 48
Zhao [10]c Artix-7 96.9 29998 10366 10 11 4172 43 8361/31600 86.3/326.1 4422 45.6
Beckwith [9]c,e VUs+ 256 53907 28435 16 29 4875 19 10945/29876 43/117 6582 26
This workc,e VUs+ 300 49753 23477 16 27.5 3320 11.1 10804/29219 36/97.4 4137 13.8
Aikata [15]a,e ZUs+ 270 23277 9798 4 24 23619 87 48446/- 171/- 26124 97
Land [13]b,e Artix-7 145 30900 11372 45 21 33102 229 26979/123218 186/852 32050 222
III Wang [11]b Zynq-7k 159 19614 8466 10 21 12982 82 -/89213 -/561 11232 71
Zhao [10]c Artix-7 96.9 29998 10366 10 11 5851 60.4 11399/49496 117.6/510.8 6181 63.8
Beckwith [9]c,e VUs+ 256 53907 28435 16 29 8291 32 16178/49437 63/193 9724 39
This workc,e VUs+ 300 49753 23477 16 27.5 5330 17.8 14859/46274 49.5/154.2 5846 19.5
Aikata [15]a,e ZUs+ 270 23277 9798 4 24 39737 147 70179/- 260/- 46671 173
Naina [12]b,e ZUs+ 391 13975 6845 4 35 63.2k 162 113.9k/- 291/- 67.9k 174
Land [13]b,e Artix-7 140 44653 13814 45 31 50892 364 36609/145912 261/1042 52712 377
V Wang [11]b Zynq-7k 159 20973 9677 10 28 20185 127 -/93708 -/589 15875 100
Zhao [10]c Artix-7 96.9 29998 10366 10 11 8765 90.5 15790/55321 163/570.9 9039 93.3
Beckwith [9]c,e VUs+ 256 53907 28435 16 29 14037 55 24358/55070 95/215 14642 57
This workc,e VUs+ 300 49753 23477 16 27.5 8226 27.4 21199/50848 70.7/169.5 8062 26.9
a : multiple schemes, b : support dependent level, c : support all levels,
d : best-case / average-case scenarios, e : include packing/unpacking.
FIGURE 10. Comparison of Area×Time metric (LUTs × s). FIGURE 11. Comparison of Area×Time metric (FFs × s).
this capability, the system loses much of its practicality. In and the potential for higher operating frequency, resulting in
terms of latency, their average execution time statistics do not enhanced time execution, while maintaining equivalent hard-
include the data input receiving and sending processes. The ware consumption in ATP_LUT and ATP_DSP parameters,
execution time for Verify algorithm is calculated only in cases using ATP and EqS metrics.
where the public key remains unchanged. In similar cases, our The work of [9] presented a design similar to ours, provides
work only requires 3179, 4819 and 7232 CCs for the three support for three security levels. However, it uses the Minerva
levels. This work also suffers from the complexity of the hash tool [28] to determine the maximum frequency, which repre-
module, leading to a low operating frequency. Therefore, even sents a different benchmark compared to other designs. By
with different platforms, our work offers improved latency enhancing the versatility of the arithmetic module and opti-
VOLUME 11, 2023 11
Verify and 1001µs for the best-case scenario of Sign. In the

same tasks, our hardware implementation outperforms their
software implementation, achieving a speed-up of 19.8×,
23.2× and 14.2×, respectively. Additionally, Becker et al.
[7] presented a software implementation on high-end CPUs,
specifically the ARMv8 Apple M1 Firestorm core at 3.2GHz.
Their reported execution times for the three main algorithms
of Dilithium-III are 48µs, 114µs and 33µs. In terms of time
execution, our hardware implementation achieves improve-
ments of 2.7×, 0.74× and 1.7× for Keygen, Sign and Verify,
respectively. The similarly significant results are showcased
when compared with one of the most notable Dilithium
implementations on the RISC-V CVA6 core on the Xilinx
FIGURE 12. Comparison of Area×Time metric (DSPs × s).
ZCU106 evaluation board [8], where the execution times
of highest security level Dilithium are 56.21ms for Keygen,
176.3ms and 59.37ms for Sign and Verify.
V. CONCLUSION
In this paper, we present an efficient and low-latency hard-
ware implementation of ML-DSA, which supports all main
algorithms at three security levels of NIST. Our implemen-
tation enables the complete execution of all ML-DSA tasks
independently or in interaction with other software/hardware
platforms to accelerate specific processes. When compared
with the state-of-the-art hardware architecture for all three
security levels of ML-DSA, we achieve the lowest latency
and efficiency in terms of area/performance tradeoffs, with
FIGURE 13. Comparison of Area×Time metric (BRAMs × s). a significant improvement of 1.27 to 2.58× as evaluated by
the ATP metric. Additionally, this work introduces optimized
arithmetic operations and hashing & sampling modules tai-
mizing the hash module, specifically tailored for Dilithium, lored to the characteristics of ML-DSA, enhancing the over-
our architecture has achieved a 2× improvement in all ATP all efficiency of the scheme. These modules can also work
parameters during Keygen and Verify algorithms. Our results independently as accelerators to support software platforms,
also showcase improved Sign algorithm execution times and handling the two most time-consuming parts of these algo-
more efficient resource utilization, detailed in Table 3. rithms. In summary, our proposed architecture exhibits poten-
tial adaptability to diverse applications of digital signatures,
Recently, a notable unified architecture for Kyber and
offering resilience against both quantum and classical attacks.
Dilithium was introduced by Aikata et al. [15]. When the
overall area consumption of this work is analyzed in Table
3, it utilizes 22184 LUTs, 8596 FFs, 4 DSPs, and 24 BRAMs REFERENCES
for the hardware part without the submodules solely support- [1] P. W. Shor, ‘‘Polynomial-time algorithms for prime factorization and dis-
crete logarithms on a quantum computer,’’ SIAM Journal on Computing,
ing Kyber. When evaluated using the ATP metric based on vol. 26, no. 5, pp. 1484–1509, Oct. 1997.
these statistics, our work exhibits substantial improvements [2] NIST, ‘‘Status report on the third round of the NIST-PQC standard-
in almost all parameters, as shown in Figs. 10-13. ization Process,’’. Available: https://csrc.nist.gov/projects/post-quantum-
cryptography.
In summary, our implementation exhibits the lowest la- [3] NIST, ‘‘FIPS 204 - Module-Lattice-Based Digital Signature Standard," url:
tency in both clock cycle and time execution compared to the https://csrc.nist.gov/pubs/fips/204/ipd
other designs. Furthermore, when compared to the combined [4] S. Bai, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe,
G. Seiler, and D. Stehl´e, ‘‘CRYSTALS-Dilithium Algorithm Specifica-
architecture supporting all security levels of Dilithium in [9], tions and Supporting Documentation,’’ Proposal to NIST PQC Standard-
our design demonstrates efficiency improvements ranging ization, Round3, 2021, https://pq-crystals.org/dilithium/data/dilithium-
from 1.27∼2.58×, as evaluated by the ATP metric. specification-round3.pdf.
[5] National Institute of Standards and Technology (2023) Digital Signature
In comparison to the software implementation, our hard- Standard (DSS).(Department of Commerce, Washington, D.C.), Federal
ware implementation in this work demonstrates significant Information Processing Standards Publication (FIPS) NIST FIPS 186-5.
performance improvements. Seo et al. [6] reported an opti- https://doi.org/10.6028/NIST.FIPS.186-5
mized NEON implementation on an 8-core ARM v8.2 64- [6] S. C. Seo, S. W. An, ‘‘Parallel implementation of CRYSTALS-Dilithium
for effective signing and verification in autonomous driving environment,’’,
bit CPU mounted on Jetson AGX Xavier. For Dilithium level ICT Express, vol. 9, issue 1, pp. 100–105, Feb. 2023.
5, their implementation takes 542µs for Keygen, 625µs for [7] H. Becker, V. Hwang, M. J. Kannwischer, B.-Y. Yang, and S.-Y. Yang,
12 VOLUME 11, 2023

‘‘Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple QUANG DANG TRUONG (Graduate Student
M1,’’ Cryptology ePrint Archive 2021/986, Jul. 2021. Member, IEEE) received his B.S. degree in Electri-
[8] P. Nannipieri, S. Di Matteo, L. Zulberti, F. Albicocchi, S. Saponara, cal and Electronic Engineering from Le Quy Don
and L. Fanucci, ‘‘A RISC-V Post Quantum Cryptography Instruction Set University of Technology, Vietnam, in 2019 and
Extension for Number Theoretic Transform to Speed-Up CRYSTALS M.s degree in Electrical and Computer Engineer-
Algorithms,’’ IEEE Access, vol. 9, pp. 150798-150808, 2021. ing from Inha University, in 2023. He is currently
[9] L. Beckwith, D. T. Nguyen, and K. Gaj, ‘‘High-performance hardware pursuing his Ph.D. in Electrical and Computer
implementation of CRYSTALS-Dilithium,’’ Crypto. ePrint Arch., Report
Engineering from Inha University. His research
2021/1451, 2021.
interests include algorithm and architecture design
[10] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu, S. Yin, S.
Wei, and L. Liu, ‘‘A compact and high-performance hardware architecture for post-quantum cryptography and homomorphic
for CRYSTALS-Dilithium,’’ IACR Trans. Cryptogr. Hardw. Embed.Syst., encryption.
vol. 2022, no. 1, pp. 270–295, 2022.
[11] T. Wang, C. Zhang, P. Cao and D. Gu, ‘‘Efficient implementation of
Dilithium signature scheme on FPGA SoC Platform,’’ IEEE Transactions
on Very Large Integration (VLSI) systems, vol: 30, Sep. 2022.
[12] N. Gupta, A. Jati, A. Chattopadhyay, and G. Jha, ‘‘Lightweight Hardware
Accelerator for Post-Quantum Digital Signature CRYSTALS-Dilithium,’’
IEEE Transactions on Circuits and Systems I: Regular Papers (2023).
[13] G. Land, P. Sasdrich, and T. Guneysu, ‘‘Hard Crystal-Implementing
Dilithium on Reconfigurable Hardware,’’ Cryptology ePrint Archive
2021/355, Mar. 2021.
[14] S. Ricci, L. Malina, P. Jedlicka, D. Smekal, 1. Hajny, P. Cibik, and P.
Dobias, ‘‘Implementing CRYSTALSDilithium Signature Scheme on FP- PHAP NGOC DUONG (Member, IEEE) received
GAs,’’ Cryptology ePrint Archive 2021/108, Jan. 2021. the M.S. degree in Electronic Engineering from
[15] A. Aikata, C. Mert, M. Imran, S. Pagliarini and S. Roy, ‘‘KaLi: A crystal The University of Danang, Vietnam, in 2015 and
for post-quantum security using Kyber and Dilithium,’’ IEEE Transactions the Ph.D. degree in Electrical and Computer En-
on Circuits and Systems I, vol: 70, Feb. 2023. gineering from Inha University, South Korea, in
[16] Vadim Lyubashevsky, ‘‘Fiat-shamir with aborts: Applications to lattice and 2022. He is currently a post-doctoral fellow at the
factoring-based signatures,’’ in International Conference on the Theory Artificial Intelligence System-on-Chip (AI-SoC)
and Application of Cryptology and Information Security, Springer Berlin Research Center, Inha University, South Korea;
Heidelberg, pp. 598–616, 2009. and a lecturer of Vietnam-Korea University of In-
[17] Y. Wang and M.Wang, ‘‘Module-LWE versus Ring-LWE, Revisited,’’
formation and Communication Technology, The
Cryptology ePrint Archive 2019/930, Aug. 2019.
University of Danang, Vietnam. His research interests include algorithm and
[18] E. Kiltz, V. Lyubashevsky, and C. Schaffner,‘‘A concrete treatment
of Fiat–Shamir signatures in the quantum random-oracle model,’’ in architecture for digital signal processing, hardware acceleration architecture,
Proc.37th Annu. Int. Conf. Theory Appl. Cryptograph. Techn., pp. 552–586, post-quantum cryptography and homomorphic encryption.
Tel Aviv, Israel, Apr. 2018.
[19] Leon Groot Bruinderink and Peter Pessl, ‘‘Differential fault attacks on
deterministic lattice signatures,’’ IACR Transactions on Cryptographic
Hardware and Embedded Systems, pp. 21–43, 2018.
[20] P. Longa and M. Naehrig, ‘‘Speeding up the number theoretic transform for
faster ideal lattice-based cryptography,’’ in Proc. 15th Int. Conf. Cryptol.
Netw. Secur., pp. 124–139, Milan, Italy, Nov. 2016.
[21] S. He and M. Torkelson, ‘‘A new approach to pipeline FFT processor,’’ in
Proceedings of International Conference on Parallel Processing, pp. 766-
770, April 1996.
[22] ‘‘UltraScale Architecture Memory Resources Advance Specification User
Guide,’’ UG573 (v1.1) August 14, 2014. HANHO LEE (Senior Member, IEEE) received
[23] National Institute of Standards and Technology, in FIPS PUB 202: SHA- Ph.D. and M.Sc. degrees in Electrical and Com-
3 Standard: Permutation-Based Hash and Extendable-Output Functions. puter Engineering, from the University of Min-
Aug. 2015. nesota, Minneapolis in 2000 and 1996, respec-
[24] G. Bertoni, J. Daemen, S. Hoffert, M. Peeters, and G. V. Assche.(2020)., tively. He was a member of the technical staff
Keccak in VHDL. [Online]. Available:https://keccak.team/hardware.html at Lucent Technologies (Bell Labs Innovations),
[25] K. K. Parhi, C.-Y. Wang, and A. P. Brown, ‘‘Synthesis of control circuits in Allentown from April 2000 to August 2002 and
folded pipelined DSP architectures,’’ IEEE Journal of Solid-State Circuits, Assistant Professor in the Department of Elec-
vol. 27, no. 1, pp. 29– 43, Jan. 1992. trical and Computer Engineering, University of
[26] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei and L.Liu, ‘‘Highly Effi- Connecticut, USA from August 2002 to August
cient Architecture of NewHope-NIST on FPGA using Low-Complexity
2004. He has been with the Department of Information and Communication
NTT/INTT,’’ IACR Trans. on CHES, vol. 2020, no. 2, pp. 49–72, 2020.
[27] P. Duong-Ngoc, S. Kwon, D. Yoo, and H. Lee, ‘‘Area-efficient number the-
Engineering, Inha University since August 2004, where he is currently a
oretic transform architecture for homomorphic encryption,’’ IEEE Trans. Professor. He was visiting scholar at Bell Labs, Alcatel-Lucent, Murray Hill,
Circuits Syst. I, Reg. Papers, vol. 70, no. 3, pp. 1270–1283, Mar. 2023. New Jersey, USA from August 2010 to August 2011. His research interests
[28] F. Farahmand, A. Ferozpuri, W. Diehl, and K. Gaj, ‘‘Minerva: Automated include algorithm and architecture design for cryptographic, forward error
hardware optimization tool,’’ in 2017 International Conference on ReCon- correction coding, artificial intelligence and digital signal processing.
Figurable Computing and FPGAs, ReConFig 2017, pp. 1-8, Cancun: IEEE,
Dec. 2017.
VOLUME 11, 2023 13

Main Manuscript

Uploaded by

Copyright:

Available Formats

Main Manuscript

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Main Manuscript

Uploaded by

Copyright:

Available Formats

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2023.0322000

Efficient Low-Latency Hardware Architecture for

INDEX TERMS Post-quantum cryptography (PQC), Module-Lattice-Based Digital Signature Standard

I. INTRODUCTION lected CRYSTALS-Dilithium for digital signature standard-

VOLUME 11, 2023 1

Algorithm 2 Signature Generation Sign(sk, M ) [3] Algorithm 3 Verification Verify(pk, M , σ) [3]

Algorithm 4 Pipelined Radix-2 NTT Algorithm

performance. The pipelined NTT architecture is more suitable

ever, implementing this architecture in ML-DSA requires at

FIGURE 2. Timing diagram of Key generation phase in security level 5.

FIGURE 3. Timing diagram of Sign phase in security level 5.

FIGURE 4. Timing diagram of Verification phase in security level 5.

of all hardware modules. To enhance clarity, we present the

FIGURE 9. Architecture of modular multiplication unit.

our modular multiplication based on this equation, as shown

TABLE 3. Comparison of FPGA-based implementations for Dilithium signature scheme.

Sec. Design Device Freq. Area Keygen Sign Verify

Verify and 1001µs for the best-case scenario of Sign. In the

12 VOLUME 11, 2023

VOLUME 11, 2023 13

You might also like