Main Manuscript
Main Manuscript
Main Manuscript
ABSTRACT The rapid advancement of powerful quantum computers poses a significant security risk to
current public-key cryptosystems, which heavily rely on the computational complexity of problems such
as discrete logarithms and integer factorization. As a result, CRYSTALS-Dilithium, a lattice-based digital
signature scheme with the potential to be an alternative algorithm that can withstand both quantum and
classical attacks, has been standardized as ML-DSA after NIST Post-Quantum Cryptography competition.
While prior studies have proposed hardware designs to accelerate this cryptosystem, there is room for
further optimization in the tradeoff between performance and hardware consumption. This paper addresses
these limitations by presenting an efficient low-latency hardware architecture for ML-DSA, leveraging
optimized timing schedules for its three main algorithms. The hardware implementation enables runtime
switching main operations in ML-DSA with various security levels. We design flexible arithmetic and
hash modules tailored for ML-DSA, the most time-consuming submodules and key determinants of the
scheme implementation. Combined with efficient operation scheduling to maximize the utilized time of
submodules, our design achieves the best latency among FPGA-based implementations, outperforming state-
of-the-art works by 1.27∼2.58× in terms of the area-time tradeoff metric. Therefore, the proposed hardware
architecture demonstrates its practical applicability for digital signature cryptosystems in post-quantum era.
ture applications. There is a growing interest in developing posed in [9] achieves the lowest latency. The difference be-
efficient implementations of this scheme in both hardware tween these two architectures lies in the number of arithmetic
and software. Most existing works on implementing and eval- and hash modules used. The architecture in [10] has one hash
uating Dilithium have used pure software methods [6], [7] and arithmetic modules, that means it is slower but more
or hardware-software co-design methods [8]. Developing and compact than the work in [9] which uses multiple modules.
benchmarking software implementations of cryptographic al- Consequently, we realized that designing efficient hash
gorithms like PQC is relatively straightforward. However, and arithmetic modules is key factor for achieving high-
implementing them in hardware platforms, particularly on performance architecture. The latency of hardware imple-
Field-Programmable Gate Arrays (FPGAs), requires signif- mentations for Dilithium is mainly dominated by hashing and
icant time and design effort to assess their efficiency in terms polynomial arithmetic tasks, which should be continuously
of speed, area utilization, power consumption and energy executed as much as possible. Besides, the interconnection
efficiency. By exploring FPGA implementations, researchers between individual submodules in the whole architecture
can realize various ideas and apply the tradeoff method to should be carefully considered to avoid critical paths, which
optimize the approach for hardware designs to achieve de- can reduce the working frequency of the hardware imple-
sired performance goals. FPGA implementations are a crucial mentation. The balance between on-chip resource utilization
and efficient tool for gaining valuable insights into the cost and performance must also be considered to maximize the
and performance characteristics of PQC algorithms when efficiency of combined architecture for all phases. To im-
deployed in hardware environments. plement a more compact architecture, [10] does not support
Several prior researches have focused on improving the packing/unpacking and the final results are stored in BRAMs.
hardware designs for Round 3 Dilithium on FPGA platforms. However, in public-key cryptosystems, packing is essential
In these designs, two main aspects were typically considered. for defining the data layout of the keys and signature before
The first is developing hardware architectures that could sup- transmission. This is necessary as these applications may be
port all security levels as illustrated in [9], [10] or specifying implemented on diverse platforms with varying architectures.
to a particular security level as proposed in [11]–[14]. We Additionally, although the proposed on-the-fly calculation for
recognize that adopting a unified architecture with a flexible matrix A in [10] aims at minimizing storage, it leads to a
mode signal to accommodate all security levels is a more longer critical path as well as restricts the design to operate
effective approach. This strategy can enhance algorithm per- at higher frequency. Conversely, [9] provided practical sup-
formance in hardware, providing flexibility and enabling the port with packing/unpacking modules featuring 64-bit ports.
architecture to address a broader spectrum of applications. However, their hash modules support multiple instances of
The second consideration is the goal of these implementa- SHA-3 and the complexity of iterative NTT structure leads to
tions. Some studies aimed at maximizing performance while a longer critical path. Additionally, utilizing two arithmetic
others focused on minimizing power or area consumption modules with only one used in Keygen and Verify algorithms
or achieving a balance in area-time tradeoff. As a result, of ML-DSA, might result in inefficient hardware utilization.
these implementations were broadly categorized as high- This study approaches a unified hardware design for ex-
performance [9], [10] or lightweight [12], [15] although the ecuting three main algorithms in NIST ML-DSA [3]. Our
distinction is not always clear-cut as some designs may aim design improves submodules and employs appropriate oper-
for specific tradeoff directions. For instance, a high-efficiency ation scheduling tailored to ML-DSA hardware implemen-
design [11] might primarily focus on achieving high perfor- tation. Additionally, this study aims at a high-performance
mance while keeping resource utilization in comparable level. implementation and achieving an optimal balance between
Lightweight implementations aim to minimize hardware execution time and on-chip resource utilization.
utilization, allowing them to fit on even the smallest FPGA In summary, we make the following main contributions:
platforms. However, current FPGA platforms support a large 1) We propose a unified hardware architecture that sup-
amount of hardware resources, as evidenced by the unified ports three algorithms of ML-DSA and can be config-
architecture for CRYSTALS-Kyber and Dilithium on the Ul- ured for NIST security levels in run-time. The com-
traScale+ ZCU102 [15]. This cryptoprocessor utilizes only a bined hardware design improves the submodules and
small portion of the platform’s hardware resources, utilizing optimizes on-chip resource utilization. Furthermore,
less than 10% of available resources. When implementing the proposed architecture supports packing/unpacking
Dilithium on hardware, one of the most important goals is modules that enables efficient data exchange with other
accelerating execution time to improve application perfor- platforms, facilitating accelerated multiple computa-
mance. Therefore, a high-performance approach is more suit- tional tasks in ML-DSA.
able for Dilithium hardware implementation, which is still 2) The hash and arithmetic modules are designed to match
affordable with the available hardware resources on current the specific needs of ML-DSA. Specifically, we use in-
FPGA platforms. Among the high-performance designs that dependent hash modules for SHAKE-128 or SHAKE-
fully support three security levels, the compact and efficient 256 and a fully pipelined NTT architecture in the arith-
hardware architecture presented in [10] achieves the best metic module, resulting in an improved critical path.
results in resource consumption, while the architecture pro- Optimizing these two modules, which have the most
2 VOLUME 11, 2023
Truong et al.: Efficient Low-Latency Hardware Architecture for Module-Lattice-Based Digital Signature Standard
substantial impact on the overall execution time of the TABLE 1. Parameters sets of ML-DSA.
architecture, leads to performance enhancement for the Parameters NIST security level
entire cryptosystem. 2 3 5
3) By using flexible submodules for entire architecture, q [modulus] 8380417 8380417 8380417
we propose an optimal timing diagram and proper d [dropped bits from t] 13 13 13
τ [# of ±1’s in c] 39 49 60
scheduling to achieve area-time balance when execut- γ1 [y coefficient range] 217 219 219
ing ML-DSA processes. Comparison results show that γ2 [low-order rounding range] (q-1)/88 (q-1)/32 (q-1)/32
our design achieves the lowest latency among FPGA- (k, l) [dimension of A] (4,4) (6,5) (8,7)
η [secret key range] 2 4 2
based implementations and outperforms state-of-the- β [τ · η] 78 196 120
art studies in terms of area-time product (ATP) metric. ω [max # of 1’s in hint] 80 55 75
The remainder of this paper is organized as follows. Sec- λ [collision strength of c̃] 128 192 256
tion II provides a brief background of ML-DSA and number
theoretic transform. Section III proposes our design decisions Algorithm 1 Key Generation KeyGen() [3]
and details our hardware architecture as well as the optimized Output: pk = (ρ, t1 ), sk = (ρ, K , tr, s1 , s2 , t0 )
modules. The implementation results and comparison with 1: ζ ← {0, 1}
256
the state-of-the-art works are given in Section IV. Finally, 2: (ρ, ρ′ , K ) ∈ {0, 1}
256 512 256
× {0, 1} × {0, 1} ← H(ζ)
Section V summarizes and concludes the paper. 3: A ∈ Rk×l := ExpandA(ρ)
q
4: (s1 , s2 ) ∈ Sηl × Sηk := ExpandS(ρ′ )
II. PRELIMINARIES 5: t:= As1 + s2
A. MODULE-LATTICE-BASED DIGITAL SIGNATURE
6: (t1 , t0 ) := Power2Roundq (t, d)
STANDARD OVERVIEW 512
7: tr ∈ {0, 1} := H(ρ ∥ t1 )
ML-DSA is a digital signature scheme based on CRYSTALS- 8: return (pk, sk)
Dilithium, a member of the Cryptographic Suite for Algebraic
Lattices (CRYSTALS) and a digital signature scheme based
on the "Fiat-Shamir with Aborts" approach [16]. Different
from other lattice-based algorithms recommended after round in Algorithm 2. The ‘hedged’ version is a new improvement
3: Kyber and Falcon, Dilithium uses uniform sampling rather aimed at facilitating countermeasures against side-channel
than discrete Gaussian distribution for secret randomness attacks and fault attacks on deterministic signatures [19].
generation. This approach greatly simplifies polynomial gen- Key generation (Keygen), Signature generation (Sign) and
eration and is easily implemented in constant time. The hard Verification (Verify), three core algorithms of ML-DSA, are
problems underlying the security of ML-DSA schemes are shown in Algorithms 1-3 [3].
Module Learning with Errors (M-LWE) and Module Shortest KeyGen(): Key generation algorithm generates a keypair
Integer Solution (M-SIS) formally proved in [4]. consisting of a private signing key (sk) and public verification
M-LWE problem needed to protect against key-recovery key (pk) used for signature generation and verification. In this
[17] and M-SIS problem for strong unforgeability [18]. The algorithm, two uniform seeds: public seed (ρ) and noise seed
M-LWE problem can be briefly described as follows: Let A (ρ′ ) are mapped to generate polynomial matrix A and vectors
∈ Rqk×l be uniformly chosen s1 ∈ Rlq , s2 ∈ Rkq . Then, the s1 , s2 . It then computes t = As1 +s2 to generate the final
standard M-LWE problem is the public key (A, t = As1 +s2 ) keypair. To keep size small, the public key includes the seed
is indistinguishable from (A, t) where t is chosen uniformly (ρ) instead of matrix A and the upper bit of t. The secret key
at random. The M-SIS problem in ML-DSA is expressed packs the lower d bits of t, seed ρ and two byte-arrays K , tr.
through: from a uniformly chosen matrix A ∈ Rk×l q , that there Sign(): This algorithm takes message M and secret key
exists z and u such that Az+u = 0, where ∥z∥∞ ≤ 2(γ1 − β), to generate the signature attached with the message before
∥u∥∞ ≤ 4γ2 + 2 (parameters can be chosen to match NIST sending to verifier. Signature generation begins by generating
security levels outlined in Table 1). In addition to 3 proposed masking vector y and matrix A and then compute w = Ay. It
NIST security levels, the security of ML-DSA can be adjusted is based on generating challenge c by hashing the message
by changing the parameters inside. The most straightforward with the high order bit of w, signature as z = y+cs1 then
way to enhance or reduce security is by modifying the values and also the hints h for the verifier. The potential signature
of (k, l) and then adjusting η, β and ω accordingly. An- σ consisting of (c, z, h) is checked by adapting the condi-
other approach is to lower the values of γ1 and/or γ2 , which tion of the polynomial max norms are within an acceptable
makes forging signatures more challenging since it relies on predefined range and the maximum number of 1’s in hint
the underlying SIS problem. Increasing (k, l) significantly that the signature can support which are defined by choosing
strengthens security, while adjusting γi is more suitable for parameter of the security level in Table 1. Otherwise, the
minor security tweaks. One of the most notable updates in signature must be rejected and a new attempt is made.
ML-DSA from Dilithium version 3.1 is that it contains two Verify(): The high-order bit of w1 is recovered from mes-
versions of the signature generation algorithm: "hedged" and sage, public key and signature. It is used as a part of hash
"deterministic" which are selected by the value rnd, as seen function to generate challenge c which will be compared to
VOLUME 11, 2023 3
Truong et al.: Efficient Low-Latency Hardware Architecture for Module-Lattice-Based Digital Signature Standard
Sampling unit: This module consists of four submodules strings. The different vectors, along with their corresponding
responsible for generating different types of pseudorandom bit widths, are packed into fixed 8-byte strings. The decoder
samples used in ML-DSA. ExpandMatrix maps a uniform can then recover the original vectors from these byte strings.
seed to matrix A, UniformEta for generating the secret vec- Depending on the algorithm being performed, certain pro-
tors s1 , s2 , ExpandMask to make the masking vector y and cessing units can operate in parallel to reduce overall latency,
SampeInBall to create challenge c. provided there is no data flow conflict between them. From
Decomposer: This module breaks up a finite field element Algorithms 1-3, the polynomial generation and computation
r in Zq into their “high-order” bits and “low-order” bits, such phases in the ML-DSA scheme are the most time-consuming.
that r = r1 · 2γ + r0 where 0 ≤ r0 < 2γ. By prioritizing these tasks, the computation time of other
Hint: used to describe the work of both MakeHint and tasks can be masked behind these lengthy phases. This de-
UseHint functions. Given an arbitrary element r ∈ Zq and sign focuses on optimizing these tasks sequencing, enabling
a small element z ∈ Zq , "MakeHint" records a 1-bit hint h as continuous execution to enhance overall performance.
the "carry" that enables the computation of the higher-order
bits of r +z using only r and h. Conversely, "UseHint" utilizes 2) Operation scheduling.
the hint to recover the high-order bits of the sum. Based on algorithms and the proposed architecture, we sug-
Encoder/decoder: To interface with other hardware or gest an optimized timing schedule that maximizes the uti-
software platforms, which typically have a bandwidth of 2n , lization of submodules while maintaining constant progress
this implementation integrates encoder/decoder units. The en- on the critical path of the operations. In our architecture, the
coder is responsible for packing polynomial vectors into byte controller operates as a FSM, ensuring sequential processing
6 VOLUME 11, 2023
Truong et al.: Efficient Low-Latency Hardware Architecture for Module-Lattice-Based Digital Signature Standard
FIGURE 6. Flexible polynomial arithmetic module in mode 1. FIGURE 7. Flexible polynomial arithmetic module in mode 2.
C. FLEXIBLE POLYNOMIAL ARITHMETIC MODULE per cycle during polynomial arithmetic. This mode is used
As discussed in Section II, one of the critical considera- for Sign operation, one module to compute vector y and w,
tions is the extensive computational demands in the Sign the other for remaining computations as shown in Fig. 3.
algorithm, particularly during the rejection loop phase. To To optimize memory and efficiently store twiddle factors
improve performance in this phase, we propose an approach (TFs), this module uses six registers dedicated to storing
inspired by [9] to distribute the calculations across two poly- precomputed TFs for the two first layers of NTT/INTT. These
nomial arithmetic (PolyArith) modules. However, since our registers are directly connected to PE1 and PE4. Additionally,
architecture unifies all three main algorithms of ML-DSA, it utilizes two 36Kb BRAM configurations in a 72×512 mode
implementing two PolyArith modules when Verify and Key- to store TFs associated with the remaining six PEs. The
gen algorithms only require one is an unnecessary resource addresses in BRAM are carefully arranged in accordance with
allocation. Hence, our arithmetic module is designed based their order of use. Shift registers are used to implement a FIFO
on the radix-2 MDC FFT pipelined architecture [21] featuring unit for saving intermediate values during NTT/INTT.
eight Processing Elements (PE), which includes butterfly unit
and other sub-units, to handle all polynomial arithmetic op-
erations in ML-DSA. This module offers two modes that can 1) Butterfly unit.
be flexibly adapted to suit the specific characteristics of ML- The butterfly unit plays a crucial role in the arithmetic module
DSA, both of them are configurable to support NTT/INTT for performing various operations. In the proposed design, a
operations with the data flow pass through the black and faint configurable BU architecture is presented, supporting both
lines, respectively, in Figs. 6, 7. the CT BU and GS BU methods, as depicted in Fig. 8. The
Fully pipelined mode: In this mode, we use eight PEs proposed butterfly unit consists of one modular multiplier,
corresponding to the eight layers of ML-DSA for N=256, as one modular adder and one modular subtractor, without re-
outlined in Algorithm 4. BUi take the input from its previous quiring any additional modular multiplier, adder, or subtractor
BU consequently, the twiddle factors have been precomputed compared to a dedicated CT or GS butterfly configuration.
and saved in corresponding ROM address. The stall delay The proposed BU takes a, b and ω as input, the NTT/INTT
length from input to output is 153 CCs, after that each poly- control signal in the blue line acts as a selection signal for the
nomial requires 128 cycles for NTT/INTT. The BUs can be multiplexers, configuring the BU to operate as either a GS
reused for all polynomial arithmetic, processing 8 coefficients or CT butterfly configuration. the BU architecture enables it
in parallel. This mode is employed in the Keygen and Verify. to handle point-wise multiplication, addition and subtraction
Folding transform mode: In this mode, we divide our through the corresponding internal unit. A and B denote the
module into two independent modules, each equipped with output ports when performing point-wise multiplication or
four PEs. By applying the folding transformation method subtraction, the result is taken from port B, while port A is
[25], each PE is responsible for calculating two adjacent NTT utilized for addition using the CT butterfly configuration.
layers in a time-sliced fashion. This enables us to employ four In [26] Zhang et al. proposed a technique to eliminate the
PEs to implement radix-2 MDC NTT in ML-DSA, similar to multiplication of resulting coefficients with n−1 (mod q) after
the approach in [10]. In this mode, during NTT/INTT, the PEs the INTT operation. The output can be seen as (A, B) :=
handle the odd stages in odd cycles and the even stages in even (2-1 (a + b), (a - b)ω). For an odd prime q, x/2 (mod q) can
cycles. This mode has a delay of 281 CCs from input to output be performed as shown in Eqn. 1. In this work, we adopt
and requires 256 CCs to transfer a single polynomial during this technique by inserting divide 2 operation integrated into
NTT/INTT. It also supports the processing of 4 coefficients addition unit and pre-processed the twiddle factor for the
8 VOLUME 11, 2023
Truong et al.: Efficient Low-Latency Hardware Architecture for Module-Lattice-Based Digital Signature Standard
TABLE 2. Resource utilization of submodules in the architecture. architecture in [12] is the smallest hardware accelerator for
Dilithium, but it only supports level 5. Therefore, our archi-
Submodule Resource Utilization tecture complexity is nearly equivalent to [9], [10], but higher
LUT FF DSP RAM than other implementations dedicated to specific security
Polynomial RAM 0 0 0 25.5 levels. Additionally, with the same lattice-based structure,
Hint 10476 3776 0 0 Aikata et al. [15] propose a combined architecture for Kyber
Encoder 2183 464 0 0
Decoder 2683 251 0 0 and Dilithium schemes. For a comprehensive comparison of
Decomposer 1565 728 0 0 resource utilization and performance in Keygen, Sign and
PolyArith 5819 3527 16 2 Verify in our implementation and related works are listed
SHAKE-128 9278 6041 0 0 in Table 3. As discussed earlier, the execution time of the
SHAKE-256 5126 3941 0 0
Sampling 6942 2868 0 0
Sign algorithm varies depending on the rejection loop con-
Other 5681 1881 0 0 ditions. Therefore, in Table 3, the parameters for the Sign
column are split into the best-case scenario and average case,
Top ML-DSA Architecture 49753 23477 16 27.5
respectively. For an effective evaluation of the design, Figs.
10-13 present a fair comparison using the ATP and EqS
metrics for corresponding parameters of level 5-the highest
the first iteration, resulting in a time of 10804, 14859 and security level supported by Dilithium. These figures present
21199 CCs for three security levels. The processing time of a comparative analysis of the ATP_LUT, ATP_FF, ATP_DSP
input message is determined by the size of the message and and ATP_BRAM parameters among the four most advanced
operates at a rate of 8 bytes per cycle. Based on the parameters designs available.
in the ML-DSA specification, an average of 4.25 attempts is The work of Land [13] is one of the earliest designs for
required for level 2, 5.1 attempts for level 3 and 3.85 for level Round 3 Dilithium. This design has several non-optimized
5. The average time to execute one attempt is 5665, 7658 and aspects, such as the use of separated NTT/INTT blocks and
10340 CCs, leading to average times of signature generation multiplication blocks, which could be combined to achieve
are 29219, 46274 and 50848 CCs for three security levels. efficiency without increasing algorithm execution time.Thus,
based on the data in Table 3, even with the EqS metric, it is
B. COMPARISON WITH RELATED WORKS evident that our design achieves significantly better perfor-
ML-DSA was recently introduced with slight updates to the mance than this design, especially in security level 5.
earlier version of CRYSTALS-Dilithium, including minor In [12], Naina proposed a lightweight implementation for
parameter changes and the "hedged" versions in Algorithm level 5 Dilithium on the Zynq Ultrascale+ hardware plat-
2, as explained in the ML-DSA specification [3]. Despite form. With a focus on minimizing hardware utilization, this
some minor differences, to compare the efficacy of our im- design achieves the best results in terms of LUTs and FFs.
plementation, several state-of-the-art full hardware imple- However, regarding execution time, this design only provides
mentations for Round-3 Dilithium based on FPGA platforms data for the best-case scenario of the Sign algorithm without
with the best results are listed in Table 3. As mentioned specific information about the average execution time of this
above, these implementations vary in their approaches, such algorithm. Despite the differences in design approach, as our
as Lightweight, High-performance, or others. To compare the design supports various security levels, our implementation
effectiveness of these implementations, we employ the area- demonstrates better efficiency in almost all statistics when
time tradeoff metric. The ATP values are calculated by mul- evaluated using the ATP metric, as shown in Figs. 10-13.
tiplying the latency with the number of utilized LUTs, FFs, Wang et al. [11] used three independent cores for the
DSPs and BRAMs denoted as ATP_LUT, ATP_FF, ATP_DSP corresponding Dilithium’s security levels. This work is soft-
and ATP_BRAM, respectively. In this metric, our work serves ware/hardware codesign, where bit packing and unpacking
as the "center point" for comparison, with a value set to are not included in the hardware part. Therefore, when com-
1. Lower values indicate better performance, as both area paring by ATP metric in level 5, a direct comparison may not
and time consumption should be minimized. Additionally, be totally fair as our design, which supports 3 levels, is more
to ensure a fair comparison despite differences in hardware complex and resource-consuming. However, our implemen-
platforms, we use an equivalence conversion method. This tation showcases a slight improvement in terms of memory
method, previously used in [27], approximates the number of utilization and an significant performance enhancement.
LUTs, FFs, DSPs, and BRAMs to equivalent slices (EqS). Zhao et al. [10] presents a compact design for Dilithium
By this way when comparing results implemented on 7 series on the low-end Artix-7 platform. Although it also supports all
FPGAs platforms, such as Artix-7, they can be evaluated with three levels, it lacks the flexibility of our design. Unlike our
half of the equivalent results on UltraScale+ platforms. design, its results are stored only in its memory and it does not
The implementations in our work and in [9], [10] represent support the packing and sending of results to connect with ex-
full hardware designs for all three security levels of Dilithium. ternal hardware/software platforms. Dilithium is a public key
On the other hand, in [11], [13], the authors propose in- cryptosystem and the ability to pack and send public keys and
dependent designs for each security level. The Lightweight digital signatures to other platforms is fundamental. Without
10 VOLUME 11, 2023
Truong et al.: Efficient Low-Latency Hardware Architecture for Module-Lattice-Based Digital Signature Standard
FIGURE 10. Comparison of Area×Time metric (LUTs × s). FIGURE 11. Comparison of Area×Time metric (FFs × s).
this capability, the system loses much of its practicality. In and the potential for higher operating frequency, resulting in
terms of latency, their average execution time statistics do not enhanced time execution, while maintaining equivalent hard-
include the data input receiving and sending processes. The ware consumption in ATP_LUT and ATP_DSP parameters,
execution time for Verify algorithm is calculated only in cases using ATP and EqS metrics.
where the public key remains unchanged. In similar cases, our The work of [9] presented a design similar to ours, provides
work only requires 3179, 4819 and 7232 CCs for the three support for three security levels. However, it uses the Minerva
levels. This work also suffers from the complexity of the hash tool [28] to determine the maximum frequency, which repre-
module, leading to a low operating frequency. Therefore, even sents a different benchmark compared to other designs. By
with different platforms, our work offers improved latency enhancing the versatility of the arithmetic module and opti-
VOLUME 11, 2023 11
Truong et al.: Efficient Low-Latency Hardware Architecture for Module-Lattice-Based Digital Signature Standard
V. CONCLUSION
In this paper, we present an efficient and low-latency hard-
ware implementation of ML-DSA, which supports all main
algorithms at three security levels of NIST. Our implemen-
tation enables the complete execution of all ML-DSA tasks
independently or in interaction with other software/hardware
platforms to accelerate specific processes. When compared
with the state-of-the-art hardware architecture for all three
security levels of ML-DSA, we achieve the lowest latency
and efficiency in terms of area/performance tradeoffs, with
FIGURE 13. Comparison of Area×Time metric (BRAMs × s). a significant improvement of 1.27 to 2.58× as evaluated by
the ATP metric. Additionally, this work introduces optimized
arithmetic operations and hashing & sampling modules tai-
mizing the hash module, specifically tailored for Dilithium, lored to the characteristics of ML-DSA, enhancing the over-
our architecture has achieved a 2× improvement in all ATP all efficiency of the scheme. These modules can also work
parameters during Keygen and Verify algorithms. Our results independently as accelerators to support software platforms,
also showcase improved Sign algorithm execution times and handling the two most time-consuming parts of these algo-
more efficient resource utilization, detailed in Table 3. rithms. In summary, our proposed architecture exhibits poten-
tial adaptability to diverse applications of digital signatures,
Recently, a notable unified architecture for Kyber and
offering resilience against both quantum and classical attacks.
Dilithium was introduced by Aikata et al. [15]. When the
overall area consumption of this work is analyzed in Table
3, it utilizes 22184 LUTs, 8596 FFs, 4 DSPs, and 24 BRAMs REFERENCES
for the hardware part without the submodules solely support- [1] P. W. Shor, ‘‘Polynomial-time algorithms for prime factorization and dis-
crete logarithms on a quantum computer,’’ SIAM Journal on Computing,
ing Kyber. When evaluated using the ATP metric based on vol. 26, no. 5, pp. 1484–1509, Oct. 1997.
these statistics, our work exhibits substantial improvements [2] NIST, ‘‘Status report on the third round of the NIST-PQC standard-
in almost all parameters, as shown in Figs. 10-13. ization Process,’’. Available: https://csrc.nist.gov/projects/post-quantum-
cryptography.
In summary, our implementation exhibits the lowest la- [3] NIST, ‘‘FIPS 204 - Module-Lattice-Based Digital Signature Standard," url:
tency in both clock cycle and time execution compared to the https://csrc.nist.gov/pubs/fips/204/ipd
other designs. Furthermore, when compared to the combined [4] S. Bai, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe,
G. Seiler, and D. Stehl´e, ‘‘CRYSTALS-Dilithium Algorithm Specifica-
architecture supporting all security levels of Dilithium in [9], tions and Supporting Documentation,’’ Proposal to NIST PQC Standard-
our design demonstrates efficiency improvements ranging ization, Round3, 2021, https://pq-crystals.org/dilithium/data/dilithium-
from 1.27∼2.58×, as evaluated by the ATP metric. specification-round3.pdf.
[5] National Institute of Standards and Technology (2023) Digital Signature
In comparison to the software implementation, our hard- Standard (DSS).(Department of Commerce, Washington, D.C.), Federal
ware implementation in this work demonstrates significant Information Processing Standards Publication (FIPS) NIST FIPS 186-5.
performance improvements. Seo et al. [6] reported an opti- https://doi.org/10.6028/NIST.FIPS.186-5
mized NEON implementation on an 8-core ARM v8.2 64- [6] S. C. Seo, S. W. An, ‘‘Parallel implementation of CRYSTALS-Dilithium
for effective signing and verification in autonomous driving environment,’’,
bit CPU mounted on Jetson AGX Xavier. For Dilithium level ICT Express, vol. 9, issue 1, pp. 100–105, Feb. 2023.
5, their implementation takes 542µs for Keygen, 625µs for [7] H. Becker, V. Hwang, M. J. Kannwischer, B.-Y. Yang, and S.-Y. Yang,
‘‘Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple QUANG DANG TRUONG (Graduate Student
M1,’’ Cryptology ePrint Archive 2021/986, Jul. 2021. Member, IEEE) received his B.S. degree in Electri-
[8] P. Nannipieri, S. Di Matteo, L. Zulberti, F. Albicocchi, S. Saponara, cal and Electronic Engineering from Le Quy Don
and L. Fanucci, ‘‘A RISC-V Post Quantum Cryptography Instruction Set University of Technology, Vietnam, in 2019 and
Extension for Number Theoretic Transform to Speed-Up CRYSTALS M.s degree in Electrical and Computer Engineer-
Algorithms,’’ IEEE Access, vol. 9, pp. 150798-150808, 2021. ing from Inha University, in 2023. He is currently
[9] L. Beckwith, D. T. Nguyen, and K. Gaj, ‘‘High-performance hardware pursuing his Ph.D. in Electrical and Computer
implementation of CRYSTALS-Dilithium,’’ Crypto. ePrint Arch., Report
Engineering from Inha University. His research
2021/1451, 2021.
interests include algorithm and architecture design
[10] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu, S. Yin, S.
Wei, and L. Liu, ‘‘A compact and high-performance hardware architecture for post-quantum cryptography and homomorphic
for CRYSTALS-Dilithium,’’ IACR Trans. Cryptogr. Hardw. Embed.Syst., encryption.
vol. 2022, no. 1, pp. 270–295, 2022.
[11] T. Wang, C. Zhang, P. Cao and D. Gu, ‘‘Efficient implementation of
Dilithium signature scheme on FPGA SoC Platform,’’ IEEE Transactions
on Very Large Integration (VLSI) systems, vol: 30, Sep. 2022.
[12] N. Gupta, A. Jati, A. Chattopadhyay, and G. Jha, ‘‘Lightweight Hardware
Accelerator for Post-Quantum Digital Signature CRYSTALS-Dilithium,’’
IEEE Transactions on Circuits and Systems I: Regular Papers (2023).
[13] G. Land, P. Sasdrich, and T. Guneysu, ‘‘Hard Crystal-Implementing
Dilithium on Reconfigurable Hardware,’’ Cryptology ePrint Archive
2021/355, Mar. 2021.
[14] S. Ricci, L. Malina, P. Jedlicka, D. Smekal, 1. Hajny, P. Cibik, and P.
Dobias, ‘‘Implementing CRYSTALSDilithium Signature Scheme on FP- PHAP NGOC DUONG (Member, IEEE) received
GAs,’’ Cryptology ePrint Archive 2021/108, Jan. 2021. the M.S. degree in Electronic Engineering from
[15] A. Aikata, C. Mert, M. Imran, S. Pagliarini and S. Roy, ‘‘KaLi: A crystal The University of Danang, Vietnam, in 2015 and
for post-quantum security using Kyber and Dilithium,’’ IEEE Transactions the Ph.D. degree in Electrical and Computer En-
on Circuits and Systems I, vol: 70, Feb. 2023. gineering from Inha University, South Korea, in
[16] Vadim Lyubashevsky, ‘‘Fiat-shamir with aborts: Applications to lattice and 2022. He is currently a post-doctoral fellow at the
factoring-based signatures,’’ in International Conference on the Theory Artificial Intelligence System-on-Chip (AI-SoC)
and Application of Cryptology and Information Security, Springer Berlin Research Center, Inha University, South Korea;
Heidelberg, pp. 598–616, 2009. and a lecturer of Vietnam-Korea University of In-
[17] Y. Wang and M.Wang, ‘‘Module-LWE versus Ring-LWE, Revisited,’’
formation and Communication Technology, The
Cryptology ePrint Archive 2019/930, Aug. 2019.
University of Danang, Vietnam. His research interests include algorithm and
[18] E. Kiltz, V. Lyubashevsky, and C. Schaffner,‘‘A concrete treatment
of Fiat–Shamir signatures in the quantum random-oracle model,’’ in architecture for digital signal processing, hardware acceleration architecture,
Proc.37th Annu. Int. Conf. Theory Appl. Cryptograph. Techn., pp. 552–586, post-quantum cryptography and homomorphic encryption.
Tel Aviv, Israel, Apr. 2018.
[19] Leon Groot Bruinderink and Peter Pessl, ‘‘Differential fault attacks on
deterministic lattice signatures,’’ IACR Transactions on Cryptographic
Hardware and Embedded Systems, pp. 21–43, 2018.
[20] P. Longa and M. Naehrig, ‘‘Speeding up the number theoretic transform for
faster ideal lattice-based cryptography,’’ in Proc. 15th Int. Conf. Cryptol.
Netw. Secur., pp. 124–139, Milan, Italy, Nov. 2016.
[21] S. He and M. Torkelson, ‘‘A new approach to pipeline FFT processor,’’ in
Proceedings of International Conference on Parallel Processing, pp. 766-
770, April 1996.
[22] ‘‘UltraScale Architecture Memory Resources Advance Specification User
Guide,’’ UG573 (v1.1) August 14, 2014. HANHO LEE (Senior Member, IEEE) received
[23] National Institute of Standards and Technology, in FIPS PUB 202: SHA- Ph.D. and M.Sc. degrees in Electrical and Com-
3 Standard: Permutation-Based Hash and Extendable-Output Functions. puter Engineering, from the University of Min-
Aug. 2015. nesota, Minneapolis in 2000 and 1996, respec-
[24] G. Bertoni, J. Daemen, S. Hoffert, M. Peeters, and G. V. Assche.(2020)., tively. He was a member of the technical staff
Keccak in VHDL. [Online]. Available:https://keccak.team/hardware.html at Lucent Technologies (Bell Labs Innovations),
[25] K. K. Parhi, C.-Y. Wang, and A. P. Brown, ‘‘Synthesis of control circuits in Allentown from April 2000 to August 2002 and
folded pipelined DSP architectures,’’ IEEE Journal of Solid-State Circuits, Assistant Professor in the Department of Elec-
vol. 27, no. 1, pp. 29– 43, Jan. 1992. trical and Computer Engineering, University of
[26] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei and L.Liu, ‘‘Highly Effi- Connecticut, USA from August 2002 to August
cient Architecture of NewHope-NIST on FPGA using Low-Complexity
2004. He has been with the Department of Information and Communication
NTT/INTT,’’ IACR Trans. on CHES, vol. 2020, no. 2, pp. 49–72, 2020.
[27] P. Duong-Ngoc, S. Kwon, D. Yoo, and H. Lee, ‘‘Area-efficient number the-
Engineering, Inha University since August 2004, where he is currently a
oretic transform architecture for homomorphic encryption,’’ IEEE Trans. Professor. He was visiting scholar at Bell Labs, Alcatel-Lucent, Murray Hill,
Circuits Syst. I, Reg. Papers, vol. 70, no. 3, pp. 1270–1283, Mar. 2023. New Jersey, USA from August 2010 to August 2011. His research interests
[28] F. Farahmand, A. Ferozpuri, W. Diehl, and K. Gaj, ‘‘Minerva: Automated include algorithm and architecture design for cryptographic, forward error
hardware optimization tool,’’ in 2017 International Conference on ReCon- correction coding, artificial intelligence and digital signal processing.
Figurable Computing and FPGAs, ReConFig 2017, pp. 1-8, Cancun: IEEE,
Dec. 2017.