Dilithium Nist

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

CRYSTALS-Dilithium

Algorithm Specifications and Supporting Documentation

Shi Bai, Léo Ducas, Eike Kiltz, Tancrède Lepoint, Vadim Lyubashevsky,
Peter Schwabe, Gregor Seiler and Damien Stehlé
1
Florida Atlantic University
2
CWI, The Netherlands
3
Ruhr Universität Bochum, Germany
4
Google, USA
5
IBM Research Europe, Switzerland
6
Max Planck Institute for Security and Privacy, Germany & Radboud University, The
Netherlands
7
IBM Research Europe and ETH, Switzerland
8
ENS de Lyon, France

October 1, 2020

The most up-to-date versions of our specification document and code can be
accessed at:
https://pq-crystals.org/dilithium/
and
https://github.com/pq-crystals/dilithium/tree/round3
2 CRYSTALS-Dilithium

Changelog (Round 1 → Round 2)


We first describe the differences between the round 1 submission of Dilithium and the
current version. The only conceptual design change that we made is to allow the option
for the scheme to be either randomized or deterministic (whereas the round 1 version was
only deterministic). The difference between the two versions of the scheme is one line
in which a seed is chosen at random (in the randomized version) or as a hash of a key
together with the message (in the deterministic version). To allow for this small difference
between the deterministic and randomized version, the function ExpandMask now takes
a 48 byte seed as input instead of a key and the message. As a minor additional change
we have harmonized the nonces of the various expansion functions for the matrix A, the
masking vector y and the secret vectors s1 , s2 to all be 16-bit integers.
The other changes were in the implementation. We made various optimizations in the
signing algorithm. The most important optimization is how the rejection condition based
on the low part of the vector w and the hint vector is computed. Our AVX2 optimized
implementation now makes more use of vectorization and includes a simpler assembler
NTT implementation using macros.
We also introduced a version of the algorithm that uses AES, rather than SHAKE to
expand the public matrix A, the secret vectors s1 , s2 and the masking vector y from
a seed. This is mainly done to showcase the improved efficiency of our scheme once
hardware-support for SHAKE becomes widely available (as it is for AES).

Changelog (Round 2 → Round 3)


• Combined the first two parameter sets (which used 4 × 3 and 5 × 4 public key
matrices) into one NIST level 2 set which has a 4 × 4 public key matrix. We also
introduced a level 5 parameter set.
• Instead of always outputting a challenge polynomial with 60 non-zero coefficients
(and thus having 257-bit entropy), we now output ones with 39 and 49, which have
192 and 225 bits of entropy, for security levels 2 and 3, respectively. This entropy is
enough for the respected security levels (because only second-preimage resistance
is needed in the proof) and has the effect of somewhat decreasing the rejection
probability.
• Decreased the number of dropped bits from the public key t from 14 to 13, which
has the effect of increasing the SIS hardness of the scheme at the expense of slightly
increasing the public key size.
• The masking polynomial y is now sampled from a range where each coefficient has
a power-of-2 number of possibilities, making the sampling noticeably simpler to
implement. Because the previous range was already very close to a power-of-2, this
change has no effect on the security of the scheme.
• Changed the way the “challenge” c is sampled and transmitted. It is now sampled
in two stages. In the first, a hash outputting a 32-byte challenge seed c̃ is created by
the prover, and this is what is included in the signature. The polynomial challenge
used in the signature scheme is then created using SHAKE with c̃ as the seed. This
saves 8 bytes in the signature size.
• Updated the concrete security analysis, providing refined estimates in the gate cost
metric, based on recent progress on predicting the behavior and the costs of lattice
algorithms.
• Made some miscellaneous small changes to the NIST level 3 parameter set.
CRYSTALS-Dilithium 3

1 Introduction
We present the digital signature scheme Dilithium, whose security is based on the hardness
of finding short vectors in lattices. Our scheme was designed with the following criteria in
mind:

Simple to implement securely. The most compact lattice-based signature schemes


[DDLL13, DLP14] crucially require the generation of secret randomness from the dis-
crete Gaussian distribution. Generating such samples in a way that is secure against
side-channel attacks is highly non-trivial and can easily lead to insecure implementations,
as demonstrated in [BHLY16, EFGT17, PBY17]. While it may be possible that a very
careful implementation can prevent such attacks, it is unreasonable to assume that a
universally-deployed scheme containing many subtleties will always be expertly imple-
mented. Dilithium therefore only uses uniform sampling, as was originally proposed for
signatures in [Lyu09, GLP12]. Furthermore, all other operations (such as polynomial
multiplication and rounding) are easily implemented in constant time.

Be conservative with parameters. Since we are aiming for long-term security, we have
analyzed the applicability of lattice attacks from a very favorable, to the attacker, viewpoint.
In particular, we are considering (quantum) algorithms whose space requirements are on
the same order as the time ones. Such algorithms are currently unrealistic, and there seem
to be serious obstacles in removing the space requirement, but we are allowing for the
possibility that improvements may occur in the future.

Minimize the size of public key + signature. Since many applications require the
transmission of both the public key and the signature (e.g. certificate chains), we designed
our scheme to minimize the sum of these parameters. Under the restriction that we avoid
(discrete) Gaussian sampling, to the best of our knowledge, Dilithium has the smallest
combination of signature and public key sizes of any lattice-based scheme with the same
security levels.

Be modular – easy to vary security. The two operations that constitute nearly the
entirety of the signing and verification procedures are expansion of an XOF (we use
SHAKE-128 and SHAKE-256), and multiplication in the polynomial ring Zq [X ]/(X n + 1).
Highly efficient implementations of our algorithm will therefore need to optimize these
operations and make sure that they run in constant time. For all security levels, our
scheme uses the same ring with q = 223 − 213 + 1 and n = 256. Varying security simply
involves doing more/less operations over this ring and doing more/less expansion of the
XOF. In other words, once an optimized implementation is obtained for some security
level, it is easy to obtain an optimized implementation for a higher/lower level.

1.1 Overview of the Basic Approach


The design of the scheme is based on the “Fiat-Shamir with Aborts” approach [Lyu09,
Lyu12] and bears most resemblance to the schemes proposed in [GLP12, BG14]. For
readers who are unfamiliar with the general framework of such signature schemes, we
present a simplified (and less efficient) version of our scheme in Fig. 1. This version is a
slightly modified version of the scheme from [BG14]. We will now go through each of its
components to give the reader an idea of how such schemes work.
4 CRYSTALS-Dilithium

Gen
01 A ← Rqk×`
02 (s1 , s2 ) ← Sη` × Sηk
03 t := As1 + s2
04 return (pk = (A, t), sk = (A, t, s1 , s2 ))

Sign(sk, M )
05 z := ⊥
06 while z = ⊥ do
07 y ← Sγ`1 −1
08 w1 := HighBits(Ay, 2γ2 )
09 c ∈ Bτ := H(M k w1 )
10 z := y + cs1
11 if kzk∞ ≥ γ1 − β or kLowBits(Ay − cs2 , 2γ2 )k∞ ≥ γ2 − β, then z := ⊥
12 return σ = (z, c)

Verify(pk, M , σ = (z, c))


13 w10 := HighBits(Az − ct, 2γ2 )
14 if return Jkzk∞ < γ1 − βK and Jc = H (M k w10 )K

Figure 1: Template for our signature scheme without public key compression.

Key Generation. The key generation algorithm generates a k × ` matrix A each of whose
entries is a polynomial in the ring Rq = Zq [X ]/(X n + 1). As previously mentioned, we will
always have q = 223 − 213 + 1 and n = 256. Afterwards, the algorithm samples random
secret key vectors s1 and s2 . Each coefficient of these vectors is an element of Rq with
small coefficients of size at most η. Finally, the second part of the public key is computed
as t = As1 + s2 . All algebraic operations in this scheme are assumed to be over the
polynomial ring Rq .

Signing Procedure. The signing algorithm generates a masking vector of polynomials y


with coefficients less than γ1 . The parameter γ1 is set strategically – it is large enough
that the eventual signature does not reveal the secret key (i.e. the signing algorithm is
zero-knowledge), yet small enough so that the signature is not easily forged. The signer
then computes Ay and sets w1 to be the “high-order” bits of the coefficients in this
vector. In particular, every coefficient w in Ay can be written in a canonical way as
w = w1 · 2γ2 + w0 where |w0 | ≤ γ2 ; w1 is then the vector comprising all the w1 ’s. The
challenge c is then created as the hash of the message and w1 . The output c is a polynomial
in Rq with exactly τ ±1’s and the rest 0’s. The reason for this distribution is that c has
small norm and comes from a domain of size log2 256 τ + τ , which we would like to be
between 128 and 256. The potential signature is then computed as z = y + cs1 .
If z were directly output at this point, then the signature scheme would be insecure
due to the fact that the secret key would be leaked. To avoid the dependency of z on the
secret key, we use rejection sampling. The parameter β is set to be the maximum possible
coefficient of csi . Since c has τ ±1’s and the maximum coefficient in si is η, it’s easy to see
that β ≤ τ · η. If any coefficient of z is larger than γ1 − β, then we reject and restart the
signing procedure. Also, if any coefficient of the low-order bits of Az − ct is greater than
γ2 − β, we restart. The first check is necessary for security, while the second is necessary
for both security and correctness. The while loop in the signing procedure keeps being
repeated until the preceding two conditions are satisfied. The parameters are set such that
the expected number of repetitions is not too high (e.g. around 4).
CRYSTALS-Dilithium 5

Verification. The verifier first computes w10 to be the high-order bits of Az − ct, and then
accepts if all the coefficients of z are less than γ1 − β and if c is the hash of the message and
w10 . Let us look at why verification works, in particular as to why HighBits(Az − ct, 2γ2 ) =
HighBits(Ay, 2γ2 ). The first thing to notice is that Az − ct = Ay − cs2 . So all we really
need to show is that

HighBits(Ay, 2γ2 ) = HighBits(Ay − cs2 , 2γ2 ). (1)

The reason for this is that a valid signature will have kLowBits(Ay − cs2 , 2γ2 )k∞ < γ2 − β.
And since we know that the coefficients of cs2 are smaller than β, we know that adding
cs2 is not enough to cause any carries by increasing any low-order coefficient to have
magnitude at least γ2 . Thus Eq. (1) is true and the signature verifies correctly.

1.2 Dilithium
The basic template in Fig. 1 is rather inefficient, as is. The most glaring (but trivially
fixed) inefficiency is that the public key consists of a matrix of k · ` polynomials, which
could have a rather large representation. The fix is simply to have A generated from some
seed ρ using SHAKE-128, and this is a standard technique. The public key is therefore
(ρ, t) and its size is dominated by t. The novelty of Dilithium over the previous schemes
(e.g. [BG14] and qTESLA [ABB+ 19], which is a particular instantiation of the [BG14]
framework) is that we also shrink the bit-representation size of t by a factor slightly larger
than two at the expense of increasing the signature by less than 100 bytes.
The main observation for obtaining this very favorable trade-off is that when the verifier
computes w10 in Line 13, the high-order bits of Az − ct do not depend too much on the
low order bits of t because t is being multiplied by a very low-weight polynomial c. In
our scheme, some low-order bits of t are not included in the public key, and so the verifier
cannot always correctly compute the high-order bits of Az − ct. To make up for this,
the signer includes some “hints” as part of the signature, which are essentially the carries
caused by adding in the product of c with the missing low-order bits of t. With this hint,
the verifier is able to correctly compute w10 .
Additionally, we give an option for making our scheme deterministic using the standard
technique of adding a seed to the secret key and using this seed together with the message
to produce the randomness y in Line 07. The recent result of Kiltz et al. [KLS18] showed
that the fewer different signatures the adversary sees for the same messages, the tighter the
reduction is in the quantum random oracle model between the signature scheme and the
underlying hardness assumptions. While it’s not clear as to whether there is an improved
quantum attack for randomized signatures, we suggest the deterministic version as the
default option except in scenarios where an adversary can mount side-channel attacks that
exploit determinism [SBB+ 18, PSS+ 18]. Our full scheme in Fig. 4 also makes use of basic
optimizations such as pre-hashing the message M so as to not rehash it with every signing
attempt.

Implementation Considerations. The main algebraic operation performed in the scheme


is a multiplication of a matrix A, whose elements are polynomials in Zq [X ]/(X 256 + 1),
by a vector y of such polynomials. In our NIST Level 3 parameter setting, A is a 6 × 5
matrix and therefore consists of 30 polynomials. Thus the multiplication Ay involves 30
polynomial multiplications. As in many lattice-based schemes that are based on operations
over polynomial rings, we have chosen our ring so that the multiplication operation has
a very efficient implementation via the Number Theoretic Transform (NTT), which is
just a version of FFT that works over the finite field Zq rather than over the complex
numbers. To enable the “fully-splitting” NTT algorithm, we need to choose a prime q so
that the group Z∗q has an element of order 2n = 512; or equivalently q ≡ 1 (mod 512). If
6 CRYSTALS-Dilithium

r is such an element, then X 256 + 1 = (X − r)(X − r 3 ) · · · (X − r 511 ) and thus one can
equivalently represent any polynomial a ∈ Zq [X ]/(X 256 +1) in its CRT (Chinese Remainder
Theorem) form as (a(r), a(r 3 ), . . . , a(r 2n−1 )). The advantage of this representation is that
the product of two polynomials is coordinate-wise. Therefore the most expensive parts of
polynomial multiplication are the transformations a → â and the inverse â → a – these
are the NTT and inverse NTT operations.
The above-mentioned advantage of using the NTT representation is even more pro-
nounced due to the fact that when doing the multiplication Ay, one needs to compute
much fewer NTTs than the dimension of the matrix. Since the matrix is uniformly random,
one can generate (and store) it already in NTT form. Doing the multiplication of this
matrix by a vector y ∈ Rq5 then only involves doing 5 NTT computations to convert y into
the NTT representation1 , then doing pointwise multiplications to obtain Ay in NTT form,
followed by 6 inverse NTT multiplications.
The other major time-consuming operation is the expansion of a seed ρ into the
polynomial matrix A. The matrix A is needed for both signing and verification, therefore
a good implementation of SHAKE-128 is important for the efficiency of the scheme.
For our AVX2 optimized implementation of the NTT we take the approach of using
integer, rather than floating point, arithmetic. Although we pack only 4 coefficients into
one vector register of 256 bits, which is the same density that is also used by floating
point implementations, we can improve on the multiplication speed by about a factor of
2. We achieved this speed-up by carefully scheduling the instructions and interleaving
the multiplications and reductions during the NTT so that parts of the multiplication
latencies are hidden.

Security. Following the basic approach from [Lyu12], the security of the signature scheme
in Figure 1 can be proved, in the Random Oracle model (ROM), based on the hardness of
two problems. The first is the standard LWE (over polynomial rings) problem which asks
to distinguish (A, t := As1 + s2 ) from (A, u), where u is uniformly random. The other
problem is, what was called
 the SelfTargetMSIS problem in [KLS18], which is the problem
z
of finding a vector  c  with small coefficients and a message (digest) µ satisfying
v
  
z
H µ k [ A | t | I] ·  c  = c,
v

where A and t are uniformly random and I is the identity matrix. In the ROM, one can
get a (non-tight) reduction using the forking lemma [PS00, BN06] from the standard MSIS
problem of finding a z0 with small coefficients satisfying Az0 = 0 to SelfTargetMSIS. One
can follow this exact approach to prove Dilithium secure in the ROM based on the hardness
of MLWE and SIS.
In the Quantum Random Oracle model (QROM), where the adversary can query H in
superposition, the situation is a little different. It was shown in [KLS18] that Dilithium is
still based on MLWE and SelfTargetMSIS in the QROM, even with a tight reduction when
the scheme is deterministic. But one can no longer directly use the forking lemma (since
it’s a type of rewinding) to give a quantum reduction from MSIS to SelfTargetMSIS. There
are still good reasons to believe that the SelfTargetMSIS problem, and therefore Dilithium,
is secure in the QROM. Firstly, there are no natural signature schemes constructed from
Σ-protocols using the Fiat-Shamir transform that are secure in the ROM but not in the
QROM. Also, it is possible to set parameters of Dilithium (while leaving the design of the
1 The vectors, like y, which we multiply A with contain only coefficients and so cannot be generated

directly in NTT form.


CRYSTALS-Dilithium 7

scheme the same) so that the SelfTargetMSIS problem becomes information-theoretically


hard, thus leaving this version of Dilithium secure in the QROM based on just MLWE. An
instantiation of such parameters in [KLS18] results in a scheme with signatures and public
keys that are 2X and 5X larger, respectively. While we do not deem this to be a good
trade-off, the existence of such a scheme gives us added confidence in the security of the
optimized Dilithium.
Very recently, two new works narrowed the gap even more between security in the
ROM and the QROM. The work of [DFMS19] showed that if the underlying Σ-protocol is
collapsing and has special soundness, then its Fiat-Shamir transform is a secure signature
in the QROM. Special soundness of the Dilithium Σ-protocol is directly implied by the
hardness of MSIS [Lyu12, DKL+ 18]. Furthermore, [DFMS19] conjecture, that the Dilithium
Σ-protocol is collapsing. The work of [LZ19] further showed that the collapsing property
does have a reduction from MLWE. The reduction is rather non-tight, but it does give
even more affirmation that there is nothing fundamentally insecure about the construction
of Dilithium or any natural scheme built via the Fiat-Shamir framework whose security can
be proven in the ROM. In our opinion, evidence is certainly mounting that the distinction
between signatures secure in the ROM and QROM will soon become treated in the same
way as the distinction between schemes secure in the standard model and ROM – there
will be some theoretical differences, but security in practice will be the same.
8 CRYSTALS-Dilithium

Table 1: Output sizes, security, and performance of Dilithium. The formulas for the sizes
of the public key and signature are given in Section 5.4. The numbers in parentheses for
SIS security are for the strongly-unforgeable (i.e. hard to come up with a second distinct
signature for an already-seen message/signature pair) version of the signature scheme,
where the coefficient length comes from Eq. (8), rather than Eq. (7) for the unparenthesized
version. Due to the rejection sampling, there is a noticeable difference between the median
and average signing times, and so we measure both.

NIST Security Level 2 3 5


Output Size
public key size (bytes) 1312 1952 2592
signature size (bytes) 2420 3293 4595
LWE Hardness (Core-SVP and refined)
BKZ block-size b (GSA) 423 624 863
Classical Core-SVP 123 182 252
Quantum Core-SVP 112 165 229
BKZ block-size b (simulation) 433 638 883
log2 Classical Gates (see App. C.5) 159 217 285
log2 Classical Memory (see App. C.5) 98 139 187
SIS Hardness (Core-SVP)
BKZ block-size b 423 (417) 638 (602) 909 (868)
Classical Core-SVP 123 (121) 186 (176) 265 (253)
Quantum Core-SVP 112 (110) 169 (159) 241 (230)
Performance (Unoptimized Reference Code, Skylake)
Gen median cycles 300, 751 544, 232 819, 475
Sign median cycles 1, 081, 174 1, 713, 783 2, 383, 399
Sign average cycles 1, 355, 434 2, 348, 703 2, 856, 803
Verify median cycles 327, 362 522, 267 871, 609
Performance (AVX2, Skylake)
Gen median cycles 124, 031 256, 403 298, 050
Sign median cycles 259, 172 428, 587 538, 986
Sign average cycles 333, 013 529, 106 642, 192
Verify median cycles 118, 412 179, 424 279, 936
Performance (AVX2+AES, Skylake)
Gen median cycles 70, 548 153, 856 153, 936
Sign median cycles 194, 892 296, 201 344, 578
Sign average cycles 251, 144 366, 470 418, 157
Verify median cycles 72, 633 102, 396 151, 066
CRYSTALS-Dilithium 9

2 Basic Operations
2.1 Ring Operations
We let R and Rq respectively denote the rings Z[X ]/(X n + 1) and Zq [X ]/(X n + 1), for q
an integer. Throughout this document, the value of n will always be 256 and q will be the
prime 8380417 = 223 − 213 + 1. Regular font letters denote elements in R or Rq (which
includes elements in Z and Zq ) and bold lower-case letters represent column vectors with
coefficients in R or Rq . By default, all vectors will be column vectors. Bold upper-case
letters are matrices. For a vector v, we denote by vT its transpose. The boolean operator
JstatementK evaluates to 1 if statement is true, and to 0 otherwise.

Modular reductions. For an even (resp. odd) positive integer α, we define r 0 = r mod± α
to be the unique element r 0 in the range − α2 < r 0 ≤ α2 (resp. − α−1
2 ≤ r ≤ 2 ) such that
0 α−1

r ≡ r mod α. We will sometimes refer to this as a centered reduction modulo α.2 For
0

any positive integer α, we define r 0 = r mod+ α to be the unique element r 0 in the range
0 ≤ r 0 < α such that r 0 ≡ r mod α. When the exact representation is not important, we
simply write r mod α.

Sizes of elements. For an element w ∈ Zq , we write kwk∞ to mean |w mod± q|. We


define the `∞ and `2 norms for w = w0 + w1 X + . . . + wn−1 X n−1 ∈ R:

kwk∞ = max kwi k∞ , kwk = kw0 k2∞ + . . . + kwn−1 k2∞ .


p
i

Similarly, for w = (w1 , . . . , wk ) ∈ Rk , we define

kwk∞ = max kwi k∞ , kwk = kw1 k2 + . . . + kwk k2 .


p
i

We will write Sη to denote all elements w ∈ R such that kwk∞ ≤ η. We will write S̃η , to
denote the set {w mod± 2η : w ∈ R}. Note that Sη and S̃η are very similar except that S̃η
does not contain any polynomials with −η coefficients.

2.2 NTT domain representation


Our modulus q is chosen such that there exists a 512-th root of unity r modulo q.
Concretely, we always work with r = 1753. This implies that the cyclotomic polynomial
X 256 + 1 splits into linear factors X − r i modulo q with i = 1, 3, 5, . . . , 511. By the Chinese
remainder theorem our cyclotomic ring Rq is thus isomorphic to the product of the rings
Zq [X ]/(X − r i ) ∼
= Zq . In this product of rings it is easy to multiply elements since the
multiplication is pointwise there. The isomorphism
Y
a 7→ a(r), a(r 3 ), . . . , a(r 511 ) : Rq → Zq [X ]/(X − r i )

i

can be computed quickly with the help of the Fast Fourier Transform. Since X 256 + 1 =
X 256 − r 256 = (X 128 − r 128 )(X 128 + r 128 ) one can first compute the map

Zq [X ]/(X 256 + 1) → Zq [X ]/(X 128 − r 128 ) × Zq [X ]/(X 128 + r 128 )

and then continue separately with the two reduced polynomials of degree less than
128 noting that X 128 + r 128 = X 128 − r 384 . The Fast Fourier Transform is also called
2 We draw the reader’s attention to the fact that for even α, the range includes α/2 but not −α/2. This

is a somewhat less standard choice, but defining things in this way makes some parts of the scheme (in
particular, the bit-packing of the public key) more efficient.
10 CRYSTALS-Dilithium

Number Theory Transform (NTT) in this case where the ground field is a finite field.
Natural fast NTT implementations do not output vectors with coefficients in the order
a(r), a(r 3 ), . . . , a(r 511 ). Therefore we define the NTT domain representation â = NTT(a) ∈
Z256
q of a polynomial a ∈ Rq to have coefficients in the order as output by our reference
NTT. Concretely,

â = NTT(a) = (a(r0 ), a(−r0 ), . . . , a(r127 ), a(−r127 ))

where ri = r brv(128+i) with brv(k) the bitreversal of the 8 bit number k. With this notation,
and because of the isomorphism property, we have ab = NTT−1 (NTT(a)NTT(b)). For vectors
y and matrices A, the representations ŷ = NTT(y) and  = NTT(A) mean that every
polynomial yi and ai,j comprising y and A is in NTT domain representation. We give
further detail about our NTT implementations in Section 5.6.

2.3 Hashing
Our scheme uses several different algorithms that hash strings in {0, 1}∗ onto domains of
various forms. Below we give the high level descriptions of these functions and defer the
details of how exactly they are used in the signature scheme to Section 5.3.

Hashing to a Ball. Let Bτ denote the set of elements of R that have τ coefficients that
are either −1 or 1 and the rest are 0. We have |Bτ | = 2τ · 256
τ . For our signature scheme,


we will need a cryptographic hash function that hashes onto Bτ in two steps. The first
step consists of using a 2nd pre-image resistant cryptographic hash function to map {0, 1}∗
onto the domain {0, 1}256 . The second stage consists of an XOF (e.g. SHAKE-256) that
maps the output of the first stage to an element of Bτ . The reason for this two-step
process is that the output of the first stage allows for a very simple, compact description
of an element in Bτ (under the assumption that the XOF is indistinguishable from a
purely-random function). The algorithm we will use to create a random element in Bτ is
sometimes referred to as an “inside-out” version of the Fisher-Yates shuffle [Knu97], and
its high-level description is in Fig. 2.3 The XOF in the second stage is used to create the
randomness needed for this algorithm (in Steps 03 and 04) using the output of the first
stage as the seed.

SampleInBall(ρ)
01 Initialize c = c0 c1 . . . c255 = 00 . . . 0
02 for i := 256 − τ to 255
03 j ← {0, 1, . . . , i}
04 s ← {0, 1}
05 ci := cj
06 cj := (−1)s
07 return c

Figure 2: Create a random 256-element array with τ ±1’s and 256 − τ 00 s using the input
seed ρ (and an XOF) to generate the randomness needed in Steps 03 and 04.

Expanding the Matrix A. The function ExpandA maps a uniform seed ρ ∈ {0, 1}256 to
a matrix A ∈ Rqk×l in NTT domain representation. The matrix A is only needed for
multiplication. Hence, for the sake of faster implementations, the expansion function
ExpandA does not output A ∈ Rqk×l = (Zq [X ]/(X 256 + 1))k×l . Instead it outputs  ∈ Z256
q ,
3 Normally, the algorithm should begin at i = 0, but since there are 256 − τ 0’s, the first 256 − τ − 1

iterations would just be setting components of c to 0.


CRYSTALS-Dilithium 11

which is interpreted as the NTT domain representation of A. As A needs to be sampled


uniformly and the NTT is an isomorphism, ExpandA also needs to sample uniformly
in this representation. To be compatible to Dilithium, an implementation whose NTT
produces differently ordered vectors than our reference NTT needs to sample coefficients
in a non-consecutive order.

Sampling the vectors y. The function ExpandMask, used for deterministically generating
the randomness of the signature scheme, maps a seed ρ0 and a nonce κ to y ∈ S̃γl 1 .

Collision resistant hash. The function CRH used in our signature scheme is a collision
resistant hash function mapping to {0, 1}384 .

2.4 High/Low Order Bits and Hints


To reduce the size of the public key, we will need some simple algorithms that extract
“higher-order” and “lower-order” bits of elements in Zq . The goal is that when given an
arbitrary element r ∈ Zq and another small element z ∈ Zq , we would like to be able to
recover the higher order bits of r + z without needing to store z. We therefore define
algorithms that take r, z and produce a 1-bit hint h that allows one to compute the higher
order bits of r + z just using r and h. This hint is essentially the “carry” caused by z in
the addition.
There are two different ways in which we will break up elements in Zq into their “high-
order” bits and “low-order” bits. The first algorithm, Power2Roundq , is the straightforward
bit-wise way to break up an element r = r1 · 2d + r0 where r0 = r mod± 2d and r1 =
(r − r0 )/2d .
Notice that if we choose the representatives of r1 to be non-negative integers between 0
and bq/2d c, then the distance (modulo q) between any two r1 · 2d and r10 · 2d is usually
≥ 2d , except for the border case. In particular, the distance modulo q between bq/2d c · 2d
and 0 could be very small. This is problematic in the case that we would like to produce a
1-bit hint, as adding a small number to r can actually cause the high-order bits of r to
change by more than 1.
We avoid having the high-order bits change by more than 1 with a simple tweak. We
select an α that is a divisor of q − 1 and write r = r1 · α + r0 in the same way as before.
For the sake of simplicity, we assume that α is even (which is possible, as q is odd). The
possible r1 · α’s are now {0, α, 2α, . . . , q − 1}. Note that the distance between q − 1 and
0 is 1, and so we remove q − 1 from the set of possible r1 · α’s, and simply round the
corresponding r’s to 0. Because q − 1 and 0 differ by 1, all this does is possibly increase
the magnitude of the remainder r0 by 1. This procedure is called Decomposeq . Using
this procedure as a sub-routine, we can define the MakeHintq and UseHintq routines that
produce a hint and, respectively, use the hint to recover the high-order bits of the sum.
For notational convenience, we also define HighBitsq and LowBitsq routines that simply
extract r1 and r0 , respectively, from the output of Decomposeq .
The below Lemmas state the properties of these supporting algorithms that are necessary
for the correctness and security of our scheme. Their proofs can be found in Appendix A.

Lemma 1. Suppose that q and α are positive integers satisfying q > 2α, q ≡ 1 (mod α)
and α even. Let r and z be vectors of elements in Rq where kzk∞ ≤ α/2, and let h, h0
be vectors of bits. Then the HighBitsq , MakeHintq , and UseHintq algorithms satisfy the
following properties:

1. UseHintq (MakeHintq (z, r, α), r, α) = HighBitsq (r + z, α).


12 CRYSTALS-Dilithium

Power2Roundq (r, d) Decomposeq (r, α)


08 r := r mod+ q 19 r := r mod+ q
09 r0 := r mod± 2d  20 r0 := r mod± α
10 return (r − r0 )/2d , r0 21 if r − r0 = q − 1
22 then r1 := 0; r0 := r0 − 1
MakeHintq (z, r, α) 23 else r1 := (r − r0 )/α
11 r1 := HighBitsq (r, α) 24 return (r1 , r0 )
12 v1 := HighBitsq (r + z, α)
13 return Jr1 6= v1 K
HighBitsq (r, α)
UseHintq (h, r, α) 25 (r1 , r0 ) := Decomposeq (r, α)
14 m := (q − 1)/α 26 return r1
15 (r1 , r0 ) := Decomposeq (r, α)
16 if h = 1 and r0 > 0 return (r1 + 1) mod+ m LowBitsq (r, α)
17 if h = 1 and r0 ≤ 0 return (r1 − 1) mod+ m 27 (r1 , r0 ) := Decomposeq (r, α)
18 return r1 28 return r0

Figure 3: Supporting algorithms for Dilithium.

2. Let v1 = UseHintq (h, r, α). Then kr − v1 · αk∞ ≤ α + 1. Furthermore, if the number


of 1’s in h is ω, then all except at most ω coefficients of r − v1 · α will have magnitude
at most α/2 after centered reduction modulo q.

3. For any h, h0 , if UseHintq (h, r, α) = UseHintq (h0 , r, α), then h = h0 .

Lemma 2. If ksk∞ ≤ β and kLowBitsq (r, α)k∞ < α/2 − β, then

HighBitsq (r, α) = HighBitsq (r + s, α).

Lemma 3. Let (r1 , r0 ) = Decomposeq (r, α), (w1 , w0 ) = Decomposeq (r + s, α), and
ksk∞ ≤ β. Then

ks + r0 k∞ < α/2 − β ⇐⇒ w1 = r1 ∧ kw0 k∞ < α/2 − β.

3 Signature
The Key Generation, Signing, and Verification algorithms for our signature scheme are
presented in Fig. 4. We present the deterministic version of the scheme in which the ran-
domness used in the signing procedure is generated (using SHAKE-256) as a deterministic
function of the message and a small secret key. Since our signing procedure may need to
be repeated several times until a signature is produced, we also append a counter in order
to make the SHAKE-256 output differ with each signing attempt of the same message.
Also due to the fact that each (possibly long) message may require several iterations to
sign, we compute an initial digest of the message using a collision-resistant hash function,
and use this digest in place of the message throughout the signing procedure.
As discussed in Section 1.2, the main design improvement of Dilithium over the scheme
in Fig. 1 is that the public key size is approximately halved at the expense of fewer than a
hundred extra bytes in the signature. To accomplish the size reduction, the key generation
algorithm outputs t1 := Power2Roundq (t, d) as the public key instead of t as in Fig. 1.
This means that instead of dlog qe bits per coefficient, the public key requires dlog qe − d
bits. In our instantiations, q ≈ 223 and d = 13, which means that instead of 23 bits in
each public key coefficient, there are instead 10.
CRYSTALS-Dilithium 13

Gen
01 ζ ← {0, 1}256
02 (ρ, ς, K ) ∈ {0, 1}256×3 := H(ζ) B H is instantiated as SHAKE-256 throughout
03 (s1 , s2 ) ∈ Sη` × Sηk := H(ς)
04 A ∈ Rqk×` := ExpandA(ρ) B A is generated and stored in NTT Representation as Â
05 t := As1 + s2 B Compute As1 as NTT−1 (Â · NTT(s1 ))
06 (t1 , t0 ) := Power2Roundq (t, d)
07 tr ∈ {0, 1}384 := CRH(ρ k t1 )
08 return (pk = (ρ, t1 ), sk = (ρ, K , tr, s1 , s2 , t0 ))

Sign(sk, M )
09 A ∈ Rqk×` := ExpandA(ρ) B A is generated and stored in NTT Representation as Â
10 µ ∈ {0, 1}384 := CRH(tr k M )
11 κ := 0, (z, h) := ⊥
12 ρ0 ∈ {0, 1}384 := CRH(K k µ) (or ρ0 ← {0, 1}384 for randomized signing)
13 while (z, h) = ⊥ do B Pre-compute ŝ1 := NTT(s1 ), ŝ2 := NTT(s2 ), and t̂0 := NTT(t0 )
14 y ∈ S̃γ`1 := ExpandMask(ρ0 , κ)
15 w := Ay B w := NTT−1 (Â · NTT(y))
16 w1 := HighBitsq (w, 2γ2 )
17 c̃ ∈ {0, 1}256 := H(µ k w1 )
18 c ∈ Bτ := SampleInBall(c̃) B Store c in NTT representation as ĉ = NTT(c)
19 z := y + cs1 B Compute cs1 as NTT−1 (ĉ · ŝ1 )
20 r0 := LowBitsq (w − cs2 , 2γ2 ) B Compute cs2 as NTT−1 (ĉ · ŝ2 )
21 if kzk∞ ≥ γ1 − β or kr0 k∞ ≥ γ2 − β, then (z, h) := ⊥
22 else
23 h := MakeHintq (−ct0 , w − cs2 + ct0 , 2γ2 ) B Compute ct0 as NTT−1 (ĉ · t̂0 )
24 if kct0 k∞ ≥ γ2 or the # of 1’s in h is greater than ω, then (z, h) := ⊥
25 κ := κ + `
26 return σ = (z, h, c̃)

Verify(pk, M , σ = (z, h, c̃))


27 A ∈ Rqk×` := ExpandA(ρ) B A is generated and stored in NTT Representation as Â
28 µ ∈ {0, 1}384 := CRH(CRH(ρ k t1 ) k M )
29 c := SampleInBall(c̃)
30 w10 := UseHintq (h, Az − ct1 · 2d , 2γ2 ) B Compute as NTT−1 (Â · NTT(z) − NTT(c) · NTT(t1 · 2d ))
31 return Jkzk∞ < γ1 − βK and Jc̃ = H (µ k w10 )K and J# of 1’s in h is ≤ ωK

Figure 4: The pseudo-code for deterministic and randomized versions of Dilithium. The
only difference between the two versions is in Line 12, where ρ0 is either a function of the
key and message, or is chosen completely at random.
14 CRYSTALS-Dilithium

The main complication due to not having the entire t in the public key is that the
verification algorithm is no longer able to exactly compute w10 in Line 13 in Fig. 1. In
order to do this, the verification algorithm will need the high order bits of Az − ct, but it
can only compute Az − ct1 · 2d = Az − ct + ct0 . Even though the high-order bits of ct0
are 0,its presence in the sum creates “carries” which may have an effect on the higher bits.
The signer thus sends these carries as a hint to the verifier. Heuristically, based on our
parameter choices, there should not be more than ω positions in which a carry is caused.
The signer therefore simply sends the positions in which these carries occur (these are the
extra bytes in Dilithium compared to the signature in Fig. 1), which allows the verifier to
compute the high order bits of Az − ct.

3.1 Implementation Notes and Efficiency Trade-offs


To keep the size of the public (and secret) key small, both the Sign and Verify procedures
begin with extracting the matrix A (or more accurately, its NTT domain representation
Â) from the seed ρ. If storage space is not a factor, then  can be pre-computed and be
part of the secret/public key. The signer can additionally pre-compute the NTT domain
representations of s1 , s2 , t0 to slightly speed up the signing operation. At the other extreme,
if the signer wants to store as small a secret key as possible, he only needs to store a
32-byte secret seed ζ which is used to generate the randomness to create ρ, K , and s1 , s2 in
the key generation algorithm. Furthermore, one can also keep the memory for intermediate
computations low by only keeping the parts of the NTT domain representation that one is
currently working with.

3.2 Deterministic vs. Randomized Signatures


Dilithium gives an option of producing either deterministic (i.e. the signature of a particular
message is always the same) or randomized signatures. The only implementational difference
between the two options occurs on Line 12 where the seed ρ0 is either derived from the
message and a key (in deterministic signing) or is chosen completely at random (for
randomized signing).
One may want to consider randomized signatures in situations where the side channel
attacks of [SBB+ 18, PSS+ 18] exploiting determinism are applicable. Another situation
where one may want to avoid determinism is when the signer does not wish to reveal the
message that is being signed. While there is no timing leakage of the secret key, there is
timing leakage of the message if the scheme is deterministic. Since the randomness of the
scheme is derived from the message, the number of aborts for a particular message will
always be the same.

3.3 Correctness
In this section, we prove the correctness of the signature scheme.
If kct0 k∞ < γ2 , then by Lemma 1 we know that

UseHintq (h, w − cs2 + ct0 , 2γ2 ) = HighBitsq (w − cs2 , 2γ2 ) .

Since w = Ay and t = As1 + s2 , we have that

w − cs2 = Ay − cs2 = A(z − cs1 ) − cs2 = Az − ct, (2)

and w − cs2 + ct0 = Az − ct1 · 2d . Therefore the verifier computes

UseHintq (h, Az − ct1 · 2d , 2γ2 ) = HighBitsq (w − cs2 , 2γ2 ) .


CRYSTALS-Dilithium 15

Table 2: Parameters of Dilithium

NIST Security Level 2 3 5


Parameters
q [modulus] 8380417 8380417 8380417
d [dropped bits from t] 13 13 13
τ [# of ±1’s in c]  39 49 60
challenge entropy [log 256τ + τ] 192 225 257
γ1 [y coefficient range] 217 219 219
γ2 [low-order rounding range] (q − 1)/88 (q − 1)/32 (q − 1)/32
(k, `) [dimensions of A] (4, 4) (6, 5) (8, 7)
η [secret key range] 2 4 2
β [τ · η] 78 196 120
ω [max. # of 1’s in the hint h] 80 55 75
Repetitions (from Eq. (5)) 4.25 5.1 3.85

Furthermore, because β is set such that kcs2 k∞ ≤ β and the signer checks in Line 21
that LowBitsq (w − cs2 , 2γ2 ) < γ2 − β, Lemma 2 implies that

HighBitsq (w − cs2 , 2γ2 ) = HighBitsq (w − cs2 + cs2 , 2γ2 ) = HighBitsq (w, 2γ2 ) = w1 . (3)

Therefore, the w10 computed by the verifier in the input to the hash function is the
same as the w1 of the signer. And thus the verification procedure will always accept.

3.4 Number of Repetitions


We now want to compute the probability that Step 21 will set (z, h) to ⊥. The probability
that kzk∞ < γ1 − β can be computed by considering each coefficient separately. For
each coefficient σ of cs1 , the corresponding coefficient of z will be between −γ1 + β + 1
and γ1 − β − 1 (inclusively) whenever the corresponding coefficient of yi is between
−γ1 + β + 1 − σ and γ1 − β − 1 − σ. The size of this range is 2(γ1 − β) − 1, and the
coefficients of y have 2γ1 − 1 possibilities. Thus the probability that every coefficient of y
is in the good range is
256·` `n
2(γ1 − β) − 1
 
β
= 1− ≈ e−256·β`/γ1 , (4)
2γ1 − 1 γ1 − 1/2

where we used the fact that our values of γ1 are large compared to 1/2.
We now move to computing the probability that we have

kr0 k∞ = kLowBitsq (w − cs2 , 2γ2 )k∞ < γ2 − β .

If we (heuristically) assume that the low order bits are uniformly distributed modulo 2γ2 ,
then there is a
256·k
2(γ2 − β) − 1

≈ e−256·βk/γ2
2γ2
probability that all the coefficients are in the good range (using the fact that our values
of β are large compared to 1/2). Therefore, the probability that Step 21 passes is

≈ e−256·β(`/γ1 +k/γ2 ) . (5)


16 CRYSTALS-Dilithium

It is more difficult to formally compute the probability that Step 24 results in a restart.
The parameters were set such that heuristically (z, h) = ⊥ with probability between 1 and
2%. Therefore the vast majority of the loop repetitions will be caused by Step 21.
We would like to stress that the expected number of iterations is independent of the
secret key s1 , s2 and therefore no information about them can be gained by an attack that
counts the iterations.

4 Dilithium Challenge and Extended Security Parameters

Table 3: Challenge and Extended Security Parameters for Dilithium. The numbers in
parentheses for SIS security are for the strongly-unforgeable (i.e. hard to come up with
a second distinct signature for an already-seen message/signature pair) version of the
signature scheme.

Challenge and Extended Sets 1- - 1- 5+ 5++


q 8380417 8380417 8380417 8380417
d 10 13 13 13
weight of c 24 30 60 60
challenge entropy 135 160 257 257
γ1 217 217 219 219
γ2 (q − 1)/128 (q − 1)/128 (q − 1)/32 (q − 1)/32
(k, `) (2, 2) (3, 3) (9, 8) (10, 9)
η 6 3 2 2
β 144 90 120 120
ω 10 80 85 90
pk size (bytes) 864 992 2912 3232
sig size (bytes) 1196 1843 5246 5892
Exp. reps (from Eq. (5)) 5.2 4.87 4.59 5.48
BKZ block-size b to break SIS 190 (165) 305 (305) 1055 (1005) 1200 (1145)
Core-SVP Classical 55 (49) 89 (89) 308 (293) 360 (334)
BKZ block-size b to break LWE 200 305 1020 1175
Core-SVP Classical 58 89 298 343

In addition to the parameters presented in Table 1 and Table 2 which correspond to


various NIST security levels, we also give parameters that are below and above these
levels. The purpose of the two parameter sets below NIST level 1 (called 1- - and 1-) is to
encourage cryptanalysis and to be an early warning in case the SIS and LWE problems are
easier than expected. The 1- - level has less than 60-bits of Core-SVP security, and it is
therefore conceivable that a non-trivial computational effort could break these parameters
in the next several years. It can also be a good gauge of progress on the computational
problems underlying Dilithium. The 1- security level, on the other hand, has approximately
90 bits of Core-SVP security and BKZ block-size of around 300. Breaking SVP or LWE
at this level in the near future without a gargantuan computational effort would indicate
that a big cryptanalytical leap has been made, and one should therefore consider the NIST
level 2 proposed parameters to be in danger. In this case, an immediate switch to levels 3
or 5 would be strongly encouraged.
At the other end of the spectrum, we also propose parameters that are above NIST level
5, which we call 5+ and 5++. For these parameter sets, we do not increase the hardness
CRYSTALS-Dilithium 17

of any symmetric primitives (e.g. PRFs or collision-resistant hash functions) that are
used inside our protocol, but only increase the security of the underlying lattice problems.
The goal of these parameter sets is to show what the parameters would be should there
be a moderate improvement in lattice cryptanalysis. In terms of Core-SVP hardness,
the 5++ level has about twice the required security of NIST level 3, and is therefore a
good representative of what the parameters may look like if moderate improvements in
lattice-based cryptanalysis necessitate a change. Such a change would be recommended
in case some future cryptanalytic effort begins to threaten the security of the level 3
parameter set.

5 Implementation Details
5.1 Alternative way of decomposing and computing the hints
In the pseudo-code for signing in Figure 4, the decomposition function Decompose is called
3 times per iteration of the rejection sampling loop. We have presented the algorithm
in this way for the ease of exposition and for compatibility with the security proofs
in [KLS18, DKL+ 18]. In the implementation, we use an alternative and faster way of
computing the hint vector h and the rejection condition based on the low part w0 of w.
First note that Lemma 3 says that instead of computing (r1 , r0 ) = Decomposeq (w − cs2 , α)
and checking whether kr0 k∞ < γ2 − β and r1 = w1 , it is equivalent to just check that
kw0 − cs2 k∞ < γ2 − β, where w0 is the low part of w. If this check passes, w0 − cs2
is the low part of w − cs2 . Next, recall that by the definition of the MakeHint function,
a coefficient of a polynomial in h as computed in Figure 4 is non-zero precisely if the
high parts of the corresponding coefficients of w − cs2 and w − cs2 + ct0 differ. Now, we
have already computed the full decomposition w = αw1 + w0 of w, and we know that
αw1 + (w0 − cs2 ) is the correct decomposition of w − cs2 . But then, αw1 + (w0 − cs2 + ct0 )
is the correct decomposition of w − cs2 + ct0 (i.e. the high part is w1 ) if and only if each
coefficient of w0 − cs2 + ct0 lies in the interval (−γ2 , γ2 ], or, when some coefficient is −γ2
and the corresponding coefficient of w1 is zero. The last condition is due to the border case
in the Decompose function. On the other hand, if these conditions are not true, then the
high parts must differ and it then follows that for computing the hint vector h it suffices
to just check these conditions on the coefficients of w0 − cs2 + ct0 .

5.2 Bit-packing
We now describe how we encode vectors as byte strings. This is needed for absorbing
them into SHAKE and defining the data layout of the keys and signature. To reduce the
computation time spent on SHAKE and the sizes of keys and signatures, we use bit-packing.
The general rule that we will follow is that if the range what a element x is in consists
exclusively of non-negative integers, then we simply pack the integer x. If x is from a range
[−a, b] that may contain some negative integers, then we pack the positive integer b − x.
We begin with the vector w1 . It consists of k polynomials w1,1 , . . . , w1,k in Rq with
coefficients that are roundings of elements in Zq with respect to α = 2γ2 . It follows that
its coefficients lie in {0, . . . , 15} for NIST levels 3 and 5, and in {0, . . . , 43} for NIST level
2. These can be represented by 4 and, respectively, 6, bits each. This allows w1 to be
packed in a string of k · 256 · 4/8 = k · 128, respectively k · 192 bytes. For NIST security
levels 3 and 5, each byte encodes two consecutive coefficients of a polynomial w1,i in its
low 4 bits and high 4 bits, respectively. See Figure 5 for an explanation of the exact bit
packing. For NIST security level 2, the packing looks as in Figure 6.
Next we turn to the vector t1 , which is the power-of-two rounding of t. Note that
q − 1 = 223 − 213 = (210 − 1)213 , which shows that the coefficients of the k polynomials
18 CRYSTALS-Dilithium

c1 c2 c3 c4 ... c5 c6 c c c 1 2 3

...
c Byte 1c
1 2 c Byte 2c 3 c Byte 3c 4
... 5 6 c Bytec 1 c 1 2 3

...
Figure 5: Bit-packing w for NIST levels 3 and 5. The k polynomials comprising w are
w , . . . , w and
1,1 1,k c we let
Byte 1
1
1 c , . . .c, c Byte
1 be the c
2 coefficients
2 256 Byte 3
. lower
of wc3 (with the . . powers first),
4
1,1 Byte 1 c
1

c ,...,c be the coefficients of w , etc. 1

......
257 512 1,2

c c c c c c c c c c
1
cByte 11
2
c Byte 2 c
3
2
4
Bytec3
...
5
3
. . . 6
4 Byte 1 c
1 2 3 4

Byte 1 Byte 2 Byte 3


... Byte 1 Byte
Byte 1 Byte 2 Byte 3
c 1 c 2 c ... 3
Byte 1
c 1 c
Figure 2 c
6: Bit-packing w for 3
. .2 . . . .
c NIST level 1 4 c 1
. . c. ... Byte 1
of t lie in {0, .Byte
1 . . , 2c 1− 1} and can
101 Byte
be2represented
c Byte
by 103bits each.
2 These 10 bits per 3
Byte 1 Byte 2 are bit-packed.
Byte 3In total t needs k · 256 · 10/8 = 320k Byte 1
coefficient, in little-endian byte-order,
bytes. See Figure 7 for an explanation of the exact bit packing.
... 1 Byte

Byte 1 Byte 2 Byte 3 Byte 1


c1 c2 c3 ... c1
...
Byte 1 Byte 2 Byte 3 Byte 1 Byte

Figure 7: Bit-packing t1 . The k polynomials comprising t1 are t1,1 , . . . , t1,k and we let
c1 , . . . , c256 be the coefficients of t1,1 (with the lower powers first), c257 , . . . , c512 be the
coefficients of t1,2 , etc.

The coefficients of the polynomials of t0 can be written in the form 212 − v with
v ∈ {0, . . . , 213 − 1}. These v in little endian byte-order are bit-packed. This results in
256 · 13/8 bytes per polynomial and k · 256 · 13/8 = 416k bytes for t0 . See Figure 8 for an
explanation of the exact bit packing.
The polynomials in s1 and s2 have coefficients with infinity norm at most η. So every
coefficient of these polynomials is equivalent modulo q to η−c with some c ∈ {0, . . . , 2η}. In
the bit packing the values for c are stored so that each polynomial needs 256dlog(2η + 1)e/8
bytes. This amounts to 256 · 3/8 = 96 bytes for NIST levels 2 and 5, and 256 · 4/8 = 128
bytes for level 3. The bit-packing is done similarly to the case of w1 , t1 and t0 . See Figure
9 for an explanation of the exact bit packing.
Finally, z contains polynomials whose coefficients are equivalent modulo q to γ1 − c
with c ∈ {0, . . . , 2γ1 − 1} and these values c are bit packed. For NIST levels 3 and 5,
bit-packing z requires l · 256 · 20/8 = 640l bytes and blocks of 2 coefficients are stored in 5
consecutive bytes. See Figure 10 for an explanation of the exact bit packing. For NIST
level 2, the bit-packing of z requires l · 256 · 18/8 = 576l bytes
... c1 c2 c3 c4 c5 c6 c7 c8 ...
... ...
CRYSTALS-Dilithium
Byte 1 Byte 2 Byte 3 19

... c1 c2 ...
... ...
Byte 1 Byte 2 Byte 3

Figure 8: Bit-packing t0 . The k polynomials comprising t0 are t0,1 , . . . , t0,k with ci defined

c3 ... c1 coefficients of t , etc.


213 − c257 , . . . , 213 − c512 are the 0,2
...
such that 213 − c1 , . . . , 213 − c256 are the coefficients of t0,1 (with the lower powers first),

... ...
c6 ... cByte
1 1c2 c3 Byte
c4 2 c5 c6 Bytec73 c
8 ...
... ...
3 Byte 1 Byte 2 Byte 3

Figure 9: Bit-packing si . The ` polynomials comprising s1 are s1,1 , . . . , s1,` and we let
4
... η−c1 , . . . , η−c256 be the coefficients of s1,1 (with the lower powers first), η−c257 , . . . , η−c512
be the coefficients of s1,2c,1etc. where ci ∈ {0, . . . , 2η}.cThe ...
k polynomials comprising s2
2
... are s2,1 , . . . , s2,` and we let η − c1 , . . . , η − c256 be the coefficients of s2,1 (with the lower
...
powers first), c257 , . . . , c512 be the coefficients of s2,2 , etc. ci ∈ {0, . . . , 2η}. The above
3 picture is for parameter sets where ci requires three bits per coefficient. Packing four bits
per coefficient is Byte done 1
in the obviousByte 2 and looks
manner Byte
like3the packing of wi in Figure 5.

5.3 Hashing
c3 . Hashing
. . to a Ball. The function
...
c1 H in Figure 4 absorbs the concatenation of µ and w1
... ...
into SHAKE-256, and produces a 32-byte output c̃. We now precisely specify the operation
of the function SampleInBall : c̃ 7→ c ∈ Bτ described in Figure 2 as it is used in our
signature scheme. The SampleInBall routine absorbs the 32 bytes of c̃ into SHAKE-256.
3 Byte 1 Byte 2 Byte 3
Throughout its operations the function squeezes SHAKE-256 in order to obtain a stream
of random bytes of variable length. The first τ bits in the first 8 bytes of this random
stream are interpreted as τ random sign bits si ∈ {0, 1}, i = 0, . . . , τ − 1. The remaining
64 − τ bits are discarded. In each iteration of the for loop in the Algorithm in Figure 2,
we use rejection sampling on elements from {0, . . . , 255} until we obtain a j ∈ {0, . . . , i}.
An element in {0, . . . , 255} is obtained by interpreting the next byte of the random stream
from SHAKE-256 as a number in this set. For the sign s the corresponding si−196 is used.

Expanding the Matrix A. The function ExpandA maps a uniform seed ρ ∈ {0, 1}256 to a
matrix A ∈ Rqk×l in NTT domain representation. It computes each coefficient âi,j ∈ Rq of
 separately. For the coefficient âi,j it absorbs the 32 bytes of ρ immediately followed by
two bytes representing 0 ≤ 256 ∗ i + j < 216 in little-endian byte order into SHAKE-128.
In the AES variant ρ is used as the key and 256i + j is zero-padded to a 12-byte nonce
for AES in counter mode. The output stream of SHAKE-128 or AES256ctr is interpreted
as a sequence of integers between 0 and 223 − 1. This is done by setting the highest bit
of every third byte to zero and interpreting blocks of 3 consecutive bytes in little endian
byte order. So for example the three bytes b0 , b1 and b2 are used to get the integer
0 ≤ b20 · 216 + b1 · 28 + b0 ≤ 223 − 1 where b20 is the logical AND of b2 and 2128 − 1. Finally,
ExpandA performs rejection sampling on these 23-bit integers to sample the 256 coefficients
... c1 c2 ...
... ...
20 Byte 1 Byte 2 Byte 3 CRYSTALS-Dilithium

c3 ... c1 ...
... ...
Byte 1 Byte 2 Byte 3
Figure 10: Bit-packing z. The ` polynomials comprising z are z1 , . . . , z` and we let γ1 −
c1 , . . . , γ1 −c256 be the coefficients of z1 (with the lower powers first), γ1 −c257 , . . . , γ1 −c512
be the coefficients of z2 , etc. In the above, each coefficient of z uses 20 bits. The packing
for level 2, where each coefficient is 18 bits looks analogously.

ai,j (r0 ), ai,j (−r0 ), . . . , ai,j (r127 )ai,j (−r127 ) of âi,j uniformly from the set {0, . . . , q − 1} in
the order of our NTT domain representation.

Sampling the vectors y. The function ExpandMask maps (ρ0 , κ) to y ∈ S̃γl 1 , where κ ≥ 0,
and works as follows. It computes each of the l coefficients of y, which are polynomials
in S̃γ1 , independently. For the i-th polynomial, 0 ≤ i < l, it absorbs the 48 bytes of ρ0
concatenated with the 2 bytes representing κ+i in little endian byte order into SHAKE-256.
In the AES variant the first 32 bytes of ρ0 are used as the key and κ + i is zero-padded
to a 12 byte nonce for AES in counter mode. Then the output bytes are used to create
a positive number in the range {0, . . . , 2γ1 − 1}, and then the integers comprising y are
then obtained by subtracting (γ1 − 1) from this positive number. Because γ1 is a power of
2, no rejection sampling is required. Because γ1 is 17 and 19, for different security levels,
generating the consecutive integers comprising y involves either taking 18 or 20 bits at a
time from the byte stream.

Collision resistant hash. The function CRH in Figure 4 is a collision resistant hash
function. For this purpose 384 bits of the output of SHAKE-256 are used. CRH is called
with 3 different inputs. First it is called with ρ k t1 . The function then absorbs the 32
bytes of ρ followed by the k · 256 · 9/8 bytes of the bit-packed representation of t1 into
SHAKE-256 and takes the first 48 bytes of the first output block of SHAKE-256 as the
output hash. The second input is tr k M . Here the concatenation of the hash tr and the
message string M are absorbed into SHAKE-256 and the first 48 output bytes are used as
the resulting hash. The third input K k µ is handled in the same way.

5.4 Data layout of keys and signature


Public key. The public key, containing ρ and t1 , is stored as the concatenation of the
bit-packed representations of ρ and t1 in this order. Therefore, it has a size of 32 + 320k
bytes.

Secret key. The secret key contains ρ, K , tr, s1 , s2 and t0 and is also stored as the
concatenation of the bit-packed representation of these quantities in the given order.
Consequently, a secret key requires 32 + 32 + 48 + 32((k + `) · dlog(2η + 1)e + 13k) bytes.
As mentioned previously, the signer also has the option of simply storing the 32-byte value
ζ and then simply re-deriving all the other elements of the secret key.

Signature. The signature byte string is the concatenation of a bit packed representation
of z and encodings of h and c in this order. We describe the encoding of h, which needs
ω + k bytes. Together all the polynomials in the vector h have at most ω non-zero
CRYSTALS-Dilithium 21

coefficients. It is sufficient to store the locations of these non-zero coefficients. Each of the
first ω bytes of the byte string representing h is the index i of the next non-zero coefficient
in its polynomial, i.e. 0 ≤ i ≤ 255, or zero if there are no more non-zero coefficients. The
bytes numbers ω up to ω + k − 1 record the k positions j of the polynomial boundaries in
the string of ω coefficient indices, where 0 ≤ j ≤ ω. The challenge c̃ has 32 bytes.
Therefore, a signature requires 32l · (1 + log γ1 ) + ω + k + 32 bytes.

5.5 Constant time implementation


Our reference implementation does not branch depending on secret data and does not access
memory locations that depend on secret data. For the modular reductions that are needed
for the arithmetic in Rq we never use the ’%’ operator of the C programming language.
Instead we use Montgomery reductions without the correction steps and special reduction
routines that are specific to our modulus q. For computing the rounding functions described
in Section 2.4, we have implemented branching-free algorithms. On the other hand, when
it is safe to reveal information, we have not tried to make the code constant-time. This
includes the computation of the challenges and the rejection conditions in the signing
algorithm. When performing rejection sampling, our code reveals which of the conditions
was the reason for the rejection, and in case of the norm checks, which coefficient violated
the bound. This is safe since the rejection probabilities for each coefficient are independent
of secret data. The challenges reveal information about CRH(µ k w1 ) also in the case of
rejected y, but this does not reveal any information about the secret key when CRH is
modelled as a random oracle and w1 has high min-entropy.

5.6 Reference implementation


Our reference NTT is a natural iterative implementation for 32 bit signed integers that
uses Cooley-Tukey butterflies in the forward transform and Gentleman-Sande butterflies in
the inverse transform. For modular reductions after multiplying with a precomputed root
of unity we use the Montgomery algorithm as was already done before in e.g. [ADPS16].
In order that the reduced values are correct representatives, the precomputed roots contain
the Montgomery factor 232 mod q. We also use Montgomery reductions after the pointwise
product of the polynomials in the NTT domain representations. Since we cannot get the
Montgomery factor in at this point, these products are in fact Hensel remainders. We
then make use of the fact that the NTT transform is linear and multiply by an additional
Montgomery factor after the inverse NTT when we divide out the factor 256.
The implementations of the functions ExpandA and ExpandMask initially squeeze a
number of output blocks of SHAKE-256 and SHAKE-128 that give enough randomness
with high probability. In the case of ExpandA, which samples uniform polynomials and
hence needs at least 3 · 256 = 768 random bytes per polynomial, 5 blocks from SHAKE-128
of 168 bytes each are needed at least for one polynomial. They suffice with probability
greater than 1 − 2−132 . ExpandMask initially retrieves 5 blocks from SHAKE-256 that have
136 bytes. This is the minimum number of blocks and suffices with probability greater
than 1 − 2−81 .
As mentioned in the introduction our reference implementation is protected against
timing attacks. For this reason the centralized remainders in the rounding functions given
in Figure 3 are not computed with branchings. Instead we use the following well-known
trick to compute the centralized remainder r 0 = r mod± α where 0 ≤ r < q. Subtracting
α/2 + 1 from r yields a negative result if and only if r ≤ α/2. Therefore, shifting this
result arithmetically to the right by 31 bits gives −1, i.e. the integer with all bits equal to
1, if r ≤ α/2 and 0 otherwise. Then the logical AND of the shifted value and α is added
to r and α/2 − 1 subtracted. This results in r − α if r > α/2 and r if r ≤ α/2, i.e. the
centralized remainder.
22 CRYSTALS-Dilithium

We make heavy use of lazy reduction in our implementation. In the NTT we do not
reduce the results of additions and subtractions at all. For rounding and norm checking
it is important to map to standard representatives. This freezing of the coefficients is
achieved in constant-time by conditionally subtracting q with another instance of the
arithmetic right shift trick.

5.7 AVX2 optimized implementation


We have written an optimized implementation of Dilithium for CPUs that support the
AVX2 instruction set. Since the two most time-consuming operations are polynomial
multiplication and the expansion of the matrix and vectors, the optimized implementation
speeds up these two operations.
For polynomial multiplication, we use a vectorized version of the NTT. This NTT
achieves a full multiplication of two polynomials including three NTTs and the pointwise
multiplication in less than 5000 Haswell cycles and is about a factor of 4.5 faster than the
reference C code compiled using gcc with full machine-specific optimizations turned on.
Contrary to some other implementations (e.g. [ADPS16]), we do not use floating point
instructions. When using floating point instructions, modular reductions are easily done by
multiplying with a floating point inverse of q and rounding to get the quotient from which
the remainder can be computed with another multiplication and a subtraction. Instead
of this approach we use integer instructions only and the same Montgomery reduction
methodology as in the reference C code. When compared to the floating point NTT from
[ADPS16] applied to the Dilithium prime q = 223 − 213 + 1, our integer NTT is about two
times faster.
At any time our AVX2 optimized NTT has 32 unsigned integer coefficients, of 32 bits
each, loaded into 8 AVX2 vector registers. Each of these vector registers then contains
4 extended 64 bit coefficients. So after three levels of NTT the reduced polynomials fit
completely into these 8 registers and we can transform them to linear factors without
further loads and stores. In the second to last and last level the polynomials have degree
less than 4. This means that every polynomial fits into one register but only half of the
coefficients need to be multiplied by roots. For this reason we shuffle the vectors in order
to group together coefficients that need to be multiplied. The instruction that we use
for this task are perm2i128, vpunpcklqdq and vpunpckhqdq. The vectors are shuffled
already in levels 4 and 5 which has the advantage of cheaper loads for the constant roots of
unity. The multiplications are performed using the vpmuldq instruction. This instruction
computes a full 64 bit product of two 32 bit integers. It has a latency of 5 cycles on both
Haswell and Skylake. In each level of the NTT half of the coefficients need to be multiplied.
Therefore we can do four vector multiplications and Montgomery reductions in parallel.
This hides some of the latency of the multiplication instructions.
For faster matrix and vector expansion, we use a vectorized SHAKE implementation
that operates on 4 parallel sponges and hence can absorb and squeeze blocks in and out of
these 4 sponges at the same time. For sampling this means that up to four coefficients can
be sampled simultaneously.

5.8 Computational Efficiency


We have performed timing experiments with our reference implementation on a Skylake
CPU. The results are presented in Table 1. They include the number of CPU cycles needed
by the three operations key generation, signing and signature verification. These numbers
are the medians of 1000 executions each. Signing was performed with a message size of 32
bytes. The computer we have used is a laptop with an Intel Core i7-6600U CPU that runs
at a base clock frequency of up 2600 Mhz. The code was compiled with gcc 7.3.0.
CRYSTALS-Dilithium 23

6 Security Reductions
The standard security notion for digital signatures is UF-CMA security, which is security
under chosen message attacks. In this security model, the adversary gets the public key and
has access to a signing oracle to sign messages of his choice. The adversary’s goal is to come
up with a valid signature of a new message. A slightly stronger security requirement that
is sometimes useful is SUF-CMA (Strong Unforgeability under Chosen Message Attacks),
which also allows the adversary to win by producing a different signature of a message
that he has already seen.
It can be shown that in the (classical) random oracle model, Dilithium is SUF-CMA
secure based on the hardness of the standard MLWE and MSIS lattice problems. The
reduction, however, is not tight. Furthermore, since we also care about quantum attackers,
we need to consider the security of the scheme when the adversary can query the hash
function on a superposition of inputs (i.e. security in the quantum random oracle model –
QROM). Since the classical security proof uses the “forking lemma” (which is essentially
rewinding), the reduction does not transfer over to the quantum setting.
There are no counter-examples of schemes whose security is actually affected by the
non-tightness of the reduction. For example, schemes like Schnorr signatures [Sch89], GQ
signatures [GQ88], etc. all set their parameters ignoring the non-tightness of the reduction.
Furthermore, the only known uses of the additional power of quantum algorithms against
schemes whose security is based on quantum-resistant problems under a classical reduction
involve “Grover-type” algorithms that improve exhaustive search (although it has been
shown that there cannot be a “black-box” proof that the Fiat-Shamir transform is secure
in the QROM [ARU14]).
The reason that there haven’t been any attacks taking advantage of the non-tightness
of the reduction is possibly due to the fact that there is an intermediate problem which is
tightly equivalent, even under quantum reductions, to the UF-CMA security of the signature
scheme. This problem is essentially a “convolution” of the underlying mathematical
problem (such as MSIS or discrete log) with a cryptographic hash function H. It would
appear that as long as there is no relationship between the structure of the math problem
and H, solving this intermediate problem is not easier than solving the mathematical
problem.4
Below, we will introduce the assumptions upon whose hardness the SUF-CMA security
of our scheme is based. The first two assumptions, MLWE and MSIS, are standard lattice
problems which are a generalization of LWE, Ring-LWE, SIS, and Ring-SIS. The third
problem, SelfTargetMSIS is the aforementioned problem that’s based on the combined
hardness of MSIS and the hash function H. In the classical ROM, there is a (non-tight)
reduction from MSIS to SelfTargetMSIS.

6.1 Assumptions
The MLWE Problem. For integers m, k, and a probability distribution D : Rq → [0, 1],
we say that the advantage of algorithm A in solving the decisional MLWEm,k,D problem
over the ring Rq is

m,k,D := Pr[b = 1 | A ← Rq
AdvMLWE m×k
; t ← Rqm ; b ← A(A, t)]
− Pr[b = 1 | A ← Rqm×k ; s1 ← D k ; s2 ← D m ; b ← A(A, As1 + s2 )] .

4 In the ROM, there is indeed a (non-tight) reduction using the forking lemma that states that solving

this problem is as hard as solving the underlying mathematical problem.


24 CRYSTALS-Dilithium

The MSIS Problem. To an algorithm A we associate the advantage function AdvMSIS


m,k,γ
to solve the (Hermite Normal Form) MSISm,k,γ problem over the ring Rq as

AdvMSIS
m,k,γ (A) := Pr 0 < kyk∞ ≤ γ ∧ [ I | A ] · y = 0 | A ← Rq
m×k
; y ← A(A) .
 

The SelfTargetMSIS Problem. Suppose that H : {0, 1}∗ → Bτ is a cryptographic hash


function. To an algorithm A we associate the advantage function

AdvSelfTargetMSIS
H,m,k,γ (A) :=

0 ≤ kyk∞ ≤ γ
     
r
Pr A← Rqm×k ; y := ,µ ← A|H(·)i
(A) .
∧ H(µ k [ I | A ] · y) = c c

Classical Hardness of SelfTargetMSIS. We now sketch the classical reduction from MSIS
to SelfTargetMSIS. In this scenario, the attacker A is classical and only has classical access
to the function H modeled as a random oracle. This reduction is a standard application of
rewinding / forking lemma, and therefore we only give a sketch.
Suppose A makes queries H(µ1 k w1 ), . . . , H0 (µ  k k wk ) and receives randomly-chosen
r
responses c1 , . . . , ck , and finally outputs µi , y = satisfying H(µi k [ I | A ] · y) = ci
ci
for some 1 ≤ i ≤ k. The reduction then rewinds A to the point where he made the query
H(µi k wi ) and reprograms the response to another randomly-chosen ci0 . The forking
lemma then states that the successful A has  0 a ≈ 1/k chance to create a forgery on the
r
same i. That is, he will output µi , y = 0 satisfying H(µi k [ I | A ] · y0 ) = ci0 . We
0
ci
therefore have that [ I | A ] · (y − y0 ) = 0 and y − y0 6= 0 because ci 6= ci0 . Furthermore,
all the coefficients of yi − yi0 are small, thus giving a solution to MSIS.

6.2 Signature Scheme Security


The concrete security of Dilithium was analyzed in [KLS18], where it was shown that if H is
a quantum random oracle (i.e., a quantum-accessible perfect hash function), the advantage
of an adversary A breaking the SUF-CMA security of the signature scheme is

AdvSUF -CMA SelfTargetMSIS


Dilithium (A) ≤ Advk,`,D (B) + AdvH,k,`+1,ζ (C) + AdvMSIS
k,`,ζ 0 (D) + 2 (6)
MLWE −254 5
,

for D a uniform distribution over Sη , and

ζ = max{γ1 − β, 2γ2 + 1 + 2d−1 · τ }, (7)

ζ 0 = max{2(γ1 − β), 4γ2 + 2}. (8)


Furthermore, if the running times and success probabilities (i.e. advantages) of A, B, C, D
are tA , tB , tC , tD , A , B , C , D , then the lower bound on tA /A is within a small multiplicative
factor of min ti /i for i ∈ {B, C, D}.
Intuitively, the MLWE assumption is needed to protect against key-recovery, the
SelfTargetMSIS is the assumption upon which new message forgery is based, and the MSIS
assumption is needed for strong unforgeability. We will now sketch some parts of the
security proof that are relevant to the concrete parameter setting.
5 To simplify the concrete security bound, we assume that ExpandA produces a uniform matrix A ∈ Rk×` ,
q
ExpandMask(K , ·) is a perfect pseudo-random function, and CRH is a perfect collision-resistant hash
function.
CRYSTALS-Dilithium 25

6.2.1 UF-CMA Security Sketch


It was shown in [KLS18] that for zero-knowledge deterministic signature schemes, if an
adversary having quantum access to H and classical access to a signing oracle can produce a
forgery of a new message, then there is also an adversary who can produce a forgery without
access to the signing oracle (so he only gets the public key).6 The latter security model is
called UF-NMA – unforgeability under no-message attack. By the MLWE assumption, the
public key (A, t = As1 + s2 ) is indistinguishable from (A, t) where t is chosen uniformly
at random. The proof that our signature scheme is zero-knowledge is fairly standard and
follows the framework from [Lyu09, Lyu12, BG14]. It is formally proved in [KLS18]7 and
we sketch the proof in Appendix B.
If we thus assume that the MLWEk,`,D problem is hard, where D is the distribution that
samples a uniform integer in the range [−η, η], then to prove UF-NMA security, we only
need to analyze the hardness of the experiment where the adversary receives a random
(A, t) and then needs to output a valid message/signature pair M , (z, h, c) such that

• kzk∞ < γ1 − β

• H(µ k UseHintq (h, Az − ct1 · 2d , 2γ2 )) = c

• # of 1’s in h is ≤ ω

Lemma 1 implies that one can rewrite

2γ2 · UseHintq (h, Az − ct1 · 2d , 2γ2 ) = Az − ct1 · 2d + u, (9)

where kuk∞ ≤ 2γ2 + 1. Furthermore, only ω coefficients of u will have magnitude greater
than γ2 . If we write t = t1 · 2d + t0 where kt0 k∞ ≤ 2d−1 , then we can rewrite Eq. (9) as

Az − ct1 · 2d + u = Az − c(t − t0 ) + u = Az − ct + (ct0 + u) = Az − ct + u0 . (10)

Note that the worst-case upper-bound for u0 is

ku0 k∞ ≤ kct0 k∞ + kuk∞ ≤ kck1 · kt0 k∞ + kuk∞ ≤ τ · 2d−1 + 2γ2 + 1.

Thus a (quantum) adversary who is successful at creating a forgery of a new message is


able to find z, c, u0 , M such that kzk∞ < γ1 − β, kck∞ = 1, ku0 k∞ ≤ τ · 2d−1 + 2γ2 + 1,
and M ∈ {0, 1}∗ such that
  
z
1
H µ k · [ A | t | Ik ] ·  c  = c. (11)
2γ2
u0

For simplicity of notation, we can define the function H0 such that H(µ k x) = H0 (µ k
2γ2 x), and so Eq. (11) becomes
  
z
H0 µ k [ A | t | Ik ] ·  c  = c.8 (12)
u0
6 It was also shown in [KLS18] that the “deterministic” part of the requirement can be relaxed. The

security proof simply loses a factor of the number of different signatures produced per message in its
tightness. Thus, for example, if one were to implement the signature scheme (with the same secret key) on
several devices with different random-number generators, the security of the scheme would not be affected
much.
7 In that paper, it is actually proved that the underlying zero-knowledge proof is zero-knowledge and

then the security of the signature scheme follows via black box transformations.
8 Notice that the difference between H and H0 is just a change in the format of the input.
26 CRYSTALS-Dilithium

Since (A, t) is completely random, this is exactly the definition of the SelfTargetMSIS
problem from above.
A standard forking lemma argument sketched in Section 6.1 can be used to show that
an adversary solving the above problem in the (standard) random oracle model can be
used to solve the MSIS problem. While giving a reduction using the forking lemma is a
good “sanity check”, it is not particularly useful for setting parameters due to its lack of
tightness. So how does one set parameters? The Fiat-Shamir transform has been used
for over 3 decades (and we have been aware of the non-tightness of the forking lemma for
two of them), yet the parameter settings for schemes employing it have ignored this loss
in tightness. Implicitly, therefore, these schemes rely on the exact hardness of analogues
(based on various assumptions such as discrete log [Sch89], one-wayness of RSA [GQ88],
etc.) of the problem in Eq. (12).
The intuition for the security of the problem in Eq. (12) (and its discrete log, RSA,
etc. analogues) is as follows: since H is a cryptographic hash function whose structure is
completely independent of the algebraic structure of its inputs, choosing some µ “strategi-
cally” should not help – so the problem would be equally hard if the µ were fixed. Then,
again relying on the independence of H0 and the algebraic structure of its inputs, the only
approach for obtaining a solution appears to be picking some w, computing H0 (µ k w) = c,
and then finding z, u0 such that Az + u0 = w + ct.9 The hardness of finding such z, u0
with `∞ -norms less than in Eq. (7) such that

Az + u0 = t0 (13)

for some t0 is the problem whose concrete security we will be analyzing. Note that this
bound is somehwat conservative because in Eq. (12) kzk∞ < γ1 − β  τ · 2d−1 + 2γ2 + 1.
Furthermore, only ω coefficients of u0 can be larger than τ · 2d−1 + γ2 . Thus most of
the coefficients of the solution (z, u0 ) to Eq. (13) generated by a real forgery must have
coefficients noticeably smaller than the maximum in Eq. (7).

6.2.2 The Addition of the Strong Unforgeabilty Property


To handle the strong-unforgeability requirement, one needs to handle an additional case.
Intuitively, the reduction from UF-CMA to UF-NMA used the fact that a forgery of a
new message will necessarily require the use of a challenge c for which the adversary has
never seen a valid signature (i.e., (z, h, c) was never an output by the signing oracle). To
prove strong-unforgeability, we also have to consider the case where the adversary sees a
signature (z, h, c) for µ and then only changes (z, h). In other words, the adversary ends
up with two valid signatures such that

UseHintq (h, Az − ct1 · 2d , 2γ2 ) = UseHintq (h0 , Az0 − ct1 · 2d , 2γ2 ).

By Lemma 1, the above equality can be shown to imply that there exist kzk∞ ≤ 2(γ1 − β)
and kuk∞ ≤ 4γ2 + 2 such that Az + u = 0.

6.3 Concrete Security Analysis


In Appendix C, we describe the best known lattice attacks against the problems in Eq. (6)
upon which the security of our signature scheme is based. The best attacks involve
finding short vectors in some lattice. The main difference between the MLWE and MSIS
problems is that the MLWE problem involves finding a short vector in a lattice in which
an “unusually short” vector exists. The MSIS problem, on the other hand, involves just
finding a short vector in a random lattice. In knapsack terminology, the MLWE problem is
a low-density knapsack, while MSIS is a high-density knapsack instance. The analysis for
9 This is indeed the intuition in the proof sketch in the classical random oracle model in Section 6.1.
CRYSTALS-Dilithium 27

the two instances is slightly different and we analyze the MLWE problem in Appendix C.2
and the MSIS problem (as well as SelfTargetMSIS) in Appendix C.3.
We first follow the core-SVP hardness methodology from [ADPS16], based on the
asymptotic costs of sieving [BDGL16a, Laa15]. It is significantly more conservative than
prior ones used in lattice-based cryptography. We describe how we use it for MLWE and
MSIS in sections C.2 and C.3. However, recent progresses in sieving in practice seems to
mandate more refined estimates; we sketch such refinements in Section C.5 (the full details
are provided in the Kyber Round 3 specification document). It expresses security in terms
of the number of gates required for the computation, in the RAM model; because the
algorithm of [ADPS16] makes random access to an exponentially large amount of memory,
it is quite likely that any realistic implementation will be significantly more expensive.
While the MLWE and MSIS problems are defined over polynomial rings, we do not
currently have any way of exploiting this ring structure, and therefore the best attacks are
mounted by simply viewing these problems as LWE and SIS problems. The LWE and SIS
problems are exactly as in the definitions of MLWE and MSIS in Section 6.1 with the ring
Rq being replaced by Zq .

6.4 Conservative Choices for the MSIS Problem.


The security of Dilithium is based on the hardness of LWE and SIS-type problems. lattice-
based encryption schemes are based solely on the LWE problem and its derivatives with
added structure (e.g. MLWE, NTRU, etc.), it’s natural that LWE-type problems will receive
more attention in the future. While both problems are quite similar in that they require
finding short vectors in a particular lattice and, furthermore, the best algorithms have
the same core component, we believe it makes sense to be a little more conservative with
the parameters on the SIS side. We have already been quite conservative in discounting
the fact that only very few (i.e. ω) coefficients of the solution can be at the maximum
`∞ -length and the fact that we lost an additive factor of τ · 2d−1 because we wanted to
obtain a reduction from the inhomogeneous SIS problem with the target t rather than t1 ,
even though we don’t know of any improved attack against the latter.
There is one more place in which we could have been even more conservative against
attacks that are not known to exist. The classical reduction from SelfTargetMSIS to MSIS
sketched in section 6.1 increases the `∞ -norm of the extracted solution by a factor of 2.
More specifically, the length of the vector yi − yi0 is twice the length of the ζ in Eq. (7).
There are no known algorithms that take advantage of this, and we did not take this
reduction loss into account when computing the hardness of the SIS problem. For the
sake of added conservativism, we now give the hardness of SIS assuming that there is
indeed some possible exploit of this factor loss. Instead of classical Core-SVP SIS security
124, 186, and 265 for NIST levels 2,3, and 5, one would have corresponding security of
115, 172, and 256. So even in this ultra-conservative case, there is still ample SIS security
in our parameter sets.

6.5 Changing Security Levels


The most straightforward way of raising/lowering the security of Dilithium is by changing
the values of (k, `) and then adapting the value of η (and then β and ω) accordingly as in
Table 2. Increasing (k, `) by 1 each results in the public key increasing by ≈ 300 bytes and
the signature by ≈ 700 bytes; and increases security by ≈ 30 bits.
A different manner in which to increase security would be by lowering the values of γ1
and/or γ2 . This would make forging signatures (whose hardness is based on the underlying
SIS problem) more difficult. Rather than increasing the size of the public key / signature,
the negative effect of lowering the γi is that signing would require more repetitions. One
could similarly increase the value of η in order to make the LWE problem harder at the
28 CRYSTALS-Dilithium

expense of more repetitions.10 Because the increase in running time is rather dramatic
(e.g. halving both γi would end up squaring the number of required repetitions as per
Eq. (5)), we recommend increasing (k, `) when needing to “substantially” increase security.
Changing the γi should be reserved only for slight “tweaks” in the security levels.

References
[ABB+ 19] Erdem Alkim, Paulo S. L. M. Barreto, Nina Bindel, Patrick Longa, and
Jefferson E. Ricardini. The lattice-based digital signature scheme qtesla.
Cryptology ePrint Archive, Report 2019/085, 2019. https://eprint.iacr.
org/2019/085. 5

[ADH+ 19] Martin R Albrecht, Léo Ducas, Gottfried Herold, Elena Kirshanova, Ea-
monn W Postlethwaite, and Marc Stevens. The general sieve kernel and
new records in lattice reduction. In Annual International Conference on
the Theory and Applications of Cryptographic Techniques, pages 717–746.
Springer, 2019. 37

[ADPS16] Erdem Alkim, Léo Ducas, Thomas Pöppelmann, and Peter Schwabe. Post-
quantum key exchange – a new hope. In Proceedings of the 25th USENIX
Security Symposium, pages 327–343. USENIX Association, 2016. http://
cryptojedi.org/papers/#newhope. 21, 22, 27, 35, 37, 38

[AG11] Sanjeev Arora and Rong Ge. New algorithms for learning in presence of
errors. In ICALP (1), pages 403–415, 2011. 36

[AGPS19] Martin R Albrecht, Vlad Gheorghiu, Eamonn W Postlethwaite, and John M


Schanck. Estimating quantum speedups for lattice sieves. Technical report,
Cryptology ePrint Archive, Report 2019/1161, 2019. 37, 38

[AGVW17] Martin R Albrecht, Florian Göpfert, Fernando Virdia, and Thomas Wunderer.
Revisiting the expected cost of solving uSVP and applications to LWE.
In Tsuyoshi Takagi and Thomas Peyrin, editors, Advances in Cryptology –
ASIACRYPT 2017, volume 10211 of LNCS, pages 65–102. Springer, 2017.
https://eprint.iacr.org/2017/815. 37

[APS15] Martin R. Albrecht, Rachel Player, and Sam Scott. On the concrete hardness
of learning with errors. J. Mathematical Cryptology, 9(3):169–203, 2015.
https://eprint.iacr.org/2015/046. 37

[ARU14] Andris Ambainis, Ansis Rosmanis, and Dominique Unruh. Quantum attacks
on classical proof systems: The hardness of quantum rewinding. In FOCS,
pages 474–483, 2014. 23

[BDGL16a] Anja Becker, Léo Ducas, Nicolas Gama, and Thijs Laarhoven. New directions
in nearest neighbor searching with applications to lattice sieving. In SODA ’16
Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete
Algorithms, pages 10–24. SIAM, 2016. https://eprint.iacr.org/2015/
1128. 27, 37

[BDGL16b] Anja Becker, Léo Ducas, Nicolas Gama, and Thijs Laarhoven. New directions
in nearest neighbor searching with applications to lattice sieving. In SODA,
pages 10–24. SIAM, 2016. 33
10 The two changes in this paragraph may require lowering d so as to keep the value kct k
0 ∞ smaller
than γ2 with high probability. Lowering d will increase the size of the public key.
CRYSTALS-Dilithium 29

[BG14] Shi Bai and Steven D. Galbraith. An improved compression technique for
signatures based on learning with errors. In CT-RSA, pages 28–47, 2014. 3,
5, 25, 33

[BHLY16] Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom.
Flush, gauss, and reload - A cache attack on the BLISS lattice-based signature
scheme. In CHES, pages 323–345, 2016. 3

[BKW03] Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolerant learning, the
parity problem, and the statistical query model. J. ACM, 50(4):506–519,
2003. 36

[BN06] Mihir Bellare and Gregory Neven. Multi-signatures in the plain public-key
model and a general forking lemma. In ACM Conference on Computer and
Communications Security, pages 390–399, 2006. 6

[BS16] Jean-François Biasse and Fang Song. Efficient quantum algorithms for com-
puting class groups and solving the principal ideal problem in arbitrary degree
number fields. In SODA, pages 893–902. SIAM, 2016. 36

[CDPR16] Ronald Cramer, Léo Ducas, Chris Peikert, and Oded Regev. Recovering
short generators of principal ideals in cyclotomic rings. In EUROCRYPT (2),
volume 9666 of Lecture Notes in Computer Science, pages 559–585. Springer,
2016. 36

[CDW17] Ronald Cramer, Léo Ducas, and Benjamin Wesolowski. Short Stickelberger
class relations and application to ideal-SVP. In EUROCRYPT (1), volume
10210 of Lecture Notes in Computer Science, pages 324–348, 2017. 36

[CGS14] Peter Campbell, Michael Groves, and Dan Shepherd. Soliloquy: A cautionary
tale. In ETSI 2nd Quantum-Safe Crypto Workshop, pages 1–9, 2014. 36

[CN11a] Yuanmi Chen and Phong Q. Nguyen. BKZ 2.0: Better lattice security
estimates. In ASIACRYPT, volume 7073 of Lecture Notes in Computer
Science, pages 1–20. Springer, 2011. 33

[CN11b] Yuanmi Chen and Phong Q. Nguyen. BKZ 2.0: Better lattice security esti-
mates. In Dong Hoon Lee and Xiaoyun Wang, editors, Advances in Cryptology
– ASIACRYPT 2011, volume 7073 of LNCS, pages 1–20. Springer, 2011. http:
//www.iacr.org/archive/asiacrypt2011/70730001/70730001.pdf. 37

[DDLL13] Léo Ducas, Alain Durmus, Tancrède Lepoint, and Vadim Lyubashevsky.
Lattice signatures and bimodal gaussians. In CRYPTO (1), pages 40–56,
2013. 3

[DFMS19] Jelle Don, Serge Fehr, Christian Majenz, and Christian Schaffner. Security of
the fiat-shamir transformation in the quantum random-oracle model. Cryp-
tology ePrint Archive, Report 2019/190, 2019. https://eprint.iacr.org/
2019/190. 7

[DKL+ 18] Léo Ducas, Eike Kiltz, Tancrède Lepoint, Vadim Lyubashevsky, Peter
Schwabe, Gregor Seiler, and Damien Stehlé. Crystals-dilithium: A lattice-
based digital signature scheme. IACR Trans. Cryptogr. Hardw. Embed. Syst.,
2018(1):238–268, 2018. 7, 17

[DLP14] Léo Ducas, Vadim Lyubashevsky, and Thomas Prest. Efficient identity-based
encryption over NTRU lattices. In ASIACRYPT, pages 22–41, 2014. 3
30 CRYSTALS-Dilithium

[DSDGR20] Dana Dachman-Soled, Léo Ducas, Huijing Gong, and Mélissa Rossi. Lwe with
side information: Attacks and concrete security estimation. 2020. https:
//eprint.iacr.org/2020/292.pdf. 37, 38

[Duc18] Léo Ducas. Shortest vector from lattice sieving: a few dimensions for free. In
Annual International Conference on the Theory and Applications of Crypto-
graphic Techniques, pages 125–145. Springer, 2018. 37, 38

[EFGT17] Thomas Espitau, Pierre-Alain Fouque, Benoît Gérard, and Mehdi Tibouchi.
Side-channel attacks on BLISS lattice-based signatures: Exploiting branch
tracing against strongswan and electromagnetic emanations in microcon-
trollers. In CCS, pages 1857–1874, 2017. 3

[GLP12] Tim Güneysu, Vadim Lyubashevsky, and Thomas Pöppelmann. Practical


lattice-based cryptography: A signature scheme for embedded systems. In
CHES, pages 530–547, 2012. 3

[GQ88] Louis C. Guillou and Jean-Jacques Quisquater. A "paradoxical" indentity-


based signature scheme resulting from zero-knowledge. In CRYPTO, pages
216–231, 1988. 23, 26

[HPS11] Guillaume Hanrot, Xavier Pujol, and Damien Stehlé. Algorithms for the
shortest and closest lattice vector problems. In IWCC, volume 6639 of Lecture
Notes in Computer Science, pages 159–190. Springer, 2011. 33

[KF17] Paul Kirchner and Pierre-Alain Fouque. Revisiting lattice attacks on over-
stretched NTRU parameters. In EUROCRYPT (1), volume 10210 of Lecture
Notes in Computer Science, pages 3–26, 2017. 36

[KLS18] Eike Kiltz, Vadim Lyubashevsky, and Christian Schaffner. A concrete treat-
ment of fiat-shamir signatures in the quantum random-oracle model. In
EUROCRYPT, pages 552–586, 2018. 5, 6, 7, 17, 24, 25

[Knu97] Donald Knuth. The Art of Computer Programming, volume 2. Addison-Wesley,


3 edition, 1997. page 145. 10

[Laa15] Thijs Laarhoven. Search problems in cryptography. PhD thesis, Eindhoven


University of Technology, 2015. 27, 33

[Lyu09] Vadim Lyubashevsky. Fiat-Shamir with aborts: Applications to lattice and


factoring-based signatures. In ASIACRYPT, pages 598–616, 2009. 3, 25

[Lyu12] Vadim Lyubashevsky. Lattice signatures without trapdoors. In EUROCRYPT,


pages 738–755, 2012. 3, 6, 7, 25, 33

[LZ19] Qipeng Liu and Mark Zhandry. Revisiting post-quantum fiat-shamir. Cryp-
tology ePrint Archive, Report 2019/262, 2019. https://eprint.iacr.org/
2019/262. 7

[MLB17] Artur Mariano, Thijs Laarhoven, and Christian Bischof. A parallel variant
of LDSieve for the SVP on lattices. In 2017 25th Euromicro International
Conference on Parallel, Distributed and Network-based Processing (PDP),
pages 23–30. IEEE, 2017. 37

[PBY17] Peter Pessl, Leon Groot Bruinderink, and Yuval Yarom. To BLISS-B or not
to be: Attacking strongswan’s implementation of post-quantum signatures.
In CCS, pages 1843–1855, 2017. 3
CRYSTALS-Dilithium 31

[PS00] David Pointcheval and Jacques Stern. Security arguments for digital signatures
and blind signatures. J. Cryptology, 13(3):361–396, 2000. 6
[PSS+ 18] Damian Poddebniak, Juraj Somorovsky, Sebastian Schinzel, Manfred Lochter,
and Paul Rösler. Attacking deterministic signature schemes using fault attacks.
In EuroS&P, pages 338–352, 2018. 5, 14
[SBB+ 18] Niels Samwel, Lejla Batina, Guido Bertoni, Joan Daemen, and Ruggero
Susella. Breaking ed25519 in wolfssl. In CT-RSA, pages 1–20, 2018. 5, 14
[Sch89] Claus-Peter Schnorr. Efficient identification and signatures for smart cards.
In CRYPTO, pages 239–252, 1989. 23, 26
[SE94] Claus-Peter Schnorr and M. Euchner. Lattice basis reduction: Improved
practical algorithms and solving subset sum problems. Math. Program.,
66:181–199, 1994. 33
[YD17] Yang Yu and Léo Ducas. Second order statistical behavior of lll and bkz.
In International Conference on Selected Areas in Cryptography, pages 3–22.
Springer, 2017. 37

A Proofs for Rounding Algorithm Properties


The three lemmas below prove each of the three parts of Lemma 1.
Lemma 4. Let r, z ∈ Zq with kzk∞ ≤ α/2. Then

UseHintq (MakeHintq (z, r, α), r, α) = HighBitsq (r + z, α).

Proof. The output of Decomposeq is an integer r1 such that 0 ≤ r1 < (q − 1)/α and
another integer r0 such that kr0 k∞ ≤ α/2. Because kzk∞ ≤ α/2, the integer v1 :=
HighBitsq (r + z, α) either stays the same as r1 or becomes r1 ± 1 modulo m = (q − 1)/α.
More precisely, if r0 > 0, then −α/2 < r0 + z ≤ α. This implies that v1 is either r1 or
r1 + 1 mod m. If r0 ≤ 0, then we have −α ≤ r0 + z ≤ α/2. In this case, we have v1 = r1
or r1 − 1 mod m.
The MakeHintq routine checks whether r1 = v1 and outputs 0 if this is so, and 1 if
r1 6= v1 . The UseHintq routine uses the “hint” h to either output r1 (if y = 0) or, depending
on whether r0 > 0 or not, output either r1 + 1 mod+ m or r1 − 1 mod+ m.
The lemma below shows that r is not too far away from the output of the UseHintq
algorithm. This will be necessary for the security of the scheme.
Lemma 5. Let (h, r) ∈ {0, 1} × Zq and let v1 = UseHintq (h, r, α). If h = 0, then
kr − v1 · αk∞ ≤ α/2; else kr − v1 · αk∞ ≤ α + 1.
Proof. Let (r1 , r0 ) := Decomposeq (r, α). We go through all three cases of the UseHintq
procedure.
Case 1 (h = 0): We have v1 = r1 and

r − v1 · α = r1 · α + r0 − r1 · α = r0 ,

which by definition has absolute value at most α/2.


Case 2 (h = 1 and r0 > 0): We have v1 = r1 + 1 − κ · (q − 1)/α for κ = 0 or 1. Thus

r − v1 · α = r1 · α + r0 − (r1 + 1 − κ · (q − 1)/α) · α
= −α + r0 + κ · (q − 1).
32 CRYSTALS-Dilithium

After centered reduction modulo q, the latter has magnitude ≤ α.


Case 3 (h = 1 and r0 ≤ 0): We have v1 = r1 − 1 + κ · (q − 1)/α for κ = 0 or 1. Thus

r − v1 · α = r1 · α + r0 − (r1 − 1 + κ · (q − 1)/α) · α
= α + r0 − κ · (q − 1).

After centered reduction modulo q, the latter quantity has magnitude ≤ α + 1.


The next lemma will play a role in proving the strong existential unforgeability of
our signature scheme. It states that two different h, h 0 cannot lead to UseHintq (h, r, α) =
UseHintq (h 0 , r, α).
Lemma 6. Let r ∈ Zq and h, h 0 ∈ {0, 1}. If UseHintq (h, r, α) = UseHintq (h 0 , r, α), then
h = h0.
Proof. Note that UseHintq (0, r, α) = r1 and UseHintq (1, r, α) is equal to (r1 ± 1) mod+ (q −
1)/α. Since (q − 1)/α ≥ 2, we have that r1 6= (r1 ± 1) mod+ (q − 1)/α.
We now prove Lemma 2 and Lemma 3.
Proof. (Of Lemma 2) We prove the lemma for integers, rather than vectors of polynomials,
since the HighBits function works independently on each coefficient. If kLowBitsq (r, α)k∞ <
α/2 − β, then r = r1 · α + r0 where −α/2 + β < r0 ≤ α/2 + β. Then r + s = r1 · α + (r0 + s)
and −α/2 < r0 + s ≤ α/2. Therefore r + s mod ± α = r0 + s, and thus

(r + s) − ((r + s) mod ± α) = r1 · α = r − (r mod ± α),

and the claim in the Lemma follows.


Proof. (Of Lemma 3) We first prove the forward direction. By the same proof as inLemma 2,
since kr0 +sk∞ < α/2, we know that w1 = r1 . We can therefore write (r +s) = r1 ·α+r0 +s.
Since kr0 + sk∞ < α, we know that

w0 = LowBitsq (r + s, α) = r1 · α + r0 + s mod ± α = r0 + s.

To prove the other direction, write

r1 · α + r0 + s = r + s = w1 · α + w0 = r1 · α + w0 ,

and therefore r0 + s = w0 .

B Zero-Knowledge Proof
The security of our scheme does not rely on the part of the public key t0 being secret and
so we will be assuming that the public key is t rather than t1 .
We want to first compute the probability that some particular (z, c) is generated in
Step 19 taken over the randomness of y and the random oracle H which is modeled as a
random function. We have

Pr[z, c] = Pr[c] · Pr[y = z − cs1 | c].

Whenever z has all its coefficients less than γ1 − β then the above probability is
exactly the same for every such tuple (z, c). This is because kcsi k∞ ≤ β, and thus
kz − cs1 k∞ ≤ γ1 − 1, which is a valid value of y. Therefore, if we only output z when all
its coefficients have magnitudes less than γ1 − β, then the resulting distribution will be
uniformly random over Sγ`1 −β−1 × Bτ .
CRYSTALS-Dilithium 33

The simulation of the signature follows [Lyu12, BG14]. The simulator picks a uniformly
random (z, c) in Sγ`1 −β−1 × Bτ , after which it also makes sure that

kr0 k∞ = kLowBitsq (w − cs2 , 2γ2 )k∞ < γ2 − β.

By Equation (2), we know that w − cs2 = Az − ct, and therefore the simulator can
perfectly simulate this as well.
If z does indeed satisfy kLowBitsq (w−cs2 , 2γ2 )k∞ < γ2 −β, then as long as kcs2 k∞ ≤ β
(which is how we set β), we will have

r1 = HighBitsq (w − cs2 , 2γ2 ) = HighBitsq (w, 2γ2 ) = w1 .

We can then program


H(µ k w1 ) ← c .
Unless we have already set the value of H(µ k w1 ) to something else (which only happens
with negligible probability), the resulting pair (z, c) has the same distribution as in a
genuine signature of µ.
All the other steps (after Step 21) of the signing algorithm are performed using public
information and are therefore simulatable.

C Concrete Security
C.1 Lattice Reduction and Core-SVP Hardness
The best known algorithm for finding very short non-zero vectors in Euclidean lattices is
the Block–Korkine–Zolotarev algorithm (BKZ) [SE94], proposed by Schnorr and Euchner
in 1991. More recently, it was proven to quickly converge to its fix-point [HPS11] and
improved in practice [CN11a]. Yet, what it achieves asymptotically remains unchallenged.
BKZ with block-size b makes calls to an algorithm that solves the Shortest lattice
Vector Problem (SVP) in dimension b. The security of our scheme relies on the necessity
to run BKZ with a large block-size b and the fact that the cost of solving SVP is
exponential in pb. The best known classical SVP solver [BDGL16b] runs in time ≈ 2cC ·b
with cC = log2 3/2 ≈ 0.292. The best p known quantum SVP solver [Laa15, Sec. 14.2.10]
runs in time ≈ 2cQ ·b with cQ = log2 13/9 ≈ 0.265. One may hope to improve these run-
times, but going below ≈ 2cP ·b with cP = log2 4/3 ≈ 0.2075 would require a theoretical
p

breakthrough. Indeed, the best known SVP solvers rely on covering the b-dimensional
sphere with cones of center-to-edge angle π/3: this requires 2cP ·b cones. The subscripts C,
Q, P respectively stand for Classical, Quantum and Paranoid.
The strength of BKZ increases with b. More concretely, given as input a basis (c1 , . . . , cn )
of an n-dimensional lattice, BKZ repeatedly uses the b-dimensional SVP-solver on lattices
of the form (ci+1 (i), . . . , cj (i)) where i ≤ n, j = min(n, i + b) and where ck (i) denotes
the projection of ck orthogonally to the vectors (c1 , . . . , ci ). The effect of these calls
is to flatten the curve of the `i = log2 kci (i − 1)k’s (for i = 1, . . . , n). At the start of
the execution, the `i ’s typically decrease fast, at least locally. As BKZ preserves the
determinant of the ci ’s, the sum of the `i ’s remains constant throughout the execution,
and after a (small) polynomial number of SVP calls, BKZ has made the `i ’s decrease less.
It can be heuristically estimated that for sufficiently large b, the local slope of the `i ’s
converges to
1
 
b
slope(b) = log2 (π · b)1/b
,
b−1 2πe
unless the local input `i ’s are already too small or too large. The quantity slope(b) decreases
with b, implying that the larger b the flatter the output `i ’s.
34 CRYSTALS-Dilithium

`i `i
log2 q log2 q

Zone 1

Zone 3

Zone 1

Zone 2

Zone 3
0 0
0 i 0 i
Before reduction After b-BKZ with small b

`i `i
log2 q Zone 2 log2 q

Zone 3

Zone 2
0 0
0 i 0 i
After b-BKZ with med. b After b-BKZ with large b

Figure 11: Evolution of Gram-Schmidt length in log-scale under BKZ reduction for various
blocksizes. The area under the curves remains constant, and the slope in Zone 2 decrease
with the blocksize. Note that Zone 3 may disappear before Zone 1, depending on the shape
of the input basis.

In our case, the input `i ’s are of the following form (cf. Fig. 11). The first ones
are all equal to log2 q and the last ones are all equal to 0. BKZ will flatten the jump,
decreasing `i ’s with small i’s and increasing `i ’s with large i’s. However, the local slope
slope(b) may not be sufficiently small to make the very first `i ’s decrease and the very
last `i ’s increase. Indeed, BKZ will not increase (resp. increase) some `i ’s if these are
already smaller (resp. larger) than ensured by the local slope guarantee. In our case, the
`i ’s are always of the following form at the end of the execution:
• The first `i ’s are constant equal to log2 q (this is the possibly empty Zone 1).
• Then they decrease linearly, with slope slope(b) (this is the never-empty Zone 2).
• The last `i ’s are constant equal to 0 (this is the possibly empty Zone 3).
The graph is continuous, i.e., if Zone 1 (resp. Zone 3) is not empty, then Zone 2 starts
with `i = log2 q (resp. ends with `i = 0).

C.2 Solving MLWE


Any MLWE`,k,D instance for some distribution D can be viewed as an LWE instance of
dimensions 256·` and 256·k. Indeed, the above can be rewritten as finding vec(s1 ), vec(s2 ) ∈
Z256·` × Z256·k from (rot(A), vec(t)), where vec(·) maps a vector of ring elements to
the vector obtained by concatenating the coefficients of its coordinates, and rot(A) ∈
Z256·k×256·`
q is obtained by replacing all entries aij ∈ Rq of A by the 256 × 256 matrix
whose z-th column is vec x z−1 · aij .


Given an LWE instance, there are two lattice-based attacks. The primal attack and
the dual attack. Here, the primal attack consists in finding a short non-zero vector in the
lattice Λ = {x ∈ Zd : Mx = 0 mod q} where M = (rot(A)[1:m] |Im |vec(t)[1:m] ) is an m × d
matrix where d = 256 · ` + m + 1 and m ≤ 256 · k. Indeed, it is sometime not optimal to
use all the given equations in lattice attacks.
We tried all possible number m of rows, and, for each trial, we increased the blocksize of b
until the value `d−b obtained as explained above was deemed sufficiently large. As explained
CRYSTALS-Dilithium 35

in [ADPS16, Sec. 6.3], if 2`d−b is greater than the expected norm of (vec(s1 ), vec(s2 )) after
projection orthogonally to the first d − b vectors, it is likely that the MLWE solution can
be easily extracted from the BKZ output.
The dual attack consists in finding a short non-zero vector in the lattice Λ0 = {(x, y) ∈
Z × Zd : MT x + y = 0 mod q)}, M = (rot(A)[1:m] ) is an m × d matrix where d = 256 · `
m

and m ≤ 256 · k. Again, for each value of m, we increased the value of b until the
value `1 obtained as explained above was deemed sufficiently small according the analysis
of [ADPS16, Sec. 6.3].

C.3 Solving MSIS and SelfTargetMSIS


As per the discussion in Section 6.2.1, the best known attack against the SelfTargetMSIS
problem involves either breaking the security of H or solving the problem in Eq. (13). The
latter amounts to solving the MSISk,`+1,ζ problem for the matrix [ A | t0 ]. 11
Note that the MSIS instance can be mapped to a SIS256·k,256·(`+1),ζ instance by consid-
256·k×256·(`+1)
ering the matrix rot(A|t0 ) ∈ Zq . The attack against the MSISk,`,ζ 0 instance
in Eq. (6) can similarly be mapped to a SIS256·k,256·`,ζ 0 instance by considering the matrix
rot(A) ∈ Zq256·k×256·` . The attacker may consider a subset of w columns, and let the
solution coefficients corresponding to the dismissed columns be zero.
Remark 1. An unusual aspect here is that we are considering the infinity norm, rather
than the Euclidean norm. Further, for our specific parameters, the Euclidean norms of the
solutions are above q. In particular, the vector (q, 0, . . . , 0)T belongs to the lattice, has
Euclidean norm below that of the solution, but its infinity norm above the requirement.
This raises difficulties in analyzing the strength of BKZ towards solving our infinity norm
SIS instances: indeed, even with small values of b, the first `i ’s are short (they correspond
to q-vectors), even though they are not solutions.
For each number w of selected columns and for each value of b, we compute the
estimated BKZ output `i ’s, as explained above. We then consider the smallest i such that
`i is below log2 q and the largest j such that `j above 0. These correspond to the vectors
that were modified by BKZ, with smallest and largest indices, respectively. In fact, for
b
the same cost as a call to the SVP-solver, we can obtain 4/3 vectors with Euclidean
p

norm ≈ 2`i after projection orthogonally to the first i − 1 basis vectors. Now, let us look
closely at the shape of such a vector. As the first i − 1 basis vectors are the first i − 1
canonical unit vectors multiplied by q, projecting orthogonally to these consists in zeroing
the first i − 1 coordinates. The remaining w − i + 1 coordinates have total Euclidean
norm ≈ 2`i ≈ q, and the last w − j coordinates√are 0. We heuristically assume that these
coordinates have similar magnitudes σ ≈ 2`i / j − i + 1; we model each such coordinate
b
as a Gaussian of standard deviation σ. We assume that each one of our 4/3 vectors
p

has its first i − 1 coordinates independently uniformly distributed modulo q, and finally
compute the probability that all coordinates in both ranges [0, i − 1] and [i, j] are less than
B in absolute value. Our cost estimate is the inverse of that probability multiplied by the
run-time of our b-dimensional SVP-solver.

Forgetting q-vectors. For all the parameter sets in Table 1 and Table 2, the best
parametrization of the attack above kept the basis in a shape with a non-trivial Zone 1.
We note that the coordinates in this range have a quite lower probability of passing the
`∞ constraint than coordinates in Zone 2. We therefore considered a strategy consisting
of “forgetting” the q-vectors, by re-randomizing the input basis before running the BKZ
11 Note that a solution to Eq. (13) would require the coefficient in from of t0 to be ±1, while we’re

allowing any small polynomial. Furthermore, as discussed after Eq. (13), some parts of the real solution
are smaller than the bound ζ, but we’re ignoring this for the sake of being conservative with our analysis.
36 CRYSTALS-Dilithium

`i `i
log2 q

Zone 1

Zone 2

Zone 3

Zone 2

Zone 3
0 0
0 i 0 i
Keeping q-vectors Forgetting q-vectors

Figure 12: Effect of forgetting q-vectors by randomization, under the same BKZ-blocksize
b.

algorithm. For the same blocksize b, this makes Zone 1 of the output basis disappear
(BKZ does not find the q-vectors), at the cost of producing a basis with first vectors of
larger Euclidean norms. This is depicted in Fig. 12.
It turns out that this strategy always improves over the previous strategy for the
parameter ranges considered in Table 2. We therefore used this strategy for our security
estimates.

C.4 On Other Attacks


For our parameters, the BKW [BKW03] and Arora–Ge [AG11] families of algorithms are
far from competitive.

Algebraic attacks. One specificity of our LWE and SIS instances is that they are inherited
from MLWE and MSIS instances. One may wonder whether the extra algebraic structure
of the resulting lattices can be exploited by an attacker. The line of work of [CGS14,
BS16, CDPR16, CDW17] did indeed find new cryptanalytic results on certain algebraic
lattices, but [CDW17] mentions serious obstacles towards breaking cryptographic instances
of Ring-LWE. By switching from Ring-LWE to MLWE, we get even further away from
those weak algebraic lattice problems.

Dense sublattice attacks. Kirchner and Fouque [KF17] showed that the existence of
many linearly independent and unexpectedly short lattice vectors (much shorter than
Minkowski’s bound) helps BKZ run better than expected in some cases. This could
happen for our primal LWE attack, by extending M = (rot(A)[1:m] |Im |vec(t)[1:m] ) to
(rot(A)[1:m] |Im |rot(t)[1:m] ): the associated lattice now has 256 linearly independent short
vectors rather than a single one. The Kirchner-Fouque analysis of BKZ works best if both
q and the ratio between the number of unexpectedly short vectors and the lattice dimension
are high. In the NTRU case, for example, the ratio is 1/2, and, for some schemes derived
from NTRU, the modulus q is also large. We considered this refined analysis of BKZ in
our setup, but, to become relevant for our parameters, it requires a parameter b which
is higher than needed with the usual analysis of BKZ. Note that [KF17] also arrived to
the conclusion that this attack is irrelevant in the small modulus regime, and is mostly a
threat to fully homomorphic encryption schemes and cryptographic multilinear maps.
Note that, once again, the switch from Ring-LWE to MLWE takes us further away
from lattices admitting unconventional attacks. Indeed, the dimension ratio of the dense
sub-lattice is 1/2 in NTRU, at most 1/3 in lattices derived from Ring-LWE, and at most
1/(` + 2) in lattices derived from MLWE.
CRYSTALS-Dilithium 37

Specialized attack against `∞ -SIS. At last, we would like to mention that it is not clear
whether the attack sketched in Appendix C.3 above for SIS in infinity norm is optimal.
Indeed, as we have seen, this approach produces many vectors, with some rather large
uniform coordinates (at indices 1, . . . , i), and smaller Gaussian ones (at indices i, . . . , j). In
our current analysis, we simply hope that one of the vector satisfies the `∞ bound. Instead,
one could combine them in ways that decrease the size of the first (large) coefficients, while
letting the other (small) coefficients grow a little bit.
This situation created by the use of `∞ -SIS (see Remark 1) has — to the best of our
knowledge — not been studied in detail. After a preliminary analysis, we do not consider
such an improved attack a serious threat to our concrete security claims, especially in light
of the approximations already made in favor of the adversary.

C.5 Beyond Core-SVP Hardness


At the time the core-SVP hardness measure was introduced by [ADPS16], the best
implementations of sieving [BDGL16a, MLB17] had performance significantly worse than
the 2.292b CPU cycles proposed as a conservative estimate by this methodology. This
was due to substantial polynomial, or even sub-exponential, overheads hidden in the
complexity analysis of 2.292b+o(b) given in [BDGL16a]. Before the work of [ADPS16],
security estimates of lattice schemes were typically based on the cost of SVP via enumeration
given in [CN11b, APS15], leading to much more aggressive parameters. Beyond the cost
of SVP-calls, this methodology also introduced a different prediction of when BKZ solves
LWE which was later confirmed in [AGVW17] and refined in [DSDGR20].
While doubts were expressed to whether sieving would ever outperform the super-
exponential, yet practically smaller, costs of enumeration [CN11b] for relevant cryptographic
dimensions, significant progress on sieving algorithms [Duc18, ADH+ 19] has brought the
cross-over point down to dimension about b = 80. In fact, the current SVP records are
now held by algorithms that employ sieving12 . These progresses mandates a revision
and refinement of Dilithium security estimates, especially regading classical attacks. In
particular, while it was pretty clear from experiments than the costs hidden in the o(b)
before those improvement were positive both in practice and asymptotically, the dimensions
for free technique of [Duc18] offers a sub-exponential speed-up, making it a priori unclear
whether the total o(b) term is positive or negative, both asymptotically and concretely.

C.5.1 Cost of MLWE


We follow the same methodology as for the Kyber Round 3 specification document to
count the number of gates required to solve MLWE. This analysis refines the core-SVP
methodology in the following ways:

• It uses the probabilistic simulation of [DSDGR20] rather the the GSA-intersect model
of [ADPS16, AGVW17] to determine the BKZ blocksize b for a successful attack.
The required blocksize is somewhat larger (by an added term between 10 and 20 for
our parameters), because of a “tail” phenomenon [YD17],

• It accounts for the “few dimensions for free” introduced in [Duc18], which permits
to solve SVP in dimension β while running sieving in a somewhat smaller dimension
b0 = b − o(b),

• It relies on the concrete estimation for the cost of sieving in gates from [AGPS19]

• It accounts for the number of calls to the SVP oracle,


12 https://www.latticechallenge.org/svp-challenge/
38 CRYSTALS-Dilithium

• It dismisses the dual attack as realistically more expensive than the primal one,
noting that the analysis of [ADPS16] (also used here) by assuming that the sieve
provides 2.208b many vectors as short as the shortest one. First, most of those vectors
are larger by a factor 4/3, secondly the trick of exploiting all those vectors is not
p

compatible with the “dimension for free” trick of [Duc18].


The scripts for these refined estimates are provided in a git branch of the leaky-LWE-
estimator of [DSDGR20],13 and lead to the estimates given in Table 4. We refer to the
Kyber Round 3 specification document for the details of this analysis. We point out that
it is paired with a detailed discussion of the “known unknowns”, providing a plausible
confidence interval for these estimates. For the classical hardness of LWE at Levels 1 and
2, it is estimated that the true cost is no more than a 16 bits away from this estimate, in
either direction.
We also note that a similar refined count of quantum gates seems essentially irrelevant:
the work of [AGPS19] concluded that quantum speed-up of sieving are rather tenuous,
while the quantum security target for each level is significantly lower that the classical
target.

Table 4: Refined estimates for the MLWE hardness


n b b0 log2 (gates) log2 (memory in bits)
Dilithium Level 2 2049 433 394 158.6 97.8
Dilithium Level 3 2654 638 587 216.7 138.7
Dilithium Level 5 3540 883 818 285.4 187.4

n: Optimal lattice dimension for the attack


b: BKZ blocksize
b0 : Sieving dimension accounting for “dimensions for free”

C.5.2 Cost of MSIS


Ideally, one would also like to have similar refined estimates for the MSIS problem in the
`∞ -norm. However this requires further technical considerations. We strongly suspect that
the refined cost of MSIS would be somewhat larger than of LWE, with the main reason
being, as for the MLWE dual attack, the trick of exploiting all the vectors outputted
by sieving is not compatible with the “dimension for free” trick of [Duc18]. We plan to
provide such a refined analysis in the near future.

13 https://github.com/lducas/leaky-LWE-Estimator/tree/NIST-round3

You might also like