An Assessment of Differential-Neural Distinguishers
An Assessment of Differential-Neural Distinguishers
An Assessment of Differential-Neural Distinguishers
Distinguishers
Aron Gohr1 , Gregor Leander2 and Patrick Neumann2
1
Independent Researcher, aron.gohr@gmail.com
2
Ruhr University Bochum, Bochum, Germany, firstname.lastname@rub.de
1 Introduction
Evaluating the security of a cipher is of great importance. In symmetric cryptography
this is, for the most time, done by proving resilience against already known types of
attacks and by the absence of (known) attacks against the full version of the cipher, often
including some security margin. For most types of attacks, automated search strategies
for the underlying properties were developed. Examples of this are finding differential
and/or linear characteristics with highest probability/correlation for a certain number
of rounds[MWGP11, MP13, SHW+ 14], or to automatically find integral attacks[Tod15,
XZBL16]. Such tools often reduce the underlying problem to, for example, a SAT or MILP
instance, which can be solved using already existing solvers.
Not too long ago, a new, machine learning assisted, type of attack on the block cipher
Speck was proposed in [Goh19]. Having its foundations in differential cryptanalysis, the
idea is to distinguish ciphertext-pairs that belong to a fixed plaintext difference from
random ones. For this, neuronal networks are employed, leading to an 11-round key
recovery attack on Speck32/64 that is at least competitive with classical ones.
While it is clear that machine learning will not completely substitute classical crypt-
analysis, especially since we will see that, at least for the problem at hand, they tend
to need a quite pronounced difference in the distributions of the learning problem, the
question arises how generic this approach is and to which extend it can complement the
work of a cryptanalyst. In other words, can we see machine learning as a tool assisting
cryptanalysis, similar to how SAT and MILP solvers, among others, are seen by now?
2 An Assessment of Differential-Neural Distinguishers
[CSYY22], the accuracies of our distinguishers (without any further adjustments) are 0.662 and 0.550 for
6 and 7 rounds respectively.
Aron Gohr, Gregor Leander and Patrick Neumann 3
Table 1: Comparison of neural distinguishers. For convenience of the reader, the results
for each round-reduced cipher are ordered by accuracy.
Cipher Full Rounds Rounds Accuracy Source
Simon32/64 32 9 0.598 [HRC21b]
0.628 [SZM20]
<0.63 [HRC21a]
0.659 [BGL+ 21]
0.661 This Work
10 0.501† [SZM20]
0.566∗ [BGL+ 21]
0.567∗ This Work
11 0.517∗ [BGL+ 21]
0.520∗ This Work
Speck32/64 22 7 0.607 [CSYY22]
0.616 [Goh19]
0.617 This Work
8 0.514∗ [Goh19]
Skinny64/64 32 7 0.937 This Work
Present 31 6 0.658 [CSYY22]
0.712 This Work
7 0.549 [CSYY22]
0.563 This Work
Katan32 254 61 0.536 This Work
63 0.522 This Work
65 0.514 This Work
66 0.505 This Work
ChaCha 8/12/20 3 0.989 This Work
†: As far as we understand, the accuracy is evaluated using 106 samples, which means that the probability
that this is just random noise is around 3.6%.
∗: Results need a more elaborate training procedure; there is no known way to obtain them by simple
variations of direct training.
4 An Assessment of Differential-Neural Distinguishers
†: Result is based on a difference-polytope that consists of 3 differences (i. e. 4 plain-/ciphertexts are used
per sample).
Aron Gohr, Gregor Leander and Patrick Neumann 5
et al. [JKM20] classify input differences instead of distinguishing between real and random
samples. In addition, the work of Benamira et al.[BGPT21] explores the explainability of
the results in [Goh19]. Also, similar approaches to [Goh19] have been applied to other
ciphers, such as ASCON[BBCD21], Chaskey[ZW22, BBCD21, CSYY22], DES[ZW22,
CSYY22], GIMLI[BBCD21], KNOT[BBCD21], Present[CSYY22, ZW22, JKM20], Si-
mon[LLL+ 22, ZWW22, HRC21a, HRC21b, BGL+ 21, SZM20] and Simeck[LLL+ 22]. Fur-
thermore, Bacuieti et al.[BBP22] explored the direction of finding smaller, and therefore
more efficient, neural distinguishers, which, despite inferior distinguishing performance,
have the potential to lower the attack complexity.
Outline. We start in Section 2 with introducing some basic notation, the general setting
of our distinguishers and the structure of the employed neuronal network, as well as the
ciphers under scrutiny. Afterwards, we perform a case study of optimizing the original
differential-neural distinguisher for six ciphers, following five different design approaches,
in Section 3 and provide some general guidance on which hyper-parameters should be
considered. This is followed up by pointing out some manual optimizations that can be
done for the ciphers at hand in Section 4. After that, we show in Section 5 that the
distinguishing performance of a differential-neural distinguisher is strongly correlated to
the mean absolute distance of the two underlying distributions, which is essentially the
accuracy a purely differential distinguisher is able to achieve, as well as that for a class
of ciphers it is impossible for any distinguisher to use more than differential features.
This is followed up by conducting the real difference experiment proposed in [Goh19]
for all ciphers under scrutiny in Section 6. We then show in Section 7 that it is rather
likely for differential-neural distinguisher to require a quite pronounced skewness of the
ciphertext-pair distribution in order to be able to learn, followed by a rectification of
distinguisher comparisons for using multiple ciphertext-pairs at once in Section 82 . At last,
we conclude our work in Section 9.
2 Preliminaries
2.1 Notation
In this work, F2 denotes the finite field consisting of the two elements {0, 1}, and Fn2 the
n-dimensional vector space over this Pnfield. For convenience, we will implicitly identify
(xn , ..., x1 ) ∈ Fn2 with the integer i=1 xi 2i , and the other way around, based on the
context. We will therefore call x1 the least significant bit (lsb) and xn the most significant
bit (msb). Furthermore, ⊕, , denote XOR, modular addition and bitwise product, and
≪, ≫ denote left and right rotation respectively. Additionally, gcd(a, b) will denote the
greatest common divisor of the two integers a and b.
At last, let us introduce some basic notation about differentials. For this, let |S| denote
the cardinality of the set S.
As all ciphers under scrutiny are of an iterative nature, we will also recall the DP under
Markov assumption.
2 The source code for our experiments and the neural distinguishers can be found at https://github.
com/differential-neural/An-Assessment-of-Differential-Neural-Distinguishers
6 An Assessment of Differential-Neural Distinguishers
Corollary 1 (DP under Markov Assumption). Let F1 , ..., Fr : Fn2 → Fn2 , as well as
γ1 , ..., γr+1 ∈ Fn2 . Then, under Markov assumption,
X r
Y
pγγ1r+1 (Fr ◦ ... ◦ F1 ) = pγγii+1 (Fi ).
γ2 ,...,γr ∈Fn
2
i=1
We are mainly interested in the maximal DP, either over the choice of output difference
or both, input and output difference.
Definition 2 (Maximal DP). Let F : Fn2 → Fn2 . Then the maximal DP for a fixed input dif-
ference γ ∈ Fn2 is maxδ∈Fn2 \{0} pγδ (F ). In addition, the maximal DP is maxγ,δ∈Fn2 \{0} pγδ (F ).
3 Since ciphers are usually bijective mappings if the key is fixed, strictly speaking the case that c = c0
should be excluded. But given that nc is at least 32 for the ciphers under scrutiny, the probability of this
happening is negligible.
4 Note that less blocks were used to produce the final distinguishers in [Goh19].
Aron Gohr, Gregor Leander and Patrick Neumann 7
(c1 , c2 )
Preprocess
Activation
Activation
Residual Block (×10) Convolution (3, 32)
Batch Normalization
Activation
Flatten
Dense (64)
Activation
Prediction Head
Dense (64)
Batch Normalization
Activation
z ∈ [0, 1]
li ri
li ri
≪α ≫α
≪β
≪γ ki
≪β
ki
li+1 ri+1
li+1 ri+1
(a) Simon
(b) Speck
The Ciphers Simon and Speck Simon and Speck are families of lightweight block cipher
designed by the NSA[BSS+ 13]. For the most part, we will concentrate on the variant using
a 32-bit state and a 64-bit key, which we shall denote as Simon32/64 and Speck32/64
respectively. During encryption, the state is split into two equal sized words, which are
then processed by Feistel-like round functions depicted in Fig. 2. The rotational constants
for Simon32/64 are α = 1, β = 8 and γ = 2, while the ones for Speck32/64 are α = 7
and β = 2.
Note that, even without knowing the key, part of the last round can always be reverted,
i. e. based on a ciphertext (li , ri ) we are able to calculate (li−1 , ri−1 ⊕ ki−1 ) in the case
of Simon and (li , ri−1 ) in the case of Speck. We will refer to this as calculating the
ciphertexts back.
For more details on Simon and Speck, the reader may consult [BSS+ 13].
The Cipher Skinny Skinny[BJK+ 16] is a tweakable block cipher and was designed by
Beierle et al. It uses a round function, which is depicted in Fig. 3 and inspired by the
Advanced Encryption Standard (AES). We will only consider the smaller version that
uses a 64-bit block size and a (tweak-)key of the same size. Furthermore, we will call each
of the 16 (4-bit) cells a word of the state. To be in line with the framework from [Goh19]
we will ignore the tweak and use the tweakkey as an ordinary key.
Again, part of the round function can be reverted, without any knowledge about the
key. This includes reverting the linear part (MixColumns and ShiftRows), as well as the
addition of the round constants (AC) and the application of the S-box (SC) to the last
two rows. In other words, it is possible to calculate the green cells (Fig. 3) within the last
round.
More details on Skinny can be found in [BJK+ 16].
The Cipher Present Present[BKL+ 07], a cipher designed by Bogdanov et al., also has
a 64-bit block size, and we focus on the version that uses a 80-bit key. Its round function is
shaped like a classical Substitution-Permutation Network (SPN), as can be seen in Fig. 4.
Note that after the last round a final round key addition takes place, as otherwise most
of the last round could trivially be reverted. This means that whenever we refer to the
number of rounds for Present, this should be understood as the number of applications
Aron Gohr, Gregor Leander and Patrick Neumann 9
≫1
SC AC
≫2
≫3
Figure 3: Round function of Skinny [Jea16]. The 16 green cells indicate to which point
the last round function can be reverted without knowing the key.
ki
S S S S S S S S S S S S S S S S
of the round function, and the number of key additions is the number of rounds increased
by one. See [BKL+ 07] for more details on Present.
While for Present, the bit-permutation right before the last key addition is, by
nature, a linear operation, and we could revert this part, similar to how we revert non-key
dependent operations for the other ciphers under scrutiny, we will not do so, since we do
not expect changing the order of the bits to make a big difference.
The Cipher Katan Katan[DDK09] is a block cipher that uses shift registers and non-
linear feedback functions to manipulate the state during en-/decryption. Here, we only
consider the 32-bit version depicted in Fig. 5. For more details on Katan we refer to its
specification[DDK09].
As for most of the other ciphers, we will also consider reverting non-key dependent
parts of the round function. Since for Katan, this means reverting 17 (partial) rounds,
we provide the details of this in Appendix B.
The Cipher ChaCha The last cipher we consider here is ChaCha[Ber08]. While ChaCha
is a stream cipher, we will handle a keystream block as a ciphertext, and nonce and counter
(i. e. the input data that may be controlled by an adversary) as the plaintext. This enables
us to use similar techniques for all of the ciphers under scrutiny.
The key stream generation works as follows. First, an initial state, which is usually
L1
IR ka
kb
L2
ai bi ci di
≪ 16
≪ 12
≪8
≪7
Table 3: Overview of used input differences. Differences without a reference were chosen
such that diffusion in the first few rounds is limited and were found to perform better than
ones from literature.
Cipher Difference
Speck 0x0040 0000[Goh19, ALLW15]
Simon 0x0000 0040[KLT15, ALLW13]
Skinny 0x0000 0000 0000 2000
Present 0x0000 0000 00d0 0000
Katan 0x00002000
ChaCha 0x00000000 00000000 00000000 00008000
expectation should still be tested. In addition, assuming that the neural network
can also be used for other ciphers than Speck, can we optimize it further for each
specific cipher, and which of the hyper-parameters are important?
In the following of this section, we will set out to answer the last two (still open) questions.
To this end, we will train and further optimize the neural network from [Goh19] for six
different ciphers, following five different design approaches, namely Simon and Speck,
representing Feistel networks, Present and Skinny, representing a SPN using a simple
and a more involved (AES-like) linear layer respectively, Katan, that is based on non-
linear feedback shift registers, as well as ChaCha, which is in contrast to the other
ciphers not a block cipher, but a stream cipher. As we will see, using a bit of (automatic)
optimization/customization for each cipher leads to a quite promising differential-neural
distinguisher for a not insignificant number of rounds, which further supports the generality
of the approach and of the employed neural network. In addition, we show that we can
slightly improve over the best previously known differential-neural distinguishers, but
despite considering a great amount of hyper-parameters, no groundbreaking improvements
could be found.
The Search Space We started with a quite broad search space for both, Simon and
Speck, which we than reduced to a more reasonable one for the other ciphers based on
the results for those two ciphers.
The first search space is shown in Table 4. The last column gives the values used
in [Goh19][Section 4.2]5 . The concrete search space for each hyper-parameter was deter-
mined by two factors, hardware limitations6 and/or degradation in distinguishing power.
For more details on those hyper-parameter, we refer to Appendix A.
Out of all those parameters, only batch size, the use of circular convolutions, the number
of neurons in the densely connected layers, the number of filters, the filter size, the LR
(while using the cyclic LR schedule), the number of residual blocks and the L2 penalty seem
to have a significant (positive) effect. Hence, we restricted the search space for the other
ciphers to only those hyper-parameters. The overall results are shown in Table 5 (hyper-
parameters not shown are as in [Goh19][Section 4.2]). Out of all the hyper-parameters we
tested, the ones that appear to have the biggest impact are LR, L2 penalty and especially
the filter size. For Simon the higher filter size of 7 leads to an accuracy that is multiple
percent points higher than with a filter size of 3. One possible explanation for this is
that the neuronal network most likely (effectively) reverts deterministic transformations
at the end of the encryption process, which in the case of Simon is one complete round
(up to key addition). Hence, reverting such calculations in advance (i. e. before providing
the samples to the neural network), which we shall denote as calculating the ciphertexts
back, can be seen as one additional hyper-parameter of our search space. Doing so indeed
improved the results for half of the ciphers in our experiments. For Speck, on the other
hand, we do not see any significant improvement, which may be due to the neural network
already being optimized to perform this calculation itself, and the same appears to be the
case for Simon if a filter size of at least 7 is used.
We want to highlight that at least a small hyper-parameter search, with different LRs,
L2 penalties and filter sizes, should be conducted if one plans to use a neural distinguisher
for another cipher. But given that such a search can be automated using a tool like
Optuna[ASY+ 19], the results in Table 5 show that the neural network from [Goh19] is
quite generic.
Comparison to Literature As we mentioned before, there already exist results for Simon,
Speck and Present in literature. If we take a look at Table 1, we see that, for the same
number of rounds, our results are always superior (in terms of accuracy), even if not by
much in the cases of Simon and Speck (and also Present if we use the same input
difference). Given the size of the search space, it is likely that much greater improvements
are simply not possible. We will see in Section 5 that at least for Simon this is most
certainly the case.
Note that we naturally tried to increase to number of rounds, but did not find more
rounds to be learnable without any changes to the training data and/or process. This
5 Note that we use 10 residual blocks as our default/baseline despite the fact that less blocks were used
seems to be in line with literature, as the results for more rounds are either not free from
doubt, or use the more elaborate training processes introduced in [Goh19].
4 Manual Optimizations
In the previous section, we studied the performance of a generic neural network based
cryptanalytic training pipeline on a wide variety of different ciphers. In this section, we will
show that the presented results can be significantly improved with some limited guidance
from a cryptanalyst.
1. We can improve the choice of input differences. In principle, this does not necessarily
mean picking input differences that generate deeper distinguishers; it is perfectly
conceivable that improvements to an attack can be obtained by trading a slightly
weaker machine-learned distinguisher for more control over the event that it detects.
However, in this work, we will solely focus on optimizing input differences to obtain
stronger distinguishers.
Aron Gohr, Gregor Leander and Patrick Neumann 15
2. The designer can also try to obtain neural distinguishers that work for more rounds
by modifying the distinguisher to cover more rounds. In general, the most natural
approach to do this are staged training or extension by key-averaging as introduced
in [Goh19]; in this article, we will see that for Present, simply using the same
distinguisher for more rounds works to a significant extent.
Table 6: Present distinguisher evaluated (using 107 samples) on more rounds than it
was trained on. Every result other than the ones for 10 rounds passes a significance test
with significance level 0.001.
Encryption Rounds 6-round Distinguisher 7-round Distinguisher
6 0.658 –
7 0.557 0.563
8 0.509 0.512
9 0.502 0.502
10 0.500 0.500
Table 7: Present distinguisher evaluated (using 107 samples from the real distribution
and a fixed set of 108 samples from the random distribution) on more rounds than it was
trained on. All reported results except the one on 11 rounds are significant at the 3σ level
or more. The 11-round result becomes significant when 108 real labels are used, at an
empirical deviation of ≈ 5σ0 .
Encryption Rounds 7-round Distinguisher, (µ1 − µ0 )/σ0
8 229
9 34.8
10 4.8
11 0.7
The Accuracy of a Purely Differential Distinguisher Given the probability pδ for every
output difference δ ∈ Fn2 (and a fixed input difference) a distinguisher could just use
pδ > 2n1−1 as a decision rule, i. e. the ciphertext-pair is coined to belong to the encryption
distribution if the probability of the observed ciphertext difference is greater in the
encryption case than the probability of the same difference to appear in the random case7 .
With this, the accuracy of such a distinguisher would be
1 X 1 X 1
pδ + ,
2 2 2n − 1
pδ > 2n1−1 pδ ≤ 2n1−1
assuming that both, the encryption and the random case, appear with probability 12 , as
it is the case for training the neural network (and evaluating its accuracy). Note that
if both cases are equiprobable, this is actually the best that can be done if only the
ciphertext-differences of single samples are considered.
Interestingly, this accuracy is basically the same as the mean absolute distance of the
ciphertext-difference distribution and the uniform one.
Lemma 1. The accuracy of a purely differential distinguisher is the same (up to constants)
as the mean absolute distance of the ciphertext-difference distribution and the uniform one.
7 Note that the zero difference cannot appear, which means that (in the random case) the differences
1
Proof. Since both pδ and 2n −1 sum to 1 over all possible δ ∈ Fn2 \ {0}, we have
X X 1 X X 1 X 1
pδ + = 1− pδ + = 1+ − pδ ,
2n −1 2n −1 2n − 1
pδ > 2n1−1 pδ ≤ 2n1−1 pδ ≤ 2n1−1 pδ ≤ 2n1−1 pδ ≤ 2n1−1
as well as
X X 1 X X 1 X 1
pδ + = pδ + 1 − = pδ − + 1.
2n −1 2n −1 2n − 1
pδ > 2n1−1 pδ ≤ 2n1−1 pδ > 2n1−1 pδ > 2n1−1 pδ > 2n1−1
1 X 1 X 1
pδ +
2 2 2n −1
pδ > 2n1−1 pδ ≤ 2n1−1
1 X 1 X 1
= 1 + − pδ + pδ − + 1
4 2n − 1 2n − 1
pδ ≤ 2n1−1 pδ > 2n1−1
1 1 X 1
= + pδ − .
2 4 2n −1
δ∈Fn
2 \{0}
This means that maximizing the accuracy of such a distinguisher (over the chosen
input difference) may not lead to the same input difference as if one tries to maximize the
probability of a differential, as the former maximizes the mean absolute distance, while the
later maximizes the maximal distance.
Proof.
By interpreting the Xi as bits from c1 and Yi as bits from c2 this means that if we add
independent key bits (more precisely key bits that are independent of the bits from c1 and
c2 we are looking at) to c2 we would get
i. e. the bits would look uniform distributed. But there is also a different interpretation.
If we divide the bits from c1 and c2 that we are looking at in a group of bits that get
added by the same key bits (or no key bits at all), i. e. we have dependencies between the
key bits that get added to the bits, and interpret them as X1 , ..., Xm (after key addition),
and interpret the rest of the bits, i. e. the ones where we add independent key bits, as
Y1 ⊕ K1 , ..., Yl ⊕ Kl , we can see that we only need to consider the group corresponding to
the dependent key bits.
For illustration let us look at a Feistel cipher with round function F : F2n 2n
2 → F2 ,
n n
Fk (l, r) = (r ⊕ k ⊕ f (l), l) and Feistel function f : F2 → F2 . Let us consider the two
(l) (l) (r) (r) (l) (l) (r) (r)
ciphertexts c1 = (s1 , ..., sn , s1 , ..., sn ) and c2 = (q1 , ..., qn , q1 , ..., qn ). If we would
(l) (l) (r) (r)
look at s1 , ..., sn and q1 , ..., qn we would see no dependency since the last round key
(l) (l) (r) (r)
got added to s1 , ..., sn but is independent of q1 , ..., qn , assuming independent round
(l) (l) (l) (l)
keys. Analog, there can also be no dependency between s1 , ..., sbn/2c−1 and qdn/2d , ..., qn .
1
Pr [Vj = vj | Xi = vi , Yi = vi ⊕ δi ∀i 6= j] = (1)
2
hold for all (v1 , ..., vm ) ∈ Fm
2 and all Vj ∈ {Xj , Yj }. Then we have that
Pr [Xi = zi , Yi = zi ⊕ δi ∀i]
= Pr [Xj = zj , Yj = zj ⊕ δj | Xi = zi , Yi = zi ⊕ δi ∀i 6= j] · Pr [Xi = zi , Yi = zi ⊕ δi ∀i 6= j] .
20 An Assessment of Differential-Neural Distinguishers
Pr [Xj = zj , Yj = zj ⊕ δj | Xi = zi , Yi = zi ⊕ δi ∀i 6= j]
1
= − Pr [Xj = zj , Yj = zj ⊕ 1 ⊕ δj | Xi = zi , Yi = zi ⊕ δi ∀i 6= j]
2
1 1
= − − Pr [Xj = zj ⊕ 1, Yj = zj ⊕ 1 ⊕ δj | Xi = zi , Yi = zi ⊕ δi ∀i 6= j]
2 2
= Pr [Xj = wj , Yj = wj ⊕ δj | Xi = zi , Yi = zi ⊕ δi ∀i 6= j] .
Thus, we have
For us this means that as long as Eq. (1) holds the probability distribution does only
depend on the difference between c1 and c2 and can therefore be calculated using the DDT.
This can easily be seen by expressing the probability of a difference as the sum of every
possible bit configuration
X
Pr [Xi = Yi ⊕ δi ∀i] = Pr [Xi = zi , Yi = zi ⊕ δi ∀i] .
(zi ,...,zm )
Since the lemma tells us that all configurations have the same probability, calculating them
from the probability of the difference is just a matter of dividing by the number of possible
configurations. Hence, if we want to see whether it could be possible to learn additional
features aside from the ones provided by the DDT, we can check if Eq. (1) does not hold.
For example, let us assume we have a cipher with round function of the from R :
Fn2 × Fn2 → Fn2 , R(k, x) = f (x) ⊕ k for some f : Fn2 → Fn2 and independent round key k.
Then we know from Lemma 2 that Eq. (1) holds. Therefore, we cannot expect to learn
additional information from the distribution of ciphertext-pairs than we could already
learn from the DDT8 . More general, we have shown the following.
Theorem 1. If for a (bijective) cipher there exist a transformation of the ciphertexts inde-
pendent of the key such that the transformed ciphertexts have the from c0 ⊕ k 0 , where k 0 are
independent key bits, then the encryption distribution and the ciphertext-difference distribu-
tion (where the difference is calculated after applying the key independent transformation)
contain the same information.
Hence, in order to show that the encryption distribution does not provide any additional
information to the DDT we have to show that we can represent ciphertexts as c0 ⊕ k 0 for
independent key bits k 0 . Obviously, this is the case for Present if we assume independent
round keys, as well as all other key-alternating ciphers.
Additionally, this is also the case for Feistel ciphers. To see that, let us denote
the round function of such a cipher as Fk (l, r) = (r ⊕ k ⊕ f (l), l) for some function
f : Fn2 → Fn2 and round key k. Obviously, we can see the key addition as the first operation
in the round and calculate one round back up to the key addition. Hence, instead of
looking at ciphertexts of the from (ri ⊕ ki ⊕ f (li ), li ) we can look at ones of the form
(ri ⊕ ki , li ) = (ri ⊕ ki , ri−1 ⊕ ki−1 ⊕ f (li−1 )) = (ri , ri−1 ⊕ f (li−1 )) ⊕ (ki , ki−1 ). Since
ri , ri−1 , li−1 do not depend on the round keys ki , ki−1 (assuming independent round keys)
8 That is if we know the real DDT, i. e. the model used to calculate the DDT does not contain any
assumptions that are not met such as the Markov assumption that is often used even though it does not
hold in most cases. Otherwise, we would only have an approximation of the DDT, and therefore of the
distribution of ciphertext-pairs, and could still hope to find a better one.
Aron Gohr, Gregor Leander and Patrick Neumann 21
Table 8: Accuracy comparison between neural distinguisher and DDT based one (after
reverting one round up to key addition) for Simon32/64.
Distinguisher Rounds
9 10 11
Neural 0.661 0.567 0.520
DDT[BGL+ 21] 0.663 0.568 0.520
we know that we can calculate the probability distribution of the ciphertexts (calculated
back up to key addition) by using the DDT. In other words, Simon is another example
where only differential features can be used, again, assuming independent round keys. But
for Feistel ciphers, such as Simon, we first need to revert one round, up to key addition.
Knowing that, and given that Bao et al.[BGL+ 21] already calculated the accuracy
a DDT based distinguisher is able to achieve, we can actually judge the quality of the
approximation the differential-neural distinguisher is able to produce for Simon32/64.
As can be seen in Table 8, the accuracies of the two distinguishers are almost identical,
supporting the conjecture that the results we have seen in Section 3 may be (almost)
optimal. We will later see that the same is true for toy versions of Simon, which makes
it reasonable to assume that this observation can be extrapolated to the 48-, 64-, 96-
and 128-bit block versions of Simon, too. Note that, while calculating the DDT for
the 32-bit version of Simon is possible, at least under Markov assumption, it is already
computationally expensive, and one can expect it to become infeasible for greater block
sizes. Hence, the use of differential-neural distinguishers could be even more valuable for
greater block sizes.
A Chance to Still Use More Than Differential Features While the results above show
that for single ciphertext-pairs only their difference may provide information, one option to
still improve over purely differential distinguishers could be to use multiple ciphertext-pairs
at once, making use of dependencies between the pairs, especially if the key is fixed. But
as we will see in Section 8, the dependencies are either not that strong, or they simply do
not get used much by current state of the art multi-pair distinguishers.
Small Scale Simon Starting with Simon, the authors of [KLT15, Bei16] already provide
constraints for choosing reasonable rotational constants α, β and γ (using the notation
from Fig. 2a). The most obvious one is that α and β can be switched and, of course, α 6= β
should hold. Also, it should hold that gcd(| α − β |, n) = 1, where n = 8 is the word size,
in order to keep the differential probabilities low. For the same reason γ ∈
/ {α, β} should
hold, too. In addition, Kölbl et al. also show in [KLT15] that multiplying all rotational
constants by s ∈ Z∗n results in the same cipher apart from bit permutations (and changes
within the key schedule). We will adopt their notation of structural equivalence for this.
Lemma 4. For Simon, adding n/2 to all rotational constants, where n denotes the word
size, leads to a structural equivalent cipher.
Proof. Let
f (x)(α,β,γ) = ((x ≪ α) (x ≪ β)) ⊕ (x ≪ γ)
be the Feistel function of Simon. It is easy to see that, for any integer ρ, f (x)(α+ρ,β+ρ,γ+ρ) =
f (x)(α,β,γ) ≪ ρ holds. If we denote by
(k)
R(α,β,γ) (l, r) = r ⊕ f (l)(α,β,γ) ⊕ k, l
the round function of Simon, then by using ρ = n/2 it is easy to see that
(k)
R(α+n/2,β+n/2,γ+n/2) (l, r ≪ n/2) = (r ⊕ f (l)(α,β,γ) ⊕ (k ≪ n/2)) ≪ n/2, l
and
(k)
R(α+n/2,β+n/2,γ+n/2) (l ≪ n/2, r) = r ⊕ f (l)(α,β,γ) ⊕ k, l ≪ n/2 .
Hence, using (α, β, γ) and (α + n/2, β + n/2, γ + n/2) produces structural equivalent
ciphers.
Lemma 5. Let n be the word size. If n is even, n ≥ 4 and gcd(| α − β |, n) = 1, then there
exist unique and non-negative integers b < n/2 (even) and c < n − 1 such that, for Simon,
using the constants (α, β, γ) is structural equivalent to using the constants (1, b, c).
9 For Simon it is also not clear how the key schedule should be adapted, which is why we use independent
rounds keys instead. For Speck, on the other hand, the key schedule makes use of the round function only,
which makes adaptation of the key schedule straight forward.
Aron Gohr, Gregor Leander and Patrick Neumann 23
Table 9: Maximal DP and accuracy (based on DDT) for 7-round small scale Simon and
different choices of rotational constants, together with the input difference which leads to
those results, and ordered by maximal DP. Rotational constant α is fixed to 1.
β γ Difference log(max DP) Difference max Acc.
2 3 0x00 0b -11.3 0x00 01 0.566
2 5 0x01 20 -11.0 0x00 01 0.547
2 7 0x01 80 -11.0 0x00 01 0.587
0 7 0x01 80 -11.0 0x00 01 0.688
0 5 0x01 20 -11.0 0x00 01 0.546
0 6 0x00 01 -10.4 0x00 01 0.546
0 2 0x00 01 -10.4 0x00 01 0.574
0 3 0x00 05 -10.3 0x00 01 0.547
2 6 0x00 01 -8.7 0x00 01 0.694
0 4 0x00 01 -7.4 0x00 01 0.701
2 4 0x00 01 -7.2 0x00 01 0.619
2 0 0x00 01 -6.8 0x00 01 0.737
Proof. Since gcd(| α − β |, n) = 1 and n is even, α − β must be odd. This means that one of
them (α or β) has to be even and the other one odd. Hence, by multiplying all constants
with s ∈ Z∗n we can bring the odd one to 1 while the even one remains even. For the
sake of simplicity let us assume that α = 1. Now we can choose b as either β if β < n/2
already holds or as β + n/2 by simply adding n/2 to all constants (and reducing modulo
n). Of course, this means that we could change α from 1 to 1 + n/2, but by multiplying all
constants with 1 + n/2 we can return α to (1 + n/2)2 = 1 mod n (since n ≥ 4 even) and
also leave b < n/2 untouched since b · (1 + n/2) = b + n · 2b = b mod n as β and therefore b
is even. The uniqueness follows from the fact that adding n/2 and multiplying by s ∈ Z∗n
is commutative and every combination of those can be written as a single multiplication by
s and at most one addition of n/2. Adding n/2 or multiplying by s alone would violate the
first constant being 1 while doing both would require 1 · s + n/2 = 1 for the first constant
to remain 1, which means that s has to be n/2 + 1, i. e.
b · s + n/2 = b · (n/2 + 1) + n/2 = b + n/2 > n/2
since b is even and b < n/2. Hence, b and c are unique.
This leaves us with the rotational constants
(α, β, γ) ∈ {(1, b, c)|b ∈ {0, 2}, γ ∈ {0, 2, 3, 4, 5, 6, 7}\{b}} . (2)
We calculated the DDT for 7 rounds by making use of Theorem 3 from [KLT15] for those
constants. Note that since the round function is invariant under rotations (i. e. rotation
both words of the input by d is the same as doing so at the output) we only need to
consider input differences up to rotational equivalence. Table 9 provides the maximal DP,
as well as the maximal accuracy of a purely differential distinguisher, together with the
input differences leading to those results. Here, it becomes already apparent that choosing
the rotational constants such that the maximal DP is minimized, as it is normally done10 ,
does not necessarily minimize the accuracy a purely differential distinguisher is able to
achieve.
We already know from Section 5.1 that an r-round distinguisher would use the DDT
for r − 1 rounds. Hence, we trained differential-neural distinguishers11 on 8 rounds using
10 Ofcourse, other kinds of attacks then differential ones would also be considered.
11 The hyper-parameters were chosen to be as in [Goh19][Section 4.2], i. e. no cipher specific hyper-
parameter optimization was done.
24 An Assessment of Differential-Neural Distinguishers
Figure 7: Network performance for small scale Simon and different choices of rotational
constants when using the input difference 0x00 01 and one additional round, i. e. 8 rounds,
while the DDT is still calculated for 7 rounds. α is fixed to 1 and the results are ordered
by the accuracy of the neural network.
the input difference 0x00 01, which leads to the highest accuracies for the DDT-based
distinguisher, no matter the rotational constants. As can be seen in Fig. 7, if the neural
network did learn at all, the accuracies are virtually identical.
Note that the input difference 0x00 01, which is found to be the best (in terms of
accuracy), is rotational equivalent to every other input difference with only one active
bit that is in the right words. In other words, the input difference we have chosen for
Simon32/64, namely 0x0000 0040, is rotational equivalent to 0x0000 0001, which is a
reasonable candidate for the best input difference for the 32-bit version, based on the
results we have seen here.
Next, let us fix the rotational constants to (α, β, γ) = (1, 2, 3), i. e. the ones that lead
to the lowest maximal DP. Since 7 rounds seem to be harder for the neural network to
learn, we will use 6 rounds instead, in order to have more meaningful results.
Starting with the input differences that lead to the maximal DP, it can be seen in
Fig. 8a that both accuracies, the one based on the DDT and the one of the neural network,
are again virtually identical. The same can be seen for input differences leading to maximal
accuracy (based on the DDT), see Fig. 8b. Therefore, we conclude that for Simon the
neural network seems to be able to find a good approximation of the DDT, which may not
be that interesting for the small scale variant, but, as we have already seen, also seems to
be the case for Simon32/64, and is therefore also likely to be true for variants with greater
block sizes, where calculating the DDT becomes infeasible. Hence, such an approximation,
especially for larger block sizes, could get quite useful.
Small Scale Speck For Speck, we, again, need to chose the rotational constants. For
this, we can make similar arguments. Once more, we need α and β to be not both divisible
by 2 in order to achieve better diffusion. Also, α = n − β mod n (n being the word
size) should be avoided, as the rotations would otherwise be identical. In addition, we
incorporate the rational of the THREEFISH designers[FLS+ 10] and ignore the rotational
constants −1, 0 and 1 since diffusion in adjacent bits should already be provided by the
modular addition.
The best input differences for purely differential distinguishers on 6-round Speck and
different choices of rotational constants are given in Table 10. Note that, once more, the
best input difference in terms of DP and the best in terms of accuracy based on the DDT
are not identical.
In contrast to small scale Simon, where the input difference 0x00 01 always lead to
Aron Gohr, Gregor Leander and Patrick Neumann 25
(a) Top ten differences based on DP (b) Top ten differences based on accuracy (DDT)
Figure 8: Network performance for small scale Simon, using rotational constants (α, β, γ) =
(1, 2, 3), for the top ten input differences (up to rotational equivalence) based on and ordered
by DP (for 7 rounds) and accuracy (DDT) respectively. The network was trained on 7
rounds, but the accuracy based on the DDT is reported for 6 rounds.
the highest accuracy and is the natural counterpart to 0x0000 0040 (up to rotational
equivalence), the choice of input difference is not as clear cut for the different variants of
small scale Speck. But if we take a closer look at the input difference used for Speck32/64
in [Goh19], we notice that at the first modular addition only the msb of the left word is
active. This leads to a deterministic first round transition and can easily be adapted for
other choices of rotational constants.
Using such differences, we can see once more that the accuracy based on the DDT is
a good approximation of the one of the neural distinguisher, see Fig. 9a. But as already
noted in [Goh19], the neural distinguisher is able to surpass the purely differential one
in most of the cases, and as can be seen in Fig. 9b, just optimizing the size of the filters
enables the neural distinguisher to at lest catch up to the purely differential one in the
cases where the purely differential distinguisher was superior before.
We would like to note that trying different filter sizes for small scale Simon did not
make any significant difference. In other words, the version of Fig. 7 optimized for filter
size would not look any different.
Given that the rotational constants α = 5 and β = 2 lead to the lowest maximal DP,
let us now fix those constants. We can see the top ten input differences, as well as the
corresponding maximal DP and accuracies for both, the neural network, as well as the
DDT, in Fig. 10a. Here, we would like to point out two things. Firstly, if the neural
network is able to learn at all, it either reaches the same accuracy as the distinguisher
based on the DDT or it surpasses it. Secondly, the figure definitely shows that there is no
clear relation between the accuracy (based on the DDT) and the maximal DP, which is
not surprising as we already know that one corresponds to the mean and the other to the
maximal distance of the distributions.
Moving on to the top ten input differences according to the accuracy of the purely
differential distinguisher, we can see in Fig. 10b that the accuracy based on the DDT is
once more a good approximation for the one of the neural distinguisher, and, again, the
neural distinguisher is at least as good as the purely differential one and even surpasses it
in most cases.
If we take a closer look at the input differences that lead to the highest accuracies for
small scale Speck, we see that the counterpart to the input difference used for Speck32/64
(i. e. 0x10 00) does not lead to the highest accuracy. Hence, the question arises if we are
able to find a better input difference based on those results. While it is not completely
26 An Assessment of Differential-Neural Distinguishers
Table 10: Maximal DP and accuracy (based on DDT) for 6-round small scale Speck and
different choices of rotational constants together with the input difference which leads to
those results, ordered by maximal DP.
α β Difference log(max DP) Difference max Acc.
5 2 0x08 60 -11.7 0x04 20 0.591
3 2 0x28 01 -11.7 0x01 20 0.580
3 6 0x84 98 -11.6 0x10 02 0.596
6 5 0x20 00 -11.0 0x01 04 0.581
5 6 0x20 01 -11.0 0x10 00 0.563
2 5 0x48 10 -11.0 0x10 04 0.587
6 3 0x28 a9 -11.0 0x04 10 0.580
5 4 0x12 91 -10.0 0x10 80 0.644
3 3 0x34 02 -10.0 0x24 80 0.592
5 5 0xd0 94 -10.0 0x12 80 0.598
2 3 0x02 80 -9.7 0x40 10 0.559
4 3 0x28 32 -9.0 0x48 80 0.599
3 4 0x01 20 -8.4 0x04 80 0.615
4 5 0x84 40 -8.0 0x08 00 0.664
Figure 9: Network performance for 6-round small scale Speck and different choices of
rotational constants when using the input difference such that only the msb of the left side
is active right before the modular addition in the first round, extended by the accuracy
calculated using the DDT (for the same number of rounds). The results are ordered by
the accuracy of the neural network in the left image.
Aron Gohr, Gregor Leander and Patrick Neumann 27
(a) Top ten differences based on DP (b) Top ten differences based on accuracy (DDT)
Figure 10: Network performance for 6-round small scale Speck, using rotational constants
(α, β) = (5, 2), for the top ten input differences based on and ordered by DP and accuracy
(DDT) respectively.
clear how one can find a counterpart of one difference for a different variant of Speck,
we will still try to do so based on the word size and the rotational constants. For the
difference 0x04 20 for example, one can notice that after rotation the left side by α the
two sides are identical, i. e. the differences in both words are the same right before the
first modular addition. Furthermore, rotating the right side by β yields a difference in
the msb only. The counterpart of 0x04 20 is therefore likely to be 0x0010 2000. But for
Speck32/64 this difference leads to an accuracy of 0.595, which is slightly worse than the
original input difference, but still the best result of all differences such that the two sides
are identical up to rotating the left side by α. It hence looks reasonable that the choice of
input difference in [Goh19] could already be optimal, at least if we consider the neural
distinguisher on its own.
6 Real-Difference-Experiment
In [Goh19] the Real-Difference-Experiment is proposed to show that the neural distinguisher
for Speck is not a purely differential distinguish. For this, 107 samples12 of the form
(c1 , c2 ) are drawn from the encryption distribution. Then half of the samples get blinded
as (c1 ⊕ b, c2 ⊕ b) by independent blinding values b chosen uniformly at random. The
resulting samples get judged by the distinguisher and the accuracy of distinguishing the
blinded from the un-blinded samples is evaluated using the distinguishers trained for the
scenario described in Section 2.2.
As explained in [Goh19], the rationale behind this is that the blinded samples contain
the same ciphertext difference, but all additional information is lost. Hence, if only
differential features would be used by a distinguisher the accuracy would be around 0.5. If
the accuracy is significantly higher, then more than just the ciphertext difference is used.
Actually, the distinguishing performance on the non-blinded encryption samples will
stay the same, as a sample labeled as encryption sample is still an encryption sample,
while one labeled as random sample is now a blinded encryption sample. This means we
could just compare the performance on random and blinded encryption samples, both
labeled as random samples, which is typically called the True Negative Rate (TNR) in
both, the default network evaluation, as well as the Real-Difference-Experiment. If the
TNR would be the same in both cases, then the blinded encryption samples would look
12 We increased the amount of samples from 106 to 107 in order to get more precise estimates for Katan.
28 An Assessment of Differential-Neural Distinguishers
like random samples to the network. If it would go down, then at least some differential
features get used.
Still, we think that the accuracy of the Real-Difference-Experiment is the more intuitive
metric, which is why we will simply report both. If we take a look at the results in Fig. 11,
we see that for Simon32/64, Skinny and Present the accuracy drops to 0.5 in the
Real-Difference-Experiment, indicating that the neural distinguisher only makes use of the
ciphertext difference (or more precise, the difference after a deterministic transformation
of the ciphertexts in the cases of Simon and Skinny). While this was to be expected
for Simon and Present, given the results from Section 5.1, for Skinny it is not as clear
cut. Still, the experiment shows that for Skinny only differential features were used, too.
Whether this is because other features do not exist or because the neural network was
unable to learn them is not clear.
While the results for Katan look like also only differential features are used, a closer
look on the accuracy of 0.501 reveals that, as it was evaluated using 107 samples, it still
deviates significantly from 0.5, which means that more than purely differential features
were used. Similar can be seen for ChaCha and Speck32/64, where the accuracy drops
are not too high.
We would like to point out that while the results for Simon, Present and Skinny
clearly indicate that only differential features are used, we cannot conclude the opposite
for Katan and ChaCha, as we only considered one (natural) representation. If we, for
example, look at Simon32/64 without backwards calculation, we see that the accuracy in
the Real-Difference-Experiment would still be 0.645 (with a TNR of 0.708). In other words,
we would need to repeat the Real-Difference-Experiment for all possible (deterministic)
transformations of the ciphertexts (up to linear equivalence), if we would want to make
such a conclusion plainly based on the Real-Difference-Experiment. Nevertheless, the
results show that using the given representation, more than just differential features can
be used. Additionally, at least for Speck, it was already show in [Goh19], and also in
Section 5.3, that the neural distinguisher uses more than just differential features.
All in all, neural distinguishers are able to make use of more than just differential
feature, even in a setting that is differential by nature (i. e. using a fixed input difference).
It would therefore be interesting to classically explain such features, as this may have the
potential to lead to new classes of attacks, and hence improve current security estimations
as well as provide new guidance for cipher design.
Aron Gohr, Gregor Leander and Patrick Neumann 29
ciphertext difference (except for present, as this would just result in a permutation of the bit positions),
which is motivated by Section 5.1
14 I. e. without any form of pre-training or more elaborate training process.
30 An Assessment of Differential-Neural Distinguishers
Figure 12: Differential-linear biases when using the usual input differences as well as
one-bit linear masks, empirically evaluated using 107 samples. Results that do not pass a
Bonferroni corrected significance test with significance level 0.01 are not shown. The two
critical values for the significance test are shown by the two horizontal lines.
Aron Gohr, Gregor Leander and Patrick Neumann 31
For this, it is important to note that differential-neural distinguishers, like the one
from [Goh19], actually produces a score between 0 and 1 for each sample, representing
its confidence for the sample to belong to the encryption class (instead of the random
class). To evaluate the accuracy, we reduce this confidence into a hard decision by simply
comparing it to 0.5, i. e. we loose information. But if we simply interpret each score as the
probability of the sample to belong to the encryption class, and if we assume independence,
we can actually make use of the whole score.
Corollary 2. Let us assume that we have m samples, each drawn from the same distribution
(encryption or random). Let us denote the probability that the i-th samples is an encryption
samples as pi . If the probabilities p1 , ..., pm are independent, then the probability that all
m samples are drawn from the encryption distribution is
1
Qm 1−pi .
1+ i=1 pi
Note that this was already implicitly used in [Goh19], as a ranking of m ciphertext-pairs
based on this combined probability is the same as one by
X pi
log ,
i
1 − pi
which is how the keys are ranked in [Goh19] when using multiple ciphertext-pairs during
trial decryption. In other words, [Goh19] already used multi-pair distinguishers, even if
only implicitly.
Using this rule of combining probabilities/distinguisher responses, we are able to
explicitly turn a distinguisher for one ciphertext-pair into one for an arbitrary amount of
ciphertext-pairs. A comparison of such distinguishers and results from literature is shown
in Table 215 . As can be seen, the claimed improvement is at best non-existent in most cases.
In the case of [CSYY22], the multi-pair results are even worse than just using their single-
pair distinguishers. Only [ZWW22] manage to improve over the combined distinguishers,
but not as much as they reported, and in the case of 7- to 9-round Speck32/64, as well as
12-round Simon32/64 only.
To be fair to the authors, our distinguishers were not available for them to compare
to, but at least comparisons to published distinguishers, such as the ones for Speck32/64
from [Goh19], or even to their own single pair distinguishers, would have been desirable.
In addition to the results in Table 2, Benamira et al.[BGPT21, Section 6] claim to
improve the distinguisher from [Goh19] by averaging over the score of multiple (single-
pair) samples or combining them using a neural network, effectively creating a multi-pair
distinguisher. Unfortunately, they fail to notice that the key recovery attacks in [Goh19]
effectively combine the scores as detailed above. Despite their claims, they do not manage
to improve the results from [Goh19] at all, see Table 11.
Given the results for using multiple ciphertext-pairs when training a neural distinguisher,
one could come to the conclusion that this may be the best way to learn more rounds.
Indeed, if we look for example at the accuracies the authors of [ZWW22] achieved for
9-round Speck32/64 when using at least 8 pairs, it seems unlikely that we could tell an
accuracy for one pair apart from 0.5 using 106 (or even 107 ) samples. In other words,
it could be that single-pair distinguishers for 9-round Speck32/64 already were trained,
but were not identified as being useful due to the noise that comes from experimentally
evaluating the accuracy. A good example for this are our results for 66-round Katan.
15 To avoid dividing by zero, we will use the number of times that p = 0 or p = 1 holds as a decision rule
i i
in case that at least one of the pi is zero. Also, note that we used independent keys for each ciphertext-pair,
but do not expect the results to look significantly different if the same key were to be used for all pairs
contained in one sample.
32 An Assessment of Differential-Neural Distinguishers
Table 11: Accuracy (for Speck32/64) when using the score combination method implicitly
used in [Goh19] compared to the ones proposed in [BGPT21].
Rounds Pairs [Goh19] Averaging[BGPT21] 2D-CNN[BGPT21]
5 5 0.999 0.998 0.994
10 1.000 1.000 1.000
6 5 0.967 0.954 0.933
10 0.995 0.990 0.977
50 1.000 1.000 1.000
7 5 0.743 0.735 –
10 0.821 0.808 –
50 0.977 0.967 –
100 0.997 0.997 –
While a single pair leads to an accuracy of 0.505 only, which one could easily mistake for
random noise, using (for instance) 8 pairs leads to an accuracy of 0.524, for which it is
out of the gate more believable to be better than random guessing, but which is still way
higher than the accuracy of 0.502 (reported in [ZWW22] for 9-round Speck32/64 and 8
pairs).
We would like to emphasise that we do think that using more than just one ciphertext-
pair could be a viable way to improve on already existing distinguishers, as there should
exist dependencies between the pairs that could be used, but we would like to ask for a
fair comparison, for example as we did here. Especially, a fair comparison does not include
using the same total amount of plain-/ciphertexts when evaluating the accuracy, but
keeping the amount used for one sample constant. It should be clear that the amount of
samples/trials only effects the quality of the accuracy estimation, but not the distinguishing
performance itself.
9 Conclusion
We have seen that differential-neural distinguishers are quite generic and can be applied to
a wide variety of ciphers and cipher designs. Furthermore, we provide guidance on how the
network hyper-parameters should be chosen in order to produce potent differential-neural
distinguishers. Despite the size of the explored search space, only a few hyper-parameters
needed to be changed for the six ciphers under scrutiny, which makes it plausible that the
same may be true for other ciphers as well.
In addition, we have seen that the best input difference with respect to the probability
of a differential is not necessarily the best for differential-neural distinguishers, and also that
there is a pronounced correlation between the accuracy a differential-neural distinguisher is
able to achieve and the mean absolute distance between the output difference distribution
and the uniform one, which is essentially the accuracy of a DDT-based distinguisher.
For a whole class of ciphers, we were able to see that only differential features can be
used, assuming independent round keys. Furthermore, we have seen examples of ciphers
not belonging to this class, where it is most likely that more than differential features are
used by the differential-neural distinguishers presented in this work.
While the skewness of the encryption distribution most likely needs to be quite pro-
nounced for the neural network to learn, this does not diminish the fact that differential-
neural distinguishers seem to be able to learn a good approximation of the ciphertext-
difference distribution, which, especially for greater block sizes, should be much faster
than calculating the corresponding part of the DDT.
Aron Gohr, Gregor Leander and Patrick Neumann 33
At last, we were able to rectify results from literature for using multiple ciphertext-pairs
at once, showing that the claimed improvements are at best marginal, if not non-existent.
This is especially important as the approach of combining distinguisher responses is by
no means new, but is already used in the underlying work[Goh19], be it only implicitly.
We hope that going forward, this gives authors a straightforward way of comparing such
distinguishers to single-pair ones.
Discussion One downside of neural network based cryptanalysis is without doubt the
explainability of the used features. It would be desirable to understand the neural-
distinguishers decision in order to gain insights for (classical) cryptanalysis and cipher
design. While there does exist work in this direction (e. g. [BGPT21]), it is (so far)
distinguisher (and therefor cipher) specific. On the other hand, we were able to show that
for a great number of ciphers any distinguisher in the differential setting we discussed
here can at most make use of the ciphertext-difference distribution16 , which completely
explains the features a neural network could use. For such ciphers, we can even see
differential-neural distinguishers as a tool providing (in a reasonable amount of time) a
lower bound on the mean absolute distance of the ciphertext-difference distribution and
the uniform one, but with the drawback that it may be unclear how tight this bound is.
In addition, even if unknown features get used by a machine learning based distinguisher,
knowing of the existence of such features is at least a first step in gaining insights, and is
preferable over being unaware of such features.
distinguishers can indeed lead to greatly increased distinguishing performance, but it could
be that the dependencies would need to be more pronounced for this to be possible.
Also interesting could be to try to find the best input difference by classical means.
While it is possible that this is as hard as computing the DDT, given that the best input
difference for a differential-neural distinguisher seems to be the one that maximizes the
mean absolute distance between the ciphertext-difference distribution and the uniform
one, this could also prove useful for general security estimations, as it would especially be
a nice addition to just knowing the maximal distance, i. e. the maximal probability of a
differential, since an adversary is by no means limited to one ciphertext-difference only.
At last, for ciphers like Simon32/64, where differential-neural distinguishers are unlikely
to exceed a distinguisher based on the DDT, and where it is feasible to compute the needed
part of the DDT, it could be interesting to see if an attack that basically replaces the
differential-neural distinguisher by one based on the DDT in the attack from [Goh19]
can be competitive with existing attacks. While evaluating the distinguisher could be
a problem for the overall time complexity, making use of all possible output differences
should be superior to using just a single one.
References
[AL13] Martin R. Albrecht and Gregor Leander. An all-in-one approach to differential
cryptanalysis for small block ciphers. In Lars R. Knudsen and Huapeng Wu,
editors, SAC 2012, volume 7707 of LNCS, pages 1–15. Springer, Heidelberg,
August 2013.
[ALLW13] Farzaneh Abed, Eik List, Stefan Lucks, and Jakob Wenzel. Differential and
linear cryptanalysis of reduced-round Simon. Cryptology ePrint Archive,
Report 2013/526, 2013. https://eprint.iacr.org/2013/526.
[ALLW15] Farzaneh Abed, Eik List, Stefan Lucks, and Jakob Wenzel. Differential
cryptanalysis of round-reduced Simon and Speck. In Carlos Cid and Christian
Rechberger, editors, FSE 2014, volume 8540 of LNCS, pages 525–545. Springer,
Heidelberg, March 2015.
[ASY+ 19] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori
Koyama. Optuna: A next-generation hyperparameter optimization frame-
work. In Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria
Terzi, and George Karypis, editors, Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, KDD 2019,
Anchorage, AK, USA, August 4-8, 2019, pages 2623–2631. ACM, 2019.
[BBCD21] Anubhab Baksi, Jakub Breier, Yi Chen, and Xiaoyang Dong. Machine
learning assisted differential distinguishers for lightweight ciphers. In Design,
Automation & Test in Europe Conference & Exhibition, DATE 2021, Grenoble,
France, February 1-5, 2021, pages 176–181. IEEE, 2021.
[BBP22] Nicoleta-Norica Bacuieti, Lejla Batina, and Stjepan Picek. Deep neural
networks aiding cryptanalysis: A case study of the speck distinguisher. In
Giuseppe Ateniese and Daniele Venturi, editors, Applied Cryptography and
Network Security - 20th International Conference, ACNS 2022, Rome, Italy,
June 20-23, 2022, Proceedings, volume 13269 of Lecture Notes in Computer
Science, pages 809–829. Springer, 2022.
Aron Gohr, Gregor Leander and Patrick Neumann 35
[Bei16] Christof Beierle. Pen and paper arguments for SIMON and SIMON-like
designs. In Vassilis Zikas and Roberto De Prisco, editors, SCN 16, volume
9841 of LNCS, pages 431–446. Springer, Heidelberg, August / September
2016.
[Ber08] D. J. Bernstein. Chacha, a variant of salsa20, 2008. http://cr.yp.to/
chacha/chacha-20080128.pdf.
[BGL+ 21] Zhenzhen Bao, Jian Guo, Meicheng Liu, Li Ma, and Yi Tu. Conditional
differential-neural cryptanalysis. Cryptology ePrint Archive, Report 2021/719,
2021. https://eprint.iacr.org/2021/719.
[BGPT21] Adrien Benamira, David Gérault, Thomas Peyrin, and Quan Quan Tan. A
deeper look at machine learning-based cryptanalysis. In Anne Canteaut and
François-Xavier Standaert, editors, EUROCRYPT 2021, Part I, volume 12696
of LNCS, pages 805–835. Springer, Heidelberg, October 2021.
[BJK+ 16] Christof Beierle, Jérémy Jean, Stefan Kölbl, Gregor Leander, Amir Moradi,
Thomas Peyrin, Yu Sasaki, Pascal Sasdrich, and Siang Meng Sim. The
SKINNY family of block ciphers and its low-latency variant MANTIS. In
Matthew Robshaw and Jonathan Katz, editors, CRYPTO 2016, Part II,
volume 9815 of LNCS, pages 123–153. Springer, Heidelberg, August 2016.
[BKL+ 07] Andrey Bogdanov, Lars R. Knudsen, Gregor Leander, Christof Paar, Axel
Poschmann, Matthew J. B. Robshaw, Yannick Seurin, and C. Vikkelsoe.
PRESENT: An ultra-lightweight block cipher. In Pascal Paillier and Ingrid
Verbauwhede, editors, CHES 2007, volume 4727 of LNCS, pages 450–466.
Springer, Heidelberg, September 2007.
[Bre17] Jakub Breier. ChaCha Quarter-Round. https://www.iacr.org/authors/
tikz/, 2017.
[BSS+ 13] Ray Beaulieu, Douglas Shors, Jason Smith, Stefan Treatman-Clark, Bryan
Weeks, and Louis Wingers. The SIMON and SPECK families of lightweight
block ciphers. Cryptology ePrint Archive, Report 2013/404, 2013. https:
//eprint.iacr.org/2013/404.
[CSYY22] Yi Chen, Yantian Shen, Hongbo Yu, and Sitong Yuan. A New Neural
Distinguisher Considering Features Derived From Multiple Ciphertext Pairs.
The Computer Journal, 03 2022. bxac019.
[DDK09] Christophe De Cannière, Orr Dunkelman, and Miroslav Knežević. KATAN
and KTANTAN - a family of small and efficient hardware-oriented block
ciphers. In Christophe Clavier and Kris Gaj, editors, CHES 2009, volume
5747 of LNCS, pages 272–288. Springer, Heidelberg, September 2009.
[FLS+ 10] Niels Ferguson, Stefan Lucks, Bruce Schneier, Doug Whiting, Mihir Bellare,
Tadayoshi Kohno, Jon Callas, and Jesse Walker. The skein hash function
family, 2010.
[Goh19] Aron Gohr. Improving attacks on round-reduced Speck32/64 using deep learn-
ing. In Alexandra Boldyreva and Daniele Micciancio, editors, CRYPTO 2019,
Part II, volume 11693 of LNCS, pages 150–179. Springer, Heidelberg, August
2019.
[HRC21a] Zezhou Hou, Jiongjiong Ren, and Shaozhen Chen. Cryptanalysis of round-
reduced SIMON32 based on deep learning. Cryptology ePrint Archive, Report
2021/362, 2021. https://eprint.iacr.org/2021/362.
36 An Assessment of Differential-Neural Distinguishers
[HRC21b] Zezhou Hou, Jiongjiong Ren, and Shaozhen Chen. Improve neural distinguisher
for cryptanalysis. Cryptology ePrint Archive, Report 2021/1017, 2021. https:
//eprint.iacr.org/2021/1017.
[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings
in deep residual networks. CoRR, abs/1603.05027, 2016.
[JKM20] Aayush Jain, Varun Kohli, and Girish Mishra. Deep learning based differential
distinguisher for lightweight cipher PRESENT. Cryptology ePrint Archive,
Report 2020/846, 2020. https://eprint.iacr.org/2020/846.
[KLT15] Stefan Kölbl, Gregor Leander, and Tyge Tiessen. Observations on the SIMON
block cipher family. In Rosario Gennaro and Matthew J. B. Robshaw, editors,
CRYPTO 2015, Part I, volume 9215 of LNCS, pages 161–185. Springer,
Heidelberg, August 2015.
[LLL+ 22] Jinyu Lu, Guoqiang Liu, Yunwen Liu, Bing Sun, Chao Li, and Li Liu. Improved
neural distinguishers with (related-key) differentials: Applications in SIMON
and SIMECK. Cryptology ePrint Archive, Report 2022/030, 2022. https:
//eprint.iacr.org/2022/030.
[LM02] Helger Lipmaa and Shiho Moriai. Efficient algorithms for computing differential
properties of addition. In Mitsuru Matsui, editor, FSE 2001, volume 2355 of
LNCS, pages 336–350. Springer, Heidelberg, April 2002.
[MP13] Nicky Mouha and Bart Preneel. Towards finding optimal differential charac-
teristics for ARX: Application to Salsa20. Cryptology ePrint Archive, Report
2013/328, 2013. https://eprint.iacr.org/2013/328.
[MWGP11] Nicky Mouha, Qingju Wang, Dawu Gu, and Bart Preneel. Differential and
linear cryptanalysis using mixed-integer linear programming. In Chuankun Wu,
Moti Yung, and Dongdai Lin, editors, Information Security and Cryptology -
7th International Conference, Inscrypt 2011, Beijing, China, November 30 -
December 3, 2011. Revised Selected Papers, volume 7537 of Lecture Notes in
Computer Science, pages 57–76. Springer, 2011.
[SHW+ 14] Siwei Sun, Lei Hu, Meiqin Wang, Peng Wang, Kexin Qiao, Xiaoshuang Ma,
Danping Shi, Ling Song, and Kai Fu. Towards finding the best characteristics
of some bit-oriented block ciphers and automatic enumeration of (related-key)
differential and linear characteristics with predefined properties. Cryptology
ePrint Archive, Report 2014/747, 2014. https://eprint.iacr.org/2014/
747.
[SKSS16] Anish Shah, Eashan Kadam, Hena Shah, and Sameer Shinde. Deep residual
networks with exponential linear unit. CoRR, abs/1604.04112, 2016.
[SLJ+ 14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi-
novich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
Aron Gohr, Gregor Leander and Patrick Neumann 37
[SLJ+ 15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Ra-
binovich. Going deeper with convolutions. In 2015 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
[SNPP19] Stefan Schubert, Peer Neubert, Johannes Pöschmann, and Peter Protzel.
Circular convolutional neural networks for panoramic images and laser data.
In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 653–660, 2019.
[SZM20] Heng-Chuan Su, Xuan-Yong Zhu, and Duan Ming. Polytopic attack on round-
reduced simon32/64 using deep learning. In Yongdong Wu and Moti Yung,
editors, Information Security and Cryptology - 16th International Conference,
Inscrypt 2020, Guangzhou, China, December 11-14, 2020, Revised Selected
Papers, volume 12612 of Lecture Notes in Computer Science, pages 3–20.
Springer, 2020.
[XZBL16] Zejun Xiang, Wentao Zhang, Zhenzhen Bao, and Dongdai Lin. Applying
MILP method to searching integral distinguishers based on division property
for 6 lightweight block ciphers. In Jung Hee Cheon and Tsuyoshi Takagi,
editors, ASIACRYPT 2016, Part I, volume 10031 of LNCS, pages 648–678.
Springer, Heidelberg, December 2016.
[ZWW22] Liu Zhang, Zilong Wang, and Boyang Wang. Improving differential-neural
cryptanalysis with inception blocks. Cryptology ePrint Archive, Report
2022/183, 2022. https://eprint.iacr.org/2022/183.
Learning Rates and Learning Rate Schedules In [Goh19] a cyclic LR schedule, which
starts at 2 · 10−3 (to which we refer as high LR), then linearly drops to 10−4 (low LR)
over 10 epochs, and jumps back to 2 · 10−3 completing the cycle, is used. Looking at
the validation accuracy and loss during training, we had the conjecture that returning to
the high LR of 2 · 10−3 may prevent the neural network from refining the weights, as it
seems to regularly get out of a (local and approximated) minimum (of the loss function).
While this can be useful to find a better minimum, it could be beneficial to commit to
one minimum at some point. With this reasoning, we tried the LR schedules depicted in
Fig. 13, in addition to constant ones. In the end, we did not find the choice of a different
LR schedule to give any significant improvement.
38 An Assessment of Differential-Neural Distinguishers
Figure 13: Examples of tried LR schedules, compared to the cyclic one used in [Goh19]
(original).
Aron Gohr, Gregor Leander and Patrick Neumann 39
xi xi xi
+ + +
xi+1 xi+1 xi+1
+ + +
xi+1 xi+1 xi+1
Residual Block Design We tested a variety of different residual blocks, see Fig. 14, all
using two convolutional layers. In addition, we also tried in-/decreasing the number of
convolutional layers using the original design shown in Fig. 14a. Here, the number of con-
volutions actually refers to the number of applications of the composition of convolutional
layer, batch normalization and activation.
Inception Given the importance of the filter size, we also tried to use Inception[SLJ+ 14],
which allows the use of multiple filter sizes at the same time, in a way enabling the network
to learn the filter size. A graphic representation of our implementation can be found in
Fig. 15. Since calculating multiple convolutional layers with different filter sizes instead
of one convolutional layer with a fixed filter size is computationally more expensive, the
authors of [SLJ+ 14] proposed to reduce the number of filters for the layers with a filter
size greater than one by adding additional convolutional layers with filter size one and less
filters before these layers.
Given the increase in training time, but not in accuracy, we found it to be more efficient
to directly search for the optimal filter size using a hyper-parameter search.
Circular Convolutions Based on the fact that many of the ciphers under scrutiny make use
of some sort of bit cyclic shift (i. e. rotation), it seems reasonable that a cyclic representation
would make more sense. To achieve this, Schubert et al.[SNPP19] propose to pad the input
using data from the opposing side, instead of with zero. We adapted this approach to
the neural network from [Goh19], and found it to make small improvements. But in total,
those improvements also seem possible by tweaking the remaining hyper-parameters.
40 An Assessment of Differential-Neural Distinguishers
xi
Activation Activation
Concatenate
Batch Normalization
Activation
Activation Activation
Concatenate
Batch Normalization
Activation
+
xi+1
Figure 15: Residual block using the Inception Model. Convolutional layers marked with
(Red.) are used for filter reduction and use filter size of one. All other convolutional layers
have their filter size written in parentheses. For max. pooling, the number in parentheses
represent the used pool size.
Batch Normalization The original network from [Goh19] already uses batch normaliza-
tion. We therefore wanted to find out which effect it has on the performance of the neural
distinguisher. As batch normalization after convolutions is already part of the residual
block design, we tested removing batch normalization after the densely connected layers
only, and found it to be quite essential.
Residual Scaling The residual block takes some input x, calculated some function F on
it and adds the input to the result. While this can enable deeper neural networks in the
first place, our idea is to control the impact of F by scaling it accordingly, i. e. instead of
calculating F (x) + x, we calculate λ · F (x) + x for a scalar λ. This was found to have only
a limited effect.
Bit Encoding In [Goh19] the samples are represented as arrays of bits, encoded as zero
and one. But note that most layers of the neural network just perform matrix multiplication
over real numbers. Therefore, using zero as one possible encoding could have the drawback
that every multiplication with zero gives zero. Based on this observation, we tried encoding
the bits as −1 and 1, as well as 1 and 2, but did not find this to make any difference.
of the first shift register and (xr13 , ..., xr31 ) the content of the second one. If we look at the
structure of Katan in Fig. 5, we know that
xr−j
i = xri+j for 0 ≤ i ≤ 12 − j and
xr−j
i = xri+j for 13 ≤ i ≤ 31 − j.
Let further (kar , kbr ) ∈ F22 be the round key and IRr ∈ F2 the round constant of round r.
Hence
xr−1 r r−1
12 = x13 ⊕ x8 · xr−1
5 ⊕ xr−1
7 ⊕ xr−1
3 · IRr ⊕ kar and
xr−1 r r−1 r−1 r−1 r−1 r−1 r
31 = x0 ⊕ x16 · x21 ⊕ x20 ⊕ x23 · x25 ⊕ kb .
Therefore, we know every variable except the round keys for the last four rounds, which
means we that we can calculate
xr−4 r−i
9+i ⊕ ka for 0 ≤ i ≤ 3 and
xr−4 r−i
28+i ⊕ kb for 0 ≤ i ≤ 3.
In other words, we know the state of four rounds before (up to the eight key bits). But
since we only know xr−4 9 ⊕ kar and not xr−4
9 itself, we can no longer calculate xr−5
12 as it
r−5 r−4
would require us to know x8 = x9 . On the other hand, it is still possible to calculate
at least parts of some more rounds backwards. While we do not know xr−5 8 and therefore
cannot revert the addition of xr−5 8 · xr−5
5 to xr−5
12 , we can still revert the addition of
xr−5
7 ⊕ xr−5
3 · IRr−4 . Furthermore, we can still calculate xr−5
31 ⊕ kb
r−4
and xr−6
31 ⊕ kb
r−5
,
r−7 r−7
before x23 · x25 becomes unknown. Hence, partly reverting six rounds gives
xr−6
i for 0 ≤ i ≤ 6,
xr−6
i ⊕ kar+7−i for 7 ≤ i ≤ 10,
xr−6
i ⊕ x8r+6−i · xr+6−i
5 ⊕ kar+7−i for 11 ≤ i ≤ 12,
xr−6
i for 13 ≤ i ≤ 25 and
xr−6
i ⊕ kbr+26−i for 26 ≤ i ≤ 31.
xr−9
i for 0 ≤ i ≤ 3,
xr−9
i ⊕ kar+4−i for 4 ≤ i ≤ 7,
xr−9
i ⊕ xr+3−i
8 · x5r+3−i ⊕ kar+4−i for 8 ≤ i ≤ 12,
r−9
xi for 13 ≤ i ≤ 22,
xr−9
i ⊕ kbr+23−i for 23 ≤ i ≤ 28 and
xr−9
i ⊕ xr+22−i
23 · r+22−i
x25 ⊕ kbr+23−i for 29 ≤ i ≤ 31,
xr−10
i for 0 ≤ i ≤ 2,
xr−10
i ⊕ kar+3−i for 3 ≤ i ≤ 6,
xir−10 ⊕ x8r+2−i · xr+2−i
5 ⊕ kar+3−i for 7 ≤ i ≤ 11,
xr−10
i ⊕ xr−10
8 · xr−10
5 ⊕ xr−10
7 ⊕ xr−10
3 · IR r−9
⊕ kar−9 for i = 12
xr−10
i for 13 ≤ i ≤ 21,
xr−10
i ⊕ kbr+22−i for 22 ≤ i ≤ 27 and
xr−10
i ⊕ xr+21−i
23 · r+21−i
x25 ⊕ kbr+22−i for 28 ≤ i ≤ 31.
We now can continue reverting the addition of xr−i20 for three additional rounds until all
values within the first register are masked by a key, i. e.
xr−13
i ⊕ kar−i for 0 ≤ i ≤ 3,
xr−13
i ⊕ x8r−1−i · xr−1−i
5 ⊕ kar−i for 4 ≤ i ≤ 8,
xr−13
i ⊕ xr−1−i
8 · xr−1−i
5 ⊕ xr−1−i
7 ⊕ xr−1−i
3 · IR r−i
⊕ kar−i for 9 ≤ i ≤ 12,
xr−13
i for 13 ≤ i ≤ 18,
xr−13
i ⊕ kbr+19−i for 19 ≤ i ≤ 24,
xr−13
i ⊕ xr+18−i
23 · xr+18−i
25 ⊕ kbr+19−i for 25 ≤ i ≤ 28 and
xr−13
i ⊕ xr+18−i
16 · xr+18−i
21 ⊕ r+18−i
x23 · xr+18−i
25 ⊕ kbr+19−i for 29 ≤ i ≤ 31.
xr−17
i ⊕ xr−5−i
8 · xr−5−i
5 ⊕ kar−4−i for 0 ≤ i ≤ 4,
xr−17
i ⊕ x8r−5−i · xr−5−i
5 ⊕ xr−5−i
7 ⊕ xr−5−i
3 · IR r−4−i
⊕ kar−4−i for 5 ≤ i ≤ 12,
xr−17
i for 13 ≤ i ≤ 14,
xr−17
i ⊕ kbr+15−i for 15 ≤ i ≤ 20,
xr−17
i ⊕ xr+14−i
23 · xr+14−i
25 ⊕ kbr+15−i for 21 ≤ i ≤ 24,
xr−17
i ⊕ xr+14−i
16 · xr+14−i
21 ⊕ r+14−i
x23 · xr+14−i
25 ⊕ kbr+15−i for 25 ≤ i ≤ 27 and
xr−17
i ⊕ xr+14−i
16 · xr+14−i
21 ⊕ r+14−i
x23 · xr+14−i
25 ⊕ kbr+15−i ⊕ kar+28−i for 28 ≤ i ≤ 31
From now on, we cannot even revert parts of the round function. Hence, this is the state
we are referring to if we talk about ciphertexts calculated back.