Elsarticle Template
Elsarticle Template
net/publication/343897594
CITATIONS READS
3 85
6 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Je Sen Teh on 06 February 2021.
Zhanwen Chena , Jiageng Chena,∗, Weizhi Mengb , Je Sen Tehc , Pei Lia ,
Bingqing Rend
a School of Computer, Central China Normal University, China
b Department of Applied Mathematics and Computer Science, Technical University of
Denmark, Denmark
c School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia
d Central China Normal University Wollongong Joint Institute, China
Abstract
∗ Correspondingauthor
Email address: chinkako@gmail.com (Jiageng Chen)
2
the computational complexity for identifying differentials.
30 Differential characteristic. A differential characteristic for a single round can
be represented by a pair (α, β) where α denotes the input difference and β de-
R
notes the output difference such that difference α leads to β (denoted by α → β).
R
Differential characteristics with high probability P (α → β) can be exploited in
the statistical attack. For multiple rounds, characteristics of each round are con-
R R R R
35 catenated to produce a specific differential path α → δ1 → δ2 → ... → β. The
probability of concatenating these characteristics is the product of the probabil-
ities of each single-round characteristic. While performing traditional differen-
tial cryptanalysis, most researchers identify differential characteristics with high
probability like Wang et al in [7] whereby certain differences appear again af-
40 ter several rounds with high probability (called iterative characteristics). These
iterative characteristics are used to help reduce the workload of cryptanalysis.
However iterative characteristics may not always be found and more impor-
tantly, a differential characteristic considers only one specific path from start
to end, neglecting information from other paths that could potentially expose
45 more severe weaknesses of a block cipher.
Differential. When analyzing the block cipher, attackers only know the plain-
text and ciphertext differences, and nothing else. It means that for a 3 round
R R R
block cipher, attackers require knowledge of differential: α →? →? → β where
’ ?’ denotes the unknown and irrelevant difference value of intermediate rounds.
50 This is called a differential that contains all the characteristics with same input
and output differences. Conventionally, an individual differential characteristic
restricts analysis to a specific differential path with a fixed evolution procedure.
Thus, we can overcome this disadvantage by clustering all the characteristics
Rn
and the probability of P (α → β) can be calculated in the following way:
X XX
P (α → β) = ... P (α → δ1 → δ2 ... → δn → β) (1)
δn δ2 δ1
55 Such probability is also the theoretical probability of β’s appearance for in-
put difference α when collecting samples to recover keys. From the viewpoint
3
of a cipher designer, preventing attacks based on individual characteristics do
not guarantee sufficient security. Instead, more emphasis should be given to
preventing attacks against differentials.
60 Full distribution. Based on the differential, we want to go even further to
obtain a more precise result for cryptanalysis. Blondeau et al investigate the
statistical accuracy of the multiple differential cryptanalysis in [6]. They use
more than one differential to allow the attacker to extract more information from
the samples because for an input difference there exists multiple possible output
65 differences given a large number of rounds. Usually if we take advantage of all
the output differentials for some fixed input, we call this a full distribution It is
more effective because for each input difference it leads to a unique distribution
which include all the differential information and helps an attacker analyse the
cipher. By using statistical approaches such as the χ2 or LLR test, we can get
70 an improved effect on the distinguisher.
Although using the full distribution leads to a more powerful cryptanalysis
approach, its computational cost increases rapidly. Therefore, distinguishers
based on differential characteristics are still used as the main approach since it
is usually within our computational capabilities, making it a practical solution
75 as compared to the full differential distribution approach. To compute the full
distribution in a practical manner, parallel cryptanalysis can be adopted. We
have attempted to compute full distribution cryptanalysis on a computer with
128 logic CPU cores but it still takes much time. Therefore we leverage upon the
GPU’s parallel advantage to mitigate this problem. [8] has used the PlayStation
80 3 to solve ECDLPs due to the game console’s graphic processing ability. [9],
[10], [11], [12] have shown that GPU has been accepted as a new implementation
platform to improve the performance for block ciphers. The results of [13] and
[14] show that the GPU possesses a much powerful parallel performance than
CPU and can largely fasten the speed of encryption algorithms. Cryptanalysis
85 field also starts to explore the potential on the new platform like [15], [16] where
theoretically existing attacks that are hard to implement in real-world in the
past can eventually be achieved thanks to GPU. Parallel computing on GPUs
4
has also gained wide adoption in various areas such as deeping learning, Bitcoin
mining and so on. Compared with a CPU framework, GPU supports more
90 parallel threads and is cheaper. Thus, we want to propose a common method
that achieves full distribution differential cryptanalysis on GPU for the SPN
(substitution-permutation network) Structure.
Our contribution. In this paper we take advantage of GPU parallel comput-
ing power to speed up the computation of differential distinguishers for SPN
95 cipher. First, we introduce an algorithm to to calculate the full distribution of
SPN ciphers by parallel computation. Based on the aforementioned algorithm,
three upper layer methods are introduced to solve two problems: limited GPU
memory space, and achieving efficient attacks on large sized SPN block ciphers.
For our experiments, we chose PRESENT[17], an ultralight SPN cipher designed
100 by Bogdanov et al, as the target cipher and reduce its block size to calculate
its security margin based on full differential distribution. We then evaluate the
performance of our cryptanalysis. We use a reduced block size because searching
for the full distribution for the original PRESENT with 64-bit block size is still
impractical even with the help of GPU power. Hence we reduce the block size
105 to 8,12,16,20,24,28-bit so that the search space of the distribution is reduced.
By studying the security margin of the reduced versions we can obtain evidence
that helps us to predict the security margin of PRESENT. Based on experimen-
tal results, we found that the GPU approach leads to a significant advantage
over regular CPU computation.
110 Outline. Section 2 introduces background information regarding differential
cryptanalysis and GPU programming. Then in Section 3 we propose the gen-
eral GPU-based algorithm for a full differential distribution search. Section 4
introduces three upper-layer methods based on the proposed base algorithm.
They are used to solve the problem of inadequate memory space and improve
115 computational efficiency. Performance analysis is given in Section 6 where the
computational cost of GPU and CPU is compared. Finally we conclude the
paper in Section 7.
5
2. Preliminaries
135 Figure 1 shows the structure of the original PRESENT cipher. The standard
6
cipher PRESENT although being widely considered to be lightweight, the 64-
bit block size is still too large for our experiment purpose. Luckily, PRESENT
follows a very symmetric design structure and we can tweak the cipher by shrink-
ing the block size without changing the cipher property. Actually Toy Present
140 cipher has already been proposed for this purpose [18]. During the experiment
we reduce the block size to 8, 12, 16, 20 and 24-bits for our experiment purpose.
For a block cipher with block size b and r round, differential(plaintext) space
is N = 2b . We denote all differences as δ1 , δ2 , ..., δN . Full distribution means
145 while doing differential cryptanalysis, we need to calculate N differentials after
r round: (assume input difference is δ2 )
δ2 → δ1
δ2 → δ2
(2)
...
δ2 → δN
After the calculation we acquire an two dimension array of length N and ele-
ments in it is (δi , P (δi )), which represents the full distribution. Then a statistical
test inspired by [19] is applied to calculate the data complexity of distribution.
150 A statistical test is used to create distinguisher to distinguish two distributions
D0 and D1 , where D0 is the full distribution we get from the experiment and
D1 is the uniform distribution. Then we calculate data complexity n as follows:
d d
n= P 2z
≈ (3)
2D(D0 kD1 )
z∈Z pz
√
The error probability for distinguisher is Pe ≈ Φ(− d/2). In our test, we
set Pe to be 0.1. D is the Kullback-Leibler distance, which is calculated by:
X P rD0 [z]
D(D0 kD1 ) = P rD0 [z]log (4)
P rD1 [z]
z∈Z
7
155 Data complexity n means how many samples the distinguisher needs to distin-
guish between the two distributions. Intuitively, n increase with round number
r. Security margin is defined as followed: for a cipher with block size b and r, if
n > 2b then the test needs more samples than the whole sample space. In other
words, we cannot distinguish a distribution between the cipher and a theoretical
160 uniform distribution. The smallest value of r that provides an indistinguishable
case can be used to evaluate the security margin.
8
Figure 2: Organization of threads in GPU
Nvidia’s GPU use SIMD (Single Instruction Multiple Data) to improve ef-
ficiency. For different data, GPU apply same instructions to them. We exploit
this feature by arranging kernel function in this way: kernel function takes only
one input difference and calculate the full distribution derived from this differ-
190 ence. Hence every thread is in charge of searching full distribution for one input
difference.
3.1. Search the full distribution of one input difference after one round
9
difference can be calculated as:
Y
P (Diff) = P (si,x ) (6)
Permutation. Like plaintext, differential after S-Box layer can also be per-
muted. The same permutation rule is applied to the distribution obtained from
difference combination and the differential distribution after this round can be
205 get.
X
P (δx ) = (P (δi ) × P (δi → δx )) (7)
i
10
Algorithm 1 shows the detailed process in searching full distribution for 1 round,
which is also the kernel function.
Algorithm 1 Search full distribution for one input difference after one round
Input: Input difference Diff; Probability of the input difference Pin ; Block size
l.
Output: Array A as differential distribution of ciphertext.
1: A[2l ] = [0..]
2: x1 , x2 , ... ← Diff //separate input for every S-Box
3: for all y1 in SP1 [x1 ] do
4: for all y2 in SP2 [x2 ] do
5: ...
6: for all yn in SPn [xn ] do
7: A[y1 ⊕ ... ⊕ yn ] ← Pin ∗ SP1 [x1 ][y1 ] ∗ SP2 [x2 ][y2 ] ∗ ...
8: end for
9: ...
10: end for
11: end for
For more rounds, the algorithm can be executed many times. Pin can be
220 obtained from the result of previous round. Finally a full distribution that
indicates probability of every differential can be obtained.
11
Figure 3: Reusing two arrays to record full distribution
12
in cryptanalysis. So we do not consider the case where the number of threads
is not enough. It can be seen in Figure 3 that every thread calculate only one
250 input difference. For a cipher with block size b, 2b threads are needed for our
cryptanalysis task. After the thread amount is decided, block and grid still
can be constructed in different ways. Due to the feature of GPU introduced in
section 2.3, we give two principle for thread organization:
How many threads in block: SM process threads in the unit of wrap (32
255 threads). Therefore the thread number of block should always be multiple of
32.
How many blocks in grid: As for block number, it depends on the number of
SM. One block can only be dispatched by one SM. Different SM works parallel.
So the best condition is to make the block number no less than SM number.
To shorten the time cost for large block size cipher, we propose a method
that prune differential path with little probability in every round. Two arrays
A1 and A2 only record parts of differentials with high probability rather than
265 full distribution. Elements of A1 and A2 are recorded in form of {difference,
probability}. Every time when a new output differential is found by threads, it
is checked that if such differential’s record has already been written. If it is then
add the probability to existing record otherwise create a record in empty address.
When there is no more space to store more records, a threshold probability is
270 set for the current and further rounds that only the differential with higher
probability than the threshold can be written to the array.
We set the threshold value to a theoretical value that every differential path
has the same probability, which represents the uniform distribution. For a
cipher with block size b, the uniform differential probability is 1/2b . Assume
275 the cipher also has ideal permutation on characteristics, then probability of any
characteristic is even and uniform characteristic probability is 1/2b × 1/2b =
13
1/22b . Such probability is chosen as threshold value. And the result of this
cryptanalysis method is a semi-full distribution. Distinguisher from section 2.2
can still work on such distribution.
Meet in the middle attack is an efficient way to reduce time cost for differ-
ential cryptanalysis because in early rounds most input differences’ probability
is 0 so searching processes finishes fast. It takes several rounds before the input
probability spread to the other part of the differential space. If cryptanalysis
285 starts from both plaintext and ciphertext, then we can make use of early rounds
twice.
Supposing all threads are parallel worked, such meet in the middle improves
nearly nothing. But things are different if there are thread blocking. While cre-
ated threads’ amount is more than what all SM can process, a waiting queue is
290 produced and some threads are delayed, which largely decrease the parallel effi-
ciency. However in early rounds when most differentials have 0 probability and
the full distribution searching is not required, kernel function ends up quickly
and spare SM to other threads. In addition, while recording probabilities to the
14
output array in global memory, writing conflict is unavoidable but less writing
295 request can reduce the chance of writing conflict.
One problem is that meet in the middle attack requires to decide which
differences pair is chosen before begin the cryptanalysis. A branch and bound
algorithm is introduced to help us predict which differential may have a high
probability. Matsui first gives a branch and bound searching algorithm in citem-
300 atsui1994correlation. The purpose of it is to quickly find a characteristic with
very high probability among all the characteristics. It do not guarantee a high
differential probability but research [21] by Chen et al shows that Matsui’s
algorithm also gives a high probability on differential. Algorithm 2 is the com-
bination of Matsui’s algorithm and meet in the middle attack, in which Matsui’s
305 algorithm is used to predict what plaintext and ciphertext difference (with po-
tential high probability differential) is chosen. Then based on its result we use
meet in the middle attack to calculate the differential probability. branch and
bound returns the output very fast and is a recursion algorithm. Therefore we
run this algorithm on CPU rather than GPU.
15
Algorithm 2 Use branch and bound to search for differential probability
Input: Total round number of cipher R; Block size b.
Output: Plaintext and ciphertext difference α,β; Differential probability P
1: α, β ← Branch&bound() //Done by CPU
2: A0 [2b ], A1 [2b ], B0 [2b ], B1 [2b ] = [0..]
3: A0 [α] = 1; B0 [β] = 1
4: P ←0
R
5: for i ← 0;i < 2 ;i + + do
6: A1 = f ull distribution search(A0 , b) //Encryption direction
7: Swap(A0 ,A1 )
8: A1 = [0..]
9: end for
R
10: for i ← 0;i < 2 + 1;i + + do //Decryption direction
11: B1 = f ull distribution search(B0 , b) //Decryption direction
12: Swap(B0 ,B1 )
13: B1 = [0..]
14: end for
15: for i ← 0;i < 2b ;i + + do
16: if A0 [i]! = 0 and B0 [i]! = 0 then
17: P+ = x ∗ y
18: end if
19: end for
X
p(δi ) = pj × p(δj → δi ) (8)
j
315 while p(δj → δi ) is the differential characteristic of one round, and it is always a
fixed value. Therefore the only parameter that changes in previous equation is
16
pj . If all the information of p(δj → δi ) can be calculated in advance, the process
of searching full distribution can be represented by a matrix multiplication:
h i
p1,r+1 p2,r+1 · · · pn,r+1
h i
= p1,r p2,r · · · pn,r ×
δ1 → δ1 δ1 → δ2 ··· δ1 → δn (9)
δ2 → δ1 δ2 → δ2 ··· δ2 → δn
.. .. ..
..
. . . .
δn → δ1 δn → δ2 ··· δn → δn
Pr+1 = Pr × ∆ (10)
Pk+i = Pi × ∆k (11)
It is obvious that two ways can be used to calculate equation 11. One is to
325 normally start from Pr × ∆ and multiplied ∆ one by one (left to right order).
The second one is changing to a new calculation order that ∆k is calculated
first and then multiplied by Pi . One big advantage of using the latter way is
that this fasten the process of searching the full distribution compared to the
original method used in section 3, because it does not need to calculate round
k k
330 by round. After ∆2 is calculated, ∆k can be transformed to (∆2 ) 2 , (∆4 ) 4 and
so on. As a result, ideally the time of calculation can be reduced to log2 k. Thus
the original method in section 3 can be used to obtain ∆ first and then used
the matrix way for further rounds.
After transforming differential cryptanalysis into pure matrix multiplication,
17
335 it is more clear to see the problem in a mathematical way: how to fasten large
matrix multiplication on GPU. Section 3 provides a way of searching full distri-
bution in differential view but many characteristics or techniques of GPU cannot
be applied directly. As a result efficiency is not always satisfying. However for
matrix multiplication, it has been studied for many years and a large number of
340 matrix multiplication solutions are available. Taking advantage of those mature
solutions makes the most use of GPU’s parallel ability. MatrixMulCUBLAS,
a highly recommended algorithm by Nvidia that depends on CUBLAS (CUDA
Basic Linear Algebra Subroutines) library, is chosen to conduct matrix multi-
plication in our experiments because it costs little time when matrix has a large
345 size.
In some cases matrix based method may face the memory limitation and ∆
is too large for the GPU’s memory, which is pretty common considering that
nowadays GPUs do not have memory space as large as RAM or hard disk.
Despite that the memory space of GPUs cannot be increased at will, matrix
350 approach provides some characteristics that can help with this problem.
Sparse matrix. ∆ contains the 1-round differential characteristic infor-
mation for all input differential. We did a test on 16-bit PRESENT that a
number of input differentials are chosen and their 1-round full distributions are
computed independently. It turns out that about 99% output differentials have
355 0 probability. In other words ∆ is a very large but sparse matrix with about
99% elements being 0. Therefore storage format for sparse matrix like COO,
CSR, CSC save a lots of memory space need, and corresponding calculation
approaches are still feasible for parallel operation.
Matrix partition. Although ∆ can be compressed as sparse matrix, after
360 several times of multiplication ∆r will become a dense matrix and the com-
pressing method loses its effectiveness. Under this circumstance, ∆r can only
be stored in RAM or hard disk. But matrix partition enables us to split a large
matrix into some sub-matrices and those sub-matrices follow the calculation
rule as the elements in matrix. Assuming that the GPU memory can only store
365 two n × n matrices, ∆r can be split to an assemble of n × n matrices. For each
18
time two sub-matrices are transferred to GPU and the result is written back to
RAM or hard disk.
Version Input Diff round 1 round 2 round 3 round 4 round 5 round 6 round 7 round 8
8-bits 0x7 -0.213 1.064 3.762 7.972 12.120 15.469 19.762 23.992
8-bits 0x80 -0.213 1.702 5.756 9.880 13.856 17.541 22.209 25.872
12-bits 0x8 -1.003 0.362 3.404 7.017 10.402 14.986 19.441 24.415
12-bits 0x50 -0.964 0.079 3.335 7.765 12.807 17.264 22.175 27.204
12-bits 0x400 -0.964 0.028 2.902 6.586 11.519 16.587 21.621 26.255
16-bits 0xc -1.483 -0.964 0.971 5.114 8.0956 12.263 17.201 21.634
16-bits 0x40 -1.483 -0.575 1.841 6.116 10.782 15.026 19.181 23.229
16-bits 0x200 -1.483 -0.575 1.79 6.170 10.960 15.066 19.342 23.552
16-bits 0x7000 -1.510 -0.842 0.938 5.173 9.031 12.227 16.721 22.141
20-bits 0x3 -1.863 -1.362 0.121 3.617 8.911 14.786 20.607 26.562
20-bits 0x40 -1.863 -1.294 0.817 5.975 11.941 18.139 23.985 29.904
20-bits 0xb00 -1.863 -1.420 -0.174 2.773 7.609 13.076 18.679 24.376
20-bits 0x9000 -1.884 -1.462 -0.337 2.603 7.582 13.166 18.926 24.863
20-bits 0xd0000 -1.863 -1.441 -0.221 2.727 7.710 13.402 18.919 24.790
19
Security margin of those block size reduced PRESENT versions are obtained
as follow:
385 We can observe that the data complexity increases with the round number.
Before the round where data complexity’s logarithm (log2 C) is less than 0, it
grows slowly. While log2 C > 0, the increment suddenly changes to around 4.
Besides, increment between neighboring rounds is nearly the same (4 for each
neighboring round), which indicates that the growth of the data complexity
390 can be exponential. So we predict that the security margin of the original
PRESENT is around round 19 when log2 C ≥ 64 . Another factor that affects
the data complexity is the plaintext’s difference. For a certain version, various
differentials may derive different security margins and the largest one is often
chosen to evaluate the final security margin.
In previous work we use a 128 core CPU to do same full distribution search
on block size reduced PRESENT. Time cost for GPU and CPU to find full
distribution up to 8 rounds for randomly chosen plaintext difference is shown
below:
CPU GPU
8-bits 2s 4.6ms
12-bits 10s 4.7ms
16-bits 30min 14.5ms
20-bits 12h 5.6s
24-bits N/A 3.8min
28-bits N/A 7.6h
20
400 It can be easily seen from the table 4 that using GPU increase the speed
greatly. GPU can help finish all the work but CPU takes a few days and still
cannot derive the results of the 24-bit version. Considering the ecomonical cost
of a 128 core computer is even more than a TESLA graphic card, using GPU
to do cryptanalysis has more advantages over the CPU structure. It also can
405 be seen that time cost grows largely from 24-bits version to 28-bits version for
GPU. It shows that after 24-bit version thread amount engages a bottleneck of
GPU’s maximum parallel thread number.
For improved measures, time cost of pruning method in Section 4.1 depends
on the array length which can be referred to Table 4. Meet-in-the-middle attack
410 costs double time of half rounds and same block size for corresponding ciphers in
Table 4. Table 5 shows the time comparison of matrix method and the original
method. While using matrix multiplication, 12-bit version is the limit without
the matrix partition. For small block size version such as 8-bit and 12-bit,
Matrix 1 shows the dominant advantage over Original. Matrix 2 requires
415 more parallel threads because it contains multiplication of larger matrix (∆n ).
Thus it only excel Original in 8-bit version when block size is too small that
Original cannot make use of all the parallel capacities. After matrix partition
is involved from 16-bit, although the problem of inadequate memory is solved,
partition process makes two matrix methods slower than Original. Due to
420 that Original, Matrix 1’s time cost grows linearly along with round number
when Matrix 2’s grows logarithmically, the former is more effective when the
round number is not too large. And given enough memory space, Matrix 1
is relatively the faster way for differential cryptanalysis. We expect it to be
powerful in the further when GPU memory no longer being the bottleneck.
425 The matrix partition is done by the following ways: For Matrix 1, Pr × ∆
h i h i
is partitioned to be Pr × δ1 δ2 · · · δn where Pr is a 2b × 1 sub-matrix
and δi being j × 2b . j is adaptive that is expected to make the most use of
memory space in GPU. As Matrix 2, it is always a multiplication of two n × n
matrix and Tesla V100 can hold 212 × 212 matrix multiplication at most. So
430 for larger matrix, they are all partitioned to be a square matrix with elements
21
Table 5: Time cost of matrix multiplication and original method for calculating 8 rounds full
distribution
7. Conclusion
22
8. Acknowledgement
The final authenticated version of this preprint has been published in the
Journal of Information Security and Applications at https://doi.org/10.
455 1016/j.jisa.2020.102565.
References
[2] M. Matsui, A. Yamagishi, A new method for known plaintext attack of feal
cipher, in: Workshop on the Theory and Application of of Cryptographic
Techniques, Springer, 1992, pp. 81–91.
[4] D. Wagner, The boomerang attack, Fast Software Encryption March 1636
(1998) 245–259.
[5] M. Matsui, On correlation between the order of s-boxes and the strength
of DES, in: Advances in Cryptology - EUROCRYPT ’94, Workshop on the
470 Theory and Application of Cryptographic Techniques, Perugia, Italy, May
9-12, 1994, Proceedings, 1994, pp. 366–375. doi:10.1007/BFb0053451.
URL https://doi.org/10.1007/BFb0053451
23
[7] M. Wang, Differential cryptanalysis of reduced-round present, in: Interna-
480 tional Conference on Cryptology in Africa, Springer, 2008, pp. 40–49.
505 [15] M. Cianfriglia, S. Guarino, Cryptanalysis on gpus with the cube attack: de-
sign, optimization and performances gains, in: 2017 International Confer-
24
ence on High Performance Computing & Simulation (HPCS), IEEE, 2017,
pp. 753–760.
[18] G. Leander, Small scale variants of the block cipher present., IACR Cryp-
tology ePrint Archive 2010 (2010) 143.
520 [20] D. Kirk, et al., Nvidia cuda software and gpu parallel computing architec-
ture, in: ISMM, Vol. 7, 2007, pp. 103–104.
25