Rethinking Modular Multi-Exponentiation in Real-World Applications
Rethinking Modular Multi-Exponentiation in Real-World Applications
Rethinking Modular Multi-Exponentiation in Real-World Applications
Abstract The importance of efficient multi-exponen- Mathematics Subject Classification (2020) MSC
tiation algorithms in a large spectrum of cryptographic code1 · MSC code2 · more
applications continues to grow. Previous literature on
the subject pays attention exclusively on the mini-
mization of the number of modular multiplications. 1 Introduction
However, a small reduction of the multiplicative com-
plexity can be easily overshadowed by other figures of 1.1 Modular arithmetic
merit. In this article, we demonstrate that the most
efficient algorithm for computing multi-exponentiation Modular arithmetic is a cornerstone of modern math-
changes if considering execution time instead of number ematics with innumerable applications in cryptogra-
of multi-exponentiations. We focus our work on two al- phy, coding theory, computer algebra and more. This
gorithms that perform best under the number of multi- field has been investigated since ancient Greece. For in-
exponentiation metric and show that some side opera- stance, the modular reduction problem
tions affects their theoretical ranking. We provide this
A mod N, (MR)
analysis on different hardware, such as Intel Core and
ARM CPUs and the two latest generations of Rasp- which consists of obtaining the reminder of an inte-
berry Pis, to show how the machine chosen affects the ger A modulo another integer N , was already studied
execution time of multi-exponentiation. by Euclid, as written in the Elements [12]. Over the
centuries, research produced highly sophisticated algo-
Keywords Multi-exponentiation · OpenSSL · bench- rithms to compute (MR) such as the Barrett reduction
marking · cryptography · arithmetic [2], which computes the solution of (MR) by replacing
divisions with multiplications, and the Montgomery re-
duction [18], which converts integers into a space where
V. Attias the modulo operation is replaced by (computationally
IOTA Foundation
Berlin
easier) multiplications. While the two aforementioned
Germany algorithms, Barrett and Montgomery reductions, both
E-mail: vidal.attias@iota.org replace divisions with multiplications, Montgomery’s
L. Vigneri conversion induces an overhead while its reduction op-
IOTA Foundation eration per se is faster than Barrett’s and it allows ma-
Berlin nipulated converted integers as if they were in a canoni-
Germany
E-mail: luigi.vigneri@iota.org
cal form. Thus, Montgomery reduction is more efficient
for computations involving lots of reductions using the
V. Dimitrov
University of Calgary
same transformed integers whereas Barrett reduction
Calgary is preferable for one-time reductions. Both algorithms
Canada have been thoroughly studied both from a theoretical
E-mail: vdimitro@ucalgary.ca [6] and from an implementation perspective [7, 14].
2 Vidal Attias et al.
Another essential operation is the modular exponen- when implemented in real systems. The goal of the pa-
tiation defined as per is indeed to provide the knowledge necessary to
select the best algorithm depending on the hardware
xe mod N, (ME)
used and on the input parameters, such as base and
where N is a modulus in N, x an integer in [0, N − 1], modulus sizes. Furthermore, it is often non-trivial to
and e is a natural number. Again, this problem received tune the algorithm parameters, and we show that this
an important research coverage, both on purely theo- is largely affected by exponent and modulus lengths.
retical aspects [10] and implementation perspective [4], In short, we found a lack of implementation study
with some research even focusing on producing quan- for multi-exponentiation algorithms, and the platform-
tum algorithms [21]. Sensible improvements in the solu- dependant effects are not yet fully understood. We want
tion of (ME) has been given by the introduction of the to stress out that a rigorous study of using the execution
square-and-multiply technique which uses the product time metric to compare multi-exponentiation metrics
of powers rule, i.e., mp · mq = mp+q , and the binary has not been done before to the best of our knowledge.
representation of the exponent e as follows: One can find references providing execution time mea-
surements of some algorithms [5, 8], but no survey of
x,
if e = 1, the algorithm from this metric point of view has been
2
power(x, e) = power(x , e/2), if n is even, provided.
2
x · power(x , (e − 1)/2), if e is odd.
(1)
We present an extensive performance comparison of
This technique leads to a time complexity of O(log(x))
three multi-exponentiation algorithms operating on in-
instead of O(x) and paved the way for a whole range of
tegers on single-core machines (in Section 3.6 we discuss
sophisticated algorithms using it [13].
about non-integers groups and multi-core machines).
A generalization of (ME) is called modular multi-
Specifically, we test different modern hardware: the Ap-
exponentiation, and it is expressed as the product of n
ple ARM M1 brand-new chip, a general-purpose In-
modular exponentiations:
tel Core processor and the two latest generations of
n
Y Raspberry Pis. Studying different hardware is impor-
xei i mod N. (MME) tant because platform-dependant side effects can occur.
i=1
For example, the Raspberry Pi devices use a 32-bit op-
Problem (MME) has been less profusely studied. One erating system as opposed to modern general-purpose
could naively compute each modular exponentiation computers using 64 bits systems, which makes a differ-
separately and then compute their product but this ence for multi-precision computing libraries. Moreover,
has been shown to be suboptimal [15], hence the Raspberry Pis, which do not have efficient long divi-
need for optimized algorithms. Existing literature is sion routines, make the optimization of the modular
mostly focused to compare the performance of multi- reduction an even more important task than for x86
exponentiation algorithms based on the total number devices. We reiterate that the contributions of this pa-
of multiplications to be performed. Most algorithms ag- per are twofold: (i) evaluate the performance differences
gregate the exponents during the scanning phase in or- depending on the hardware used to solve (MME); (ii)
der to reduce the amount of multiplications. This gener- provide the reader a way to understand which multi-
ally induces a precomputation overhead and adds some exponentiation algorithm to use and how to tune its
complexity resulting in a exponents filtering computa- parameters depending on application specifications.
tion overhead. Although counting the number of mod-
ular multiplication is useful to compare algorithms re-
gardless of the hardware, it does not catch side com-
The rest of the paper is organized as follows: in Sec-
putations that can actually make a difference when it
tion 2, we formally state the problem we are tackling.
comes to implementation.
Then, in Section 3, we provide a description of the rele-
vant algorithms used to solve (MME) along with pseu-
1.2 Our contributions docode and a theoretical comparison. After that, Sec-
tion 4 describes the basic concepts of the Montgomery
In this paper we tackle a practical problem: while it is reduction which will be an important component of
easy to find extensive literature on the theoretical per- our experiments. In Section 5, we show the results of
formance and guarantees of multi-exponentiation algo- these experiments. Finally, we conclude our paper in
rithms, little is known on how these techniques behave Section 6.
Rethinking Modular Multi-Exponentiation in Real-World Applications 3
where xi ∈ N with i ∈ [1, n] represents the base and has – ei [j] returns the j-th (starting with index 0) least
size λ bits, N ∈ N is the modulus and has size λ bits significant bit of ei .
and ei with i ∈ [1, n] is the exponent and has size k bits. – ei [a . . . b] returns the bits within the a-th and the
Furthermore, let Θλ,k (r, h, A) – or just Θ(r, h, A) for the b-th least significant bits.
sake of simplicity – be the computing time needed to If a or b are negative numbers, |a| or |b| zeros are
solve r by algorithm A using hardware h, where bases added as least significant digits. For example, if ei =
and modulus are of size λ and exponents are of size k. 101010112 , then ei [3 . . . 1] = 1012 and ei [3 . . . −2] =
Hence, the goal of this paper is to provide insights to 1011002 .
the following optimization problems:
xi s to use a square-and-multiply-like technique with the 3.3 Simultaneous sliding window algorithm (YLL)
n bases. This optimisation can be extended to compute
a window of size w bits of all exponents in one single Based on the previous simultaneous 2w -ary method,
multiplication. This technique is called the Simultane- Yen, Laih and Lenstra [25] have proposed an optimiza-
ous 2w -ary method and the special case w = 1 is also tion which replaces the fixed window of the previous al-
known as the “Shamir’s trick” [17]. We denote this tech- gorithm with a sliding window. It still consists of a pre-
nique as the 2w -ary algorithm. computation phase, which can be done offline, followed
More in detail, the algorithm consists of two phases: by the exponentiation itself. We display the pseudocode
of the multi-exponentiation phase in Algorithm 3. The
1. Precomputation. In this phase, the solver computes optimization is twofold:
and stores all the
1. Before making the multiplication, the algorithm
n
Y scans the most significant bits in the window that
xE
i mod N
i
(3) will be computed next to each exponent; while they
i=1 are all zeros, the window is shifted to the left and we
square the current exponentiation result (denoted
for all (E1 , . . . , En ) ∈ {0, . . . , 2w − 1}k . Please note by A in the pseudocode).
that this phase can be done offline. 2. Once the window cannot be shifted anymore, we
2. Evaluation. A fast exponentiation-like algorithm scan the least significant bits using a virtual cursor
matches precomputation entries to batches of w bits denoted J in the algorithm so that we multiply using
of each exponent and then squares w times in order only entries in which at least of exponent is odd. In
to shift the computed exponent by w bits on the that way, we save some precomputations by only
left. Algorithm 2 gives the pseudocode of this phase computing entries with at least one odd element,
of the algorithm. which represents 25% of the precomputations.
YLL w=4
YLL w=3 1.2
Speedup in percentage
103 YLL w=2
Separate 1.0
0.8
102
0.6 2w-ary w=4
2w-ary w=3
0.4 2w-ary w=2
101 YLL w=4
0.2 YLL w=3
YLL w=2
0.0 No speedup
20 21 22 23 24 25 26 27 28 29 210 211 212 213 20 21 22 23 24 25 26 27 28 29 210 211 212 213
Exponent bitsize Exponent bitsize
(a) Expected number of multiplications in function of expo- (b) Expected speedup comparison in function of exponent
nent length. length.
chitectures and we leave multi-core ones as a future {p · x mod N }{x∈[1,N −1]} is a permutation of [1, N − 1].
work: in fact, we firmly believe that there is still large In other words, the modular multiplication by p is a
interest into single-core applications as not all hard- bijective application and then can be inverted.
ware can support parallelization, especially Internet-of- Then, if we assume that our modulus N is odd,
Things devices or embedded technologies that usually which is often the case in cryptosystems such as RSA
have very low computing power. and has a size λ, then we can set R = 2λ and then
gcd(N, R) = 1. Then, for any x in G, we can define:
4 Montgomery multiplication – Montgomery representation of x : x̄ = x · R mod N .
– Montgomery product of ā · b̄: ū = ā · b̄ · R−1 mod N .
The most straightforward way to perform modular It is easy to see that ū = a · b. Furthermore can also
arithmetic computations is by applying the modulo easily see that a + b = a + b then we can do arithmetic
operation as soon as the numbers handled grow in using the Montgomery representation of each number.
order not to spend time computing huge numbers In order to retrieve a number x from its Montgomery
that will anyway be reduced. For example, comput- representation x, we can just multiply it by R−1 and
ing x = a · b mod n can be done by computing c = a · b take the remainder modulo N .
and then x = c mod n. However, doing so can be in- The trick of using Montgomery representation is
efficient as the modulus operation on big numbers is that, when R = 2λ , we have algorithms that can
a costly operation due to long divisions involved. In compute the Montgomery product without using the
particular, hardware using RISC sets of instructions modulo operation, hence speeding the computation up.
can spend much more time on the modulus operation However, this technique is not interesting when the
than on the multiplication. Thus, some techniques have number of products is small amount due to the overhead
been designed to efficiently compute a · b mod N with- required for the Montgomery conversion; conversely, it
out involving the modulus operation itself. The most is very powerful when it comes to exponentiations in
efficient ones are the Barrett reduction [2] and the which a lot of multiplications are involved as the over-
Montgomery reduction [18]. As briefly mentioned in head is amortized by the lower computation cost in the
Section 1, Montegomery reduction is better suited for Montgomery space.
multi-exponentiation use, and is based on the idea to
transform the numbers a and b into a special form in
which the reduction can be efficiently computed. We 5 Implementation results and discussion
elaborate more on this in the rest of this section.
First, an important mathematical result must be In this section, we compare some of the algorithms pre-
recalled. Let N be our modulus. If we take a natu- sented in Section 3 to provide valuable insights depend-
ral number p such that gcd(p, N ) = 1, then the set ing on the values of sizes k and λ and on the hard-
Rethinking Modular Multi-Exponentiation in Real-World Applications 7
set when starting the processes using UNIX priorities, We decided to set λ = 2048 and k = 256 because we
that is -20 (highest priority), default value set by the believe they represent reasonable values to be used in
operating system (0) and 20 (lowest priority). We can some real cryptosystems such as the Wesolowski’s veri-
see that the impact of priority during our experiments is fiable delay function construction [23]. Finally, in these
negligible. We point out that the following precautions figures, the 2w -ary (resp. YLL) method is represented
can help reduce the interferences: i) reducing the work- by dashed (resp. solid) lines, which are blue in case
load on the machine to limit as much as possible any w = 2, green if w = 3 and red if w = 4. Dash-dotted
other process running; ii) keeping CPU’s temperature yellow lines represent the separate method.
low to avoid thermal throttling. An efficient ventilation In Figure 3c we can see that the variation over λ
has proved to stabilize the measurements. leads to a similar asymptotic behavior amongst all al-
gorithms. It is important to note that, besides validat-
ing the theoretical findings, this plot shows that us-
5.2 Computing time analysis ing these multi-exponentiation algorithms can be per-
formed under the millisecond for realistic setup values
The first set of experiments shows the comparative be- like λ = 2048 and k = 256.
havior of multi-exponentiation algorithms with respect In Figure 3c we can see an increase from λ = 22
to the main parameters tested, such as exponent size to λ = 26 , suddenly followed by a decrease, and then
k, modulus size λ and, to a lesser extent, the window the plot gradually increases again. This is due to the
size. In Figure 3 and Figure 4 we display a set of eight fact that 26 = 64 is the size of a word on this 64-bit
figures showing the time spent by evaluating multi- device and marks the threshold by which using a multi-
exponentiations as in Eq. (4). Figure 3 provides plots precision library leads to improvements. This can be
with a sensibiity on the λ parameter using k = 256 and verified by looking at Figure 3a which displays the same
Figure 4 displays a sensibiity with k using λ = 2048. parameters but for a Raspberry Pi 4 hardware, running
The subfigures include: on a 32-bits operating system. There we can see that
the threshold is for λ = 32.
– Figure 3a represents the computation time using the
Figure 4c is very similar to Figure 1a, confirming
Raspberry Pi 4 as a function of the modulus bit-
the theoretical expectations. It confirms that using dif-
length λ, where the exponent size k is set to 256
ferent values of w becomes interesting as k increases
bits.
in practical settings, and implementation conditions do
– Figure 3b displays the speedup of using 2w -ary and
not sensibly affect the theoretical improvements.
YLL compared to separate, which is used as a
Figure 4b is much less consistent with theory pre-
baseline scenario, as a function of the modulus size
dicted in Figure 1b for low values of k, especially in
λ for k = 256 on the Raspberry Pi 4.
the curve w = 2. However, for values of k larger than
– Figure 3c represents the computation time using the
28 , results start to gain consistency. Furthermore, when
Apple M1 as a function of the modulus bit-length
comparing the asymptotic improvement factors, we can
λ, where the exponent size k is set to 256 bits.
observe they are all larger than the predicted one by a
– Figure 3d displays the speedup of using 2w -ary and
factor of around 5%. The discrepancies seen for low val-
YLL compared to separate, which is used as a
ues of k is due to libraries inner mechanisms that we
baseline scenario, as a function of the modulus size
do not have control over which get smoothed away with
λ for k = 256 on the Apple M1.
values of k over 28 . Furthermore, we can see that YLL
– Figure 4a represents the computation time using the
always outperforms 2w -ary by around 10%. Figure 3b
Raspberry Pi 4 as a function of the exponent size k,
is showing the expected behavior, i.e., that using differ-
where the modulus size λ is set to 2048 bits.
ent values of w for the same value of k does not lead to
– Figure 4b displays the speedup of using 2w -ary and
significant improvements for large values of λ. Someone
YLL compared to separate, which is used as a
implementing multi-exponentiations should then only
baseline scenario, as a function of the exponent size
consider k when picking a value of w. However, the al-
k for λ = 2048 on the Raspberry Pi 4.
gorithm used, i.e., 2w -ary or YLL makes a difference,
– Figure 4c represents the computation time using the
with YLL performing better for large values of λ.
Apple M1 as a function of the exponent size k, where
the modulus size λ is set to 2048 bits.
– Figure 4d displays the speedup of using 2w -ary and 5.3 Hardware Comparison
YLL compared to separate, which is used as a
baseline scenario, as a function of the exponent size An aspect that we considered important in our analy-
k for λ = 2048 on the Apple M1. sis was the different behavior depending on the hard-
Rethinking Modular Multi-Exponentiation in Real-World Applications 9
Speedup
100
2.0
1.5
10 1 1.0
0.5
22 23 24 25 26 27 28 29 210 211 212 22 23 24 25 26 27 28 29 210 211 212
Bitlength of modulus Bitlength of modulus
(a) Multiexponentiation algorithms for k = 256 on Raspberry (b) Speedup of multiexponentiation algorithms for k = 256 on
Pi 4 with variation of parameters λ and w. Raspberry Pi 4 with variation of parameters λ and w.
3.5
2w-ary w=4 2w-ary w=4
100 2w-ary w=3 2w-ary w=3
2w-ary w=3 3.0 2w-ary w=3
YLL w=4 YLL w=4
YLL w=3 YLL w=3
Computation time (ms)
2.0
10 1
1.5
1.0
10 2
0.5
22 23 24 25 26 27 28 29 210 211 212 22 23 24 25 26 27 28 29 210 211 212
Bitlength of modulus Bitlength of modulus
(c) Multiexponentiation algorithms for k = 256 on Apple M1 (d) Speedup of multiexponentiation algorithms for k = 256 on
chip with variation of parameters λ and w. Apple M1 chip with variation of parameters λ and w.
ware architectures. This can help anyone implement- k with λ = 2048. Each line represents a different hard-
ing multi-exponentiations to understand what are the ware as depicted in the legend while Figure 5b shows
expected discrepancies in case of heterogeneous ecosys- the same data in the form of speedup compared to the
tems. This is displayed in Figure 5. This analysis is separate algorithm, used as a baseline. Figures 5c and
similar to the one performed in the previous subsec- 5d display the same information but performing a sen-
tion: while in Figure 3 and Figure 4 we study how fast sitivity analysis over the modulus size λ where k is set
multi-exponentiation algorithms perform when the pa- equal to 256.
rameters k and λ change, here we restrict our study by The high-level behavior of the plots follows the ob-
setting w = 3 and we focus on the impact of differ- servations made in the previous subsection. We can see
ent hardware. We have already presented the different that using different hardware can only be observed via
behavior when using 32-bits or 64-bits operating sys- vertical shifts in performances as clearly displayed by
tems; in this section, we take this analysis a step further Figure 5a. Moreover, in Figures 5c and 5d we can ob-
and we are going to present a more exhaustive study of serve an horizontal offset between the 32 bits and 64
hardware influence. bits devices as devised in Section 5.2.
Figure 5a shows the computation time of YLL al- Another interesting observation is that in Figure 5b,
gorithm with w = 3 as a function of the exponent size the two Raspberry Pi devices have better speedup for
10 Vidal Attias et al.
2w-ary w=4
2w-ary w=3 2.00
2w-ary w=3
YLL w=4 1.75
101 YLL w=3
Computation time (ms)
Speedup
1.00
2w-ary w=4
100 0.75 2w-ary w=3
2w-ary w=3
0.50 YLL w=4
YLL w=3
0.25 YLL w=3
Separate
22 23 24 25 26 27 28 29 210 211 212 22 23 24 25 26 27 28 29 210 211 212
Bitlength of exponent k Bitlength of exponent k
(a) Multiexponentiation algorithms for λ = 2048 on Raspberry (b) Speedup of multiexponentiation algorithms for λ = 2048
Pi 4 with variation of parameters k and w. on Raspberry Pi 4 with variation of parameters k and w.
1.6
1.4
100
Computation time (ms)
1.2
1.0
Speedup
0.8
2w-ary w=4 2w-ary w=4
10 1 2w-ary w=3 0.6 2w-ary w=3
2w-ary w=3 2w-ary w=3
YLL w=4 0.4 YLL w=4
YLL w=3 YLL w=3
YLL w=3 0.2 YLL w=3
Separate Separate
22 23 24 25 26 27 28 29 210 211 212 22 23 24 25 26 27 28 29 210 211 212
Bitlength of exponent k Bitlength of exponent k
(c) Multiexponentiation algorithms for λ = 2048 on Apple M1 (d) Speedup of multiexponentiation algorithms for λ = 2048
chip with variation of parameters k and w. on Apple M1 chip with variation of parameters k and w.
YLL than the Intel Core and Apple M1 processors. configuration and algorithm. A first observation tells
The same behavior can be observed in Figure 5d in us that, without surprise, the desktop-grade processors
an asymptotic way although there is a rather chaotic perform much better than the Raspberry Pi devices,
behavior for low values of λ. This is a motivation for outperforming them by a factor of around ×11 for the
further research to understand how different hardware Raspberry Pi 4 and ×19 for Raspberry Pi 3, and these
perform multi-exponentiation algorithms. ratios are consistent throughout the whole data. Al-
In Table 2 we summarize the results of our exper- though this result should be of no surprise for anyone,
iments with respect to the different hardware for sep- we considered having precise measurements for this spe-
arate, 2w -ary and YLL with values of w in {2, 3, 4} cific application would be of interest to anyone building
when λ = 2048 and k = 256. The last column rep- applications using heterogeneous hardware.
resents the speedup achieved by the fastest algorithm
compared to the separate for each line. A second observation shows that the behavior is
We can see the same global behavior as in Sec- consistent for all devices when using different algo-
tion 5.2 but taking a closer look at the actual numeric rithms, as showcased in Figure 5c. We finally can see
values is instructive. Each cell contains the minimum of that the maximum expected speedup is 1.75 for this
around a hundred multi-exponentiation runs for each specific application.
Rethinking Modular Multi-Exponentiation in Real-World Applications 11
1.8
Raspberry Pi 3 Raspberry Pi 3
Raspberry Pi 4 Raspberry Pi 4
Apple M1 1.6 Apple M1
Intel Core i7 Intel Core i7
101 1.4
Computation time (ms)
1.2
Speedup
100 1.0
0.8
10 1 0.6
(a) Computation time when varying k with λ = 2048. (b) Speedup when varying k with λ = 2048.
100 2.0
Speedup
1.5
10 1
1.0
0.5
10 2
(c) Computation time when varying λ with k = 256. (d) Speedup when varying λ with k = 256.
2w -ary YLL
Hardware separate Max Speedup
w=2 w=3 w=4 w=2 w=3 w=4
Raspberry Pi 3 7.72 4.56 4.83 7.06 4.41 4.57 6.30 ×1.75
Raspberry Pi 4 4.66 2.81 2.95 4.34 2.74 2.80 3.87 ×1.66
Apple M1 0.41 0.26 0.28 0.38 0.25 0.26 0.38 ×1.58
Intel Core i7 0.45 0.29 0.31 0.51 0.28 0.29 0.45 ×1.61
5.4 Best multi-exponentiation algorithm cell contains a color corresponding to the fastest algo-
rithm. The different hues indicate a change of window
We believe that a very instructive from of represent- size with w = 2 in green, w = 3 in blue and w = 4 in
ing data is to show which algorithm performs the best purple while the naive implementation is in red. The
depending on the parameters k and λ. Hence, in this shade in turn will show a change of algorithm as de-
subsection we plot the fastest algorithm for each value scribed in the plot’s legend.
of k and λ in {2i | i ∈ [2, 13]} depending on the hard- Figures 6a, 6b, 6c, 6d show the best algorithm for
ware. We present the results as a grid for which each the Intel Core i7, the Apple M1, the Raspberry Pi 3
12 Vidal Attias et al.
213 213
4-YLL 4-YLL
212 212
211 4-2w-ary 211 4-2w-ary
210 210
29 3-YLL 29 3-YLL
28 28
Value of
Value of
3-2w-ary 3-2w-ary
27 27
26 2-YLL 26 2-YLL
25 25
24 2-2w-ary 24 2-2w-ary
23 23
22 Separate 22 Separate
22 23 24 25 26 27 28 29 210 211 212 213 22 23 24 25 26 27 28 29 210 211 212 213
Value of k Value of k
(a) Best algorithm as a function of λ and k for Intel Core. (b) Best algorithm as a function of λ and k for Apple M1.
Value of
27 3-2w-ary
27 3-2w-ary 26
26 2-YLL
25 2-YLL 25
24 24 2-2w-ary
2-2w-ary
23 23
22 Separate 22 Separate
22 23 24 25 26 27 28 29 210 211 212 22 23 24 25 26 27 28 29 210 211 212
Value of k Value of k
(c) Best algorithm as a function of λ and k for Raspberry Pi (d) Best algorithm as a function of λ and k for Raspberry Pi
3. 4.
Fig. 6: Heat map showing the fastest multi-exponentiation algorithm per hardware depending on modulus and
exponent sizes.
and the Raspberry Pi 4 respectively. One can observe esting to increase the value of w in order to leverage
two patterns, each one taking place along one of the the speedup that it brings in the actual evaluation part.
two axes and thus orthogonal with each other. When That explains the pattern shift on the k axis.
the exponent size k varies, adapting the value of the
window w is important. Conversely, depending on the
Concerning the λ axis, the rationale is that increas-
value of the modulus size λ, a change of algorithm from
ing the modulus length only impacts the computation
2w -ary to YLL leads to better results, and we notice
time of modular multiplications in the same way for
that this pattern is consistent for any hardware tested.
all algorithms and for all values of w. Then it becomes
The variation along the k axis is explained by the more efficient to optimize the parts of the algorithm
fact that when performing multi-exponentiation algo- that are not related to the actual modular multiplica-
rithms, there is a certain overhead in precomputations tions such as filtering and bit testing. In that sense, the
which is linear with 2w , hence it only depends on the 2w -ary method has a simpler mechanic than the sliding
value w, while the evaluation part is a function of grow- window one. Thus, it is better to use for smaller values
ing with k and decreasing with w. We can then see that of λ. However for big values, we can benefit from the
increasing k diminishes the relative overhead of the pre- reduced number of modular multiplications performed
computation phase and at some point it becomes inter- during the sliding window method.
Rethinking Modular Multi-Exponentiation in Real-World Applications 13
Another interesting point is that the pattern shifts and change how algorithms compare to each other. Sec-
around the same value of the exponent k for any hard- ond, although YLL is always theoretically faster than
ware. However, we can see that for the λ-axis, the In- 2w -ary, performing exponents scanning and bit filter-
tel Core and Apple M1 ARM have a switching value ing induce an overhead which is not taken into account
around 210 whereas the Raspberry Pi 3 and 4 have a in predictions. Thus, for low values of k, the 2w -ary
switching value around 28 . The rationale is that the algorithm actually performs better than YLL.
default Raspberry Pi operating system works in 32-bits We show that the two parameters k and λ have an
mode even though he Raspberry Pi 3 and 4 CPU have orthogonal role in double exponentiation performances.
64-bits capabilities. This is due to backward compat- An increase λ yields longest modular multiplications
ibility issues. This makes the difference in computa- whereas a larger value of k induces a more impor-
tion time bigger for big numbers modular multiplica- tant number of modular multiplications performed and
tion than all other computations involved because the make bigger precomputations affordable, hence increas-
word length is smaller. As mentioned above, difference ing the value of optimal w.
between modular multiplication and the other opera- We finally show that results are pretty consistent
tions performed in important only on the λ-axis, thus regardless the considered hardware, although using a
the behavior seen on these plots. 32-bits or 64-bits operating system changes the thresh-
old value of λ for which YLL performs better than
2w -ary. This makes algorithm selection easy for real
6 Conclusion
applications.
One of the main points of our paper is to demonstrate This paper opens some perspectives for future work.
that there is a need to rethink the use of performance Important topics yet to understand include understand-
metrics in considering the (MME) problem (and, of ing how generalizing to n > 2 exponentiations behaves,
course, many other essential cryptographic primitives). consider the case in which precomputations can be done
The algorithms proposed in the past have been designed offline and then only consider only the evaluation phase
with one chief goal only – minimize as much as possible or have a study focusing on a very large variety of hard-
the number of modular multiplications. But a small re- ware to have an even better understanding of platform-
duction of the number of modular multiplications can dependent effects. Another important aspect is that we
be easily overshadowed by other factors like more com- have focused our work on showing that optimal algo-
plex architectures, larger number of precomputations, rithms (with respect to the number of modular mul-
less opportunity to use parallelization, to name a few. tiplications) can have a sub-optimal behavior once ex-
Our research unveils the complicated relationship be- ecuted on specific platforms. This brings the question
tween many of these factors and presents the best pos- of whether other algorithms not ‘good enough’ in the
sible algorithms for a large number of cases essential in number of modular multiplications could actually be
cryptographic applications. We aim our efforts specifi- ‘good’ concerning the execution time. We argue this is
cally on the applications related to verifiable delay func- unlikely as other algorithms that we have implemented
tions, but other “users” of multi-exponentiation algo- but not described in this article, such as the Basic In-
rithms can directly benefit from the findings presented terleaving method, provided poor results.
here.
We have presented an implementation study of three
algorithms to solve double exponentiation, separate, 6.1 Generalizations and additional considerations
2w -ary, and YLL. We have focused our study on un-
derstanding how computation time is affected by prob- Over the last few years several new cryptographic
lems parameters, k and λ, respectively being the expo- schemes that make use of different algebraic struc-
nents bit-size and the modulus bit-size and understand- tures, like quaternions, have been proposed. The
ing which optimal parameter w, that is the window size task of computing modular exponentiation and multi-
in 2w -ary and YLL, to pick for a given pair (k, λ). We exponentiations for quaternions has been considered
have provided extensive experimental results for these in [22] and [24]. A similar task for computing multi-
three algorithms, considering four different hardware to exponentiations in the cases of matrices (it is essen-
understand how platform-specific behavior happen. tial in some automatic control problems [3]) was also
Although experimental results follow theoretical posed by some researchers. What is the main differ-
predictions, there are implementation side effects that ence in these cases? It is the implicit use of commu-
are highlighted in this paper. First, running a 32-bits or tativity in all of the already existing algorithms! In-
64-bits operating system does impact the performances deed, the computation of W = A2 B2 (A, B square
14 Vidal Attias et al.
matrices) has to be done as W = (AA)(BB); the non- thank Prof. Paul Zimmermann from Loria, France, who took
commutative nature of the matrix multiplication pre- a lot of time to help us understand how multi-precision li-
braries work. Then, Colin Walter who clarified a lot of details
vents us to use tricks like computing the same state- on Montgomery reduction implementation. Thank you also to
ment as (AB)2 = ((AB)(AB)). For readers who might Wolfgang Welz from the IOTA Foundation for his engineering
be interested in exploring these fascinating problems, advises that helped us define the scope of this paper. Finally,
we will provide here two basic definitions: a matrix A thank you to Ilan Zajtmann for lending us some computation
time on his Apple M1 computer to help us include cutting-
is called nilpotent if there exist a positive integer, n,
edge technology in this survey.
such that An = 0; a pair of matrices (A, B) is called
mortal if there exist a pair of positive integers (m, n)
such that Am Bn = 0. Whilst testing if a matrix A is References
nilpotent is relatively easy, testing mortality of pair of
1. Attias, V., Vigneri, L., Dimitrov, V.: Implementation
matrices is extremely difficult – there are no provable
Study of Two Verifiable Delay Functions. In: Tokenomics,
bounds on the exponents (m, n) and matrix double- 6, pp. 1–6 (2020)
exponentiation computation cannot use many of the 2. Barrett, P.: Implementing the rivest shamir and adleman
computational tricks presented in our paper, due to the public key encryption algorithm on a standard digital
signal processor. In: Lecture Notes in Computer Sci-
non-commutativity. So, for these cases one has to use a
ence (including subseries Lecture Notes in Artificial In-
radically different approach. In cryptographic settings telligence and Lecture Notes in Bioinformatics), vol. 263
this means that one has to use either separate expo- LNCS (1987). DOI 10.1007/3-540-47721-7 24
nentiations (that would be unfortunate and a serious 3. Blondel, V.D., Tsitsiklis, J.N.: When is a pair of matrices
mortal? Information Processing Letters 63(5) (1997).
drawback to the performance of quaternion-based cryp- DOI 10.1016/s0020-0190(97)00123-3
tographic schemes like the one in [22], for example) or 4. Blum, T., Paar, C.: Montgomery modular exponentiation
completely different multi-exponentiations techniques on reconfigurable hardware. Proceedings - Symposium on
have to be found. We hope this remark of ours will Computer Arithmetic (1999). DOI 10.1109/arith.1999.
762831
spark the attention of the cryptographic algorithms de- 5. Borges, F., Lara, P., Portugal, R.: Parallel algorithms
signers to take a deeper look into this under-researched for modular multi-exponentiation. Applied Mathematics
task. and Computation 292, 406–416 (2017). DOI 10.1016/j.
amc.2016.07.036
On last consideration, side channel attacks are one 6. Bosselaers, A., Govaerts, R., Vandewalle, J.: Compar-
of the most powerful cryptanalysis techniques. They ison of three modular reduction functions. In: Lec-
do not rely on the solution of some conjectured hard ture Notes in Computer Science (including subseries Lec-
ture Notes in Artificial Intelligence and Lecture Notes in
mathematical problem (e.g., factoring large integers or Bioinformatics), vol. 773 LNCS (1994). DOI 10.1007/
solving the discrete logarithm problem). Instead, they 3-540-48329-2 16
rely on the possibility to extract specific vital informa- 7. Brent, R., Zimmermann, P.: Modern Computer Arith-
tion about the secret keys by performing observations metic. October. Cambridge University Press, Cambridge
(2010). DOI 10.1017/cbo9780511921698
and measurements on the actual physical implementa- 8. Chang, C.C., Lou, D.C.: Parallel computation of the
tions of a cryptographic algorithm. The side-channel multi-exponentiation for cryptosystems. International
attacks are of no concern for our main targeted ap- Journal of Computer Mathematics 63(1-2) (1997). DOI
10.1080/00207169708804548
plication – the execution of the most popular VDFs.
9. Chang, C.C., Lou, D.C.: Fast Parallel Computation of
However, in many other real-world scenarios this is not Multi-Exponentiation for Public Key Cryptosystems. In:
the case, one prominent example being the computation Parallel and Distributed Computing, Applications and
of ECDSA (elliptic curve digital signature algorithm). Technologies, PDCAT Proceedings (2003). DOI 10.1109/
pdcat.2003.1236459
For those applications, the most popular algorithmic 10. Dimitrov, V.S., Jullien, G.A., Miller, W.C.: An
protection is the development of constant-time algo- algorithm for modular exponentiation. Informa-
rithms. In the case of double-exponentiation, this means tion Processing Letters 66(3), 155–159 (1998).
that computing xm ·y n mod p must require exactly the DOI 10.1016/s0020-0190(98)00044-1. URL
https://www.sciencedirect.com/science/article/
same number of modular multiplications regardless of abs/pii/S0020019098000441
the specific values of the exponents m and n. In or- 11. Dimitrov, V.S., Jullien, G.A., Miller, W.C.: Complex-
der to cover applications outside the domain of VDFs, ity and fast algorithms for multiexponentiations. IEEE
we are performing ongoing research on constant-time Transactions on Computers 49(2), 141–147 (2000). DOI
10.1109/12.833110. URL http://ieeexplore.ieee.org/
multi-exponentiation algorithms and plan to report our document/833110/
findings in future publications. 12. Euclid: The Elements of Euclid, 2nd edn. Dover Publi-
cations Inc., New York, New York, USA (1956)
13. Gueron, S.: Efficient software implementations of modu-
Acknowledgements We would like to thank some people lar exponentiation. Journal of Cryptographic Engineer-
that helped elaborating this paper. First, we would like to ing 2(1) (2012). DOI 10.1007/s13389-012-0031-5
Rethinking Modular Multi-Exponentiation in Real-World Applications 15