GZKP- A GPU Accelerated Zero-Knowledge Proof System
GZKP- A GPU Accelerated Zero-Knowledge Proof System
Weifang Hu‡
Huazhong University of Science and
Technology
Wuhan, Hubei, China
huwf@hust.edu.cn
ABSTRACT access pattern while eliminating the costly external shuffle in ex-
Zero-knowledge proof (ZKP) is a cryptographic protocol that al- isting solutions. For multi-scalar multiplication, GZKP adopts a
lows one party to prove the correctness of a statement to another new parallelization strategy, which aggressively combines integer
party without revealing any information beyond the correctness elliptic curve point operations and exploits fine-grained task par-
of the statement itself. It guarantees computation integrity and allelism with load balancing for sparse integer distribution. GZKP
confidentiality, and is therefore increasingly adopted in industry outperforms the state-of-the-art ZKP systems by an order of mag-
for a variety of privacy-preserving applications, such as verifiable nitude, achieving up to 48.1× and 17.6× speedup with standard
outsource computing and digital currency. cryptographic benchmarks and a real-world application workload,
A significant obstacle in using ZKP for online applications is the respectively.
performance overhead of its proof generation. We develop GZKP, a
GPU accelerated zero-knowledge proof system that supports differ-
ent levels of security requirements and brings significant speedup CCS CONCEPTS
toward making ZKP truly usable. For polynomial computation • Computer systems organization → Parallel architectures;
over a large finite field, GZKP promotes a cache-friendly memory Data flow architectures; • Security and privacy → Privacy-preserving
∗ Both
protocols.
authors contributed equally to this research.
† Corresponding author.
‡ WeiliangMa, Qian Xiong, Xuanhua Shi, Hai Jin, Haozhao Kuang, and Weifang Hu
are with National Engineering Research Center for Big Data Technology and System, KEYWORDS
Service Computing Technology and System Lab, Cluster and Grid Computing Lab,
School of Computer Science and Technology, Huazhong University of Science and zero-knowledge proof; GPU acceleration
Technology, Wuhan, China.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
ACM Reference Format:
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the Weiliang Ma, Qian Xiong, Xuanhua Shi, Xiaosong Ma, Hai Jin, Haozhao
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or Kuang, Mingyu Gao, Ye Zhang, Haichen Shen, and Weifang Hu. 2023. GZKP:
republish, to post on servers or to redistribute to lists, requires prior specific permission A GPU Accelerated Zero-Knowledge Proof System. In Proceedings of the
and/or a fee. Request permissions from permissions@acm.org.
28th ACM International Conference on Architectural Support for Programming
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Languages and Operating Systems, Volume 2 (ASPLOS ’23), March 25–29,
ACM ISBN 978-1-4503-9916-6/23/03. . . $15.00 2023, Vancouver, BC, Canada. ACM, New York, NY, USA, 14 pages. https:
https://doi.org/10.1145/3575693.3575711 //doi.org/10.1145/3575693.3575711
340
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu
341
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
Application
overhead and long latency, significantly limiting the throughput of
latency-sensitive applications (e.g., decentralized transactions).
Trusted Setup
verification key input, verification key
Verifier With ZKP, users could generate succinct proofs for programs
key
generator proving key accept or reject
and then put the proofs on the chain: each node only needs to
vector sets: verify by checking the integrity of the proofs, a task much lighter
proving key: Proof:
than executing the smart contract. Also, these zero-knowledge
Prover MultiScalar Multiplication
Polynomial Computation proofs hide sensitive information involved in the smart contract
computation. As a result, ZKP can greatly enhance both blockchain
scalability and privacy protection.
INTT NTT h0
PMUL
+ M0 As shown in Figure 1, various random parameters for the whole
INTT
INTT
NTT
NTT
- INTT M
hN-1
PMUL
PADD
ZKP system are generated in a one-time setup phase including the
MN-1
proving key and the verification key for the prover and verifier.
P3 = P1 P2 5P = 5 P = (101)2 P u0
Q0 PMUL And with prover’s secret witness 𝑤 and public input 𝑥, the system
(x3, y3) = (x1, y1) + (x2, y2) 1 0 1
m = (y2 - y1) / (x2 - x1) PADD P 2P=P P 4P=2P 2P PADD
x3 = m2 - x1 - x3 uN-1 outputs two kinds of data which will later be used in the prover and
QN-1 PMUL
y3 = m(x1 - x3) - y1 PADD P 5P verifier:
• Vector sets: {𝑎, ® 𝑐®, 𝑢}.
® 𝑏, ® Each vector contains 𝑁 large integers
Figure 1: Typical zkSNARK protocol workflow. on a finite field 𝐹𝑟 where 𝑟 is a large prime. The bit-width of
the large integer field here typically ranges from 256 to 753. The
large integer 𝑟 can be composed as 𝑚
Í 𝑖 64
𝑖=0 𝑟𝑖 𝐷 , where 𝐷 = 2
denotes the base, so that the vector 𝑟𝑖 can be stored in word-size
decisive improvement to the market, supporting a variety of ZKP integers. The dimension 𝑁 could be up to billions and is typically
systems used in the industry, such as Marlin [11], Plonk [31], and a power of 2.1
Sonic [51]. In addition, these two modules that we investigated • The proving key and verification key: The proving key con-
(NTT and MSM) has applications in a wider scope of encryption sists of multiple point vectors {𝑀,® 𝑄® }, each containing 𝑁 points
algorithms: NTT is a key building block in homomorphic encryp- on a pre-determined elliptic curve, expressed in the form of
tion algorithms [35] used by federated learning, while MSM is used {(𝑥, 𝑦) ∈ 𝐹𝑞 × 𝐹𝑞 | 𝑦 2 = 𝑥 3 + 𝑎𝑥 + 𝑏}. 𝐹𝑞 is another finite field
in many pair-based proof systems [8, 40, 54]. Techniques proposed where 𝑞 is a large prime. Similar to 𝐹𝑟 , the bit-width of 𝐹𝑞 is also
in this work thus contribute to this growing junction of computer between 256-bit and 753-bit. The verification key is a short point
systems and security/privacy, by significantly enhancing the ef- vector with a few elliptic curve points. The basic operations of el-
fectiveness of GPU resource utilization for crucial components of liptic curves are point-addition (PADD)2 and point-multiplication
private computation. (PMUL) [41]. We use ⊕ and ⊗ to denote the PADD and PMUL
operations in elliptic curves, respectively. As demonstrated in
2 BACKGROUND Figure 1, PADD is used for adding up two points on the elliptic
2.1 The zkSNARK Protocol curve and it needs to do a set of arithmetic operations over the
same finite field. PMUL is defined by adding a point 𝑃 to itself 𝑠
zkSNARK is one of the most prominent ZKP protocols. It consists
times (through PADD), where 𝑠 is an element over 𝐹𝑟 . The figure
of two parties: a prover and a verifier. The prover can convince the
illustrates how it can be split into a series of PADD in order of
verifier that, “for given function 𝐹 and input 𝑥, I know a secret
the integer’s bit sequence by using the binary representation of
witness 𝑤 such that 𝐹 (𝑥, 𝑤) = 0,” by generating a zkSNARK proof.
the integer.
zkSNARK possesses three desirable properties: (1) succinctness,
i.e., small proof sizes (<1 KB) and fast verification (<10 ms) regard- Then the prover uses the vectors sets and the proving key to
less of 𝐹 ’s complexity, (2) non-interactive, i.e., a single message from generate a proof. The prover performs two-stage calculations suc-
the prover to the verifier, and (3) zero knowledge, i.e., a proof reveal- cessively, polynomial computation (POLY) and multi-scalar multi-
ing nothing about the actual value of 𝑤 while still able to pass the plication (MSM), to generate the proof.
verification. In the POLY stage, the prover takes the vectors 𝑎, ® and 𝑐® as
® 𝑏,
With these properties, recently zkSNARK has been widely adopted inputs, and outputs the coefficients of polynomial 𝐻 (𝑥), where
by the blockchain community [15, 17, 19, 22, 44]. The smart con- 𝐻 (𝑥) = (𝐴(𝑥)𝐵(𝑥) − 𝐶 (𝑥))/(𝑥 𝑁 − 1). As shown in Figure 1, the
tract stored on a blockchain is a program performing the function 𝐹 state-of-the-art implementation of this stage consists of a series of
in Figure 1. Traditional blockchain applications require each node NTT and its inverse (INTT) operations [16, 50], with more details
in a chain to perform the same smart contract for the chain’s state given in Section 3.
updates. For example, for an online transaction, the nodes in the In the MSM stage, the prover takes the output ℎ® of POLY, the
chain verify its validity by checking whether both parties’ account ® and the proving key as inputs and computes multiple
scalar vector 𝑢,
information (secret witness 𝑤) meets the transaction condition. It multi-scalar multiplications over the specified elliptic curves. Each
is worth noting that a smart contract involving private datasets
1 General cases with arbitrary 𝑁 values can use the power-of-2 flow as building
may reveal user privacy because anyone running their blockchain
blocks [12].
node can see and download all data stored on the chain. With 2 Here, we use PADD to refer to elliptic curve point addition operations, including
the number of such contracts growing, all nodes will incur huge point doubling operations
342
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu
MSM
l=256
s0 P0 sub-MSM Task0 k=4
s1 P1 s00 P00 Q0 = ... 0 1 0 1 0 1 0 1 P0 0x...55 P0 = ... 0x10 ( 0x5 P0 ) 0x5 P0
s2 PM BR WR ( 0xA B5B5 BABA ...
P2 P1 P1 )
...
...
sN-1 PN-1
W1
W WW0 B5 =P0BF P3 P5
1 0 1
(BF BE )
1 : point-merging (PM) = Q0 = Q
...0 Q
...7 Q7 3 2252 W3 63... ...0x1024W1W1
W0W0 ...
2 : bucket-reduction (BR) = ( 24 ... (( 24 W63) W62 ) ...) W0 2 (BF BE ... B1)
3 : window-reduction (WR)
343
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
batch 0 batch 1
0 0 0 0 0 0 0 0
t0 cache-efficient data access pattern
1 1 1 1 1 4 4 1
2 2 2 2 2 8 8 2
t1 threads access elements in global memory
3 3 3 3 3 12 12 3 threads store elements in shared memory
4 4 4 4 4 1 1 4 t0 t1 t2 t3 t4 t5 t6 t7 t0 t4 t1 t5 t2 t6 t3 t7
t2 00 10 20 30 40 50 60 70 80 90 100110120130140150 00 80 10 90 20 100 30 110
5 5 5 5 5 5 5 5 LD LD
01 00 10 20 30
6 6 6 6 6 9 9 6 m m
t3 80 90 100110
7 7 7 7 7 13 13 7
8 8 8 8 8 2 2 8
t4 N 16= *G
9 9 9 9 9 6 6 9
t0 t1 t2 t3 t4 t5 t6 t7 t0 t4 t1 t5 t2 t6 t3 t7
10 10 10 10 10 10 10 10 00 10 20 30 40 50 60 70 80 90 100110120130140150
t5 00 40 80 120 10 50 90 130 20 60 100140 30 70 110150
LD LD
11 11 11 11 11 14 14 11 01 40 50 60 70 01
12 12 12 12 12 3 3 12 m 120130140150 m
t6
13 13 13 13 13 7 7 13
14 14 14 14 14 11 11 14
t7 N 16= *G
15 15 15 15 15 15 15 15
global memory L2 cache shared memory
GPU block 0
each block contains four
N-1 N-1 N-1 independent groups N-1
represents the first element, background color the color line shows the butterfly operation is represents the first word of load (LD)
0 t0 represents thread 0 00
distinguishes different independent groups executed by which thread, i.e., t0 in the case the first element store (ST)
Figure 4: Iteration partitioning at 𝐵 = 2. Here a GPU block contains 𝐺 = 4 independent groups, each working on 2𝐵 elements (in
the same color). A total of 𝑇 = 2𝐵 ∗ 𝐺/2 = 8 threads are required in each block. The zoom-in illustration on the right side shows
the per-thread memory access pattern When loading data from the global memory to the shared memory.
operations with the distributive law. Figure 3 uses the Pippenger from tens of MBs to GBs, with each array element in 𝑚 machine-
algorithm to illustrate this process. It represents each 𝑙-bit scalar word-size (64-bit) integers. The input arrays are stored in the GPU
𝑠 under the 2𝑘 base, where 𝑘 is the pre-specified window size. In global memory with a column-major layout [10] (i.e., storing the
the right part of Figure 3 (𝑁 = 8), 𝑘 is 4 and 𝑙 is 256, resulting in first words of all 𝑁 integers contiguously, then all the second words,
64 windows. The PMUL operations within each window are then until 𝑚). In our discussion next, we use the column index to identify
grouped by the scalar component into buckets. For example, in an array element. For example, element 0 represents the 𝑚 integers
window 𝑤 0 , different elements of 𝑠® produce common components in the first column of this layout.
of 0xA and 0x5, producing a bucket 5 containing the points 𝑃0, 𝑃3 , As said in Section 2.2, a group of independent butterfly opera-
and 𝑃5 , and another bucket A. With the distributive law, one can tions in a batch are mapped to a GPU block for parallel computation.
therefore save computation by first summing up all the points in The block works on a subset of array elements, which are staged to
bucket 𝑗 to get 𝐵 𝑗 within a window (point-merging), then calculating the shared memory, also in a column-major layout. The GPU shared
Í2𝑘 −1 memory is the memory local to each streaming multiprocessor (SM),
𝑗=0 𝑗 ⊗ 𝐵 𝑗 for window 𝑡 to get 𝑊𝑡 (bucket-reduction), and finally
Í ⌈𝑙/𝑘 ⌉ −1 𝑡 ∗𝑘 much faster yet smaller than the global memory. For example, each
computing 𝑡 =0 2 ⊗ 𝑊𝑖 , where 2𝑡 ∗𝑘 is a weight, to combine of the 80 SMs of the NVIDIA V100 model has a 48 KB shared mem-
the results across windows (window-reduction). ory, accessible to up to 1024 threads of the block running on the
The scale of MSM in ZKP systems is often over millions. Due to SM, at the word granularity. With one warp (usually 32 threads)
the large 𝑁 , previous work explores data parallelism by decompos- accessing global memory simultaneously, it has been found that
ing the computation horizontally into multiple sub-MSM tasks, as for vectors of large integers, the fine-grained interleaving among
shown in Figure 3 left. Each sub-MSM can be naturally assigned to a data accessed by different threads with such a column-major layout
GPU block, where algorithms like Pippenger are further adopted to could offer significantly better performance than storing each inte-
reduce the complexity, with concurrent processing of different win- ger contiguously [10]. This is confirmed by our own experiments,
dows by different threads, as shown in the figure. However, existing which show a 2× advantage of the former with a 753-bit integer
solutions have not explored the fact that point-merging tasks in the size in global memory accessing.
same window for different sub-MSMs can also be merged to reduce With each warp accessing a subset of vector data for NTT, data
the subsequent reduction tasks, which is among our contributions transfer happens at the beginning and the end of each batch, to put
in GZKP (Section 4). relevant data into and evict back results from the shared memory.
A major reason for existing GPU-based NTT designs to perform
expensive data shuffles between batches [10, 16, 38] is to ensure
3 GZKP DESIGN: THE POLY STAGE that each warp has contiguous access to at least one GPU L2 cache
In this section, we describe GZKP’s new design for more efficient line (32 B with V100), to fully utilize data brought into the L2 cache.
GPU-based large-scale NTT computation over large finite fields. However, the cost of such shuffle operations quickly increases when
Cache-Efficient Global Memory Access. For 𝑁 -NTT in ZKP, the we get to the later iterations, as mentioned in Section 2.2.
total size of its input array (vectors {𝑎, ® 𝑐®} shown in Figure 1) varies
® 𝑏,
344
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu
Preprocessed l=256 Merge Merge Thread Data Transfer CG: cooperative group
Points k=8 point-merging bucket-reduction
Q0 = ... 0 1 0 10101 P0 = ... P0_1 P0 P3 P3_1 ...
170 85 P0 B85 GPU Block0 GPU Block0
P0_1 = 2k P0
Q1 = ... 1 0 1 01010 P1 = ... P1_1 P1
PADD
P1_1 = 2k 234 170
CG0
B1
PADD PADD
P1
...
CG1 CG0
P2 P6_1 B1
Task0 (B1)
P2_1 = 2k P2 Q2 = ... 0 1 0 11010 P2 = ... 229 P2_1 90 P2 B90
Warp0
Q3 = ... 0 1 0 P3 = P3_1
...
10101 ... 85 85 P3 B3
...
CG1
P0_2 =22k P0 Q4 = ... 1 1 1 01010 P4 = ... 170 P4_1 234 P4 B3
P1_2 =22k P1 Q5 = ... 1 1 1 P5 = B67
00101 ... 255 P5_1 229 P5
...
P2_2 =22k P2 Q6 = ... 1 0 1 11010 P6 = ... 90 P6_1 186 P6 B186 P6 ...
CG2
Result
...
PADD PADD
...
CG1 CG0
Q7 = ... 1 1 1 0 0 0 0 0 P7 = ... 0 P7_1 224 P7 B67
Task1 (B3)
P0_t =2t*k P0
...
Warp1
B185
P1_t =2t*k P1
...
CG3
...
P2_t =2t*k P2
B85 = P0 P3 ...
...
...
1 B1 ... 255 B255 B185
B90 =P1 P6_1 ...
CG4
= B255
...
...
...
= Q0 QN-1 2 1 B
(B255 B254) 170=P1 P0_1 ...
... GPU Blocks
B186=P6 ...
...
...
(B255 B254 ... B1)
...
...
Figure 5: The bucket-based task-partitioning and task-mapping for the MSM computation
Considering the large scale of NTT operations in ZKP applica- chunks in the global memory. Essentially, we transform these 2𝐵
tions, GZKP follows a different, shuffle-less approach, keeping the length-𝐺 chunks in the global memory to 𝐺 length-2𝐵 chunks in
vector in the original order in the global memory across all batches the shared memory. As long as 𝐺 is sufficiently large, e.g., at 4 or
of iterations (Figure 4). It then has to address the conflict between higher, accessing these chunks from the global memory ensures
(1) the fine-grained access (byte-level) of each thread due to the effective utilization of each L2 cache line. We use 𝑇 threads to fetch
column-major layout, for better global memory access speed, and these chunks in parallel in multiple rounds. Figure 4 example shows
(2) the growing distance of array elements accessed by each GPU 𝐺 = 4 and 𝐵 = 2, as on the right side. In each round, the threads
block, without data shuffling between batches. Otherwise, both access contiguous elements from the global memory that can be
issues would significantly lower the L2 cache line utilization. well coalesced. They then write the data to the appropriate location
GZKP addresses these challenges with a cache-friendly task in the shared memory, producing the desired strided interleaving
partitioning scheme. Suppose we have 𝐵 iterations within each pattern, promoting efficient access within each independent group
batch, where GZKP divides the butterfly operations into 𝑁 /2𝐵 in- in their NTT computation. As to be seen later in our experiments,
dependent groups, each of which works on 2𝐵 elements. Figure 4 such a design improves the NTT performance by up to 2.1×. When
illustrates this with 𝐵 = 2, where the 16 elements’ butterfly compu- the current batch finishes, the reverse process is used to write data
tation form 4 such groups (grouped by background colors), each back to the global memory, with the original data order.
working on a disjoint subset of 4 elements. With existing GPU- The internal shuffle operation above ensures contiguous mem-
based NTT designs, 𝐵 is maximized (under the shared memory ory access at the global memory side for L2 cache efficiency. By
capacity constraint) to make such independent groups large. Each scattering them to the desired locations as dictated by the current
group is mapped to one GPU block, to allow large chunks of work batch’s independent groups order (such as “0 4 8 12” for the first
to proceed in parallel. group), subsequent iterations of NTT computation in this batch
Considering the aforementioned L2 cache efficiency problem, could proceed with one thread’s data placed close together. Such
GZKP adopts a different design. Rather than having each GPU sequential and reverse-order interleaving of contiguous data seg-
block work on one large group, we assign it with multiple smaller ments further reduces the inter-bank access conflicts on the shared
ones. In Figure 4, for example, the four independent groups are memory [20].
all assigned to GPU block 0. This allows GZKP threads to perform
internal shuffling when transferring data between the global and
shared memories: neighboring threads will coalesce their accesses
by visiting the global memory with a coarse-grained block-style
data distribution, then creating the fine-grained interleaving at the
shared memory. 4 GZKP DESIGN: THE MSM STAGE
We illustrate more using the examples in Figure 4. Batch 0 on the This section focuses on the MSM stage, which occupies most of the
left have contiguous access within each independent group, so the ZKP proof generation time. We present two major optimizations:
threads simply access the data needed, where neighboring threads (1) further computation consolidation beyond those exploited by
will visit neighboring data. This stops being the case with batch 1 existing approaches, and (2) load balancing techniques targeting
(with GZKP discarding data shuffling in the global memory), where scalar vectors containing large integer elements. It is worth noting
vector elements have a stride of 4: the first group will be working that our MSM optimizations presented in this section is agnostic to
on elements 0, 4, 8, and 12. Nevertheless, if we could schedule 𝐺 elliptic curve types, and apply to both the two common types (G1
independent groups together where each group processes 2𝐵 ele- and G2) used in current ZKP systems [23, 50] despite their different
ments, the overall required data nicely form 2𝐵 length-𝐺 contiguous PADD operation costs.
345
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
346
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu
S im ila r ta s k g r o u p
3 0 0
Group0 Group1
T h e n u m b e r o f p o in ts p e r ta s k
that heavy buckets are not left as stragglers.
When scheduling the tasks in each task group to the underly-
ing GPU hardware, GZKP performs fine-grained task mapping for
Figure 6: Workload distribution in the point-merging step. workload-aware resource allocation. We allocate GPU warps to the
Tasks with the same workload (i.e., number of points as- tasks in the same group according to their average loads. As demon-
signed to the bucket) are grouped into a similar task group. strated in Figure 7, a task in group 1 is mapped to one warp, while
another task in group 0 is mapped to two warps since it contains
more points. Therefore, each warp will independently manage at
After the cross-window point-merging step, only one bucket- most one task (bucket), and multiple CGs in a warp can execute
reduction task is left in our MSM module. The preprocessing scheme the task in a data-parallel manner.
essentially turns all windows into one with no weight. Therefore,
there are no more windows but 2𝑘 buckets left. For the final step 4.3 Optimized Finite Field Library
in our MSM module, we calculate 2𝑗=0−1 𝑗 ⊗ 𝐵 𝑗 by leveraging the
Í𝑘
GZKP uses a highly optimized finite field library as the underlying
parallel prefix sum algorithm [42], which converts certain sequen- tool to accelerate the basic operations in ZKP, mainly modular mul-
tial computations into equivalent parallel computations. The input tiplications [52] and large integer additions. As mentioned earlier,
array is from the point-merging step and is stored in GPU global a typical proof generation in Zcash [43] involves billions of these
memory. We divide it into blocks, each scanned by a single GPU basic operations. Our library supports large integers, currently 256-
block. The number of GPU blocks is determined by the input ar- bit ∼ 753-bit, and can be further extended if needed. Due to the
ray size (2𝑘 ) and the number of threads needed for each PADD length limit, we only briefly introduce it here.
operation, again in CGs. Current mainstream computing devices (i.e., CPU, GPU) cannot
directly support the above operations on large integers. Instead,
4.2 Workload Management one needs to split them into multiple device-native data types, e.g., 4
ZKP may exhibit imbalanced data distribution when working with machine-word-size (64-bit) integers for a 256-bit integer. The large
large integers. In particular, the scalar vector 𝑢® in real-world ZKP finite field operations are then implemented using the arithmetic
workloads is highly sparse. The system setup has a lot of bound operations supported by the GPU in CUDA [20]. Multiple threads
checks and range constraints, which introduce many 0s and 1s to in a warp cooperate to complete the arithmetic operation and use
® and thus severe load imbalance in existing systems [16, 18]. For
𝑢, the warp-level primitive to realize the data communication between
example, the points in bucket 0 require trivial processing and no threads. In this way, we can effectively deal with data carry signals
subsequent point merging. between threads, and merge the multiple individual 64-bit results
As an example, we analyze the workload distribution in the into the final 256-bit number.
point-merging step with scalar vector 𝑢, ® based on a Zcash MSM Like earlier works [29, 30], our library exploits the floating-point
execution with a scale of 217 and scalar bit-width of 256. As shown processing units (the forte of GPUs), which would be otherwise idle
in Figure 6, there is up to 2.85× difference in the numbers of points in such integer computation tasks, to accelerate modular multipli-
across all buckets. cation. For example, we choose base 𝐷 = 252 to split a large integer
When switching from the previous window-based parallel execu- into multiple 52-bit components, resulting in 15 machine-word-size
tion to the more fine-grained bucket-based parallelization strategy (64-bit) integers to represent a 753-bit integer. The higher-order
in GZKP, such imbalance could get more serious as more points are bits of these integers are filled with 0 to accommodate carry bits.
merged into each bucket. On the other hand, GZKP takes advan- We then convert them to double-precision floating point (DFP),
tage of its fine-grained, bucket-based work partitioning strategy to followed by applying Dekker’s Method [27] to complete the DFP
dynamically improve load balance. Given the fairly large number of multiplications and additions without rounding issues. We imple-
buckets during point-merging, GZKP groups them by loads (i.e., the ment the butterfly operation for NTT and the PADD operation for
number of points contained), ensuring that the tasks in the same MSM based on this library. We believe the library itself could be
347
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
of independent interest beyond GZKP, and can also be applied to and GZKP support ALT-BN128, BLS12-381, and MNT4753 elliptic
improve existing GPU-based ZKP systems, which we demonstrated curves, with bit-width of 256, 381, and 753, respectively. MINA only
through our baselines. We report the acceleration results of the supports the MNT4753 elliptic curve, while both bellperson and
finite field library on different modules (Figure 8 and Figure 10) in bellman support BLS12-381. We evaluate the performance of GZKP
Section 5.3. with different elliptic curves. For each elliptic curve, we select the
best GPU system (“Best-GPU”, or “BG”) as the baseline, along with
5 EVALUATION the best CPU solution (“Best-CPU”, or “BC”), especially when the
particular curve is not supported by any prior GPU solutions.
5.1 Methodology
Workloads. We use multiple real-world workloads to evaluate
Experimental setup. Our experiments are conducted on a server the overall performance of GZKP. Among them, the zkSNARK
with dual 2.00 GHz Intel Xeon Gold CPU 5117 processors (each workloads are generated by xJsnark [47, 49], a high-level framework
with 28 logical cores) and 256 GB DRAM, running Ubuntu 18.04.2 for developing applications based on zkSNARK, integrating libsnark
with CUDA Toolkit version 11.4. It is equipped with four NVIDIA as a backend for proof generation. The Zcash workload is generated
Tesla V100 GPUs (each with 32 GB memory) and one NVIDIA GTX through the tool included in its code repository [24]. The scalar
1080Ti GPU (with 11 GB memory). vector 𝑢® in these real-world workloads is sparse. In addition, we use
Baselines. we compare GZKP with multiple state-of-the-art ZKP synthetic data generated by libsnark to evaluate the performance
systems (using CPU or GPU) based on the zkSNARK protocol, as list of different GZKP modules.
in Table 1. Among them, libsnark [50], a C++ library, is to our knowl-
edge the most well-known zkSNARK library, supporting multiple
elliptic curves as well as parallel proof generation on multiple CPU
cores. MINA [18] is a GPU-accelerated ZKP system, but only acceler- 5.2 Overall Performance
ates the MSM stage. In addition, bellman and bellperson are recent zkSNARK. First, we evaluate the end-to-end proof-generation time
zkSNARK implementations in Rust, adopted by Zcash [22] and (sum of the POLY and MSM stages) using the zkSNARK workloads
Filecoin [15], respectively. They have similar overall design, while on one V100 GPU card. More specifically, the actual zkSNARK
the former is CPU-based and the latter GPU-based. Also, libsnark execution [50] contains seven NTT operations in the POLY stage
and five MSM operations in the MSM stage, for generating one proof.
Table 1: Baselines We create such zkSNARK workloads with the 753-bit MNT4753
curve.
As shown in Table 2, GZKP achieves, on average, 48.1× and 33.6×
Platforms Supported curves
speedup against the best-performing CPU-based system (libsnark)
ALT-BN128, BLS12-381,
GZKP GPU and the best-performing GPU-based system (MINA), respectively.
MNT4753
Since MINA only performs the MSM stage on the GPU, its overall
MINA [18] GPU MNT4753
execution time is obtained by adding libsnark’s POLY stage time to
bellperson [16] GPU BLS12-381 its MSM stage time. While MINA provides quite limited improve-
ALT-BN128, BLS12-381, ment over the best CPU solution, GZKP is able to provide a speedup
libsnark [50] CPU
MNT4753 between 14.0× and 48.1× over the former. The reason is that with
bellman [23] CPU BLS12-381 the scalar vector 𝑢® being sparse in real-world workloads, MINA
Table 2: Performance results (in seconds) of different zkSNARK workloads on V100 GPU with MNT4753 (753-bit) elliptic curve
Table 3: Performance results (in seconds) of Zcash workload on V100 GPU with BLS12-381 (381-bit) elliptic curve
348
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu
suffers from significant load imbalance, while GZKP’s design avoids cost by preprocessing before the NTT operation. Specifically, we
such a problem. use a parallel strategy more suitable for GPU. As shown in Figure 2,
Zcash. Next, we report evaluation results with a Zcash workload, each iteration 𝑖 contains 2𝑖 unique 𝜔 𝑁 -related values. Our design
using the 381-bit BLS12-381 curve on one V100 GPU. There are two only needs to store these unique preprocessing values once, with-
types of trading methods (sprout, sapling) [43] with Zcash, which out redundant storage. We also avoid the memory access overhead
involve different proof generation activities. We report results for by ensuring that the threads read the preprocessing values contin-
both. uously. However, such preprocessing is not suitable for libsnark
As shown in Table 3, GZKP achieves, on average, 24.7× and 8.7× because of its specific parallel strategy, which is not exactly the
speedup against BC (bellman) and BG (bellperson), respectively. To same as Figure 2. We have modified libsnark to precompute all
make a shielded transaction, a combination of those workloads is 𝜔 𝑁 -related values, but these values actually take up 16× memory
required. Thus, GZKP could reduce the latency of proof generation size, e.g., up to 24 GB in 224 -NTT. The performance improvement
for each shielded transaction in Zcash by 37.1× against bellman is only 1.5× on average. Since the preprocessing values increase
and provide a speedup of 9.2× over bellperson. The reason is that the overall memory footprint and the data transfers from memory,
GZKP accelerates the computation of the two stages better through the extra memory access overhead dwarfs the saved computations.
the new design compared with bellperson, especially improving Table 6 lists results of similar tests, but on a lower-end GPU
the performance of the more time-consuming MSM stage by 8×. (GTX1080Ti), with fewer SMs and lower memory bandwidth. GZKP
Here the improvement echoes the zkSNARK results. still achieves on average a speedup of 8.9× against bellperson with
Multi-GPU tests. We also evaluate the end-to-end performance BLS12-381 curve. The latter is more severely affected by the reduced
of GZKP with four V100 GPUs by using the Zcash workload. The memory bandwidth. Note that we ignore the preprocessing time
best-performing GPU-based ZKP system (bellperson) also supports required for the butterfly operation for fairness, which is executed
multiple GPU cards for MSM stage. For the POLY stage, we as- serially on the CPU in bellperson and takes on average 5× latency
sign data-independent NTT operations to different GPU cards for in Table 5, while GZKP performs it in parallel on the GPU. The
parallel computation. For the MSM stage, we decompose the com- speedup ratio of GZKP compared to bellperson varies for different
putation horizontally into 4 smaller sub-MSM tasks, where each NTT sizes. The reason is that with the fixed division of a GPU block
task uses all our proposed optimizations, and then assign each of to process independent groups, bellperson suffers from sub-optimal
them to a GPU. As shown in Table 4, although additional data com- block division, while GZKP avoids such problems.
munication overhead is introduced between multiple cards, GZKP To take a deeper look, we investigate the improvement brought
reduces the proof-generation time further by an average of 2.1× by major GZKP proposals for the POLY stage, using the BLS12-381
on the basis which only uses one GPU card. Also, due to better
scalability, GZKP achieves, on average, 13.2× speedup against BG
(bellperson). Table 5: Performance results (in milliseconds) of single NTT
operation on the V100 GPU
Table 4: Performance results (in seconds) of Zcash workload
on four V100 GPUs with BLS12-381 (381-bit) elliptic curve 753-bit 256-bit
NTT
Scale Best Best
CPU GZKP GPU GZKP
Zcash Vector Best-GPU GZKP Overall
workload size POLY MSM POLY MSM Speedup 214 102 0.15 (697.4×) 0.37 0.05 (8.0×)
Sapling_Output 8191 0.052 0.24 0.0004 0.023 12.7× 216 212 0.49 (434.4×) 0.48 0.09 (5.3×)
Sapling_Spend 131071 0.16 0.31 0.0017 0.049 9.3× 218 565 1.91 (295.4×) 2.89 0.28 (10.3×)
220 2110 7.46 (282.5×) 5.19 1.07 (4.8×)
Sprout 2097151 0.69 1.08 0.027 0.074 17.6×
222 8180 33.67 (242.9×) 12.69 4.96 (2.5×)
224 32517 141.40 (230.0×) 46.74 20.99 (2.2×)
226 131441 602.53 (218.1×) 665.84 91.05 (7.3×)
5.3 Breakdown Analysis
We perform breakdown analysis to assess the impact of individual
Table 6: Performance results (in milliseconds) of single NTT
GZKP optimizations, for the two ZKP stages separately.
operation on the GTX1080Ti GPU
NTT operation. Here we examine the execution time of a single
NTT operation, the major computation within the POLY stage. We
use synthetic data generated by libsnark with the MNT4753 and 753-bit 256-bit
NTT
BLS12-381 curves. Note that BLS12-381 curves use 256-bit integers Scale Best Best
CPU GZKP GPU GZKP
and 256-bit NTT operations in the POLY stage. As shown in Table 5,
GZKP achieves, on average, 343.0× and 5.8× speedup against the 214 102 0.33 (305.0×) 0.52 0.06 (9.4×)
BC (libsnark) with MNT4753 curve and the BG (bellperson) with 216 212 1.16 (182.8×) 0.98 0.18 (5.5×)
BLS12-381 curve on the V100 GPU, respectively. GZKP execution 218 565 6.21 (91.01×) 14.64 0.70 (20.9×)
time for a single NTT operation scales linearly with the NTT size 220 2110 27.26 (77.30×) 23.80 2.87 (8.3×)
as expected, while libsnark fails to achieve so, due to its redundant 222 8180 119.82 (68.24×) 70.50 12.83 (5.5×)
computation for 𝜔 𝑖𝑁 in each butterfly operation. GZKP avoids this 224 32517 539.25 (60.30×) 234.59 56.18 (4.2×)
349
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
2 .8 9 m s 1 2 .6 9 m s B G w . lib
1 .0 G Z K P - n o - G M - s h u ffle
eration (G1-type) on the V100 GPU
G Z K P
7 .6 6 m s
6 .0 0 m s
753-bit 381-bit 256-bit
0 .5 1 .1 2 m s 4 .9 6 m s
MSM
Best Best Best
0 .5 9 m s Scale
GPU GZKP GPU GZKP CPU GZKP
0 .2 8 m s
0 .0 214 0.36 0.02(17.9×) 0.032 0.004(8.0×) 0.07 0.004(18.1×)
2 1 8
2 2 2
N T T s c a le 216 0.48 0.05(9.2×) 0.052 0.007(7.4×) 0.18 0.006(29.3×)
218 1.99 0.16(12.1×) 0.14 0.020(7.2×) 0.45 0.015(30.1×)
Figure 8: Breakdown analysis of single NTT operation with 220 7.2 0.60(12.1×) 0.53 0.062(8.5×) 1.48 0.045(32.9×)
256-bit BLS12-381 curve 222 28.1 2.66(10.7×) 1.35 0.24(5.6×) 4.90 0.17(28.7×)
224 - 11.3 6.55 1.10(6.0×) 17.27 0.72(23.8×)
226 - 40.7 24.42 4.00(6.1×) 65.70 2.79(23.5×)
curve as an example. Here the BG (bellperson) contains a shuffle
stage, assigns each GPU block with only one independent group,
and creates batches by grouping every 8 iterations. First, to show Table 8: Performance results (in seconds) of single MSM op-
the improvement brought by our optimized finite field library, we eration (G1-type) on the GTX1080Ti GPU
implement a BG-like solution accelerated by this library (“BG w.
lib”). Then, we evaluate an intermediate solution (“GZKP-no-GM- 753-bit 381-bit 256-bit
MSM
shuffle”) that removes the shuffling on the global memory. Finally, Scale Best GZKP Best GZKP Best
the full GZKP contains all our proposed optimizations. It adopts GPU GPU CPU GZKP
214 0.35 0.08(4.5×) 0.093 0.015(6.2×) 0.07 0.007(10.3×)
smaller independent groups as the work granularity, creates batches
by grouping fewer iterations, and performs internal shuffling that 216 1.00 0.20(5.0×) 0.20 0.032(6.3×) 0.18 0.013(13.5×)
promotes continuous memory accesses (Section 3). 218 2.71 0.71(3.8×) 0.64 0.073(8.8×) 0.45 0.032(14.1×)
As shown in Figure 8, when the NTT scale is 222 , BG w. lib 220 10.07 2.51(4.0×) 1.43 0.26(5.4×) 1.48 0.10(14.5×)
achieves 1.6× speedup against BG. GZKP can achieve another 1.5× 222 - 11.91 5.16 1.02(5.2×) 4.90 0.37(13.2×)
speedup against BG w. lib. In particular, for BG, when the NTT 224 - 46.83 19.86 4.16(4.8×) 17.27 1.50(11.5×)
scale is 218 , the last batch has 2 iterations but has 216 GPU blocks,
3 0
and each block only contains 2 threads. Because GPU scheduling
uses warp (32 threads) as the unit, there are 30 threads idling. The G Z K P -M N T 4
scheduling overhead is also high due to such many GPU blocks. As 2 5 M IN A
G Z K P -B L S
M e m o ry u s a g e (G B )
b e llp e r s o n
the number of iterations in the last batch increases, the influence
of such sub-optimal block division in BG gradually reduces. With 2 0
flexible GPU block assignment, the performance of GZKP’s NTT 1 5
module is almost linear with the NTT scale.
MSM operation. Then, we examine the execution time of a sin- 1 0
5
gle MSM operation, the major computation within the MSM stage.
Similarly, we conduct the experiment on the dense synthetic data
created by libsnark with the 753-bit MNT4753, 381-bit BLS12-381, 0
and 256-bit ALT-BN128 curves. As shown in Table 7, GZKP achieves, 2 1 4
2 1 6
2 1 8
2 2 0
2 2 2
2 2 4
2 2 6
M S M s c a le
on average, 12.4× and 7.0× speedup against the best-performing
GPU-based ZKP systems (MINA) for MNT4753 curve and (bellper-
son) for BLS12-381 curve on V100 GPU, respectively. Table 8 gives Figure 9: MSM memory usage with different curves on V100
the results on the lower-end GTX1080Ti, where GZKP still achieves
on average 4.3× and 6.1× speedup against the BG MINA and bellper-
son, respectively. For ALT-BN128 curve, GZKP achieves on average well within the total global memory size, due to Algorithm 1 that
26.6× speedup against the best-performing CPU-based ZKP system adapts to GPUs of different memory size. As GZKP significantly
(libsnark). outperforms bellperson, this shows that it is able to utilize the avail-
In addition, we evaluate the memory usage of the GZKP’s MSM able global memory size for better performance, while bellperson
module on the V100 GPU with different elliptic curves, as shown under-utilizes the GPU memory.
in Figure 9. Here “GZKP-MNT4” and “GZKP-BLS” refer to GZKP To take a deeper look, we investigate the improvement brought
with the MNT4753 curve (adopted by MINA) and the BLS12-381 by major GZKP proposals for the MSM stage, using the BLS12-381
curve (adopted by bellperson), respectively. For MNT4753 curve, curve on the V100 GPU as an example. Here the BG (bellperson)
GZKP’s memory usage grows slower than that of MINA. When the is the baseline, which explores the data parallelism by decompos-
MSM scale exceeds 222 , MINA execution fails due to insufficient ing the MSM task horizontally into multiple sub-MSM tasks. Since
GPU global memory. For BLS12-381 curve, GZKP does consume BG executes the point-merging and bucket-reduction steps on the
more memory than bellperson, but stays stable beyond 222 and GPU side and the last window-reduction step on the CPU side, we
350
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu
0 .1 4 4 s 1 .3 5 s B G
1 .0 distribution. GZKP outperforms them as we have demonstrated
N o r m a liz e d e x e c u tio n tim e
G Z K P -n o -L B
G Z K P - n o - L B w . lib earlier.
G Z K P GPU acceleration for NTT and MSM. Another body of previous
0 .5
works [10, 26, 36, 38, 46] accelerate the NTT operation on GPU
0 .0 4 s 0 .4 2 s
0 .3 2 s
and involve data shuffling to ensure continuous memory accesses
0 .0 2 9 s 0 .2 4 s
0 .0 2 s during computation, whose cost grows fast with increasing bit-
0 .0
width. These designs cannot meet the scalability requirements of
2 1 8
2 2 2
M S M s c a le
NTT operation in ZKP, where the NTT scale is up to a million,
and the bit-width of elements is larger than 256-bit. Some related
works [1, 13, 33, 53, 59] accelerate the basic PADD/PMUL opera-
Figure 10: Breakdown analysis for MSM module with BLS12- tions using GPUs to speedup elliptic curve cryptography. These
381 (381-bit) elliptic curve on one V100 GPU acceleration methods can not directly applied to ZKP MSM due
to the computation scale (millions of PMUL/PADD) and the scalar
bit-width (up to 753-bit). GZKP reduces the calculation complexity
of the MSM by leveraging Pippenger-like algorithms [4, 55, 58] and
use GZKP’s MSM module to evaluate the performance improve-
speedup the basic PADD operation using the optimized finite field
ment brought by the GPU-based finite field library. We evaluate
library.
two intermediate solutions, “GZKP-no-LB” and “GZKP-no-LB w.
lib”. The former implements the optimized MSM computation pro-
cess by leveraging bucket-based task partitioning. The latter uses 7 DISCUSSION AND FUTURE WORK
the optimized finite field library based on the former. Finally, the GZKP focuses on accelerating privacy-preserving applications based
GZKP contains all our proposed optimizations, including compu- on ZKP protocols and provides good performance as demonstrated
tation consolidation in Section 4.1 and workload management in in Section 5. Here we discuss GZKP’s wider application and its
Section 4.2. correctness.
As shown in Figure 10, with the MSM scale 222 , the breakdown NTT-based encryption algorithms. NTT is a key building block
analysis shows that GZKP-no-LB achieves 3.25× speedup compared of many homomorphic encryption (HE) algorithms, in addition
to BG. And the GZKP-no-LB w. lib further improves the perfor- to ZKP. These two types of cryptographic algorithms have differ-
mance by 33% over GZKP-no-LB. After enabling load balance, GZKP ent performance requirements. ZKP focuses more on the latency
achieves 5.6× speedup against the BG. The main improvement for of a single NTT operation, as the multiple NTTs are difficult to
the MSM module in GZKP is due to the computation consolida- parallelize in many protocols [11, 51]. Therefore, in GZKP we use
tion optimization. Besides, Our MSM module ensures load balance, all the compute resources of a GPU for a single NTT operation,
® through fine-grained task mapping.
whether for sparse 𝑢® or dense ℎ, plus asynchronous data copying to reduce host-GPU data transfer
overhead. In contrast, HE usually runs multiple NTT operations
6 RELATED WORK on the GPU simultaneously (NTT batching [38, 45]) to achieve
higher throughput and GPU utilization, enabled by the smaller
zkSNARK protocol and implementation of ZKP systems. Re-
NTT scales and the independent nature of multiple NTTs. Our
cently, many works [2, 3, 5, 6, 9, 34, 39, 51, 54, 57, 60] have made
design adopts smaller independent groups as the task granularity,
lots of efforts to optimize the zkSNARK protocol and make it more
making it suitable for throughput-oriented NTT applications with
practical. Since different ZKP applications [7, 15, 19, 56] use differ-
the aforementioned batching techniques. We plan to expand GZKP
ent elliptic curves, there are many ZKP system implementations
to support HE algorithms in our future work.
based on the zkSNARK protocol (i.e. libsnark [50], bellman [23],
Verification of GZKP. Note that GZKP focuses on efficient parallel
MINA [18] and bellperson [16]), and most of these implementa-
acceleration of core operations (NTT and MSM) in the ZKP protocol.
tions only support specific curves. GZKP supports different types
It does not modify the ZKP protocol itself, hence has no impact on
of elliptic curves to serve a range of applications by leveraging
its security. Furthermore, proof generation on private user input
the optimized finite field library, and the modular design enables
data usually runs on the user’s local machine. Implementation bugs
GZKP to be easily integrated into the existing ZKP systems as a
may generate wrong proofs, which could be addressed by formal
high-performance back-end to achieve fast proof generation.
verification of GZKP. We will leave this as future work.
Hardware acceleration for ZKP system. There are a few designs
improving the performance of ZKP with hardware acceleration,
such as FPGA [25, 28, 63] and GPU [16, 18]. A recent work called 8 CONCLUSION
PipeZK [63] is proposed to accelerate zkSNARK through customized Zero-knowledge proof can guarantee the integrity and confiden-
hardware accelerator, which pipelines the key operations (i.e. NTT tiality of transactions and is increasingly used in applications like
and MSM ) in proof generation. These designs cannot work well cryptocurrency. However, its high computation overhead hinders
with the increasing scale of the input vector. Current GPU-based its deployment in online services. In this paper, we present GZKP,
ZKP systems (i.e. MINA [18], bellperson [16]) in industry accelerate a GPU-based ZKP system that offers at least an order of magnitude
the prover computation by parallelizing those key operations. These improvement over current state-of-the-art solutions. Our results
systems suffer from huge data shuffling overhead caused by large and experience reveal that there are a significant amount of op-
bit-width integers and severe load imbalance caused by sparse data portunities for improving the parallel execution efficiency of ZKP
351
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
proof generation. Meanwhile, ZKP requires targeted optimizations [20] NVIDIA Corp. 2021. CUDA C++ Programming Guide. https://docs.nvidia.com/
to handle large integer operations that dominate its execution. cuda/pdf/CUDA_C_Programming_Guide.pdf.
[21] StarkWare Corp. 2022. https://starkware.co/.
[22] Zcash Corp. 2022. https://z.cash/.
[23] ZKCrypto Corp. 2022. bellman: a crate for building zk-SNARK circuits. https:
ACKNOWLEDGMENTS //github.com/zkcrypto/bellman.
[24] Zcash Corp. 2022. librustzcash: a (work-in-progress) set of Rust crates for working
The authors thank the anonymous reviewers and the shepherd for with Zcash. https://github.com/zcash/librustzcash.
their valuable comments. We thank Dong Zhou and Xiaoxu Guo for [25] Zcash Corp. 2022. Zcash FPGA acceleration engine. https://github.com/
their participation. This work was supported in part by the National ZcashFoundation/zcash-fpga.
[26] Wei Dai and Berk Sunar. 2015. cuHE: A homomorphic encryption accelerator
Key R&D Program of China under Grant 2020AAA0108501, and library. In International Conference on Cryptography and Information Security in
the Key R&D Program of Hubei under Grant 2020BAA020. the Balkans. Springer, 169–186. https://doi.org/10.1007/978-3-319-29172-7_11
[27] Theodorus Jozef Dekker. 1971. A floating-point technique for extending the
available precision. Numer. Math. 18, 3 (1971), 224–242. https://doi.org/10.1007/
BF01397083
REFERENCES [28] Benjamin Devlin. 2022. FPGA SNARK prover targeting the bn128 curve. https:
[1] Samuel Antao, Jean-Claude Bajard, and Leonel Sousa. 2010. Elliptic curve point //github.com/bsdevlin/fpga_snark_prover.
multiplication on GPUs. In ASAP 2010-21st IEEE International Conference on [29] Jiankuo Dong, Fangyu Zheng, Niall Emmart, Jingqiang Lin, and Charles Weems.
Application-specific Systems, Architectures and Processors. IEEE, 192–199. https: 2018. sDPF-RSA: Utilizing floating-point computing power of GPUs for massive
//doi.org/10.1109/ASAP.2010.5541000 digital signature computations. In 2018 IEEE International Parallel and Distributed
[2] Eli Ben-Sasson, Iddo Bentov, Alessandro Chiesa, Ariel Gabizon, Daniel Genkin, Processing Symposium (IPDPS). IEEE, 599–609. https://doi.org/10.1109/IPDPS.
Matan Hamilis, Evgenya Pergament, Michael Riabzev, Mark Silberstein, Eran 2018.00069
Tromer, and Madars Virza. 2017. Computational Integrity with a Public Random [30] Niall Emmart, Fangyu Zheng, and Charles Weems. 2018. Faster modular ex-
String from Quasi-Linear PCPs. In Advances in Cryptology – EUROCRYPT 2017, ponentiation using double precision floating point arithmetic on the GPU. In
Jean-Sébastien Coron and Jesper Buus Nielsen (Eds.). Springer International 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH). IEEE, 130–137.
Publishing, Cham, 551–579. https://doi.org/10.1007/978-3-319-56617-7_19 https://doi.org/10.1109/ARITH.2018.8464792
[3] Eli Ben-Sasson, Alessandro Chiesa, Eran Tromer, and Madars Virza. 2014. Suc- [31] Ariel Gabizon, Zachary J Williamson, and Oana Ciobotaru. 2019. Plonk: Per-
cinct Non-Interactive Zero Knowledge for a von Neumann Architecture. (2014), mutations over lagrange-bases for oecumenical noninteractive arguments of
781–796. https://dl.acm.org/doi/10.5555/2671225.2671275 knowledge. Cryptology ePrint Archive (2019). https://eprint.iacr.org/2019/953
[4] Daniel J Bernstein, Jeroen Doumen, Tanja Lange, and Jan-Jaap Oosterwijk. 2012. [32] Hisham S. Galal and Amr M. Youssef. 2018. Verifiable Sealed-Bid Auction on
Faster batch forgery identification. In International Conference on Cryptology in the Ethereum Blockchain. In Financial Cryptography and Data Security: FC 2018
India. Springer, 454–473. https://doi.org/10.1007/978-3-642-34931-7_26 International Workshops, BITCOIN, VOTING, and WTSC, Nieuwpoort, Curaçao,
[5] Nir Bitansky, Ran Canetti, Alessandro Chiesa, and Eran Tromer. 2012. From March 2, 2018, Revised Selected Papers (Nieuwpoort, Curaçao). Springer-Verlag,
Extractable Collision Resistance to Succinct Non-Interactive Arguments of Berlin, Heidelberg, 265–278. https://doi.org/10.1007/978-3-662-58820-8_18
Knowledge, and Back Again. In Proceedings of the 3rd Innovations in Theoret- [33] Lili Gao, Fangyu Zheng, Niall Emmart, Jiankuo Dong, Jingqiang Lin, and Charles
ical Computer Science Conference (Cambridge, Massachusetts) (ITCS ’12). As- Weems. 2020. DPF-ECC: Accelerating Elliptic Curve Cryptography with Floating-
sociation for Computing Machinery, New York, NY, USA, 326–349. https: Point Computing Power of GPUs. In 2020 IEEE International Parallel and Dis-
//doi.org/10.1145/2090236.2090263 tributed Processing Symposium (IPDPS). IEEE, 494–504. https://doi.org/10.1109/
[6] Manuel Blum, Paul Feldman, and Silvio Micali. 1988. Non-Interactive Zero- IPDPS47924.2020.00058
Knowledge and Its Applications. In Proceedings of the Twentieth Annual ACM [34] Rosario Gennaro, Craig Gentry, Bryan Parno, and Mariana Raykova. 2013. Qua-
Symposium on Theory of Computing. Association for Computing Machinery, New dratic Span Programs and Succinct NIZKs without PCPs. In Advances in Cryptol-
York, NY, USA, 103–112. https://doi.org/10.1145/62212.62222 ogy – EUROCRYPT 2013, Thomas Johansson and Phong Q. Nguyen (Eds.). Springer
[7] Joseph Bonneau, Izaak Meckler, Vanishree Rao, and Evan Shapiro. 2020. Coda: Berlin Heidelberg, Berlin, Heidelberg, 626–645. https://doi.org/10.1007/978-3-
Decentralized cryptocurrency at scale. Cryptology ePrint Archive (2020). https: 642-38348-9_37
//eprint.iacr.org/2020/352 [35] Craig Gentry. 2009. Fully homomorphic encryption using ideal lattices. In
[8] Sean Bowe, Alessandro Chiesa, Matthew Green, Ian Miers, Pratyush Mishra, and Proceedings of the forty-first annual ACM symposium on Theory of computing.
Howard Wu. 2020. ZEXE: Enabling Decentralized Private Computation. In 2020 Association for Computing Machinery, New York, NY, USA, 169–178. https:
IEEE Symposium on Security and Privacy (SP). 947–964. https://doi.org/10.1109/ //doi.org/10.1145/1536414.1536440
SP40000.2020.00050 [36] Jia-Zheng Goey, Wai-Kong Lee, Bok-Min Goi, and Wun-She Yap. 2021. Ac-
[9] Benedikt Bünz, Alessandro Chiesa, Pratyush Mishra, and Nicholas Spooner. 2020. celerating number theoretic transform in GPU platform for fully homomor-
Proof-carrying data from accumulation schemes. Cryptology ePrint Archive (2020). phic encryption. The Journal of Supercomputing 77 (2021), 1455–1474. https:
https://eprint.iacr.org/2020/499 //doi.org/10.1007/s11227-020-03156-7
[10] Liangyu Chen, Svyatoslav Covanov, Davood Mohajerani, and Marc Moreno Maza. [37] Shafi Goldwasser, Silvio Micali, and Chales Rackoff. 2019. The Knowledge Com-
2017. Big Prime Field FFT on the GPU. In Proceedings of the 2017 ACM on plexity of Interactive Proof-Systems. (2019), 203–225. https://doi.org/10.1145/
International Symposium on Symbolic and Algebraic Computation (Kaiserslautern, 3335741.3335750
Germany) (ISSAC ’17). Association for Computing Machinery, New York, NY, [38] Naga K Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John
USA, 85–92. https://doi.org/10.1145/3087604.3087657 Manferdelli. 2008. High performance discrete Fourier transforms on graphics pro-
[11] Alessandro Chiesa, Yuncong Hu, Mary Maller, Pratyush Mishra, Noah Vesely, cessors. In SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing.
and Nicholas Ward. 2020. Marlin: Preprocessing zkSNARKs with Universal and IEEE, 1–12. https://doi.org/10.1109/SC.2008.5213922
Updatable SRS. In Advances in Cryptology – EUROCRYPT 2020, Anne Canteaut [39] Jens Groth. 2010. Short pairing-based non-interactive zero-knowledge arguments.
and Yuval Ishai (Eds.). Springer International Publishing, Cham, 738–768. https: In International Conference on the Theory and Application of Cryptology and
//doi.org/10.1007/978-3-030-45721-1_26 Information Security. Springer, 321–340. https://doi.org/10.1007/978-3-642-17373-
[12] Eleanor Chu and Alan George. 1999. Inside the FFT black box: serial and parallel 8_19
fast Fourier transform algorithms. CRC press. [40] Jens Groth. 2016. On the Size of Pairing-Based Non-interactive Arguments. In
[13] Aaron E. Cohen and Keshab K. Parhi. 2010. GPU accelerated elliptic curve Advances in Cryptology – EUROCRYPT 2016, Marc Fischlin and Jean-Sébastien
cryptography in GF(2m). In 2010 53rd IEEE International Midwest Symposium on Coron (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 305–326. https:
Circuits and Systems. 57–60. https://doi.org/10.1109/MWSCAS.2010.5548560 //doi.org/10.1007/978-3-662-49896-5_11
[14] James W Cooley and John W Tukey. 1965. An algorithm for the machine cal- [41] Darrel Hankerson, Alfred J Menezes, and Scott Vanstone. 2006. Guide to elliptic
culation of complex Fourier series. Mathematics of computation 19, 90 (1965), curve cryptography. Springer Science & Business Media.
297–301. https://doi.org/10.2307/2003354 [42] Mark Harris, Shubhabrata Sengupta, and John D Owens. 2007. Parallel prefix
[15] Filecoin Corp. 2022. https://filecoin.io/. sum (scan) with CUDA. GPU gems 3, 39 (2007), 851–876.
[16] Filecoin Corp. 2022. bellperson: gpu parallel acceleration for zk-snark. https: [43] Daira Hopwood, Sean Bowe, Taylor Hornby, and Nathan Wilcox. 2022. Zcash
//github.com/filecoin-project/bellperson. Protocol Specification. (2022). https://zips.z.cash/protocol/protocol.pdf
[17] J.P. Morgan Quorum Corp. 2022. https://www.goquorum.com/. [44] QED it Corp. 2017. https://qed-it.com/.
[18] MINA Corp. 2019. GPU Groth16 prover (3x faster than CPU). https://github. [45] Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee.
com/MinaProtocol/gpu-groth16-prover-3x. 2021. Over 100x faster bootstrapping in fully homomorphic encryption through
[19] Mina Corp. 2022. https://minaprotocol.com/. memory-centric optimization with GPUs. IACR Transactions on Cryptographic
352
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu
Hardware and Embedded Systems (2021), 114–148. https://doi.org/10.46586/tches. [56] Eli Ben Sasson, Alessandro Chiesa, Christina Garman, Matthew Green, Ian Miers,
v2021.i4.114-148 Eran Tromer, and Madars Virza. 2014. Zerocash: Decentralized anonymous
[46] Sangpyo Kim, Wonkyung Jung, Jaiyoung Park, and Jung Ho Ahn. 2020. Accel- payments from bitcoin. In 2014 IEEE Symposium on Security and Privacy. IEEE,
erating Number Theoretic Transformations for Bootstrappable Homomorphic 459–474. https://doi.org/10.1109/SP.2014.36
Encryption on GPUs. In 2020 IEEE International Symposium on Workload Charac- [57] Srinath Setty. 2020. Spartan: Efficient and general-purpose zkSNARKs without
terization (IISWC). IEEE, 264–275. https://doi.org/10.1109/IISWC50251.2020.00033 trusted setup. In Annual International Cryptology Conference. Springer, 704–737.
[47] Ahmed Kosba. 2021. jsnark: a Java library for building circuits for preprocessing https://doi.org/10.1007/978-3-030-56877-1_25
zk-SNARKs. https://github.com/akosba/jsnark. [58] Ernst G Straus. 1964. Addition chains of vectors (problem 5125). Amer. Math.
[48] Ahmed Kosba, Dimitrios Papadopoulos, Charalampos Papamanthou, and Dawn Monthly 70, 806-808 (1964), 16.
Song. 2020. MIRAGE: Succinct Arguments for Randomized Algorithms with [59] Robert Szerwinski and Tim Güneysu. 2008. Exploiting the power of GPUs for
Applications to Universal Zk-SNARKs. In Proceedings of the 29th USENIX Con- asymmetric cryptography. In International Workshop on Cryptographic hardware
ference on Security Symposium. USENIX Association, USA, Article 120, 18 pages. and embedded systems. Springer, 79–99. https://doi.org/10.1007/978-3-540-85053-
https://dl.acm.org/doi/10.5555/3489212.3489332 3_6
[49] Ahmed Kosba, Charalampos Papamanthou, and Elaine Shi. 2018. xJsnark: A [60] Howard Wu, Wenting Zheng, Alessandro Chiesa, Raluca Ada Popa, and Ion
framework for efficient verifiable computation. In 2018 IEEE Symposium on Secu- Stoica. 2018. DIZK: A Distributed Zero Knowledge Proof System. Cryptology
rity and Privacy (SP). IEEE, 944–961. https://doi.org/10.1109/SP.2018.00018 ePrint Archive, Paper 2018/691. (2018). https://eprint.iacr.org/2018/691 https:
[50] SCIPR LAB. 2022. libsnark: a C++ library for zkSNARK proofs. https://github. //eprint.iacr.org/2018/691.
com/scipr-lab/libsnark. [61] Jiaheng Zhang, Zhiyong Fang, Yupeng Zhang, and Dawn Song. 2020. Zero
[51] Mary Maller, Sean Bowe, Markulf Kohlweiss, and Sarah Meiklejohn. 2019. Sonic: knowledge proofs for decision tree predictions and accuracy. In Proceedings of
Zero-Knowledge SNARKs from Linear-Size Universal and Updatable Structured the 2020 ACM SIGSAC Conference on Computer and Communications Security.
Reference Strings. In Proceedings of the 2019 ACM SIGSAC Conference on Computer 2039–2053. https://doi.org/10.1145/3372297.3417278
and Communications Security (London, United Kingdom) (CCS ’19). Association [62] Yupeng Zhang, Daniel Genkin, Jonathan Katz, Dimitrios Papadopoulos, and
for Computing Machinery, New York, NY, USA, 2111–2128. https://doi.org/10. Charalampos Papamanthou. 2017. vSQL: Verifying arbitrary SQL queries over
1145/3319535.3339817 dynamic outsourced databases. In 2017 IEEE Symposium on Security and Privacy
[52] Peter L Montgomery. 1985. Modular multiplication without trial division. Math- (SP). IEEE, 863–880. https://doi.org/10.1109/SP.2017.43
ematics of computation 44, 170 (1985), 519–521. [63] Ye Zhang, Shuo Wang, Xian Zhang, Jiangbin Dong, Xingzhong Mao, Fan Long,
[53] Wuqiong Pan, Fangyu Zheng, Yuan Zhao, Wen-Tao Zhu, and Jiwu Jing. 2016. Cong Wang, Dong Zhou, Mingyu Gao, and Guangyu Sun. 2021. PipeZK: Acceler-
An efficient elliptic curve cryptography signature server with GPU acceleration. ating Zero-Knowledge Proof with a Pipelined Architecture. In 2021 ACM/IEEE
IEEE Transactions on Information Forensics and Security 12, 1 (2016), 111–122. 48th Annual International Symposium on Computer Architecture (ISCA). 416–428.
https://doi.org/10.1109/TIFS.2016.2603974 https://doi.org/10.1109/ISCA52012.2021.00040
[54] Bryan Parno, Jon Howell, Craig Gentry, and Mariana Raykova. 2013. Pinocchio: [64] Zhichao Zhao and T-H Hubert Chan. 2015. How to vote privately using bitcoin.
Nearly practical verifiable computation. In 2013 IEEE Symposium on Security and In International Conference on Information and Communications Security. Springer,
Privacy. IEEE, 238–252. https://doi.org/10.1109/SP.2013.47 82–96. https://doi.org/10.1007/978-3-319-29814-6_8
[55] Nicholas Pippenger. 1976. On the evaluation of powers and related problems. [65] zkSync Corp. 2022. https://zksync.io/.
In 17th Annual Symposium on Foundations of Computer Science (sfcs 1976). IEEE
Computer Society, 258–263. https://doi.org/10.1109/SFCS.1976.21 Received 2022-07-07; accepted 2022-09-22
353