0% found this document useful (0 votes)

91 views14 pages

GZKP- A GPU Accelerated Zero-Knowledge Proof System

The document presents GZKP, a GPU-accelerated zero-knowledge proof system designed to enhance the performance of proof generation in privacy-preserving applications. GZKP leverages parallel processing capabilities of GPUs to significantly reduce proof generation time, achieving speedups of up to 48.1× compared to existing systems. The system addresses performance bottlenecks through optimized memory access patterns and novel parallelization strategies for multi-scalar multiplication.

Uploaded by

zhenlin wu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views14 pages

GZKP- A GPU Accelerated Zero-Knowledge Proof System

Uploaded by

zhenlin wu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

GZKP: A GPU Accelerated Zero-Knowledge Proof System

Weiliang Ma∗‡ Qian Xiong∗‡ Xuanhua Shi†‡

Huazhong University of Science and Huazhong University of Science and Huazhong University of Science and
Technology Technology Technology
Wuhan, Hubei, China Wuhan, Hubei, China Wuhan, Hubei, China
weiliangma@hust.edu.cn qianxiong@hust.edu.cn xhshi@hust.edu.cn

Xiaosong Ma Hai Jin‡ Haozhao Kuang‡

Hamad Bin Khalifa University Huazhong University of Science and Huazhong University of Science and
Doha, Qatar Technology Technology
xma@hbku.edu.qa Wuhan, Hubei, China Wuhan, Hubei, China
hjin@hust.edu.cn haozhaokuang@hust.edu.cn

Mingyu Gao Ye Zhang Haichen Shen

Tsinghua University Scroll Foundation Scroll Foundation
Beijing, China Seychelles Seychelles
gaomy@tsinghua.edu.cn ye@scroll.io haichen@scroll.io

Weifang Hu‡
Huazhong University of Science and
Technology
Wuhan, Hubei, China
huwf@hust.edu.cn
ABSTRACT access pattern while eliminating the costly external shuffle in ex-
Zero-knowledge proof (ZKP) is a cryptographic protocol that al- isting solutions. For multi-scalar multiplication, GZKP adopts a
lows one party to prove the correctness of a statement to another new parallelization strategy, which aggressively combines integer
party without revealing any information beyond the correctness elliptic curve point operations and exploits fine-grained task par-
of the statement itself. It guarantees computation integrity and allelism with load balancing for sparse integer distribution. GZKP
confidentiality, and is therefore increasingly adopted in industry outperforms the state-of-the-art ZKP systems by an order of mag-
for a variety of privacy-preserving applications, such as verifiable nitude, achieving up to 48.1× and 17.6× speedup with standard
outsource computing and digital currency. cryptographic benchmarks and a real-world application workload,
A significant obstacle in using ZKP for online applications is the respectively.
performance overhead of its proof generation. We develop GZKP, a
GPU accelerated zero-knowledge proof system that supports differ-
ent levels of security requirements and brings significant speedup CCS CONCEPTS
toward making ZKP truly usable. For polynomial computation • Computer systems organization → Parallel architectures;
over a large finite field, GZKP promotes a cache-friendly memory Data flow architectures; • Security and privacy → Privacy-preserving
∗ Both
protocols.
authors contributed equally to this research.
† Corresponding author.
‡ WeiliangMa, Qian Xiong, Xuanhua Shi, Hai Jin, Haozhao Kuang, and Weifang Hu
are with National Engineering Research Center for Big Data Technology and System, KEYWORDS
Service Computing Technology and System Lab, Cluster and Grid Computing Lab,
School of Computer Science and Technology, Huazhong University of Science and zero-knowledge proof; GPU acceleration
Technology, Wuhan, China.

Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
ACM Reference Format:
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the Weiliang Ma, Qian Xiong, Xuanhua Shi, Xiaosong Ma, Hai Jin, Haozhao
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or Kuang, Mingyu Gao, Ye Zhang, Haichen Shen, and Weifang Hu. 2023. GZKP:
republish, to post on servers or to redistribute to lists, requires prior specific permission A GPU Accelerated Zero-Knowledge Proof System. In Proceedings of the
and/or a fee. Request permissions from permissions@acm.org.
28th ACM International Conference on Architectural Support for Programming
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Languages and Operating Systems, Volume 2 (ASPLOS ’23), March 25–29,
ACM ISBN 978-1-4503-9916-6/23/03. . . $15.00 2023, Vancouver, BC, Canada. ACM, New York, NY, USA, 14 pages. https:
https://doi.org/10.1145/3575693.3575711 //doi.org/10.1145/3575693.3575711

340
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu

1 INTRODUCTION applications; and their underlying mathematics library to support

Zero-knowledge proof (ZKP) is a cryptographic primitive through large finite field arithmetics is not optimized.
which one party can convince other parties the validity of a state- In this paper, we propose GZKP, a fast GPU accelerated zero-
ment without disclosing any information beyond the correctness of knowledge proof system that significantly reduces proof generation
the statement itself. For example, the owner of the machine learning time in ZKP. GZKP is designed to tackle the challenge of ZKP on
model can declare that the model has reached a certain accuracy large finite fields and exploit the massive hardware parallelism
on a public data set, and then use the ZKP primitive to guarantee offered by current GPU systems. GZKP is a general-purpose system
the validity of the declaration to others without disclosing any that can serve a range of ZKP applications (such as Zcash [22] and
secret information (e.g., parameters) about the model. ZKP could Mina [19]) and hence can be easily integrated into existing ZKP
provide security guarantees about the correctness and consistency libraries as a high-performance back-end.
of machine learning predictions in real products [61]. GZKP alleviates the performance bottlenecks in the parallel ex-
As one of the most powerful cryptographic tools, ZKP has the ecution of existing ZKP implementations. It avoids the use of fre-
potential to be used in many privacy-preserving applications, such quent and expensive data shuffling, exploits opportunities to merge
as verifiable outsourcing [62], verifiable machine learning [61], elec- redundant operations, adopts customized scheduling schemes to im-
tronic voting [64], and online auction [32]. As an up-and-coming prove load balance, and also integrates a highly optimized library for
field approaching commercial applications with increasingly pow- large integer arithmetics. More specifically, for the number theoretic
erful computer hardware, ZKP witnesses the establishment of a transform (NTT) stage in proof generation (§ 3), GZKP promotes
flurry of startups [17, 19, 21, 44, 65] in the past six years. Most of a cache-friendly memory access pattern, which allows threads to
them leverage ZKP for the scalable and user-manageable privacy perform on-the-fly and internal data shuffling when transferring
of cryptocurrency and Web3. data between the global and shared memories. The global memory
Protocols for ZKP have been studied for more than 30 years since layout keeps stable. We coalesce the accesses from neighbor threads
its first introduction [37], and recent advances in cryptography have by visiting the global memory with a coarse-grained block-style
brought ZKP closer to practical use. In particular, zero-knowledge pattern to ensure contiguous memory accesses. For the next stage
succinct non-interactive arguments of knowledge (zkSNARK) [40] is of multi-scalar multiplication (MSM) computation (§ 4), we pro-
one of the most promising ZKP protocols. It can generate a succinct pose a novel point merging algorithm to effectively consolidate
proof for any program (i.e., a few hundred bytes), and support fast computation and alleviate redundant expensive operations. The
verification of the proof. Due to these properties, zkSNARK has algorithm leverages space-efficient data preprocessing, and applies
been deployed in real-world systems of blockchains and cryptocur- a new parallelization strategy to reduce the serial execution bottle-
rencies [7, 15, 17, 22, 44]. neck. GZKP also utilizes fine-grained task mapping to ensure load
Despite such progress, ZKP is still notoriously hard to be widely balance.
adopted in industry due to the particularly large overhead of gen- We implement GZKP using CUDA and evaluate it on Nvidia
erating proofs. For example, a typical zkSNARK proof generation V100 GPUs. We compare GZKP with multiple state-of-the-art ZKP
process in Zcash [22] involves billions of large integer addition and systems, both CPU-based (libsnark [50] and bellman [23]) and GPU-
modular multiplication [52] on a large finite field where the bit- based (MINA [18] and bellperson [16]). Our results (§ 5) show
width is 381-bit. On mainstream servers today, each such modular that, compared with the state-of-the-art GPU-based ZKP system,
multiplication would take 230 ns, and each large integer addition GZKP achieves on average 13.2× and up to 17.6× speedup for real-
costs 43 ns. With billions of operations for one anonymous transac- world application workloads, and on average 33.6× and up to 48.1×
tion [22], one such proof generation process would need over half speedup for standard cryptographic microbenchmarks. If we mea-
a minute to finish. Such a long latency makes it impractical to use sure the two key stages separately, GZKP achieves on average 5.8×
in scenarios like real-time payment and decentralized trading trans- and up to 10.3× speedup in the NTT module, and on average 9.1×
actions with cryptocurrency, despite its great benefits in privacy and up to 17.9× speedup in the MSM module.
protection. In summary, this paper makes the following major technical
Most of the existing ZKP systems are CPU-based [3, 23, 34, 48, contributions in GZKP:
50, 54] and are forced to use clusters for larger-scale ZKP applica- • An NTT module that performs memory accesses with better
tions [60]. Recently emerged GPU-based ZKP systems [16, 18] take cache efficiency, by replacing the increasingly expensive
advantage of the massive hardware parallelism available on modern global-memory shuffle operation with flexible scheduling
GPUs to accelerate key ZKP operations. However, existing GPU- and internal shuffling;
based solutions fall short in maximizing ZKP processing efficiency • An MSM module that uses a new parallelization strategy
in several aspects. They rely on excessive data shuffling in the GPU to aggressively consolidate redundant integer elliptic curve
global memory to ensure contiguous memory accesses, whose cost point operations, with space-efficient data preprocessing and
quickly grows with increasing bit-width; they have not exploited fine-grained task load balancing; and
the opportunities of aggressively eliminating redundant computa- • A GPU finite field library that supports arithmetic operations
tions to improve performance; they cannot effectively handle the on large integers (e.g., 256-bit ∼ 753-bit) by leveraging both
significant load imbalance caused by the sparse data distribution integer and floating-point processing units.
and the large bit-width integer used in current real-world ZKP
This work focuses on emerging privacy-preserving applications
based on zero-knowledge proof. Our proposed system, GZKP, brings

341
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

Application
overhead and long latency, significantly limiting the throughput of
latency-sensitive applications (e.g., decentralized transactions).
Trusted Setup
verification key input, verification key
Verifier With ZKP, users could generate succinct proofs for programs
key
generator proving key accept or reject
and then put the proofs on the chain: each node only needs to
vector sets： verify by checking the integrity of the proofs, a task much lighter
proving key： Proof:
than executing the smart contract. Also, these zero-knowledge
Prover MultiScalar Multiplication
Polynomial Computation proofs hide sensitive information involved in the smart contract
computation. As a result, ZKP can greatly enhance both blockchain
scalability and privacy protection.
INTT NTT h0
PMUL
+ M0 As shown in Figure 1, various random parameters for the whole
INTT
INTT
NTT
NTT
- INTT M
hN-1
PMUL
PADD
ZKP system are generated in a one-time setup phase including the
MN-1
proving key and the verification key for the prover and verifier.
P3 = P1 P2 5P = 5 P = (101)2 P u0
Q0 PMUL And with prover’s secret witness 𝑤 and public input 𝑥, the system
(x3, y3) = (x1, y1) + (x2, y2) 1 0 1
m = (y2 - y1) / (x2 - x1) PADD P 2P=P P 4P=2P 2P PADD
x3 = m2 - x1 - x3 uN-1 outputs two kinds of data which will later be used in the prover and
QN-1 PMUL
y3 = m(x1 - x3) - y1 PADD P 5P verifier:
• Vector sets: {𝑎, ® 𝑐®, 𝑢}.
® 𝑏, ® Each vector contains 𝑁 large integers
Figure 1: Typical zkSNARK protocol workflow. on a finite field 𝐹𝑟 where 𝑟 is a large prime. The bit-width of
the large integer field here typically ranges from 256 to 753. The
large integer 𝑟 can be composed as 𝑚
Í 𝑖 64
𝑖=0 𝑟𝑖 𝐷 , where 𝐷 = 2
denotes the base, so that the vector 𝑟𝑖 can be stored in word-size
decisive improvement to the market, supporting a variety of ZKP integers. The dimension 𝑁 could be up to billions and is typically
systems used in the industry, such as Marlin [11], Plonk [31], and a power of 2.1
Sonic [51]. In addition, these two modules that we investigated • The proving key and verification key: The proving key con-
(NTT and MSM) has applications in a wider scope of encryption sists of multiple point vectors {𝑀,® 𝑄® }, each containing 𝑁 points
algorithms: NTT is a key building block in homomorphic encryp- on a pre-determined elliptic curve, expressed in the form of
tion algorithms [35] used by federated learning, while MSM is used {(𝑥, 𝑦) ∈ 𝐹𝑞 × 𝐹𝑞 | 𝑦 2 = 𝑥 3 + 𝑎𝑥 + 𝑏}. 𝐹𝑞 is another finite field
in many pair-based proof systems [8, 40, 54]. Techniques proposed where 𝑞 is a large prime. Similar to 𝐹𝑟 , the bit-width of 𝐹𝑞 is also
in this work thus contribute to this growing junction of computer between 256-bit and 753-bit. The verification key is a short point
systems and security/privacy, by significantly enhancing the ef- vector with a few elliptic curve points. The basic operations of el-
fectiveness of GPU resource utilization for crucial components of liptic curves are point-addition (PADD)2 and point-multiplication
private computation. (PMUL) [41]. We use ⊕ and ⊗ to denote the PADD and PMUL
operations in elliptic curves, respectively. As demonstrated in
2 BACKGROUND Figure 1, PADD is used for adding up two points on the elliptic
2.1 The zkSNARK Protocol curve and it needs to do a set of arithmetic operations over the
same finite field. PMUL is defined by adding a point 𝑃 to itself 𝑠
zkSNARK is one of the most prominent ZKP protocols. It consists
times (through PADD), where 𝑠 is an element over 𝐹𝑟 . The figure
of two parties: a prover and a verifier. The prover can convince the
illustrates how it can be split into a series of PADD in order of
verifier that, “for given function 𝐹 and input 𝑥, I know a secret
the integer’s bit sequence by using the binary representation of
witness 𝑤 such that 𝐹 (𝑥, 𝑤) = 0,” by generating a zkSNARK proof.
the integer.
zkSNARK possesses three desirable properties: (1) succinctness,
i.e., small proof sizes (<1 KB) and fast verification (<10 ms) regard- Then the prover uses the vectors sets and the proving key to
less of 𝐹 ’s complexity, (2) non-interactive, i.e., a single message from generate a proof. The prover performs two-stage calculations suc-
the prover to the verifier, and (3) zero knowledge, i.e., a proof reveal- cessively, polynomial computation (POLY) and multi-scalar multi-
ing nothing about the actual value of 𝑤 while still able to pass the plication (MSM), to generate the proof.
verification. In the POLY stage, the prover takes the vectors 𝑎, ® and 𝑐® as
® 𝑏,
With these properties, recently zkSNARK has been widely adopted inputs, and outputs the coefficients of polynomial 𝐻 (𝑥), where
by the blockchain community [15, 17, 19, 22, 44]. The smart con- 𝐻 (𝑥) = (𝐴(𝑥)𝐵(𝑥) − 𝐶 (𝑥))/(𝑥 𝑁 − 1). As shown in Figure 1, the
tract stored on a blockchain is a program performing the function 𝐹 state-of-the-art implementation of this stage consists of a series of
in Figure 1. Traditional blockchain applications require each node NTT and its inverse (INTT) operations [16, 50], with more details
in a chain to perform the same smart contract for the chain’s state given in Section 3.
updates. For example, for an online transaction, the nodes in the In the MSM stage, the prover takes the output ℎ® of POLY, the
chain verify its validity by checking whether both parties’ account ® and the proving key as inputs and computes multiple
scalar vector 𝑢,
information (secret witness 𝑤) meets the transaction condition. It multi-scalar multiplications over the specified elliptic curves. Each
is worth noting that a smart contract involving private datasets
1 General cases with arbitrary 𝑁 values can use the power-of-2 flow as building
may reveal user privacy because anyone running their blockchain
blocks [12].
node can see and download all data stored on the chain. With 2 Here, we use PADD to refer to elliptic curve point addition operations, including
the number of such contracts growing, all nodes will incur huge point doubling operations

342
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu

a[0] iterations divided into multiple independent groups. The common

a[1] practice [10, 16, 26, 36, 38, 46] is to divide the iterations into batches
a[2]
such that each group of independent butterfly operations (within
Butterfly Operation
a[3]
one colored block in the figure) works on a sub-vector that can be
accommodated in the shared memory. Each of such independent
a[4]
group is naturally mapped to a GPU block for parallel computation.
a[5]
As shown in the figure, when the input vector 𝐴 is stored in an
a[6]
array in the global memory, the distance between the two inputs
a[7] of a butterfly operation increases exponentially as the iterative
iteration 0 iteration 1 iteration 2
batch 0 batch 1
computation progresses.
Such a strided access pattern with a growing stride size brings a
large overhead. When hosted in the global memory for a larger 224 -
Figure 2: The computation process and strided data access
NTT with input bit-width of 256, these non-contiguous memory
pattern of 8-NTT.
accesses occupy 13% of the overall execution time in batch 1 (batch
0 has contiguous accesses) and 53% in batch 2, despite performing
Í𝑁 −1
MSM task can be expressed as 𝑠® · 𝑃® = 𝑖=0 (𝑠𝑖 ⊗ 𝑃𝑖 ), where 𝑠® is a the same amount of computation (confirming prior findings [16]).
scalar vector (ℎ® or 𝑢)
® and 𝑃® is a point vector from the proving key This is the reason why current implementations, including ours,
(𝑀® or 𝑄).
® As shown in Figure 1, each 𝑠𝑖 ⊗ 𝑃𝑖 is calculated through partition the iterations into batches so that within each batch, the
PMUL and summed using PADD. These two inner-product results gradually enlarging strides can be accommodated within the small
are sent out by the prover as its proof, to be checked by the verifier. and fast shared memory.
At last, the verifier can use the verifying key to quickly determine Previous solutions [10, 16, 38] perform costly reordering of the
the validity of the proof. It usually takes a few milliseconds in 𝐴 vector array at the beginning of each batch, so that the next
zkSNARK. iteration can use purely contiguous accesses. This step is called the
shuffle stage. However, with the large bit-width used in ZKP, the
2.2 NTT Computation many batches require more such shuffle stages, each costing 42% ∼
Í𝑁 −1 81% of the per-batch execution time.
For a polynomial 𝐴(𝑥) = 𝑖=0 𝐴𝑖 𝑥 𝑖 , with the coefficients 𝐴𝑖 ∈ 𝐹𝑟 ,
an NTT computes the vector 𝑎® = {𝐴(1), 𝐴(𝜔 𝑁 ), . . . , 𝐴(𝜔 𝑁 𝑁 −1 )},
where 𝜔 𝑁 is the 𝑁 -th root of unity in 𝐹𝑟 . The state-of-the-art 2.3 MSM Computation
algorithm for NTT is through the Cooley-Tukey algorithm [14], it MSM performs high-dimensional vector inner-product, entailing
can reduce the complexity from 𝑂 (𝑁 2 ) to 𝑂 (𝑁𝑙𝑜𝑔𝑁 ). many expensive PMUL operations on an elliptic curve. The cost of
We denote NTT computation with scale 𝑁 as 𝑁 -NTT, which vector inner-product is proportional to the vector size 𝑁 , which
contains 𝑛 = log 𝑁 iterations with the Cooley-Tukey algorithm. Fig- is up to several billion. Meanwhile, the PMUL operation on the
ure 2 illustrates the process of 8-NTT. Each iteration contains 𝑁 /2 elliptic curve is time-consuming because it involves a large number
butterfly operations, the basic NTT operations. The zoom-in bubble of modular multiplication and addition operations over the large
in the figure gives the definition of the butterfly operation, which finite field 𝐹𝑞 as shown at the bottom of Figure 1. Therefore, MSM
reads two inputs, performs a series of modular multiplications and is the most computation-intensive part in ZKP proof generation.
additions (both of large integers in this case), and stores the result Existing CPU-based zkSNARK implementations spend over 70% of
in place. After the last iteration, the original 𝐴® vector (coefficients their total execution time on the MSM stage [50].
of the polynomial 𝐴(𝑥)) contains the result vector 𝑎. ® As mentioned earlier, exiting work [16, 18, 63] uses optimizations
Such NTT computation can be parallelized, illustrated in Figure 2 such as the Pippenger [55] and Straus [58] algorithms to reduce
with colored backgrounds, with butterfly operations in consecutive the MSM complexity. The core idea behind is to reduce the PMUL

MSM
l=256
s0 P0 sub-MSM Task0 k=4
s1 P1 s00 P00 Q0 = ... 0 1 0 1 0 1 0 1 P0 0x...55 P0 = ... 0x10 ( 0x5 P0 ) 0x5 P0
s2 PM BR WR ( 0xA B5B5 BABA ...
P2 P1 P1 )
...

Q1 = ... 1 0 1 0 1 0 1 0 0x...AA P1 = ... 0x10 0xA P1

s3 P3 s77 PP77 P0 PP5 P1 P4
P2 = ... ( 0x5 P1
...

Q2 = ... 0 1 0 1 1 0 1 0 P2 0x...5A 0x10 P2 ) 0xA P2 0

s4 P4 sub-MSM Task1 P3 P P2 P6
Q3 = ... 0 1 0 1 0 1 0 1 P3 0x...55 P3 = ... 0x10 ( 0x5 P3 ) 0x5 P3 3 P2
s5 P5
ss88 PP88 Result Q4 = ... 1 1 1 0 1 0 1 0 P4 0x...EA P4 = ... 0x10 ( 0xE P4 ) 0xA P4 BP55 = P0 P3P4 P5
s6 P6
PM BR WR 1
...

s7 P7 Q5 = ... 1 1 1 0 0 1 0 1 P5 0x...E5 P5 = ... 0x10 ( 0xE P5 ) 0x5 P5 BA = P1 P2P6 P4 P6

s15 P15
...

Q6 = ... 1 0 1 1 1 0 1 0 P6 0x...BA P6 = ... 0x10 ( 0xB P6 ) 0xA P6

...

Q7 = ... 1 1 1 0 0 0 0 0 P7 0x...E0 P7 = ... 0x10 ( 0xE P7 ) 0x0 P7 2 W0 = 0x1 B1 ... 0xF BF

...

sN-1 PN-1
W1
W WW0 B5 =P0BF P3 P5
1 0 1
(BF BE )
1 : point-merging (PM) = Q0 = Q
...0 Q
...7 Q7 3 2252 W3 63... ...0x1024W1W1
W0W0 ...
2 : bucket-reduction (BR) = ( 24 ... (( 24 W63) W62 ) ...) W0 2 (BF BE ... B1)
3 : window-reduction (WR)

Figure 3: An example MSM computation process using Pippenger algorithm.

343
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

batch 0 batch 1

global LD shared ST global global LD shared ST global

memory memory memory memory memory memory

0 0 0 0 0 0 0 0
t0 cache-efficient data access pattern
1 1 1 1 1 4 4 1
2 2 2 2 2 8 8 2
t1 threads access elements in global memory
3 3 3 3 3 12 12 3 threads store elements in shared memory
4 4 4 4 4 1 1 4 t0 t1 t2 t3 t4 t5 t6 t7 t0 t4 t1 t5 t2 t6 t3 t7
t2 00 10 20 30 40 50 60 70 80 90 100110120130140150 00 80 10 90 20 100 30 110
5 5 5 5 5 5 5 5 LD LD
01 00 10 20 30
6 6 6 6 6 9 9 6 m m
t3 80 90 100110
7 7 7 7 7 13 13 7
8 8 8 8 8 2 2 8
t4 N 16= *G
9 9 9 9 9 6 6 9
t0 t1 t2 t3 t4 t5 t6 t7 t0 t4 t1 t5 t2 t6 t3 t7
10 10 10 10 10 10 10 10 00 10 20 30 40 50 60 70 80 90 100110120130140150
t5 00 40 80 120 10 50 90 130 20 60 100140 30 70 110150
LD LD
11 11 11 11 11 14 14 11 01 40 50 60 70 01
12 12 12 12 12 3 3 12 m 120130140150 m
t6
13 13 13 13 13 7 7 13
14 14 14 14 14 11 11 14
t7 N 16= *G
15 15 15 15 15 15 15 15
global memory L2 cache shared memory
GPU block 0
each block contains four
N-1 N-1 N-1 independent groups N-1

represents the first element, background color the color line shows the butterfly operation is represents the first word of load (LD)
0 t0 represents thread 0 00
distinguishes different independent groups executed by which thread, i.e., t0 in the case the first element store (ST)

Figure 4: Iteration partitioning at 𝐵 = 2. Here a GPU block contains 𝐺 = 4 independent groups, each working on 2𝐵 elements (in
the same color). A total of 𝑇 = 2𝐵 ∗ 𝐺/2 = 8 threads are required in each block. The zoom-in illustration on the right side shows
the per-thread memory access pattern When loading data from the global memory to the shared memory.

operations with the distributive law. Figure 3 uses the Pippenger from tens of MBs to GBs, with each array element in 𝑚 machine-
algorithm to illustrate this process. It represents each 𝑙-bit scalar word-size (64-bit) integers. The input arrays are stored in the GPU
𝑠 under the 2𝑘 base, where 𝑘 is the pre-specified window size. In global memory with a column-major layout [10] (i.e., storing the
the right part of Figure 3 (𝑁 = 8), 𝑘 is 4 and 𝑙 is 256, resulting in first words of all 𝑁 integers contiguously, then all the second words,
64 windows. The PMUL operations within each window are then until 𝑚). In our discussion next, we use the column index to identify
grouped by the scalar component into buckets. For example, in an array element. For example, element 0 represents the 𝑚 integers
window 𝑤 0 , different elements of 𝑠® produce common components in the first column of this layout.
of 0xA and 0x5, producing a bucket 5 containing the points 𝑃0, 𝑃3 , As said in Section 2.2, a group of independent butterfly opera-
and 𝑃5 , and another bucket A. With the distributive law, one can tions in a batch are mapped to a GPU block for parallel computation.
therefore save computation by first summing up all the points in The block works on a subset of array elements, which are staged to
bucket 𝑗 to get 𝐵 𝑗 within a window (point-merging), then calculating the shared memory, also in a column-major layout. The GPU shared
Í2𝑘 −1 memory is the memory local to each streaming multiprocessor (SM),
𝑗=0 𝑗 ⊗ 𝐵 𝑗 for window 𝑡 to get 𝑊𝑡 (bucket-reduction), and finally
Í ⌈𝑙/𝑘 ⌉ −1 𝑡 ∗𝑘 much faster yet smaller than the global memory. For example, each
computing 𝑡 =0 2 ⊗ 𝑊𝑖 , where 2𝑡 ∗𝑘 is a weight, to combine of the 80 SMs of the NVIDIA V100 model has a 48 KB shared mem-
the results across windows (window-reduction). ory, accessible to up to 1024 threads of the block running on the
The scale of MSM in ZKP systems is often over millions. Due to SM, at the word granularity. With one warp (usually 32 threads)
the large 𝑁 , previous work explores data parallelism by decompos- accessing global memory simultaneously, it has been found that
ing the computation horizontally into multiple sub-MSM tasks, as for vectors of large integers, the fine-grained interleaving among
shown in Figure 3 left. Each sub-MSM can be naturally assigned to a data accessed by different threads with such a column-major layout
GPU block, where algorithms like Pippenger are further adopted to could offer significantly better performance than storing each inte-
reduce the complexity, with concurrent processing of different win- ger contiguously [10]. This is confirmed by our own experiments,
dows by different threads, as shown in the figure. However, existing which show a 2× advantage of the former with a 753-bit integer
solutions have not explored the fact that point-merging tasks in the size in global memory accessing.
same window for different sub-MSMs can also be merged to reduce With each warp accessing a subset of vector data for NTT, data
the subsequent reduction tasks, which is among our contributions transfer happens at the beginning and the end of each batch, to put
in GZKP (Section 4). relevant data into and evict back results from the shared memory.
A major reason for existing GPU-based NTT designs to perform
expensive data shuffles between batches [10, 16, 38] is to ensure
3 GZKP DESIGN: THE POLY STAGE that each warp has contiguous access to at least one GPU L2 cache
In this section, we describe GZKP’s new design for more efficient line (32 B with V100), to fully utilize data brought into the L2 cache.
GPU-based large-scale NTT computation over large finite fields. However, the cost of such shuffle operations quickly increases when
Cache-Efficient Global Memory Access. For 𝑁 -NTT in ZKP, the we get to the later iterations, as mentioned in Section 2.2.
total size of its input array (vectors {𝑎, ® 𝑐®} shown in Figure 1) varies
® 𝑏,

344
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu

Preprocessed l=256 Merge Merge Thread Data Transfer CG: cooperative group
Points k=8 point-merging bucket-reduction
Q0 = ... 0 1 0 10101 P0 = ... P0_1 P0 P3 P3_1 ...
170 85 P0 B85 GPU Block0 GPU Block0
P0_1 = 2k P0
Q1 = ... 1 0 1 01010 P1 = ... P1_1 P1

PADD
P1_1 = 2k 234 170

CG0
B1

PADD PADD
P1
...

CG1 CG0
P2 P6_1 B1

Task0 (B1)
P2_1 = 2k P2 Q2 = ... 0 1 0 11010 P2 = ... 229 P2_1 90 P2 B90

Warp0
Q3 = ... 0 1 0 P3 = P3_1
...

10101 ... 85 85 P3 B3
...

PADD PADD PADD PADD

B170 P1 P0_1

CG1
P0_2 =22k P0 Q4 = ... 1 1 1 01010 P4 = ... 170 P4_1 234 P4 B3
P1_2 =22k P1 Q5 = ... 1 1 1 P5 = B67
00101 ... 255 P5_1 229 P5

...
P2_2 =22k P2 Q6 = ... 1 0 1 11010 P6 = ... 90 P6_1 186 P6 B186 P6 ...

CG2
Result

...
PADD PADD
...

CG1 CG0
Q7 = ... 1 1 1 0 0 0 0 0 P7 = ... 0 P7_1 224 P7 B67

Task1 (B3)
P0_t =2t*k P0

...

Warp1
B185
P1_t =2t*k P1

...

CG3
...
P2_t =2t*k P2
B85 = P0 P3 ...
...

...
1 B1 ... 255 B255 B185
B90 =P1 P6_1 ...

CG4
= B255

...
...

...
= Q0 QN-1 2 1 B
(B255 B254) 170=P1 P0_1 ...
... GPU Blocks
B186=P6 ...

...

...
(B255 B254 ... B1)

...
...
Figure 5: The bucket-based task-partitioning and task-mapping for the MSM computation

Considering the large scale of NTT operations in ZKP applica- chunks in the global memory. Essentially, we transform these 2𝐵
tions, GZKP follows a different, shuffle-less approach, keeping the length-𝐺 chunks in the global memory to 𝐺 length-2𝐵 chunks in
vector in the original order in the global memory across all batches the shared memory. As long as 𝐺 is sufficiently large, e.g., at 4 or
of iterations (Figure 4). It then has to address the conflict between higher, accessing these chunks from the global memory ensures
(1) the fine-grained access (byte-level) of each thread due to the effective utilization of each L2 cache line. We use 𝑇 threads to fetch
column-major layout, for better global memory access speed, and these chunks in parallel in multiple rounds. Figure 4 example shows
(2) the growing distance of array elements accessed by each GPU 𝐺 = 4 and 𝐵 = 2, as on the right side. In each round, the threads
block, without data shuffling between batches. Otherwise, both access contiguous elements from the global memory that can be
issues would significantly lower the L2 cache line utilization. well coalesced. They then write the data to the appropriate location
GZKP addresses these challenges with a cache-friendly task in the shared memory, producing the desired strided interleaving
partitioning scheme. Suppose we have 𝐵 iterations within each pattern, promoting efficient access within each independent group
batch, where GZKP divides the butterfly operations into 𝑁 /2𝐵 in- in their NTT computation. As to be seen later in our experiments,
dependent groups, each of which works on 2𝐵 elements. Figure 4 such a design improves the NTT performance by up to 2.1×. When
illustrates this with 𝐵 = 2, where the 16 elements’ butterfly compu- the current batch finishes, the reverse process is used to write data
tation form 4 such groups (grouped by background colors), each back to the global memory, with the original data order.
working on a disjoint subset of 4 elements. With existing GPU- The internal shuffle operation above ensures contiguous mem-
based NTT designs, 𝐵 is maximized (under the shared memory ory access at the global memory side for L2 cache efficiency. By
capacity constraint) to make such independent groups large. Each scattering them to the desired locations as dictated by the current
group is mapped to one GPU block, to allow large chunks of work batch’s independent groups order (such as “0 4 8 12” for the first
to proceed in parallel. group), subsequent iterations of NTT computation in this batch
Considering the aforementioned L2 cache efficiency problem, could proceed with one thread’s data placed close together. Such
GZKP adopts a different design. Rather than having each GPU sequential and reverse-order interleaving of contiguous data seg-
block work on one large group, we assign it with multiple smaller ments further reduces the inter-bank access conflicts on the shared
ones. In Figure 4, for example, the four independent groups are memory [20].
all assigned to GPU block 0. This allows GZKP threads to perform
internal shuffling when transferring data between the global and
shared memories: neighboring threads will coalesce their accesses
by visiting the global memory with a coarse-grained block-style
data distribution, then creating the fine-grained interleaving at the
shared memory. 4 GZKP DESIGN: THE MSM STAGE
We illustrate more using the examples in Figure 4. Batch 0 on the This section focuses on the MSM stage, which occupies most of the
left have contiguous access within each independent group, so the ZKP proof generation time. We present two major optimizations:
threads simply access the data needed, where neighboring threads (1) further computation consolidation beyond those exploited by
will visit neighboring data. This stops being the case with batch 1 existing approaches, and (2) load balancing techniques targeting
(with GZKP discarding data shuffling in the global memory), where scalar vectors containing large integer elements. It is worth noting
vector elements have a stride of 4: the first group will be working that our MSM optimizations presented in this section is agnostic to
on elements 0, 4, 8, and 12. Nevertheless, if we could schedule 𝐺 elliptic curve types, and apply to both the two common types (G1
independent groups together where each group processes 2𝐵 ele- and G2) used in current ZKP systems [23, 50] despite their different
ments, the overall required data nicely form 2𝐵 length-𝐺 contiguous PADD operation costs.

345
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

4.1 Computation Consolidation Algorithm 1: Point Merging Based On Checkpoints

As mentioned in Section 2.3, previous systems [16, 18] exploit Input: the MSM scale 𝑁 ,
window-based parallelism by processing the windows in paral- the preprocessing interval 𝑀,
lel, as shown in Figure 3. Windows are partitioned horizontally into the windows size 𝑘,
sub-MSMs. Each sub-MSM is mapped to a GPU block, where differ- the bucket-info array 𝑝_𝑖𝑛𝑑𝑒𝑥,
ent windows within the same sub-MSM are processed by different the preprocessed points (checkpoints) array 𝑝𝑟𝑒_𝑃
threads. Using Pippenger algorithm in each sub-MSM, the points Output: the result point 𝐵
are first summed up in the buckets (point-merging), multiplied with 1 for 𝑖 ← 0 to 𝑙𝑒𝑛(𝑝_𝑖𝑛𝑑𝑒𝑥) − 1 do
the corresponding bucket scalars and summed (bucket-reduction), 2 𝑡 = 𝑝_𝑖𝑛𝑑𝑒𝑥 [𝑖]/𝑁 ;
and finally accumulated across windows (window-reduction). 3 𝑟 = 𝑝_𝑖𝑛𝑑𝑒𝑥 [𝑖]%𝑁 ;
Unlike prior systems, GZKP exploits bucket-based task partition- 4 if 𝑡%𝑀 == 0 then
ing and fine-grained task-mapping to ensure load balance, as illus- 5 𝐵 = 𝐵 ⊕ 𝑝𝑟𝑒_𝑃 [(𝑡/𝑀) ∗ 𝑁 + 𝑟 ];
trated in Figure 5. While it still adopts a Pippenger-like algorithm 6 else
(shown on the left side of the figure), GZKP pursues parallelization 7 𝑡𝑚𝑝 = 𝑝𝑟𝑒_𝑃 [(𝑡/𝑀) ∗ 𝑁 + 𝑟 ];
in a different manner by leveraging previously unexploited oppor-
8 for 𝑗 ← 0 to (𝑡%𝑀) ∗ 𝑘 do
tunities to further reduce the PMUL operations in MSM. The large
9 𝑡𝑚𝑝 = 𝑡𝑚𝑝 ⊕ 𝑡𝑚𝑝;
bit-width of current ZKP applications enables a large number of
10 end
windows available for concurrent processing. GZKP hence discards
the sub-MSM partitioning and instead consolidates all computations 11 𝐵 = 𝐵 ⊕ 𝑡𝑚𝑝;
in each “column”. In this way, more points (across all sub-MSMs) 12 end
can be merged in a bucket, thus saving PMUL operations from one 13 end

per sub-MSM to one per entire “column”.

We further adjust the ordering of the three steps in the algorithm
and merge the point-merging tasks within different “columns” to
eliminate the window-reduction step. We group all PMUL operations grows too fast with large 𝑁 , even with a small 𝑘 value. This is
with the same scalar component across different windows into buck- reflected in our performance comparison with MINA [18] (Table 7),
ets, producing only 2𝑘 cross-window point-merging tasks in our which utilizes the Straus algorithm.
MSM workflow. However, the points across different windows can- The point-merging step is the most time-consuming, taking
not be directly added together because their bucket scalars are in dif- up 90% of the overall MSM execution. To effectively parallelize
ferent bit positions, i.e., with different weights. GZKP addresses this these cross-window point-merging tasks, GZKP defines the point-
issue with a preprocessing scheme, as shown in the left-most panel merging task as a basic work unit, which sums up all points in the
in Figure 5. We pre-compute 2𝑡 ∗𝑘 ⊗ 𝑃𝑖 to get 𝑃𝑖_𝑡 in advance, where corresponding bucket. Multiple point-merging tasks are assigned to
a GPU block for parallel computation. We allocate at least one warp
𝑡 is the window index and 2𝑡 ∗𝑘 is the window weight. Note that the
for a basic task unit to execute. As shown in the right side of Figure 5,
original point vector 𝑃 is generated in the ZKP system setup phase
the tasks corresponding to the buckets (i.e. 𝐵 1, 𝐵 3, 𝐵 67, 𝐵 185, · · · ) are
and remains unchanged for a specific application. Consequently,
mapped to the GPU block0, each processed by one warp. In addition,
we can pre-compute 2𝑘 ⊗ 𝑃𝑖 , 22𝑘 ⊗ 𝑃𝑖 , . . . , 2 ( ⌈𝑙/𝑘 ⌉ −1)∗𝑘 ⊗ 𝑃𝑖 for each
data parallelism within each point-merging task is harvested by our
point 𝑃𝑖 , and save these preprocessed points in the GPU global
optimized finite field library (Section 4.3) accelerating the PADD
memory. These “weighted” points can then be directly summed
operations using the CUDA cooperative groups (CGs) programming
in the point-merging tasks. The downside of such preprocessing,
model [20]. Threads in a warp are decomposed into multiple CGs
however, is the memory space overhead required to store the pre-
(containing 4 threads in the case demonstrated by figure), each
computed points: over 5 GB at the MSM scale of 221 (with the large
parallelizing individual PADD operations. The CG size depends
integer bit-width of 381-bit), and further grows as the MSM scale
on the integer bit-width adopted by the application and needs to
increases.
be a factor of 32 to ensure full thread utilization in the warp. The
To better balance between the time and space saving in our
intermediate results from different CGs are stored in the GPU shared
design, we design a point merging algorithm based on a checkpoint
memory, before they are added up to get the final point-merging
strategy as shown in Algorithm 1. We set a preprocessing interval
result, to be stored in the GPU global memory.
𝑀 to adjust the “frequency” of the preprocessed data, as a control
It is worth noting that the window size 𝑘 is an important parame-
knob to configure based on memory space constraints. GZKP only
ter that determines the parallelism of our MSM module as well as the
preprocesses the points with fixed weights, called checkpoints (i.e.
⌈𝑙 /𝑘⌉ point-merging workload distribution. In general, a larger window
2𝑀∗𝑘 , 2 (2𝑀 )∗𝑘 , ..., 2 ⌊ 𝑀 ⌋∗𝑀∗𝑘 ). Then during point-merging, it first size helps reduce the total computational cost of the MSM using the
finds the closest checkpoint to the needed point, and then perform Pippenger algorithm [4]. On the other hand, increasing the window
at most (𝑀 − 1) ∗ 𝑘 PADD operations to get the desired point. size exponentially increases the number of point-merging tasks,
Note that the idea of preprocessing point vectors is also used in which, when much higher than the number of GPU SMs, introduces
the Straus algorithm [58]. There the points (i.e. 2⊗𝑃𝑖 , 3⊗𝑃𝑖 , . . . , (2𝑘 − non-trivial GPU scheduling overheads. Fortunately, for a specific
1) ⊗ 𝑃𝑖 ) are also pre-computed to accelerate bucket-reduction. How- ZKP application, the size of its MSM task is known. Therefore,
ever, this scheme scales poorly, as the amount of pre-computation GZKP performs profiling-based window configuration.

346
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu

S im ila r ta s k g r o u p
3 0 0
Group0 Group1

Warp0 Warp1 Warp0

CG0 CG1 CG0 CG1 CG0 CG1
... ... ...

PADD ... PADD ... PADD ...

T h e n u m b e r o f ta s k s

PADD PADD PADD

2 0 0
Task0 Task0

Figure 7: Fine-grained task mapping in GZKP for load bal-

1 0 0
ancing.

group have similar workloads. The bars in the histogram of Figure 6

illustrate an example grouping. GZKP schedules these groups in
0
0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0
order, starting from the buckets with the heaviest loads, ensuring

T h e n u m b e r o f p o in ts p e r ta s k
that heavy buckets are not left as stragglers.
When scheduling the tasks in each task group to the underly-
ing GPU hardware, GZKP performs fine-grained task mapping for
Figure 6: Workload distribution in the point-merging step. workload-aware resource allocation. We allocate GPU warps to the
Tasks with the same workload (i.e., number of points as- tasks in the same group according to their average loads. As demon-
signed to the bucket) are grouped into a similar task group. strated in Figure 7, a task in group 1 is mapped to one warp, while
another task in group 0 is mapped to two warps since it contains
more points. Therefore, each warp will independently manage at
After the cross-window point-merging step, only one bucket- most one task (bucket), and multiple CGs in a warp can execute
reduction task is left in our MSM module. The preprocessing scheme the task in a data-parallel manner.
essentially turns all windows into one with no weight. Therefore,
there are no more windows but 2𝑘 buckets left. For the final step 4.3 Optimized Finite Field Library
in our MSM module, we calculate 2𝑗=0−1 𝑗 ⊗ 𝐵 𝑗 by leveraging the
Í𝑘
GZKP uses a highly optimized finite field library as the underlying
parallel prefix sum algorithm [42], which converts certain sequen- tool to accelerate the basic operations in ZKP, mainly modular mul-
tial computations into equivalent parallel computations. The input tiplications [52] and large integer additions. As mentioned earlier,
array is from the point-merging step and is stored in GPU global a typical proof generation in Zcash [43] involves billions of these
memory. We divide it into blocks, each scanned by a single GPU basic operations. Our library supports large integers, currently 256-
block. The number of GPU blocks is determined by the input ar- bit ∼ 753-bit, and can be further extended if needed. Due to the
ray size (2𝑘 ) and the number of threads needed for each PADD length limit, we only briefly introduce it here.
operation, again in CGs. Current mainstream computing devices (i.e., CPU, GPU) cannot
directly support the above operations on large integers. Instead,
4.2 Workload Management one needs to split them into multiple device-native data types, e.g., 4
ZKP may exhibit imbalanced data distribution when working with machine-word-size (64-bit) integers for a 256-bit integer. The large
large integers. In particular, the scalar vector 𝑢® in real-world ZKP finite field operations are then implemented using the arithmetic
workloads is highly sparse. The system setup has a lot of bound operations supported by the GPU in CUDA [20]. Multiple threads
checks and range constraints, which introduce many 0s and 1s to in a warp cooperate to complete the arithmetic operation and use
® and thus severe load imbalance in existing systems [16, 18]. For
𝑢, the warp-level primitive to realize the data communication between
example, the points in bucket 0 require trivial processing and no threads. In this way, we can effectively deal with data carry signals
subsequent point merging. between threads, and merge the multiple individual 64-bit results
As an example, we analyze the workload distribution in the into the final 256-bit number.
point-merging step with scalar vector 𝑢, ® based on a Zcash MSM Like earlier works [29, 30], our library exploits the floating-point
execution with a scale of 217 and scalar bit-width of 256. As shown processing units (the forte of GPUs), which would be otherwise idle
in Figure 6, there is up to 2.85× difference in the numbers of points in such integer computation tasks, to accelerate modular multipli-
across all buckets. cation. For example, we choose base 𝐷 = 252 to split a large integer
When switching from the previous window-based parallel execu- into multiple 52-bit components, resulting in 15 machine-word-size
tion to the more fine-grained bucket-based parallelization strategy (64-bit) integers to represent a 753-bit integer. The higher-order
in GZKP, such imbalance could get more serious as more points are bits of these integers are filled with 0 to accommodate carry bits.
merged into each bucket. On the other hand, GZKP takes advan- We then convert them to double-precision floating point (DFP),
tage of its fine-grained, bucket-based work partitioning strategy to followed by applying Dekker’s Method [27] to complete the DFP
dynamically improve load balance. Given the fairly large number of multiplications and additions without rounding issues. We imple-
buckets during point-merging, GZKP groups them by loads (i.e., the ment the butterfly operation for NTT and the PADD operation for
number of points contained), ensuring that the tasks in the same MSM based on this library. We believe the library itself could be

347
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

of independent interest beyond GZKP, and can also be applied to and GZKP support ALT-BN128, BLS12-381, and MNT4753 elliptic
improve existing GPU-based ZKP systems, which we demonstrated curves, with bit-width of 256, 381, and 753, respectively. MINA only
through our baselines. We report the acceleration results of the supports the MNT4753 elliptic curve, while both bellperson and
finite field library on different modules (Figure 8 and Figure 10) in bellman support BLS12-381. We evaluate the performance of GZKP
Section 5.3. with different elliptic curves. For each elliptic curve, we select the
best GPU system (“Best-GPU”, or “BG”) as the baseline, along with
5 EVALUATION the best CPU solution (“Best-CPU”, or “BC”), especially when the
particular curve is not supported by any prior GPU solutions.
5.1 Methodology
Workloads. We use multiple real-world workloads to evaluate
Experimental setup. Our experiments are conducted on a server the overall performance of GZKP. Among them, the zkSNARK
with dual 2.00 GHz Intel Xeon Gold CPU 5117 processors (each workloads are generated by xJsnark [47, 49], a high-level framework
with 28 logical cores) and 256 GB DRAM, running Ubuntu 18.04.2 for developing applications based on zkSNARK, integrating libsnark
with CUDA Toolkit version 11.4. It is equipped with four NVIDIA as a backend for proof generation. The Zcash workload is generated
Tesla V100 GPUs (each with 32 GB memory) and one NVIDIA GTX through the tool included in its code repository [24]. The scalar
1080Ti GPU (with 11 GB memory). vector 𝑢® in these real-world workloads is sparse. In addition, we use
Baselines. we compare GZKP with multiple state-of-the-art ZKP synthetic data generated by libsnark to evaluate the performance
systems (using CPU or GPU) based on the zkSNARK protocol, as list of different GZKP modules.
in Table 1. Among them, libsnark [50], a C++ library, is to our knowl-
edge the most well-known zkSNARK library, supporting multiple
elliptic curves as well as parallel proof generation on multiple CPU
cores. MINA [18] is a GPU-accelerated ZKP system, but only acceler- 5.2 Overall Performance
ates the MSM stage. In addition, bellman and bellperson are recent zkSNARK. First, we evaluate the end-to-end proof-generation time
zkSNARK implementations in Rust, adopted by Zcash [22] and (sum of the POLY and MSM stages) using the zkSNARK workloads
Filecoin [15], respectively. They have similar overall design, while on one V100 GPU card. More specifically, the actual zkSNARK
the former is CPU-based and the latter GPU-based. Also, libsnark execution [50] contains seven NTT operations in the POLY stage
and five MSM operations in the MSM stage, for generating one proof.
Table 1: Baselines We create such zkSNARK workloads with the 753-bit MNT4753
curve.
As shown in Table 2, GZKP achieves, on average, 48.1× and 33.6×
Platforms Supported curves
speedup against the best-performing CPU-based system (libsnark)
ALT-BN128, BLS12-381,
GZKP GPU and the best-performing GPU-based system (MINA), respectively.
MNT4753
Since MINA only performs the MSM stage on the GPU, its overall
MINA [18] GPU MNT4753
execution time is obtained by adding libsnark’s POLY stage time to
bellperson [16] GPU BLS12-381 its MSM stage time. While MINA provides quite limited improve-
ALT-BN128, BLS12-381, ment over the best CPU solution, GZKP is able to provide a speedup
libsnark [50] CPU
MNT4753 between 14.0× and 48.1× over the former. The reason is that with
bellman [23] CPU BLS12-381 the scalar vector 𝑢® being sparse in real-world workloads, MINA

Table 2: Performance results (in seconds) of different zkSNARK workloads on V100 GPU with MNT4753 (753-bit) elliptic curve

Best-CPU Best-GPU GZKP Overall Speedup Overall Speedup

Application Vector size
POLY MSM POLY MSM POLY MSM (Best-CPU) (Best-GPU)
AES 16383 0.85 0.83 0.85 0.59 0.004 0.099 16.3× 14.0×
SHA-256 32767 0.97 1.14 0.97 0.90 0.005 0.066 29.8× 26.3×
RSAEnc 98303 3.58 3.77 3.58 1.86 0.022 0.12 53.2× 39.4×
RSASigVer 131071 2.57 4.77 2.57 1.63 0.024 0.13 46.7× 26.7×
Merkle-Tree 294911 10.03 12.33 10.03 3.72 0.06 0.22 78.2× 48.1×
Auction 557055 19.46 14.27 19.46 5.41 0.15 0.37 64.3× 47.4×

Table 3: Performance results (in seconds) of Zcash workload on V100 GPU with BLS12-381 (381-bit) elliptic curve

Best-CPU Best-GPU GZKP Overall Speedup Overall Speedup

Zcash workload Vector size
POLY MSM POLY MSM POLY MSM (Best-CPU) (Best-GPU)
Sapling_Output 8191 0.17 0.21 0.052 0.26 0.001 0.033 11.1× 9.2×
Sapling_Spend 131071 0.43 1.07 0.16 0.50 0.003 0.09 16.7× 7.1×
Sprout 2097151 4.05 9.61 0.69 2.24 0.049 0.25 46.3× 9.8×

348
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu

suffers from significant load imbalance, while GZKP’s design avoids cost by preprocessing before the NTT operation. Specifically, we
such a problem. use a parallel strategy more suitable for GPU. As shown in Figure 2,
Zcash. Next, we report evaluation results with a Zcash workload, each iteration 𝑖 contains 2𝑖 unique 𝜔 𝑁 -related values. Our design
using the 381-bit BLS12-381 curve on one V100 GPU. There are two only needs to store these unique preprocessing values once, with-
types of trading methods (sprout, sapling) [43] with Zcash, which out redundant storage. We also avoid the memory access overhead
involve different proof generation activities. We report results for by ensuring that the threads read the preprocessing values contin-
both. uously. However, such preprocessing is not suitable for libsnark
As shown in Table 3, GZKP achieves, on average, 24.7× and 8.7× because of its specific parallel strategy, which is not exactly the
speedup against BC (bellman) and BG (bellperson), respectively. To same as Figure 2. We have modified libsnark to precompute all
make a shielded transaction, a combination of those workloads is 𝜔 𝑁 -related values, but these values actually take up 16× memory
required. Thus, GZKP could reduce the latency of proof generation size, e.g., up to 24 GB in 224 -NTT. The performance improvement
for each shielded transaction in Zcash by 37.1× against bellman is only 1.5× on average. Since the preprocessing values increase
and provide a speedup of 9.2× over bellperson. The reason is that the overall memory footprint and the data transfers from memory,
GZKP accelerates the computation of the two stages better through the extra memory access overhead dwarfs the saved computations.
the new design compared with bellperson, especially improving Table 6 lists results of similar tests, but on a lower-end GPU
the performance of the more time-consuming MSM stage by 8×. (GTX1080Ti), with fewer SMs and lower memory bandwidth. GZKP
Here the improvement echoes the zkSNARK results. still achieves on average a speedup of 8.9× against bellperson with
Multi-GPU tests. We also evaluate the end-to-end performance BLS12-381 curve. The latter is more severely affected by the reduced
of GZKP with four V100 GPUs by using the Zcash workload. The memory bandwidth. Note that we ignore the preprocessing time
best-performing GPU-based ZKP system (bellperson) also supports required for the butterfly operation for fairness, which is executed
multiple GPU cards for MSM stage. For the POLY stage, we as- serially on the CPU in bellperson and takes on average 5× latency
sign data-independent NTT operations to different GPU cards for in Table 5, while GZKP performs it in parallel on the GPU. The
parallel computation. For the MSM stage, we decompose the com- speedup ratio of GZKP compared to bellperson varies for different
putation horizontally into 4 smaller sub-MSM tasks, where each NTT sizes. The reason is that with the fixed division of a GPU block
task uses all our proposed optimizations, and then assign each of to process independent groups, bellperson suffers from sub-optimal
them to a GPU. As shown in Table 4, although additional data com- block division, while GZKP avoids such problems.
munication overhead is introduced between multiple cards, GZKP To take a deeper look, we investigate the improvement brought
reduces the proof-generation time further by an average of 2.1× by major GZKP proposals for the POLY stage, using the BLS12-381
on the basis which only uses one GPU card. Also, due to better
scalability, GZKP achieves, on average, 13.2× speedup against BG
(bellperson). Table 5: Performance results (in milliseconds) of single NTT
operation on the V100 GPU
Table 4: Performance results (in seconds) of Zcash workload
on four V100 GPUs with BLS12-381 (381-bit) elliptic curve 753-bit 256-bit
NTT
Scale Best Best
CPU GZKP GPU GZKP
Zcash Vector Best-GPU GZKP Overall
workload size POLY MSM POLY MSM Speedup 214 102 0.15 (697.4×) 0.37 0.05 (8.0×)
Sapling_Output 8191 0.052 0.24 0.0004 0.023 12.7× 216 212 0.49 (434.4×) 0.48 0.09 (5.3×)
Sapling_Spend 131071 0.16 0.31 0.0017 0.049 9.3× 218 565 1.91 (295.4×) 2.89 0.28 (10.3×)
220 2110 7.46 (282.5×) 5.19 1.07 (4.8×)
Sprout 2097151 0.69 1.08 0.027 0.074 17.6×
222 8180 33.67 (242.9×) 12.69 4.96 (2.5×)
224 32517 141.40 (230.0×) 46.74 20.99 (2.2×)
226 131441 602.53 (218.1×) 665.84 91.05 (7.3×)
5.3 Breakdown Analysis
We perform breakdown analysis to assess the impact of individual
Table 6: Performance results (in milliseconds) of single NTT
GZKP optimizations, for the two ZKP stages separately.
operation on the GTX1080Ti GPU
NTT operation. Here we examine the execution time of a single
NTT operation, the major computation within the POLY stage. We
use synthetic data generated by libsnark with the MNT4753 and 753-bit 256-bit
NTT
BLS12-381 curves. Note that BLS12-381 curves use 256-bit integers Scale Best Best
CPU GZKP GPU GZKP
and 256-bit NTT operations in the POLY stage. As shown in Table 5,
GZKP achieves, on average, 343.0× and 5.8× speedup against the 214 102 0.33 (305.0×) 0.52 0.06 (9.4×)
BC (libsnark) with MNT4753 curve and the BG (bellperson) with 216 212 1.16 (182.8×) 0.98 0.18 (5.5×)
BLS12-381 curve on the V100 GPU, respectively. GZKP execution 218 565 6.21 (91.01×) 14.64 0.70 (20.9×)
time for a single NTT operation scales linearly with the NTT size 220 2110 27.26 (77.30×) 23.80 2.87 (8.3×)
as expected, while libsnark fails to achieve so, due to its redundant 222 8180 119.82 (68.24×) 70.50 12.83 (5.5×)
computation for 𝜔 𝑖𝑁 in each butterfly operation. GZKP avoids this 224 32517 539.25 (60.30×) 234.59 56.18 (4.2×)

349
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

B G Table 7: Performance results (in seconds) of single MSM op-

N o r m a liz e d e x e c u tio n tim e

2 .8 9 m s 1 2 .6 9 m s B G w . lib
1 .0 G Z K P - n o - G M - s h u ffle
eration (G1-type) on the V100 GPU
G Z K P

7 .6 6 m s
6 .0 0 m s
753-bit 381-bit 256-bit
0 .5 1 .1 2 m s 4 .9 6 m s
MSM
Best Best Best
0 .5 9 m s Scale
GPU GZKP GPU GZKP CPU GZKP
0 .2 8 m s
0 .0 214 0.36 0.02(17.9×) 0.032 0.004(8.0×) 0.07 0.004(18.1×)
2 1 8
2 2 2
N T T s c a le 216 0.48 0.05(9.2×) 0.052 0.007(7.4×) 0.18 0.006(29.3×)
218 1.99 0.16(12.1×) 0.14 0.020(7.2×) 0.45 0.015(30.1×)
Figure 8: Breakdown analysis of single NTT operation with 220 7.2 0.60(12.1×) 0.53 0.062(8.5×) 1.48 0.045(32.9×)
256-bit BLS12-381 curve 222 28.1 2.66(10.7×) 1.35 0.24(5.6×) 4.90 0.17(28.7×)
224 - 11.3 6.55 1.10(6.0×) 17.27 0.72(23.8×)
226 - 40.7 24.42 4.00(6.1×) 65.70 2.79(23.5×)
curve as an example. Here the BG (bellperson) contains a shuffle
stage, assigns each GPU block with only one independent group,
and creates batches by grouping every 8 iterations. First, to show Table 8: Performance results (in seconds) of single MSM op-
the improvement brought by our optimized finite field library, we eration (G1-type) on the GTX1080Ti GPU
implement a BG-like solution accelerated by this library (“BG w.
lib”). Then, we evaluate an intermediate solution (“GZKP-no-GM- 753-bit 381-bit 256-bit
MSM
shuffle”) that removes the shuffling on the global memory. Finally, Scale Best GZKP Best GZKP Best
the full GZKP contains all our proposed optimizations. It adopts GPU GPU CPU GZKP
214 0.35 0.08(4.5×) 0.093 0.015(6.2×) 0.07 0.007(10.3×)
smaller independent groups as the work granularity, creates batches
by grouping fewer iterations, and performs internal shuffling that 216 1.00 0.20(5.0×) 0.20 0.032(6.3×) 0.18 0.013(13.5×)
promotes continuous memory accesses (Section 3). 218 2.71 0.71(3.8×) 0.64 0.073(8.8×) 0.45 0.032(14.1×)
As shown in Figure 8, when the NTT scale is 222 , BG w. lib 220 10.07 2.51(4.0×) 1.43 0.26(5.4×) 1.48 0.10(14.5×)
achieves 1.6× speedup against BG. GZKP can achieve another 1.5× 222 - 11.91 5.16 1.02(5.2×) 4.90 0.37(13.2×)
speedup against BG w. lib. In particular, for BG, when the NTT 224 - 46.83 19.86 4.16(4.8×) 17.27 1.50(11.5×)
scale is 218 , the last batch has 2 iterations but has 216 GPU blocks,
3 0
and each block only contains 2 threads. Because GPU scheduling
uses warp (32 threads) as the unit, there are 30 threads idling. The G Z K P -M N T 4
scheduling overhead is also high due to such many GPU blocks. As 2 5 M IN A
G Z K P -B L S
M e m o ry u s a g e (G B )

b e llp e r s o n
the number of iterations in the last batch increases, the influence
of such sub-optimal block division in BG gradually reduces. With 2 0
flexible GPU block assignment, the performance of GZKP’s NTT 1 5
module is almost linear with the NTT scale.
MSM operation. Then, we examine the execution time of a sin- 1 0

5
gle MSM operation, the major computation within the MSM stage.
Similarly, we conduct the experiment on the dense synthetic data
created by libsnark with the 753-bit MNT4753, 381-bit BLS12-381, 0
and 256-bit ALT-BN128 curves. As shown in Table 7, GZKP achieves, 2 1 4
2 1 6
2 1 8
2 2 0
2 2 2
2 2 4
2 2 6

M S M s c a le
on average, 12.4× and 7.0× speedup against the best-performing
GPU-based ZKP systems (MINA) for MNT4753 curve and (bellper-
son) for BLS12-381 curve on V100 GPU, respectively. Table 8 gives Figure 9: MSM memory usage with different curves on V100
the results on the lower-end GTX1080Ti, where GZKP still achieves
on average 4.3× and 6.1× speedup against the BG MINA and bellper-
son, respectively. For ALT-BN128 curve, GZKP achieves on average well within the total global memory size, due to Algorithm 1 that
26.6× speedup against the best-performing CPU-based ZKP system adapts to GPUs of different memory size. As GZKP significantly
(libsnark). outperforms bellperson, this shows that it is able to utilize the avail-
In addition, we evaluate the memory usage of the GZKP’s MSM able global memory size for better performance, while bellperson
module on the V100 GPU with different elliptic curves, as shown under-utilizes the GPU memory.
in Figure 9. Here “GZKP-MNT4” and “GZKP-BLS” refer to GZKP To take a deeper look, we investigate the improvement brought
with the MNT4753 curve (adopted by MINA) and the BLS12-381 by major GZKP proposals for the MSM stage, using the BLS12-381
curve (adopted by bellperson), respectively. For MNT4753 curve, curve on the V100 GPU as an example. Here the BG (bellperson)
GZKP’s memory usage grows slower than that of MINA. When the is the baseline, which explores the data parallelism by decompos-
MSM scale exceeds 222 , MINA execution fails due to insufficient ing the MSM task horizontally into multiple sub-MSM tasks. Since
GPU global memory. For BLS12-381 curve, GZKP does consume BG executes the point-merging and bucket-reduction steps on the
more memory than bellperson, but stays stable beyond 222 and GPU side and the last window-reduction step on the CPU side, we

350
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu

0 .1 4 4 s 1 .3 5 s B G
1 .0 distribution. GZKP outperforms them as we have demonstrated
N o r m a liz e d e x e c u tio n tim e

G Z K P -n o -L B
G Z K P - n o - L B w . lib earlier.
G Z K P GPU acceleration for NTT and MSM. Another body of previous
0 .5
works [10, 26, 36, 38, 46] accelerate the NTT operation on GPU
0 .0 4 s 0 .4 2 s
0 .3 2 s
and involve data shuffling to ensure continuous memory accesses
0 .0 2 9 s 0 .2 4 s
0 .0 2 s during computation, whose cost grows fast with increasing bit-
0 .0
width. These designs cannot meet the scalability requirements of
2 1 8
2 2 2
M S M s c a le
NTT operation in ZKP, where the NTT scale is up to a million,
and the bit-width of elements is larger than 256-bit. Some related
works [1, 13, 33, 53, 59] accelerate the basic PADD/PMUL opera-
Figure 10: Breakdown analysis for MSM module with BLS12- tions using GPUs to speedup elliptic curve cryptography. These
381 (381-bit) elliptic curve on one V100 GPU acceleration methods can not directly applied to ZKP MSM due
to the computation scale (millions of PMUL/PADD) and the scalar
bit-width (up to 753-bit). GZKP reduces the calculation complexity
of the MSM by leveraging Pippenger-like algorithms [4, 55, 58] and
use GZKP’s MSM module to evaluate the performance improve-
speedup the basic PADD operation using the optimized finite field
ment brought by the GPU-based finite field library. We evaluate
library.
two intermediate solutions, “GZKP-no-LB” and “GZKP-no-LB w.
lib”. The former implements the optimized MSM computation pro-
cess by leveraging bucket-based task partitioning. The latter uses 7 DISCUSSION AND FUTURE WORK
the optimized finite field library based on the former. Finally, the GZKP focuses on accelerating privacy-preserving applications based
GZKP contains all our proposed optimizations, including compu- on ZKP protocols and provides good performance as demonstrated
tation consolidation in Section 4.1 and workload management in in Section 5. Here we discuss GZKP’s wider application and its
Section 4.2. correctness.
As shown in Figure 10, with the MSM scale 222 , the breakdown NTT-based encryption algorithms. NTT is a key building block
analysis shows that GZKP-no-LB achieves 3.25× speedup compared of many homomorphic encryption (HE) algorithms, in addition
to BG. And the GZKP-no-LB w. lib further improves the perfor- to ZKP. These two types of cryptographic algorithms have differ-
mance by 33% over GZKP-no-LB. After enabling load balance, GZKP ent performance requirements. ZKP focuses more on the latency
achieves 5.6× speedup against the BG. The main improvement for of a single NTT operation, as the multiple NTTs are difficult to
the MSM module in GZKP is due to the computation consolida- parallelize in many protocols [11, 51]. Therefore, in GZKP we use
tion optimization. Besides, Our MSM module ensures load balance, all the compute resources of a GPU for a single NTT operation,
® through fine-grained task mapping.
whether for sparse 𝑢® or dense ℎ, plus asynchronous data copying to reduce host-GPU data transfer
overhead. In contrast, HE usually runs multiple NTT operations
6 RELATED WORK on the GPU simultaneously (NTT batching [38, 45]) to achieve
higher throughput and GPU utilization, enabled by the smaller
zkSNARK protocol and implementation of ZKP systems. Re-
NTT scales and the independent nature of multiple NTTs. Our
cently, many works [2, 3, 5, 6, 9, 34, 39, 51, 54, 57, 60] have made
design adopts smaller independent groups as the task granularity,
lots of efforts to optimize the zkSNARK protocol and make it more
making it suitable for throughput-oriented NTT applications with
practical. Since different ZKP applications [7, 15, 19, 56] use differ-
the aforementioned batching techniques. We plan to expand GZKP
ent elliptic curves, there are many ZKP system implementations
to support HE algorithms in our future work.
based on the zkSNARK protocol (i.e. libsnark [50], bellman [23],
Verification of GZKP. Note that GZKP focuses on efficient parallel
MINA [18] and bellperson [16]), and most of these implementa-
acceleration of core operations (NTT and MSM) in the ZKP protocol.
tions only support specific curves. GZKP supports different types
It does not modify the ZKP protocol itself, hence has no impact on
of elliptic curves to serve a range of applications by leveraging
its security. Furthermore, proof generation on private user input
the optimized finite field library, and the modular design enables
data usually runs on the user’s local machine. Implementation bugs
GZKP to be easily integrated into the existing ZKP systems as a
may generate wrong proofs, which could be addressed by formal
high-performance back-end to achieve fast proof generation.
verification of GZKP. We will leave this as future work.
Hardware acceleration for ZKP system. There are a few designs
improving the performance of ZKP with hardware acceleration,
such as FPGA [25, 28, 63] and GPU [16, 18]. A recent work called 8 CONCLUSION
PipeZK [63] is proposed to accelerate zkSNARK through customized Zero-knowledge proof can guarantee the integrity and confiden-
hardware accelerator, which pipelines the key operations (i.e. NTT tiality of transactions and is increasingly used in applications like
and MSM ) in proof generation. These designs cannot work well cryptocurrency. However, its high computation overhead hinders
with the increasing scale of the input vector. Current GPU-based its deployment in online services. In this paper, we present GZKP,
ZKP systems (i.e. MINA [18], bellperson [16]) in industry accelerate a GPU-based ZKP system that offers at least an order of magnitude
the prover computation by parallelizing those key operations. These improvement over current state-of-the-art solutions. Our results
systems suffer from huge data shuffling overhead caused by large and experience reveal that there are a significant amount of op-
bit-width integers and severe load imbalance caused by sparse data portunities for improving the parallel execution efficiency of ZKP

351
GZKP: A GPU Accelerated Zero-Knowledge Proof System ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

proof generation. Meanwhile, ZKP requires targeted optimizations [20] NVIDIA Corp. 2021. CUDA C++ Programming Guide. https://docs.nvidia.com/
to handle large integer operations that dominate its execution. cuda/pdf/CUDA_C_Programming_Guide.pdf.
[21] StarkWare Corp. 2022. https://starkware.co/.
[22] Zcash Corp. 2022. https://z.cash/.
[23] ZKCrypto Corp. 2022. bellman: a crate for building zk-SNARK circuits. https:
ACKNOWLEDGMENTS //github.com/zkcrypto/bellman.
[24] Zcash Corp. 2022. librustzcash: a (work-in-progress) set of Rust crates for working
The authors thank the anonymous reviewers and the shepherd for with Zcash. https://github.com/zcash/librustzcash.
their valuable comments. We thank Dong Zhou and Xiaoxu Guo for [25] Zcash Corp. 2022. Zcash FPGA acceleration engine. https://github.com/
their participation. This work was supported in part by the National ZcashFoundation/zcash-fpga.
[26] Wei Dai and Berk Sunar. 2015. cuHE: A homomorphic encryption accelerator
Key R&D Program of China under Grant 2020AAA0108501, and library. In International Conference on Cryptography and Information Security in
the Key R&D Program of Hubei under Grant 2020BAA020. the Balkans. Springer, 169–186. https://doi.org/10.1007/978-3-319-29172-7_11
[27] Theodorus Jozef Dekker. 1971. A floating-point technique for extending the
available precision. Numer. Math. 18, 3 (1971), 224–242. https://doi.org/10.1007/
BF01397083
REFERENCES [28] Benjamin Devlin. 2022. FPGA SNARK prover targeting the bn128 curve. https:
[1] Samuel Antao, Jean-Claude Bajard, and Leonel Sousa. 2010. Elliptic curve point //github.com/bsdevlin/fpga_snark_prover.
multiplication on GPUs. In ASAP 2010-21st IEEE International Conference on [29] Jiankuo Dong, Fangyu Zheng, Niall Emmart, Jingqiang Lin, and Charles Weems.
Application-specific Systems, Architectures and Processors. IEEE, 192–199. https: 2018. sDPF-RSA: Utilizing floating-point computing power of GPUs for massive
//doi.org/10.1109/ASAP.2010.5541000 digital signature computations. In 2018 IEEE International Parallel and Distributed
[2] Eli Ben-Sasson, Iddo Bentov, Alessandro Chiesa, Ariel Gabizon, Daniel Genkin, Processing Symposium (IPDPS). IEEE, 599–609. https://doi.org/10.1109/IPDPS.
Matan Hamilis, Evgenya Pergament, Michael Riabzev, Mark Silberstein, Eran 2018.00069
Tromer, and Madars Virza. 2017. Computational Integrity with a Public Random [30] Niall Emmart, Fangyu Zheng, and Charles Weems. 2018. Faster modular ex-
String from Quasi-Linear PCPs. In Advances in Cryptology – EUROCRYPT 2017, ponentiation using double precision floating point arithmetic on the GPU. In
Jean-Sébastien Coron and Jesper Buus Nielsen (Eds.). Springer International 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH). IEEE, 130–137.
Publishing, Cham, 551–579. https://doi.org/10.1007/978-3-319-56617-7_19 https://doi.org/10.1109/ARITH.2018.8464792
[3] Eli Ben-Sasson, Alessandro Chiesa, Eran Tromer, and Madars Virza. 2014. Suc- [31] Ariel Gabizon, Zachary J Williamson, and Oana Ciobotaru. 2019. Plonk: Per-
cinct Non-Interactive Zero Knowledge for a von Neumann Architecture. (2014), mutations over lagrange-bases for oecumenical noninteractive arguments of
781–796. https://dl.acm.org/doi/10.5555/2671225.2671275 knowledge. Cryptology ePrint Archive (2019). https://eprint.iacr.org/2019/953
[4] Daniel J Bernstein, Jeroen Doumen, Tanja Lange, and Jan-Jaap Oosterwijk. 2012. [32] Hisham S. Galal and Amr M. Youssef. 2018. Verifiable Sealed-Bid Auction on
Faster batch forgery identification. In International Conference on Cryptology in the Ethereum Blockchain. In Financial Cryptography and Data Security: FC 2018
India. Springer, 454–473. https://doi.org/10.1007/978-3-642-34931-7_26 International Workshops, BITCOIN, VOTING, and WTSC, Nieuwpoort, Curaçao,
[5] Nir Bitansky, Ran Canetti, Alessandro Chiesa, and Eran Tromer. 2012. From March 2, 2018, Revised Selected Papers (Nieuwpoort, Curaçao). Springer-Verlag,
Extractable Collision Resistance to Succinct Non-Interactive Arguments of Berlin, Heidelberg, 265–278. https://doi.org/10.1007/978-3-662-58820-8_18
Knowledge, and Back Again. In Proceedings of the 3rd Innovations in Theoret- [33] Lili Gao, Fangyu Zheng, Niall Emmart, Jiankuo Dong, Jingqiang Lin, and Charles
ical Computer Science Conference (Cambridge, Massachusetts) (ITCS ’12). As- Weems. 2020. DPF-ECC: Accelerating Elliptic Curve Cryptography with Floating-
sociation for Computing Machinery, New York, NY, USA, 326–349. https: Point Computing Power of GPUs. In 2020 IEEE International Parallel and Dis-
//doi.org/10.1145/2090236.2090263 tributed Processing Symposium (IPDPS). IEEE, 494–504. https://doi.org/10.1109/
[6] Manuel Blum, Paul Feldman, and Silvio Micali. 1988. Non-Interactive Zero- IPDPS47924.2020.00058
Knowledge and Its Applications. In Proceedings of the Twentieth Annual ACM [34] Rosario Gennaro, Craig Gentry, Bryan Parno, and Mariana Raykova. 2013. Qua-
Symposium on Theory of Computing. Association for Computing Machinery, New dratic Span Programs and Succinct NIZKs without PCPs. In Advances in Cryptol-
York, NY, USA, 103–112. https://doi.org/10.1145/62212.62222 ogy – EUROCRYPT 2013, Thomas Johansson and Phong Q. Nguyen (Eds.). Springer
[7] Joseph Bonneau, Izaak Meckler, Vanishree Rao, and Evan Shapiro. 2020. Coda: Berlin Heidelberg, Berlin, Heidelberg, 626–645. https://doi.org/10.1007/978-3-
Decentralized cryptocurrency at scale. Cryptology ePrint Archive (2020). https: 642-38348-9_37
//eprint.iacr.org/2020/352 [35] Craig Gentry. 2009. Fully homomorphic encryption using ideal lattices. In
[8] Sean Bowe, Alessandro Chiesa, Matthew Green, Ian Miers, Pratyush Mishra, and Proceedings of the forty-first annual ACM symposium on Theory of computing.
Howard Wu. 2020. ZEXE: Enabling Decentralized Private Computation. In 2020 Association for Computing Machinery, New York, NY, USA, 169–178. https:
IEEE Symposium on Security and Privacy (SP). 947–964. https://doi.org/10.1109/ //doi.org/10.1145/1536414.1536440
SP40000.2020.00050 [36] Jia-Zheng Goey, Wai-Kong Lee, Bok-Min Goi, and Wun-She Yap. 2021. Ac-
[9] Benedikt Bünz, Alessandro Chiesa, Pratyush Mishra, and Nicholas Spooner. 2020. celerating number theoretic transform in GPU platform for fully homomor-
Proof-carrying data from accumulation schemes. Cryptology ePrint Archive (2020). phic encryption. The Journal of Supercomputing 77 (2021), 1455–1474. https:
https://eprint.iacr.org/2020/499 //doi.org/10.1007/s11227-020-03156-7
[10] Liangyu Chen, Svyatoslav Covanov, Davood Mohajerani, and Marc Moreno Maza. [37] Shafi Goldwasser, Silvio Micali, and Chales Rackoff. 2019. The Knowledge Com-
2017. Big Prime Field FFT on the GPU. In Proceedings of the 2017 ACM on plexity of Interactive Proof-Systems. (2019), 203–225. https://doi.org/10.1145/
International Symposium on Symbolic and Algebraic Computation (Kaiserslautern, 3335741.3335750
Germany) (ISSAC ’17). Association for Computing Machinery, New York, NY, [38] Naga K Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John
USA, 85–92. https://doi.org/10.1145/3087604.3087657 Manferdelli. 2008. High performance discrete Fourier transforms on graphics pro-
[11] Alessandro Chiesa, Yuncong Hu, Mary Maller, Pratyush Mishra, Noah Vesely, cessors. In SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing.
and Nicholas Ward. 2020. Marlin: Preprocessing zkSNARKs with Universal and IEEE, 1–12. https://doi.org/10.1109/SC.2008.5213922
Updatable SRS. In Advances in Cryptology – EUROCRYPT 2020, Anne Canteaut [39] Jens Groth. 2010. Short pairing-based non-interactive zero-knowledge arguments.
and Yuval Ishai (Eds.). Springer International Publishing, Cham, 738–768. https: In International Conference on the Theory and Application of Cryptology and
//doi.org/10.1007/978-3-030-45721-1_26 Information Security. Springer, 321–340. https://doi.org/10.1007/978-3-642-17373-
[12] Eleanor Chu and Alan George. 1999. Inside the FFT black box: serial and parallel 8_19
fast Fourier transform algorithms. CRC press. [40] Jens Groth. 2016. On the Size of Pairing-Based Non-interactive Arguments. In
[13] Aaron E. Cohen and Keshab K. Parhi. 2010. GPU accelerated elliptic curve Advances in Cryptology – EUROCRYPT 2016, Marc Fischlin and Jean-Sébastien
cryptography in GF(2m). In 2010 53rd IEEE International Midwest Symposium on Coron (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 305–326. https:
Circuits and Systems. 57–60. https://doi.org/10.1109/MWSCAS.2010.5548560 //doi.org/10.1007/978-3-662-49896-5_11
[14] James W Cooley and John W Tukey. 1965. An algorithm for the machine cal- [41] Darrel Hankerson, Alfred J Menezes, and Scott Vanstone. 2006. Guide to elliptic
culation of complex Fourier series. Mathematics of computation 19, 90 (1965), curve cryptography. Springer Science & Business Media.
297–301. https://doi.org/10.2307/2003354 [42] Mark Harris, Shubhabrata Sengupta, and John D Owens. 2007. Parallel prefix
[15] Filecoin Corp. 2022. https://filecoin.io/. sum (scan) with CUDA. GPU gems 3, 39 (2007), 851–876.
[16] Filecoin Corp. 2022. bellperson: gpu parallel acceleration for zk-snark. https: [43] Daira Hopwood, Sean Bowe, Taylor Hornby, and Nathan Wilcox. 2022. Zcash
//github.com/filecoin-project/bellperson. Protocol Specification. (2022). https://zips.z.cash/protocol/protocol.pdf
[17] J.P. Morgan Quorum Corp. 2022. https://www.goquorum.com/. [44] QED it Corp. 2017. https://qed-it.com/.
[18] MINA Corp. 2019. GPU Groth16 prover (3x faster than CPU). https://github. [45] Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee.
com/MinaProtocol/gpu-groth16-prover-3x. 2021. Over 100x faster bootstrapping in fully homomorphic encryption through
[19] Mina Corp. 2022. https://minaprotocol.com/. memory-centric optimization with GPUs. IACR Transactions on Cryptographic

352
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada W. Ma, Q. Xiong, X. Shi, X. Ma, H. Jin, H. Kuang, M. Gao, Y. Zhang, H. Shen, and W. Hu

Hardware and Embedded Systems (2021), 114–148. https://doi.org/10.46586/tches. [56] Eli Ben Sasson, Alessandro Chiesa, Christina Garman, Matthew Green, Ian Miers,
v2021.i4.114-148 Eran Tromer, and Madars Virza. 2014. Zerocash: Decentralized anonymous
[46] Sangpyo Kim, Wonkyung Jung, Jaiyoung Park, and Jung Ho Ahn. 2020. Accel- payments from bitcoin. In 2014 IEEE Symposium on Security and Privacy. IEEE,
erating Number Theoretic Transformations for Bootstrappable Homomorphic 459–474. https://doi.org/10.1109/SP.2014.36
Encryption on GPUs. In 2020 IEEE International Symposium on Workload Charac- [57] Srinath Setty. 2020. Spartan: Efficient and general-purpose zkSNARKs without
terization (IISWC). IEEE, 264–275. https://doi.org/10.1109/IISWC50251.2020.00033 trusted setup. In Annual International Cryptology Conference. Springer, 704–737.
[47] Ahmed Kosba. 2021. jsnark: a Java library for building circuits for preprocessing https://doi.org/10.1007/978-3-030-56877-1_25
zk-SNARKs. https://github.com/akosba/jsnark. [58] Ernst G Straus. 1964. Addition chains of vectors (problem 5125). Amer. Math.
[48] Ahmed Kosba, Dimitrios Papadopoulos, Charalampos Papamanthou, and Dawn Monthly 70, 806-808 (1964), 16.
Song. 2020. MIRAGE: Succinct Arguments for Randomized Algorithms with [59] Robert Szerwinski and Tim Güneysu. 2008. Exploiting the power of GPUs for
Applications to Universal Zk-SNARKs. In Proceedings of the 29th USENIX Con- asymmetric cryptography. In International Workshop on Cryptographic hardware
ference on Security Symposium. USENIX Association, USA, Article 120, 18 pages. and embedded systems. Springer, 79–99. https://doi.org/10.1007/978-3-540-85053-
https://dl.acm.org/doi/10.5555/3489212.3489332 3_6
[49] Ahmed Kosba, Charalampos Papamanthou, and Elaine Shi. 2018. xJsnark: A [60] Howard Wu, Wenting Zheng, Alessandro Chiesa, Raluca Ada Popa, and Ion
framework for efficient verifiable computation. In 2018 IEEE Symposium on Secu- Stoica. 2018. DIZK: A Distributed Zero Knowledge Proof System. Cryptology
rity and Privacy (SP). IEEE, 944–961. https://doi.org/10.1109/SP.2018.00018 ePrint Archive, Paper 2018/691. (2018). https://eprint.iacr.org/2018/691 https:
[50] SCIPR LAB. 2022. libsnark: a C++ library for zkSNARK proofs. https://github. //eprint.iacr.org/2018/691.
com/scipr-lab/libsnark. [61] Jiaheng Zhang, Zhiyong Fang, Yupeng Zhang, and Dawn Song. 2020. Zero
[51] Mary Maller, Sean Bowe, Markulf Kohlweiss, and Sarah Meiklejohn. 2019. Sonic: knowledge proofs for decision tree predictions and accuracy. In Proceedings of
Zero-Knowledge SNARKs from Linear-Size Universal and Updatable Structured the 2020 ACM SIGSAC Conference on Computer and Communications Security.
Reference Strings. In Proceedings of the 2019 ACM SIGSAC Conference on Computer 2039–2053. https://doi.org/10.1145/3372297.3417278
and Communications Security (London, United Kingdom) (CCS ’19). Association [62] Yupeng Zhang, Daniel Genkin, Jonathan Katz, Dimitrios Papadopoulos, and
for Computing Machinery, New York, NY, USA, 2111–2128. https://doi.org/10. Charalampos Papamanthou. 2017. vSQL: Verifying arbitrary SQL queries over
1145/3319535.3339817 dynamic outsourced databases. In 2017 IEEE Symposium on Security and Privacy
[52] Peter L Montgomery. 1985. Modular multiplication without trial division. Math- (SP). IEEE, 863–880. https://doi.org/10.1109/SP.2017.43
ematics of computation 44, 170 (1985), 519–521. [63] Ye Zhang, Shuo Wang, Xian Zhang, Jiangbin Dong, Xingzhong Mao, Fan Long,
[53] Wuqiong Pan, Fangyu Zheng, Yuan Zhao, Wen-Tao Zhu, and Jiwu Jing. 2016. Cong Wang, Dong Zhou, Mingyu Gao, and Guangyu Sun. 2021. PipeZK: Acceler-
An efficient elliptic curve cryptography signature server with GPU acceleration. ating Zero-Knowledge Proof with a Pipelined Architecture. In 2021 ACM/IEEE
IEEE Transactions on Information Forensics and Security 12, 1 (2016), 111–122. 48th Annual International Symposium on Computer Architecture (ISCA). 416–428.
https://doi.org/10.1109/TIFS.2016.2603974 https://doi.org/10.1109/ISCA52012.2021.00040
[54] Bryan Parno, Jon Howell, Craig Gentry, and Mariana Raykova. 2013. Pinocchio: [64] Zhichao Zhao and T-H Hubert Chan. 2015. How to vote privately using bitcoin.
Nearly practical verifiable computation. In 2013 IEEE Symposium on Security and In International Conference on Information and Communications Security. Springer,
Privacy. IEEE, 238–252. https://doi.org/10.1109/SP.2013.47 82–96. https://doi.org/10.1007/978-3-319-29814-6_8
[55] Nicholas Pippenger. 1976. On the evaluation of powers and related problems. [65] zkSync Corp. 2022. https://zksync.io/.
In 17th Annual Symposium on Foundations of Computer Science (sfcs 1976). IEEE
Computer Society, 258–263. https://doi.org/10.1109/SFCS.1976.21 Received 2022-07-07; accepted 2022-09-22

353

2025-172
No ratings yet
2025-172
24 pages
A Survey On The Applications of Zero-Knowledge Pro
No ratings yet
A Survey On The Applications of Zero-Knowledge Pro
81 pages
Practice 19-b Creating An Oracle 19c RAC Database
No ratings yet
Practice 19-b Creating An Oracle 19c RAC Database
19 pages
Embedded Systems Dictionary Jack Ganssle and Michael Barr
No ratings yet
Embedded Systems Dictionary Jack Ganssle and Michael Barr
305 pages
Proof Sarg Sand ZK
No ratings yet
Proof Sarg Sand ZK
329 pages
A Survey on Zero-Knowledge Proof in Blockchain
No ratings yet
A Survey on Zero-Knowledge Proof in Blockchain
8 pages
Why and How ZK Snark Works
No ratings yet
Why and How ZK Snark Works
64 pages
Sok ZK Friendly Hashes
No ratings yet
Sok ZK Friendly Hashes
19 pages
Ai R16 - Unit-2
No ratings yet
Ai R16 - Unit-2
54 pages
Interactive Locking, Zero-Knowledge PCPS, and Unconditional Cryptography
No ratings yet
Interactive Locking, Zero-Knowledge PCPS, and Unconditional Cryptography
50 pages
Mobile Commerce and Ubiquitous Computing
No ratings yet
Mobile Commerce and Ubiquitous Computing
50 pages
Lecture 11
No ratings yet
Lecture 11
79 pages
Manual PDF
No ratings yet
Manual PDF
186 pages
ProofsArgsAndZK
No ratings yet
ProofsArgsAndZK
325 pages
6 Steps To Drive Social Software Adoption
No ratings yet
6 Steps To Drive Social Software Adoption
8 pages
Zero Knowledge
No ratings yet
Zero Knowledge
41 pages
ZK PDF
No ratings yet
ZK PDF
65 pages
ZK SNARKS
No ratings yet
ZK SNARKS
65 pages
Salesforce-AI-Specialist-2
No ratings yet
Salesforce-AI-Specialist-2
61 pages
Teaching and Learning With Technology
100% (1)
Teaching and Learning With Technology
315 pages
Gf45 Final Report
No ratings yet
Gf45 Final Report
55 pages
A Self-Sovereign Identity Based On Zero-Knowledge Proof and Blockchain
No ratings yet
A Self-Sovereign Identity Based On Zero-Knowledge Proof and Blockchain
11 pages
Zero Knowledge Proof: 17CP024 Dhruvil Kotecha
No ratings yet
Zero Knowledge Proof: 17CP024 Dhruvil Kotecha
31 pages
DBMS (Unit II)
No ratings yet
DBMS (Unit II)
33 pages
Ajava
No ratings yet
Ajava
30 pages
ZKBoo Faster Zero-Knowledge For Boolean Circuit
No ratings yet
ZKBoo Faster Zero-Knowledge For Boolean Circuit
24 pages
Zcash: Abstract - Bitcoin's Transparent, Immutable, Publicly
No ratings yet
Zcash: Abstract - Bitcoin's Transparent, Immutable, Publicly
5 pages
zk-proofs
No ratings yet
zk-proofs
35 pages
PPT_AI-Based Fraud Detection System for Online Transactions With Real-Time Alerts.
No ratings yet
PPT_AI-Based Fraud Detection System for Online Transactions With Real-Time Alerts.
20 pages
FPGA Acceleration of Multi-Scalar Multiplication: CycloneMSM
No ratings yet
FPGA Acceleration of Multi-Scalar Multiplication: CycloneMSM
17 pages
PSX
No ratings yet
PSX
107 pages
Assignment 2 Security Requirements
No ratings yet
Assignment 2 Security Requirements
14 pages
UNIT 1 AND 2 IT03
No ratings yet
UNIT 1 AND 2 IT03
91 pages
ICICLE v2: Polynomial API For Coding ZK Provers To Run On Specialized Hardware
No ratings yet
ICICLE v2: Polynomial API For Coding ZK Provers To Run On Specialized Hardware
19 pages
2501.03245v1
No ratings yet
2501.03245v1
20 pages
Beginner's Intro To Coding Zero-Knowledge Proofs
No ratings yet
Beginner's Intro To Coding Zero-Knowledge Proofs
16 pages
Ou: Automating The Parallelization of Zero-Knowledge Protocols
No ratings yet
Ou: Automating The Parallelization of Zero-Knowledge Protocols
21 pages
Lecture 3.1.6 - Zero Knowledge Systems
No ratings yet
Lecture 3.1.6 - Zero Knowledge Systems
16 pages
Do You Need A Zero Knowledge Proof?
No ratings yet
Do You Need A Zero Knowledge Proof?
17 pages
Zero-Knowledge Range Proofs
No ratings yet
Zero-Knowledge Range Proofs
18 pages
ZKPpuzzles
No ratings yet
ZKPpuzzles
15 pages
Ramco
No ratings yet
Ramco
6 pages
A Survey On Zero Knowledge Range Proofs and Applications: Eduardo Morais Tommy Koens Cees Van Wijk Aleksei Koren
No ratings yet
A Survey On Zero Knowledge Range Proofs and Applications: Eduardo Morais Tommy Koens Cees Van Wijk Aleksei Koren
17 pages
ACC Test Plan
No ratings yet
ACC Test Plan
9 pages
TCHES2025_1_10
No ratings yet
TCHES2025_1_10
39 pages
73647cf6-d85b-4e81-b7e4-6f8501985e81
No ratings yet
73647cf6-d85b-4e81-b7e4-6f8501985e81
34 pages
14-Zero Knowledge Proofs-20-08-2024
No ratings yet
14-Zero Knowledge Proofs-20-08-2024
16 pages
Chapter 22
No ratings yet
Chapter 22
23 pages
Bulletproofs_Shorter_Proofs_for_a_Privacy-Enhanced_Distributed_Ledger
No ratings yet
Bulletproofs_Shorter_Proofs_for_a_Privacy-Enhanced_Distributed_Ledger
16 pages
Elliptic Curve Cryptography On Embedded Multicore Systems
No ratings yet
Elliptic Curve Cryptography On Embedded Multicore Systems
6 pages
Net2008a 20201112
No ratings yet
Net2008a 20201112
6 pages
Direct Anonymous Attestation: Cse, 7th Sem
No ratings yet
Direct Anonymous Attestation: Cse, 7th Sem
8 pages
Lecture_16_Zcash_173927088453675945767ab2ae431d66
No ratings yet
Lecture_16_Zcash_173927088453675945767ab2ae431d66
19 pages
2
No ratings yet
2
19 pages
Module 5 Snark and Zcash
No ratings yet
Module 5 Snark and Zcash
14 pages
Trustless Privacy-Preserving Data Aggregation On
No ratings yet
Trustless Privacy-Preserving Data Aggregation On
17 pages
3-540-48184-2_5
No ratings yet
3-540-48184-2_5
21 pages
Hentai
No ratings yet
Hentai
2 pages
Zero-Knowledge Proof Vulnerability Analysis and Security Auditing
No ratings yet
Zero-Knowledge Proof Vulnerability Analysis and Security Auditing
21 pages
Problem Statements
No ratings yet
Problem Statements
5 pages
Optimized Cross-Chain Transactions With Aggregated Zero-Knowledge Proofs Enhanci
No ratings yet
Optimized Cross-Chain Transactions With Aggregated Zero-Knowledge Proofs Enhanci
16 pages
Zero Knowledge ML For Generative AI
No ratings yet
Zero Knowledge ML For Generative AI
9 pages
PipeMSM Hardware Acceleration
No ratings yet
PipeMSM Hardware Acceleration
14 pages
AxRMs_Approximate_Recursive_Multipliers_Using_High-Performance_Building_Blocks
No ratings yet
AxRMs_Approximate_Recursive_Multipliers_Using_High-Performance_Building_Blocks
7 pages
FPGA Implementation of Diffie-Hellman Key Exchange Algorithm For Zero Knowledge Proof
No ratings yet
FPGA Implementation of Diffie-Hellman Key Exchange Algorithm For Zero Knowledge Proof
47 pages
Operating_ZKPs_on_Blockchain_A_Performance_Analysis_Based_on_Hyperledger_Fabric
No ratings yet
Operating_ZKPs_on_Blockchain_A_Performance_Analysis_Based_on_Hyperledger_Fabric
10 pages
3658644.3670316
No ratings yet
3658644.3670316
15 pages
Zero Knowledge Protocol
100% (1)
Zero Knowledge Protocol
12 pages
CS Form No 212 Work Experience Sheet
No ratings yet
CS Form No 212 Work Experience Sheet
2 pages
System Engineering Interfaces-IEEE 2013
No ratings yet
System Engineering Interfaces-IEEE 2013
8 pages
ZKP Research Paper
No ratings yet
ZKP Research Paper
6 pages
MS Excel MCQ Quiz Set 2
No ratings yet
MS Excel MCQ Quiz Set 2
17 pages
boundless
No ratings yet
boundless
10 pages
Onenote Rubric
No ratings yet
Onenote Rubric
5 pages
Lab-10-Fuzzing A Local Network Service
No ratings yet
Lab-10-Fuzzing A Local Network Service
3 pages
Analysis_and_comparison_of_the_main_zero-knowledge_proof_scheme
No ratings yet
Analysis_and_comparison_of_the_main_zero-knowledge_proof_scheme
7 pages
A_Benchmark_for_Different_Implementations_of_Zero-Knowledge_Proof_Systems (2)
No ratings yet
A_Benchmark_for_Different_Implementations_of_Zero-Knowledge_Proof_Systems (2)
8 pages
Enhancing Account Privacy in Blockchain-Based Iot Access Control Via Zero Knowledge Proof
No ratings yet
Enhancing Account Privacy in Blockchain-Based Iot Access Control Via Zero Knowledge Proof
7 pages
On Chain Zk Machine Learning
No ratings yet
On Chain Zk Machine Learning
15 pages
Alternative
No ratings yet
Alternative
15 pages
Teachers Weekly Accomplishment Report Format
No ratings yet
Teachers Weekly Accomplishment Report Format
2 pages
Iot
No ratings yet
Iot
14 pages
Complete Guide To ZKLLVM and Proof of Market - Streamlining Zk-Enabled Ethereum Apps With ZKLLVM
No ratings yet
Complete Guide To ZKLLVM and Proof of Market - Streamlining Zk-Enabled Ethereum Apps With ZKLLVM
14 pages
An Introduction to Zero Knowledge Proofs in Blockchains and Economics
No ratings yet
An Introduction to Zero Knowledge Proofs in Blockchains and Economics
15 pages
Efficiency_of_Zero-Knowledge_Proofs_A_Through_Review_and_Analysis (1)
No ratings yet
Efficiency_of_Zero-Knowledge_Proofs_A_Through_Review_and_Analysis (1)
7 pages
A_Survey_on_Zero-Knowledge_Proof_in_Blockchain
No ratings yet
A_Survey_on_Zero-Knowledge_Proof_in_Blockchain
8 pages
Hadr Config Db2
No ratings yet
Hadr Config Db2
4 pages
Curriculum Vitae: Electrical Engineer
No ratings yet
Curriculum Vitae: Electrical Engineer
3 pages
Bayesian Network: Modeling Uncertainty in Robotics Systems
From Everand
Bayesian Network: Modeling Uncertainty in Robotics Systems
Fouad Sabry
No ratings yet
Neural Networks and Fuzzy Logic
From Everand
Neural Networks and Fuzzy Logic
C. Naga Bhaskar
No ratings yet