0% found this document useful (0 votes)
13 views19 pages

Usenixsecurity23 Viand

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 19

HECO: Fully Homomorphic Encryption Compiler

Alexander Viand, Patrick Jattke, Miro Haller, and Anwar Hithnawi, ETH Zurich
https://www.usenix.org/conference/usenixsecurity23/presentation/viand

This paper is included in the Proceedings of the


32nd USENIX Security Symposium.
August 9–11, 2023 • Anaheim, CA, USA
978-1-939133-37-3

Open access to the Proceedings of the


32nd USENIX Security Symposium
is sponsored by USENIX.
HECO: Fully Homomorphic Encryption Compiler

Alexander Viand, Patrick Jattke, Miro Haller, Anwar Hithnawi


ETH Zurich

Abstract from the need to map applications to the unique programming


In recent years, Fully Homomorphic Encryption (FHE) has paradigms imposed by FHE (cf. § 2.3). Properly optimized
undergone several breakthroughs and advancements leading code for this paradigm is often several orders of magnitude
to a leap in performance. Today, performance is no longer a more efficient than poorly adapted code, making optimization
major barrier to adoption. Instead, it is the complexity of de- essential for practical FHE applications and posing a major
veloping an efficient FHE application that currently limits de- barrier to wider adoption.
ploying FHE in practice and at scale. Several FHE compilers Fully Homomorphic Encryption is a nascent field and still
have emerged recently to ease FHE development. However, actively evolving, with ongoing research on the cryptography,
none of these answer how to automatically transform impera- software implementations, and, increasingly, on hardware ac-
tive programs to secure and efficient FHE implementations. celerators. As a result, tools must be designed to accommodate
This is a fundamental issue that needs to be addressed before and adapt to to this fast moving field. Existing compilers (cf.
we can realistically expect broader use of FHE. Automating § 7), however, are mostly rigid, monolithic tools with a nar-
these transformations is challenging because the restrictive row focus on individual sub-optimizations. Whereas experts
set of operations in FHE and their non-intuitive performance usually transform applications in ways that accelerate them
characteristics require programs to be drastically transformed by orders of magnitude, existing tools have mostly focused on
to achieve efficiency. Moreover, existing tools are monolithic smaller-scale optimizations that result in small constant-factor
and focus on individual optimizations. Therefore, they fail to speedups. While these represent important contributions, they
fully address the needs of end-to-end FHE development. In are insufficient to make the kind of qualitative performance
this paper, we present HECO, a new end-to-end design for difference that is necessary to achieve practical FHE. In or-
FHE compilers that takes high-level imperative programs and der to overcome these limitations, we need to fundamentally
emits efficient and secure FHE implementations. In our de- rethink the architecture of FHE compilers and develop novel
sign, we take a broader view of FHE development, extending optimizations that abstract away the complexity FHE and
the scope of optimizations beyond the cryptographic chal- address the limitations of existing tools. In this paper, we
lenges existing tools focus on. present HECO, a new multi-stage optimizing FHE compiler.
Our architecture provides, for the first time, a true end-to-
end toolchain for FHE development. In addition, we propose
1 Introduction novel transformations and optimizations that map imperative
programs to the unique programming model of FHE.
Privacy and security are gaining tremendous importance
across all organizations, as public perception has shifted and
expectations, including regulatory demands, have increased. E2E Architecture. Expert developers naturally structure
This has led to a surge in demand for secure and confidential the development of FHE applications into different stages.
computing solutions that protect data’s confidentiality in tran- First, they consider how to efficiently map applications to the
sit, rest, and in-use. Fully Homomorphic Encryption (FHE) unique paradigm of FHE, e.g., how to minimize the need for
is a key secure computation technology that enables systems data movement. Only afterward do they consider lower-level
to preserve the confidentiality of data at any phase; hence, issues, such as fine-tuning the program to exploit scheme-
allowing outsourcing of computations without having to grant specific optimizations. Currently, most developers then target
access to the data. In the last decade, theoretical breakthroughs software libraries such as Microsoft SEAL [6] which real-
propelled FHE to a practical solution for a wide range of ap- ize the underlying FHE schemes and implement a range of
plications [1–3] in real-world scenarios. In addition, end-user cryptographic optimizations. However, with more focus on
facing deployments have started to appear, e.g., in Microsoft hardware acceleration, there is an emerging need to optimize
Edge’s password monitor [1]. With upcoming hardware ac- for individual hardware targets, which libraries are usually ill-
celerators for FHE promising further speedup [4, 5], FHE will suited to accomplish. Based on this progression of abstraction
soon be competitive for an even wider set of applications. levels, we identified four phases of converting an application
Though promising in its potential, developing efficient FHE to an efficient FHE implementation: program transformation,
applications remains a complex and tedious process that even circuit optimization, cryptographic optimization, and target
experts struggle with. A large part of this complexity arises optimization (cf. § 3.1).

USENIX Association 32nd USENIX Security Symposium 4715


Compilers need to be able to accommodate a wide variety Enc( · )
of optimizations across these different levels of abstraction. x Enc(x)
However, existing compilers usually abstract FHE computa- kp
kev
tions as circuits consisting of the basic homomorphic opera- Eval(f( · ))
Gen
tions and scheme-specific operations for managing noise. This
is a natural representation since FHE computations are math- ks
ematically modeled as arithmetic circuits. However, many f(x) Enc(f(x))
optimizations at the high-level (program transformation) or Client Dec( · ) Server
at the lowest level (target optimization) cannot be fully ex-
pressed in the circuit setting because they consider aspects Figure 1: Using FHE, a third party can compute on encrypted
such as data flow or memory management which have no values without requiring access to the underlying data.
natural correspondence in this abstraction. We instead pro-
pose a set of Intermediate Representations (IRs) based on transformed dramatically in order to be expressed solely from
the requirements of each phase that allow us to naturally and these operations. We devise a series of transformations and
efficiently express optimizations at these different levels. We optimizations that can translate batching-amenable programs
realize these IRs using the MLIR compiler framework [7], to fully exploit SIMD operations while minimizing the need
which provides a standardized way to define and operate on for data movement (c.f. § 4).
domain-specific IRs. MLIR enables the transfer of optimiza- In our evaluation, we show that HECO can match the per-
tions between different projects, including across domains, formance of expert implementations, providing up to 3500x
and provides a powerful software framework. Specifically, it speedup over naive non-batched implementations (c.f. § 6).
is well suited to represent and optimize high-level programs We open-source HECO and we hope that it will help to ad-
in a way that circuit-based tools are not. vance the FHE development ecosystem. HECO decouples
Automated Mapping to Efficient FHE. HECO supports optimizations from front- and back-end logic, allowing it to
the automatic transformation of high-level programs to FHE’s be easily extended to different languages, FHE libraries, ac-
unique programming paradigm. Experts spend significant celerators, and novel optimizations as they emerge.
time considering how to best express an application in the
FHE paradigm, only considering the other aspects once the 2 Background
program is efficiently expressible using native FHE operations.
Existing tools, in contrast, typically disregard this arguably In this section, we briefly introduce the notion of FHE and
most important phase of the FHE development process. key aspects of modern FHE schemes.
In HECO, developers can express their algorithms con-
veniently in the standard imperative paradigm, e.g., using
2.1 Fully Homomorphic Encryption
loops that access/modify individual vector elements. How-
ever, such programs do not align well with the restricted set In a homomorphic encryption (HE) scheme, there exists a
of operations offered by FHE schemes, requiring the com- homomorphism between plaintext and ciphertext operations
piler to translate and restructure the application. We focus such that, e.g., Dec(Enc(x + y)) = Dec(Enc(x) ⊕ Enc(y)) in
on transformations targeting the Single Instruction, Multiple the case of additively homomorphic encryption. While addi-
Data (SIMD) parallelism of most modern schemes, which tively and multiplicative HE schemes (e.g., Paillier [8] and
allows one to batch many (usually, 213 − 216 ) different values textbook RSA [9], respectively) have been known for many
into a single ciphertext and compute over all simultaneously. decades, fully homomorphic encryption (FHE) schemes that
Batching is used by experts to drastically reduce ciphertext support an arbitrary combination of both operations remained
expansion and computational overhead, frequently improving practically infeasible until Gentry’s breakthrough in 2009 [10].
runtimes by several orders of magnitude when compared to This allows a third party to compute on encrypted data, with-
naive implementations. out requiring access to the underlying data. Specifically, FHE
While it is arguably the single most important optimiza- is traditionally defined for outsourced computation, where a
tion for many applications, unlocking its performance poten- client provides encrypted data x and a function f to a server
tial currently requires significant expertise and experience which computes and returns f (x) as shown in Figure 1. The
in writing FHE applications. Because these schemes do not client can decrypt the returned result while the server learns
offer the data movements operations (e.g., scatter/gather/per- nothing about inputs, intermediate values, or results. Beyond
mute) usually present in the context of vector operations, this simple setting, the function (and additional inputs) can
existing approaches from the traditional compiler literature also be supplied by the server, opening up additional inter-
do not translate well. Specifically, FHE only natively supports esting deployment scenarios. For example, Private Set Inter-
element-wise SIMD operations and cyclical rotations of the section (PSI) as used in Microsoft Edge’s password moni-
elements inside a ciphertext and existing algorithms must be tor [1], where the client sends encrypted login credentials to

4716 32nd USENIX Security Symposium USENIX Association


the server, which compares them against its own database i.e., a client might be able to learn information about the ap-
of leaked usernames and passwords. In the decade since its plied circuit. Different techniques, varying in practicality and
first realization, FHE performance has improved dramatically: protection level, can be used to address this [21, 22]. Finally,
from around half an hour to compute a single multiplication to issues can appear when using approximate homomorphic en-
only a few milliseconds. Nevertheless, this is still seven orders cryption (e.g., CKKS), with attacks that can recover the secret
of magnitude slower than a standard, signed multiplication key from the noise embedded in ciphertext decryptions [23].
executed on a modern CPU. However, this gap is expected Recent work has shown how adding differentially private
to be reduced significantly with the emergence of dedicated noise can mitigate these attacks [24], but some concerns re-
FHE hardware accelerators [5, 11–13]. main.
2.2 FHE Schemes 2.3 FHE Programming Paradigm
We briefly describe the BFV scheme [14, 15] as a represen-
tative of the largest family of FHE schemes. We focus on FHE imposes a variety of restrictions on developing pro-
aspects relevant to FHE application development and refer grams: some derive from the definition of FHE and its security
to the original papers for further details. Since most modern guarantees, while others result from scheme restrictions and
FHE schemes follow a similar pattern, the descriptions also cost models. For example, FHE’s security guarantees make
mostly apply to the BGV and CKKS schemes [16, 17]. it necessarily data-independent, hence preventing branching
In BFV, plaintexts are polynomials of degree n (usually based on secret inputs. While some forms of branching can
n > 212 ) with coefficients modulo q (usually q > 260 ). En- be emulated, all branches must be evaluated, resulting in a
cryption introduces noise into the ciphertext, which is initially potentially significant degradation of performance. In addi-
small enough to be rounded away during decryption but accu- tion, FHE schemes only offer a limited set of data types and
mulates during homomorphic operations. As basic operations, operations, with addition and multiplication as basic opera-
BFV supports additions and multiplications over ciphertexts. tions. Applied over binary plaintext spaces (Z2 ), this techni-
Additions increase the noise negligibly, but multiplications cally enables arbitrary computation. However, the best perfor-
affect it significantly. Managing noise is crucial to prevent mance is usually achieved with larger plaintext spaces (e.g.,
ciphertext corruption, which manifests as a failing decryption. Zt for t  2). In this setting, computations are equivalent to
The noise limits computations to a (parameter-dependent) arithmetic circuits, which can only compute polynomial func-
number of consecutive multiplications (multiplicative depth) tions. Non-polynomial functions can be approximated, but
before decryption fails. While boostrapping can reduce the this is typically prohibitively inefficient. While recent works
noise homomorphically, it introduces significant overheads have explored homomorphic conversions between binary and
and must therefore be eliminated or minimized to achieve arithmetic settings [25, 26] and introduced programmable
practical FHE solutions. bootstrapping to approximate non-polynomial functions [27],
Using the Chinese Remainder Theorem (CRT), it is possi- these approaches are not yet practical enough for widespread
ble to encode a vector of n integers into a single polynomial, adoption.
with addition and multiplication acting slot-wise (SIMD). As a result, developing FHE applications requires funda-
Since the polynomial of degree n is usually between 213 mentally rethinking how programs are written. Generally, de-
and 216 for security, it can significantly reduce ciphertext velopers need to rethink their approach, e.g., using branch-free
expansion and computation cost. BFV also supports rotation algorithms well-suited to low-degree polynomial approxima-
operations over such batched ciphertexts, which cyclically tions. In addition, the large size of FHE ciphertexts, which is
rotate the vector’s elements. Finally, BFV includes a variety required for security reasons, is a significant source of both
of noise-management (or ciphertext maintenance) operations, communication and computation overhead. However, it also
which do not change the encrypted message but can reduce presents an opportunity, as many1 schemes support batching,
noise growth during computations. which allows encrypting many values into the same ciphertext.
This reduces ciphertext expansion and enables element-wise
Security. Modern FHE schemes rely on post-quantum hard- operations in a Single Instruction, Multiple Data (SIMD) fash-
ness assumptions, widely believed to be secure for the fore- ion. Data movement in FHE is incredibly restricted, affording
seeable future. The community has developed estimates of no efficient ways to permute the batched data after encryption,
their concrete hardness [18] and parameter choices for sev- with the exception of cyclical rotations. As a result, efficient
eral FHE schemes have been standardized [19]. Nevertheless, FHE algorithms are usually drastically different from their
some attention must be paid to security when using FHE. For plaintext equivalents. Adapting to this unique programming
example, FHE does not provide integrity by default, i.e., a paradigm requires a lot of experience and poses a significant
server might perform a different calculation than requested barrier to entry for non-experts.
or none at all. There exist techniques to address that, ranging
from zero-knowledge-proofs to hardware attestation [20]. Ad- 1 Specifically, schemes from the Ring-LWE family that B/FV, BGV and
ditionally, FHE does not provide by default circuit privacy, CKKS belong to.

USENIX Association 32nd USENIX Security Symposium 4717


HECO Framework 1 def server ( x_enc , y_enc , public_context ) :
2 p = FrontendProgram ()
Program Circuit Crypto Target 0110
Optimization Optimization Optimization Optimization 1001 3 with CodeContext ( p ) :
Source ௡
Ժ ܺ Ȁሺܺ ൅ ͳሻ Target 4 def euclidean_sq ( x : Tensor [ 8 , Secret [ int ]] ,
FHE
Code Kernel
5 y : Tensor [ 8 , Secret [ int ]])

6 -> Secret [ int ]:
7 sum : Secret [ int ] = 0
HECO Compiler 8 for i in range ( 8 ) :
Target SEAL FHE
9 d = x[i] - y[i]
x86 FPGA
Kernels C++ ASIC 10 sum = sum + ( d * d )
11 return sum
Secret SEAL Library
CPU FPGA ASIC FHE 12
Inputs CPU
13 # compile FHE code
Target Options 14 f = p . compile ( context = public_context )
R lt
Results
15 # run FHE code using SEAL
16 r_enc = f ( x_enc , y_enc )
Figure 2: Overview of our end-to-end design, showing the 17 return r_enc
compilation flow from a high-level input program to an effi-
cient FHE kernel running on a target backend. Listing 1: Example server-side code using HECO.
3 End-to-End FHE Compiler Design
1 def client ( x : Tensor [ int ] , y : Tensor [ int ]) :
2 # Select SEAL backend , scheme and params
This section first provides a system overview of HECO and 3 context = SEAL . BFV . new ( poly_mod_degree = 2 0 4 8 )
then presents its core components, beginning with the overall 4

framework design, followed by a discussion of our compiler 5 # encrypt input


6 x_enc = context . encrypt ( x )
architecture, and finally, give an overview of the transfor-
7 y_enc = context . encrypt ( y )
mations and optimizations that constitute the compilation 8
pipeline. 9 # send enc input to server
10 r_enc = server ( x_enc , y_enc , context . pub () )
3.1 System Overview 11 result = context . decrypt ( r_enc , context )

HECO proposes a multi-staged approach to FHE compila- Listing 2: Corresponding client-side code, outsourcing the
tion that encompasses: (i) Program Transformations, which computation of the (squared) euclidean distance.
restructure high-level programs to be efficiently expressible
using native FHE operations, (ii) Circuit Optimizations, which
primarily focuses on changes that reduce noise growth in the provides a unified experience for development, compilation
FHE computation, (iii) Cryptographic Optimizations, which and execution. We provide an overview of our end-to-end de-
instantiate the underlying scheme as efficiently as possible sign in Figure 2. In the remainder of this section, we describe
for the given program, and (iv) Target Optimizations, which HECO’s components, abstractions, and compilation stages.
map the computation to the capabilities of the target. We pro-
pose a set of Intermediate Representations (IRs) designed to
3.2 HECO Framework
provide a suitable abstraction of each stage, allowing us to
naturally and efficiently express optimizations at these dif- HECO’s framework ties together the front end, compiler, and
ferent levels. In contrast, existing compilers usually abstract the various back ends into a unified development experience.
FHE computations as circuits2 which does not allow them to It allows developers to edit, compile and deploy their applica-
fully express many optimizations at the high-level (program tions from a familiar Python environment. In order to provide
transformation) or at the lowest level (target optimization) an intuition of the developer experience in our system, we
because these need to consider aspects such as data-flow or provide an example of using HECO to compile and run an
memory management which have no natural correspondence FHE program in Listing 1 (Server) and Listing 2 (Client).
in a circuit representation. In HECO, high-level programs are By wrapping FHE functions in with blocks, we can operate
lowered through a series of transformations, using multiple in- on them as first-class entities, making compilation explicit.
creasingly lower-level IRs to produce the target kernel. These Our framework provides the necessary infrastructure to run
kernels can then be targeted and run against various back-end programs directly from the front end, allowing developers to
options. We provide a user-facing Python framework that integrate FHE functionality into larger applications easily.
abstracts away the complexities of this process, supports a
Python-embedded Domain Specific Language for FHE, and
Python-Embedded DSL. HECO uses Python to host its
2 This
is natural since FHE computations are usually modeled mathemati- Domain-Specific Language (DSL), inheriting Python’s syn-
cally as arithmetic circuits. tax and general semantics. We want to allow developers to

4718 32nd USENIX Security Symposium USENIX Association


write programs in as natural a fashion as possible, and merely Python … other frontends

require type annotations to denote inputs that are Secret. In

middle-end
heco::fhe mlir::tensor mlir::arithmetic
order to facilitate this, HECO supports (statically sized) loops, Types Types // iadd, imul, ...
secret<T> tensor<V>

HIR
access to vector elements, and many other high-level features mlir::affine
batchedsecret<T> // e.g., V = secret<T> // loops
Operations Operations
that do not have a direct correspondence in FHE. Since our + add - sub * mul rot insert extract
mlir::func
// func_call
compilation approach requires a high-level representation of

SIR
the input program, including these non-native operations and heco::bfv heco::bgv heco::ckks
// ops in bfv scheme // ops in bgv scheme // ops in ckks scheme
the control-flow structure, we cannot follow the approach used

back-ends
by most existing tools. These tend to execute the program

RIR PIR
heco::poly
C++ // ops in polynomial ring + number theoretic transform
using placeholder objects that record operations performed heco::rns
FHE Libraries
on them, which is equivalent if considering FHE programs (e.g. SEAL)
// ops in residue number system representation

mlir::llvm dprive::isa
as circuits but removes most of the high-level information // MLIR to LLVM IR // hardware ISA

about the program structure. Instead, we use Python’s exten-


sive introspection features to parse the input program and Figure 3: Overview of HECO’s dialects, which define the
translate the resulting Abstract Syntax Tree (AST) directly to operations used in the Intermediate Representations (IRs).
our high-level IR.
proved implementations, both in software and hardware. This
3.3 Compiler Infrastructure requires a high level of modularity and flexibility from the
compiler. In HECO, we achieve this by using target-specific
The core of HECO is an optimizing compiler that translates dialects, which can be customized and extended as new back
and optimizes programs by lowering them through a series of ends are introduced. While traditional library-based imple-
progressively lower-level Intermediate Representations (IRs). mentations targeting CPUs and GPUs share a common API
This section describes how we build upon the MLIR frame- (conceptually, if not technically), upcoming FPGA and ASIC
work to realize HECO’s compiler design. accelerators for FHE [4, 28–30]) feature a much lower-level
interface. These systems are designed to efficiently realize the
Multi-Level Intermediate Representations. HECO’s required mathematical operations in the modular rings of poly-
middle end exposes multiple levels of abstractions to nomials that underly most FHE schemes, and as a result, their
facilitate our multi-stage compilation & optimization Instruction Set Architectures (ISA) operate on this level. In or-
approach. This is realized through a series of Interme- der to support this, HECO is designed to be easily extended to
diate Representations (IRs), as seen in Figure 3. We match this abstraction level, featuring MLIR dialects for both
leverage the MLIR framework [7], which was designed bignum polynomial ring operations (heco::poly) and for the
specifically to facilitate progressive lowering, introduc- commonly used Residue Number System (RNS) approach
ing additional IRs to reduce the complexity of each using the Chinese Remainder Theorem (CRT) to split these
lowering step. MLIR defines a common syntax for IR large datatypes into hardware-sized elements (heco::rns).
operations, for example, an addition might be represented as In addition to targeting hardware accelerators, the ability to
%2 = artih.addi(%0, %1) : (i16, i16) -> i16. MLIR lower to this level also allows targeting x86 directly via LLVM
is strongly typed, however, for conciseness, we will omit the IR and the LLVM toolchain.
details of type conversions when discussing transformations.
Intermediate Representations in MLIR are composed of 3.4 Transformation & Optimization
sets of operations known as dialects. We define a custom
dialect for our high-level abstraction of FHE (heco::fhe) The transformations and optimizations in HECO are grouped
and combine this with built-in dialects for vector operations according to the four stages of compilation we identified, and
(mlir::tensor), plaintext arithmetic (mlir::arithmetic) we present them accordingly in the following.
and basic program structure (mlir::affine, mlir::func)
to realize our High-Level Intermediate Representation (HIR). Program Transformations. The first phase of compilation
In addition, we define dialects for each of the supported focuses on high-level transformations and optimizations. This
FHE schemes, mirroring their natively supported operations includes a wide variety of general (e.g., constant folding, com-
(heco::bfv, heco::bgv, heco::ckks). MLIR also includes mon sub-expression elimination) and FHE-specific optimiza-
a variety of standard simplification passes, which can tions that allow developers to write code more naturally by
be extended to custom dialects by defining appropriate removing the need for menial hand-optimization. Most im-
interfaces. portantly, however, it focuses on optimizations that map the
input program to FHE’s unique programming paradigm, such
Supporting Different Back-Ends. FHE is actively evolv- as the automated batching optimizations, which we present in
ing, and as such, tools need to be able to adapt to new and im- more detail in § 4. Previous work has shown that performance

USENIX Association 32nd USENIX Security Symposium 4719


differences between the runtimes of well-mapped and naively- proach for a given computation. However, this requires re-
mapped implementations can easily reach several orders of expressing the complex underlying logic of FHE schemes
magnitude [31]. As a result, a significant part of our focus inside the compiler. HECO inherits a powerful system of
in HECO is on this level of abstraction, which existing FHE abstractions and optimizations for computationally intense
tools generally do not support. mathematics from the MLIR framework, allowing our system
to be easily extended with such optimizations in the future.
Circuit Optimizations. After mapping to the FHE
paradigm, the program is conceptually equivalent to an arith- Target Optimization. Finally, in the fourth phase, we con-
metic circuit of native FHE operations. This is the level of sider target-specific optimizations. In addition to general code
abstraction considered by the vast majority of existing tools. generation optimizations, there is a significant opportunity for
Optimizations at this stage are mostly concerned with man- FHE-specific optimizations at this level. For example, when
aging the noise growth in the computation. For example, a available, FHE benefits greatly from instruction set extensions
variety of optimizations that try to re-arrange the arithmetic such as AVX512. This concept has already been explored in
operations to reduce the number of sequential multiplications the context of libraries [43], and implementing similar tech-
have been proposed [32–35]. However, even state-of-the-art niques in HECO should be straightforward given our modular
optimizations in this style are likely to accelerate a program design. When targeting hardware, FHE accelerators impose
by only around 2x in practice [31]. We omit a detailed de- non-trivial constraints on memory and register usage, due to
scription of these techniques here as they are not the focus of complex memory hierarchies. Initial work in this space has
this paper. More importantly, this level of abstraction is also already shown that code generation and scheduling can have a
where we must consider ciphertext maintenance operations. significant impact on accelerator performance [29,30]. HECO
These do not modify the encrypted messages, but significantly supports optimizations at this level through our low-level di-
affect future noise growth, making them essential for prac- alects for the underlying math, and can easily be extended
tical FHE. HECO uses a traditional approach of inserting with target-specific dialects for the Instruction Set Architec-
relinearization operations [15] between all consecutive multi- tures (ISAs) of upcoming accelerators, e.g., those developed
plications. This is always correct but not necessarily strictly by the DARPA DPRIVE program [4].
optimal, and more sophisticated strategies [36–38] could offer 4 Automatic SIMD Batching
further improvements. The modularity of our design makes it
a straightforward future work to include these techniques, but In this section, we introduce our automated SIMD batching
since these have been explored in the past, we do not focus optimization, which is part of our program transformation
on them in this paper. stage and maps traditional imperative programs to the restric-
tive SIMD-like setting of state-of-the-art FHE schemes.
Cryptographic Optimizations. In the third phase, we con-
sider cryptographic optimizations focused on instantiating 4.1 SIMD Batching
the underlying FHE scheme as efficiently as possible. When
targeting existing FHE libraries, the primary challenge is pa- Effective use of batching is arguably the single most important
rameter selection: identifying the smallest (i.e., most efficient) optimization for many applications and is omnipresent in most
parameters that still provide sufficient noise capacity to per- state-of-the-art FHE results. Due to the large capacity of FHE
form the computation correctly. Different techniques have ciphertexts (usually 213 − 216 slots), applying batching has
been proposed to estimate the expected noise growth of an the potential to drastically reduce ciphertext expansion over-
FHE program [39–41]. These include theoretical noise analy- head and computation time. While batching can be used to
sis, where recent work has achieved tighter bounds for some trivially increase throughput, most FHE applications are con-
schemes [41], but which generally tend to significantly over- strained by latency. However, employing batching effectively
estimate noise growth [42], leading to unnecessarily large to improve performance on a single input is non-trivial due
parameter choices. As a result, experts primarily still rely to the restrictions imposed by FHE’s unusual programming
on a trial-and-error process to experimentally determine the paradigm. Therefore, unlocking the performance potential of
point at which noise invalidates the results. HECO includes batching currently requires significant expertise and experi-
basic automatic parameter selection based on a simple multi- ence in writing FHE applications. In the following, we present
depth heuristic but also allows experts to easily override these a simple example that showcases the drastic transformations
suggestions. When targeting hardware directly, rather than that can be required to achieve efficient batching, followed by
through libraries, further optimization opportunities open up. a brief introduction of a folklore technique that demonstrates
For example, many ciphertext maintenance operations can common patterns of FHE batching optimizations.
be instantiated in different ways, offering trade-offs between Example Application – Image Processing. We consider a
runtime, memory consumption, and noise behavior. While simple image processing application (see Listing 3a), which
libraries tend to implement a general-purpose compromise, nevertheless features a complex loop nest structure and non-
compilers can adaptively choose the most appropriate ap- trivial index patterns. Specifically, we consider a Laplacian

4720 32nd USENIX Security Symposium USENIX Association


1 def foo ( img : Tensor [N , Secret [ f64 ]]) : 1 def foo_batched ( img : BatchedSecret [ f64 ]) :
2 img_out = img . copy () 2 r0 = img * -8
3 w = [[ 1 , 1 , 1 ] , [ 1 , -8 , 1 ] , [ 1 , 1 , 1 ]] 3 r1 = img << -n -1
4 for x in range ( n ) : # loop over pixels 4 r2 = img << -n
5 for y in range ( n ) : 5 r3 = img << -n +1
6 t = 0 6 r4 = img << -1
7 for j in range ( - 1 , 2 ) : # apply kernel 7 r5 = img << 1
8 for i in range ( -1 , 2 ) : 8 r6 = img << n -1
9 t += w [ i + 1 ][ j + 1 ] * img [(( x + i ) * n +( y + j ) ) % N ] 9 r7 = img << n
10 img_out [( x * n + y ) % N ] = 2 * img [( x * n + y ) % N ] - t 10 r8 = img << n +1
11 return img_out 11 return 2 * img -( r0 + r1 + r2 + r3 + r4 + r5 + r6 + r7 + r8 )

(a) Textbook implementation of a simple sharpening filter (b) Optimized batched solution of the same program

Listing 3: Both of these functions apply a simple sharpening filter to an encrypted image of size n × n = N, by convolving a
3 × 3 kernel (−8 in the center, 1 everywhere else) with the image. The version on the left encrypts each pixel individually, and
follows the textbook version of the algorithm, operating over a vector of N ciphertexts. The version on the right batches all pixels
into a single ciphertext and uses rotations (<<) and SIMD operations to compute the kernel over the entire image at the same
time. Designing batched implementations requires out-of-the-box thinking in addition to significant expertise and experience.

Sharpening filter, i.e., a convolution of a (3x3) kernel over an Algorithm 1 Rotate-and-Sum


image, implemented with wrap-around padding. The function 1: Algorithm SumVectorPowerTwo(x, n)
is compatible with efficient arithmetic-circuit based FHE, as 2: for i ← n2 downto 1 by i ← 2i
it does not use data-dependent branching and only requires 3: x ← x + Rotate(x, i)
homomorphic addition and multiplications operations. How- return x
ever, its current form makes use of nested loops accessing
a complex set of indices, which is not very amenable to ef- t0 t2 t0 = x << 2
t1 = x + t0
ficient batching as there appears to be little opportunity for t2 = t1 << 1
operations over entire ciphertexts. s = t1 + t2
Nevertheless, there exists a significantly more efficient x t1 s
rot rot
batched design, as seen in Listing 3b. In the optimized ver- 2 1

sion, the input image is batched into a single ciphertext, and


+ +
all homomorphic operations make full use of their SIMD na-
ture. Instead of iterating the kernel over the image, nine copies
of the image are made and each is rotated so that all elements
interacting at a specific kernel position align at the same index. Figure 4: Illustration of how repeated copying and rotating
This is possible, because the relative offset between different can be used to compute the sum of all elements in a ciphertext
pixels in the kernel remains static, even though the indices in a logarithmic, rather than linear, number of steps.
themselves are different for each iteration. The transforma-
tion enables the the runtime of the program to depend on the
(small) kernel size, rather than the image size. As a result, before repeating the same procedure with a lower offset (c.f.
the batched version is more than an order of magnitude more Algorithm 1). This is visualized in Figure 4 for a vector size
efficient than a naive implementation. These types of drastic of four. While this technique is applicable, the performance
transformations are common in state-of-the-art FHE appli- benefits are usually overshadowed by more radical transfor-
cations and significant experience is required to develop an mations, such as the example shown above. However, using
intuition for this unusual programming paradigm. rotate-and-sum and similar rotation-based approaches can be
worthwhile if it enables other parts of the program to remain
Rotate-and-Sum. In the example above, only interactions slot-aligned.
between values in different ciphertexts were required. How- 4.2 Automatic Batching Approach
ever, it is also possible to efficiently realize certain operations
on the elements of a single ciphertext; we now describe a com- Experts generally rely on their experience with the FHE pro-
mon folklore technique used to achieve this: The rotate-and- gramming paradigm to transform and optimize programs for
sum algorithm allows us to efficiently sum up the elements batching, posing a high barrier to entry for non-expert de-
of a ciphertext, using (log n) rotations (where n is the number velopers. Instead, formal methods to automatically translate
of ciphertext slots). The algorithm proceeds by creating a traditional imperative programs into efficient batched FHE so-
copy of the current vector, rotating it and then adding both lutions are required. We assume the input program computes

USENIX Association 32nd USENIX Security Symposium 4721


1 %1= tensor.extract % x[ i ] %1= fhe.mul (% x , % m_i ) in FHE. While some recent work can reason about the cost
2 %2= tensor.extract % y[ j ] %2= fhe.rotate (%1 , -i ) of data movement, it does not consider how data movement
3 %3= fhe.add (%1 , %2) %3= fhe.mul (% y , % m_j ) introduced at the beginning of the program might affect later
4 %4= tensor.insert (%3 ,% z [i ]) %4= fhe.rotate (%3 , -j )
5 %5= fhe.add (%2 , %4) parts of the program [47].
6 %6= fhe.rotate (%5 , i )
7 %7= fhe.mul (% z , % mn_i ) HECO’s Approach. HECO’s batching transformation
8 %8= fhe.add (%6 , %7) starts with the core idea of the strawman approach, i.e., batch-
ing vectors of secrets into ciphertexts but eliminates the over-
(a) Input Program (b) Naive Batching head of emulating insert/extract operations for batching
amenable programs. Rather than directly emulating these op-
erations, we instead translate the homomorphic operations in
Listing 4: Strawman batching approach for z[i] = x[i]
which they appear as operands. While this still requires insert-
+ y[j], showing the necessary rotations and multiplications
ing rotation operations, it allows operations with compatible
with masking vectors: %m_i, %m_j are zero everywhere except
index access patterns to be mapped to the same emulated code
at i or j, respectively; %mn_i is one everywhere except at i.
if using an appropriate algorithm. As a result, a simple built-in
simplification pass can eliminate the duplicates. In the case
the elements from vectors of secret values in a non-SIMD of well-structured programs, this can completely eliminate
fashion (e.g., Listing 3a). Of course, this can be naively re- emulation related code. For example, for the program from
alized by encrypting each vector element into one ciphertext the previous section (Listing 3a), HECO produces exactly the
, but this usually does not achieve acceptable performance optimized code seen in Listing 3b, which does not contain
due to the high overhead of FHE. The goal of automated any emulation-related code.
batching is to amortize the cost of each FHE operation by HECO Batching Pipeline. HECO’s batching transforma-
utilizing as many ciphertext slots as possible for meaningful tion is composed of a series of smaller passes, each interleaved
computation. In the following, we discuss two potential alter- with built-in simplification passes. Before the main batching
native approaches and their drawbacks before introducing our passes, we first perform a series of preprocessing steps that
approach. unroll statically-sized loops, merge sequential associative bi-
Strawman Approach. Batching each vector in the input nary operations into n-ary group operations, and perform a
program into a ciphertext will trivially achieve a ‘batched’ so- type conversion from vectors of secrets to BatchedSecrets,
lution. However, this raises the question of how to execute the which are HECO’s high-level abstraction of ciphertexts. Fol-
computations over individual elements present in the program. lowing this, the main pass walks through the program and
Element-wise access (extract and insert) are not native transforms each operation over secret vector elements into
FHE operations and must instead be emulated, requiring sev- operations over entire vectors. Similar to the strawman ap-
eral rotations and ciphertext-plaintext multiplications. For proach, this involves introducing rotations, but our approach
example, Listing 4 shows how x[i] = x[i] + y[j] can be does not require multiplications with masks. Additionally, the
emulated in the batched setting. However, this replaces each way we perform these translations allows us to ‘chain’ them
FHE operation from the naive, vector-of-ciphertexts approach, so that consecutive operations on the same vector elements do
with multiple expensive FHE operations. As a result, unless not result in separate emulation code. After the main pass and
ciphertext expansion is significantly more important than run- the associated simplification pass, we apply the rotate-and-
time, this approach is virtually always ill-advised. Therefore, sum technique where applicable, which is enabled by both the
most existing FHE tools use the vector-of-ciphertexts ap- merging of operations during the preprocessing phase and the
proach rather than attempting to perform batching. exposure of same-ciphertext operations by the main pass. In
the following, we first outline some key preprocessing steps
Alternative Approaches. There have been initial attempts before explaining the two main optimization steps.
at performing automated batching for FHE using Synthesis-
based approaches [44]. While these can, in theory, achieve the
drastic transformations required to exploit batching, they are 4.3 Preprocessing
not suitable for practical use in real-world code development,
In addition to standard simplifications and canonicalization of
as they do not scale beyond toy-sized program snippets, and
the IR (i.e., bringing operations into a standardized ‘canonical’
even those can take minutes to optimize. Alternatively, one
form to reduce the complexity of the IR), we also apply two
might consider applying traditional Superword-Level Paral-
more specialized transformations.
lelism (SLP) vectorization algorithms [45–47], as these try to
group operations into SIMD instructions. However, these gen-
erally rely on the ability to efficiently scatter/gather elements Merging Arithmetic Operations. During the preprocess-
into and out of vectors, which is only possible at a high cost ing stage, we combine chained applications of (associative and

4722 32nd USENIX Security Symposium USENIX Association


commutative) binary operations into larger arithmetic opera- Algorithm 2 Batching Pass
tions with multiple operands (e.g., merging x=a+b; y=x+c to 1: Algorithm BatchPass(G )
y=a+b+c). This removes chains of dependencies and replaces 2: V ,E ← G
them with a single operation, making it easier to identify 3: foreach op ∈ V ∧ type(op) = fhe.secret:
rotate-and-sum optimization opportunities. In addition, it can 4: ts ← SelectTargetSlot(op, V , E )
also allow more efficient direct lowerings, such as when per- 5: OperandConversion(op,ts, V , E )
forming the product over n elements: which can be lowered 6: foreach v ∈ V ∧ (op, v) ∈ E :
efficiently to a multiplication tree with depth log n. 7: u ← fhe.extract[v,ts]
8: Replace(v, u, V , E )
Type Conversion. While tensors can be arbitrarily 9: procedure SelectTargetSlot(op, V , E )
(re)shaped, all FHE ciphertexts used in a homomorphic com- 10: foreach v ∈ V ∧ (op, v) ∈ E :
putation must have compatible parameters, which implies a 11: switch v:
fixed number of slots. Therefore, we perform type conver- 12: case fhe.insert[_, i]: return i
sion, converting all vector operations over secret values to 13: case func.return: return 0
operations over the BatchedSecret type, an abstract repre- 14: foreach v ∈ V ∧ (v, op) ∈ E :
sentation of ciphertexts. We convert multi-dimensional ten- 15: switch o:
sors to vectors using column-major encoding and scale up 16: case fhe.extract[_, i]:
any secret vector operands to the size of the largest (secret) 17: return i
vector present, padding the plaintext as necessary. This does 18: return ⊥
not impact the result of the computation, as the existing code 19: procedure OperandConversion(op,ts, V , E )
will never access these additional elements. 20: foreach v ∈ V ∧(v, op) ∈ E ∧type(v) = fhe.secret:
21: switch v:
4.4 Automatic SIMD-fication 22: case fhe.extract(x, i):
23: u ← fhe.rotate(x, i − ts)
This pass replaces scalar operations over vector elements (e.g.,
24: Replace(v, u, V , E )
x[i] + y[j]) with SIMD operations, applying the same op-
eration to each element of the ciphertext. At its core, the pass 25: case fhe.ptxt[p]:
is a linear walk over all (arithmetic) homomorphic operations 26: p ← Repeat(p)
in the program, as seen in Algorithm 2. For each operation, 27: u ← fhe.ptxt(p )
we (i) identify in which ciphertext slot the result should be 28: Replace(v, u, V , E )
computed (ii) transform the operands so that they are suitable 29: procedure Replace(v, u, V , E )

for such a SIMD-operation, and (iii) insert extract opera- 30: V ← (V \ {v}) {u}
tions in situations where a scalar is expected (including uses 31: foreach w ∈ V ∧ (v, w)∈ E :
in later operations that have yet to be transformed). Note 32: E ← (E \ {(v, w)}) {(u, w)}
that, by itself, this transformation does not actually remove 33: foreach w ∈ V ∧ (w, v)∈ E :
any code. However, it will expose common patterns (e.g., 34: E ← (E \ {(w, v)}) {(w, u)}
such as those occurring in loops) and cause operations over
compatible indices to be translated to the same SIMD opera-
tions. This allows the following clean-up pass to remove these While this approach is straightforward and correct, it is unsuit-
now-redundant operations, which frequently includes dupli- able for an optimizing compiler, as it does not set the program
cate arithmetic operations and many or all of the operations up for further simplification. For example, in the case of a
inserted to ensure consistency. loop for i in 0..10: z[i]=x[i]+y[i]), each iteration
would result in a unique operation with distinct operands,
Target Slot Selection. When translating an operation with each rotated by different amounts.
operands corresponding to different vector positions (e.g., Instead, HECO introduces the notion of a target slot, de-
x[i]+y[j]), we must bring the elements of interest into align- termined by the further uses of the result. For example, if the
ment by issuing a rotation for at least some of the operands. result of a computation is assigned to z[k], we select k as the
However, there are usually multiple valid solutions (e.g., target slot, eliminating the rotation required afterward. If no
rot(x, j-i) vs. rot(y, i-j)), especially for operations clear target slot can be derived from the uses of the result, we
with multiple operands, which occur frequently as a result of use one of the operand indices as the target slot, removing
our preprocessing stage An obvious approach would be to the need to rotate that operand. Selecting the target slot this
rotate each operand (e.g., x[i]) so that the element of interest way reduces the immediate number of rotation operations
is moved to slot 0. Then, performing the SIMD operation created. More importantly, it reliably maps operations with
over the rotated operands produces the result in the same slot. the same relative index access patterns to the same set of ro-

USENIX Association 32nd USENIX Security Symposium 4723


tations and SIMD operations, allowing them to be eliminated and native binary operations, meanwhile, has to be performed
by the following simplification pass. Since this approach is after that pass, since it requires the operands to be entire ci-
based purely on index access patterns, it works equally well phertexts, rather than scalars. Additionally, the de-duplication
for complex loop nests and heavily interleaved code. simplifications that can take place after the batching transfor-
mation can widen the applicability of this transformation by
Operand Conversion. In order to convert a scalar opera- reducing the number of distinct ciphertexts appearing in the
tion to a fully-batched SIMD operation, all non-batched inputs program.
must be converted, as described in Algorithm 2 (O PERAND -
C ONVERSION). Note that, due to the linear-walk nature of
the pass, all previous FHE operations have already been con- 5 Implementation
verted to fully-batched operations. As a result, any operand
of type fhe.secret must be the result of an fhe.extract We build HECO on top of the open-source MLIR frame-
operation. This invariant allows us to ‘chain’ batched opera- work [7], which is rapidly establishing itself as the go-to tool
tions together by replacing the extraction-based operand with for domain-specific compilers and opening up the possibil-
a rotation-based operand. Specifically, an operand extracted ity of exchanging ideas and optimizations even beyond the
from slot i of vector x, is replaced with a rotation of x by i −ts, FHE community. HECO consists of roughly 15k LOC of
where ts is the target slot determined before. C++, with around 2k LOC of Python for the Python front-
end. HECO uses the Microsoft Simple Encrypted Arithmetic
Ensuring Consistency. We maintain the consistency and Library (SEAL) as its FHE backend. SEAL, first released
correctness of the program at each step of the optimization. in 2015, is an open-source FHE library implemented in C++
Towards this, we first construct the new rotations and batched that is thread-safe and heavily multi-threaded itself. SEAL
operations as additions to the program. We only replace oc- implements the BFV, BGV and CKKS schemes.
currences of the old operation with the optimized version In contrast to existing monolithic compilers, HECO is
after we replace uses of the old operation with an extract highly modular and designed to be flexible and extensible.
operation that extracts the target slot of the new batched result. We decouple optimizations from front-end logic, allowing for
This ensures that, even if no further batching opportunities a wide variety of domain-specific front-ends and the ability to
are found, the program remains correct. Since batched FHE easily replace back-ends to target different FHE libraries or
schemes do not support true scalar values, we simply interpret hardware accelerators as they become available. The toolchain
scalars as ciphertexts where only slot 0 contains valid data. can easily be adapted to different needs, with certain optimiza-
With this convention, any remaining extractions will eventu- tions enabled or disabled as required.
ally be converted to a rotation by −ts. However, in practice,
this is rare as most of the extract operations we insert will
in turn be converted to rotations when the next homomorphic
operation is processed. As a result, these consistency-related 6 Evaluation
extract operations are frequently eliminated completely at
the end of the batching pass. HECO is designed to compile high-level programs, written
4.5 Rotate-and-Sum Pass by non-experts in the standard imperative paradigm, into
highly efficient batched FHE implementations that achieve
After the main pass and the associated simplification pass, we the same performance as hand-crafted implementations by
apply the rotate-and-sum technique where applicable. Since experts. HECO achieves its usability goals through a well-
this optimization requires a holistic view of the operation, integrated Python front-end and by requiring developers to
this would be significant if we did not merge sequential op- alter their code only minimally (annotating variables as se-
erations during preprocessing. While we used a sum over cret). However, ease-of-use becomes moot when the perfor-
all elements when explaining the technique in the previous mance of the generated code is not competitive. Therefore,
section, the technique can be generalized to any subset with a we focus our evaluation on the performance of HECO and the
consistent stride. Additionally, it can also be used to compute code it generates, trying to answer whether or not automatic
products rather than sums. When applied to multiplication, optimizations can bring naive code to the same performance
it additionally has the benefit of automatically reducing the level as expert implementations. In this section, we first show
multiplicative depth of the expression as a side-effect. the effect of the batching optimizations on benchmark work-
Note that the pre-processing combination of binary opera- loads designed to demonstrate different batching patterns,
tions into larger operations must happen before the main pass then compared against synthesized optimal batching patterns,
described above, as that pass would otherwise insert rotations and finally discuss a real-world application example.
between the different operations, making them no longer di-
rectly chained. The actual translation to a series of rotations

4724 32nd USENIX Security Symposium USENIX Association


102
102 Naive Naive
HECO HECO
101
101
Time [s]

Time [s]
100
100

10−1
10−1

10−2
10−2

2x2 4x4 8x8 16x16 32x32 64x64 4 8 16 32 64 128 256 512 1024 2048 4096
Image size Vector size

(a) Roberts Cross benchmark runtime (in seconds). (b) Hamming Distance benchmark runtime (in seconds).
104
Naive Naive
Max resident memory [MB]

Max resident memory [MB]


HECO HECO
103
103

102 102

101 101

100 100

2x2 4x4 8x8 16x16 32x32 64x64 4 8 16 32 64 128 256 512 1024 2048 4096
Image size Vector size

(c) Roberts Cross benchmark memory usage (in MB). (d) Hamming Distance benchmark memory usage (in MB).

Figure 5: Log-log plot of the runtime and memory consumption of the roberts cross and hamming distance benchmarks for
different vector sizes, comparing a naive non-batched solution with the batched solution generated by our system.

6.1 Benchmarks terms of run time over a solution that batches vector data
into ciphertexts, but does not re-structure the program to be
We evaluate HECO in terms of the speedup reduction in mem-
batching-friendly. This is because such a solution adds the
ory overhead gained over non-optimized implementations and
overhead of rotations, masking, etc., to the base runtime of
the compile time required.
the non-batched solution.

Applications. We demonstrate the speedup achieved by our Environment. All benchmarks are executed on AWS
batching optimization on two applications that are represen- m5n.xlarge instances, which provide 4 cores and 16 GB
tative of common batching opportunities. The roberts cross of RAM. We used Microsoft SEAL [6] as the underlying
operator is an edge-detection feature used in image process- FHE library, targeting its BFV [14, 15] scheme implemen-
ing. It approximates the gradient of an image as the square tation. All experiments are run using the same parameters,
root of the sum-of-squares of two different convolutions of the which ensure at least 128-bit security. We report the run time
image, which compute the differences between diagonally ad- of the computation itself, omitting client-side aspects such as
jacent pixels. As in all other kernel-based benchmarks, wrap- key generation or encryption/decryption. All results are the
around padding is used, which aligns well with the cyclical average of 10 iterations, discarding top and bottom outliers.
rotation paradigm of FHE. In order to enable a practical FHE Runtime & Memory Overhead. In Figure 5a we show the
evaluation, the final square root is omitted, since it would runtime of the Roberts Cross benchmark for varying instance
be prohibitively expensive to evaluate under encryption. The sizes, comparing the non-batched baseline with the batched
hamming distance, meanwhile, computes the edit distance solution generated by HECO. While the run time of the naive
between two vectors, i.e., the number of positions at which version increases linearly with the image size, the batched
they disagree. Here, we consider two binary vectors of the solution maintains the same performance (until parameters
same length, a setting in which computing (non-)equality must be increased to accommodate even larger images). In-
can be done efficiently using the arithmetic operations avail- stead, the run time is more closely tied to the size of the kernel
able in FHE. Specifically, this makes use of the fact that than to that of the image. This highlights the dramatic trans-
NEQ(a, b) = XOR(a, b) = (a − b)2 for a, b ∈ {0, 1}. formations achieved by HECO, fundamentally changing the
Baseline. Our baseline is a naive implementation of the ap- structure of the program. As a result of these transformations,
plication without taking advantage of batching, as one might HECO achieves a speedup of 3454x over the non-batched
expect FHE novices to implement. In this setting, vectors of baseline for 64x64 pixel images, demonstrating the extraordi-
secrets are directly translated to vectors of ciphertexts. While nary impact that effective use of batching can have on FHE
this approach introduces significant ciphertext expansion and applications: while the non-batched solution is borderline im-
increases the memory required, it is actually preferable in practical at over two minutes, the batched solution takes only

USENIX Association 32nd USENIX Security Symposium 4725


a fraction of a second (0.04 s). The runtime of the generated n 4 16 64 256 1024 4096
code in this case does depend directly on the vector length.
However, due to the fold-style optimization (cf. § 4.5), this rc 0.06 0.07 0.08 0.12 0.30 1.28
dependence is only logarithmic. hd 0.03 0.04 0.06 0.09 0.25 1.85
In Figure 5d we can see that, for realistic problem sizes,
the performance advantage of batching becomes significant, Table 1: Compile time (in seconds) of the Roberts Cross
resulting in a speedup of 934x for 4096-element vectors. We (rc) and Hamming Distance (hd) benchmarks for different
also see that, while the runtime of the batched solution does problem sizes (n).
increase with the vector length, this is nearly imperceptible
when compared to the non-batched baseline. Finally, in Fig-
ure 5c/5b, we show the memory overhead of the non-batched we consider the following problem sizes here: The synthe-
baseline and the batched solution. While FHE introduces a sized dot-product code targets 8-element vectors, while those
non-negligible baseline overhead due to the large amount for Hamming Distance and L2 Distance were provided for
of key- and other context-data that must be maintained, the 4-element vectors. For the other applications, we use vectors
reduced number of ciphertexts in the batched solution has a of length 4096, representing 64x64 pixel images
clear impact on memory usage, increasingly so as the problem As we can see from Figure 6, HECO dramatically improves
sizes increase. performance over the non-batched baseline approach. For ex-
ample, for the Roberts Cross benchmark, our batched solution
is over 3500 times faster than the non-batched solution, taking
Compile Time. HECO achieves these fundamental trans- less than 0.04 seconds instead of over 2.35 minutes. More
formations efficiently, with compile times that are amenable importantly, our results are nearly equivalent to the optimally
to interactive development. This is in contrast to synthesis- batched solutions synthesized by Porcupine, especially when
based tools, which require more than 10 minutes to synthesize considering the stark contrast between the non-batched and
a batched solution for the Roberts Cross benchmark even for the two batched solutions. In some cases (e.g., Box Blur),
toy-sized instances and do not scale to the sizes we consider Porcupine has an advantage because it finds non-intuitive so-
here at all [44]. lutions beyond traditional batching patterns. Interestingly, for
some applications (e.g. Hamming Distance) HECO actually
outperforms Porcupine. This is because Porcupine provides
6.2 Comparison with Synthesized Solutions an optimally batched solution but does not necessarily han-
We compare the baseline and HECO to synthesized optimal dle ciphertext management optimally, inserting unnecessary
batching patterns. Synthesis based approaches explore the relinearization operations.
space of all possible programs, constrained by a reference 6.3 Real-World Application
specification describing input-output behavior. While this
tends to be computationally expensive, it has the potential The previous benchmarks demonstrated HECO’s effective-
to find optimal solutions featuring highly non-intuitive opti- ness and performance for different and common batching pat-
mizations. We use a set of nine benchmark applications that terns. We now evaluate an application that more closely resem-
represent a variety of common code patterns relevant to batch- bles the complexity that real-world settings exhibit. Specifi-
ing for our evaluation. These programs range from having cally, we consider an application computing private statistics
no inter-element dependencies (e.g., Linear Polynomial), to over two databases that might contain duplicate entries. We
simple accumulator patterns (e.g., Hamming Distance), and use this to demonstrate that HECO can produce efficient FHE
complex dependencies across a multitude of different vector code for non-trivial programs, while also highlighting that
elements (e.g., Roberts Cross). As a result, they provide a use- there is further room for optimizations exploiting application
ful benchmark, especially for batching optimizations. These semantics.
benchmarks were first proposed in [44], which introduces
Porcupine [44], a synthesis-based compiler for batched FHE. Application. Privacy regulations frequently prohibit enti-
In addition to a reference specification, it requires a developer- ties from combining sensitive datesets directly. Instead, they
provided sketch of an initial possible batched approach. Our could employ threshold FHE, which extends FHE with multi-
‘synthesis’ solutions are based on pseudo-code made available party key generation, to securely compute on the (encrypted)
in an extended version of the paper [44]. joint dataset. In this setting, neither party has sole access to
Synthesis based tools can require the significant search the secret key, and they must collaborate to decrypt the re-
time to find solutions, limiting them to toy-sized workloads. sults of approved queries. However, in practice their datasets
For example, Porcupine requires over 10 minutes to synthe- might overlap (e.g., agencies at different levels of government
size a program with ten instructions and will fail to synthesize collecting similar data), introducing duplicate items into the
a solution at all for sufficiently complex programs. As a result, joint dataset. Due to the duplicates, analytics (e.g., counting

4726 32nd USENIX Security Symposium USENIX Association


Naive
102 HECO
Porcupine
101
Time [s]

100

10−1

10−2

BoxBlur DotProduct GxKernel HammingDistance L2Distance LinearPolynomial QuadraticPolynomial RobertsCross

Figure 6: Runtime of example applications (in seconds), comparing a naive non-batched baseline, the solution generated by our
system (HECO), and an optimally-batched solution synthesized by the Porcupine tool.

queries) will return incorrect results. Therefore, we must first 1 def encryptedPSU ( a_id : Tensor [ 1 2 8 ,8 , Secret [ int ]] ,
de-duplicate the encrypted databases before executing the 2 a_data : Tensor [ 1 2 8 , Secret [ int ]] ,
3 b_id : Tensor [ 1 2 8 ,8 , Secret [ int ]] ,
analytics. Since threshold FHE does not affect the server-side
4 b_data : Tensor [ 1 2 8 , Secret [ int ]])
execution of the computation, HECO can be used directly to 5 -> Secret [ int ]:
develop such an application. This first computes the Private 6 sum : Secret [ int ] = 0
Set Union (PSU) of two databases A,B indexed by unique IDs 7 for i in range ( 0 , 1 2 8 ) :
8 sum = sum + a_data [ i ]
consistent across both (e.g., a national identifier like the SSN). 9 for i in range ( 0 , 1 2 8 ) :
For simplicity, we consider databases with one data column 10 unique : Secret [ int ] = 1
and a simple SUM aggregation. However, the presented ap- 11 for j in range ( 0 , 1 2 8 ) :
proach trivially extends to larger databases and more complex 12 # compute a_id [ i ] /= b_id [ j ]
13 eq : sf64 = 1
statistics. Listing 5 shows the application expressed using the 14 for k in range ( 0 , 8 ) :
HECO Python frontend, for a database size of 128 elements 15 # a xor b == (a - b ) ^ 2
and 8-bit identifiers. We split the identifiers into individual 16 x = ( a_id [ i ][ k ] - b_id [ j ][ k ]) ** 2
bits, allowing us to compute the equality function even while 17 nx = 1 - x # not x
18 eq = eq * nx # eq and nx
working with arithmetic circuits. The program begins by ag- 19 neq = 1 - eq # not eq
gregating A’s data and then proceeds to check each element of 20 unique = unique * nequal
B’s database for potential duplicates in A. In order to compute 21

the equality function, we compute k ak ⊕ bk , using the fact 22 sum = sum + unique * a_data [ i ]
23 return sum
that, for inputs a, b ∈ {0, 1}, xor can be computed as (a − b)2 ,
and directly via multiplication, and not as 1−a. If a duplicate Listing 5: Computing statistics over duplicated data.
is found, the element of B is multiplied with unique == 0,
i.e., it does not contribute to the overall statistics.
Specifically, instead of batching the identifiers for each
Performance & Discussion. We evaluated both a naive database into a single ciphertext, the expert solution instead
baseline and the HECO-optimized batched implementation creates one ciphertext per bit, using significantly oversized
using the same setup described earlier in this section. The ciphertexts with 1282 = 214 slots. This enables the expert so-
naive approach has a run time of several minutes (11.3 min) lution to batch every possible permutation of the identifier set
and requires the sending of over 2000 ciphertexts between into one ciphertext. By applying this to the encryption of set
the server and the clients. The batched solution produced B, while simply encrypting 128 non-permuted repetitions of
by our system requires not only significantly less data to be set A, the expensive (n2 ) duplication check can be performed
transmitted (only 4 ciphertexts), but also runs an order of in parallel on all elements at the same time. Computing the
magnitude faster (57.6 s), confirming the trend we observed unique flag then uses a rare application of the rotate-and-
when evaluating on smaller benchmarks. While the results multiply pattern. As a trade-off, the equality computation is
achieved by HECO are more than practical already, a state- no longer batched, but since the number of bits in the iden-
of-the art hand-written implementation designed for this task tifiers is, by necessity, at most logarithmic in the number of
can improve this even further, requiring only 1.4 s. However, database elements, this is a profitable trade-off.
arriving at this solution requires a significant rethinking of the In general, exploiting application-specific packing patterns
program and an application-specific batching pattern, which can unlock additional performance gains and is a frequently
HECO intentionally does not consider to avoid a search space combined with client-side processing in expert-designed FHE
explosion. Note that synthesis based tools such as Porcupine systems. However, the decision on which client-side process-
also cannot capture these kinds of transformations. ing (e.g., removing outliers, computing permutations, etc) is

USENIX Association 32nd USENIX Security Symposium 4727


sensible is not a well-defined problem that automated solu- rely on pre-determined mappings rather than automatically
tions can tackle. At the same time, the performance improve- identifying optimization opportunities, they do not transfer
ments demonstrated by HECO provide a first jump from the to other domains, such as the general-purpose setting HECO
regime of prohibitive overheads to one of practical solutions. targets. Besides that, their lack of flexibility prevents their use
While further optimizations are likely frequently possible, when developers’ needs are even slightly misaligned.
they offer quickly diminishing returns. For example, scaling The Porcupine compiler [44] is closest to our work in that
this application to real-world sizes (which, incidentally, is it also considers translating imperative programs to FHE’s
more complex with the expert approach) means that a naive batching paradigm. However, their tool has a significantly
solution might take days to compute while HECO’s solution different focus, using a heavy-weight synthesis approach that
would complete in a few hours, which is a reasonable runtime tries to identify optimal solutions that can outperform even
for these kinds of secure statistics. state-of-the-art approaches used by experts. Since it explores
a large state space in the search for an optimal solution, com-
pile times tend to be long (up to many minutes) and programs
7 Related Work
can contain at most a handful of statements before the ap-
In this section, we briefly discuss related work in the do- proach becomes infeasible. Additionally, Porcupine requires
main of FHE compilation (§ 7.1) followed by a discussion of that developers provide a sketch of the structure of the batched
differences to existing Multi-Party Computation (MPC) and program, making it less suitable for non-expert users.
Zero-Knowledge Proofs (ZKP) compilers (§ 7.2). Finally, we want to highlight that the MLIR framework
is rapidly establishing itself as the gold standard for FHE
tooling. Early attempts such as SyFER-MLIR [52] relied
7.1 FHE Compilers primarily on built-in optimizations, adding only a few binary-
The complexity of implementing FHE operations efficiently circuit-based FHE-specific rewrite rules. No evaluation is
led to the development of dedicated libraries [6, 48] early on. provided for SyFER-MLIR, but prior work studying similar
Today, a large number of libraries provide efficient imple- simple rewrite rule-based tools [31] leads us to predict that it
mentations of state-of-the-art schemes. These libraries mostly would produce only relatively minor speedups. More recently,
provide comparatively low-level APIs that allow developers however, a variety of concurrent work has successfully real-
to extract the best possible performance but require significant ized different aspects of the FHE ecosystem using MLIR. For
expertise to utilize effectively. As a result, a first wave of FHE example, Zama’s Concrete-ML [57] internally uses an MLIR-
tools and compilers emerged that tried to improve the usabil- based compiler to translate Machine Learning tasks expressed
ity of FHE [31, 37, 49–51]. These mostly target a circuit-level as Numpy programs to the TFHE scheme. HECATE [58],
abstraction and are focused on circuit optimizations [31]. meanwhile, improves upon the rescale-allocation optimiza-
For example, Microsoft’s EVA [37] offers a user-friendly tions presented in the EVA compiler [37]. However, where
high-level interface and automatically inserts ciphertext main- the latter uses a custom Python implementation that does not
tenance operations into the circuit. EVA uses a custom circuit- provide for interoperability with other tools, HECATE builds
based IR and requires developers to manually map their pro- upon common MLIR abstractions. Thanks to the modular
gram to the FHE programming paradigm. In order to ease nature of MLIR, these tools could easily be integrated into
this process, recent versions [36] include a library of expert- HECO’s end-to-end toolchain.
implemented batched kernels for frequently used patterns,
e.g., summing all elements in a vector. However, this still 7.2 MPC & ZKP Compilers
requires developers to manually transform an application to
the batched paradigm. Compilers for both Multi-Party Computation (MPC) and Zero-
A series of tools including Cingulata [33], E3 [51], SyFER- Knowledge Proof (ZKP) systems face similar challenges to
MLIR [52], and Google’s Transpiler [53] attempt to trans- FHE compilation, such as the need for data-independent com-
late arbitrary programs without the usual restrictions of FHE. putations and a general tendency towards trying to achieve
They achieve this by translating their input programs into small and low-depth circuits. However, in practice we find
binary circuits, encrypting each input bit individually. How- these similarities are too superficial to allow techniques from
ever, programs translated in this way are virtually always too one domain to be lifted to another. Due to the heavy reliance
inefficient to be of practical use because they do not support on selectively revealing (potentially blinded) data during an
the type of high-level transformations that HECO employs to MPC computation – a feature that has no direct correspon-
achieve practical efficiency. dence in FHE – many of the optimization approaches are
Domain-specific compilers [54–56], e.g., targeting en- unlikely to transfer. State-of-the art MPC compilers [59, 60]
crypted Machine Learning applications, rely on a large set also frequently make heavy use of hybrid approaches, i.e.,
of hand-written expert-optimized kernels for common func- switching between different MPC settings. While there has
tionality (mostly linear algebra operations). Since these tools been significant work on scheme switching for FHE [25, 26],

4728 32nd USENIX Security Symposium USENIX Association


practical applications remain rare and few libraries currently considers RLWE-based schemes offering SIMD operations,
support these techniques. As these approaches start to mature, but recent developments in LWE-based fast-bootstrapping
investigating to what extent scheme-switching optimizations schemes makes them an attractive alternative. Today, these
from the domain of MPC can transfer to FHE will present an worlds remain mostly separate and use significantly different
interesting avenue for future work. paradigms. Future work needs to consider how to unify these,
Zero-Knowledge Proof Compilers also face the challenge especially in the context of scheme-switching, i.e., the ability
of mapping complex operations to arithmetic circuits with to move between schemes inside a single application. Finally,
limited expressiveness. However, their setting fundamentally upcoming dedicated FHE hardware accelerators promise to
differs from that of FHE, as the prover generally has access deliver significant performance improvements but require so-
to all data in the computation in the clear. This allows ZKP phisticated scheduling to unlock their potential. While ex-
computations to heavily rely on witness-based computation, isting work on accelerators already incorporates automated
which allows compilers to shift virtually all non-arithmetic scheduling, there are likely significant further optimization
operations outside the core ZKP computation. Additionally, opportunities in considering compilation for these systems
ZKP compilers mostly use intermediate representations based from an end-to-end perspective. More generally, we believe
on Rank-1 Constraint Systems (R1CS) or other constraint that there is significant potential for interdisciplinary research
specification systems that are not suitable for expressing FHE that combines techniques from compiler and programming
computations. Finally, we note that compiler frameworks try- language research with insights from cryptography.
ing to accommodate MPC, ZKP and potentially also FHE are
emerging [61]. However, their reliance on a circuit-like IR
Acknowledgments
makes them unsuitable for the high-level transformations we
use in our work. We thank Gyorgy Rethy and Nicolas Küchler for their invalu-
able help in evaluating HECO. We also thank our anonymous
8 Discussion reviewers, Tobias Gross, Michael Steiner, and the PPS Lab
team for their insightful input and feedback. We would also
As FHE has emerged into practicality, it has drawn the interest like to acknowledge our sponsors for their generous support,
of a significantly wider audience bringing new perspectives, including Meta, Google, SNSF through an Ambizione Grant
requirements and backgrounds to the area. While tradition- No. 186050, and the Semiconductor Research Corporation.
ally, FHE applications were mostly developed by the same
experts that designed, optimized and implemented the un- References
derlying cryptographic schemes, this will soon no longer be
true beyond the world of cutting-edge academic research. In [1] S. Kannepalli, K. Laine, and R. C. Moreno, “Pass-
recent years, a variety of tools has emerged in an attempt word monitor: Safeguarding passwords in mi-
to address the needs of future non-expert developers. While crosoft edge.” https://www.microsoft.com/en-us/
some domain specific tools have proven to be very effec- research/blog/password-monitor-safeguarding-
tive [31, 56, 62], most general purpose tools have fallen short passwords-in-microsoft-edge/, 21 Jan. 2021.
of delivering on the promise of usable FHE. While they sim- Accessed: 2021-7-5.
plify the development process, they generally produce naive
implementations which provide little real-world benefit due [2] M. Kim, A. O. Harmanci, J.-P. Bossuat, S. Carpov, J. H.
to their significant overhead compared to more optimal im- Cheon, I. Chillotti, W. Cho, D. Froelicher, N. Gama,
plementations. However, as performance is key for practical M. Georgieva, S. Hong, J.-P. Hubaux, D. Kim, K. Lauter,
deployment, we believe that usability without sufficient per- Y. Ma, L. Ohno-Machado, H. Sofia, Y. Son, Y. Song,
formance is mostly meaningless. J. Troncoso-Pastoriza, and X. Jiang, “Ultrafast homo-
HECO aims to bridge the gap between usability and per- morphic encryption models enable secure outsourcing
formance for general-purpose workloads. It offers non-expert of genotype imputation,” Cell systems, vol. 12, pp. 1108–
developers the ability to express applications in a familiar 1120.e4, 17 Nov. 2021.
high-level paradigm without paying the extreme performance [3] S. Carpov, N. Gama, M. Georgieva, and J. R. Troncoso-
penalty this would usually incur. Beyond optimizing this Pastoriza, “Privacy-preserving semi-parallel logistic re-
high-level transformation, HECO proposes a new end-to- gression training with fully homomorphic encryption,”
end architecture for FHE compilers based on the distinct BMC medical genomics, vol. 13, p. 88, 21 July 2020.
stages of FHE optimization we identify. HECO’s modular
architecture is designed to allow it to interoperate with other [4] DARPA, “Data protection in virtual environ-
toolchains and easily integrate future optimization techniques. ments (DPRIVE).” https://sam.gov/opp/
While HECO represents an important step in FHE usability, 16c71dadbe814127b475ce309929374b/view, 27 Feb.
many challenges remain to be addressed. For example, HECO 2020.

USENIX Association 32nd USENIX Security Symposium 4729


[5] N. Samardzic, A. Feldmann, A. Krastev, S. Devadas, [15] J. Fan and F. Vercauteren, “Somewhat Practical Fully
R. Dreslinski, C. Peikert, and D. Sanchez, “F1: A fast Homomorphic Encryption,” IACR Cryptology ePrint
and programmable accelerator for fully homomorphic Archive, vol. 2012, p. 144, 2012.
encryption,” in MICRO-54: 54th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO [16] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(lev-
’21, (New York, NY, USA), pp. 238–252, Association eled) fully homomorphic encryption without bootstrap-
for Computing Machinery, 18 Oct. 2021. ping,” in Proceedings of the 3rd Innovations in Theo-
retical Computer Science Conference, ITCS ’12, (New
[6] “Microsoft SEAL (release 3.5),” Apr. 2020. York, NY, USA), p. 309–325, Association for Comput-
ing Machinery, 2012.
[7] C. Lattner, M. Amini, U. Bondhugula, A. Cohen,
A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasi- [17] J. H. Cheon, A. Kim, M. Kim, and Y. Song, “Homomor-
lache, and O. Zinenko, “MLIR: A compiler infras- phic Encryption for Arithmetic of Approximate Num-
tructure for the end of moore’s law,” 25 Feb. 2020, bers,” in Advances in Cryptology – ASIACRYPT 2017,
arXiv:cs.PL/2002.11054. vol. 10624, pp. 409–437, Cham: Springer International
Publishing, 2017.
[8] P. Paillier, “Public-Key cryptosystems based on com-
posite degree residuosity classes,” in Advances in Cryp- [18] M. R. Albrecht, R. Player, and S. Scott, “On the concrete
tology — EUROCRYPT ’99, EUROCRYPT, (Prague, hardness of learning with errors,” Journal of Mathemat-
Czech Republic), pp. 223–238, Springer, Berlin, Heidel- ical Cryptology, vol. 9, 1 Jan. 2015.
berg, May 1999.
[19] M. Albrecht, M. Chase, H. Chen, J. Ding, S. Gold-
[9] R. L. Rivest, A. Shamir, and L. Adleman, “A method for
wasser, S. Gorbunov, S. Halevi, J. Hoffstein, K. Laine,
obtaining digital signatures and public-key cryptosys-
K. Lauter, S. Lokam, D. Micciancio, D. Moody, T. Mor-
tems,” Communications of the ACM, vol. 21, pp. 120–
rison, A. Sahai, and V. Vaikuntanathan, “Homomorphic
126, Feb. 1978.
encryption security standard,” tech. rep., Homomorphi-
[10] C. Gentry, “Fully Homomorphic Encryption Using Ideal cEncryption.org, Toronto, Canada, Nov. 2018.
Lattices,” in Proceedings of the 41st Annual ACM Sym-
posium on Symposium on Theory of Computing - STOC [20] L. Brenna, I. S. Singh, H. D. Johansen, and D. Johansen,
’09, (Bethesda, MD, USA), p. 169, ACM Press, 2009. “TFHE-rs: A library for safe and secure remote comput-
ing using fully homomorphic encryption and trusted ex-
[11] N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, ecution environments,” Array, p. 100118, 28 Dec. 2021.
N. Genise, S. Devadas, K. Eldefrawy, C. Peikert, and
D. Sanchez, “CraterLake: a hardware accelerator for [21] F. Bourse, R. Del Pino, M. Minelli, and H. Wee, “FHE
efficient unbounded computation on encrypted data,” in circuit privacy almost for free,” in Advances in Cryptol-
ISCA, pp. 173–187, 2022. ogy – CRYPTO 2016, pp. 62–89, Springer Berlin Hei-
delberg, 2016.
[12] R. Cammarota, “Intel HERACLES: Homomorphic en-
cryption revolutionary accelerator with correctness for [22] L. Ducas and D. Stehlé, “Sanitization of FHE cipher-
learning-oriented End-to-End solutions,” in Proceedings texts,” in Advances in Cryptology – EUROCRYPT 2016,
of the 2022 on Cloud Computing Security Workshop, Lecture notes in computer science, pp. 294–310, Berlin,
CCSW’22, (New York, NY, USA), p. 3, Association for Heidelberg: Springer Berlin Heidelberg, 2016.
Computing Machinery, 7 Nov. 2022.
[23] B. Li and D. Micciancio, “On the security of homomor-
[13] R. Geelen, M. Van Beirendonck, H. V. L. Pereira, phic encryption on approximate numbers,” in Advances
B. Huffman, T. McAuley, B. Selfridge, D. Wagner, G. Di- in Cryptology – EUROCRYPT 2021, (Cham), pp. 648–
mou, I. Verbauwhede, F. Vercauteren, and D. W. Archer, 677, Springer International Publishing, 2021.
“BASALISC: Flexible asynchronous hardware accelera-
tor for fully homomorphic encryption,” 27 May 2022, [24] B. Li, D. Micciancio, M. Schultz, and J. Sorrell, “Secur-
arXiv:cs.CR/2205.14017. ing approximate homomorphic encryption using differ-
ential privacy,” Cryptology ePrint Archive, 2022.
[14] Z. Brakerski, “Fully Homomorphic Encryption with-
out Modulus Switching from Classical GapSVP,” in [25] C. Boura, N. Gama, M. Georgieva, and D. Jetchev,
Advances in Cryptology – CRYPTO 2012, vol. 7417, “Chimera: Combining ring-lwe-based fully homomor-
pp. 868–886, Berlin, Heidelberg: Springer Berlin Hei- phic encryption schemes,” Journal of Mathematical
delberg, 2012. Cryptology, vol. 14, no. 1, pp. 316–338, 2020.

4730 32nd USENIX Security Symposium USENIX Association


[26] W. jie Lu, Z. Huang, C. Hong, Y. Ma, and H. Qu, “PE- [35] P. Aubry, S. Carpov, and R. Sirdey, “Faster homomor-
GASUS: Bridging polynomial and non-polynomial eval- phic encryption is not enough: Improved heuristic for
uations in homomorphic encryption,” in 2021 IEEE Sym- multiplicative depth minimization of boolean circuits,”
posium on Security and Privacy (SP), IEEE, May 2021. in Topics in Cryptology – CT-RSA 2020, pp. 345–363,
Springer International Publishing, 2020.
[27] I. Chillotti, M. Joye, and P. Paillier, “Programmable boot-
[36] S. Chowdhary, W. Dai, K. Laine, and O. Saarikivi, “EVA
strapping enables efficient homomorphic inference of
improved: Compiler and extension library for CKKS,” in
deep neural networks.” Cryptology ePrint Archive, Re-
Proceedings of the 9th on Workshop on Encrypted Com-
port 2021/091, 2021. https://ia.cr/2021/091.
puting & Applied Homomorphic Cryptography, WAHC
’21, (New York, NY, USA), pp. 43–55, Association for
[28] N. Samardzic, A. Feldmann, A. Krastev, S. Devadas,
Computing Machinery, 15 Nov. 2021.
R. Dreslinski, C. Peikert, and D. Sanchez, “F1: A fast
and programmable accelerator for fully homomorphic [37] R. Dathathri, B. Kostova, O. Saarikivi, W. Dai, K. Laine,
encryption,” in MICRO-54: 54th Annual IEEE/ACM and M. Musuvathi, “EVA: An encrypted vector arith-
International Symposium on Microarchitecture, MICRO metic language and compiler for efficient homomorphic
’21, (New York, NY, USA), pp. 238–252, Association computation,” in Proceedings of the 41st ACM SIG-
for Computing Machinery, 18 Oct. 2021. PLAN Conference on Programming Language Design
and Implementation, 27 Dec. 2019.
[29] N. Samardzic, A. Feldmann, A. Krastev, N. Manohar,
N. Genise, S. Devadas, K. Eldefrawy, C. Peikert, and [38] Y. Lee, S. Heo, S. Cheon, S. Jeong, C. Kim, E. Kim,
D. Sanchez, “CraterLake: a hardware accelerator for D. Lee, and H. Kim, “HECATE: Performance-Aware
efficient unbounded computation on encrypted data,” in scale optimization for homomorphic encryption com-
ISCA, pp. 173–187, 2022. piler,” in 2022 IEEE/ACM International Symposium on
Code Generation and Optimization (CGO), pp. 193–204,
[30] R. Geelen, M. Van Beirendonck, H. V. L. Pereira, Apr. 2022.
B. Huffman, T. McAuley, B. Selfridge, D. Wagner, G. Di- [39] A. Costache, K. Laine, and R. Player, “Evaluating the
mou, I. Verbauwhede, F. Vercauteren, and D. W. Archer, effectiveness of heuristic worst-case noise analysis in
“BASALISC: Flexible asynchronous hardware accelera- FHE.” Cryptology ePrint Archive, Report 2019/493,
tor for fully homomorphic encryption,” 27 May 2022, 2019.
arXiv:cs.CR/2205.14017.
[40] I. Iliashenko, “Optimisations of fully homomorphic en-
[31] A. Viand, P. Jattke, and A. Hithnawi, “SoK: Fully ho- cryption,” 27 May 2019. PhD. Thesis, KU Leuven.
momorphic encryption compilers,” in 2021 IEEE Sym- [41] A. Costache, B. R. Curtis, E. Hales, S. Murphy,
posium on Security and Privacy (SP), pp. 1166–1182, T. Ogilvie, and R. Player, “On the precision loss
2021. in approximate homomorphic encryption.” https://
eprint.iacr.org/2022/162.pdf. Accessed: 2022-2-
[32] D. W. Archer, J. M. Calderón Trilla, J. Dagit, A. Mal- 26.
ozemoff, Y. Polyakov, K. Rohloff, and G. Ryan, “RAM-
PARTS: A Programmer-Friendly system for building [42] A. Costache, K. Laine, and R. Player, “Evaluating the
homomorphic encryption applications,” in Proceedings effectiveness of heuristic worst-case noise analysis in
of the 7th ACM Workshop on Encrypted Computing & fhe,” in Computer Security – ESORICS 2020, pp. 546–
Applied Homomorphic Cryptography - WAHC’19, (New 565, Springer, Sept. 2020.
York, New York, USA), pp. 57–68, ACM Press, 2019.
[43] F. Boemer, S. Kim, G. Seifu, F. D. M. de Souza, and
V. Gopal, “Intel HEXL: Accelerating homomorphic en-
[33] S. Carpov, P. Dubrulle, and R. Sirdey, “Armadillo: A
cryption with intel AVX512-IFMA52,” 30 Mar. 2021,
compilation chain for privacy preserving applications,”
arXiv:cs.CR/2103.16400.
in Proceedings of the 3rd International Workshop on
Security in Cloud Computing, SCC ’15, (New York, NY, [44] M. Cowan, D. Dangwal, A. Alaghi, C. Trippel, V. T. Lee,
USA), pp. 13–19, ACM, 2015. and B. Reagen, “Porcupine: A synthesizing compiler for
vectorized homomorphic encryption,” in Proceedings
[34] S. Carpov, P. Aubry, and R. Sirdey, “A multi-start heuris- of the 42nd ACM SIGPLAN International Conference
tic for multiplicative depth minimization of boolean on Programming Language Design and Implementa-
circuits,” in Combinatorial Algorithms, pp. 275–286, tion, PLDI 2021, (New York, NY, USA), p. 375–389,
Springer International Publishing, 2018. Association for Computing Machinery, 2021.

USENIX Association 32nd USENIX Security Symposium 4731


[45] S. Larsen and S. Amarasinghe, “Exploiting superword
[55] F. Boemer, A. Costache, R. Cammarota, and C. Wierzyn-
level parallelism with multimedia instruction sets,” SIG-
ski, “nGraph-HE2: A High-Throughput framework for
PLAN Not., vol. 35, pp. 145–156, 1 May 2000.
neural network inference on encrypted data,” in Proceed-
[46] C. Mendis and S. Amarasinghe, “goSLP: globally op- ings of the 7th ACM Workshop on Encrypted Comput-
timized superword level parallelism framework,” Proc. ing & Applied Homomorphic Cryptography, WAHC’19,
ACM Program. Lang., vol. 2, pp. 1–28, 24 Oct. 2018. (New York, NY, USA), pp. 45–56, Association for Com-
puting Machinery, Nov. 2019.
[47] Y. Chen, C. Mendis, M. Carbin, and S. Amarasinghe,
“VeGen: a vectorizer generator for SIMD and beyond,” in
Proceedings of the 26th ACM International Conference
[56] F. Boemer, Y. Lao, R. Cammarota, and C. Wierzynski,
on Architectural Support for Programming Languages
“nGraph-HE: A Graph Compiler for Deep Learning on
and Operating Systems, ASPLOS ’21, (New York, NY,
Homomorphically Encrypted Data,” in Proceedings of
USA), pp. 902–914, Association for Computing Machin-
the 16th ACM International Conference on Comput-
ery, 19 Apr. 2021.
ing Frontiers, CF ’19, (New York, NY, USA), pp. 3–13,
[48] S. Halevi and V. Shoup, “Design and Implementation ACM, 2019.
of a Homomorphic-Encryption Library,” IBM Research
(Manuscript), vol. 6, pp. 12–15, 2013.
[57] Zama, “concrete-numpy: Concrete numpy is an open-
[49] A. Viand and H. Shafagh, “Marble: Making fully homo- source library which simplifies the use of fully homo-
morphic encryption accessible to all,” in Proceedings morphic encryption.”
of the 6th Workshop on Encrypted Computing & Ap-
plied Homomorphic Cryptography, pp. 49–60, ACM,
Oct. 2018.
[58] Y. Lee, S. Heo, S. Cheon, S. Jeong, C. Kim, E. Kim,
[50] S. Carpov, P. Dubrulle, and R. Sirdey, “Armadillo: A D. Lee, and H. Kim, “HECATE: Performance-Aware
compilation chain for privacy preserving applications,” scale optimization for homomorphic encryption com-
in Proceedings of the 3rd International Workshop on piler,” in 2022 IEEE/ACM International Symposium on
Security in Cloud Computing, SCC ’15, (New York, NY, Code Generation and Optimization (CGO), pp. 193–204,
USA), pp. 13–19, ACM, 2015. Apr. 2022.

[51] E. Chielle, N. G. Tsoutsos, O. Mazonka, and M. Ma-


niatakos, “Encrypt-Everything-Everywhere: ISA exten- [59] N. Büscher and S. Katzenbeisser, Compilation for Se-
sions for private computation,” IEEE Transactions on cure Multi-party Computation. SpringerBriefs in Com-
Dependable and Secure Computing, pp. 1–1, 2020. puter Science, Springer International Publishing, 2017.
[52] S. Govindarajan and W. S. Moses, “SyFER-MLIR: Inte-
grating Fully Homomorphic Encryption Into the MLIR
Compiler Framework,” 2020. [60] M. Hastings, B. Hemenway, D. Noble, and S. Zdancewic,
“SoK: General purpose compilers for secure Multi-Party
[53] S. Gorantala, R. Springer, S. Purser-Haskell, W. Lam, computation,” in IEEE Symposium on Security and
R. J. Wilson, A. Ali, E. P. Astor, I. Zukerman, S. Ruth, Privacy (SP), (Los Alamitos, CA, USA), pp. 479–496,
C. Dibak, P. Schoppmann, S. Kulankhina, A. Forget, IEEE Computer Society, 2019.
D. Marn, C. Tew, R. Misoczki, B. Guillen, X. Ye,
D. Kraft, D. Desfontaines, A. Krishnamurthy, M. Gue-
vara, I. M. Perera, Y. Sushko, and B. Gipson, “A general
[61] A. Ozdemir, F. Brown, and R. S. Wahby, “CirC: Com-
purpose transpiler for fully homomorphic encryption,”
piler infrastructure for proof systems, software verifica-
CoRR, vol. abs/2106.07893, 2021, :/2106.07893.
tion, and more,” in 2022 IEEE Symposium on Security
[54] R. Dathathri, O. Saarikivi, H. Chen, K. Laine, K. Lauter, and Privacy (SP), pp. 2248–2266, May 2022.
S. Maleki, M. Musuvathi, and T. Mytkowicz, “CHET:
an optimizing compiler for fully-homomorphic neural-
network inferencing,” in Proceedings of the 40th ACM [62] I. Chillotti, M. Joye, and P. Paillier, “Programmable boot-
SIGPLAN Conference on Programming Language strapping enables efficient homomorphic inference of
Design and Implementation, (New York, NY, USA), deep neural networks.” Cryptology ePrint Archive, Re-
pp. 142–156, ACM, 8 June 2019. port 2021/091, 2021.

4732 32nd USENIX Security Symposium USENIX Association

You might also like