Usenixsecurity23 Viand
Usenixsecurity23 Viand
Usenixsecurity23 Viand
Alexander Viand, Patrick Jattke, Miro Haller, and Anwar Hithnawi, ETH Zurich
https://www.usenix.org/conference/usenixsecurity23/presentation/viand
HECO proposes a multi-staged approach to FHE compila- Listing 2: Corresponding client-side code, outsourcing the
tion that encompasses: (i) Program Transformations, which computation of the (squared) euclidean distance.
restructure high-level programs to be efficiently expressible
using native FHE operations, (ii) Circuit Optimizations, which
primarily focuses on changes that reduce noise growth in the provides a unified experience for development, compilation
FHE computation, (iii) Cryptographic Optimizations, which and execution. We provide an overview of our end-to-end de-
instantiate the underlying scheme as efficiently as possible sign in Figure 2. In the remainder of this section, we describe
for the given program, and (iv) Target Optimizations, which HECO’s components, abstractions, and compilation stages.
map the computation to the capabilities of the target. We pro-
pose a set of Intermediate Representations (IRs) designed to
3.2 HECO Framework
provide a suitable abstraction of each stage, allowing us to
naturally and efficiently express optimizations at these dif- HECO’s framework ties together the front end, compiler, and
ferent levels. In contrast, existing compilers usually abstract the various back ends into a unified development experience.
FHE computations as circuits2 which does not allow them to It allows developers to edit, compile and deploy their applica-
fully express many optimizations at the high-level (program tions from a familiar Python environment. In order to provide
transformation) or at the lowest level (target optimization) an intuition of the developer experience in our system, we
because these need to consider aspects such as data-flow or provide an example of using HECO to compile and run an
memory management which have no natural correspondence FHE program in Listing 1 (Server) and Listing 2 (Client).
in a circuit representation. In HECO, high-level programs are By wrapping FHE functions in with blocks, we can operate
lowered through a series of transformations, using multiple in- on them as first-class entities, making compilation explicit.
creasingly lower-level IRs to produce the target kernel. These Our framework provides the necessary infrastructure to run
kernels can then be targeted and run against various back-end programs directly from the front end, allowing developers to
options. We provide a user-facing Python framework that integrate FHE functionality into larger applications easily.
abstracts away the complexities of this process, supports a
Python-embedded Domain Specific Language for FHE, and
Python-Embedded DSL. HECO uses Python to host its
2 This
is natural since FHE computations are usually modeled mathemati- Domain-Specific Language (DSL), inheriting Python’s syn-
cally as arithmetic circuits. tax and general semantics. We want to allow developers to
middle-end
heco::fhe mlir::tensor mlir::arithmetic
order to facilitate this, HECO supports (statically sized) loops, Types Types // iadd, imul, ...
secret<T> tensor<V>
HIR
access to vector elements, and many other high-level features mlir::affine
batchedsecret<T> // e.g., V = secret<T> // loops
Operations Operations
that do not have a direct correspondence in FHE. Since our + add - sub * mul rot insert extract
mlir::func
// func_call
compilation approach requires a high-level representation of
SIR
the input program, including these non-native operations and heco::bfv heco::bgv heco::ckks
// ops in bfv scheme // ops in bgv scheme // ops in ckks scheme
the control-flow structure, we cannot follow the approach used
back-ends
by most existing tools. These tend to execute the program
RIR PIR
heco::poly
C++ // ops in polynomial ring + number theoretic transform
using placeholder objects that record operations performed heco::rns
FHE Libraries
on them, which is equivalent if considering FHE programs (e.g. SEAL)
// ops in residue number system representation
mlir::llvm dprive::isa
as circuits but removes most of the high-level information // MLIR to LLVM IR // hardware ISA
(a) Textbook implementation of a simple sharpening filter (b) Optimized batched solution of the same program
Listing 3: Both of these functions apply a simple sharpening filter to an encrypted image of size n × n = N, by convolving a
3 × 3 kernel (−8 in the center, 1 everywhere else) with the image. The version on the left encrypts each pixel individually, and
follows the textbook version of the algorithm, operating over a vector of N ciphertexts. The version on the right batches all pixels
into a single ciphertext and uses rotations (<<) and SIMD operations to compute the kernel over the entire image at the same
time. Designing batched implementations requires out-of-the-box thinking in addition to significant expertise and experience.
Time [s]
100
100
10−1
10−1
10−2
10−2
2x2 4x4 8x8 16x16 32x32 64x64 4 8 16 32 64 128 256 512 1024 2048 4096
Image size Vector size
(a) Roberts Cross benchmark runtime (in seconds). (b) Hamming Distance benchmark runtime (in seconds).
104
Naive Naive
Max resident memory [MB]
102 102
101 101
100 100
2x2 4x4 8x8 16x16 32x32 64x64 4 8 16 32 64 128 256 512 1024 2048 4096
Image size Vector size
(c) Roberts Cross benchmark memory usage (in MB). (d) Hamming Distance benchmark memory usage (in MB).
Figure 5: Log-log plot of the runtime and memory consumption of the roberts cross and hamming distance benchmarks for
different vector sizes, comparing a naive non-batched solution with the batched solution generated by our system.
6.1 Benchmarks terms of run time over a solution that batches vector data
into ciphertexts, but does not re-structure the program to be
We evaluate HECO in terms of the speedup reduction in mem-
batching-friendly. This is because such a solution adds the
ory overhead gained over non-optimized implementations and
overhead of rotations, masking, etc., to the base runtime of
the compile time required.
the non-batched solution.
Applications. We demonstrate the speedup achieved by our Environment. All benchmarks are executed on AWS
batching optimization on two applications that are represen- m5n.xlarge instances, which provide 4 cores and 16 GB
tative of common batching opportunities. The roberts cross of RAM. We used Microsoft SEAL [6] as the underlying
operator is an edge-detection feature used in image process- FHE library, targeting its BFV [14, 15] scheme implemen-
ing. It approximates the gradient of an image as the square tation. All experiments are run using the same parameters,
root of the sum-of-squares of two different convolutions of the which ensure at least 128-bit security. We report the run time
image, which compute the differences between diagonally ad- of the computation itself, omitting client-side aspects such as
jacent pixels. As in all other kernel-based benchmarks, wrap- key generation or encryption/decryption. All results are the
around padding is used, which aligns well with the cyclical average of 10 iterations, discarding top and bottom outliers.
rotation paradigm of FHE. In order to enable a practical FHE Runtime & Memory Overhead. In Figure 5a we show the
evaluation, the final square root is omitted, since it would runtime of the Roberts Cross benchmark for varying instance
be prohibitively expensive to evaluate under encryption. The sizes, comparing the non-batched baseline with the batched
hamming distance, meanwhile, computes the edit distance solution generated by HECO. While the run time of the naive
between two vectors, i.e., the number of positions at which version increases linearly with the image size, the batched
they disagree. Here, we consider two binary vectors of the solution maintains the same performance (until parameters
same length, a setting in which computing (non-)equality must be increased to accommodate even larger images). In-
can be done efficiently using the arithmetic operations avail- stead, the run time is more closely tied to the size of the kernel
able in FHE. Specifically, this makes use of the fact that than to that of the image. This highlights the dramatic trans-
NEQ(a, b) = XOR(a, b) = (a − b)2 for a, b ∈ {0, 1}. formations achieved by HECO, fundamentally changing the
Baseline. Our baseline is a naive implementation of the ap- structure of the program. As a result of these transformations,
plication without taking advantage of batching, as one might HECO achieves a speedup of 3454x over the non-batched
expect FHE novices to implement. In this setting, vectors of baseline for 64x64 pixel images, demonstrating the extraordi-
secrets are directly translated to vectors of ciphertexts. While nary impact that effective use of batching can have on FHE
this approach introduces significant ciphertext expansion and applications: while the non-batched solution is borderline im-
increases the memory required, it is actually preferable in practical at over two minutes, the batched solution takes only
100
10−1
10−2
Figure 6: Runtime of example applications (in seconds), comparing a naive non-batched baseline, the solution generated by our
system (HECO), and an optimally-batched solution synthesized by the Porcupine tool.
queries) will return incorrect results. Therefore, we must first 1 def encryptedPSU ( a_id : Tensor [ 1 2 8 ,8 , Secret [ int ]] ,
de-duplicate the encrypted databases before executing the 2 a_data : Tensor [ 1 2 8 , Secret [ int ]] ,
3 b_id : Tensor [ 1 2 8 ,8 , Secret [ int ]] ,
analytics. Since threshold FHE does not affect the server-side
4 b_data : Tensor [ 1 2 8 , Secret [ int ]])
execution of the computation, HECO can be used directly to 5 -> Secret [ int ]:
develop such an application. This first computes the Private 6 sum : Secret [ int ] = 0
Set Union (PSU) of two databases A,B indexed by unique IDs 7 for i in range ( 0 , 1 2 8 ) :
8 sum = sum + a_data [ i ]
consistent across both (e.g., a national identifier like the SSN). 9 for i in range ( 0 , 1 2 8 ) :
For simplicity, we consider databases with one data column 10 unique : Secret [ int ] = 1
and a simple SUM aggregation. However, the presented ap- 11 for j in range ( 0 , 1 2 8 ) :
proach trivially extends to larger databases and more complex 12 # compute a_id [ i ] /= b_id [ j ]
13 eq : sf64 = 1
statistics. Listing 5 shows the application expressed using the 14 for k in range ( 0 , 8 ) :
HECO Python frontend, for a database size of 128 elements 15 # a xor b == (a - b ) ^ 2
and 8-bit identifiers. We split the identifiers into individual 16 x = ( a_id [ i ][ k ] - b_id [ j ][ k ]) ** 2
bits, allowing us to compute the equality function even while 17 nx = 1 - x # not x
18 eq = eq * nx # eq and nx
working with arithmetic circuits. The program begins by ag- 19 neq = 1 - eq # not eq
gregating A’s data and then proceeds to check each element of 20 unique = unique * nequal
B’s database for potential duplicates in A. In order to compute 21
the equality function, we compute k ak ⊕ bk , using the fact 22 sum = sum + unique * a_data [ i ]
23 return sum
that, for inputs a, b ∈ {0, 1}, xor can be computed as (a − b)2 ,
and directly via multiplication, and not as 1−a. If a duplicate Listing 5: Computing statistics over duplicated data.
is found, the element of B is multiplied with unique == 0,
i.e., it does not contribute to the overall statistics.
Specifically, instead of batching the identifiers for each
Performance & Discussion. We evaluated both a naive database into a single ciphertext, the expert solution instead
baseline and the HECO-optimized batched implementation creates one ciphertext per bit, using significantly oversized
using the same setup described earlier in this section. The ciphertexts with 1282 = 214 slots. This enables the expert so-
naive approach has a run time of several minutes (11.3 min) lution to batch every possible permutation of the identifier set
and requires the sending of over 2000 ciphertexts between into one ciphertext. By applying this to the encryption of set
the server and the clients. The batched solution produced B, while simply encrypting 128 non-permuted repetitions of
by our system requires not only significantly less data to be set A, the expensive (n2 ) duplication check can be performed
transmitted (only 4 ciphertexts), but also runs an order of in parallel on all elements at the same time. Computing the
magnitude faster (57.6 s), confirming the trend we observed unique flag then uses a rare application of the rotate-and-
when evaluating on smaller benchmarks. While the results multiply pattern. As a trade-off, the equality computation is
achieved by HECO are more than practical already, a state- no longer batched, but since the number of bits in the iden-
of-the art hand-written implementation designed for this task tifiers is, by necessity, at most logarithmic in the number of
can improve this even further, requiring only 1.4 s. However, database elements, this is a profitable trade-off.
arriving at this solution requires a significant rethinking of the In general, exploiting application-specific packing patterns
program and an application-specific batching pattern, which can unlock additional performance gains and is a frequently
HECO intentionally does not consider to avoid a search space combined with client-side processing in expert-designed FHE
explosion. Note that synthesis based tools such as Porcupine systems. However, the decision on which client-side process-
also cannot capture these kinds of transformations. ing (e.g., removing outliers, computing permutations, etc) is