Stylesdf: High-Resolution 3D-Consistent Image and Geometry Generation
Stylesdf: High-Resolution 3D-Consistent Image and Geometry Generation
Stylesdf: High-Resolution 3D-Consistent Image and Geometry Generation
Figure 1. Our proposed framework–StyleSDF– learns to jointly generate high resolution, 3D-consistent images (top rows) along with their
detailed view-consistent geometry represented with SDFs (depth maps in bottom rows), while being trained on single view RGB images.
Abstract 1. Introduction
We introduce a high resolution, 3D-consistent image and StyleGAN architectures [35–37] have shown an un-
shape generation technique which we call StyleSDF. Our precedented quality of RGB image generation. They are,
method is trained on single-view RGB data only, and stands however, designed to generate single RGB views rather than
on the shoulders of StyleGAN2 for image generation, while 3D content. In this paper, we introduce StyleSDF, a method
solving two main challenges in 3D-aware GANs: 1) high- for generating 3D-consistent 1024×1024 RGB images and
resolution, view-consistent generation of the RGB images, geometry, trained only on single-view RGB images.
and 2) detailed 3D shape. We achieve this by merging a Related 3D generative models [8, 49, 54, 58, 63] present
SDF-based 3D representation with a style-based 2D gener- shape and appearance synthesis via coordinate-based multi-
ator. Our 3D implicit network renders low-resolution fea- layer-perceptrons (MLP). These works, however, often re-
ture maps, from which the style-based network generates quire 3D or multi-view data for supervision, which are dif-
view-consistent, 1024×1024 images. Notably, our SDF- ficult to collect, or are limited to low-resolution rendering
based 3D modeling defines detailed 3D surfaces, leading outputs as they rely on expensive volumetric field sampling.
to consistent volume rendering. Our method shows higher Without multi-view supervision, 3D-aware GANs [8,49,58]
quality results compared to state of the art in terms of visual typically use opacity fields as geometric proxy, forgoing
and geometric quality.
Project Page: https://stylesdf.github.io/
well-defined surfaces, which results in low-quality depth pervision via fitting their 3D models to the multi-view im-
maps that are inconsistent across views. ages using differentiable rendering [46, 51, 63, 73].
At the core of our architecture lies the SDF-based 3D Two recent seminal breakthroughs are NeRF [46] and
volume renderer and the 2D StyleGAN generator. We use SIREN [62]. NeRF introduced the use of volume render-
a coordinate-based MLP to model Signed Distance Fields ing [32] for reconstructing a 3D scene as a combination of
(SDF) and radiance fields which render low resolution fea- neural radiance and density fields to synthesize novel views.
ture maps. These feature maps are then efficiently trans- SIREN replaced the popular ReLU activation function with
formed into high-resolution images using the StyleGAN sine functions with modulated frequencies, showing great
generator. Our model is trained with an adversarial loss that single scene fitting results. We refer readers to [67] for
encourages the networks to generate realistic images from more comprehensive review.
all sampled viewpoints, and an Eikonal loss that ensures Single-View Supervised 3D-Aware GANs: Rather than
proper SDF modeling. These losses automatically induce relying on 3D or multi-view supervisions, recent ap-
view-consistent, detailed 3D scenes, without 3D or multi- proaches aim at learning a 3D generative model from a
view supervision. The proposed framework effectively ad- set of unconstrained single-view images. These methods
dresses the resolution and the view-inconsistency issues of [8, 16, 25, 30, 42, 47–49, 58] typically optimize their 3D rep-
existing 3D-aware GAN approaches that base on volume resentations to render realistic 2D images from all randomly
rendering. Our system design opens the door for interesting sampled viewpoints using adversarial loss.
future research in vision and graphics that involve a latent Most inline with our work are methods that use implicit
space of high quality shape and appearance. neural radiance fields for 3D-aware image and geometry
Our approach is evaluated on the FFHQ [36] and generation (GRAF [58] and Pi-GAN [8]). However, these
AFHQ [12] datasets. We demonstrate through extensive ex- methods are limited to low-resolution outputs due to the
periments that our system outperforms the state-of-the-art high computational costs of the volume rendering. In ad-
3D-aware methods, measured by the quality of the gener- dition, the use of density fields as proxy for geometry pro-
ated images and surfaces, and their view-consistencies. vides ample amount of leeway for the networks to produce
2. Related Work realistic images while violating 3D consistency, leading to
inconsistent volume rendering w.r.t. the camera viewpoints
In this section, we review related approaches in 2D im- (the rendered RGB or depth images are not 3D-consistent).
age synthesis, 3D generative modeling, and 3D-aware im- ShadeGAN [53] introduces a shading-guided pipeline
age synthesis. which improves the surface quality, but the image output
Generative Adversarial Networks: State-of-the-art Gen- resolution (128×128) is still bounded by the computational
erative Adversarial Networks [19] (GANs) can synthesize burden of the volume rendering. GIRAFFE [49] proposed a
high-resolution RGB images that are practically indistin- dual stage rendering process, where a backbone volume ren-
guishable from real images [34–37]. Substantial work has derer generates low resolution feature maps (16×16) that
been done in order to manipulate the generated images, by are passed to a 2D CNN to generate outputs at resolutions
exploring meaningful latent space directions [1–3, 13, 24, of up to 256 × 256. Despite improved image quality, GI-
28, 59, 60, 64, 65], introducing contrastive learning [61], in- RAFFE outputs lack view consistency, as the hairstyle, fa-
verse graphics [75], examplar images [31] or multiple input cial expression, and sometimes the object’s identity, are en-
views [40]. While 2D latent space manipulation produces tangled with the camera viewpoint inputs, likely because 3D
realistic results, these methods tend to lack explicit cam- outputs at 16×16 are not descriptive enough.
era control, have no 3D understanding, require shape priors StyleNeRF [23] and CIPS-3D [76] are concurrent works
from 3DMM models [64, 65], or reconstruct the surface as that adopt the two-stage rendering process or a smart sam-
a preprocessing step [40]. pling for high-resolution image generation, yet these works
Coordinate-based 3D Models: While multiple 3D rep- still do not model well-defined, view-consistent 3D geome-
resentations have been proposed for generative modeling try.
[22, 70, 71], recent coordinate-based neural implicit mod-
els [9, 44, 54] stand out as an efficient, expressive, and dif- 3. Algorithm
ferentiable representation.
3.1. Overview
Neural implicit representations (NIR) have been widely
adopted for learning shape and appearance of objects [4,10, Our framework consists of two main components. A
14, 20, 45, 50, 52, 56, 57], local parts [17, 18], and full 3D backbone conditional SDF volume renderer, and a 2D style-
scenes [7, 11, 29, 55] from explicit 3D supervisions. More- based generator [37]. Each component also has an accom-
over, NIR approaches have been shown to be a powerful panied mapping network [36] to map the input latent vector
tool for reconstructing 3D structure from multi-view 2D su- into modulation signals for each layer. An overview of our
Overall Architecture SDF Volume Renderer <latexit sha1_base64="gMTgjs7J9T7tfl8I4J/4iGZ8J/U=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuiG5cV7APbUjLpnTY0kxmSjFiG/oUbF4q49W/c+Tdm2llo64HA4Zx7ybnHjwXXxnW/ncLK6tr6RnGztLW9s7tX3j9o6ihRDBssEpFq+1Sj4BIbhhuB7VghDX2BLX98k/mtR1SaR/LeTGLshXQoecAZNVZ66IbUjPwgfZr2yxW36s5AlomXkwrkqPfLX91BxJIQpWGCat3x3Nj0UqoMZwKnpW6iMaZsTIfYsVTSEHUvnSWekhOrDEgQKfukITP190ZKQ60noW8ns4R60cvE/7xOYoKrXsplnBiUbP5RkAhiIpKdTwZcITNiYgllitushI2ooszYkkq2BG/x5GXSPKt6F9Xzu/NK7TqvowhHcAyn4MEl1OAW6tAABhKe4RXeHO28OO/Ox3y04OQ7h/AHzucPADaRJQ==</latexit>
x
v x
<latexit sha1_base64="aVFEhD40Trme9YZH9TGgcE9wMKs=">AAACCXicbVDLSgMxFM3UV62vUZdugkWoIGVGirosunElFewDOqVk0kwbmmSGJCPUYbZu/BU3LhRx6x+482/MtLPQ1gOBk3Pu5d57/IhRpR3n2yosLa+srhXXSxubW9s79u5eS4WxxKSJQxbKjo8UYVSQpqaakU4kCeI+I21/fJX57XsiFQ3FnZ5EpMfRUNCAYqSN1Lehx5Ee+UHykEJPUT77Y8SSm7TinLjHfbvsVJ0p4CJxc1IGORp9+8sbhDjmRGjMkFJd14l0L0FSU8xIWvJiRSKEx2hIuoYKxInqJdNLUnhklAEMQmme0HCq/u5IEFdqwn1Tme2p5r1M/M/rxjq46CVURLEmAs8GBTGDOoRZLHBAJcGaTQxBWFKzK8QjJBHWJrySCcGdP3mRtE6r7lm1dlsr1y/zOIrgAByCCnDBOaiDa9AATYDBI3gGr+DNerJerHfrY1ZasPKeffAH1ucPVvyZcw==</latexit>
View Direction
<latexit sha1_base64="GRuxbeGZdJOlSIDiXsmmlxIc3sY=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuiG5cV7APbUjLpnTY0kxmSTKEM/Qs3LhRx69+482/MtLPQ1gOBwzn3knOPHwuujet+O4W19Y3NreJ2aWd3b/+gfHjU1FGiGDZYJCLV9qlGwSU2DDcC27FCGvoCW/74LvNbE1SaR/LRTGPshXQoecAZNVZ66obUjPwgncz65Ypbdecgq8TLSQVy1Pvlr+4gYkmI0jBBte54bmx6KVWGM4GzUjfRGFM2pkPsWCppiLqXzhPPyJlVBiSIlH3SkLn6eyOlodbT0LeTWUK97GXif14nMcFNL+UyTgxKtvgoSAQxEcnOJwOukBkxtYQyxW1WwkZUUWZsSSVbgrd88ippXlS9q+rlw2WldpvXUYQTOIVz8OAaanAPdWgAAwnP8ApvjnZenHfnYzFacPKdY/gD5/MH/R2RIw==</latexit> <latexit sha1_base64="gMTgjs7J9T7tfl8I4J/4iGZ8J/U=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuiG5cV7APbUjLpnTY0kxmSjFiG/oUbF4q49W/c+Tdm2llo64HA4Zx7ybnHjwXXxnW/ncLK6tr6RnGztLW9s7tX3j9o6ihRDBssEpFq+1Sj4BIbhhuB7VghDX2BLX98k/mtR1SaR/LeTGLshXQoecAZNVZ66IbUjPwgfZr2yxW36s5AlomXkwrkqPfLX91BxJIQpWGCat3x3Nj0UqoMZwKnpW6iMaZsTIfYsVTSEHUvnSWekhOrDEgQKfukITP190ZKQ60noW8ns4R60cvE/7xOYoKrXsplnBiUbP5RkAhiIpKdTwZcITNiYgllitushI2ooszYkkq2BG/x5GXSPKt6F9Xzu/NK7TqvowhHcAyn4MEl1OAW6tAABhKe4RXeHO28OO/Ox3y04OQ7h/AHzucPADaRJQ==</latexit>
z ∼ N (0, 1) Architecture
and Position
SDF Renderer
Volume Mapping !: Feature vector
Renderer Network
": RGB radiance
#: SDF value
MLP
Network
w
<latexit sha1_base64="rdxiPIdagr8cZ+NC4T6aAUakhZM=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiQi6rLoxmUF+4A2hMl00g6dTMLMpFJC/sSNC0Xc+ifu/BsnbRbaemDgcM693DMnSDhT2nG+rcra+sbmVnW7trO7t39gHx51VJxKQtsk5rHsBVhRzgRta6Y57SWS4ijgtBtM7gq/O6VSsVg86llCvQiPBAsZwdpIvm0PIqzHQZg95X42jXnu23Wn4cyBVolbkjqUaPn212AYkzSiQhOOleq7TqK9DEvNCKd5bZAqmmAywSPaN1TgiCovmyfP0ZlRhiiMpXlCo7n6eyPDkVKzKDCTRU617BXif14/1eGNlzGRpJoKsjgUphzpGBU1oCGTlGg+MwQTyUxWRMZYYqJNWTVTgrv85VXSuWi4V43Lh8t687asowoncArn4MI1NOEeWtAGAlN4hld4szLrxXq3PhajFavcOYY/sD5/AHszlDk=</latexit>
σ: Density Value
64x64
Image 2D feature
w
<latexit sha1_base64="rdxiPIdagr8cZ+NC4T6aAUakhZM=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiQi6rLoxmUF+4A2hMl00g6dTMLMpFJC/sSNC0Xc+ifu/BsnbRbaemDgcM693DMnSDhT2nG+rcra+sbmVnW7trO7t39gHx51VJxKQtsk5rHsBVhRzgRta6Y57SWS4ijgtBtM7gq/O6VSsVg86llCvQiPBAsZwdpIvm0PIqzHQZg95X42jXnu23Wn4cyBVolbkjqUaPn212AYkzSiQhOOleq7TqK9DEvNCKd5bZAqmmAywSPaN1TgiCovmyfP0ZlRhiiMpXlCo7n6eyPDkVKzKDCTRU617BXif14/1eGNlzGRpJoKsjgUphzpGBU1oCGTlGg+MwQTyUxWRMZYYqJNWTVTgrv85VXSuWi4V43Lh8t687asowoncArn4MI1NOEeWtAGAlN4hld4szLrxXq3PhajFavcOYY/sD5/AHszlDk=</latexit>
Low-Res
<latexit sha1_base64="g95WM2rz4TJpo/GyZZX0s2M1hGQ=">AAAB9HicbVDLSsNAFL2pr1pfVZduBotQNyWRoi6LblxWsA9oQ5lMJu3QySTOTIol9DvcuFDErR/jzr9x0mahrQcGDufcyz1zvJgzpW372yqsrW9sbhW3Szu7e/sH5cOjtooSSWiLRDySXQ8rypmgLc00p91YUhx6nHa88W3mdyZUKhaJBz2NqRvioWABI1gbyfWr/RDrkRekT7PzQbli1+w50CpxclKBHM1B+avvRyQJqdCEY6V6jh1rN8VSM8LprNRPFI0xGeMh7RkqcEiVm85Dz9CZUXwURNI8odFc/b2R4lCpaeiZySyiWvYy8T+vl+jg2k2ZiBNNBVkcChKOdISyBpDPJCWaTw3BRDKTFZERlpho01PJlOAsf3mVtC9qzmWtfl+vNG7yOopwAqdQBQeuoAF30IQWEHiEZ3iFN2tivVjv1sditGDlO8fwB9bnD4wnkfg=</latexit>
f (x, v)
Discriminator Styled 2D Mapping
Generator Marching <latexit sha1_base64="zBLQWennagfZAt2j3un3tQpjQSU=">AAAB8XicjVDLSgNBEOyNrxhfUY9eBoPgKexKUI9BL4KXCOaByRJ6J7PJkNnZZWZWCMv+hRcPinj1b7z5N04eBxUFCxqKqm66u4JEcG1c98MpLC2vrK4V10sbm1vbO+XdvZaOU0VZk8YiVp0ANRNcsqbhRrBOohhGgWDtYHw59dv3TGkey1szSZgf4VDykFM0Vrq77mc9FMkI83654lXdGcjfpAILNPrl994gpmnEpKECte56bmL8DJXhVLC81Es1S5COcci6lkqMmPaz2cU5ObLKgISxsiUNmalfJzKMtJ5Ege2M0Iz0T28q/uZ1UxOe+xmXSWqYpPNFYSqIicn0fTLgilEjJpYgVdzeSugIFVJjQyr9L4TWSdU7rdZuapX6xSKOIhzAIRyDB2dQhytoQBMoSHiAJ3h2tPPovDiv89aCs5jZh29w3j4Bq6eQ7g==</latexit>
Network Kα FC Layer
Cubes <latexit sha1_base64="6cYjeJDlSs4FD97IRVVE7RgZ2io=">AAAB+3icbVDLSgMxFM3UV62vsS7dBItQN2VGirosunFZwT6gM5RMmmlDk8yQZKRlmF9x40IRt/6IO//GTDsLbT0QOJxzL/fkBDGjSjvOt1Xa2Nza3invVvb2Dw6P7ONqV0WJxKSDIxbJfoAUYVSQjqaakX4sCeIBI71gepf7vSciFY3Eo57HxOdoLGhIMdJGGtpVT9ExR3WPIz0JwnSWXQztmtNwFoDrxC1IDRRoD+0vbxThhBOhMUNKDVwn1n6KpKaYkaziJYrECE/RmAwMFYgT5aeL7Bk8N8oIhpE0T2i4UH9vpIgrNeeBmcwjqlUvF//zBokOb/yUijjRRODloTBhUEcwLwKOqCRYs7khCEtqskI8QRJhbeqqmBLc1S+vk+5lw71qNB+atdZtUUcZnIIzUAcuuAYtcA/aoAMwmIFn8ArerMx6sd6tj+VoySp2TsAfWJ8/8ZOUZA==</latexit>
σ(x)
<latexit sha1_base64="LzA4+Wbh8z+JYq5WshSxZ7rROdg=">AAACAXicbZDLSsNAFIZP6q3WW9SN4GawCBWkJFLUZdGNywr2Am0ok+mkHTq5MDMpllA3voobF4q49S3c+TZO2gja+sPAx3/OYc753YgzqSzry8gtLa+sruXXCxubW9s75u5eQ4axILROQh6Klosl5SygdcUUp61IUOy7nDbd4XVab46okCwM7tQ4oo6P+wHzGMFKW13zgJQ6PlYD10vuJ6c/OJqcdM2iVbamQotgZ1CETLWu+dnphST2aaAIx1K2bStSToKFYoTTSaETSxphMsR92tYYYJ9KJ5leMEHH2ukhLxT6BQpN3d8TCfalHPuu7kxXlPO11Pyv1o6Vd+kkLIhiRQMy+8iLOVIhSuNAPSYoUXysARPB9K6IDLDAROnQCjoEe/7kRWicle3zcuW2UqxeZXHk4RCOoAQ2XEAVbqAGdSDwAE/wAq/Go/FsvBnvs9ackc3swx8ZH9+VwJb+</latexit>
c(x, v)
Volume Aggregation
1024x1024
3D Mesh
Image
Figure 2. StyleSDF Architecture: (Left) Overall architecture: SDF volume renderer takes in a latent code and camera parameters, queries
points and view directions in the volume, and projects the 3D surface features into the 2D view. The projected features are fed to the Styled
2D generator that creates the high resolution image. (Right) our SDF volume renderer jointly models volumetric SDF and radiance field,
providing a well defined and view consistent geometry.
architecture can be seen in Figure 2. shown in VolSDF [72], the SDF can be serve as a proxy for
To generate an image, we sample a latent vector z from the density function used for the traditional volume render-
the unit normal distribution, and camera azimuth and eleva- ing [46]. Assuming a non-hollow surface, we convert the
tion angles (φ, θ) from the dataset’s object pose distribution. SDF value into the 3D density fields σ,
For simplicity, we assume that the camera is positioned on
the unit sphere and directed towards the origin. Next, our
volume renderer outputs the signed distance value, RGB
−d(x)
color, and a 256 element feature vector for all the sampled 1
σ(x) = Kα (d(x)) = · Sigmoid , (1)
volume points along the camera rays. We calculate the sur- α α
face density for each sampled point from its SDF value and
apply volume rendering [46] to project the 3D surface fea-
tures into 2D feature map. The 2D generator then takes
the feature map and generates the output image from the where α is a learned parameter that controls the tightness of
desired viewpoint. The 3D surface can be visualized with the density around the surface boundary. α values that ap-
volume-rendered depths or with the mesh from marching- proach 0 represent a solid, sharp, object boundary, whereas
cubes algorithm [41]. larger α values indicate a more “fluffy” object boundary. A
large positive SDF value would drive the sigmoid function
3.2. SDF-based Volume Rendering towards 0, meaning no density outside of the surface, and
Our backbone volume renderer takes a 3D query point, x a high-magnitude negative SDF value would push the sig-
and a viewing direction v. Conditioned by the latent vector moid towards 1, which means maximal density inside the
z, it outputs an SDF value d(x, z), a view dependent color surface.
value c(x, v, z), and feature vector f (x, v, z). For clarity,
we omit z from hereon forward. We render low resolution 64 × 64 feature maps and color
The SDF value indicates the distance of the queried point images with volume rendering. For each pixel, we query
from the surface boundary, and the sign indicates whether points on a ray that originates at the camera position o, and
the point is inside or outside of a watertight surface. As points at the camera direction r(t) = o + tv. and calculate
the RGB color and feature map as follows: We observed that using view-dependent color c(x, v)
Z tf tends make the networks overfit to biases in the dataset. For
C(r) = T (t)σ(r(t))c(r(t), v)dt, instance, people in FFHQ [36] tend to smile more when
tn facing the camera. This makes the facial expression change
Z tf with the viewpoint although the geometry remains consis-
F(r) = T (t)σ(r(t))f (r(t), v)dt, (2) tent. However, when we removed view-dependent color, the
tn
Z t model did not converge. Therefore, to get view consistent
where T (t) = exp − σ(r(s))ds , images, we train our model with view dependent color, but
tn fix the view direction v to the frontal view during inference.
which we approximate with discrete sampling along rays. 3.3. High-Resolution Image Generation
Unlike NeRF [46] and other 3D-aware GANs such as Pi-
GAN [8] and StyleNeRF [23] we do not use stratified sam- Unlike multi-view reconstruction task [46], where the re-
pling. Instead, we split [tn , tf ] into N evenly-sized bins, construction loss for each ray can be individually computed,
t −t adversarial training needs a full image to be present. There-
draw a single offset term uniformly δ ∼ U[0, f N n ], and
fore, scaling a pure volume renderer to high-resolution
sample N evenly-spaced points,
quickly becomes untractable, as we need to sample more
tf − tn than 108 queries to render a single 1024×1024 image with
ti = · i + δ, where i ∈ {0, . . . , N − 1}. (3) 100 samples/ray. As such, we seek to fuse a volume ren-
N
derer with the StyleGAN2 network that has a proven capa-
In addition, we forgo hierarchical sampling altogether, bilities of synthesizing high-resolution images in 2D.
thereby reducing the number of samples by 50%. To combine the two architectures, we truncate the early
The incorporation of SDFs provides clear definition of layers of the StyleGAN2 generator up until the 64×64 layer
the surface, allowing us to extract the mesh via Marching and feed the generator with the 64 × 64 feature maps gen-
Cubes [41]. Moreover, the use of SDFs along with the re- erated by the backbone volume renderer. In addition, we
lated losses (Sec. 3.4.1) leads to higher quality geometry in cut StyleGAN2’s mapping network from eight layers to five
terms of expressiveness and view-consistency (as shown in layers, and feed it with the w latent code from the volume
Sec. 4.4), even with a simplified volume sampling strategy. renderer’s mapping network, instead of the original latent
The architecture of our volume renderer mostly matches vector z. The discriminator is left unchanged.
that of Pi-GAN [8]. The mapping network consists of a This design choice allows us to enjoy the best of both
3 layer MLP with LeakyReLU activation and maps an in- worlds. The volume renderer learns the underline geom-
put latent code z into w space and then generates frequecny etry, explicitly disentangles the object’s pose from it’s ap-
modulation, γi , and phase shift, βi , for each layer of the vol- pearance, and enables full control of the camera position
ume renderer. The volume rendering network contains eight during inference. The StyleGAN2 generator upsamples the
shared modulated FC layers with SIREN [62] activation: low resolution feature maps, adds high frequency details,
φi (x) = sin (γi (Wi · x + bi ) + βi ) , i ∈ {0, . . . , 7} (4) and mimics complex light transport effects such as sub-
surface scattering and inter-reflections that are difficult to
where Wi and bi are the weight matrix and bias vector of model with the low-resolution volume renderer.
the fully connected layers. The volume renderer then splits
3.4. Training
into two paths, the SDF path and the color path. The SDF
path is implemented using a single FC layer denoted φd . We employ a two-stage training procedure. First we
In the color path, the output of the last shared layer φ7 is train only the backbone SDF-based volume renderer, then
concatenated with the view direction input and passed into we freeze the volume renderer weights, and then train the
one additional FiLM siren layer φf followed by a single FC StyleGAN generator.
layer φc that generates the color output. To summarize:
We penalize the prediction error using a smoothed L1 loss: While it is possible to have a reconstruction loss between
( the low-resolution and high-resolution output images, we
(θ̂ − θ)2 if |θ̂ − θ| ≤ 1 find that the inductive bias of the 2D convolutional archi-
Lview = . (6)
|θ̂ − θ| otherwise tecture and the sharing of style codes is strong enough to
preserve important structures and identities between the im-
This loss is applied on both view angles for the generator ages (Fig. 3).
and the discriminator, however, since we don’t have ground
truth pose data for the original dataset, this loss is only ap- 4. Experiments
plied to the fake images in the discriminator pass.
Eikonal Loss: This term ensures that the learned SDF is 4.1. Datasets & Baselines
physically valid [21]: We train and evaluate our model on the FFHQ [36] and
AFHQ [12] datasets. FFHQ contains 70,000 images of di-
Leik = Ex (k∇d(x)k2 − 1)2 . (7) verse human faces at 1024 × 1024 resolution, which are
Minimal Surface Loss: We encourage the 3D network to centered and aligned according to the procedure introduced
describe the scenes with minimal volume of zero-crossings in Karras et al. [34]. The AFHQ dataset consists of 15,630
to prevent spurious and non-visible surfaces from being images of cats, dogs and wild animals at 512 × 512 resolu-
formed within the scenes. That is, we penalize the SDF tion. Note that the AFHQ images are not aligned and con-
values that are close to zero: tain diverse animal species, posing a significant challenge
to StyleSDF.
Lsurf = Ex (exp(−100|d(x)|)) . (8) We compare our method against the state-of-the-art
3D-aware GAN baselines, GIRAFFE [49], PiGAN [8],
The overall loss function is then, GRAF [58] and HoloGAN [47], on the above datasets by
measuring the quality of the generated images, shapes, and
Lvol = Ladv + λview Lview + λeik Leik + λsurf Lsurf , (9) rendering consistency.
where λview = 15, λeik = 0.1, and λsurf = 0.05. The 4.2. Qualitative Evaluations
weight of the R1 loss is set according to the dataset.
Comparison to Baseline Methods: We compare the visual
quality of our images to the baseline methods by rendering
3.4.2 Styled Generator Training
the same identity (latent code) from 4 different viewpoints,
We train our Styled generator with the same losses and opti- results are shown in Figure 4. To compare the quality of the
mizer parameters as the original implementation, a non sat- underlying geometry, we also show the surfaces extracted
urating adversarial loss, R1 regularization, and path regu- by marching cubes from StyleSDF, Pi-GAN, and GRAF
larization. As in the volume renderer training, we set the (Note that GIRRAFE and HoloGAN pipelines do not gener-
weight of the R1 regularization according to the dataset. ate shapes). Our method generates superior images as well
N/A N/A N/A N/A
Figure 4. Qualitative image and geometry comparisons. We compare our sample renderings and corresponding 3D meshes against the
state-of-the-art 3D-aware GAN approaches ( [8, 47, 49, 58]). Note that HoloGAN and GIRAFFE are unable to create 3D mesh from
their representations. Both HoloGAN (a) and GRAF (b) produce renderings that are of lower quality. The 3D mesh reconstructed from
PiGAN’s learned opacity fields reveal noticeable artifacts (c). While GIRAFFE (d) produces realistic low-resolution images, the identity
of the person often changes with the viewpoints. StyleSDF (d) produces 1024×1024 realistic view consistent rgb, while also generating
high quality 3D. Best viewed digitally.
as more detailed 3D shapes. Additional generation results 4.3. Quantitative Image Evaluations
from our method can be seen in Figures 1 and 3.
We evaluate the visual quality and the diversity of the
generated images using the Frechet Inception Distance
Novel View Synthesis: Since our method learns strong 3D (FID) [26] and Kernel Inception Distance (KID) [6]. We
shape priors, it can generate images from viewpoints that compare our scores against the aforementioned baseline
are not well represented in the dataset distribution. Ex- models on the FFHQ and AFHQ datasets.
amples of out-of-distribution view synthesis are displayed All the baseline models are trained following their given
in Figure 5. pipelines to generate 256 × 256 images, with the excep-
tion of Pi-GAN, which is trained on 128 × 128 images and
Video Results: We urge readers to view our project’s web- renders 256 × 256 images at inference time. The results,
site that includes a larger set of results and videos to better summarized in Table 1, show that StyleSDF performs con-
appreciate the multi-view capabilities of StyleSDF. sistently better than all the baselines in terms of visual qual-
Dataset: FFHQ AFHQ
PiGAN 11.04 8.66
Ours 0.40 0.63
Figure 9. Qualitative view consistency comparison of RGB renderings. We project the rendering from a side view using its corresponding
depth map to the frontal view. We compare the reprojection to the frontal-view rendering and compute the error map showing mean
absolute difference in RGB channels (0 - 255). Our SDF-based technique generate superior depth quality and significantly improves the
view-consistency of the RGB renderings. Most of our errors concentrate on the occlusion boundaries whereas PiGAN’s errors spread
across the whole subject (e.g., eyes, mouth, specular highlights, fur patterns).
tached 24 sequences in the supplementary material, featur- frames. The flickering artifacts are especially prominent for
ing view-generation results on the two datasets using two the AFHQ dataset due to high-frequency textures from the
different camera trajectories. For each identity, we provide fur patterns.
two videos, one for RGB and another for depth rendering. Therefore, we aim at reducing this flickering by adding
The videos are presented in the project’s website. the Gaussian noise in a 3D-consistent manner, i.e., we want
to attach the noise on the 3D surface. We achieve this by
C.1. Geometry-Aware StyleGAN Noise
extracting a mesh (at 128 resolution grid) for each sequence
Even though the images shown in the main paper on from our SDF representation and attach a unit Gaussian
multi-view RGB generation look highly realistic, we note noise to each vertex, and render the mesh using vertex col-
that for generating a video sequence, the random noise of oring. Since higher resolution intermediate features require
StyleGAN2 [36], when naı̈vely applied to 2D images, could up to 1024×1024 noise map, we subdivide triangle faces of
result in severe flickering of high-frequency details between the extracted mesh once every layer, starting from 128×128
Frontal φ = −0.45 φ = −0.3 φ = −0.15 φ = 0.15 φ = 0.3 φ = 0.45 θ = −0.225 θ = −0.15 θ = −0.075 θ = 0.075 θ = 0.15 θ = 0.225
Figure 10. View-consistency visualization of high-resolution renderings. We use the side-view depth maps (first rows) to warp the side-
view RGB renderings (second rows) to the frontal view (first column). The reprojected pixels that pass the occlusion testing are shown
in the third row. We compare the reprojections with the frontal-view renderings and show the per-pixel error maps (fourth rows). Our
reprojections well align with the frontal view with errors mostly in the occlusion boundaries and high-frequency details.
FFHQ: We trained FFHQ with R1 regularization loss of 10. D.2. Training Details
The camera field of view was fixed to 12◦ and its azimuth
and elevation angles are sampled from Normal distributions Sphere Initialization: During our experiments we have no-
with zero mean and standard deviations of 0.3 and 0.15 re- ticed that our SDF volume renderer can get stuck at a local
spectively. We set the near and far fields to [0.88, 1.12] and minimum, which generates concave surfaces. To avoid this
sample 24 points per ray during training .We trained our optimization failure, we first initialize the MLP to generate
volume renderer for 200k iterations and the 2D-Styled gen- an SDF of a sphere centered at the origin with a fixed radius.
FFHQ AFHQ
Frontal rendering
Mean reproduction
Difference
Figure 11. Color consistency visualization with mean faces. We reproject the RGB renderings from the side views to the frontal view (as
in Fig. 10). We show the mean reprojections that pass the occlusion testing and their differences to the frontal-view renderings. The mean
reprojections are well aligned with the frontal rendering. The majority of the errors are in the high-frequency details, generated from the
random noise maps in the StyleGAN component. This demonstrates the strong view consistency of our high-resolution renderings.
We analytically compute the signed distance of the sampled biguate between shape and radiance. Penalizing values that
points from the sphere and fit the MLP to match these dis- are close to zero essentially minimizes the surface area and
tances. We run this procedure for 10k iterations before the makes the network prefer smooth SDFs.
main training. The importance of sphere initialization is In Figure 13, we show the importance of the sphere ini-
discussed in Appendix E. tialization in breaking the concave/convex ambiguity. With-
Training setup: Our system is trained in a two-stage strat- out properly initializing the weights, the network gets stuck
egy. First, we train the backbone SDF volume renderer on at a local minimum that generates concave surfaces. Al-
64×64 images with a batch size of 24 using the ADAM [39] though concave surfaces are physically incorrect, they can
optimizer with learning rates of 2 · 10−5 and 2 · 10−4 for the perfectly explain multi-view images, as they are essentially
generator and discriminator respectively and β1 = 0, β2 = the ”mirror” surface. Concave surfaces cause the images to
0.9. We accumulate gradients in order to fit to the GPU be rendered in the opposite azimuth angle, an augmentation
memory constraints. For instance, a setup of 2 NVIDIA that the discriminator cannot detect as fake. Therefore, the
A6000 GPUs (a batch of 12 images per GPU) requires the generator cannot recover from the this local minima.
accumulation of two forward passes (6 images per forward
pass) and takes roughly 3.5 days to train. We use an expo-
nential moving average model during inference. F. Limitations (continued)
In the second phase, we freeze the volume renderer
As mentioned in the main paper, our high-resolution
weights and train the 2D styled generator with identical
generation network is based on the implementation of Style-
setup to StyleGAN2 [37]. This includes ADAM optimizer
GAN2 [36], and thus might experience the same aliasing
with 0.002 learning rate and β1 = 0, β2 = 0.99, equalized
and flickering at regions with high-frequency details (e.g.,
learning rate, lazy R1 and path regularization, batch size of
hair), which are recently addressed in Alias-free GAN [35]
32, and exponential moving average. We trained the styled
or Mip-NeRF [5]. Moreover, we observe that the recon-
generator on 8 NVIDIA TeslaV100 GPUs for 7 days.
structed geometry for human eyes contain artifacts, charac-
E. Ablation studies terized by concave, instead of convex, eye balls. We believe
that these artifacts often lead to slight gaze changes along
We perform two ablation studies to show the neces- with the camera views. As stated in the main paper, our cur-
sity of the minimal surface loss (see main paper) and the rent implementation of volume rendering during inference
sphere initialization. As can be seen in Figure 12, on top uses fixed frontal view directions for RGB queries c(x, v),
of preventing spurious and non-visible surfaces from be- and thus cannot express moving specular highlights along
ing formed, the minimal surface loss also helps to disam- with the camera.
Without Minimal
Surface Loss
With Minimal
Surface Loss
Color Image Frontal view Side view Color Image Frontal view Side view
Figure 12. Minimal surface loss ablation study. We visualize the volume rendered RGB and depth images from volume renderers trained
with and without the minimal surface loss. The Depth map meshes are visualized from the front and side views. Note how a model trained
with the minimal surface loss generates smoother surfaces and is less prone to shape-radiance ambiguities, e.g., specular highlights are
baked into the geometry.
Without sphere
initialization
With sphere
initialization
Color Image Frontal view Side view Color Image Frontal view Side view
Figure 13. Sphere initialization ablation study. We visualize volume-rendered RGB and depth images from volume renderers trained with
and without sphere initialization. The Depth map meshes are visualized from the front and side views. Note how a model trained without
model sphere initialization generates concave surfaces.
G. Additional Results
We show uncurated set of images generated by our net-
works (Fig. 14).
Figure 14. Uncurated high-resolution RGB images that are randomly generated by StyleSDF.