3D Generative Models A Survey

1
3D Generative Models: A Survey

Zifan Shi*, Sida Peng*, Yinghao Xu*, Andreas Geiger, Yiyi Liao, and Yujun Shen
Abstract—Generative models aim to learn the distribution of observed data by generating new instances. With the advent of neural
networks, deep generative models, including variational autoencoders (VAEs), generative adversarial networks (GANs), and diffusion
models (DMs), have progressed remarkably in synthesizing 2D images. Recently, researchers started to shift focus from 2D to 3D
space, considering that 3D data is more closely aligned with our physical world and holds immense practical potential. However, unlike
2D images, which possess an inherent and efficient representation (i.e., a pixel grid), representing 3D data poses significantly greater
challenges. Ideally, a robust 3D representation should be capable of accurately modeling complex shapes and appearances while
being highly efficient in handling high-resolution data with high processing speeds and low memory requirements. Regrettably, existing
3D representations, such as point clouds, meshes, and neural fields, often fail to satisfy all of these requirements simultaneously.
arXiv:2210.15663v3 [cs.CV] 28 Aug 2023
In this survey, we thoroughly review the ongoing developments of 3D generative models, including methods that employ 2D and 3D
supervision. Our analysis centers on generative models, with a particular focus on the representations utilized in this context. We believe
our survey will help the community to track the field’s evolution and to spark innovative ideas to propel progress towards solving this
challenging task.
Index Terms—Generative modeling, 3D representations, deep learning, unsupervised learning, 3D vision.
✦
1 I NTRODUCTION
T HE rapid advancement of deep learning [1] has revolu-

tionized various computer vision tasks, such as visual
object recognition [2], [3], object detection [4], [5], [6], and
tions, variational autoencoders (VAEs) [16], autoregressive
models (ARs) [24], normalizing flows (NFs) [25], generative
adversarial networks (GANs) [15], and the very recent
image rendering [7], [8], [9]. It has also brought significant diffusion probabilistic models (DPMs) [17] all possess
improvements to our daily lives, enabling autonomous the ability to transform latent variables into high-quality
driving [10], [11], advancements in biological research [12], images. Nowadays, however, the application of generative
and facilitating intelligent creations [13], [14]. models exclusively in 2D falls short in meeting the demands
Among the different techniques, generative model- of several important real-world scenarios which require
ing [15], [16], [17] holds a crucial position in data analysis access to 3D information relevant to modeling the physical
and machine learning. Unlike discriminative models that fo- image formation process. Taking the film industry as a
cus on direct predictions, generative models aim to capture prime example, the need arises to design 3D digital assets
the underlying data distribution to generate new instances. instead of merely producing 2D images, enabling more
Consequently, they require a comprehensive understanding immersive experiences. Existing content creation pipelines
of the data. For instance, while a recognition model might often rely on significant expertise and extensive human
disregard task-irrelevant details (e.g., color) or suffer from effort, resulting in time-consuming and costly processes.
shortcut learning [18], a generative model is expected to Numerous pioneering attempts [26], [27], [28], [29], [30],
model every intricate detail (e.g., object arrangement and [31], as shown in Fig. 2, have been made to explore
texture) to achieve convincing results. From this perspective, automated 3D data generation. However, these studies are
learning a generative model is usually more challenging yet still in their early stages, with much progress yet to be made.
facilitates a range of applications [14], [19], [20], [21] and is One of the key distinctions between 2D generation and
important for generalization and robustness. 3D generation lies in the data format. Specifically, a 2D
The past few years have witnessed the incredible success image can be naturally represented as an array of pixel
of deep generative models [15], [16], [17] in 2D image values, which can be conveniently processed by neural
synthesis [14], [22], [23]. Despite their different formula- networks [2], [3]. On the other hand, there are numerous
3D representations available to depict a 3D instance, such as
point clouds [32], [33], meshes [34], [35], voxel grids [36],
• Z. Shi is with the Department of Computer Science and Engineering, the
Hong Kong University of Science and Technology, Hong Kong SAR. [37], multi-plane images [38], implicit neural representa-
• S. Peng is with School of Software Technology, Zhejiang University, tions [9], and so on. Each representation possesses its own
China. advantages and limitations. For instance, meshes offer a
• Y. Xu is with the Department of Information Engineering, the Chinese
University of Hong Kong, Hong Kong SAR.
compact representation of 3D shapes but are challenging
• A. Geiger is with the Autonomous Vision Group, University of Tubingen to analyze and generate using neural networks due to their
and Tubingen AI Center, Germany. irregular data structure. In contrast, voxel grids are regularly
• Y. Liao is with the College of Information Science & Electronic aligned in 3D space and work well with standard 3D
Engineering, Zhejiang University, China. She is the co-corresponding
author: yiyi.liao@zju.edu.cn convolutional neural networks. However, they are memory-
• Y. Shen is with Ant Group, China. He is the corresponding author: consuming and cannot represent high-resolution 3D scenes
shenyujun0302@gmail.com without sophisticated sparse data structures. Hence, the
* denotes equal contribution. selection of an appropriate representation is crucial in 3D
2
3D representations (Sec 3.2)

Generative models (Sec 3.1)
Input
• Point clouds
• Random noise • GAN
• Voxels Objective functions
• Conditioning (image, • Normalizing flow
• Meshes
text, partial point • VAE • Adversarial loss
• Depth 3D supervision
clouds ...) • Diffusion model • Reconstruction loss
• Neural fields
• Energy-based model (Sec. 4)
• Hybrid representations
.5 n
ec isio
)
(S perv
su
2D
Density Rendering
Color
• Volume rendering
• Surface rendering
• 2D CNN rendering
Voxel grids Point clouds Meshes Neural fields Hybrid representations • Mixed
Fig. 1: 3D generative model pipeline. To synthesize 3D data from the random noise or conditioning signal, previous
methods propose various types of generative models, such as GAN, normalizing flow, VAE, diffusion model, and energy-
based model. Popular representations of 3D data include point clouds, voxels, meshes, depth, neural fields, and hybrid
representations. The generative models are optimized under either the 2D supervision through differentiable rendering or
the 3D supervision.
content generation. 2 S COPE OF THIS SURVEY

This survey primarily focuses on approaches that train
Considering the rapidly growing field of 3D generative
networks to model the data distribution of target 3D sam-
models, this paper offers a comprehensive survey to educate
ples and support sampling for synthesizing 3D represen-
beginners and serve as a reference for practitioners and
tations and 2D images. Additionally, we include methods
experts alike. While there already exist several surveys in
that predict conditional probability distributions based on
the literature investigating generative models [39], [40], 3D
specific inputs, such as images, point clouds, or text. It is
vision [41], [42], [43], [44], as well as generation of 3D
important to note that these conditional generative methods
structures [45] and faces [46], a comprehensive review of
aim to synthesize 3D representations that respect the inputs
3D generative models is still missing. As discussed earlier,
while retaining diversity in their generations. In contrast,
accomplishing such a challenging task involves considering
classical 3D reconstruction methods establish a one-to-one
numerous candidate generative models (e.g., VAEs, and
mapping from inputs to the target 3D representations.
GANs) and representations (e.g., point clouds, and implicit
Readers seeking a review on such approaches can refer
neural representations). To overcome this problem, our
to [41], [47]. Furthermore, this survey does not include
survey categorizes 3D generative models according to 3D
test-time optimization-based 3D generative methods, such
representations and summarizes the general pipelines of
as [48], [49], [50]. While our survey includes methods
these methods, as presented in Fig. 1, Fig. 3, and Fig. 4.
that generate 3D representations for rendering, it does not
Our survey clearly illustrates the suitability of specific 3D
exhaustively cover the field of neural rendering methods,
representations for various types of generative models,
which are thoroughly discussed in [41], [51]. It is worth
taking into account factors such as neural network compat-
mentioning that this survey is complementary to existing
ibility, memory efficiency, and representation capability. In
surveys on 2D generative models [39], [40] and generative
addition, we classify generative models based on the su-
models learning from 3D data [45], as none of the previous
pervision signal and illustrate design choices of generative
methods provides a comprehensive survey of 3D generative
models with or without 3D supervision. Furthermore, we
models learning from both 2D and 3D supervision and
compare different 3D generative models quantitatively in
provides relevant insights.
terms of generation capability and efficiency. We believe
our survey helps the reader to grasp a more comprehensive
understanding of the field of 3D generative models. 3 F UNDAMENTALS
The remainder of this paper is organized as follows: 3.1 Deep Generative Models
Sec. 2 clarifies the scope of this survey. Sec. 3 introduces Generative models strive to learn the underlying data
the fundamentals of the 3D generation task, including distribution in an unsupervised manner, aiming to produce
formulations of various generative models and popular data that is as realistic as possible based on the given
3D representations. Sec. 4 and Sec. 5 summarize existing information. They enable capturing intricate details and
approaches for learning 3D representations from 2D and 3D showcasing creativity. Broadly speaking, generative models
data, respectively. Sec. 6 discusses the downstream appli- can be categorized into two main groups. The first cat-
cations of 3D generative models. Finally, Sec. 7 discusses egory consists of likelihood-based models, which include
unsolved problems and future work in the field of 3D variational autoencoders (VAEs) [16], normalizing flows
generative models. (NFs) [25], diffusion probabilistic models (DPMs) [17], and
3
3D-GAN VON IM-Net PolyGen DiffRF

Brock et al. Achlioptas et al. PointFlow ShapeGF Point-E H RODIN
3D Supersion
2016 2018 2019 2020 2021 2022 2023
2D Supersion
S2-GAN HoloGAN GRAF pi-GAN H EG3D H RenderDiffusion
PlatonicGAN GIRAFFE Get3D
Voxel grid Point cloud Mesh GAN VAE Normalizing Flow

Depth Neural fields H Hybrid Energy-based model Autoregressive Diffusion
3D Representations Generative Models
Fig. 2: 3D generative model timeline. We show representative methods trained with 3D supervision (top) and 2D
supervision (bottom), respectively. Each method is illustrated with its 3D representation and the generative model.
energy-based models (EBMs) [52]. These models learn by Variational Autoencoders.

maximizing the likelihood of the provided data, allowing Deep latent variable models (DLVMs) utilize neural
them to capture the underlying distribution. The second networks to parameterize the data distribution, denoted as
category encompasses likelihood-free models, prominently x ∼ pθ , by incorporating latent variables z sampled from a
Generative Adversarial Networks (GANs) [15]. GANs em- prior distribution z ∼ pz . However, optimizing the desired
ploy a two-player min-max game framework to find a Nash parameter θ of DLVMs presents challenges R due to the in-
equilibrium, leading to the generation of synthetic data. In tractability of the likelihood term pθ (x) = pθ (x|z)pz (z)dz.
the subsequent sections, we will provide a concise overview This intractability makes differentiation and optimization
of various generative models, including their characteristics difficult. Variational Autoencoders (VAEs) address this
and mechanisms. issue by transforming the problem of intractable posterior
Generative Adversarial Networks. inference into a tractable one. VAEs achieve this by em-
Generative Adversarial Networks (GANs), commonly ploying an efficient approximation, denoted as qϕ (z|x), to
referred to as GANs, have gained immense popularity the intractable posterior distribution. This approximation
for their exceptional performance in data synthesis tasks. enables effective learning and inference within the VAE
Typically, a GAN consists of two separate networks: a framework. Concretely, qϕ (z|x) is parameterized with a
generator G(·) and a discriminator D(·). The generator feed-forward model and optimized by minimizing the KL
network takes as input a latent code sampled from a prior divergence between itself and pθ (z|x):
distribution z ∼ pz for synthesizing data. On the other hand,
the discriminator network aims to distinguish between real DKL (qϕ (z|x)||pθ (z|x)) =log(pθ (x)) + DKL (qϕ (z|x)||pθ (z))
data samples x ∼ px and the synthesized data G(z). − Ez∼qϕ (z|x) log(pθ (x|z)). (3)
During the training process, the generator network G(·)
The log likelihood log(pθ (x)) can be re-written into the
strives to synthesize data that appear as realistic as possible,
following form:
attempting to deceive the discriminator network D(·) into
classifying the generated samples as real. Simultaneously, log(pθ (x)) =DKL (qϕ (z|x)||pθ (z|x)) − DKL (qϕ (z|x)||pθ (z))
the discriminator network is trained to accurately label + Ez∼qϕ(z|x) log(pθ (x|z))
synthesized samples generated by G(·) as fake and training
samples x as real. These two networks compete with each ≥ − DKL (qϕ (z|x)||pθ (z)) + Ez∼qϕ(z|x) log(pθ (x|z)),
other and can be formulated into a min-max game. They are (4)
jointly optimized with where the term DKL (qϕ (z|x)||pθ (z|x)) can be eliminated
LD = −Ex∼px [log(D(x))] − Ez∼pz [log(1 − D(G(z)))], (1) since KL divergence is always non-negative. The loss
function of a VAE can be expressed as follows:
LG = −Ez∼pz [log(D(G(z)))]. (2)
LV AE = −DKL (qϕ (z|x)||pθ (z)) + Ez∼qϕ(z|x) log(pθ (x|z)),
With the advent of deep learning, the two networks within
(5)
GANs have progressively been parameterized using deep
neural networks, such as DC-GAN [53]. In recent years, which is commonly known as the evidence lower bound
GAN variants like PG-GAN [54] and StyleGAN1-3 [55], (ELBO) [58].
[56], [57] have emerged, introducing improved architectures In order to optimize this objective function, the repa-
capable of synthesizing highly realistic samples. However, rameterization trick was introduced in [16]. This trick
due to the inherent challenges in optimizing the two- enables the generation of samples z ∼ qϕ (z|x) to be
player game, GANs often face difficulties in training in a differentiable, facilitating the training process. The feed-
stable manner, which can lead to non-convergence issues. forward nature of qϕ (z|x) allows for efficient inference
Additionally, GANs are prone to a phenomenon known as of new samples. Furthermore, the training process of
mode collapse, where the generator maps multiple distinct Variational Autoencoders (VAEs) is generally stable due to
latent codes to the same output, resulting in a lack of the reconstruction-based loss function. This loss function
diversity in the generated samples. encourages the VAE to accurately reconstruct the given data,
4
leading to improved performance and faithful reconstruc- model pθ to model the conditional transition probability,
tions. However, VAEs are susceptible to a phenomenon achieved by optimizing the Evidence Lower Bound (ELBO)
known as posterior collapse, wherein the learned latent in a manner similar to Variational Autoencoders (VAEs).
space becomes uninformative for reconstructing the input Because of the long Markov chain, diffusion models can
data. This can result in a degradation of the generative synthesize high-quality data and allow for stable training.
capacity of the model. Additionally, due to the injected However, it is important to note that the inference of
noise and the inherent imperfections in the reconstruction new samples in diffusion models can be computationally
process, VAEs tend to generate more blurry samples than expensive. The sampling process in diffusion models tends
those produced by GANs, which are known for their ability to be slower than GANs and VAEs.
to produce sharper and more realistic samples. Energy-based model. Energy-based models leverage the
Normalizing Flows. energy function to model the probability distribution of
Both GANs and VAEs utilize parametrized models to the data explicitly. It is built upon a fundamental idea that
implicitly learn the density of data, which prevents them any probability function can be transformed from an energy
from calculating the exact likelihood function for optimizing function by normalizing its volume:
model training. To address this limitation, normalizing
exp(−Eθ (x))
flows alleviate the problem by introducing a set of in- p(x) = R , (12)
vertible transformation functions. These functions enable x exp(−Eθ (x))
transforming a simple distribution, such as a standard where −Eθ (x) is the energy function. Clearly, data points
normal distribution, into the desired probability distribution with a high likelihood correspond to a low energy value,
of the final output. Specifically, it starts from a normal whereas data points with a low likelihood exhibit a high
distribution, and a set of the invertible function f1:N (·) energy value. However, it is difficult to optimize the
sequentially transforms the normal distribution to the
R
likelihood because calculating x exp(−E(x)) for high-
probability distribution of the final output: dimensional data is intractable. Contrastive Divergence is
proposed as a means to mitigate optimization challenges
zi = fi−1 (zi−1 ). (6)
by comparing the likelihood gradient on the true data
Owing to the invertible properties of the fi , the probability distribution p(x) with the gradient on randomly sampled
density function of the new variable zi can be easily data from the energy distribution qθ (x).
estimated from the last step zi−1 :
dfi −1 ∇θ Ex∼qθ (−log(p(x))) = Ex∼p ((Eθ (x))) − Ex∼qθ ((Eθ (x)))
p(zi ) = p(zi−1 )| | , (7)
dzi−1 (13)
dfi
logp(zi ) = logp(zi−1 ) − log| |. (8) The energy distribution qθ (x)
dzi−1 is approximated by Markov Chain Monte Carlo (MCMC)
By applying the chain rule, we can derive the density of the process.
final output zN after N transformations as follows:
N 3.2 3D Representations
X dfi
logp(zN ) = logp(z0 ) − log| |, (9) The computer vision and computer graphics communities
1
dzi−1
have developed diverse 3D scene representations, such
where the full chain consisting of zi is commonly known as voxel grids, point clouds, meshes, and neural fields.
as normalizing flow. Thanks to their invertible property, Each of these representations has its own advantages and
normalizing flows offer versatility in tasks such as novel limitations when it comes to the task of 3D generation.
sample generation, latent variable projection, and density In the subsequent sections, we will present the for-
value estimation. These flows enable straightforward uti- mulations of widely used 3D representations along with
lization in various scenarios. their notable works. This background information will
However, it struggles with balancing the parameterized lay the foundation for a comprehensive analysis of these
model’s capacity and efficiency. representations in the context of 3D generation tasks. An
Diffusion Models. Diffusion models [17] are parameterized overview of these 3D representations can be found in
by a Markov chain, which gradually adds noise to the input Figure 1, providing a visual depiction of their characteristics.
data x0 by a noise schedule β1:T with T denoting the time Furthermore, a comparison of them for 3D generation with
steps. Theoretically, when T → ∞, xT is a normal Gaussian regard to time efficiency, memory efficiency, representation
distribution. capability, and neural network compatibility is shown in
p Tab. 1.
q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), (10) Voxel grids. Voxels are Euclidean-structured data that is
T
Y regularly placed in 3D space, akin to pixels in 2D space.
q(x1:T |x0 ) = q(xt |xt−1 ). (11) They serve as a representation for 3D shapes and can store
t=1 various types of information, such as geometry occupancies
The reverse of the diffusion process is learned to reconstruct [36], [37], volume densities [59], [60], or signed distance
the input by modeling the transition q(xt−1 |xt ) from values [61].
the noise to the data. However, the posterior inference Thanks to the regularity of voxel grids, they work
q(xt−1 |xt ) is intractable. This is resolved using a parametric well with standard convolutional neural networks and are
5
TABLE 1: Comparison of different representations with regard to time efficiency, memory efficiency, representation
capability, and NN (neural network) compatibility. A larger number of stars indicates better performance.
3D Representation Point Clouds Voxel Grids Depth/Normal Maps Neural Fields Mesh Hybrid
Time Efficiency ⋆⋆⋆ ⋆ ⋆⋆⋆⋆⋆ ⋆ ⋆⋆⋆⋆ ⋆⋆⋆
Memory Efficiency ⋆⋆ ⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆⋆ ⋆⋆⋆ ⋆⋆⋆
Representation Capability ⋆⋆⋆ ⋆⋆ ⋆ ⋆⋆⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆⋆
NN Compatibility ⋆⋆ ⋆⋆⋆⋆⋆ ⋆⋆⋆⋆⋆ ⋆⋆⋆⋆⋆ ⋆ ⋆⋆⋆⋆⋆
widely used in deep geometry learning. As a pioneer, Since the rendered images tend to contain holes, [75] uses
3D ShapeNets [36] introduces voxel grids into 3D scene 2D CNNs to refine images. Some methods [76], [77] have
understanding tasks. It converts a depth map into a 3D developed differentiable renderers for point clouds, which
voxel grid, which is further processed by a convolutional can optimize not only for point positions but also for colors
deep belief network. 3D-R2N2 [62], differently, uses 2D and opacity of points. To increase the modeling capacity of
CNNs to encode the input image into the latent vector and point clouds, [78], [79], [80], [81] attempt to anchor high-
utilizes a 3D convolutional neural network to predict the dimensional feature vectors to points and project them to
target voxel grid. Although voxel grids are well suited to 3D feature maps for the latter rendering.
CNNs, processing voxels with neural networks is typically
Meshes. Polygonal meshes are non-Euclidean data that
memory inefficient. Hence, [63], [64], [65], [66] introduce
represent shape surfaces with a collection of vertices, edges,
the octree data structure for shape modeling. Voxel grids
and faces. In contrast to voxels, meshes focus solely on
also have many applications in rendering tasks. Early
modeling the surfaces of 3D scenes, making them more
methods [67], [68] store high-dimensional feature vectors
compact representations. Compared to point clouds, meshes
at voxels to encode the scene geometry and appearance,
provide explicit connectivity information between surface
which will be interpreted into color images using projection
points, enabling the modeling of relationships among
and 2D CNNs. Neural Volumes [69] uses CNNs to predict
points. Because of these advantages, polygonal meshes are
RGB-Alpha volumes and synthesizes images with volume
widely used in traditional computer graphics applications,
rendering techniques [70]. Multi-plane image (MPI) [38] can
such as geometry processing, animation, and rendering.
be regarded as a variation of voxel grids where the 3D space
However, applying deep neural networks to meshes is
is partitioned into several depth planes, each associated
more challenging than to point clouds because mesh edges
with an RGB-alpha image. By reducing the number of
need to be taken into consideration in addition to vertices.
voxels along the depth dimension, MPI-based methods offer
[34], [82] parameterize 3D shape surfaces as 2D geometry
computational advantages by reducing the computational
images and process geometry images with 2D CNNs.With
cost to some extent.
the advancement of graph neural networks, [83], [84],
[85], [86] propose to regard meshes as graphs. Generating
Point clouds. A point cloud is an unstructured set of points
meshes using networks is equally challenging, mirroring
in 3D space, representing a discretized sampling of a 3D
the complexities encountered in mesh analysis. The task
shape’s surface. Point clouds are commonly generated as
necessitates predicting not only the vertex positions but
direct outputs from depth sensors, making them widely
also the underlying topology. [35], [87], [88], [89] pre-
used in various 3D scene understanding tasks. Depth
define a mesh with fixed connectivity and predict vertex
and normal maps can be considered as special cases of
displacements to deform the mesh to produce the target
point cloud representation. Although they are convenient to
shape. In the rendering pipeline of traditional computer
obtain, the irregularity of point clouds makes them difficult
graphics, both software and hardware have been heavily
to be processed with existing neural networks that are
optimized for rendering meshes. Some differentiable mesh
designed for regular grid data (e.g., images). Moreover, the
renderers [90], [91], [92] leverage the advances of classical
underlying 3D shape could be represented by different point
rendering techniques and design the back-propagation
clouds due to the sampling variations. Many methods [32],
process to update some properties (e.g., colors) defined on
[33], [71], [72], [73], [74] have attempted to effectively and
meshes. To improve the rendering quality, a strategy is
efficiently analyze 3D point clouds. PointNet [32] leverages
storing appearance properties on shape surfaces, which are
MLP networks to extract feature vectors from point sets
parameterized as texture maps. Learning-based methods
and summarizes features of all points via max-pooling
[93], [94] define learnable feature vectors in texture maps,
for point order invariance. PointNet++ [33] hierarchically
which are decoded into color images with a 2D renderer.
groups the point clouds into several sets and separately
processes local point sets with PointNet, which captures Neural fields. A neural field is a continuous neural implicit
the local context of point clouds at multiple levels. Some representation that encompasses the complete or partial
methods [73], [74] reformulate point clouds as other types of depiction of scenes or objects using a neural network.
data structures (e.g., graphs and sparse voxels) and attempt For each position in 3D space, the neural network maps
to exploit neural networks in other fields. To synthesize its related features (e.g., coordinate) to the attributes (e.g.,
images with point clouds, a naive way is storing colors an RGB value). Neural fields [7], [8], [9], [116], [117],
on points and rendering point clouds using point splatting. [118], [119] are able to represent 3D scenes or objects in
6
TABLE 2: Properties of representative 3D generative models. Units for “FLOPS” and “#Params.” are in G and M,
respectively. We calculate the two metrics for methods implemented using a PyTorch [95] implementation.
(a) Methods using 3D supervision. (b) Methods using 2D supervision.

Method 3D Representation FLOPS #Params. Shape Resolution Method 3D Representation FLOPS #Params. Image Resolution
PQ-Net [96] Voxel grid 0.02 13.62 64 × 64 × 64 Shi et al. [107] Depth map 63.24 55.81 256 × 256
AutoSDF [61] Voxel grid 997.41 2941.10 64 × 64 × 64 Schwarz et al. [29] Neural field 177.03 0.68 64 × 64
PointFlow [97] Point cloud 105.89 1.06 2048 Niemeyer et al. [31] Neural field 5.21 0.75 256 × 256
Hui et al. [98] Point cloud 25.79 12.71 2048 Chan et al. [30] Neural field 842.42 1.91 256 × 256
ShapeGF [99] Point cloud 413.85 4.17 2048 Pan et al. [108] Neural field 1243.66 1.91 256 × 256
SoftFlow [100] Point cloud 5.26 6.50 2048 Niemeyer et al. [109] Neural field 183.86 0.36 128 × 128
SP-GAN [101] Point cloud 1.64 0.59 2048 Xu et al. [110] Neural field 621.53 1.91 256 × 256
Generative PointNet [102] Point cloud 91.86 1.39 2048 Sun et al. [111] Neural field 2094.73 2.70 256 × 256
Luo et al. [103] Point cloud 67.33 3.30 2048 StyleNeRF [112] Neural field 13.49 7.63 1024 × 1024
IM-Net [28] Neural field 799.70 3.05 256 × 256 × 256 GRAM [113] Neural field 977.91 1.95 256 × 256
Deng et al. [104] Neural field 209.11 0.10 256 × 256 × 256 GIRAFFE HD [114] Neural field 35366.12 12.78 1024 × 1024
gDNA [105] Neural field 425.90 0.87 256 × 256 × 256 VolumeGAN [59] Hybrid 32.52 8.67 256 × 256
ShapeAssembly [106] Program 0.02 1.41 - EG3D [115] Hybrid 25.78 30.60 512×512
arbitrary resolution and unknown or complex topology due and topology with relatively low memory consumption.
to their continuity in representation. In addition, in com- However, they are usually parameterized with MLP layers
parison to the aforementioned representations, the storage and output the attribute for each coordinate, suffering from
requirements are limited to the parameters of the neural small receptive fields. Consequently, explicit supervision on
network, resulting in reduced memory consumption when surfaces is challenging, and optimization becomes difficult.
compared to alternative representations. To render an image Researchers leverage the advantages of each representation
from a neural field, there are two streams of techniques - type to compensate for the drawbacks of the other. Some
surface rendering and volume rendering. Surface rendering works [122], [125], [128] integrate voxel grids into neural
[120], [121] utilizes an implicit differentiable renderer to fields to accelerate the training and rendering processes.
trace viewing rays and determine their intersection points The features of points for differentiable rendering are
with the surface. Subsequently, the network is queried to interpolated from features of voxel grids. These represen-
obtain the RGB values associated with these intersection tations sacrifice memory consumption for rendering speed.
points, which are then used to generate a 2D image. MINE [129] merges neural fields with multi-plane images,
While surface rendering-based methods demonstrate strong which is a smaller representation than voxel grids but
performance in representing 3D objects and rendering 2D suffers from a limited viewing range. EG3D [115] uses
images, they often necessitate per-pixel object masks and tri-planes to boost the model capacity of neural fields.
meticulous initialization to facilitate optimization towards Such representations consume less memory than voxel-
a valid surface. This requirement arises from the fact based neural fields, and they allow fast rendering at the
that surface rendering solely offers gradients at the points same time. [130] and [131] build a neural field on a point
where rays intersect with the surface, posing challenges in cloud. [130] interpolates point features from K neighboring
network optimization. Volume rendering [70], in contrast, points, while [131] uses a hyper-network that takes in
is based on ray casting and samples multiple points a point cloud and then generates the weights of NeRF
along each ray. It has shown great power in modeling network. NeuMesh [132] presents mesh-based neural fields
complex scenes. NeRF [9] and its following works adopt by encoding geometry and texture codes on mesh vertices,
such differentiable volume rendering to render 2D images enabling the manipulation of neural fields through meshes.
from 3D representations, allowing gradients to propagate [133], in contrast, combines two explicit representations,
through the renderer. However, sampling a set of points mesh and voxel grids. The proposed deformable tetrahedral
along all rays may lead to low rendering speed. Recent mesh representation optimizes both vertex placement and
works focus on acceleration via various techniques, such occupancy. This representation achieves both memory and
as pruning [122], improved integration [123], and carefully- computation efficiency.
designed data structures [124], [125], [126], [127].
Hybrid representation. Given the respective advantages
and disadvantages of each representation, hybrid represen-
4 L EARNING FROM 3D DATA
tations have been proposed as a means to complement and With the availability of 3D data, a majority of recent 3D
combine their strengths. Many of these hybrid representa- generative methods focus on training deep neural networks
tions primarily concentrate on the fusion of explicit and to effectively capture the distributions of 3D shapes. Unlike
implicit representations. Explicit representations provide 2D images, 3D shapes can be represented in various ways,
explicit control over the geometry. On the other hand, such as voxel grids, point clouds, meshes, and neural
they are restricted by the resolution and topology. Implicit fields. Each of these representations possesses distinct
representations allow for the modeling of complex geometry advantages and disadvantages when applied to the task of
7
(a) Point cloud generation (b) Voxel grid generation

3D voxel grids
Point clouds
Random vector
Point cloud Random vector
z Voxel grid
generator z
generator
GAN models, VAE models

Sampled points Transformed points Transformed points
(c) Neural field generation
Random vector
Network Network z Point value
Neural field
(occupancy, signed
Point coordinate generator distance)
Normalizing flow models

(d) Mesh generation
Sampled points Denoised points Multi-chart
Random vector
Image-space
Network Point offsets + z
generator
Multi-chart representation
Denoised points Random vector
z Deformed mesh
+ Point offsets Network Template mesh Deformation Network
Diffusion models Template-based representation
(e) Generation of hybrid representations
Random vector Point value

Random vector Point value z Network MLP (occupancy, signed
z Network MLP (occupancy, signed
distance)
distance)
Feature volume Tri-plane feature maps
Fig. 3: Representative 3D generative models. We present some classical pipelines for generating (a) point clouds, (b) voxel
grids, (c) neural fields, (d) meshes, and (e) hybrid representations. Some figures are taken from [26], [97], [101], [103], [134],
[135], [136], [137]. We only present some of representative methods. Please refer to Sec. 4 for more variants.
3D generation from 3D data. Several factors come into play Wu et al. [26] adopt the architecture of generative ad-
when evaluating the compatibility of a 3D representation versarial networks to process 3D voxel grids. The generator
with deep generative models. These factors include the maps a high-dimensional latent vector to a 3D cube, which
ease of network processing for a given representation, the describes the synthesized object in voxel space. In contrast
ability to efficiently generate high-quality and intricate 3D to 2D GANs, the generator and discriminator in 3D GANs
shapes, and the costs associated with obtaining supervision are constructed using a sequence of 3D convolution blocks
signals for the generative models. Assessing these aspects to enable the processing of 3D voxel data. In practice, they
is crucial to determine the suitability of a particular 3D construct an encoder to map the 2D image into the latent
representation for successful integration with deep genera- space of its corresponding generator, closely resembling the
tive models. Fig. 3 summarizes the representative pipelines VAE-GAN [143]. In addition to the conventional adver-
on 3D-supervision-based methods, and Tab. 3 outlines the sarial loss, they incorporate two components for encoder
representative methods. Tab. 2a summarizes the properties training: a reconstruction loss and a KL divergence loss.
of the representative methods. These additional elements serve to constrain the output
distribution of the encoder. Owing to its VAE-GAN-like
design, the proposed model can also be leveraged to recover
4.1 Voxel Grids 3D shapes from the 2D observation. They also demonstrate
that the discriminator learned without any supervision can
Voxel grids are usually seen as images in 3D space. To
be successfully transferred to several 3D downstream tasks
represent 3D shapes, voxels could store the geometry
with good performance. Considering the training of GAN
occupancy, signed distance values, or density values, which
models is unstable, [138] attempts to train a variational
are implicit surface representations that define the shape
auto-encoder to model the shape distribution. It first uses
surface as a level-set function. Thanks to the regularity
an encoder network consisting of four 3D convolutional
of its data structure, the voxel grid is one of the earliest
layers and a fully connected layer to map the input voxel
representations being used in deep learning techniques for
into a latent vector and then the 3D decoder with an
3D vision tasks, such as 3D classification [36], [37], [138],
identical but inverted architecture to transform a latent
3D object detection [139], [140], and 3D segmentation [141],
vector into a 3D voxel. The downsampler in the encoder
[142].
8
and upsampler in the decoder are implemented by stride octree as a hierarchical compact representation for the voxel
convolutions. The objective function for training consists grid, where they convert the octree into a sequence by the
of two parts: one is KL divergence on the latent codes, traversal order. Besides, an adaptive compression scheme
and the other is a Binary Cross-Entropy (BCE) for voxel is used to decrease the sequence length to improve the
reconstruction. They incorporate modifications to the BCE generation efficiency.
loss to avoid gradient vanishing. Despite its ability to On the one hand, similar to 2D images, voxel grids are
handle dense objects, this method exhibits a limitation in in an Euclidean structure, which works well with 3D CNNs.
generating smooth rounded edges, akin to the behavior On the other hand, the voxel grid typically consumes much
observed in 2D VAEs, which often leads to the generation computation cost because the number of voxel elements
of blurry voxels. To solve these problems, [61] introduces grows cubically with resolution. Although the octree for
VQ-VAE to model the data distribution. Different from voxel grids can reduce a lot of computational cost, it cannot
VAE, only using a single vector serving as a latent for be processed by neural networks very efficiently due to its
input, they use VQ-VAE to project the high-dimensional non-grid structure.
3D shape into a lower-dimensional discrete latent space
which is optimized during training, not fixed like VAE. Once
the latent space is well trained, they use a transformer to 4.2 Point Clouds
autoregressively model the non-sequential data. Specifically, Since point clouds are the direct outputs of depth scanners,
they maximize the likelihood of the latent representations they are widely used in scene understanding tasks. Leverag-
using randomized orders for an autoregressive generation. ing generative models to model data priors of specific point
A well-trained autoregressive model holds the potential for cloud datasets can benefit various downstream computer
diverse downstream applications, such as shape comple- vision tasks [147], [148]. In contrast to voxel grids, point
tion and text-guided shape generation. The incorporation clouds are an explicit surface representation and directly
of conditional information into the autoregressive model characterize shape properties, which has two advantages.
enables straightforward fusion with the model, facilitating First, deep neural networks usually process point clouds
these applications efficiently. with less GPU memory. Second, they are suitable for some
In addition to modeling the overall shape of the object, generative models, such as normalizing flows and diffusion
some works endeavor aim to achieve more detailed and models. Despite the two advantages, the irregularity of
fine-grained shape generation. SAGNet [144] proposes to point clouds makes networks difficult to analyze and
use an autoencoder to jointly learn the geometry of the generate them.
part and the pairwise relationship between different parts. As the pioneer work, Achlioptas et al. [27] exploit
It uses a two-way encoder to independently extracts the generative adversarial networks to learn the distributions
features of both geometry and structure. GRU and attention of 3D point clouds. It proposes a raw point cloud GAN (r-
modules are also incorporated into the encoder to exchange GAN) and a latent-space GAN (l-GAN). The generator of
the geometric and structural information encoded by two the r-GAN is an MLP network with 5 fully connected layers,
independent encoders and summarize the input into a latent which maps the randomly sampled noise vector to the point
code. The architectural design allows for the disentangled cloud with 2048 points. The corresponding discriminator
control of the object structure. Li et al. [145] also propose to uses PointNet [32] as the network backbone. Achlioptas et
model 3D shape variations at the part level. In addition, they al. [27] found that r-GAN has difficulty in generating high-
explore the automated assembly of various parts to form a quality point clouds. A plausible reason is that GANs are
complete 3D shape. Specifically, they first learn a part-wise hard to converge to a good point. To overcome this problem,
generative network that consists of K part generators, where they present a novel training framework that trains a
K is the number of parts. The part generator is built upon generative adversarial network to model the latent space
3D VAE-GAN, which is very similar to [26]. The difference of a pre-trained auto-encoder, which is called l-GAN. The
is that Li et al. [145] additionally introduce a reflective l-GAN delivers much better performance than r-GAN. The
symmetry loss to encourage the symmetry property of the GAN generator of point clouds is typically implemented as
generated object. To assemble the synthesized parts with fully-connected networks, which cannot effectively leverage
different scales and positions, they propose a part assembler local contexts for producing point clouds. To solve this
to regress the transformation matrices for each part. Since problem, some methods [98], [149], [150], [151] propose
the assembling solution is not unique for the given parts, to construct the GAN generator based on the graph
they define a fixed anchor part while the remains are convolution. For example, given a sampled latent vector,
required to learn the transformation matrices to match the Valsesia et al. [149] first use an MLP network to predict a set
anchor part. PQ-Net [96] designs a sequence-to-sequence of point features, which is taken as a graph and processed by
(Seq2Seq) network for the part assembly. Thanks to the a graph convolutional network. When upsampling the point
Seq2Seq modeling, its network demonstrates impressive cloud, it applies the graph convolution to point features
performance for several tasks, including shape generation to obtain new feature vectors, which are concatenated
and single-view 3D reconstruction. to the original point features to produce the upsampled
Although the above approaches perform well on low- point set. Another challenge of point cloud generation
resolution voxel grids, they struggle to handle high- is that synthesizing high-resolution point clouds easily
resolution voxel grids that contain fine-grained details consumes a lot of memory. Ramasinghe et al. [152] reduce
due to the cubic growth in computational complexity. To the computational complexity by adopting a GAN model in
alleviate these issues, Ibing et al. [146] attempt to use the the spectral domain. It represents point clouds as spherical
9
harmonic moment vectors and regresses these vectors from [155], [156], [157] formulate the generation of point clouds
sampled latent vectors with MLP networks. as a reverse diffusion process, which is in contrast to the
While GAN-based methods have demonstrated remark- diffusion process in non-equilibrium thermodynamics. The
able generation performance, the inherent instability in their reverse diffusion process is implemented as a Markov chain
training process has prompted researchers to investigate that transforms the distribution of points from the noise
alternative types of generative models. Zamorski et al. distribution to the target distribution. Each transition step is
[153] extend the variational auto-encoder model (VAE) and instantiated as an auto-encoder, which takes the point cloud
adversarial auto-encoder model (AAE) to the 3D domain. as input and outputs the displacements of points.
The encoder is implemented as a PointNet-like network, PointGrow [158] develops an auto-regressive model that
and the decoder is an MLP network. Since point clouds recurrently predicts new points conditioned on previously
are irregular, the reconstruction loss for the auto-encoder generated points. It designs self-attention modules to cap-
is implemented as set-to-set matching loss, such as Earth ture long-range dependencies. However, the order of points
Mover’s distance and Chamfer distance. Similar to the is hard to define. PointGrow [158] sorts points based on the
2D VAE model, the 3D VAE model also employs the KL points’ z coordinates. Xie et al. [102] propose a deep energy-
divergence to supervise the latent space. Given the potential based model for synthesizing point clouds. The short-run
intractability of the KL divergence, the AAE model adopts MCMC is adopted as the point generator. The energy model
an alternative approach by learning the latent space through is implemented as a PointNet-like network, which predicts
adversarial training. Zamorski et al. [153] empirically find scores for generated point clouds.
that the 3D AAE model outperforms the 3D VAE model in Recently, some methods [101], [159] attempt to attain
terms of performance. part-level controllability over the generated shapes. MR-
Due to the non-Euclidean data structure of point clouds, GAN [159] proposes a multi-root GAN that consists of
the GAN-based and AE-based generative models mostly multiple branches, and each branch map a sampled latent
use MLP networks or graph convolutional networks to vector to a set of points, which are concatenated into the
map latent vectors to point clouds, which can typically final point cloud. After training, the parts of generated
produce a fixed number of points. This significantly limits objects can be edited by revising the corresponding latent
their modeling ability. Even for shapes within the same vectors. SP-GAN [101] adopt the sphere as prior for the
category, their complexity could require different numbers part-aware shape generation. It defines a fixed point cloud
of points. To overcome this problem, PointFlow [97] models on the unit sphere and anchors the sampled latent vector to
point clouds as a distribution of distributions and intro- each point, which is then fed into a generator to produce
duces a normalizing flow model to generate point clouds. the shape. The initial point cloud acts as a guidance to
Specifically, PointFlow first samples a set of points from a the generative process and provides dense correspondences
generic prior distribution, such as standard Gaussian. Then, between generated shapes, naturally enabling part-level
it samples a latent vector from the shape distribution that controllability. Changing latent vectors of some points leads
encodes the shape information and feeds the vector into to the modification of the associated part.
a conditional continuous normalizing flow (CNF), which Point clouds have been adopted in many types of
produces a vector field to move the sampled points to generative models to synthesize shapes. Although previous
generate the shape. PointFlow assumes that modeling shape methods have achieved impressive performance in shape
prior as the Gaussian distribution limits the performance generation, high-resolution shapes are still difficult to
of VAE models and thus uses an additional CNF to obtain. The reason is that modeling high-resolution shapes
model the shape prior. Since the Neural ODE solver in requires a significant number of points, which will consume
continuous normalizing flow is computationally expensive, a large amount of GPU memory.
Klokov et al. [154] propose to adopt the affine coupling
layers to build discrete normalizing flows, resulting in a
significant speedup. Another challenge encountered in flow- 4.3 Neural Fields
based models is their potential failure when the dimensions Neural fields use neural networks to predict properties for
of the data and target distributions do not match. The any point in the 3D space. Most of them adopt MLP net-
point clouds typically lie on 2D manifolds, while a generic works to parameterize 3D scenes and can model shapes of
prior distribution is defined over the 3D space, making arbitrary spatial resolutions in theory, which is more mem-
flow-based models struggle to transform point clouds to ory efficient than voxel grids and point clouds. Although
match the prior distribution. To solve this issue, SoftFlow neural representations have superiority in shape modeling,
[100] perturbs point clouds with sampled noise and uses it is not straightforward to apply common generative
a conditional normalizing flow model to map perturbed models such as GANs on these representations, due to the
points to the latent variables. lack of ground-truth data in neural representation format
Recent methods [99], [103], [155], [156], [157] model the and the difficulty of processing neural representations with
point cloud generation as a denoising process and train neural networks directly. To overcome this problem, some
a model to output vector fields to gradually move points methods use auto-decoders [160] to model the distribution.
from a generic prior distribution. ShapeGF [99] regards gDNA [161] adopts an auto-decoder for dynamic human
the shape as a distribution and assumes that points on generation. A 3D CNN-based feature generator first pro-
the shape surface have high densities. For any 3D point, cesses the shape latent code into a feature volume, which is
it trains a network to predict the point’s gradient that is further decoded into occupancy values and feature vectors
used to move the point to the high-density area. [103], through an MLP network. Deng et al. [104] aim to preserve
10
shape correspondences when generating shapes. An MLP due to two factors. First, meshes are non-Euclidean data
network is used to represent a template signed distance field and cannot be directly processed by convolutional neural
that is shared among all instances. The deformation and networks. Second, mesh generation requires synthesizing
correction fields are modeled by another two MLPs in the meshes with plausible connections between mesh vertices,
template space. DualSDF [162] learns a shared latent space which is difficult to achieve.
to enable semantic-aware shape manipulation. A sampled To avoid handling the irregular structure of meshes,
latent code will be processed by two networks that handle Ben-Hamu et al. [134] propose an image-like representation
different levels of granularity. One is with SDF that can called multi-chart structure to parameterize the mesh. They
capture fine details, while the other one is with simple shape define a set of landmark points on a base mesh, and each
primitives to represent a coarse shape. Two reconstruction triplet of landmark points corresponds to a function that
losses are calculated between the given shapes and the establishes correspondences between a chart on the image
generated shapes of two representations. domain and the shape surface. By representing meshes
To generate implicit fields based on generative adver- with multi-chart structures, this approach can utilize well-
sarial networks, some methods discriminate the generated developed image GAN techniques to generate shapes. Nev-
implicit fields either in the latent space or with the ertheless, the parametrization trick necessitates a congruent
converted explicit representations. Chen et al. [28] apply topology between the generated meshes and the base mesh
the discriminator on the latent space. An auto-encoder employed in defining the multi-chart structure. To make the
is first used to learn the features from a set of shapes, generation process easier, SDM-Net [175] registers training
where the encoder can be a 3D CNN or PointNet [32], meshes with a unit cube mesh, which enables them to
and the decoder is parameterized with an MLP network. have the same connectivity as the template cube mesh.
Then, latent-GANs [27], [163] are employed to train on SDM-Net utilizes the variational auto-encoder to learn the
the features extracted by the pre-trained encoder. Ibing et distributions of meshes, where the encoder and decoder
al. [164] also leverage latent-GANs but propose a hybrid are implemented with convolutional operators defined on
representation as the intermediate feature for learning to meshes. Based on SDM-Net [175], TM-Net [174] additionally
enable spatial control of the generated shapes. The latent defines a texture space on the template cube mesh and
representation combines voxel grids and implicit fields, and uses a CNN-based VAE to synthesize texture maps, which
therefore each cell covers a part of the shape. Kleineberg are combined with generated meshes to produce textured
et al. [165] generate signed distance fields and design two meshes. PolyGen [204] attempts to synthesize the connectiv-
types of discriminators, voxel-based (e.g., 3D CNN) and ity of meshes based on an auto-regressive generative model.
point-based (e.g., PointNet), for the training. In voxel-based It develops a transformer-based network to sequentially
cases, a fixed number of points are fed into the generator to generate mesh vertices and faces. Similar to PointGrow
query signed distance values. While in point-based cases, [158], PolyGen sorts the mesh vertices along the vertical
points of arbitrary sequence can be queried for signed axis and leverages a vertex transformer to generate vertices.
distance values. SurfGen [166] develops a differentiable Then, mesh faces are predicted using a face transformer
operator that extracts surface from implicit fields through that is conditioned on generated mesh vertices. Liu et al.
marching cubes [167] and then performs differentiable [176] parameterize 3D meshes with tetrahedral grids, and
spherical projection on the surface, which is an explicit each grid is associated with a deformation offset and an
shape representation. A spherical discriminator operates SDF value. Such a representation is treated as the input
on explicit surfaces to supervise the learning of the shape of diffusion models to model the underlying distribution.
generator. Lyu et al. [177], in contrast, introduce point clouds as
With the development of diffusion models, it is inter- the intermediate representation and leverage point cloud
esting to apply powerful generative models to learn the diffusion model for shape generation.
distributions of neural fields. [168] starts the initial attempt Representing shapes as structured computer programs is
by representing data using an implicit neural representation an attractive direction, as programs guarantee the produc-
and learning a diffusion model directly on the modulation tion of high-quality geometries and are editable by users.
weights of the implicit function. However, blurry results are ShapeAssembly [106] proposes to create programs using a
generated compared to the results from GAN-based meth- VAE model. To construct the training dataset, it turns 3D
ods. [169] and [170] leverage auto-encoders to compress SDF shapes into programs and develops a sequence VAE for
representation to a latent representation and then model the learning. To analyze the input program, the encoder utilizes
distribution on latent space using diffusion models. Some MLP networks to extract features from each program line
works [171], [172] represent neural fields with triplanes and and fuse them into a latent vector with a GRU module.
obtain the triplane representations from multi-view dataset During program generation, the decoder uses the GRU to
first. Then, diffusion models are adopted to model the sequentially predict features for each line, which are then
distribution of these triplane representations. mapped to the program line using MLP networks.
4.4 Meshes and Other Representations

Mesh is one of the most used representations in traditional 5 L EARNING FROM 2D DATA
computer graphics. It is also usually taken as the target The goal of learning 3D representation from 2D data is
object in many 3D modeling and editing softwares. Despite mostly to explicitly control the camera viewpoint when
the mesh’s popularity in traditional applications, it is chal- synthesizing images [205], [206], [207], [208], [209]. Al-
lenging to apply deep generative models to mesh generation though 2D GAN-based models deliver impressive results,
11
TABLE 3: Representative 3D generative models categorized by supervision and 3D representation. Methods in green
model 3D representation in a compositional way. Orange denotes that the method supports controllability through
semantic maps, while those in pink support relighting. Methods in blue enable control through human pose. Note that
generative models, such as energy-based models and normalizing flows, are not included as this table focuses on the most
frequently used types of 3D generative models (GANs, VAEs, and Diffusion Models).
Generative Model
Supervision 3D Representation
GAN VAE Diffusion
Achlioptas et al. [27] , Shu et al. [150] , Valsesia et al. [149] , ShapeGF [99] ,
Point Clouds Spectral-GAN [152] , Hui et al. [98] , Arshad et al. [151] , Zamorski et al. [153] Luo et al. [103] , PVD [155] ,
SP-GAN [101] , MRGAN [159] lion [156] , PSF [157]
Wu et al. [26] , Ibing et al. [146] , SAGNet [144] , PQ-Net [96] , AutoSDF [61] ,
Voxel Grids DiffRF [173]
PAGENet [145] Brock et al. [138]
3D Dupont et al. [168] ,
IM-Net [28] , Kleineberg et al. [165] , Ibing et al. [164] , Diffusion-SDF [169] ,
Neural Fields -
gDNA [161] , SurfGen [166] 3D-LDM [170] , Rodin [171] ,
Shue et al. [172]
TM-Net [174] , MeshDiffusion [176] ,

Mesh Ben-Hamu et al. [134]
SDM-NET [175] SLIDE [177]
Depth/Normal Maps S2 -GAN [178] , RGBDGAN [179] , DepthGAN [107] - -
VoxGRAF [60] , HoloGAN [68] , VON [180] ,

Voxel Grids - -
NGP [181] , PLATONICGAN [182] , BlockGAN [183]
GRAF [29] , π -GAN [30] , CAMPARI [109] , CIPS-3D [184] ,

GOF [110] , StyleNeRF [112] , StyleSDF [185] , GRAM [113] ,
MVCGAN [186] , GeoD [187] , GRAM-HD [188] , EpiGRAF [189] ,
2D PoF3D [190] , GSN [191] , Pix2NeRF [192] , Disentangled3D [193] ,
Neural Fields - -
3D-GIF [194] , NeRF-VAE [195] , LOLNeRF [196] ,
DisCoScene [197] , GIRAFFE [31] , GIRAFFE HD [114] ,
ShadeGAN [108] , Volux-GAN [198] , 3D-SGAN [199] ,
FENeRF [111] , IDE-3D [200]
GAUDI [202] ,
Hybrid EG3D [115] , VolumeGAN [59] , Next3D [201] -
RenderDiffusion [203]
finding a reasonable direction in the latent space is not easy incorporating 3D representation for 2D image generation are
and usually cannot support full control of the rendering quality and efficiency, where quality includes image realism
viewpoint. This survey focuses on works that explicitly and view consistency. Representative methods that learn
generate 3D representations for 3D-aware image synthe- from 2D data are summarized in Tab. 3. Additionally, Fig. 5
sis. In contrast to 3D data-supervised methods that are and Tab. 2b provide properties of representative methods.
directly trained with shapes, most 2D data-based generation
methods are supervised by images through differentiable 5.1 Depth/Normal Maps
neural rendering because there are few high-quality and
large-scale datasets of renderable 3D representations for The depth and normal maps are easily accessible repre-
training generative models. Due to the lack of renderable sentations that partially reveal the geometry of 3D scenes
3D representations, auto-encoder architectures are rarely or objects. Since they only show the geometry from one
used in this task. Instead, most methods adopt generative side, they are usually referred to as 2.5D representations.
adversarial models, which sample a latent vector from the The depth and normal maps can be easily involved in
latent space and decode it to the target representation, as image generation (i.e., processed by 2D convolutional neural
shown in Fig. 4. networks rather than 3D architectures) as they share a
similar data format to 2D images. Most methods [107], [178],
Similar to 3D data-based generation, there are also [179] leverage GAN models to generate depth or normal
several 3D representations commonly used in the task of maps for 3D-aware image synthesis.
2D data-based 3D generation. These include depth/normal S2-GAN [178] proposes to consider the underlying
maps, voxel grids, and neural fields. Point clouds and 3D geometry when generating images, where it refers to
meshes are not well explored in generative image synthe- the geometry as ”structure” and the image as ”style”. It
sis, partly because current differentiable neural rendering develops the Structure-GAN, which maps a sampled latent
cannot provide effective gradient signals to optimize these vector to a surface normal map. The normal map is then
two representations easily. The key factors to consider when fed into the Style-GAN with another sampled latent vector
12
Triplane Mesh Voxel

Random Vector Random Vector
z Generator z Upsampler
Density
Query
Points
MLP ∑
Camera pose Image/Feature

Image
Directions RGB/Feature Field
Fig. 4: The general pipeline of 3D-aware GAN. The 3D-aware GAN framework generates 3D representations including
Tri-plane [115], [189], Voxel [59], [60], and Mesh [137]. These representations are then utilized to predict the color and
density for volume rendering. The discriminator is omitted since it follows a similar approach as conventional 2D GANs.
50 GRAF GRAF[28]
[28] features from the geometry path and injects them into the
(49, 177.0)
(49, 177.0)
image path. DepthGAN [107] also designs a switchable
discriminator that not only classifies the RGB-D image but
also regresses from the RGB image to the depth, which is
then compared with the depth from the generator. While
GIRAFFE [30]
GIRAFFE[30]
(32,(32,
5.2) 5.2) depth or normal maps work well with 2D GAN models,
CAMPARI [108]
CAMPARI[195]
(28, 183.9) FeNeRF [110] which can efficiently synthesize high-resolution images,
FID
(28, 183.9) 𝜋-GAN[29]

(17.3, 2094.7)
(14.7, 842.4) 2D CNN-based generators do not guarantee the inter-
GRAM [112]
𝜋-GAN[29]
(17.4, 977.9)
(14.7, 842.4) view consistency of generated images [211], even when
𝜋-GAN [29]
ShadeGAN[194]
ShadeGAN
(16.2,
(16.2,
[107]
1243.7)
1243.7)
supervised with the view consistency loss. Moreover, depth
𝜋-GAN[29] GIRAFFE HD [113]
(14.7, 842.4)
(14.7, 842.4) GOF [109]
GOF[198] GIRAFFE HD[208](10.1, 5336.1)
(10.1, 5336.1) and normal maps are 2.5D representations that cannot fully
(14.2, 621.5)
(14.2, 621.5)
VolumeGAN [58]
capture the geometry of the scene.
StyleNeRF [111]
VolumeGAN[198]
(9.1, 32.5)
EG3D [114] StyleNeRF[203]
(8.1, 13.5)
(9.1, 32.5)
(4.0, 25.8)EG3D[57](8.1, 13.5)
(4.0, 25.8) 5.2 Voxel Grids
0 128 256 Resolution 512 1024
There are generally two ways to synthesize images from
Fig. 5: FID v.s. Resolution of representative 3D synthesis voxel grids. The first way is to use voxel grids only to
methods trained on FFHQ [55]. We annotate each method represent 3D shapes, which are then used to render depth
by their FID score and GigaFLOPS. maps to guide image generation. The second way is to
embed the geometry and appearance of scenes with voxel
grids. Inspired by the graphics rendering pipeline, Zhu
et al. [180] and Chen et al. [181] decompose the image
to generate the image. The Structure-GAN is implemented generation task into 3D shape generation and 2D image
using a set of convolution and deconvolution layers, and the generation. They adopt voxel grids for shape generation,
Style-GAN is designed based on the conditional GAN [210]. followed by a projection operator that renders voxel grids
By disentangling the generation of the structure and style, into the depth map under a viewpoint. The depth map is
S2-GAN makes the generative process more interpretable then used as guidance for RGB image generation. Similar
and enables the controllability of image synthesis given to S2-GAN, Zhu et al. [180] utilize a conditional GAN to
surface normal maps. To explicitly control the viewpoint synthesize the 2D image based on the rendered depth map.
of rendered images, RGBD-GAN [179] passes camera pa- To enable illumination control, Chen et al. [181] generate
rameters into the generator. It assumes that RGB and depth different image components (e.g., diffuse albedo, specular,
images can be generated simultaneously and learns from and roughness) based on the rendered depth and compose
2D image data only. To ensure the inter-view consistency the final RGB output following the physical image forma-
of synthesized images, [179] generates two RGBD images tion process. On the other hand, some methods [60], [68],
from two different viewpoints and warps the RGB-D image [183] represent the underlying geometry and appearance in
from one view to another to compute the consistency loss. voxel grids simultaneously. HoloGAN [68] leverages a 3D
However, the model fails to capture this relationship when CNN-based generator to produce a 3D feature volume from
the contents in the image become more complex (e.g., a latent vector. A projection unit then projects the voxels
bedrooms) and outputs a flat plane. In contrast, Wang et to 2D features, and the subsequent 2D decoder outputs
al. [178] and Shi et al. [107] argue that appearance should the final RGB image. To control the viewpoint, HoloGAN
be rendered conditioned on the geometry, using normal applies 3D transformations to voxel grids, which ensures
maps and depth maps, respectively. Similar to S2-GAN the underlying 3D geometric property. Based on HoloGAN,
[178], DepthGAN [107] develops a dual-path generator BlockGAN [183] generates multiple feature volumes, each
for 3D-aware image synthesis. However, in contrast to representing a 3D entity, and composes them into a feature
S2-GAN, which explicitly conditions image generation on volume for the whole 3D scene. This strategy enables
the synthesized geometry, DepthGAN extracts intermediate it to render images containing multiple objects. Instead
13
of learning a 3D feature volume, PlatonicGAN directly GRAM [113] also examines the point sampling strategy. It
predicts RGB and alpha volumes [182], but is limited to claims that the deficient point sampling strategy in prior
a low resolution. More recently, VoxGRAF [60] attempts work causes the inadequate expressiveness of the generator
to synthesize RGB and alpha volumes and render images on fine details and that the noise caused by unstable Monte
from volumes through volume rendering techniques. By Carlo sampling leads to inefficient GAN training. GRAM
removing the 2D CNN renderers, VoxGRAF promises to proposes to regulate the learning of neural fields on 2D
render view-consistent images. To achieve the rendering of manifolds and only optimizes the ray-manifold intersection
high-resolution images, VoxGRAF adopts sparse voxel grids points. MVCGAN [186] attempts to alleviate the consistency
to represent scenes. It first generates a low-resolution dense problem by adding a multi-view constraint on the generator.
volume and then progressively upsamples the resolution of For the same latent code, two camera poses are sampled for
the volume and prunes voxels in the empty space. rendering, and both the features and images are warped
from one to the other to calculate the reprojection loss.
GeoD [187] tackles the problem by making the discriminator
5.3 Neural Fields 3D-aware. A geometry branch and a consistency branch are
Image synthesis methods based on neural fields generally added to the discriminator to extract shapes and evaluate
adopt MLP networks to implicitly represent the properties consistency from the generator’s outputs and in turn,
of each point in the 3D space, followed by a differentiable supervise the generator.
renderer to output an image under a specific viewpoint.The To generate 3D-aware images on larger resolutions
volume renderer [70] is the most commonly used renderer efficiently, a variety of methods operate neural fields
for 3D-aware image synthesis. Most of the methods use in a small resolution and leverage convolutional neural
GANs to supervise the learning of neural fields. GRAF [29] networks for upsampling the image/features to a higher
firstly introduces the concept of generative neural radiance resolution. StyleNeRF [112] integrates the neural fields into
fields. An MLP-based generator is conditioned on a shape a style-based generator. The low-resolution NeRF outputs
noise and an appearance noise and predicts the density and features rather than colors which are further processed
the color of points along each ray. Then, a volume renderer by convolutional layers. StyleSDF [185] merges a signed
gathers the information along all rays to synthesize a 2D distance function-based representation into neural fields.
image. Due to the slow rendering process, a patch-based The output features from the neural fields are passed to
discriminator is used to differentiate between the real and a style-based 2D generator for high-resolution generation.
fake patches instead of the complete images. π -GAN [30] Two discriminators operate on the neural field and the
uses a similar setting as GRAF, but it adopts SIREN [212] 2D generator separately. CIPS-3D [184] uses a shallow
rather than ReLU MLPs for representation, which is more NeRF network to get the features and densities of all
capable of modeling fine details. It utilizes progressive points. A deep implicit neural representation network then
training with an image-based discriminator. Besides, instead predicts the color of each pixel independently based on the
of using two latent codes, π -GAN employs a StyleGAN-like feature vector of that pixel. Each network is accompanied
mapping network to condition the SIREN on a latent code by a discriminator. Despite the high-resolution images of
through FiLM conditioning [213], [214]. Although the above high fidelity achieved, introducing convolutional layers for
methods improve the quality of 3D-aware image synthesis 2D upsampling may harm the 3D consistency. Carefully
significantly, they still face several challenges. First, the designed architectures and regularizers are necessary to
good quality of rendered images does not imply a decent mitigate the problem. StyleNeRF [112] uses a well-thought-
underlying shape and guarantees inter-view consistency. out upsampler based on 1x1 convolutional kernels. More-
Second, due to the huge number of points queried along over, a NeRF path regularizer is designed to force the
all rays and the complex rendering process, rendering an output of CNNs to be similar to the output of NeRF.
image from the model takes a lot of time. Consequently, it is GRAM-HD [188] designs a super-resolution module that
hard to train a model on high-resolution images efficiently. upsamples the manifold to a higher resolution. Images
Third, they all assume a prior distribution of camera poses are rendered from upsampled manifolds. EpiGRAF [189]
on the dataset, which may not be accurate enough. bypasses the use of convolutional neural networks and
To mitigate the first issue, ShadeGAN [108] adds a instead operates directly on the output of a high-resolution
lighting constraint to the generator. It generates the albedo neural field. It achieves this by introducing a patch-based
instead of the color for each point. A shading model discriminator that is modulated by the patch location and
operates on the generated albedo map, the normal map scale parameters.
derived from densities, the sampled camera viewpoint, Regarding camera poses, most methods assume a prior
and the sampled lighting source to render an RGB image. distribution from the dataset. Several methods [60], [113],
GOF [110] analyzes the weight distribution on points along [115], [188], [189], however, leverage ground-truth poses
a ray. The cumulative rendering process cannot ensure of the dataset for sampling and training, resulting in a
a low-variance distribution on rays, leading to diffuse significant improvement in the quality of both the images
object surfaces. While occupancy-based neural fields can and geometry. This demonstrates the importance of having
satisfy the requirement, they suffer from the optimization accurate camera distribution. CAMPARI [109] attempts to
problem that gradients only exist on surfaces. Hence, GOF learn the pose distribution automatically from datasets and
unifies the two representations by gradually shrinking the leverages a camera generator to learn the distribution shift
sampling region to a minimal neighboring region around based on a prior distribution. However, the results are
the surface, resulting in neat surfaces and faster training. sensitive to the prior initialization. PoF3D [190], in contrast,
14
infers the camera poses from the latent code and frees level supervision can be applied in contrast to image-level
the model from prior knowledge. Besides, a pose-aware supervision used in GAN-based methods.
discriminator is designed to facilitate pose learning on the
generator side.
5.4 Hybrid Representations
With the advent of generative neural fields, researchers
also explore compositional generation. GSN [191] de- Implicit representations can effectively optimize the entire
composes global neural radiance fields into several lo- 3D shape based on 2D multi-view observations through
cal neural radiance fields to model parts of the whole differentiable rendering. As a result, many studies [29], [30],
scene. GIRAFFE [31] and GIRAFFE-HD [114] represent [31], [112], [137] opt to integrate implicit representations
scenes as compositional generative neural radiance fields, into the generator to achieve 3D-aware image synthesis
allowing the independent control over multiple foreground using 2D supervision. Implicit representations implemented
objects as well as the background. DisCoScene [197] also by MLPs are memory-efficient but tend to have a long
proposes compositional generation but introduces scene evaluation time. On the other hand, explicit representations
layout prior to better disentangle and model the scene. like voxel grids and depth work effectively with CNNs and
UrbanGIRAFFE [215] further leverages panoptic layout are efficient for high-resolution image rendering. However,
prior to additionally enabling control over large camera they suffer from memory overhead and view consistency
movement and background semantics. issues. By combining implicit and explicit representations,
Besides, editing and controllability in the generation their complementary benefits can be harnessed for 3D-
process have sparked the interest of researchers. A few aware image synthesis, potentially enhancing image quality
works focus on semantic control in 3D generation. [199] and rendering efficiency, which is in line with the results
generates a semantic map from generative radiance fields from the benchmark by Wang et al. [218].
conditioned on a semantic code as well as the sampled VolumeGAN [59] proposes to represent objects using
human pose and camera pose. It then employs an encoder- structural and textural representations. To capture the struc-
decoder structure to synthesize a human image based on ture, they first construct a feature volume and then utilize
the semantic map and a sampled texture latent code. The the features queried within this volume, along with the
content can be manipulated by interpolating codes and raw coordinates, to learn generative implicit feature fields.
sampling different poses. FENeRF [111] and IDE-3D [200] Additionally, a texture network similar to StyleGAN2 is em-
render semantic maps and color images from generative ployed to provide texture information to the feature fields.
neural fields simultaneously. The semantic maps can be By utilizing explicit voxel grids and implicit feature fields,
used to edit the 3D neural fields via GAN inversion VolumeGAN enhances both the evaluation speed and the
techniques [216], [217]. Pix2NeRF [192] employs generative quality of synthetic images. However, it is challenging to im-
models to translate a single input image to radiance prove the resolution of structural representation because a
fields, granting control in the 3D space (e.g., changing larger feature volume results in an increased computational
viewpoints). To accomplish this, an encoder is added to burden. Unlike VolumeGAN, EG3D [115] introduces the use
the GAN framework to convert the input to the latent of tri-planes as explicit representations instead of voxels.
space, thereby establishing a reconstruction objective for EG3D adopts a 2D generator to generate the tri-planes,
learning. D3D [193] decomposes the generation into two offering a computationally efficient approach to scale up
distinct components: shape and appearance. This decom- to high resolutions. Besides, EG3D renders a low-resolution
position allows for independent control over geometry and feature map that is upsampled to the target resolution using
color, respectively. 3D-GIF [194] and VoLux-GAN [198] a neural renderer, i.e., a 2D CNN, further reducing the
seek control over lights. 3D-GIF explicitly predicts the computational burden of the volume rendering. Moreover,
albedo, normal, and specular coefficients for each point. EG3D incorporates additional techniques, such as generator
To synthesize an image that takes lighting into account, pose conditioning and dual discrimination, to address
it employs volume rendering and photometric image ren- view inconsistencies. Another approach, Next3D [201],
dering techniques using sampled lighting sources. VoLux- proposes the combination of explicit mesh-guided control
GAN also predicts albedos, normals, diffuse components, and implicit volumetric representation, aiming to leverage
and specular components with the neural fields but instead the benefits of both controllability and quality.
uses an HDR environmental map to model the lighting. Recently, a few works have started exploring diffusion
This approach enables VoLux-GAN to achieve high dynamic models for 3D generation. GAUDI [202] utilizes a latent
range imaging (HDRI) relighting capabilities. representation that can be decoded into triplane features
In addition to GAN-based methods, NeRF-VAE [195] and camera poses for rendering. They then train a diffusion
leverages a variational autoencoder [16] to model the model on this latent space. RenderDiffusion [203] replaces
distribution of radiance fields. The model takes a set of U-Net, a common structure used in diffusion models, with
images and their corresponding camera poses from the same a tri-plane feature-based network for denoising. This allows
scene as input and generates a latent variable. The latent for learning 3D generation solely from 2D supervision.
variable serves as input for the generator to synthesize It has been shown that hybrid representations are
the content. Next, a reconstructed image is obtained by effective in producing high-quality 3D-aware images and
volume rendering the synthesized content using the input 3D shapes. As a result, subsequent works such as Avatar-
camera poses. The reconstructed images are expected to be Gen [219] and GNARF [220] also adopt tri-plane represen-
similar to the input images. LOLNeRF [196] employs an tations for human synthesis. However, there is still room for
auto-decoder to learn a 3D generative model, where pixel- improvement, particularly in terms of training speed and
15
et al. [145] allow users to edit a part of the voxel grid

and automatically deform and re-assemble all parts into a
plausible shape. Alternatively, a few methods study shape
editing controlled by more compact representations, such as
sparse 3D points or bounding boxes [104], [162], [164], [223].
While the aforementioned methods focus on editing rigid
shapes, gDNA [161] enables reposing articulated human
Layout editing bodies controlled by human skeletons.
Latent code editing: Another set of methods updates the
latent space of the generative model without requiring user
manipulation in the 3D space. Part-aware generative models
that assign a latent code to each part inherently possess
Relighting the ability of fine-grained part-level control [101], [159].
Furthermore, several methods explore shape editing guided
by semantic labels or language descriptions [224], [225],
[226].
6.2 3D-Aware Image Editing

We refer to 3D-aware image editing as the technique that
produces an edited 2D image as the outcome. We divide 3D-
aware image editing methods into two categories: one that
Semantic Editing
edits a set of physically meaningful 3D factors, while the
other manipulates the latent code without explicit physical
meaning.
Physical factor editing: All 3D-aware image synthesis
methods allow users to control camera poses, resulting in 2D
images from various viewpoints [29], [30], [112], [113], [115],
[185]. In addition to camera poses, several methods [31],
Reconstruction [183], [227] enable the editing of object poses and the
insertion of new objects by modeling the compositional
Fig. 6: Applications. 3D generative models have enabled nature of the scene. The study of re-lighting is also
various applications in editing, reconstruction, and repre- conducted through the disentangling of the albedo, normal,
sentation learning. Images adapted from [31] [198] [200], and specular coefficients [110], [194], [198]. More recently,
[61]. several generative models for neural articulated radiance
fields are proposed for human bodies [219], [220], [228],
strategy. For instance, VolumeGAN relies on progressive enabling shape manipulation driven by human skeletons.
training, which presents challenges when transitioning to Latent code editing: The latent code captures all the
another framework. other variations that are not represented by the physically
meaningful 3D factors, e.g., object shape and appearance.
One line of work involves editing the global attributes
6 A PPLICATIONS of the generated content. Several methods learn disentan-
The emergence of 3D generative models has facilitated gled shape and appearance code for generative radiance
numerous promising applications, as shown in Fig. 6. In fields, enabling independent interpolation of shape and
this section, we delve into the various applications of 3D appearance [29], [193]. CLIP-NeRF [229] further allows for
generative models, specifically focusing on editing, recon- controlling the shape and appearance of generative radiance
struction, and representation learning. Editing is further fields using language or a guidance image. Another line
categorized into two distinct types: shape editing and of work explores more fine-grained local editing through
3D-aware image editing. This categorization is based on semantic manipulation. FENeRF [111] and IDE-3D [200]
whether the resulting output exists in a 3D or 2D space. allow users to edit the semantic masks and generate edited
3D-aware images accordingly.
6.1 3D Shape Editing
Several classical methods are available for 3D shape editing, 6.3 3D Reconstruction
including skeleton-based deformation and cage-based de- Two common approaches exist for utilizing 3D generative
formation. We refer to [221] for a detailed survey of classical models in reconstruction, depending on whether test-time
shape editing methods. Here, our focus lies on shape editing optimization is employed or not.
methods that apply to 3D generative models. Armotized inference: Generative models equipped with an
3D space editing: Several works edit shapes relying on encoder, such as VAE, offer a natural means of reconstruct-
user manipulation in the 3D shape space. Liu et al. [222] ing 3D shapes from a given input. In contrast to standard
propose a method where users paint 3D shapes using 3D reconstruction focusing on one-to-one mapping from
voxel grids, which are then refined using a 3D GAN. Li the input to the target 3D shape, generative methods allow
16
for sampling numerous possible shapes from the posterior efficiency of 3D generative models is desirable to address
distribution. This line of work is applicable to different the high impact of the models on the environment, and
types of input by simply switching the encoder. Common enhancing the inference efficiency is crucial for downstream
input choices are single-view image [61], [230] and point applications.
cloud [230]. Training stability: The training of 3D generative models,
Model inversion: Another line of methods leverages the in particular 2D data-based generative models, is usually
inversion of generative models for reconstruction, typically highly prone to mode collapse. One possible explanation is
requiring test-time optimization. By adopting an auto- that the distribution of the physically meaningful factors,
decoder framework, a complete shape can be reconstructed e.g., camera poses and rendering parameters, may not
from a partial observation via optimizing the latent code to match that of the real images. Investigating the training
align with the partial observation [117], [231]. stability of generative models is thus particularly important.
While the aforementioned methods rely on 3D su-
pervision, recent advancements demonstrate single-view
image-based 3D reconstruction by inverting the 3D-aware 8 C ONCLUSION
image generative models using only 2D supervision [30],
[115]. This approach aligns with the classical concept of Deep 3D generative models, such as VAEs and GANs, aim
Analysis by Synthesis [232]. Notably, by inverting 3D-aware to characterize the data distribution of observed 3D data.
image generation models, it becomes possible to recover A key challenge in the task of 3D generation is that there
camera poses, thereby enabling category-level object pose are many 3D representations for describing 3D instances,
estimation [190], [233], [234]. and each representation comes with its own advantage and
disadvantage for generative modeling. This paper presents
a comprehensive review of 3D generation by discussing
6.4 Representation Learning how different 3D representations combine with different
Another common application of generative models is to generative models. We first introduce the fundamentals
learn better representations for downstream tasks, e.g., clas- of 3D generative models, including the formulation of
sification or segmentation. Notably, representation learning generative models and 3D representations. Then, we review
on point clouds has proven effective for tasks like model how 3D representations are modeled and generated in
classification and semantic segmentation [27], [102], [235]. the task of 3D generation learning from 2D and 3D data,
Similarly, generative models of voxel grids have been respectively. Afterward, we discuss the applications of
adopted for tasks such as model classification, shape 3D generative models, including shape editing, 3D-aware
recovery, and super-resolution [236]. image manipulation, reconstruction, and representation
learning. Finally, we highlight the limitations of existing
3D generative models and propose several future directions.
7 F UTURE W ORK We hope this review can help the readers to grasp a better
The development of 3D generative models has advanced understanding of the field of 3D generative models and
rapidly, but there are still a lot of challenges to overcome inspire novel, innovative research in the future.
before they can be used for downstream applications, such
as gaming, simulation, and augmented/virtual reality. Here,
we discuss current gaps in the literature and potential future R EFERENCES
directions of 3D generative models.
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
Universality: Most of the existing 3D generative models 2015.
are trained on simple object-level datasets, e.g., ShapeNet [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi-
for 3D shape generation and FFHQ for 3D-aware image fication with deep convolutional neural networks,” in NeurIPS,
2012.
synthesis. We believe that developing more universal 3D
[3] K. Simonyan and A. Zisserman, “Very deep convolutional
generative models is a fruitful direction for future research. networks for large-scale image recognition,” arXiv preprint
Here, universality includes generating versatile objects (e.g., arXiv:1409.1556, 2014.
ImageNet or Microsoft CoCo), dynamic objects or scenes, [4] R. Girshick, “Fast r-cnn,” in ICCV, 2015.
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
and large-scale scenes. Instead of focusing on a single time object detection with region proposal networks,” NeurIPS,
category, it is particularly interesting to learn a general 3D 2015.
generative model for various categories similar to the 2D [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
generative models, e.g., DALL-E 2 and Imagen [237], [238]. ICCV, 2017.
[7] M. Oechsle, S. Peng, and A. Geiger, “Unisurf: Unifying neural im-
Controllability: The controllability of 3D generative models plicit surfaces and radiance fields for multi-view reconstruction,”
lags behind that of 2D generative models. Ideally, the user in ICCV, 2021.
should be able to control the 3D generation process via user- [8] L. Yariv, J. Gu, Y. Kasten, and Y. Lipman, “Volume rendering of
friendly input, including but not limited to language, sketch, neural implicit surfaces,” NeurIPS, 2021.
[9] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ra-
and program. Moreover, we believe that the controllability mamoorthi, and R. Ng, “Nerf: Representing scenes as neural
of physical properties should be further improved, includ- radiance fields for view synthesis,” in ECCV, 2020.
ing lighting, material, and dynamics. [10] D. Xu, D. Anguelov, and A. Jain, “Pointfusion: Deep sensor fusion
for 3d bounding box estimation,” in CVPR, 2018.
Efficiency: Many 3D generative models require 3-10 days
[11] M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett,
of training on multiple high-end GPUs and are slow D. Wang, P. Carr, S. Lucey, D. Ramanan et al., “Argoverse: 3d
during inference. We believe that improving the training tracking and forecasting with rich maps,” in CVPR, 2019.
17
[12] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ron- [39] A. Oussidi and A. Elhassouny, “Deep generative models: Sur-
neberger, K. Tunyasuvunakool, R. Bates, A. Žı́dek, A. Potapenko vey,” in ISCV, 2018.
et al., “Highly accurate protein structure prediction with al- [40] S. Bond-Taylor, A. Leach, Y. Long, and C. G. Willcocks, “Deep
phafold,” Nature, 2021. generative modelling: A comparative review of vaes, gans,
[13] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, normalizing flows, energy-based and autoregressive models,”
M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” arXiv preprint arXiv:2103.04922, 2021.
in ICML, 2021. [41] A. Tewari, J. Thies, B. Mildenhall, P. Srinivasan, E. Tretschk,
[14] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hi- W. Yifan, C. Lassner, V. Sitzmann, R. Martin-Brualla, S. Lombardi
erarchical text-conditional image generation with clip latents,” et al., “Advances in neural rendering,” in CGF, 2022.
arXiv preprint arXiv:2204.06125, 2022. [42] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun,
[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- “Deep learning for 3d point clouds: A survey,” TPAMI, 2020.
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative [43] J. Lahoud, J. Cao, F. S. Khan, H. Cholakkal, R. M. Anwer, S. Khan,
adversarial nets,” in NeurIPS, 2014. and M.-H. Yang, “3d vision with transformers: A survey,” arXiv
[16] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” preprint arXiv:2208.04309, 2022.
in ICLR, 2014. [44] Y. Xie, T. Takikawa, S. Saito, O. Litany, S. Yan, N. Khan,
[17] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic F. Tombari, J. Tompkin, V. Sitzmann, and S. Sridhar, “Neural
models,” NeurIPS, 2020. fields in visual computing and beyond,” in CGF, 2022.
[18] R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, [45] S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, and H. Zhang, “Learning
M. Bethge, and F. A. Wichmann, “Shortcut learning in deep generative models of 3d structures,” in CGF, 2020.
neural networks,” Nature Machine Intell., 2020. [46] M. Toshpulatov, W. Lee, and S. Lee, “Generative adversarial
[19] Y. Xu, Y. Shen, J. Zhu, C. Yang, and B. Zhou, “Generative networks and their application to 3d face generation: A survey,”
hierarchical features from synthesizing images,” in CVPR, 2021. Image and Vision Computing, 2021.
[20] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image [47] M. Zollhöfer, P. Stotko, A. Görlitz, C. Theobalt, M. Nießner,
translation with conditional adversarial networks,” in CVPR, R. Klein, and A. Kolb, “State of the art on 3d reconstruction with
2017. rgb-d cameras,” in CGF, 2018.
[21] Y. Shen, C. Yang, X. Tang, and B. Zhou, “Interfacegan: Inter- [48] A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole, “Zero-
preting the disentangled face representation learned by gans,” shot text-guided object generation with dream fields,” in CVPR,
TPAMI, 2020. 2022.
[22] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, [49] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion:
S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes Text-to-3d using 2d diffusion,” arXiv preprint arXiv:2209.14988,
et al., “Photorealistic text-to-image diffusion models with deep 2022.
language understanding,” arXiv preprint arXiv:2205.11487, 2022.
[50] C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang,
[23] J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High-
A. Ku, Y. Yang, B. K. Ayan et al., “Scaling autoregressive resolution text-to-3d content creation,” in CVPR, 2023.
models for content-rich text-to-image generation,” arXiv preprint
[51] H. Kato, D. Beker, M. Morariu, T. Ando, T. Matsuoka, W. Kehl,
arXiv:2206.10789, 2022.
and A. Gaidon, “Differentiable rendering: A survey,” arXiv
[24] Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic preprint arXiv:2006.12057, 2020.
language model,” NeurIPS, 2000.
[52] G. E. Hinton, “Training products of experts by minimizing
[25] D. Rezende and S. Mohamed, “Variational inference with
contrastive divergence,” Neural computation, 2002.
normalizing flows,” in ICML, 2015.
[53] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-
[26] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum,
tation learning with deep convolutional generative adversarial
“Learning a probabilistic latent space of object shapes via 3d
networks,” in ICLR, 2016.
generative-adversarial modeling,” in NeurIPS, 2016.
[54] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing
[27] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas, “Learn-
of gans for improved quality, stability, and variation,” in ICLR,
ing representations and generative models for 3d point clouds,”
2018.
in ICML, 2018.
[28] Z. Chen and H. Zhang, “Learning implicit fields for generative [55] T. Karras, S. Laine, and T. Aila, “A style-based generator
shape modeling,” in CVPR, 2019. architecture for generative adversarial networks,” in CVPR, 2019.
[29] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger, “Graf: Gen- [56] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
erative radiance fields for 3d-aware image synthesis,” NeurIPS, “Analyzing and improving the image quality of StyleGAN,” in
2020. CVPR, 2020.
[30] E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, and G. Wetzstein, [57] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehti-
“pi-gan: Periodic implicit generative adversarial networks for 3d- nen, and T. Aila, “Alias-free generative adversarial networks,” in
aware image synthesis,” in CVPR, 2021. NeurIPS, 2021.
[31] M. Niemeyer and A. Geiger, “Giraffe: Representing scenes as [58] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational
compositional generative neural feature fields,” in CVPR, 2021. inference: A review for statisticians,” J. of the American statis.
[32] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning Associat., 2017.
on point sets for 3d classification and segmentation,” in CVPR, [59] Y. Xu, S. Peng, C. Yang, Y. Shen, and B. Zhou, “3d-aware image
2017. synthesis via learning structural and textural representations,” in
[33] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep CVPR, 2022.
hierarchical feature learning on point sets in a metric space,” [60] K. Schwarz, A. Sauer, M. Niemeyer, Y. Liao, and A. Geiger,
NeurIPS, 2017. “Voxgraf: Fast 3d-aware image synthesis with sparse voxel
[34] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani, “Surfnet: grids,” in NeurIPS, 2022.
Generating 3d shape surfaces using deep residual networks,” in [61] P. Mittal, Y.-C. Cheng, M. Singh, and S. Tulsiani, “Autosdf: Shape
CVPR, 2017. priors for 3d completion, reconstruction and generation,” in
[35] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang, CVPR, 2022.
“Pixel2mesh: Generating 3d mesh models from single rgb [62] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3d-
images,” in ECCV, 2018. r2n2: A unified approach for single and multi-view 3d object
[36] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, reconstruction,” in ECCV, 2016.
“3d shapenets: A deep representation for volumetric shapes,” in [63] C. Häne, S. Tulsiani, and J. Malik, “Hierarchical surface predic-
CVPR, 2015. tion for 3d object reconstruction,” in 3DV, 2017.
[37] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural [64] M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Octree gener-
network for real-time object recognition,” in IROS, 2015. ating networks: Efficient convolutional architectures for high-
[38] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo resolution 3d outputs,” in ICCV, 2017.
magnification: Learning view synthesis using multiplane im- [65] G. Riegler, A. Osman Ulusoy, and A. Geiger, “Octnet: Learning
ages,” in SIGGRAPH, 2018. deep 3d representations at high resolutions,” in CVPR, 2017.
18
[66] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong, “O- [96] R. Wu, Y. Zhuang, K. Xu, H. Zhang, and B. Chen, “Pq-net: A
cnn: Octree-based convolutional neural networks for 3d shape generative part seq2seq network for 3d shapes,” in CVPR, 2020.
analysis,” TOG, 2017. [97] G. Yang, X. Huang, Z. Hao, M.-Y. Liu, S. Belongie, and B. Har-
[67] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, iharan, “Pointflow: 3d point cloud generation with continuous
and M. Zollhofer, “Deepvoxels: Learning persistent 3d feature normalizing flows,” in ICCV, 2019.
embeddings,” in CVPR, 2019. [98] L. Hui, R. Xu, J. Xie, J. Qian, and J. Yang, “Progressive point cloud
[68] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y.-L. Yang, deconvolution generation network,” in ECCV, 2020.
“Hologan: Unsupervised learning of 3d representations from [99] R. Cai, G. Yang, H. Averbuch-Elor, Z. Hao, S. Belongie,
natural images,” in ICCV, 2019. N. Snavely, and B. Hariharan, “Learning gradient fields for shape
[69] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, generation,” in ECCV, 2020.
and Y. Sheikh, “Neural volumes: Learning dynamic renderable [100] H. Kim, H. Lee, W. H. Kang, J. Y. Lee, and N. S. Kim, “Softflow:
volumes from images,” TOG, 2019. Probabilistic framework for normalizing flow on manifolds,” in
[70] J. T. Kajiya and B. P. Von Herzen, “Ray tracing volume densities,” NeurIPS, 2020.
SIGGRAPH, 1984. [101] R. Li, X. Li, K.-H. Hui, and C.-W. Fu, “Sp-gan: Sphere-guided 3d
[71] H. Fan, H. Su, and L. J. Guibas, “A point set generation network shape generation and manipulation,” TOG, 2021.
for 3d object reconstruction from a single image,” in CVPR, 2017. [102] J. Xie, Y. Xu, Z. Zheng, S.-C. Zhu, and Y. N. Wu, “Generative
[72] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: pointnet: Deep energy-based learning on unordered point sets
Convolution on x-transformed points,” NeurIPS, 2018. for 3d generation, reconstruction and classification,” in CVPR,
[73] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. 2021.
Solomon, “Dynamic graph cnn for learning on point clouds,” [103] S. Luo and W. Hu, “Diffusion probabilistic models for 3d point
TOG, 2019. cloud generation,” in CVPR, 2021.
[74] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv- [104] Y. Deng, J. Yang, and X. Tong, “Deformed implicit field: Modeling
rcnn: Point-voxel feature set abstraction for 3d object detection,” 3d shapes with learned dense correspondence,” in CVPR, 2021.
in CVPR, 2020. [105] X. Chen, T. Jiang, J. Song, J. Yang, M. J. Black, A. Geiger, and
[75] M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, O. Hilliges, “gdna: Towards generative detailed neural avatars,”
N. Snavely, and R. Martin-Brualla, “Neural rerendering in the in CVPR, 2022.
wild,” in CVPR, 2019.
[106] R. K. Jones, T. Barton, X. Xu, K. Wang, E. Jiang, P. Guerrero,
[76] W. Yifan, F. Serena, S. Wu, C. Öztireli, and O. Sorkine-Hornung, N. J. Mitra, and D. Ritchie, “Shapeassembly: Learning to generate
“Differentiable surface splatting for point-based geometry pro- programs for 3d shape structure synthesis,” TOG, 2020.
cessing,” TOG, 2019.
[107] Z. Shi, Y. Shen, J. Zhu, D.-Y. Yeung, and Q. Chen, “3d-aware
[77] C. Lassner and M. Zollhofer, “Pulsar: Efficient sphere-based
indoor scene synthesis with depth priors,” in ECCV, 2022.
neural rendering,” in CVPR, 2021.
[108] X. Pan, X. Xu, C. C. Loy, C. Theobalt, and B. Dai, “A shading-
[78] M. Wu, Y. Wang, Q. Hu, and J. Yu, “Multi-view neural human
guided generative implicit model for shape-accurate 3d-aware
rendering,” in CVPR, 2020.
image synthesis,” NeurIPS, 2021.
[79] K.-A. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, and
[109] M. Niemeyer and A. Geiger, “Campari: Camera-aware decom-
V. Lempitsky, “Neural point-based graphics,” in ECCV, 2020.
posed generative neural radiance fields,” in 3DV, 2021.
[80] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson, “SynSin: End-
to-end view synthesis from a single image,” in CVPR, 2020. [110] X. Xu, X. Pan, D. Lin, and B. Dai, “Generative occupancy fields
for 3d surface-aware image synthesis,” NeurIPS, 2021.
[81] D. Rückert, L. Franke, and M. Stamminger, “Adop: Approximate
differentiable one-pixel point rendering,” TOG, 2022. [111] J. Sun, X. Wang, Y. Zhang, X. Li, Q. Zhang, Y. Liu, and J. Wang,
[82] J. Huang, H. Zhang, L. Yi, T. Funkhouser, M. Nießner, and “Fenerf: Face editing in neural radiance fields,” in CVPR, 2022.
L. J. Guibas, “Texturenet: Consistent local parametrizations for [112] J. Gu, L. Liu, P. Wang, and C. Theobalt, “Stylenerf: A style-
learning from high-resolution signals on meshes,” in CVPR, 2019. based 3d-aware generator for high-resolution image synthesis,”
[83] N. Verma, E. Boyer, and J. Verbeek, “Feastnet: Feature-steered in ICLR, 2022.
graph convolutions for 3d shape analysis,” in CVPR, 2018. [113] Y. Deng, J. Yang, J. Xiang, and X. Tong, “Gram: Generative
[84] R. Hanocka, A. Hertz, N. Fish, R. Giryes, S. Fleishman, and radiance manifolds for 3d-aware image generation,” in CVPR,
D. Cohen-Or, “Meshcnn: a network with an edge,” TOG, 2019. 2022.
[85] Y. Zhou, C. Wu, Z. Li, C. Cao, Y. Ye, J. Saragih, H. Li, and [114] Y. Xue, Y. Li, K. K. Singh, and Y. J. Lee, “Giraffe hd: A high-
Y. Sheikh, “Fully convolutional mesh autoencoder using efficient resolution 3d-aware generative model,” in CVPR, 2022.
spatially varying kernels,” NeurIPS, 2020. [115] E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello,
[86] M. Fey, J. E. Lenssen, F. Weichert, and H. Müller, “Splinecnn: Fast O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis et al., “Efficient
geometric deep learning with continuous b-spline kernels,” in geometry-aware 3d generative adversarial networks,” in CVPR,
CVPR, 2018. 2022.
[87] C. Wen, Y. Zhang, Z. Li, and Y. Fu, “Pixel2mesh++: Multi-view [116] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and
3d mesh generation via deformation,” in ICCV, 2019. A. Geiger, “Occupancy networks: Learning 3d reconstruction in
[88] S. Goel, A. Kanazawa, and J. Malik, “Shape and viewpoint function space,” in CVPR, 2019.
without keypoints,” in ECCV, 2020. [117] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove,
[89] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik, “Learning “Deepsdf: Learning continuous signed distance functions for
category-specific mesh reconstruction from image collections,” shape representation,” in CVPR, 2019.
in ECCV, 2018. [118] P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang,
[90] M. M. Loper and M. J. Black, “Opendr: An approximate “Neus: Learning neural implicit surfaces by volume rendering
differentiable renderer,” in ECCV, 2014. for multi-view reconstruction,” NeurIPS, 2021.
[91] H. Kato, Y. Ushiku, and T. Harada, “Neural 3d mesh renderer,” [119] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, “Differ-
in CVPR, 2018. entiable volumetric rendering: Learning implicit 3d representa-
[92] S. Liu, T. Li, W. Chen, and H. Li, “Soft rasterizer: A differentiable tions without 3d supervision,” in CVPR, 2020.
renderer for image-based 3d reasoning,” in ICCV, 2019. [120] J. C. Hart, “Sphere tracing: A geometric method for the
[93] J. Thies, M. Zollhöfer, and M. Nießner, “Deferred neural render- antialiased ray tracing of implicit surfaces,” The Visual Computer,
ing: Image synthesis using neural textures,” TOG, 2019. 1996.
[94] A. Raj, J. Tanke, J. Hays, M. Vo, C. Stoll, and C. Lassner, “Anr: [121] T. Takikawa, J. Litalien, K. Yin, K. Kreis, C. Loop,
Articulated neural rendering for virtual avatars,” in CVPR, 2021. D. Nowrouzezahrai, A. Jacobson, M. McGuire, and S. Fidler,
[95] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, “Neural geometric level of detail: Real-time rendering with
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, implicit 3d shapes,” in CVPR, 2021.
A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chil- [122] L. Liu, J. Gu, K. Zaw Lin, T.-S. Chua, and C. Theobalt, “Neural
amkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: sparse voxel fields,” NeurIPS, 2020.
An imperative style, high-performance deep learning library,” in [123] D. B. Lindell, J. N. Martel, and G. Wetzstein, “Autoint: Automatic
NeurIPS, 2019. integration for fast neural volume rendering,” in CVPR, 2021.
19
[124] A. Yu, R. Li, M. Tancik, H. Li, R. Ng, and A. Kanazawa, [150] D. W. Shu, S. W. Park, and J. Kwon, “3d point cloud generative
“Plenoctrees for real-time rendering of neural radiance fields,” adversarial network based on tree structured graph convolu-
in ICCV, 2021. tions,” in ICCV, 2019.
[125] P. Hedman, P. P. Srinivasan, B. Mildenhall, J. T. Barron, and [151] M. S. Arshad and W. J. Beksi, “A progressive conditional
P. Debevec, “Baking neural radiance fields for real-time view generative adversarial network for generating dense and colored
synthesis,” in ICCV, 2021. 3d point clouds,” in 3DV, 2020.
[126] S. J. Garbin, M. Kowalski, M. Johnson, J. Shotton, and J. Valentin, [152] S. Ramasinghe, S. Khan, N. Barnes, and S. Gould, “Spectral-gans
“Fastnerf: High-fidelity neural rendering at 200fps,” in ICCV, for high-resolution 3d point-cloud generation,” in IROS, 2020.
2021. [153] M. Zamorski, M. Zieba, P. Klukowski, R. Nowak, K. Kurach,
[127] C. Reiser, S. Peng, Y. Liao, and A. Geiger, “Kilonerf: Speeding W. Stokowiec, and T. Trzciński, “Adversarial autoencoders for
up neural radiance fields with thousands of tiny mlps,” in ICCV, compact representations of 3d point clouds,” CVIU, 2020.
2021. [154] R. Klokov, E. Boyer, and J. Verbeek, “Discrete point flow networks
[128] S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and for efficient point cloud generation,” in ECCV, 2020.
A. Kanazawa, “Plenoxels: Radiance fields without neural net- [155] L. Zhou, Y. Du, and J. Wu, “3d shape generation and completion
works,” in CVPR, 2022. through point-voxel diffusion,” in ICCV, 2021.
[129] J. Li, Z. Feng, Q. She, H. Ding, C. Wang, and G. H. Lee, [156] X. Zeng, A. Vahdat, F. Williams, Z. Gojcic, O. Litany, S. Fidler,
“Mine: Towards continuous depth mpi with nerf for novel view and K. Kreis, “Lion: Latent point diffusion models for 3d shape
synthesis,” in ICCV, 2021. generation,” in NeurIPS, 2022.
[130] Q. Xu, Z. Xu, J. Philip, S. Bi, Z. Shu, K. Sunkavalli, and [157] L. Wu, D. Wang, C. Gong, X. Liu, Y. Xiong, R. Ranjan,
U. Neumann, “Point-nerf: Point-based neural radiance fields,” R. Krishnamoorthi, V. Chandra, and Q. Liu, “Fast point cloud
in CVPR, 2022. generation with straight flows,” arXiv preprint arXiv:2212.01747,
[131] D. Zimny, T. Trzciński, and P. Spurek, “Points2nerf: Generating 2022.
neural radiance fields from 3d point cloud,” arXiv preprint [158] Y. Sun, Y. Wang, Z. Liu, J. Siegel, and S. Sarma, “Pointgrow:
arXiv:2206.01290, 2022. Autoregressively learned point cloud generation with self-
[132] B. Yang, C. Bao, J. Zeng, H. Bao, Y. Zhang, Z. Cui, and G. Zhang, attention,” in WACV, 2020.
“Neumesh: Learning disentangled neural mesh-based implicit [159] R. Gal, A. Bermano, H. Zhang, and D. Cohen-Or, “Mrgan: Multi-
field for geometry and texture editing,” in ECCV, 2022. rooted 3d shape generation with unsupervised part disentangle-
[133] J. Gao, W. Chen, T. Xiang, A. Jacobson, M. McGuire, and S. Fidler, ment,” in ICCVW, 2020.
“Learning deformable tetrahedral meshes for 3d reconstruction,” [160] A. Zadeh, Y.-C. Lim, P. P. Liang, and L.-P. Morency, “Variational
NeurIPS, 2020. auto-decoder: A method for neural generative modeling from
[134] H. Ben-Hamu, H. Maron, I. Kezurer, G. Avineri, and Y. Lipman, incomplete data,” arXiv preprint arXiv:1903.00840, 2019.
“Multi-chart generative surface modeling,” TOG, 2018. [161] X. Chen, T. Jiang, J. Song, J. Yang, M. J. Black, A. Geiger, and
[135] R. Hanocka, G. Metzer, R. Giryes, and D. Cohen-Or, “Point2mesh: O. Hilliges, “gdna: Towards generative detailed neural avatars,”
A self-prior for deformable meshes,” TOG, 2020. in CVPR, 2022.
[136] S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou, [162] Z. Hao, H. Averbuch-Elor, N. Snavely, and S. Belongie, “Dualsdf:
“Neural body: Implicit neural representations with structured Semantic shape manipulation using a two-level representation,”
latent codes for novel view synthesis of dynamic humans,” in in CVPR, 2020.
CVPR, 2021. [163] M. Arjovsky and L. Bottou, “Towards principled methods
[137] J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, for training generative adversarial networks,” arXiv preprint
Z. Gojcic, and S. Fidler, “GET3D: A generative model of high arXiv:1701.04862, 2017.
quality 3d textured shapes learned from images,” in NeurIPS, [164] M. Ibing, I. Lim, and L. Kobbelt, “3d shape generation with grid-
2022. based implicit functions,” in CVPR, 2021.
[138] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Generative [165] M. Kleineberg, M. Fey, and F. Weichert, “Adversarial generation
and discriminative voxel modeling with convolutional neural of continuous implicit shape representations,” arXiv preprint
networks,” arXiv preprint arXiv:1608.04236, 2016. arXiv:2002.00349, 2020.
[139] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point [166] A. Luo, T. Li, W.-H. Zhang, and T. S. Lee, “Surfgen: Adversarial
cloud based 3d object detection,” in CVPR, 2018. 3d shape synthesis with explicit surface discriminators,” in ICCV,
[140] J. Mao, Y. Xue, M. Niu, H. Bai, J. Feng, X. Liang, H. Xu, and C. Xu, 2021.
“Voxel transformer for 3d object detection,” in ICCV, 2021. [167] W. E. Lorensen and H. E. Cline, “Marching cubes: A high
[141] M. Zhao, Q. Liu, A. Jha, R. Deng, T. Yao, A. Mahadevan-Jansen, resolution 3d surface construction algorithm,” SIGGRAPH, 1987.
M. J. Tyska, B. A. Millis, and Y. Huo, “Voxelembed: 3d instance [168] E. Dupont, H. Kim, S. Eslami, D. Rezende, and D. Rosenbaum,
segmentation and tracking with voxel embedding based deep “From data to functa: Your data point is a function and you
learning,” in MLMI, 2021. should treat it like one,” in ICML, 2022.
[142] H.-Y. Meng, L. Gao, Y.-K. Lai, and D. Manocha, “Vv-net: Voxel [169] G. Chou, Y. Bahat, and F. Heide, “Diffusionsdf: Conditional
vae net with group convolutions for point cloud segmentation,” generative modeling of signed distance functions,” arXiv preprint
in ICCV, 2019. arXiv:2211.13757, 2022.
[143] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, [170] G. Nam, M. Khlifi, A. Rodriguez, A. Tono, L. Zhou, and
“Autoencoding beyond pixels using a learned similarity metric,” P. Guerrero, “3d-ldm: Neural implicit 3d shape generation with
in ICML, 2016. latent diffusion models,” arXiv preprint arXiv:2212.00842, 2022.
[144] Z. Wu, X. Wang, D. Lin, D. Lischinski, D. Cohen-Or, and [171] T. Wang, B. Zhang, T. Zhang, S. Gu, J. Bao, T. Baltrusaitis, J. Shen,
H. Huang, “Sagnet: Structure-aware generative network for 3d- D. Chen, F. Wen, Q. Chen et al., “Rodin: A generative model
shape modeling,” TOG, 2019. for sculpting 3d digital avatars using diffusion,” arXiv preprint
[145] J. Li, C. Niu, and K. Xu, “Learning part generation and assembly arXiv:2212.06135, 2022.
for structure-aware shape synthesis,” in AAAI, 2020. [172] J. R. Shue, E. R. Chan, R. Po, Z. Ankner, J. Wu, and G. Wetzstein,
[146] M. Ibing, G. Kobsik, and L. Kobbelt, “Octree transformer: “3d neural field generation using triplane diffusion,” arXiv
Autoregressive 3d shape generation on hierarchically structured preprint arXiv:2211.16677, 2022.
sequences,” arXiv preprint arXiv:2111.12480, 2021. [173] N. Müller, Y. Siddiqui, L. Porzi, S. R. Bulo, P. Kontschieder,
[147] C.-L. Li, M. Zaheer, Y. Zhang, B. Poczos, and R. Salakhutdinov, and M. Nießner, “Diffrf: Rendering-guided 3d radiance field
“Point cloud gan,” arXiv preprint arXiv:1810.05795, 2018. diffusion,” in CVPR, 2023.
[148] C. Xie, C. Wang, B. Zhang, H. Yang, D. Chen, and F. Wen, “Style- [174] L. Gao, T. Wu, Y.-J. Yuan, M.-X. Lin, Y.-K. Lai, and H. Zhang,
based point generator with adversarial rendering for point cloud “Tm-net: Deep generative networks for textured meshes,” TOG,
completion,” in CVPR, 2021. 2021.
[149] D. Valsesia, G. Fracastoro, and E. Magli, “Learning localized [175] L. Gao, J. Yang, T. Wu, Y.-J. Yuan, H. Fu, Y.-K. Lai, and H. Zhang,
generative models for 3d point clouds via graph convolution,” “Sdm-net: Deep generative network for structured deformable
in ICLR, 2018. mesh,” TOG, 2019.
20
[176] Z. Liu, Y. Feng, M. J. Black, D. Nowrouzezahrai, L. Paull, [202] M. A. Bautista, P. Guo, S. Abnar, W. Talbott, A. Toshev, Z. Chen,
and W. Liu, “Meshdiffusion: Score-based generative 3d mesh L. Dinh, S. Zhai, H. Goh, D. Ulbricht et al., “Gaudi: A neural
modeling,” in ICLR, 2023. architect for immersive 3d scene generation,” NeurIPS, 2022.
[177] Z. Lyu, J. Wang, Y. An, Y. Zhang, D. Lin, and B. Dai, “Controllable [203] T. Anciukevičius, Z. Xu, M. Fisher, P. Henderson, H. Bilen,
mesh generation through sparse latent point diffusion models,” N. J. Mitra, and P. Guerrero, “Renderdiffusion: Image diffusion
in CVPR, 2023. for 3d reconstruction, inpainting and generation,” arXiv preprint
[178] X. Wang and A. Gupta, “Generative image modeling using style arXiv:2211.09869, 2022.
and structure adversarial networks,” in ECCV, 2016. [204] C. Nash, Y. Ganin, S. A. Eslami, and P. Battaglia, “Polygen: An
[179] A. Noguchi and T. Harada, “Rgbd-gan: Unsupervised 3d repre- autoregressive generative model of 3d meshes,” in ICML, 2020.
sentation learning from natural image datasets via rgbd image [205] Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting the latent space
synthesis,” arXiv preprint arXiv:1909.12573, 2019. of gans for semantic face editing,” in CVPR, 2020.
[180] J.-Y. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenenbaum, [206] A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, H.-P. Seidel,
and B. Freeman, “Visual object networks: Image generation with P. Pérez, M. Zollhofer, and C. Theobalt, “Stylerig: Rigging
disentangled 3d representations,” NeurIPS, 2018. stylegan for 3d control over portrait images,” in CVPR, 2020.
[181] X. Chen, D. Cohen-Or, B. Chen, and N. J. Mitra, “Towards a [207] Y. Deng, J. Yang, D. Chen, F. Wen, and X. Tong, “Disentangled and
neural graphics pipeline for controllable image generation,” in controllable face image generation via 3d imitative-contrastive
CGF, 2021. learning,” in CVPR, 2020.
[182] P. Henzler, N. J. Mitra, and T. Ritschel, “Escaping plato’s cave: 3d [208] C. Yang, Y. Shen, and B. Zhou, “Semantic hierarchy emerges in
shape from adversarial rendering,” in ICCV, 2019. deep generative representations for scene synthesis,” IJCV, 2021.
[183] T. H. Nguyen-Phuoc, C. Richardt, L. Mai, Y. Yang, and N. Mitra,
[209] T. Leimkühler and G. Drettakis, “Freestylegan: Free-view editable
“Blockgan: Learning 3d object-aware scene representations from
portrait rendering with the camera manifold,” arXiv preprint
unlabelled images,” NeurIPS, 2020.
arXiv:2109.09378, 2021.
[184] P. Zhou, L. Xie, B. Ni, and Q. Tian, “Cips-3d: A 3d-aware
generator of gans based on conditionally-independent pixel [210] M. Mirza and S. Osindero, “Conditional generative adversarial
synthesis,” arXiv preprint arXiv:2110.09788, 2021. nets,” arXiv preprint arXiv:1411.1784, 2014.
[185] R. Or-El, X. Luo, M. Shan, E. Shechtman, J. J. Park, and [211] V. Sitzmann, M. Zollhöfer, and G. Wetzstein, “Scene represen-
I. Kemelmacher-Shlizerman, “Stylesdf: High-resolution 3d- tation networks: Continuous 3d-structure-aware neural scene
consistent image and geometry generation,” in CVPR, 2022. representations,” in NeurIPS, 2019.
[186] X. Zhang, Z. Zheng, D. Gao, B. Zhang, P. Pan, and Y. Yang, [212] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein,
“Multi-view consistent generative adversarial networks for 3d- “Implicit neural representations with periodic activation func-
aware image synthesis,” in CVPR, 2022. tions,” NeurIPS, 2020.
[187] Z. Shi, Y. Xu, Y. Shen, D. Zhao, Q. Chen, and D.-Y. Yeung, [213] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville,
“Improving 3d-aware image synthesis with a geometry-aware “Film: Visual reasoning with a general conditioning layer,” in
discriminator,” in NeurIPS, 2022. AAAI, 2018.
[188] J. Xiang, J. Yang, Y. Deng, and X. Tong, “Gram-hd: 3d-consistent [214] V. Dumoulin, E. Perez, N. Schucher, F. Strub, H. d. Vries,
image generation at high resolution with generative radiance A. Courville, and Y. Bengio, “Feature-wise transformations,”
manifolds,” arXiv preprint arXiv:2206.07255, 2022. Distill, 2018.
[189] I. Skorokhodov, S. Tulyakov, Y. Wang, and P. Wonka, “Epigraf: [215] Y. Yang, Y. Yang, H. Guo, R. Xiong, Y. Wang, and Y. Liao, “Urban-
Rethinking training of 3d gans,” arXiv preprint arXiv:2206.10535, giraffe: Representing urban scenes as compositional generative
2022. neural feature fields,” arXiv preprint arXiv:2303.14167, 2023.
[190] Z. Shi, Y. Shen, Y. Xu, S. Peng, Y. Liao, S. Guo, Q. Chen, and D.-Y. [216] R. Abdal, Y. Qin, and P. Wonka, “Image2stylegan: How to embed
Yeung, “Learning 3d-aware image synthesis with unknown pose images into the stylegan latent space?” in ICCV, 2019.
distribution,” in CVPR, 2023. [217] S. Pidhorskyi, D. Adjeroh, and G. Doretto, “Adversarial latent
[191] T. DeVries, M. A. Bautista, N. Srivastava, G. W. Taylor, and autoencoders,” in CVPR, 2020.
J. M. Susskind, “Unconstrained scene generation with locally [218] Q. Wang, Z. Shi, K. Zheng, Y. Xu, S. Peng, and Y. Shen,
conditioned radiance fields,” in ICCV, 2021. “Benchmarking and analyzing 3d-aware image synthesis with
[192] S. Cai, A. Obukhov, D. Dai, and L. Van Gool, “Pix2nerf: a modularized codebase,” arXiv preprint arXiv:2306.12423, 2023.
Unsupervised conditional p-gan for single image to neural [219] J. Zhang, Z. Jiang, D. Yang, H. Xu, Y. Shi, G. Song, Z. Xu, X. Wang,
radiance fields translation,” in CVPR, 2022. and J. Feng, “Avatargen: a 3d generative model for animatable
[193] A. Tewari, X. Pan, O. Fried, M. Agrawala, C. Theobalt et al., “Dis- human avatars,” arXiv preprint arXiv:2208.00561, 2022.
entangled3d: Learning a 3d generative model with disentangled [220] A. W. Bergman, P. Kellnhofer, Y. Wang, E. R. Chan, D. B. Lindell,
geometry and appearance from monocular images,” in CVPR, and G. Wetzstein, “Generative neural articulated radiance fields,”
2022. arXiv preprint arXiv:2206.14314, 2022.
[194] M. Lee, C. Chung, H. Cho, M. Kim, S. Jung, J. Choo, and M. Sung, [221] Y. Yuan, Y. Lai, T. Wu, L. Gao, and L. Liu, “A revisit of shape
“3d-gif: 3d-controllable object generation via implicit factorized editing techniques: From the geometric to the neural viewpoint,”
representations,” arXiv preprint arXiv:2203.06457, 2022. JCST, 2021.
[195] A. R. Kosiorek, H. Strathmann, D. Zoran, P. Moreno, R. Schneider,
[222] J. Liu, F. Yu, and T. A. Funkhouser, “Interactive 3d modeling with
S. Mokrá, and D. J. Rezende, “Nerf-vae: A geometry aware 3d
a generative adversarial network,” in 3DV, 2017.
scene generative model,” in ICML, 2021.
[223] Z. Zheng, T. Yu, Q. Dai, and Y. Liu, “Deep implicit templates for
[196] D. Rebain, M. J. Matthews, K. M. Yi, D. Lagun, and A. Tagliasac-
3d shape representation,” in CVPR, 2021.
chi, “Lolnerf: Learn from one look,” in CVPR, 2022.
[197] Y. Xu, M. Chai, Z. Shi, S. Peng, I. Skorokhodov, A. Siarohin, [224] T. Jahan, Y. Guan, and O. van Kaick, “Semantics-guided latent
C. Yang, Y. Shen, H.-Y. Lee, B. Zhou et al., “Discoscene: Spatially space exploration for shape generation,” CGF, 2021.
disentangled generative radiance fields for controllable 3d-aware [225] Z. Liu, Y. Wang, X. Qi, and C. Fu, “Towards implicit text-guided
scene synthesis,” in CVPR, 2023. 3d shape generation,” CVPR, 2022.
[198] F. Tan, S. Fanello, A. Meka, S. Orts-Escolano, D. Tang, R. Pandey, [226] R. Fu, X. Zhan, Y. Chen, D. Ritchie, and S. Sridhar, “Shapecrafter:
J. Taylor, P. Tan, and Y. Zhang, “Volux-gan: A generative model A recursive text-conditioned 3d shape generation model,” arXiv
for 3d face synthesis with hdri relighting,” TOG, 2022. preprint arXiv:2207.09446, 2022.
[199] J. Zhang, E. Sangineto, H. Tang, A. Siarohin, Z. Zhong, N. Sebe, [227] Y. Liao, K. Schwarz, L. M. Mescheder, and A. Geiger, “Towards
and W. Wang, “3d-aware semantic-guided generative model for unsupervised learning of generative models for 3d controllable
human synthesis,” in ECCV, 2022. image synthesis,” in CVPR, 2020.
[200] J. Sun, X. Wang, Y. Shi, L. Wang, J. Wang, and Y. Liu, “Ide- [228] A. Noguchi, X. Sun, S. Lin, and T. Harada, “Unsupervised
3d: Interactive disentangled editing for high-resolution 3d-aware learning of efficient geometry-aware neural articulated represen-
portrait synthesis,” in SIGGRAPH Asia, 2022. tations,” in ECCV, 2022.
[201] J. Sun, X. Wang, L. Wang, X. Li, Y. Zhang, H. Zhang, and Y. Liu, [229] C. Wang, M. Chai, M. He, D. Chen, and J. Liao, “Clip-nerf: Text-
“Next3d: Generative neural texture rasterization for 3d-aware and-image driven manipulation of neural radiance fields,” CVPR,
head avatars,” in CVPR, 2023. 2022.
21
[230] B. Zhang, M. Nießner, and P. Wonka, “3dilg: Irregular latent grids

for 3d generative modeling,” arXiv preprint arXiv:2205.13914,
2022.
[231] S. Duggal, Z. Wang, W. Ma, S. Manivasagam, J. Liang, S. Wang,
and R. Urtasun, “Secrets of 3d implicit object shape reconstruc-
tion in the wild,” arXiv preprint arXiv:2101.06860, 2021.
[232] A. Yuille and D. Kersten, “Vision as bayesian inference: analysis
by synthesis?” Trends in Cognitive Sciences, 2006.
[233] X. Chen, Z. Dong, J. Song, A. Geiger, and O. Hilliges, “Category
level object pose estimation via neural analysis-by-synthesis,” in
ECCV, 2020.
[234] J. Guo, F. Zhong, R. Xiong, Y. Liu, Y. Wang, and Y. Liao, “A visual
navigation perspective for category-level object pose estimation,”
in ECCV, 2022.
[235] B. Eckart, W. Yuan, C. Liu, and J. Kautz, “Self-supervised learning
on 3d point clouds by learning discrete generative models,” in
CVPR, 2021.
[236] J. Xie, Z. Zheng, R. Gao, W. Wang, S.-C. Zhu, and Y. N. Wu,
“Generative voxelnet: learning energy-based models for 3d shape
synthesis and analysis,” TPAMI, 2020.
[237] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hi-
erarchical text-conditional image generation with CLIP latents,”
arXiv preprint arXiv:2204.06125, 2022.
[238] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S.
Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans,
J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image
diffusion models with deep language understanding,” arXiv
preprint arXiv:2205.11487, 2022.

3D Generative Models A Survey

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

3D Generative Models A Survey

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3D Generative Models A Survey

Uploaded by

Copyright:

Available Formats

1

3D Generative Models: A Survey

Index Terms—Generative modeling, 3D representations, deep learning, unsupervised learning, 3D vision.

T HE rapid advancement of deep learning [1] has revolu-

3D representations (Sec 3.2)

content generation. 2 S COPE OF THIS SURVEY

3D-GAN VON IM-Net PolyGen DiﬀRF

Voxel grid Point cloud Mesh GAN VAE Normalizing Flow

energy-based models (EBMs) [52]. These models learn by Variational Autoencoders.

(a) Methods using 3D supervision. (b) Methods using 2D supervision.

(a) Point cloud generation (b) Voxel grid generation

GAN models, VAE models

Normalizing ﬂow models

+ Point oﬀsets Network Template mesh Deformation Network

Diﬀusion models Template-based representation

(e) Generation of hybrid representations

Random vector Point value

Feature volume Tri-plane feature maps

4.4 Meshes and Other Representations

TM-Net [174] , MeshDiffusion [176] ,

Depth/Normal Maps S2 -GAN [178] , RGBDGAN [179] , DepthGAN [107] - -

VoxGRAF [60] , HoloGAN [68] , VON [180] ,

GRAF [29] , π -GAN [30] , CAMPARI [109] , CIPS-3D [184] ,

Triplane Mesh Voxel

Camera pose Image/Feature

(28, 183.9) 𝜋-GAN[29]

et al. [145] allow users to edit a part of the voxel grid

6.2 3D-Aware Image Editing

[230] B. Zhang, M. Nießner, and P. Wonka, “3dilg: Irregular latent grids

You might also like