GRM: Large Gaussian Reconstruction Model For Efficient 3D Reconstruction and Generation
GRM: Large Gaussian Reconstruction Model For Efficient 3D Reconstruction and Generation
GRM: Large Gaussian Reconstruction Model For Efficient 3D Reconstruction and Generation
Stanford University
1
2
The Hong Kong University of Science and Technology
3
Shanghai AI Laboratory
4
Zhejiang University
5
Ant Group
1 Introduction
The availability of high-quality and diverse 3D assets is critical in many domains,
including robotics, gaming, architecture, among others. Yet, creating these as-
sets has been a tedious manual process, requiring expertise in difficult-to-use
computer graphics tools.
Emerging 3D generative models offer the ability to easily create diverse
3D assets from simple text prompts or single images [70]. Optimization-based
3D generative methods can produce high-quality assets, but they often require
a long time—often hours—to produce a single 3D asset [50, 71, 93, 98, 101].
Recent feed-forward 3D generative methods have demonstrated excellent quality
and diversity while offering significant speedups over optimization-based 3D
generation approaches [2, 12, 30, 38, 46, 54, 78, 91, 106]. These state-of-the-art
⋆
Equal Contribution
2 Yinghao Xu et al.
2 Related Work
Sparse-view Reconstruction. Neural representations, as highlighted in prior
works [9,62–64,69,84,86], present a promising foundation for scene representation
and neural rendering [95]. When applied to novel-view synthesis, these methods
have demonstrated success in scenarios with multi-view training images, show-
casing proficiency in single-scene overfitting. Notably, recent advancements [10,
33, 51, 59, 100, 109] have extended these techniques to operate with a sparse set
of views, displaying improved generalization to unseen scenes. These methods
face challenges in capturing multiple modes within large-scale datasets, resulting
in a limitation to generate realistic results. Recent works [30, 99, 114] further
scale up the model and datasets for better generalization. But relying on neural
volume–based scene representation proves inadequate for efficiently synthesizing
high-resolution and high-fidelity images. Our proposed solution involves the use
of pixel-aligned 3D Gaussians [8, 90] combined with our effective transformer
architecture. This approach is designed to elevate both the efficiency and quality
of the sparse-view reconstructor when provided with only four input images.
4 Yinghao Xu et al.
3 Method
GRM is a feed-forward sparse-view 3D reconstructor, utilizing four input images
to efficiently infer underlying 3D Gaussians [43]. Supplied with a multi-view
image generator head [46, 79], GRM can be utilized to generate 3D from text or
a single image. Different from LRM [30, 46, 99, 106], we leverage pixel-aligned
Gaussians (Sec. 3.1) to enhance efficient and reconstruction quality and we
adopt a transformer-based network to predict the properties of the Gaussians by
associating information from all input views in a memory-efficient way (Sec. 3.2).
Finally, we detail the training objectives in Sec. 3.3 and demonstrate high-quality
text-to-3D and image-to-3D generation in a few seconds (Sec. 3.4).
Pixel-aligned Novel-view
Upsampler Gaussians Rendering
Text-to-MV
e.g.
Instant3D
Image-to-MV
e.g.
Zero123++
PixelShuffle Windowed Self-Attention
Fig. 2: GRM pipeline. Given 4 input views, which can be generated from text [46] or a
single image [79], our sparse-view reconstructor estimates the underlying 3D scene in a
single feed-forward pass using pixel-aligned Gaussians. The transformer-based sparse-
view reconstructor, equipped with a novel transformer-based upsampler, is capable of
leveraging long-range visual cues to efficiently generate a large number of 3D Gaussians
for high-fidelity 3D reconstruction.
Transformer-based Encoder. For a given input image \mathbf {I}_v\in \R ^{H\times W\times 3} , we
first inject the camera information to every pixel following [85,106] with Plücker
embedding [34, 85]. Then we use a convolutional image tokenizer with kernel
and stride 16 to extract local image features, resulting in a \protect \nicefrac {H}{16} \times \nicefrac {W}{16} feature
map. The features from every view are concatenated together to a single fea-
ture vector of length \left (V\times \nicefrac {H}{16}\times \nicefrac {W}{16}\right ). Following common practice in vision
transformers, we append learnable image position encodings for each token to
encode the spatial information in the image space. The resulting feature vector
is subsequently fed to a series of self-attention layers. The self-attention layers
attend to all tokens across all the input views, ensuring mutual information
exchange among all input views, resembling traditional feature matching and
encouraging consistent predictions for pixels belonging to different images. The
output of the transformer-based encoder is a V\times \nicefrac {H}{16} \times \nicefrac {W}{16}-long feature vector,
denoted as \mathbf {F}. Formally, the transformer function can be written as
\feature = \transformer _{\theta , \imposenc }\left (\mathcal {I}, \mathcal {C}\right ), (2)
where θ and ϕ denote the network parameters and the learnable image position
encodings.
In the transformer encoder, we utilize patch convolution to tokenize the
input images, resulting in the output feature F with a smaller spatial dimension.
While this is advantageous for capturing broader image context, it is limited in
modeling high-frequency details. To this end, we introduce a transformer-based
upsampler to improve the detail reconstruction.
feature dimensions with a linear layer and then double the spatial dimension with
a PixelShuffle layer [80]. The upsampled feature maps are grouped and passed
to a self-attention layer in a sliding window of size W and shift \protect \nicefrac {W}{2}. While the
self-attention operation is performed within each distinct window, to maintain
manageable memory and computation, the overlapping between shifted windows
improves non-local information flow. Formally, an upsampler block contains the
following operations:
\feature &= \pixelshuffle \left (\fullyconnected \left (\feature \right ), 2\right ), \\ \feature &= \attention \left (\feature , W\right ), \\ \feature &= \shift \left (\attention \left (\shift \left (\feature , W/2\right ), W\right ), -W/2\right ).
(5)
After several blocks, the context length expands to the same spatial dimension
as the input. We reshape the features back to 2D tensors, resulting in V feature
maps with a resolution of H × W , denoted as \mathcal {F}= \{\feature _v\}_{v=1}^{V} .
Rendering with Gaussian Splatting. From the upsampled features \mathcal {F}, we
predict the Gaussian attribute maps \ifmmode \lbrace \else \textbraceleft \fi \map _v\}_{v=1}^{V} for pixel-aligned Gaussians using
separate linear heads. As mentioned in Sec. 3.1, these are then unprojected along
the viewing ray according to the predicted depth, from which a final image \mathbf {I}_{v'}
and alpha mask \mathbf {M}_{v'} (used for supervision) can be rendered at an arbitrary
camera view \mathbf {c}_{v'} through Gaussian splatting.
3.3 Training
During the training phase, we sample V = 4 input views that sufficiently cover
the whole scene, and supervise with additional views to guide the reconstruction.
To remove floaters, we also supervise the alpha map from Gaussian splatting with
the ground truth object mask available from the training data.
Given V ′ supervision views, the training objective is
\loss {} &= \dfrac {1}{V'}\sum _{{1 \leq v'\leq V'}} \loss {img} + \loss {mask},\\ \loss {img} &= L_2\left (\image _{v'}, \hat {\image }_{v'}\right ) + 0.5 L_p\left (\image _{v'}, \hat {\image }_{v'}\right ),\\ \loss {mask} &= L_2\left (\mask _{v'}, \hat {\mask }_{v'}\right ),
(8)
where Îv′ and M̂v′ denote ground truth image and alpha mask, respectively. L2
and Lp are L2 loss and perceptual loss [35].
To further constrain the Gaussian scaling, we employ the following activation
function corresponding to the output so of the linear head for scale. Subsequently,
we conduct linear interpolation within predefined scale values smin and smax :
\mathbf {s} = s_{min}\textsc {Sigmoid}(\mathbf {s}_o) + s_{max}(1- \textsc {Sigmoid}(\mathbf {s}_o)).\label {eq:scaling_act} (9)
8 Yinghao Xu et al.
4 Experiments
4.1 Experimental Settings
Training Settings. The encoder E consists of 1 strided convolution layer to
tokenize the image and 24 self-attention layers with channel width 768. The
upsampler consists of 4 upsampler blocks and each block contains 2 attention
layers. For training, we use AdamW [60] with a learning rate initialized at 0.0003
decayed with cosine annealing after 3k steps. Deferred back-propagation [110]
is adopted to optimize GPU memory. We train our model on 32 NVIDIA A100
GPUs for 40M images on the resolution of 512×512, using a batch size of 8 per
GPU and taking about 4 days to complete. The window size in the transformer-
upsampler is 4096. The values for smin and smax are set to 0.005 and 0.02.
Test Data. We use Google Scanned Objects (GSO) [20], and render a total of
64 test views with equidistant azimuth at {10, 20, 30, 40} degree elevations. In
sparse-view reconstruction, the evaluation uses full renderings from 100 objects
to assess all models. For single-view reconstruction, we restrict the analysis to
renderings generated at an elevation angle of 20 from 250 objects. More details
about training settings and data are presented in Supplementary Material .
Input
GS
SparseNeuS
IBRNet
LGM
MV-LRM
Ours
GT
Results. The quantitative results are presented in Tab. 2. Notably, GRM outper-
forms all baselines across all metrics. Our model only takes 5 seconds in total
to generate 3D Gaussians from the input image, which includes the runtime of
GRM 11
the generation head. While this is slower than TriplaneGaussian, we achieve sig-
nificantly higher reconstruction quality. Our advantage is further demonstrated
in the qualitative results shown in Fig. 4. On the top, we compare with other
3D Gaussian-based methods. The pure reconstruction method TriplaneGaussian
struggles to fill in the missing content realistically (see rows 1–2). DreamGaus-
sian, using SDS optimization, shows various geometry artifacts (row 1) and
overall noticeable inconsistencies with the input image. LGM, also using an
image-to-MV generation head, produces blurrier and inconsistent texture and
geometry.
The bottom of Fig. 4 shows non-Gaussians based approaches. These methods
all display various geometry and appearance artifacts, inconsistent with the
input. In contrast, our scalable GRM learns robust data priors from extensive
training data, demonstrating strong generalization ability on generated multi-
view input images with accurate geometry and sharper details. This leads to fast
3D generation and state-of-the-art single-image 3D reconstruction.
By using a text-to-MV diffusion model, such as the first stage of Instant3D [46],
GRM can generate 3D assets from text prompts.
Baselines and metrics. We choose Shap-E [36], Instant3D [46], LGM [92], and
MVDream [81] as baselines. MVDream represents the SOTA of optimization-
based methods, while others are feed-forward methods. We use the 200 text
prompts from DreamFusion [71]. The metrics we use are CLIP Precisions [32,93],
Averaged Precision [106], CLIP Score [46, 106], which measure the alignment
between the text and images. All the comparisons are done using a resolution
of 512×512. Additionally, we include a preference study on Amazon Mechanical
Turk, where we recruited 90 unique users to compare the generations for 50 text
prompts.
Results. As shown in Tab. 3, our method consistently ranks the highest among
feed-forward methods (rows 1–3) and compares onpar with optimization-based
MVDream. Visually, as shown in Fig. 5, our method excels at generating plau-
sible geometry and highly detailed texture. MVDream, using SDS-based opti-
mization, requires 1 hours to generate a single scene. It delivers impressive visual
quality, but exhibits sub-optimal text-image alignment, as indicated by the CLIP
score and the ‘a cat wearing eyeglasses’ example in Fig. 4.
We analyze our model components and architectural design choices on the train-
ing resolution of 256. The results are shown in Tab. 4. Note that all ablations
are trained with 16 GPUs for 14M images.
GRM 13
Table 4: Ablation. Left: Using the sigmoid activation improves the visual quality
across all metrics; increasing the number of sampling blocks also increases the
Gaussians’ density and their modeling capability, as demonstrated by the growing trend
of PSNR; finally supervising the alpha channel further boost the reconstruction quality
by removing outlier Gaussians. Right: We ablate the proposed transformer-based
upsampler and pixel-aligned Gaussians using alternative approaches, and demonstrate
that each component is critical for the final reconstruction quality.
5 Discussion
In this paper, we introduce the Gaussian Reconstruction Model (GRM)—a new
feed-forward 3D generative model that achieves state-of-the-art quality and speed.
At the core of GRM is a sparse-view 3D reconstructor, which leverages a novel
transformer-based architecture to reconstruct 3D objects represented by pixel-
aligned Gaussians. We plan to release the code and trained models to make this
advancement in 3D content creation available to the community.
Limitations and Future Work. The output quality of our sparse-view re-
constructor suffers when the input views are inconsistent. The reconstructor
GRM 15
Ethics. Generative models pose a societal threat—we do not condone using our
work to generate deep fakes intending to spread misinformation.
Acknowledgement. We would like to thank Shangzhan Zhang for his help with
the demo video, and Minghua Liu for assisting with the evaluation of One-2-3-
45++. This project was supported by Google, Samsung, and a Swiss Postdoc
Mobility fellowship.
GRM: Large Gaussian Reconstruction Model for
Efficient 3D Reconstruction and Generation
Supplementary Material
A Implementation Details
Network Architecture and Training Details. We illustrate the details of
network architecture and training in Tab. 1.
B Geometry Evaluation
Here, we demonstrate the geometry evaluation results on sparse-view reconstruc-
tion and single-image-to-3D generation. We report Chamfer Distance (CD) and
⋆
Equal Contribution
GRM 17
Table 1: Implementation details.
with torch.no_grad():
images = Render(gaussians, cameras)
return images
with torch.enable_grad():
images = Render(gaussians, cameras)
images.backward(grad_images)
return gaussians.grad
E Limitations
Despite the high-quality reconstruction, image-to-3D and text-to-3D generation
results we achieved, our model relies on the input information for reconstruction
and lacks the capability for hallucination. For example, if a region is not observed
in any of the input images, the model may produce blurry textures for it.
References
1. Abdal, R., Yifan, W., Shi, Z., Xu, Y., Po, R., Kuang, Z., Chen, Q., Yeung,
D.Y., Wetzstein, G.: Gaussian shell maps for efficient 3d human generation. arXiv
preprint arXiv:2311.17857 (2023) 4
20 Yinghao Xu et al.
2. Anciukevičius, T., Xu, Z., Fisher, M., Henderson, P., Bilen, H., Mitra, N.J.,
Guerrero, P.: Renderdiffusion: Image diffusion for 3d reconstruction, inpainting
and generation. In: IEEE Conf. Comput. Vis. Pattern Recog. (2023) 1, 4
GRM 21
a bulldozer
a tiger dressed as a
doctor
An unstable rock
cairn in the middle
of a stream
An adorable piglet in
a field
A colorful rooster
A tiger dressed as a
military general
A baby bunny
sitting on top of a
stack of pancakes
A bald eagle
A beagle in a
detective’s outfit
3. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150 (2020) 3, 6
22 Yinghao Xu et al.
4. Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity
natural image synthesis. arXiv preprint arXiv:1809.11096 (2018) 4
GRM 23
5. Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo,
O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d
generative adversarial networks. In: IEEE Conf. Comput. Vis. Pattern Recog.
(2022) 2, 4
6. Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-gan: Periodic
implicit generative adversarial networks for 3d-aware image synthesis. In: IEEE
Conf. Comput. Vis. Pattern Recog. (2021) 4
7. Chan, E.R., Nagano, K., Chan, M.A., Bergman, A.W., Park, J.J., Levy, A.,
Aittala, M., De Mello, S., Karras, T., Wetzstein, G.: Generative novel view
synthesis with 3d-aware diffusion models. Int. Conf. Comput. Vis. (2023) 4
8. Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats
from image pairs for scalable generalizable 3d reconstruction. arXiv preprint
arXiv:2312.12337 (2023) 3, 5
9. Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: Tensorial radiance fields. In:
European Conference on Computer Vision (ECCV) (2022) 3
10. Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., Su, H.: Mvsnerf: Fast
generalizable radiance field reconstruction from multi-view stereo. In: Int. Conf.
Comput. Vis. (2021) 3
11. Chen, G., Wang, W.: A survey on 3d gaussian splatting. arXiv preprint
arXiv:2401.03890 (2024) 4
12. Chen, H., Gu, J., Chen, A., Tian, W., Tu, Z., Liu, L., Su, H.: Single-stage diffusion
nerf: A unified approach to 3d generation and reconstruction. arXiv preprint
arXiv:2304.06714 (2023) 1, 4
13. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: Disentangling geometry
and appearance for high-quality text-to-3d content creation. arXiv preprint
arXiv:2303.13873 (2023) 4
14. Chen, Z., Wang, F., Liu, H.: Text-to-3d using gaussian splatting. arXiv preprint
arXiv:2309.16585 (2023) 4
15. Chung, J., Lee, S., Nam, H., Lee, J., Lee, K.M.: Luciddreamer: Domain-free
generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384 (2023)
4
16. Curless, B., Levoy, M.: A volumetric method for building complex models from
range images. In: Proceedings of the 23rd annual conference on Computer graphics
and interactive techniques (1996) 18
17. Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E.,
Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of
annotated 3d objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 13142–
13153 (2023) 8
18. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances
in neural information processing systems 34, 8780–8794 (2021) 4
19. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is
worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929 (2020) 3
20. Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K.,
McHugh, T.B., Vanhoucke, V.: Google scanned objects: A high-quality dataset of
3d scanned household items. In: 2022 International Conference on Robotics and
Automation (ICRA). pp. 2553–2560. IEEE (2022) 8, 10
21. Fei, B., Xu, J., Zhang, R., Zhou, Q., Yang, W., He, Y.: 3d gaussian as a new
vision era: A survey. arXiv preprint arXiv:2402.07181 (2024) 4
24 Yinghao Xu et al.
22. Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z.,
Fidler, S.: Get3d: A generative model of high quality 3d textured shapes learned
from images. Adv. Neural Inform. Process. Syst. (2022) 4
23. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Adv. Neural Inform.
Process. Syst. (2014) 4
24. Gu, J., Liu, L., Wang, P., Theobalt, C.: Stylenerf: A style-based 3d-aware
generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985
(2021) 4
25. Gu, J., Trevithick, A., Lin, K.E., Susskind, J.M., Theobalt, C., Liu, L., Ra-
mamoorthi, R.: Nerfdiff: Single-image view synthesis with nerf-guided distillation
from 3d-aware diffusion. In: Int. Conf. Mach. Learn. (2023) 4
26. Gupta, A., Xiong, W., Nie, Y., Jones, I., Oğuz, B.: 3dgen: Triplane latent diffusion
for textured mesh generation. arXiv preprint arXiv:2303.05371 (2023) 4
27. Hertz, A., Aberman, K., Cohen-Or, D.: Delta denoising score. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 2328–2337
(2023) 4
28. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans
trained by a two time-scale update rule converge to a local nash equilibrium.
Advances in neural information processing systems 30 (2017) 10
29. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural
Inform. Process. Syst. (2020) 4
30. Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K.,
Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. arXiv
preprint arXiv:2311.04400 (2023) 1, 3, 4, 5
31. Hu, L., Zhang, H., Zhang, Y., Zhou, B., Liu, B., Zhang, S., Nie, L.: Gaussiana-
vatar: Towards realistic human avatar modeling from a single video via animatable
3d gaussians. arXiv preprint arXiv:2312.02134 (2023) 4
32. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided
object generation with dream fields. In: IEEE Conf. Comput. Vis. Pattern Recog.
(2022) 12
33. Jain, A., Tancik, M., Abbeel, P.: Putting nerf on a diet: Semantically consistent
few-shot view synthesis. In: Int. Conf. Comput. Vis. (2021) 3
34. Jia, Y.B.: Plücker coordinates for lines in the space. Problem Solver Techniques
for Applied Computer Science, Com-S-477/577 Course Handout (2020) 6
35. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer
and super-resolution. In: ECCV. Springer (2016) 7
36. Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv
preprint arXiv:2305.02463 (2023) 4, 10, 11, 12, 13, 18
37. Kang, M., Zhu, J.Y., Zhang, R., Park, J., Shechtman, E., Paris, S., Park, T.:
Scaling up gans for text-to-image synthesis. In: IEEE Conf. Comput. Vis. Pattern
Recog. (2023) 4
38. Karnewar, A., Vedaldi, A., Novotny, D., Mitra, N.J.: Holodiffusion: Training a 3d
diffusion model using 2d images. In: IEEE Conf. Comput. Vis. Pattern Recog.
(2023) 1, 4
39. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for
improved quality, stability, and variation. In: Int. Conf. Learn. Represent. (2018)
4
40. Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., Aila,
T.: Alias-free generative adversarial networks. In: Adv. Neural Inform. Process.
Syst. (2021) 4
GRM 25
41. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative
adversarial networks. In: IEEE Conf. Comput. Vis. Pattern Recog. (2019) 4
42. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing
and improving the image quality of StyleGAN. In: IEEE Conf. Comput. Vis.
Pattern Recog. (2020) 4
43. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for
real-time radiance field rendering. ACM Transactions on Graphics 42(4) (2023)
4, 5, 8, 9, 10, 13
44. Keselman, L., Hebert, M.: Approximate differentiable rendering with algebraic
surfaces. In: European Conference on Computer Vision. pp. 596–614. Springer
(2022) 4
45. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint
arXiv:2304.02643 (2023) 2
46. Li, J., Tan, H., Zhang, K., Xu, Z., Luan, F., Xu, Y., Hong, Y., Sunkavalli, K.,
Shakhnarovich, G., Bi, S.: Instant3d: Fast text-to-3d with sparse-view generation
and large reconstruction model. https://arxiv.org/abs/2311.06214 (2023) 1, 2, 3,
4, 5, 6, 8, 10, 12, 13
47. Li, X., Wang, H., Tseng, K.K.: Gaussiandiffusion: 3d gaussian splatting for
denoising diffusion probabilistic models with structured noise. arXiv preprint
arXiv:2311.11221 (2023) 4
48. Li, Z., Zheng, Z., Wang, L., Liu, Y.: Animatable gaussians: Learning pose-
dependent gaussian maps for high-fidelity human avatar modeling. arXiv preprint
arXiv:2311.16096 (2023) 4
49. Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: Towards
high-fidelity text-to-3d generation via interval score matching. arXiv preprint
arXiv:2311.11284 (2023) 4
50. Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler,
S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In:
IEEE Conf. Comput. Vis. Pattern Recog. pp. 300–309 (2023) 1, 4
51. Lin, K.E., Yen-Chen, L., Lai, W.S., Lin, T.Y., Shih, Y.C., Ramamoorthi, R.:
Vision transformer for nerf-based view synthesis from a single input image. In:
IEEE Winter Conf. Appl. Comput. Vis. (2023) 3
52. Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians:
Text-to-4d with dynamic 3d gaussians and composed diffusion models. arXiv
preprint arXiv:2312.13763 (2023) 4
53. Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., Chen, H., Zeng, C., Gu, J.,
Su, H.: One-2-3-45++: Fast single image to 3d objects with consistent multi-view
generation and 3d diffusion. arXiv preprint arXiv:2311.07885 (2023) 10, 11, 17,
18
54. Liu, M., Xu, C., Jin, H., Chen, L., T, M.V., Xu, Z., Su, H.: One-2-3-45: Any single
image to 3d mesh in 45 seconds without per-shape optimization (2023) 1, 4, 8, 9,
10, 11, 17, 18
55. Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.:
Zero-1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 9298–9309 (2023) 4, 10
56. Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer:
Generating multiview-consistent images from a single-view image. In: The Twelfth
International Conference on Learning Representations (2023) 4
26 Yinghao Xu et al.
57. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.:
Swin transformer: Hierarchical vision transformer using shifted windows. In:
Proceedings of the IEEE/CVF international conference on computer vision. pp.
10012–10022 (2021) 6
58. Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H.,
Habermann, M., Theobalt, C., et al.: Wonder3d: Single image to 3d using cross-
domain diffusion. arXiv preprint arXiv:2310.15008 (2023) 4, 10, 11, 18
59. Long, X., Lin, C., Wang, P., Komura, T., Wang, W.: Sparseneus: Fast generaliz-
able neural surface reconstruction from sparse views. In: Eur. Conf. Comput. Vis.
(2022) 3, 8, 9, 10, 17
60. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint
arXiv:1711.05101 (2017) 8
61. Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3d gaussians: Tracking
by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023) 4
62. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy
networks: Learning 3d reconstruction in function space. In: IEEE Conf. Comput.
Vis. Pattern Recog. (2019) 3
63. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng,
R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: Eur.
Conf. Comput. Vis. (2020) 3
64. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives
with a multiresolution hash encoding. ACM Trans. Graph. 41(4), 102:1–102:15
(Jul 2022). https://doi.org/10.1145/3528223.3530127, https://doi.org/10.
1145/3528223.3530127 3
65. Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan:
Unsupervised learning of 3d representations from natural images. In: Int. Conf.
Comput. Vis. (2019) 4
66. Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for gen-
erating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751
(2022) 4
67. Niemeyer, M., Geiger, A.: Giraffe: Representing scenes as compositional gener-
ative neural feature fields. In: IEEE Conf. Comput. Vis. Pattern Recog. (2021)
4
68. Ntavelis, E., Siarohin, A., Olszewski, K., Wang, C., Van Gool, L., Tulyakov, S.:
Autodecoding latent 3d diffusion models. arXiv preprint arXiv:2307.05445 (2023)
4
69. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf:
Learning continuous signed distance functions for shape representation. In: IEEE
Conf. Comput. Vis. Pattern Recog. (2019) 3
70. Po, R., Yifan, W., Golyanik, V., Aberman, K., Barron, J.T., Bermano, A.H.,
Chan, E.R., Dekel, T., Holynski, A., Kanazawa, A., et al.: State of the art on
diffusion models for visual computing. arXiv preprint arXiv:2310.07204 (2023) 1,
4
71. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d
diffusion. In: The Eleventh International Conference on Learning Representations
(2022) 1, 4, 12
72. Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nießner,
M.: Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. arXiv
preprint arXiv:2312.02069 (2023) 4
GRM 27
73. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International conference on machine learning.
pp. 8748–8763. PMLR (2021) 10
74. Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaussian4d:
Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023) 4
75. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: IEEE Conf. Comput. Vis. Pattern
Recog. (2022) 4
76. Saito, S., Schwartz, G., Simon, T., Li, J., Nam, G.: Relightable gaussian codec
avatars. arXiv preprint arXiv:2312.03704 (2023) 4
77. Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: Graf: Generative radiance fields
for 3d-aware image synthesis. In: Adv. Neural Inform. Process. Syst. (2020) 4
78. Shen, B., Yan, X., Qi, C.R., Najibi, M., Deng, B., Guibas, L., Zhou, Y., Anguelov,
D.: Gina-3d: Learning to generate implicit neural assets in the wild. In: IEEE
Conf. Comput. Vis. Pattern Recog. pp. 4913–4926 (2023) 1, 4
79. Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H.:
Zero123++: a single image to consistent multi-view diffusion base model. arXiv
preprint arXiv:2310.15110 (2023) 2, 4, 5, 6, 8, 10, 11
80. Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert,
D., Wang, Z.: Real-time single image and video super-resolution using an efficient
sub-pixel convolutional neural network. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. pp. 1874–1883 (2016) 7
81. Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., Yang, X.: Mvdream: Multi-view
diffusion for 3d generation. In: The Twelfth International Conference on Learning
Representations (2023) 4, 12, 13
82. Shi, Z., Peng, S., Xu, Y., Andreas, G., Liao, Y., Shen, Y.: Deep generative models
on 3d representations: A survey. arXiv preprint arXiv:2210.15663 (2022) 4
83. Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3d neural field
generation using triplane diffusion. In: IEEE Conf. Comput. Vis. Pattern Recog.
(2023) 4
84. Sitzmann, V., Martel, J., Bergman, A., Lindell, D., Wetzstein, G.: Implicit neural
representations with periodic activation functions. Advances in neural information
processing systems 33, 7462–7473 (2020) 3
85. Sitzmann, V., Rezchikov, S., Freeman, B., Tenenbaum, J., Durand, F.: Light field
networks: Neural scene representations with single-evaluation rendering. NeurIPS
(2021) 6
86. Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks:
Continuous 3d-structure-aware neural scene representations. Advances in Neural
Information Processing Systems 32 (2019) 3
87. Skorokhodov, I., Siarohin, A., Xu, Y., Ren, J., Lee, H.Y., Wonka, P., Tulyakov,
S.: 3d generation on imagenet. In: International Conference on Learning Repre-
sentations (2023), https://openreview.net/forum?id=U2WjB9xxZ9q 4
88. Skorokhodov, I., Tulyakov, S., Wang, Y., Wonka, P.: Epigraf: Rethinking training
of 3d gans. In: Adv. Neural Inform. Process. Syst. (2022) 4
89. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.:
Score-based generative modeling through stochastic differential equations. arXiv
preprint arXiv:2011.13456 (2020) 4
90. Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-
view 3d reconstruction. arXiv preprint arXiv:2312.13150 (2023) 3, 4, 5
28 Yinghao Xu et al.
91. Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Viewset diffusion:(0-) image-
conditioned 3d generative models from 2d data. arXiv preprint arXiv:2306.07881
(2023) 1
92. Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi-
view gaussian model for high-resolution 3d content creation. arXiv preprint
arXiv:2402.05054 (2024) 4, 8, 10, 11, 12, 13, 17, 18
93. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian
splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
1, 4, 10, 11, 12, 18
94. Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., Chen, D.: Make-it-3d:
High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint
arXiv:2303.14184 (2023) 4
95. Tewari, A., Thies, J., Mildenhall, B., Srinivasan, P., Tretschk, E., Yifan, W.,
Lassner, C., Sitzmann, V., Martin-Brualla, R., Lombardi, S., et al.: Advances in
neural rendering. In: Computer Graphics Forum. pp. 703–735 (2022) 3
96. Tewari, A., Yin, T., Cazenavette, G., Rezchikov, S., Tenenbaum, J., Durand, F.,
Freeman, B., Sitzmann, V.: Diffusion with forward models: Solving stochastic
inverse problems without direct supervision. Advances in Neural Information
Processing Systems 36 (2024) 4
97. Tosi, F., Zhang, Y., Gong, Z., Sandström, E., Mattoccia, S., Oswald, M.R., Poggi,
M.: How nerfs and 3d gaussian splatting are reshaping slam: a survey. arXiv
preprint arXiv:2402.13255 (2024) 4
98. Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining:
Lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12619–
12629 (2023) 1, 4
99. Wang, P., Tan, H., Bi, S., Xu, Y., Luan, F., Sunkavalli, K., Wang, W., Xu, Z.,
Zhang, K.: Pf-lrm: Pose-free large reconstruction model for joint pose and shape
prediction. arXiv preprint arXiv:2311.12024 (2023) 3, 5
100. Wang, Q., Wang, Z., Genova, K., Srinivasan, P.P., Zhou, H., Barron, J.T., Martin-
Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based
rendering. In: IEEE Conf. Comput. Vis. Pattern Recog. (2021) 3, 8, 9, 10
101. Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer:
High-fidelity and diverse text-to-3d generation with variational score distillation.
arXiv preprint arXiv:2305.16213 (2023) 1, 4
102. Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang,
X.: 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint
arXiv:2310.08528 (2023) 4
103. Xu, D., Yuan, Y., Mardani, M., Liu, S., Song, J., Wang, Z., Vahdat, A.:
Agg: Amortized generative 3d gaussians for single image to 3d. arXiv preprint
arXiv:2401.04099 (2024) 4
104. Xu, Y., Chai, M., Shi, Z., Peng, S., Skorokhodov, I., Siarohin, A., Yang, C.,
Shen, Y., Lee, H.Y., Zhou, B., et al.: Discoscene: Spatially disentangled generative
radiance fields for controllable 3d-aware scene synthesis. In: IEEE Conf. Comput.
Vis. Pattern Recog. (2023) 4
105. Xu, Y., Peng, S., Yang, C., Shen, Y., Zhou, B.: 3d-aware image synthesis via
learning structural and textural representations. In: IEEE Conf. Comput. Vis.
Pattern Recog. (2022) 4
106. Xu, Y., Tan, H., Luan, F., Bi, S., Wang, P., Li, J., Shi, Z., Sunkavalli, K.,
Wetzstein, G., Xu, Z., et al.: Dmv3d: Denoising multi-view diffusion using 3d
GRM 29