GRM: Large Gaussian Reconstruction Model For Efficient 3D Reconstruction and Generation
1 Introduction
The availability of high-quality and diverse 3D assets is critical in many domains,
including robotics, gaming, architecture, among others. Yet, creating these as-
sets has been a tedious manual process, requiring expertise in difficult-to-use
computer graphics tools.
Emerging 3D generative models offer the ability to easily create diverse
3D assets from simple text prompts or single images [70]. Optimization-based
3D generative methods can produce high-quality assets, but they often require
a long time—often hours—to produce a single 3D asset [50, 71, 93, 98, 101].
Recent feed-forward 3D generative methods have demonstrated excellent quality
and diversity while offering significant speedups over optimization-based 3D
generation approaches [2, 12, 30, 38, 46, 54, 78, 91, 106]. These state-of-the-art
2 Related Work
Sparse-view Reconstruction. Neural representations, as highlighted in prior
works [9,62–64,69,84,86], present a promising foundation for scene representation
and neural rendering [95]. When applied to novel-view synthesis, these methods
have demonstrated success in scenarios with multi-view training images, show-
casing proficiency in single-scene overfitting. Notably, recent advancements [10,
33, 51, 59, 100, 109] have extended these techniques to operate with a sparse set
of views, displaying improved generalization to unseen scenes. These methods
face challenges in capturing multiple modes within large-scale datasets, resulting
in a limitation to generate realistic results. Recent works [30, 99, 114] further
scale up the model and datasets for better generalization. But relying on neural
volume–based scene representation proves inadequate for efficiently synthesizing
high-resolution and high-fidelity images. Our proposed solution involves the use
of pixel-aligned 3D Gaussians [8, 90] combined with our effective transformer
architecture. This approach is designed to elevate both the efficiency and quality
of the sparse-view reconstructor when provided with only four input images.
3 Method
GRM is a feed-forward sparse-view 3D reconstructor, utilizing four input images
to efficiently infer underlying 3D Gaussians [43]. Supplied with a multi-view
image generator head [46, 79], GRM can be utilized to generate 3D from text or
a single image. Different from LRM [30, 46, 99, 106], we leverage pixel-aligned
Gaussians (Sec. 3.1) to enhance efficient and reconstruction quality and we
adopt a transformer-based network to predict the properties of the Gaussians by
associating information from all input views in a memory-efficient way (Sec. 3.2).
Finally, we detail the training objectives in Sec. 3.3 and demonstrate high-quality
text-to-3D and image-to-3D generation in a few seconds (Sec. 3.4).
Pixel-aligned Novel-view
Upsampler Gaussians Rendering
PixelShuffle Windowed Self-Attention
Fig. 2: GRM pipeline. Given 4 input views, which can be generated from text [46] or a
single image [79], our sparse-view reconstructor estimates the underlying 3D scene in a
single feed-forward pass using pixel-aligned Gaussians. The transformer-based sparse-
view reconstructor, equipped with a novel transformer-based upsampler, is capable of
leveraging long-range visual cues to efficiently generate a large number of 3D Gaussians
for high-fidelity 3D reconstruction.
Transformer-based Encoder. For a given input image \mathbf {I}_v\in \R ^{H\times W\times 3} , we
first inject the camera information to every pixel following [85,106] with Plücker
embedding [34, 85]. Then we use a convolutional image tokenizer with kernel
and stride 16 to extract local image features, resulting in a \protect \nicefrac {H}{16} \times \nicefrac {W}{16} feature
map. The features from every view are concatenated together to a single fea-
ture vector of length \left (V\times \nicefrac {H}{16}\times \nicefrac {W}{16}\right ). Following common practice in vision
transformers, we append learnable image position encodings for each token to
encode the spatial information in the image space. The resulting feature vector
is subsequently fed to a series of self-attention layers. The self-attention layers
attend to all tokens across all the input views, ensuring mutual information
exchange among all input views, resembling traditional feature matching and
encouraging consistent predictions for pixels belonging to different images. The
output of the transformer-based encoder is a V\times \nicefrac {H}{16} \times \nicefrac {W}{16}-long feature vector,
denoted as \mathbf {F}. Formally, the transformer function can be written as
\feature = \transformer _{\theta , \imposenc }\left (\mathcal {I}, \mathcal {C}\right ), (2)
where θ and ϕ denote the network parameters and the learnable image position
In the transformer encoder, we utilize patch convolution to tokenize the
input images, resulting in the output feature F with a smaller spatial dimension.
While this is advantageous for capturing broader image context, it is limited in
modeling high-frequency details. To this end, we introduce a transformer-based
upsampler to improve the detail reconstruction.
feature dimensions with a linear layer and then double the spatial dimension with
a PixelShuffle layer [80]. The upsampled feature maps are grouped and passed
to a self-attention layer in a sliding window of size W and shift \protect \nicefrac {W}{2}. While the
self-attention operation is performed within each distinct window, to maintain
manageable memory and computation, the overlapping between shifted windows
improves non-local information flow. Formally, an upsampler block contains the
following operations:
\feature &= \pixelshuffle \left (\fullyconnected \left (\feature \right ), 2\right ), \\ \feature &= \attention \left (\feature , W\right ), \\ \feature &= \shift \left (\attention \left (\shift \left (\feature , W/2\right ), W\right ), -W/2\right ).
After several blocks, the context length expands to the same spatial dimension
as the input. We reshape the features back to 2D tensors, resulting in V feature
maps with a resolution of H × W , denoted as \mathcal {F}= \{\feature _v\}_{v=1}^{V} .
Rendering with Gaussian Splatting. From the upsampled features \mathcal {F}, we
predict the Gaussian attribute maps \ifmmode \lbrace \else \textbraceleft \fi \map _v\}_{v=1}^{V} for pixel-aligned Gaussians using
separate linear heads. As mentioned in Sec. 3.1, these are then unprojected along
the viewing ray according to the predicted depth, from which a final image \mathbf {I}_{v'}
and alpha mask \mathbf {M}_{v'} (used for supervision) can be rendered at an arbitrary
camera view \mathbf {c}_{v'} through Gaussian splatting.
3.3 Training
During the training phase, we sample V = 4 input views that sufficiently cover
the whole scene, and supervise with additional views to guide the reconstruction.
To remove floaters, we also supervise the alpha map from Gaussian splatting with
the ground truth object mask available from the training data.
Given V ′ supervision views, the training objective is
\loss {} &= \dfrac {1}{V'}\sum _{{1 \leq v'\leq V'}} \loss {img} + \loss {mask},\\ \loss {img} &= L_2\left (\image _{v'}, \hat {\image }_{v'}\right ) + 0.5 L_p\left (\image _{v'}, \hat {\image }_{v'}\right ),\\ \loss {mask} &= L_2\left (\mask _{v'}, \hat {\mask }_{v'}\right ),
where Îv′ and M̂v′ denote ground truth image and alpha mask, respectively. L2
and Lp are L2 loss and perceptual loss [35].
To further constrain the Gaussian scaling, we employ the following activation
function corresponding to the output so of the linear head for scale. Subsequently,
we conduct linear interpolation within predefined scale values smin and smax :
\mathbf {s} = s_{min}\textsc {Sigmoid}(\mathbf {s}_o) + s_{max}(1- \textsc {Sigmoid}(\mathbf {s}_o)).\label {eq:scaling_act} (9)
4 Experiments
4.1 Experimental Settings
Training Settings. The encoder E consists of 1 strided convolution layer to
tokenize the image and 24 self-attention layers with channel width 768. The
upsampler consists of 4 upsampler blocks and each block contains 2 attention
layers. For training, we use AdamW [60] with a learning rate initialized at 0.0003
decayed with cosine annealing after 3k steps. Deferred back-propagation [110]
is adopted to optimize GPU memory. We train our model on 32 NVIDIA A100
GPUs for 40M images on the resolution of 512×512, using a batch size of 8 per
GPU and taking about 4 days to complete. The window size in the transformer-
upsampler is 4096. The values for smin and smax are set to 0.005 and 0.02.
Test Data. We use Google Scanned Objects (GSO) [20], and render a total of
64 test views with equidistant azimuth at {10, 20, 30, 40} degree elevations. In
sparse-view reconstruction, the evaluation uses full renderings from 100 objects
to assess all models. For single-view reconstruction, we restrict the analysis to
renderings generated at an elevation angle of 20 from 250 objects. More details
about training settings and data are presented in Supplementary Material .
Results. The quantitative results are presented in Tab. 2. Notably, GRM outper-
forms all baselines across all metrics. Our model only takes 5 seconds in total
to generate 3D Gaussians from the input image, which includes the runtime of
the generation head. While this is slower than TriplaneGaussian, we achieve sig-
nificantly higher reconstruction quality. Our advantage is further demonstrated
in the qualitative results shown in Fig. 4. On the top, we compare with other
3D Gaussian-based methods. The pure reconstruction method TriplaneGaussian
struggles to fill in the missing content realistically (see rows 1–2). DreamGaus-
sian, using SDS optimization, shows various geometry artifacts (row 1) and
overall noticeable inconsistencies with the input image. LGM, also using an
image-to-MV generation head, produces blurrier and inconsistent texture and
The bottom of Fig. 4 shows non-Gaussians based approaches. These methods
all display various geometry and appearance artifacts, inconsistent with the
input. In contrast, our scalable GRM learns robust data priors from extensive
training data, demonstrating strong generalization ability on generated multi-
view input images with accurate geometry and sharper details. This leads to fast
3D generation and state-of-the-art single-image 3D reconstruction.
By using a text-to-MV diffusion model, such as the first stage of Instant3D [46],
GRM can generate 3D assets from text prompts.
Baselines and metrics. We choose Shap-E [36], Instant3D [46], LGM [92], and
MVDream [81] as baselines. MVDream represents the SOTA of optimization-
based methods, while others are feed-forward methods. We use the 200 text
prompts from DreamFusion [71]. The metrics we use are CLIP Precisions [32,93],
Averaged Precision [106], CLIP Score [46, 106], which measure the alignment
between the text and images. All the comparisons are done using a resolution
of 512×512. Additionally, we include a preference study on Amazon Mechanical
Turk, where we recruited 90 unique users to compare the generations for 50 text
Results. As shown in Tab. 3, our method consistently ranks the highest among
feed-forward methods (rows 1–3) and compares onpar with optimization-based
MVDream. Visually, as shown in Fig. 5, our method excels at generating plau-
sible geometry and highly detailed texture. MVDream, using SDS-based opti-
mization, requires 1 hours to generate a single scene. It delivers impressive visual
quality, but exhibits sub-optimal text-image alignment, as indicated by the CLIP
score and the ‘a cat wearing eyeglasses’ example in Fig. 4.
We analyze our model components and architectural design choices on the train-
ing resolution of 256. The results are shown in Tab. 4. Note that all ablations
are trained with 16 GPUs for 14M images.
Table 4: Ablation. Left: Using the sigmoid activation improves the visual quality
across all metrics; increasing the number of sampling blocks also increases the
Gaussians’ density and their modeling capability, as demonstrated by the growing trend
of PSNR; finally supervising the alpha channel further boost the reconstruction quality
by removing outlier Gaussians. Right: We ablate the proposed transformer-based
upsampler and pixel-aligned Gaussians using alternative approaches, and demonstrate
that each component is critical for the final reconstruction quality.
5 Discussion
In this paper, we introduce the Gaussian Reconstruction Model (GRM)—a new
feed-forward 3D generative model that achieves state-of-the-art quality and speed.
At the core of GRM is a sparse-view 3D reconstructor, which leverages a novel
transformer-based architecture to reconstruct 3D objects represented by pixel-
aligned Gaussians. We plan to release the code and trained models to make this
advancement in 3D content creation available to the community.
Limitations and Future Work. The output quality of our sparse-view re-
constructor suffers when the input views are inconsistent. The reconstructor
Ethics. Generative models pose a societal threat—we do not condone using our
work to generate deep fakes intending to spread misinformation.
Acknowledgement. We would like to thank Shangzhan Zhang for his help with
the demo video, and Minghua Liu for assisting with the evaluation of One-2-3-
45++. This project was supported by Google, Samsung, and a Swiss Postdoc
Mobility fellowship.
GRM: Large Gaussian Reconstruction Model for
Efficient 3D Reconstruction and Generation
Supplementary Material
A Implementation Details
Network Architecture and Training Details. We illustrate the details of
network architecture and training in Tab. 1.
B Geometry Evaluation
Here, we demonstrate the geometry evaluation results on sparse-view reconstruc-
tion and single-image-to-3D generation. We report Chamfer Distance (CD) and
Table 1: Implementation details.
with torch.no_grad():
images = Render(gaussians, cameras)
return images
with torch.enable_grad():
images = Render(gaussians, cameras)
return gaussians.grad
E Limitations
Despite the high-quality reconstruction, image-to-3D and text-to-3D generation
results we achieved, our model relies on the input information for reconstruction
and lacks the capability for hallucination. For example, if a region is not observed
in any of the input images, the model may produce blurry textures for it.
