2412.18608v1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

PartGen: Part-level 3D Generation and Reconstruction

with Multi-View Diffusion Models

Minghao Chen1,2 Roman Shapovalov2 Iro Laina1 Tom Monnier2


Jianyuan Wang1,2 David Novotny2 Andrea Vedaldi1,2
1
Visual Geometry Group, University of Oxford 2 Meta AI
arXiv:2412.18608v1 [cs.CV] 24 Dec 2024

silent-chen.github.io/PartGen

Part-Aware Text-to-3D Selected Parts Part-Aware Image-to-3D Selected Parts

Text prompt Input Image

> A beagle in
a detective’s
outfit

3D Decomposition Selected Parts 3D Part Editing


Unstructured 3D

Original “Magician hat” “Brown hat with “Red hat with


police badge" blue texture”

Figure 1. We introduce PartGen, a pipeline that generates compositional 3D objects similar to a human artist. It can start from text,
an image, or an existing, unstructured 3D object. It consists of a multi-view diffusion model that identifies plausible parts automatically
and another that completes and reconstructs them in 3D, accounting for their context, i.e., the other parts, to ensure that they fit together
correctly. Additionally, PartGen enables 3D part editing based on text instructions, enhancing flexibility and control in 3D object creation.

or rendered, a multi-view diffusion model extracts a set of


Abstract plausible and view-consistent part segmentations, dividing
the object into parts. Then, a second multi-view diffusion
Text- or image-to-3D generators and 3D scanners can now model takes each part separately, fills in the occlusions, and
produce 3D assets with high-quality shapes and textures. uses those completed views for 3D reconstruction by feed-
These assets typically consist of a single, fused representa- ing them to a 3D reconstruction network. This completion
tion, like an implicit neural field, a Gaussian mixture, or a process considers the context of the entire object to ensure
mesh, without any useful structure. However, most applica- that the parts integrate cohesively. The generative comple-
tions and creative workflows require assets to be made of tion model can make up for the information missing due to
several meaningful parts that can be manipulated indepen- occlusions; in extreme cases, it can hallucinate entirely in-
dently. To address this gap, we introduce PartGen, a novel visible parts based on the input 3D asset. We evaluate our
approach that generates 3D objects composed of meaning- method on generated and real 3D assets and show that it
ful parts starting from text, an image, or an unstructured 3D outperforms segmentation and part-extraction baselines by
object. First, given multiple views of a 3D object, generated a large margin. We also showcase downstream applications
such as 3D part editing.
Work completed during Minghao’s internship at Meta.
1. Introduction 56, 58, 76, 83] start by generating several consistent 2D
views of the object, and then apply a 3D reconstruction net-
High-quality textured 3D assets can now be obtained work to those images to recover the 3D object. We build
through generation from text or images [12, 14, 18, 51, 56, upon this two-stage scheme to address both part segmenta-
58, 76, 83], or through photogrammetry techniques [15, 63, tion and reconstruction ambiguities.
89]. However, the resulting objects are unstructured, con- In the first stage, we cast part segmentation as a stochas-
sisting of a single, monolithic representation, such as an im- tic multi-view-consistent colouring problem, leveraging a
plicit neural field, a mixture of Gaussians, or a mesh. This is multi-view image generator fine-tuned to produce colour-
not good enough in a professional setting, where the struc- coded segmentation maps across multiple views of a 3D
ture of an asset is also of paramount importance. While object. We do not assume any explicit or even determin-
there are many aspects to the structure of a 3D object (e.g., istic taxonomy of parts; the segmentation model is learned
the mesh topology), parts are especially important as they from a large collection of artist-created data, capturing how
enable reuse, editing and animation. 3D artists decompose objects into parts. The benefits of this
In this paper, we thus consider the problem of obtain- approach are twofold. First, it leverages an image generator
ing structured 3D objects that are formed by a collection of which is already trained to be view-consistent. Second, a
meaningful parts, akin to the models produced by human generative approach allows for multiple plausible segmen-
artists. For example, a model of a person may be decom- tations by simply re-sampling from the model. We show
posed into its clothes and accessories, as well as various that this process results in better segmentation than that ob-
anatomical features like hair, eyes, teeth, limbs, etc. How- tained by fine-tuning a model like SAM [35] or SAM2 [70]
ever, if the object is generated or scanned, different parts are for the task of multi-view segmentation: while the latter can
usually ‘fused’ together, missing the internal surfaces and still be used, our approach better captures the artists’ intent.
the part boundaries. This means that physically detachable For the second problem, namely reconstructing a seg-
parts appear glued together, with a jarring effect. Further- mented part in 3D, an obvious approach is to mask the part
more, parts carry important information and functionality within the available object views, and then use a 3D recon-
that those models lack. For example, different parts may structor network to recover the part in 3D. However, when
have distinct animations or different materials. Parts can the part is heavily occluded, this task amounts to amodal re-
also be replaced, removed, or edited independently. For in- construction, which is highly ambiguous and thus badly ad-
stance, in video games, parts are often reconfigured dynam- dressed by the deterministic reconstructor network. Instead,
ically, e.g., to represent a character picking up a weapon or and this is our core contribution, we propose to tune another
changing clothes. Due to their semantic meaning, parts are multi-view generator to complete the views of the part while
also important for 3D understanding and applications like accounting for the context of the object as a whole. In this
robotics, embodied AI, and spatial intelligence [48, 53]. manner, the parts can be reconstructed reliably even if they
Inspired by these requirements, we introduce PartGen, a are only partially visible, or even not visible, in the origi-
method to upgrade existing 3D generation pipelines from nal input views. Furthermore, the resulting parts fit together
producing unstructured 3D objects to generating objects as well and, when combined, form a coherent 3D object.
compositions of meaningful 3D parts. To do this, we ad- We show that PartGen can be applied to different input
dress two key questions: (1) how to automatically segment modalities. Starting from text, an image, or a areal-world
a 3D object into parts, and (2) how to extract high-quality, 3D scan, PartGen can generate 3D assets with meaningful
complete 3D parts even when these are only partially—or parts. We assess our method empirically on a large collec-
not at all—visible from the exterior of the 3D object. tion of 3D assets produced by 3D artists or scanned, both
Crucially, both part segmentation and completion are quantitatively and qualitatively. We also demonstrate that
highly ambiguous tasks. First, since different artists may PartGen can be easily extended to the 3D part editing task.
find it useful to decompose the same object in different
ways, there is no ‘gold-standard’ segmentation for any
2. Related Work
given 3D object. Hence, a segmentation method should
model the distribution of plausible part segmentations rather 3D generation from text and images. The problem of
than a single one. Second, current 3D reconstruction and generating 3D assets from text or images has been thor-
generation methods only model an object’s visible outer oughly studied in the literature. Some authors have built
surface, omitting inner or occluded parts. Therefore, de- generators from scratch. For instance, CodeNeRF [30]
composing an object into parts often requires completing learns a latent code for NeRF in a Variational Autoencoder
these parts or even entirely hallucinating them. fashion, and Shap-E [31] and 3DGen [21] does so using la-
To model this ambiguity, we base part segmentation and tent diffusion, PC2 [55] and Point-E [62] diffuse a point
reconstruction on 3D generative models. We note that most cloud, and MosaicSDF a semi-explicit SDF-based repre-
state-of-the-art 3D generation pipelines [12, 14, 18, 39, 51, sentation [94]. However, 3D training data is scarce, which
makes it difficult to train text-based generators directly. Primitive-based representations. Some authors proposed
to represent 3D objects as a mixture of primitives [99],
DreamFusion [65] demonstrated for the first time that which can be seen as related to parts, although they are usu-
3D assets can be extracted from T2I diffusion models with ally non-semantic. For example, SIF [19] represents an oc-
Score Distillation Sampling (SDS) loss. Variants of Dream- cupancy function as a 3D Gaussians mixture. LDIF [20]
Fusion explore representations like hash grids [41, 66], uses the Gaussians to window local occupancy functions
meshes [41] and 3D Gaussians (3DGS) [8, 79, 96], tweaks implemented as neural fields [57]. Neural Template [28]
to the SDS loss [27, 85, 87, 104], conditioning on an in- and SPAGHETTI [1] learn to decompose shapes in a simi-
put image [54, 66, 78, 80, 98], and regularizing normals or lar manner using an auto-decoding setup. SALAD [37] uses
depth [68, 74, 78]. SPAGHETTI as the latent representation for a diffusion-
based generator. PartNeRF [82] is conceptually similar,
Other works focus on improving the 3D awareness of
but builds a mixture of NeRFs. NeuForm [42] and Diff-
the T2I model, simplifying extracting a 3D output and es-
Facto [61] learn representations that afford part-based con-
chewing the need for slow SDS optimization. Inspired
trol. DBW [60] decomposes real-world scenes with tex-
by 3DIM [88], Zero-1-to-3 [47] fine-tunes the 2D gener-
tured superquadric primitives.
ator to output novel views of the object. Two-stage ap-
proaches [6, 9, 18, 22, 23, 25, 45, 49, 50, 56, 74, 81, 86, 91– Semantic part-based representations. Other authors have
93] take the output of a text- or image-to-multi-view model considered 3D parts that are semantic. PartSLIP [46]
that generates multiple views of the object and recon- and PartSLIP++ [103] use vision-language model to seg-
struct the latter using multi-view reconstruction methods ment objects into parts using point clouds as representation.
like NeRF [59] or 3DGS [32]. Other approaches reduce Part123 [44] is conceptually similar to Contrastive Lift [2],
the number of input views generated and learn a fast feed- but applied to object than scenes, and to the output of a
forward network for 3D reconstruction. Perhaps the most monocular reconstruction network instead of NeRF.
notable example is Instant3D [39] based on the Large Re- In this paper, we address a problem different from the
construction Model (LRM) [26]. Recently, there are works ones above. We generate compositional 3D objects from
focusing on 3D compositional generation [11, 40, 64, 105]. various modalities using multi-view diffusion models for
D3LL [17] learns 3D object composition through distilling segmentation and completion. Parts are meaningfully seg-
from a 2D T2I generator. ComboVerse [7] starts from a mented, fully reconstructed, and correctly assembled. We
single image, but mostly at the levels of different objects handle the ambiguity of these tasks in a generative way.
instead of their parts, performs single-view inpainting and
reconstruction, and uses SDS optimization for composition. 3. Method

3D segmentation. Our work decomposes a given 3D ob- This section introduces PartGen, our framework for gener-
ject into parts. Several works have considered segment- ating 3D objects that are fully decomposable into complete
ing 3D objects or scenes represented in an unstructured 3D parts. Each part is a distinct, human-interpretable, and
manner, lately as neural fields or 3D Gaussian mixtures. self-contained element, representing the 3D object compo-
Semantic-NeRF [101] was the first to fuse 2D semantic sitionally. PartGen can take different modalities as input
segmentation maps in 3D with neural fields. DFF [36] (text prompts, image prompts, or 3D assets) and performs
and N3F [84] propose to map 2D features to 3D fields, part segmentation and completion by repurposing a pow-
allowing their supervised and unsupervised segmentation. erful multi-view diffusion model for these two tasks. An
LERF [33] extends this concept to language-aware fea- overview of PartGen is shown in Figure 2.
tures like CLIP [69]. Contrastive Lift [2] considers in- The rest of the section is organised as follows. In
stead instance segmentation, fusing information from sev- Sec. 3.1, we introduce the necessary background on multi-
eral independently-segmented views using a contrastive for- view diffusion and how PartGen can be applied to text, im-
mulation. GARField [34] and OminiSeg3D [97] consider age, or 3D model inputs briefly. Then, in Secs. 3.2 to 3.4 we
that concepts exist at different levels of scale, which they describe how PartGen automatically segments, completes,
identify with the help of SAM [35]. LangSplat [67] lever- and reconstructs meaningful parts in 3D.
ages both CLIP and SAM, creating distinct 3D language
3.1. Background on 3D generation
fields to model each SAM scale explicitly, while N2F2 [3]
automates binding the correct scale to each concept. Neu- First, we provide essential background on multi-view diffu-
ral Part Priors [4] completes and decomposes 3D scans with sion models for 3D generation [39, 74, 76]. These methods
learned part priors in a test-time optimization manner. Fi- usually adopt a two-stage approach to 3D generation.
nally, Uni3D [102] learns a ‘foundation’ model for 3D point In the first stage, given a prompt y, an image generator Φ
clouds that can perform zero-shot segmentation. outputs several 2D views of the object from different van-
<Text Prompt> or

Conditional Multi-View

Reconstruction Network
Completion Network
Generator

Muti-View Part
Segmentation
Multi-View

Grid View Segmentation Map


Completed Parts 3D Parts and Object

Figure 2. Overview of PartGen. Our method begins with text, single images, or existing 3D objects to obtain an initial grid view of the
object. This view is then processed by a diffusion-based segmentation network to achieve multi-view consistent part segmentation. Next,
the segmented parts, along with contextual information, are input into a multi-view part completion network to generate a fully completed
view of each part. Finally, a pre-trained reconstruction model generates the 3D parts.

tage points. Depending on the nature of y, the network Φ lows us to repurpose existing multi-view models Φ, which,
is either a text-to-image (T2I) model [39, 74] or a image- as described in Sec. 3.1, are already pre-trained to produce
to-image (I2I) one [73, 86]. These are fine-tuned to output multi-view consistent generations in the RGB domain. Sec-
a single ‘multi-view’ image I ∈ R3×2H×2W , where views ond, it integrates easily with established multi-view frame-
from the four cardinal directions around the object are ar- works. Third, decomposing an object into parts is an in-
ranged into a 2 × 2 grid. This model thus provides a proba- herently non-deterministic, ambiguous task as it depends on
bilistic mapping I ∼ p(I | Φ, y). The 2D views I are subse- the desired verbosity level, individual preferences, and artis-
quently passed to a Reconstruction Model (RM) [39, 76, 90] tic intent. By learning this task with probabilistic diffusion
Ψ, i.e., a neural network that reconstructs the 3D object L in models, we can effectively capture and model this ambigu-
both shape and appearance. Compared to direct 3D genera- ity. We thus train our model on a curated dataset of artist-
tion, this two-stage paradigm takes full advantage of an im- created 3D objects, where each object L is annotated with
age generation model pre-trained on internet-scale 2D data. a possible decomposition into 3D parts, L = (S1 , . . . , SS ).
This approach is general and can be applied with var- The dataset details are provided in Sec. 3.5.
ious implementations of image-generation and reconstruc- Consider that the input is a multi-view image I, and the
tion models. Our work in particular follows a setup similar output is a set of multi-view part masks M 1 , M 2 , . . . , M S .
to AssetGen [76]. Specifically, we obtain Φ by finetuning a To finetune our multi-view image generators Φ for mask
pre-trained text-to-image diffusion model with an architec- prediction, we quantize the RGB space into Q different
ture similar to Emu [13], a diffusion model in a 8-channel colors c1 , . . . , cQ ∈ [0, 1]3 . For each training sample
latent space, the mapping to which is provided by a spe- L = (Sk )Sk=1 , we assign colors to the parts, mapping
cially trained variational autoencoder (VAE). The detailed part Sk to color cπk , where π is a random permutation on
fine-tuning strategy can be found in Sec. 4.4 and supple- {1, . . . , Q} (we assume that Q ≥ S). Given this mapping,
mentary material. When the input is a 3D model, we render we render the segmentation map as a multi-view RGB im-
multiple views to form the grid view. For the RM Ψ we use age C ∈ [0, 1]3×2H×2W (Fig. 4). Then, we fine-tune Φ to
LightplaneLRM [5], trained on our dataset. (1) take as conditioning the multi-view image I, and (2) to
3.2. Multi-view part segmentation generate the color-coded multi-view segmentation map C,
hence sampling a distribution C ∼ p(C | Φseg , I).
The first major contribution of our paper is a method for
segmenting an object into its constituent parts. Inspired This approach can produce alternative segmentations by
by multi-view diffusion approaches, we frame object de- simply re-running Φseg , which is stochastic. It further ex-
composition into parts as a multi-view segmentation task, ploits the fact that Φseg is stochastic to discount the specific
rather than as direct 3D segmentation. At a high-level, the ‘naming’ or coloring of the parts, which is arbitrary. Nam-
goal is to map I to a collection 2D masks M 1 , . . . , M S ∈ ing is a technical issue in instance segmentation which usu-
{0, 1}2H×2W , one for each visible part of the object. Both ally requires ad-hoc solutions, and here is solved ‘for free’.
image I and masks Mi are multi-view grids. To extract the segments at test time, we sample the im-
Addressing 3D object segmentation through the lens of age C and simply quantize it based on the reference colors
multi-view diffusion offers several advantages. First, it al- c1 , . . . , cQ , discarding parts that contain only a few pixels.
Whole object Part 1 Part 2 Part N
by the inpainting setup in [71]. We apply the pre-trained
VAE separately to the masked image I ⊙ M and context
… image I, yielding 2 × 8 channels, and stack them with the
8D noise image and the unencoded part mask M to obtain
the 25-channel input to the diffusion model. Example re-
sults are shown in Figure 5.

3.4. Part reconstruction
Given a multi-view part image J, the final step is to recon-
… struct the part in 3D. Because the part views are now com-
plete and consistent, we can simply use the RM to obtain a
predicted reconstruction Ŝ = Ψ(J) of the part. We found
Figure 3. Training data. We obtain a dataset of 3D objects de- that the model does not require special finetuning to move
composed into parts from assets created by artists. These come from objects to their parts, so any good quality reconstruc-
‘naturally’ decomposed into parts according to the artist’s design. tion model can be plugged into our pipeline directly.

3.5. Training data


Implementation details. The network Φseg has the same
architecture as the network Φ with some changes to allow To train our models, we require a dataset of 3D models
conditioning on the multi-view image I: we encode it into consisting of multiple parts. We have built this dataset
latent space with the VAE and stack it with the noised latent from a collection of 140k 3D-artist generated assets that
as the input to the diffusion network. we licensed for AI training from a commercial source.
Each asset L is stored as a GLTF scene that contains, in
3.3. Contextual part completion general, several watertight meshes (S1 , . . . , SS ) that often
The method so far has produced a multi-view image I of align with semantic parts due to being created by a human
the 3D object along with 2D segments M 1 , M 2 , . . . , M S . who likely aimed to create an editable asset. Example ob-
What remains is to convert those into the full 3D part recon- jects from the dataset are shown in Fig. 3. We preprocess
structions. Given a mask M , in principle we could simply data differently for each of the three models we fine tuned.
submit the masked image I ⊙M to the RM Ψ to obtain a 3D Multi-view generator data. To train the multi-view gener-
reconstruction of the part, i.e., Ŝ = Ψ(I ⊙ M ). However, in ator models Φ, first of all, we have to render the target multi-
multi-view images, some parts can be heavily occluded by view images I consisting of 4 views to the full object. Fol-
other parts and, in extreme cases, entirely invisible. While lowing Instant3D [39], we rendered shaded colours I from
we could train the RM to handle such occlusions directly, in the 4 views from the orthogonal azimuths and 20◦ elevation
practice this does not work as part completion is inherently and arranged them in a 2 × 2 grid. In case of text condi-
a stochastic problem, whereas the RM is deterministic. tioning, training data consists of the pairs {(In , yn )}N
n=1 of
To handle this ambiguity, we repurpose yet again the multi-view images and their text captions Following Asset-
multi-view generator Φ, this time to perform part comple- Gen [76], we choose 10k highest quality assets and gener-
tion. The latter model is able to generate a 3D object from ated their text captions using CAP3D-like pipeline [52] that
text or single image, so, properly fine-tuned, it should be used LLAMA3 model [16]. In case of image conditioning,
able to hallucinate any missing portion of a part. we use all 140k models, and the conditioning yn comes in
Formally, we consider fine-tuning Φ to sample a view form of single renders from a randomly sampled direction
J ∼ p(J | I ⊙ M ), mapping the masked image I ⊙ M to (not just one of the four used in In ).
the completed multi-view image J of the part. However, we
Part segmentation and completion data. To train the part
note that sometimes parts are barely visible, so the masked
segmentation and completion networks, we need to addi-
image I ⊙ M provides very little information. Furthermore,
tionally render the multi-view part images and their depth
we need the generated part to fit well with the other parts
maps. Since different creators have different ideas on part
and the whole object. Hence, we provide to the model also
decomposition, we filter the dataset to avoid having exces-
the un-masked image I for context. Thus, condition p(J |
sively granular parts which likely lack semantic meaning.
I ⊙ M, I, M ) on the masked image I ⊙ M , the unmasked
To this end, we first cull the parts that take less than 5%
image I, and the mask M . The importance of the context I
of the object volume, and then remove the assets that have
increases with the extent of the occlusion.
more than 10 parts or consist of a single monolithic part.
Implementation details. The network architecture resem- This results in the dataset of 45k objects contain the total
bles that of Sec. 3.2, but extends the conditioning, motivated of 210k parts. Given the asset L = (S1 , . . . , SS ), we ren-
Input SAM2 SAM2-finetune SAM2–4 view Part123 Ours Sample 1 Ours Sample 2 Ours Sample 3

Figure 4. Examples of automatic multi-view part segmentations. By running our method several times, we obtain different segmenta-
tions, covering the space of artist intents.

Context Incomplete Part Mask GT Ours Sample 1 Ours Sample 2 Ours Sample 3
Automatic Seeded
Method mAP50 ↑ mAP75 ↑ mAP50 ↑ mAP75 ↑
Part123 [44] 11.5 7.4 10.3 6.5
SAM2† [70] 20.3 11.8 24.6 13.1
SAM2∗ [70] 37.4 27.0 44.2 30.1
SAM2 [70] 35.3 23.4 41.4 27.4
PartGen (1 sample) 45.2 32.9 44.9 33.5
PartGen (5 samples) 54.2 33.9 51.3 32.9
PartGen (10 samples) 59.3 38.5 53.7 35.4

Table 1. Segmentation results. SAM2∗ is fine-tuned our data and


SAM2† is fine-tuned for multi-view segmentation.

Figure 5. Qualitative results of part completion. The images


der a set of multi-view images {J s }Ss=1 (shown in Fig. 3)
with blue borders are the inputs. Our algorithm produces various
and the corresponding depth maps {δ s }Ss=1 from the same plausible outputs across different runs. Even if given an empty
viewpoints as above. part, PartGen attempts to generate internal structures inside the
The segmentation diffusion network is trained on the object, such as sand or inner wheels.
dataset of pairs {(In , Mn )}N n=1 , where the segmentation
map M = [M k ]Sk=1 is a stack of multi-view binary part 4.1. Part segmentation
masks M k ∈ {0, 1}2H×2W . Each mask shows the pixels
k Evaluation protocol. We set up two settings for the seg-
where the appropriate part is visible in I: Mi,j = [k =
l
mentation tasks. One is automatic part segmentation, where
argminl δi,j ], where k, l ∈ {1, . . . , S} and brackets denote the input is the multi-view image I and requires the method
Iverson brackets. The part completion network is trained on to output all parts of the object M 1 , . . . , M S . The other

the dataset of triplets {(In′ , Jn′ , Mn′ )}Nn′ =1 . All the com- is seeded segmentation, where we assume that users give a
ponents are produces in the way described above. point as an additional input for a specific mask. Now the
segmentation algorithm is regarded as a black box M̂ =
4. Experiments A(I) mapping the multi-view image I to a ranked list of N
part segmentations (which can in general partially overlap).
Evaluation protocol. We first individually evaluate the This ranked list is obtained by scoring candidate regions
two main components of our pipeline, namely part seg- and removing redundant ones. See the sup. mat. for more
mentation (Sec. 4.1) and part completion and reconstruc- details. We then match these segments to the ground-truth
tion (Sec. 4.2). We then evaluate how well the decomposed segments Mk and report mean Average Precision (mAP).
reconstruction matches the original object (Sec. 4.3). For This precision can be low in practice due to the inherent
all experiments, we use the held out 100 objects from the ambiguity of the problem: many of the parts predicted by
dataset described in Sec. 3.5. the algorithm will not match any particular artist’s choice.
(a) 3D Decomposition
Input Reconstructed 3D Example Parts Input Reconstructed 3D Example Parts Input Reconstructed 3D Example Parts

View completion J 3D reconstruction S


Method Compl. Multi-view Context CLIP↑ LPIPS↓ PSNR↑ CLIP↑ LPIPS↓ PSNR↑
Oracle (Jˆ = J) GT — — 1.0 0.0 ∞ 0.957 0.027 18.91
PartGen (Jˆ = B(I ⊙ M, I)) ✓ ✓ ✓ 0.974 0.015 21.38 0.936 0.039 17.16
w/o contextimage_to_repr_3d_id00017_a_chihuahua_wearing_a_tutu_images_render
† ˆ
(J = B(I ⊙ M )) ✓ ✓ ✗ 0.951 0.028 16.80 0.923 0.046 14.83
single view‡ (Jˆv = B(Iv ⊙ Mv , Iv )) ✓ ✗ ✓ 0.944 0.031 15.92 0.922 0.051 13.25
None (Jˆ = I ⊙ M ) ✗ — — 0.932 0.039 13.24 0.913 0.059 12.32

Table 2. Part completion results. We first evaluate view part completion by computing scores w.r.t. the ground-truth multi-view part
image J. Then, we evaluate 3D part reconstruction by reconstructing each part S and rendering it. See text for details.
(a) Part-Aware Text-to-3D
Input Generated 3D Example Parts Input Generated 3D Example Parts Input Generated 3D Example Parts

> A cat > A gummy


> A chihuahua wearing a lion bear driving a
wearing a tutu costume convertible

(b) Part-Aware Image-to-3D


Input Generated 3D Example Parts Input Generated 3D Example Parts Input Generated 3D Example Parts

(c) 3D Decomposition
Input Reconstructed 3D Example Parts Input Reconstructed 3D Example Parts Input Reconstructed 3D Example Parts

Figure 6. Examples of applications. PartGen can effectively generate or reconstruct 3D objects with meaningful and realistic parts in
different scenarios: a) Part-aware text-to-3D generation; b) Part-aware image-to-3D generation; c) 3D decomposition.

Baselines. We consider the original and fine-tuned marily because of the ambiguity of the segmentation task,
SAM2 [70] as our baselines for multi-view segmentation. which is better captured by our generator-based approach.
We fine-tune SAM2 in two different ways. First, we fine- We further provide qualitative results in Fig. 4.
tune SAM2’s mask decoder on our dataset, given the ground
4.2. Part completion and reconstruction
truth masks and randomly selected seed points for different
views. Second, we concatenate the four orthogonal views We utilize the same test data as in Sec. 4.1, forming tuples
in a multi-view image I and fine-tune SAM2 to predict the (S, I, M k , J k ) consisting of the 3D object part S, the full
multi-view mask M (in this case, the seed point randomly multi-view image I, the part mask M k and the multi-view
falls in one of the views). SAM2 produces three regions for image J k of the part, as described in Section 3.5. We choose
each input image and seed point. For automatic segmenta- one random part index k per model, and will omit it from
tion, we seed SAM2 with a set of query points spread over the notation below to be more concise.
the object, obtaining three different regions for each seed
Evaluation protocol. The completion algorithm and its
point. For seeded segmentation, we simply return the re-
baselines are treated as a black box Jˆ = B(I ⊙ M, I) that
gions that SAM2 outputs for the given seed point. We also ˆ We then com-
predicts the completed multi-view image J.
provide a comparison with recent work, Part123 [44].
pare Jˆ to the ground-truth render J using Peak Signal to
Results. We report the results in Tab. 1. As shown in the Noise Ratio (PSNR) of the foreground pixels, Learned Per-
table, mAP results for our method are much higher than oth- ceptual Image Patch Similarity (LPIPS) [100], and CLIP
ers, including SAM2 fine-tuned on our data. This is pri- similarity [69]. The latter is an important metric since the
Ŝk = Φ(Jˆk ), and reassemble the 3D object L̂ by merg-
ing the 3D parts {Ŝ1 , . . . , ŜN }. We then compare L̂ =
S ˆ
k Φ(Jk ) to the unsegmented reconstruction L̂ = Φ(I) us-
ing the same protocol as for parts.
Results. Table 3 shows that our method achieves perfor-
Original “White T-shirt with “Hawaii shirt” “Cloth with colorful
logo” texture” mance comparable to directly reconstructing the objects us-
ing the RM (L̂ = Φ(I)), with the additional benefit of pro-
ducing the reconstruction structured into parts, which are
useful for downstream applications such as editing.

4.4. Applications
Original “Black magic hat” “White hat” “Cowboy hat”
Part-aware text-to-3D generation. First, we apply Part-
Gen to part-aware text-to-3D generation. We train a text-
to-multi-view generator similar to [76], which takes a text
prompt as input and outputs a grid of four views. For il-
Original “pink cup with “Green cup with “Yellow cup with a
lustration, we use the prompts from DreamFusion [65]. As
square bottom” cute logo” smile on it” shown in Fig. 6, PartGen can effectively generate 3D ob-
Figure 7. 3D part editing. We can edit the appearance and shape jects with distinct and completed parts, even in challenging
of the 3D objects with text prompt. cases with heavy occlusions, such as the gummy bear. Ad-
ditional examples are provided in the supp. mat.
Method CLIP↑ LPIPS↓ PSNR↑ Part-aware image-to-3D generation. Next, we consider
PartGen (L̂ = k Φ(Jˆk )) 0.952 part-aware image-to-3D generation. Building upon the text-
S
0.065 20.33
Unstructured (L̂ = Φ(I)) 0.955 0.064 20.47 to-multi-view generator, we further fine-tune the generator
to accept images as input with a strategy similar to [95].
Table 3. Model reassembling result. The quality of 3D recon- Further training details are provided in supplementary ma-
struction of the object as a whole is close to that of the part- terials. Results are shown in Fig. 6 demonstrating that Part-
based compositional reconstruction, which proves that the pre- Gen is successful in this case as well.
dicted parts fit together well.
Real-world 3D object decomposition. PartGen can also
decompose real-world 3D objects. We show this using ob-
completion task is highly ambiguous, and thus evaluating
jects from Google Scanned Objects (GSO) [15] for this pur-
semantic similarity can provide additional insights. We
pose. Given a 3D object from GSO, we render different
also evaluate the quality of the reconstruction of the pre-
views to obtain a an image grid and then apply PartGen as
dicted completions by comparing the reconstructed object
ˆ to the ground-truth part S using the same above. The last row of Figure 6 shows that PartGen can
part Ŝ = Φ(J)
effectively decompose real-world 3D objects too.
metrics, but averaged after rendering the part to four ran-
dom novel viewpoints. 3D part editing. Finally, we show that once the 3D parts
are decomposed, they can be further modified through text
Results. We compare our part completion algorithm (Jˆ =
input. As illustrated in Fig. 7, a variant of our method en-
B(I ⊙ M, I)) to several baselines and the oracle, test-
ables effective editing of the shape and texture of the parts
ing using no completion (Jˆ = I ⊙ M ), omitting context
based on textual prompts. The details of the 3D editing
(Jˆ = B(I ⊙ M )), completing single views independently
model are provided in supplementary materials.
(Jˆv = B(Iv ⊙ Mv , Iv )), and the oracle (Jˆ = J). The latter
provides the upper-bound on the part reconstruction perfor-
mance, where the only bottleneck is the RM. 5. Conclusion
As shown in the table Tab. 2, our model largely surpasses We have introduced PartGen, a novel approach to gener-
the baselines. Both joint multi-view reasoning and contex- ate or reconstruct compositional 3D objects from text, im-
tual part completion are important for good performance. ages, or unstructured 3D objects. PartGen can reconstruct
We further provide qualitative results in Fig. 5. in 3D parts that are even minimally visible, or not visible
at all, utilizing the guidance of a specially-designed multi-
4.3. Reassembling parts
view diffusion prior. We have also shown several applica-
Evaluation protocol. Starting from multi-view image I of tion of PartGen, including text-guided part editing. This is a
a 3D object L, we run the segmentation algorithm to obtain promising step towards the generation of 3D assets that are
segmentation (M̂ 1 , . . . , M̂ S ), reconstruct each 3D part as more useful in professional workflows.
References [14] Deemos. Rodin text-to-3D gen-1 (0525) v0.5, 2024. 2
[15] Laura Downs, Anthony Francis, Nate Koenig, Brandon
[1] Hertz Amir, Perel Or, Giryes Raja, Sorkine-Hornung Olga,
Kinman, Ryan Hickman, Krista Reymann, Thomas B
and Cohen-Or Daniel. SPAGHETTI: editing implicit
McHugh, and Vincent Vanhoucke. Google scanned objects:
shapes through part aware generation. In ACM Transac-
A high-quality dataset of 3d scanned household items. In
tions on Graphics, 2022. 3
2022 International Conference on Robotics and Automa-
[2] Yash Sanjay Bhalgat, Iro Laina, Joao F. Henriques, Andrea tion (ICRA), pages 2553–2560. IEEE, 2022. 2, 8
Vedaldi, and Andrew Zisserman. Contrastive Lift: 3D ob- [16] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab-
ject instance segmentation by slow-fast contrastive fusion. hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil
In Proceedings of Advances in Neural Information Process- Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh
ing Systems (NeurIPS), 2023. 3 Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra,
[3] Yash Sanjay Bhalgat, Iro Laina, Joao F. Henriques, Andrew Archie Sravankumar, Artem Korenev, Arthur Hinsvark,
Zisserman, and Andrea Vedaldi. N2F2: Hierarchical scene Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen
understanding with nested neural feature fields. In Pro- Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron,
ceedings of the European Conference on Computer Vision Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya
(ECCV), 2024. 3 Nayak, Chloe Bi, Chris Marra, Chris McConnell, Chris-
[4] Aleksei Bokhovkin and Angela Dai. Neural part priors: tian Keller, Christophe Touret, Chunyang Wu, Corinne
Learning to optimize part-based object completion in rgb- Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien
d scans. In Proceedings of the IEEE/CVF Conference on Allonsius, Daniel Song, Danielle Pintz, Danny Livshits,
Computer Vision and Pattern Recognition, pages 9032– David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego
9042, 2023. 3 Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor
[5] Ang Cao, Justin Johnson, Andrea Vedaldi, and David Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Di-
Novotny. Lightplane: Highly-scalable components for neu- nan, Eric Michael Smith, Filip Radenovic, Frank Zhang,
ral 3d fields. arXiv preprint arXiv:2404.19760, 2024. 4, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Ander-
2 son, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem
[6] Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexan- Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo
der W. Bergman, Jeong Joon Park, Axel Levy, Miika Ait- Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M.
tala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon
Generative novel view synthesis with 3D-aware diffusion Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar,
models. In Proc. ICCV, 2023. 3 Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny
[7] Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang,
Jia, and Ziwei Liu. Comboverse: Compositional 3d assets Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak,
creation using spatially-aware diffusion guidance. arXiv Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe,
preprint arXiv:2403.12409, 2024. 3 Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani,
[8] Zilong Chen, Feng Wang, and Huaping Liu. Text-to-3D Kate Plawiak, Ke Li, Kenneth Heafield, and Kevin Stone.
using Gaussian splatting. arXiv, 2309.16585, 2023. 3 The Llama 3 herd of models. arXiv, 2407.21783, 2024. 5
[9] Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and [17] Dave Epstein, Ben Poole, Ben Mildenhall, Alexei A. Efros,
Huaping Liu. V3D: Video diffusion models are effective and Aleksander Holynski. Disentangled 3d scene genera-
3D generators. arXiv, 2403.06738, 2024. 3 tion with layout learning, 2024. 3
[10] Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, [18] Ruiqi Gao, Aleksander Holynski, Philipp Henzler,
Wenqing Zhang, Xujie Zhang, Hanqing Zhao, and Xi- Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan,
aodan Liang. Catvton: Concatenation is all you need Jonathan T. Barron, and Ben Poole. CAT3D: create any-
for virtual try-on with diffusion models. arXiv preprint thing in 3d with multi-view diffusion models. arXiv,
arXiv:2407.15886, 2024. 1 2405.10314, 2024. 2, 3
[11] Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja [19] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna,
Giryes, and Daniel Cohen-Or. Set-the-scene: Global-local William T. Freeman, and Thomas Funkhouser. Learning
training for generating controllable nerf scenes. In Proc. shape templates with structured implicit functions. In Proc.
ICCV Workshops, 2023. 3 CVPR, 2019. 3
[12] CSM. CSM text-to-3D cube 2.0, 2024. 2 [20] Kyle Genova, Forrester Cole, Avneesh Sud, Aaron Sarna,
[13] Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam S. Tsai, Jialiang and Thomas A. Funkhouser. Local deep implicit functions
Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xi- for 3D shape. In Proc. CVPR, 2020. 3
aofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek [21] Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and
Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Barlas Oguz. 3DGen: Triplane latent diffusion for textured
Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Mot- mesh generation. corr, abs/2303.05371, 2023. 2
wani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ra- [22] Junlin Han, Jianyuan Wang, Andrea Vedaldi, Philip Torr,
manathan, Zijian He, Peter Vajda, and Devi Parikh. Emu: and Filippos Kokkinos. Flex3d: Feed-forward 3d genera-
Enhancing image generation models using photogenic nee- tion with flexible reconstruction model and input view cu-
dles in a haystack. CoRR, abs/2309.15807, 2023. 4, 1 ration. arXiv preprint arXiv:2410.00890, 2024. 3
[23] Junlin Han, Filippos Kokkinos, and Philip Torr. Vfusion3d: [38] D. Larlus, G. Dorko, D. Jurie, and B. Triggs. Pascal visual
Learning scalable 3d generative models from video diffu- object classes challenge. In Selected Proceeding of the first
sion models. In European Conference on Computer Vision, PASCAL Challenges Workshop, 2006. 3
pages 333–350. Springer, 2025. 3 [39] Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun
[24] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg
fusion probabilistic models. In Proc. NeurIPS, 2020. 1 Shakhnarovich, and Sai Bi. Instant3D: Fast text-to-3D with
[25] Lukas Höllein, Aljaz Bozic, Norman Müller, David sparse-view generation and large reconstruction model.
Novotný, Hung-Yu Tseng, Christian Richardt, Michael Proc. ICLR, 2024. 2, 3, 4, 5, 1
Zollhöfer, and Matthias Nießner. ViewDiff: 3D-consistent [40] Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen,
image generation with text-to-image models. In Proc. Yi Zhang, Peng Zhou, and Bingbing Ni. Focaldreamer:
CVPR, 2024. 3 Text-driven 3d editing via focal-fusion assembly, 2023. 3
[26] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, [41] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki
Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis,
Hao Tan. LRM: Large reconstruction model for single im- Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D:
age to 3D. In Proc. ICLR, 2024. 3 High-resolution text-to-3D content creation. arXiv.cs,
[27] Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, abs/2211.10440, 2022. 3
Zheng-Jun Zha, and Lei Zhang. Dreamtime: An im- [42] Connor Lin, Niloy Mitra, Gordon Wetzstein, Leonidas J.
proved optimization strategy for text-to-3D content cre- Guibas, and Paul Guerrero. NeuForm: adaptive overfitting
ation. CoRR, abs/2306.12422, 2023. 3 for neural shape editing. In Proc. NeurIPS, 2022. 3
[28] Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neu- [43] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang.
ral template: Topology-aware reconstruction and disentan- Common diffusion noise schedules and sample steps are
gled generation of 3d meshes. In Proc. CVPR, 2022. 3 flawed. In Proceedings of the IEEE/CVF winter confer-
ence on applications of computer vision, pages 5404–5411,
[29] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste
2024. 1
Alayrac, Carl Doersch, Catalin Ionescu, David Ding,
[44] Anran Liu, Cheng Lin, Yuan Liu, Xiaoxiao Long, Zhiyang
Skanda Koppula, Daniel Zoran, Andrew Brock, Evan
Dou, Hao-Xiang Guo, Ping Luo, and Wenping Wang.
Shelhamer, Olivier J. Hénaff, Matthew M. Botvinick,
Part123: Part-aware 3d reconstruction from a single-view
Andrew Zisserman, Oriol Vinyals, and João Carreira.
image. arXiv, 2405.16888, 2024. 3, 6, 7
Perceiver IO: A general architecture for structured inputs
& outputs. In Proc. ICLR, 2022. 1 [45] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen,
Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45:
[30] Wonbong Jang and Lourdes Agapito. CodeNeRF: Disen-
Any single image to 3D mesh in 45 seconds without per-
tangled neural radiance fields for object categories. In Proc.
shape optimization. In Proc. NeurIPS, 2023. 3
ICCV, 2021. 2
[46] Minghua Liu, Yinhao Zhu, Hong Cai, Shizhong Han, Zhan
[31] Heewoo Jun and Alex Nichol. Shap-E: Generating condi- Ling, Fatih Porikli, and Hao Su. PartSLIP: low-shot part
tional 3D implicit functions. arXiv, 2023. 2 segmentation for 3D point clouds via pretrained image-
[32] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, language models. In Proc. CVPR, 2023. 3
and George Drettakis. 3D Gaussian Splatting for real-time [47] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok-
radiance field rendering. Proc. SIGGRAPH, 42(4), 2023. 3 makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3:
[33] Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Zero-shot one image to 3D object. In Proc. ICCV, 2023. 3
Kanazawa, and Matthew Tancik. LERF: language embed- [48] Weiyu Liu, Jiayuan Mao, Joy Hsu, Tucker Hermans, Ani-
ded radiance fields. In Proc. ICCV, 2023. 3 mesh Garg, and Jiajun Wu. Composable part-based manip-
[34] Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken ulation. In CoRL 2023, 2023. 2
Goldberg, Matthew Tancik, and Angjoo Kanazawa. [49] Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie
Garfield: Group anything with radiance fields. arXiv.cs, Liu, Taku Komura, and Wenping Wang. SyncDreamer:
abs/2401.09419, 2024. 3 Generating multiview-consistent images from a single-view
[35] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi image. arXiv, 2309.03453, 2023. 3
Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer [50] Xiaoxiao Long, Yuanchen Guo, Cheng Lin, Yuan Liu,
Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang,
and Ross Girshick. Segment anything. In Proc. CVPR, Marc Habermann, Christian Theobalt, and Wenping Wang.
2023. 2, 3 Wonder3D: Single image to 3D using cross-domain diffu-
[36] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- sion. arXiv.cs, abs/2310.15008, 2023. 3
mann. Decomposing NeRF for editing via feature field dis- [51] LumaAI. Genie text-to-3D v1.0, 2024. 2
tillation. arXiv.cs, 2022. 3 [52] Tiange Luo, Chris Rockwell, Honglak Lee, and Justin John-
[37] Juil Koo, Seungwoo Yoo, Minh Hieu Nguyen, and Min- son. Scalable 3d captioning with pretrained models. arXiv
hyuk Sung. SALAD: part-level latent diffusion for 3D preprint arXiv:2306.07279, 2023. 5
shape generation and manipulation. In Proc. ICCV, 2023. [53] Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard.
3 Grounding language with visual affordances over unstruc-
tured data. In Proceedings of the IEEE International Con- [67] Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and
ference on Robotics and Automation (ICRA), London, UK, Hanspeter Pfister. LangSplat: 3D language Gaussian splat-
2023. 2 ting. In Proc. CVPR, 2024. 3
[54] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and [68] Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo,
Andrea Vedaldi. RealFusion: 360 reconstruction of any Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong,
object from a single image. In Proceedings of the IEEE Liefeng Bo, and Xiaoguang Han. Richdreamer: A gen-
Conference on Computer Vision and Pattern Recognition eralizable normal-depth diffusion model for detail richness
(CVPR), 2023. 3 in text-to-3D. arXiv.cs, abs/2311.16918, 2023. 3
[55] Luke Melas-Kyriazi, Christian Rupprecht, and Andrea [69] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Vedaldi. PC2: Projection-conditioned point cloud diffusion Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
for single-image 3d reconstruction. In Proceedings of the Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
IEEE Conference on Computer Vision and Pattern Recog- Krueger, and Ilya Sutskever. Learning transferable visual
nition (CVPR), 2023. 2 models from natural language supervision. In Proc. ICML,
[56] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Na- pages 8748–8763, 2021. 3, 7, 1
talia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos [70] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang
Kokkinos. IM-3D: Iterative multiview diffusion and re- Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman
construction for high-quality 3D generation. In Proceed- Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt-
ings of the International Conference on Machine Learning ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-
(ICML), 2024. 2, 3 Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Fe-
[57] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and ichtenhofer. SAM 2: Segment anything in images and
A. Geiger. Occupancy Networks: Learning 3D reconstruc- videos. arXiv, 2408.00714, 2024. 2, 6, 7
tion in function space. In Proc. CVPR, 2019. 3 [71] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[58] Meshy. Meshy text-to-3D v3.0, 2024. 2 Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proc. CVPR,
[59] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
2022. 5
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng.
[72] Tim Salimans and Jonathan Ho. Progressive distillation
NeRF: Representing scenes as neural radiance fields for
for fast sampling of diffusion models. arXiv preprint
view synthesis. In Proc. ECCV, 2020. 3
arXiv:2202.00512, 2022. 1
[60] Tom Monnier, Jake Austin, Angjoo Kanazawa, Alexei
[73] Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua
Efros, and Mathieu Aubry. Differentiable blocks world:
Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng,
Qualitative 3d decomposition by rendering primitives. Ad-
and Hao Su. Zero123++: a single image to consistent multi-
vances in Neural Information Processing Systems, 36:
view diffusion base model. arXiv.cs, abs/2310.15110, 2023.
5791–5807, 2023. 3
4
[61] George Kiyohiro Nakayama, Mikaela Angelina Uy, Jiahui [74] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li,
Huang, Shi-Min Hu, Ke Li, and Leonidas Guibas. Diff- and Xiao Yang. MVDream: Multi-view diffusion for 3D
Facto: controllable part-based 3D point cloud generation generation. In Proc. ICLR, 2024. 3, 4
with cross diffusion. In Proc. ICCV, 2023. 3 [75] Aleksandar Shtedritski, Christian Rupprecht, and Andrea
[62] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Vedaldi. What does clip know about a red circle? vi-
Mishkin, and Mark Chen. Point-E: A system for gener- sual prompt engineering for vlms. In Proceedings of the
ating 3D point clouds from complex prompts. arXiv.cs, IEEE/CVF International Conference on Computer Vision,
abs/2212.08751, 2022. 2 pages 11987–11997, 2023. 2
[63] Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Pe- [76] Yawar Siddiqui, Filippos Kokkinos, Tom Monnier, Mahen-
ters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard dra Kariya, Yanir Kleiman, Emilien Garreau, Oran Gafni,
Newcombe, and Carl Yuheng Ren. Aria digital twin: A Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and
new benchmark dataset for egocentric 3d machine percep- David Novotny. Meta 3D Asset Gen: Text-to-mesh gener-
tion, 2023. 2 ation with high-quality geometry, texture, and PBR mate-
[64] Ryan Po and Gordon Wetzstein. Compositional 3d scene rials. In Proceedings of Advances in Neural Information
generation using locally conditioned diffusion. ArXiv, Processing Systems (NeurIPS), 2024. 2, 3, 4, 5, 8
abs/2303.12218, 2023. 3 [77] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
[65] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- ing diffusion implicit models. In Proc. ICLR, 2021. 1
hall. DreamFusion: Text-to-3D using 2D diffusion. In Proc. [78] Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen
ICLR, 2023. 3, 8 Liu, Zhenda Xie, and Yebin Liu. DreamCraft3D: Hier-
[66] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, archical 3D generation with bootstrapped diffusion prior.
Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko- arXiv.cs, abs/2310.16818, 2023. 3
rokhodov, Peter Wonka, Sergey Tulyakov, and Bernard [79] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and
Ghanem. Magic123: One image to high-quality 3D object Gang Zeng. DreamGaussian: Generative gaussian splat-
generation using both 2D and 3D diffusion priors. arXiv.cs, ting for efficient 3D content creation. arXiv, 2309.16653,
abs/2306.17843, 2023. 3 2023. 3
[80] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran ation via multi-view conditions. arXiv.cs, abs/2312.03611,
Yi, Lizhuang Ma, and Dong Chen. Make-It-3D: High- 2023. 3
fidelity 3d creation from A single image with diffusion [94] Lior Yariv, Omri Puny, Natalia Neverova, Oran Gafni, and
prior. arXiv.cs, abs/2303.14184, 2023. 3 Yaron Lipman. Mosaic-SDF for 3D generative models.
[81] Shitao Tang, Jiacheng Chen, Dilin Wang, Chengzhou Tang, arXiv.cs, abs/2312.09222, 2023. 2
Fuyang Zhang, Yuchen Fan, Vikas Chandra, Yasutaka Fu- [95] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
rukawa, and Rakesh Ranjan. MVDiffusion++: A dense adapter: Text compatible image prompt adapter for text-to-
high-resolution multi-view diffusion model for single or image diffusion models. arXiv preprint arxiv:2308.06721,
sparse-view 3d object reconstruction. arXiv, 2402.12712, 2023. 8, 1
2024. 3 [96] Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xi-
[82] Konstantinos Tertikas, Despoina Paschalidou, Boxiao Pan, aopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang.
Jeong Joon Park, Mikaela Angelina Uy, Ioannis Z. Emiris, GaussianDreamer: Fast generation from text to 3D gaussian
Yannis Avrithis, and Leonidas J. Guibas. PartNeRF: Gen- splatting with point cloud priors. arXiv.cs, abs/2310.08529,
erating part-aware editable 3D shapes without 3D supervi- 2023. 3
sion. arXiv.cs, abs/2303.09554, 2023. 3 [97] Haiyang Ying, Yixuan Yin, Jinzhi Zhang, Fan Wang, Tao
[83] TripoAI. Tripo3D text-to-3D, 2024. 2 Yu, Ruqi Huang, and Lu Fang. Omniseg3d: Omniversal 3d
[84] Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea segmentation via hierarchical contrastive learning. In Pro-
Vedaldi. Neural Feature Fusion Fields: 3D distillation of ceedings of the IEEE/CVF Conference on Computer Vision
self-supervised 2D image representation. In Proceedings of and Pattern Recognition, pages 20612–20622, 2024. 3
the International Conference on 3D Vision (3DV), 2022. 3 [98] Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu
[85] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, Li, Long Quan, Ying Shan, and Yonghong Tian. HiFi-123:
and Greg Shakhnarovich. Score Jacobian chaining: Lifting Towards high-fidelity one image to 3D content generation.
pretrained 2D diffusion models for 3D generation. In Proc. arXiv.cs, abs/2310.06744, 2023. 3
CVPR, 2023. 3 [99] Guanqi Zhan, Qingnan Fan, Kaichun Mo, Lin Shao, Bao-
[86] Peng Wang and Yichun Shi. ImageDream: Image-prompt quan Chen, Leonidas J Guibas, Hao Dong, et al. Generative
multi-view diffusion for 3D generation. In Proc. ICLR, 3d part assembly via dynamic graph learning. Advances
2024. 3, 4 in Neural Information Processing Systems, 33:6315–6326,
[87] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongx- 2020. 3
uan Li, Hang Su, and Jun Zhu. ProlificDreamer: High- [100] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-
fidelity and diverse text-to-3D generation with variational man, and Oliver Wang. The unreasonable effectiveness of
score distillation. arXiv.cs, abs/2305.16213, 2023. 3 deep features as a perceptual metric. In Proc. CVPR, pages
[88] Daniel Watson, William Chan, Ricardo Martin-Brualla, 586–595, 2018. 7
Jonathan Ho, Andrea Tagliasacchi, and Mohammad [101] Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and
Norouzi. Novel view synthesis with diffusion models. In Andrew J. Davison. In-place scene labelling and under-
Proc. ICLR, 2023. 3 standing with implicit scene representation. In Proc. ICCV,
[89] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Liang Pan 2021. 3
Jiawei Ren, Wayne Wu, Lei Yang, Jiaqi Wang, Chen [102] Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu,
Qian, Dahua Lin, and Ziwei Liu. Omniobject3d: Large- Tiejun Huang, and Xinlong Wang. Uni3D: Exploring uni-
vocabulary 3d object dataset for realistic perception, re- fied 3D representation at scale. In Proc. ICLR, 2024. 3
construction and generation. In IEEE/CVF Conference on [103] Yuchen Zhou, Jiayuan Gu, Xuanlin Li, Minghua Liu, Yun-
Computer Vision and Pattern Recognition (CVPR), 2023. 2 hao Fang, and Hao Su. PartSLIP++: enhancing low-shot
[90] Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, 3d part segmentation via multi-view instance segmentation
Ceyuan Yang, Sida Peng, Yujun Shen, and Gordon Wet- and maximum likelihood estimation. arXiv, 2312.03015,
zstein. GRM: Large gaussian reconstruction model for effi- 2023. 3
cient 3D reconstruction and generation. arXiv, 2403.14621, [104] Junzhe Zhu and Peiye Zhuang. HiFA: High-fidelity
2024. 4 text-to-3D with advanced diffusion guidance. CoRR,
[91] Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji- abs/2305.18766, 2023. 3
ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, [105] Yan Zizheng, Zhou Jiapeng, Meng Fanpeng, Wu Yushuang,
Zexiang Xu, and Kai Zhang. DMV3D: Denoising multi- Qiu Lingteng, Ye Zisheng, Cui Shuguang, Chen Guanying,
view diffusion using 3D large reconstruction model. In and Han Xiaoguang. Dreamdissector: Learning disentan-
Proc. ICLR, 2024. 3 gled text-to-3d generation from 2d diffusion priors. ECCV,
[92] Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, and Hong- 2024. 3
dong Li. ConsistNet: Enforcing 3D consistency for multi-
view images diffusion. arXiv.cs, abs/2310.10343, 2023.
[93] Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen
Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, and
Xihui Liu. DreamComposer: Controllable 3D object gener-
PartGen: Part-level 3D Generation and Reconstruction
with Multi-View Diffusion Models
Supplementary Material
Input Target
This supplementary material contains the following
parts: > A red cylindrical
cup with a smooth
• Implementation Details. Detailed descriptions of the matte finish a flat
training and inference settings for all models used in Part- bottom

Gen are provided.


• Additional Experiment Details. We describe the de-
tailed evaluation metrics employed in the experiments and > A red necktie
made of smooth
provide additional experiments. shiny material
• Additional Examples. We include more outputs of our
method, showcasing applications with part-aware text-to-
3D, part-aware image-to-3D, real-world 3D decomposi-
Input image Generated caption Input image Generated caption
tion, and iteratively adding parts.
• Failure Case. We analyse the modes of of failure of Part- > A dark brown, > A dead tree trunk
tapered, wooden leg with a rough, brown
Gen. with a smooth, glossy texture and several
surface and a pointed thin, bare branches.
• Ethics and Limitation. We provide a discussion on the tip
ethical considerations of data and usage, as well as the
limitations of our method.
Figure 8. 3D part editing and captioning examples. The top sec-
A. Implementation Details tion illustrates training examples for the editing network, where a
mask, a masked image, and text instructions are provided as con-
We provide the details of training used in PartGen (Appen- ditioning to the diffusion network, which fills in the part based on
dices A.1 to A.4). In addition, we provide the implementa- the given textual input. The bottom section demonstrates the input
tion details for the applications: for part composition (Ap- for the part captioning pipeline. Here, a red circle and highlights
pendix A.5) and for part editing (Appendix A.6). are used to help the large vision-language model (LVLM) identify
and annotate the specific part.
A.1. Text-to-multi-view generator
We fine-tune the text-to-multi-view generator starting with dimension, following [10]. Additionally, inspired by IP-
a pre-trained text-to-image diffusion model trained on bil- adapter [95], we introduce another cross-attention layer into
lions of image-text pairs that uses an architecture and data the diffusion model. The input image is first converted into
similar to Emu [13]. We change the target image to a grid of tokens using CLIP [69], then reprojected into 157 tokens of
2×2 views as described in Section 3.5 following Instant 3D dimension 1024 using a Perceiver-like architecture [29]. To
[39] via v-prediction [72] loss. The resolution of each view train the model, we utilize all 140k 3D models of our data
is 512 × 512, resulting in the total size of 1024 × 1024. To collection, selecting conditional images with random eleva-
avoid the problem of the cluttered background mentioned tion and azimuth but fixed camera distance and field of view.
in [39], we rescale the noise scheduler to force a zero termi- We use the DDPM scheduler with 1000 steps [24], rescaled
nal signal-to-noise ratio (SNR) following [43]. We use the SNR, and v-prediction for training. Training is conducted
DDPM scheduler with 1000 steps [24] for training. During with 64 H100 GPUs, a batch size of 512, and a learning rate
the inference, we use DDIM [77] scheduler with 250 steps. of 10−5 over 15k steps.
The model is trained with 64 H100 GPUs with a total batch
size of 512 and a learning rate 10−5 for 10k steps. A.3. Multi-view segmentation network
To obtain the multi-view segmentation network, we also
A.2. Image-to-multi-view generator
fine-tune the pre-trained text-to-multi-view model. The in-
Building on the text-to-multi-view generator, we further put channels are expanded from 8 to 16 to accommodate
fine-tune the model to accept images as input conditioning the additional image input, where 8 corresponds to the la-
instead of text. The text condition is removed by setting it to tent dimension of the VAE used in our network. We cre-
a default null condition (an empty string). We concatenate ate segmentation-image pairs as inputs. The training setup
the conditional image to the noised image along the spatial follows a similar recipe to that of the image-to-multi-view
Emission-Absorption (EA) model for rendering, which cal-
Recall@k (IoU > 0.5) Comparison
0.7 culates transmittance Tij , representing the probability of a
photon emitted at position xij (the jth sampling point in the
0.6
ith ray) reaching the sensor. Then the rendered feature (e.g.
Recall@k (IoU > 0.5)

0.5 color) vi of ray ri is computed as:


R−1
0.4
X
vi = (Ti,j−1 − Ti,j )fv (xij )
ours (1 sample)
0.3 ours (5 sample) j=1
ours (10 sample)
SAM2 (finetuned) where fv (xij ) denotes the feature of the 3D point xij ;
0.2 SAM2 (original) Pj
SAM2 (4 views) Ti,j = exp(− k=0 ∆ · σ(xik )), where ∆ is the distance
1 3 5 10 between two sampled points and σ(xik ) is the opacity at po-
k
sition xik , Ti,j−1 − Ti,j captures the visibility of the point.
Recall@k (IoU > 0.75) Comparison Now we show how we generalise it to rendering N parts.
0.45 Given feature functions fv1 , . . . , fvN and their opacity func-
0.40 tions σ 1 , · · · , σ N , the rendered feature of a specific ray ri
0.35 becomes:
Recall@k (IoU > 0.75)

R−1 N
0.30 XX
h
vi = (T̂i,j−1 − T̂i,j )wij · fvh (xij ).
0.25
ours (1 sample) j=1 h=1
0.20 ours (5 sample) PN
ours (10 sample) where wij h
= σ h (xij )/ l=1 σ l (xij ) is the weight
0.15 SAM2 (finetuned)
SAM2 (original) of the feature fvh (xij ) at xij for part h; T̂i,j =
0.10 SAM2 (4 views) Pj PN
1 3 5 10 exp(− k=0 h=1 ∆·σ h (xik )), ∆ is the distance between
k
two sampled points and σ h (xik ) is the opacity at position
xik for part h, and T̂i,j−1 − T̂i,j is the visibility of the point.
Figure 9. Recall curve of different methods. Our method achieve
better performance comparing with SAM2 and its variants. A.6. 3D part editing
As shown in the main text and Figure 7, once 3D assets
generator, employing a DDPM scheduler, v-prediction, and are generated or reconstructed as a composition of differ-
rescaled SNR. The network is trained with 64 H100 GPUs, ent parts through PartGen, specific parts can be edited us-
a batch size of 512, a learning rate of 10−5 , for 10k steps. ing text instructions to achieve 3D part editing. To enable
this, we fine-tune the text-to-multi-view generator using
A.4. Multi-view completion network part multi-view images, masks, and text description pairs.
Example of the training data are shown in Figure 8 (top).
The training strategy for the multi-view completion network Notably, instead of supplying the mask for the part to be
mirrors that of the multi-view segmentation network, with edited, we provide the mask of the remaining parts. This
the key difference in the input configuration. The number design choice encourages the editing network to imagine
of input channels (in latent space) is increased to 25 by in- the part’s shape without constraining the region where it has
cluding the context image, masked image, and binary mask, to project. The training recipe is similar to multi-view seg-
where the mask remains a single unencoded channel. Ex- mentation network.
ample inputs are illustrated in Figure 5 of the main text. The To generate captions for different parts, we establish an
network is trained with 64 H100 GPUs, a batch size of 512, annotation pipeline similar to the one used for captioning
a learning rate of 10−5 , and for approximately 10k steps. the whole object, where captions for various views are first
A.5. Parts assembly generated using LLAMA3 and then summarized into a sin-
gle unified caption using LLAMA3 as well. The key chal-
When compositing an object from its parts, we observed lenge in this variant is that some parts are difficult to identify
that simply combining the implicit neural fields of parts without knowing the context information of the object. We
reconstructed by the Reconstruction Model (RM) in the thus employ the technique inspired by [75]. Specifically, we
rendering process with their respective spatial locations use red annulet and alpha blending to emphasize the part be-
achieves satisfactory results. ing annotated. Example inputs and generated captions are
To describe this formally, we first review the rendering shown in Figure 8 (bottom). The network is trained with 64
function of LightplaneLRM [5] that we use as our recon- H100 GPUs, a batch size of 512, and the learning rate of
struction model. LightplaneLRM employs a generalized 10−5 over 10,000 steps.
Input Object Part 1 Part 2 Part 3

> A panda
rowing a boat in
a pond

> A dachshund
dressed up in a
hotdog costume

Figure 10. More examples. Additional examples illustrate that PartGen can process various modalities and effectively generate or recon-
struct 3D objects with distinct parts.

B. Additional Experiment Details Then, we assign to each segment M̂ ∈ P a reliability score


based on how frequently it overlaps with similar segments
We provide a detailed explanation of the ranking rules ap- in the list, i.e.,
plied to different methods and the formal definition of mean  
average precision (mAP) used in our evaluation protocol. 1
s(M̂ ) = M̂ ′ ∈ P : m(M̂ ′ , M̂ ) >
Additionally, we report the recall at K in the automatic seg- 2
mentation setting.
where the Intersection over Union (IoU) [38] metric is given
Ranking the parts. For evaluation using mAP and recall at by:
K, it is necessary to rank the part proposal. For our method,
we run the segmentation network several times and concate- |M̂ ∩ M | + ϵ
m(M̂ , M ) = IoU(M̂ , M ) = .
nate the results into an initial set P of segment proposals. |M̂ ∪ M | + ϵ
Input 1 Part 2 Parts 3 Parts Composed Object

> A chihuahua
wearing a tutu

Figure 11. Iteratively adding parts. We show that users can iteratively add parts and combine the results of PartGen pipeline.

The constant ϵ = 10−4 smooths the metric when both re- between a predicted segment and a ground truth segment
gions are empty, in which case m(ϕ, ϕ) = 1, and will be as m(M̂ , M ) ∈ [0, 1]. Given this metric, we then report
useful later. the mean Average Precision (mAP) metric at different IoU
Finally, we sort the regions M by decreasing score s(M ) thresholds τ . Recall that, based on this definition, comput-
and, scanning the list from high to low, we incrementally ing the AP curve for a sample involves matching predicted
remove duplicates down the list if they overlap by more segments to ground truth segments in ranking order, ensur-
than 1/2 with the regions selected so far. The final result ing that each ground truth segment is matched only once,
is a ranked list of multi-view masks M = (M̂1 , . . . , M̂N ) and considering any unmatched ground truth segments.
where N ≤ |P| and: In more detail, we start by scanning the list of segments
M̂k in order k = 1, 2, . . . . Each time, we compare M̂k to
1 the ground truth segments S and define:
∀i < j : s(M̂i ) ≥ s(M̂j ) ∧ m(M̂i , M̂j ) < .
2
s∗ = argmax m(M̂k , Ms ).
Other algorithms like SAM2 come with their own region s=1,...,S

reliability metric s, which we use for sorting. We otherwise If m(M̂k , Ms∗ ) ≥ τ, then we label the region Ms as re-
apply non-maxima suppression to their ranked regions in trieved by setting yk = 1 and removing Ms from the list of
the same way as ours. ground truth segments not yet recalled by setting
Computing mAP. The image I comes from an object L
S ← S \ {Ms∗ }.
with parts (S1 , . . . , SS ) from which we obtain the ground-
truth part masks S = (M 1 , . . . , M S ) as explained in Sec- Otherwise, if m(M̂k , Ms∗ ) < τ or if S is empty, we set
tion 3.5 in the main text. We assign ground-truth segments yk = 0. We repeat this process for all k, which results in
to candidates following the procedure: we go through the labels (y1 , . . . , yN ) ∈ {0, 1}N . We then set the average
list M = (M̂1 , . . . , M̂N ) and match the candidates one precision (AP) at τ to be:
by one to the ground truth segment with the highest IOU, N k
exclude that ground-truth segment, and continue travers- 1 X X yi yk
AP(M, S; τ ) = .
ing the candidate list. We measure the degree of overlap S i=1
k
k=1
Input Generated Grid View Reconstructed 3D Computing recall at K. For a given sample, we define re-
call at K the curve
> An orangutan S  
using chopsticks to 1X
eat ramen R(K; M, S, τ ) = χ max m(M̂s , Mk ) > τ .
S s=1 k=1,...,K

Hence, this is simply the fraction of ground truth segments


recovered by looking up to position K in the ranked list
> a group of squirrels of predicted segments. The results in Figure 9 demonstrate
rowing crew that our diffusion-based method outperforms SAM2 and its
variants by a large margin and shows consistent improve-
ment as the number of samples increases.
(a) Grid view generation failure
Seeded part segmentation. To evaluate seeded part seg-
Input Segmentation Map Reconstructed 3D
mentation, the assessment proceeds as before, except that
a single ground truth part S and mask M is considered at
a time, and the corresponding seed point u ∈ M is passed
to the algorithm (M̂1 , . . . , M̂K ) = A(I, u). Note that, be-
cause the problem is still ambiguous, it makes sense for the
algorithm to still produce a ranked list of possible part seg-
ments.

C. Additional Examples
More application examples. We provide additional appli-
cation examples in Figure 10, showcasing the versatility of
(b) Segmentation Failure our approach to varying input types. These include part-
Input Reconstructed 3D Depth map aware text-to-3D generation, where textual prompts guide
the synthesis of 3D models with semantically distinct parts;
part-aware image-to-3D generation, which reconstructs 3D
objects from a single image while maintaining detailed
part-level decomposition; and real-world 3D decomposi-
tion, where complex real-world objects are segmented into
different parts. These examples demonstrate the broad ap-
plicability and robustness of PartGen in handling diverse
inputs and scenarios.
Iteratively adding parts. As shown in Figure 11, we
demonstrate the capability of our approach to compose a 3D
object by iteratively adding individual parts to it. Starting
(C) Reconstruction Model Failure with different inputs, users can seamlessly integrate addi-
tional parts step by step, maintaining consistency and co-
Figure 12. Failure Cases. (a) Multi-view grid generation failure, herence in the resulting 3D model. This process highlights
where the generated views lack 3D consistency. (b) Segmentation the flexibility and modularity of our method, enabling fine-
failure, where semantically distinct parts are incorrectly grouped grained control over the composition of complex objects
together. (c) Reconstruction model failure, where the complex ge- while preserving the semantic and structural integrity of the
ometry of the input leads to inaccuracies in the depth map. composition.

D. Failure Cases
As outlined in the method section, PartGen incorporates
several steps, including multi-view grid generation, multi-
PN that this quantity is at most 1 because by construction
Note view segmentation, multi-view part completion, and 3D
i=1 yi ≤ S as we cannot match more proposal than there part reconstruction. Failures at different stages will result
are ground truth regions. mAP is defined as the average of in specific issues. For instance, as shown in Figure 12(a),
the AP over all test samples. failures in grid view generation can cause inconsistencies
in 3D reconstruction, such as misrepresentations of the
orangutan’s hands or the squirrel’s oars. The segmenta-
tion method can sometimes group distinct parts together,
and limited, in our implementation, to objects containing
no more than 10 parts, otherwise it merges different build-
ing blocks into a single part. Furthermore, highly complex
input structures, such as dense grass and leaves, can lead to
poor reconstruction outcomes, particularly in terms of depth
quality, as illustrated in Figure 12(c).

E. Ethics and Limitation


Ethics. Our models are trained on datasets derived from
artist-created 3D assets. These datasets may contain bi-
ases that could propagate into the outputs, potentially re-
sulting in culturally insensitive or inappropriate content. To
mitigate this, we strongly encourage users to implement
safeguards and adhere to ethical guidelines when deploying
PartGen in real-world applications.
Limitation. In this work, we focus primarily on object-
level generation, leveraging artist-created 3D assets as our
training dataset. However, this approach is heavily depen-
dent on the quality and diversity of the dataset. Extending
the method to scene-level generation and reconstruction is a
promising direction but it will require further research and
exploration.

You might also like