Figure 1. We introduce PartGen, a pipeline that generates compositional 3D objects similar to a human artist. It can start from text,
an image, or an existing, unstructured 3D object. It consists of a multi-view diffusion model that identifies plausible parts automatically
and another that completes and reconstructs them in 3D, accounting for their context, i.e., the other parts, to ensure that they fit together
correctly. Additionally, PartGen enables 3D part editing based on text instructions, enhancing flexibility and control in 3D object creation.
3D segmentation. Our work decomposes a given 3D ob- This section introduces PartGen, our framework for gener-
ject into parts. Several works have considered segment- ating 3D objects that are fully decomposable into complete
ing 3D objects or scenes represented in an unstructured 3D parts. Each part is a distinct, human-interpretable, and
manner, lately as neural fields or 3D Gaussian mixtures. self-contained element, representing the 3D object compo-
Semantic-NeRF [101] was the first to fuse 2D semantic sitionally. PartGen can take different modalities as input
segmentation maps in 3D with neural fields. DFF [36] (text prompts, image prompts, or 3D assets) and performs
and N3F [84] propose to map 2D features to 3D fields, part segmentation and completion by repurposing a pow-
allowing their supervised and unsupervised segmentation. erful multi-view diffusion model for these two tasks. An
LERF [33] extends this concept to language-aware fea- overview of PartGen is shown in Figure 2.
tures like CLIP [69]. Contrastive Lift [2] considers in- The rest of the section is organised as follows. In
stead instance segmentation, fusing information from sev- Sec. 3.1, we introduce the necessary background on multi-
eral independently-segmented views using a contrastive for- view diffusion and how PartGen can be applied to text, im-
mulation. GARField [34] and OminiSeg3D [97] consider age, or 3D model inputs briefly. Then, in Secs. 3.2 to 3.4 we
that concepts exist at different levels of scale, which they describe how PartGen automatically segments, completes,
identify with the help of SAM [35]. LangSplat [67] lever- and reconstructs meaningful parts in 3D.
ages both CLIP and SAM, creating distinct 3D language
3.1. Background on 3D generation
fields to model each SAM scale explicitly, while N2F2 [3]
automates binding the correct scale to each concept. Neu- First, we provide essential background on multi-view diffu-
ral Part Priors [4] completes and decomposes 3D scans with sion models for 3D generation [39, 74, 76]. These methods
learned part priors in a test-time optimization manner. Fi- usually adopt a two-stage approach to 3D generation.
nally, Uni3D [102] learns a ‘foundation’ model for 3D point In the first stage, given a prompt y, an image generator Φ
clouds that can perform zero-shot segmentation. outputs several 2D views of the object from different van-
<Text Prompt> or
Conditional Multi-View
Reconstruction Network
Completion Network
Muti-View Part
Figure 2. Overview of PartGen. Our method begins with text, single images, or existing 3D objects to obtain an initial grid view of the
object. This view is then processed by a diffusion-based segmentation network to achieve multi-view consistent part segmentation. Next,
the segmented parts, along with contextual information, are input into a multi-view part completion network to generate a fully completed
view of each part. Finally, a pre-trained reconstruction model generates the 3D parts.
tage points. Depending on the nature of y, the network Φ lows us to repurpose existing multi-view models Φ, which,
is either a text-to-image (T2I) model [39, 74] or a image- as described in Sec. 3.1, are already pre-trained to produce
to-image (I2I) one [73, 86]. These are fine-tuned to output multi-view consistent generations in the RGB domain. Sec-
a single ‘multi-view’ image I ∈ R3×2H×2W , where views ond, it integrates easily with established multi-view frame-
from the four cardinal directions around the object are ar- works. Third, decomposing an object into parts is an in-
ranged into a 2 × 2 grid. This model thus provides a proba- herently non-deterministic, ambiguous task as it depends on
bilistic mapping I ∼ p(I | Φ, y). The 2D views I are subse- the desired verbosity level, individual preferences, and artis-
quently passed to a Reconstruction Model (RM) [39, 76, 90] tic intent. By learning this task with probabilistic diffusion
Ψ, i.e., a neural network that reconstructs the 3D object L in models, we can effectively capture and model this ambigu-
both shape and appearance. Compared to direct 3D genera- ity. We thus train our model on a curated dataset of artist-
tion, this two-stage paradigm takes full advantage of an im- created 3D objects, where each object L is annotated with
age generation model pre-trained on internet-scale 2D data. a possible decomposition into 3D parts, L = (S1 , . . . , SS ).
This approach is general and can be applied with var- The dataset details are provided in Sec. 3.5.
ious implementations of image-generation and reconstruc- Consider that the input is a multi-view image I, and the
tion models. Our work in particular follows a setup similar output is a set of multi-view part masks M 1 , M 2 , . . . , M S .
to AssetGen [76]. Specifically, we obtain Φ by finetuning a To finetune our multi-view image generators Φ for mask
pre-trained text-to-image diffusion model with an architec- prediction, we quantize the RGB space into Q different
ture similar to Emu [13], a diffusion model in a 8-channel colors c1 , . . . , cQ ∈ [0, 1]3 . For each training sample
latent space, the mapping to which is provided by a spe- L = (Sk )Sk=1 , we assign colors to the parts, mapping
cially trained variational autoencoder (VAE). The detailed part Sk to color cπk , where π is a random permutation on
fine-tuning strategy can be found in Sec. 4.4 and supple- {1, . . . , Q} (we assume that Q ≥ S). Given this mapping,
mentary material. When the input is a 3D model, we render we render the segmentation map as a multi-view RGB im-
multiple views to form the grid view. For the RM Ψ we use age C ∈ [0, 1]3×2H×2W (Fig. 4). Then, we fine-tune Φ to
LightplaneLRM [5], trained on our dataset. (1) take as conditioning the multi-view image I, and (2) to
3.2. Multi-view part segmentation generate the color-coded multi-view segmentation map C,
hence sampling a distribution C ∼ p(C | Φseg , I).
The first major contribution of our paper is a method for
segmenting an object into its constituent parts. Inspired This approach can produce alternative segmentations by
by multi-view diffusion approaches, we frame object de- simply re-running Φseg , which is stochastic. It further ex-
composition into parts as a multi-view segmentation task, ploits the fact that Φseg is stochastic to discount the specific
rather than as direct 3D segmentation. At a high-level, the ‘naming’ or coloring of the parts, which is arbitrary. Nam-
goal is to map I to a collection 2D masks M 1 , . . . , M S ∈ ing is a technical issue in instance segmentation which usu-
{0, 1}2H×2W , one for each visible part of the object. Both ally requires ad-hoc solutions, and here is solved ‘for free’.
image I and masks Mi are multi-view grids. To extract the segments at test time, we sample the im-
Addressing 3D object segmentation through the lens of age C and simply quantize it based on the reference colors
multi-view diffusion offers several advantages. First, it al- c1 , . . . , cQ , discarding parts that contain only a few pixels.
Whole object Part 1 Part 2 Part N
by the inpainting setup in [71]. We apply the pre-trained
VAE separately to the masked image I ⊙ M and context
… image I, yielding 2 × 8 channels, and stack them with the
8D noise image and the unencoded part mask M to obtain
the 25-channel input to the diffusion model. Example re-
sults are shown in Figure 5.
3.4. Part reconstruction
Given a multi-view part image J, the final step is to recon-
… struct the part in 3D. Because the part views are now com-
plete and consistent, we can simply use the RM to obtain a
predicted reconstruction Ŝ = Ψ(J) of the part. We found
Figure 3. Training data. We obtain a dataset of 3D objects de- that the model does not require special finetuning to move
composed into parts from assets created by artists. These come from objects to their parts, so any good quality reconstruc-
‘naturally’ decomposed into parts according to the artist’s design. tion model can be plugged into our pipeline directly.
Figure 4. Examples of automatic multi-view part segmentations. By running our method several times, we obtain different segmenta-
tions, covering the space of artist intents.
Context Incomplete Part Mask GT Ours Sample 1 Ours Sample 2 Ours Sample 3
Automatic Seeded
Method mAP50 ↑ mAP75 ↑ mAP50 ↑ mAP75 ↑
Part123 [44] 11.5 7.4 10.3 6.5
SAM2† [70] 20.3 11.8 24.6 13.1
SAM2∗ [70] 37.4 27.0 44.2 30.1
SAM2 [70] 35.3 23.4 41.4 27.4
PartGen (1 sample) 45.2 32.9 44.9 33.5
PartGen (5 samples) 54.2 33.9 51.3 32.9
PartGen (10 samples) 59.3 38.5 53.7 35.4
Table 2. Part completion results. We first evaluate view part completion by computing scores w.r.t. the ground-truth multi-view part
image J. Then, we evaluate 3D part reconstruction by reconstructing each part S and rendering it. See text for details.
(a) Part-Aware Text-to-3D
Input Generated 3D Example Parts Input Generated 3D Example Parts Input Generated 3D Example Parts
(c) 3D Decomposition
Input Reconstructed 3D Example Parts Input Reconstructed 3D Example Parts Input Reconstructed 3D Example Parts
Figure 6. Examples of applications. PartGen can effectively generate or reconstruct 3D objects with meaningful and realistic parts in
different scenarios: a) Part-aware text-to-3D generation; b) Part-aware image-to-3D generation; c) 3D decomposition.
Baselines. We consider the original and fine-tuned marily because of the ambiguity of the segmentation task,
SAM2 [70] as our baselines for multi-view segmentation. which is better captured by our generator-based approach.
We fine-tune SAM2 in two different ways. First, we fine- We further provide qualitative results in Fig. 4.
tune SAM2’s mask decoder on our dataset, given the ground
4.2. Part completion and reconstruction
truth masks and randomly selected seed points for different
views. Second, we concatenate the four orthogonal views We utilize the same test data as in Sec. 4.1, forming tuples
in a multi-view image I and fine-tune SAM2 to predict the (S, I, M k , J k ) consisting of the 3D object part S, the full
multi-view mask M (in this case, the seed point randomly multi-view image I, the part mask M k and the multi-view
falls in one of the views). SAM2 produces three regions for image J k of the part, as described in Section 3.5. We choose
each input image and seed point. For automatic segmenta- one random part index k per model, and will omit it from
tion, we seed SAM2 with a set of query points spread over the notation below to be more concise.
the object, obtaining three different regions for each seed
Evaluation protocol. The completion algorithm and its
point. For seeded segmentation, we simply return the re-
baselines are treated as a black box Jˆ = B(I ⊙ M, I) that
gions that SAM2 outputs for the given seed point. We also ˆ We then com-
predicts the completed multi-view image J.
provide a comparison with recent work, Part123 [44].
pare Jˆ to the ground-truth render J using Peak Signal to
Results. We report the results in Tab. 1. As shown in the Noise Ratio (PSNR) of the foreground pixels, Learned Per-
table, mAP results for our method are much higher than oth- ceptual Image Patch Similarity (LPIPS) [100], and CLIP
ers, including SAM2 fine-tuned on our data. This is pri- similarity [69]. The latter is an important metric since the
Ŝk = Φ(Jˆk ), and reassemble the 3D object L̂ by merg-
ing the 3D parts {Ŝ1 , . . . , ŜN }. We then compare L̂ =
S ˆ
k Φ(Jk ) to the unsegmented reconstruction L̂ = Φ(I) us-
ing the same protocol as for parts.
Results. Table 3 shows that our method achieves perfor-
Original “White T-shirt with “Hawaii shirt” “Cloth with colorful
logo” texture” mance comparable to directly reconstructing the objects us-
ing the RM (L̂ = Φ(I)), with the additional benefit of pro-
ducing the reconstruction structured into parts, which are
useful for downstream applications such as editing.
4.4. Applications
Original “Black magic hat” “White hat” “Cowboy hat”
Part-aware text-to-3D generation. First, we apply Part-
Gen to part-aware text-to-3D generation. We train a text-
to-multi-view generator similar to [76], which takes a text
prompt as input and outputs a grid of four views. For il-
Original “pink cup with “Green cup with “Yellow cup with a
lustration, we use the prompts from DreamFusion [65]. As
square bottom” cute logo” smile on it” shown in Fig. 6, PartGen can effectively generate 3D ob-
Figure 7. 3D part editing. We can edit the appearance and shape jects with distinct and completed parts, even in challenging
of the 3D objects with text prompt. cases with heavy occlusions, such as the gummy bear. Ad-
ditional examples are provided in the supp. mat.
Method CLIP↑ LPIPS↓ PSNR↑ Part-aware image-to-3D generation. Next, we consider
PartGen (L̂ = k Φ(Jˆk )) 0.952 part-aware image-to-3D generation. Building upon the text-
0.065 20.33
Unstructured (L̂ = Φ(I)) 0.955 0.064 20.47 to-multi-view generator, we further fine-tune the generator
to accept images as input with a strategy similar to [95].
Table 3. Model reassembling result. The quality of 3D recon- Further training details are provided in supplementary ma-
struction of the object as a whole is close to that of the part- terials. Results are shown in Fig. 6 demonstrating that Part-
based compositional reconstruction, which proves that the pre- Gen is successful in this case as well.
dicted parts fit together well.
Real-world 3D object decomposition. PartGen can also
decompose real-world 3D objects. We show this using ob-
completion task is highly ambiguous, and thus evaluating
jects from Google Scanned Objects (GSO) [15] for this pur-
semantic similarity can provide additional insights. We
pose. Given a 3D object from GSO, we render different
also evaluate the quality of the reconstruction of the pre-
views to obtain a an image grid and then apply PartGen as
dicted completions by comparing the reconstructed object
ˆ to the ground-truth part S using the same above. The last row of Figure 6 shows that PartGen can
part Ŝ = Φ(J)
effectively decompose real-world 3D objects too.
metrics, but averaged after rendering the part to four ran-
dom novel viewpoints. 3D part editing. Finally, we show that once the 3D parts
are decomposed, they can be further modified through text
Results. We compare our part completion algorithm (Jˆ =
input. As illustrated in Fig. 7, a variant of our method en-
B(I ⊙ M, I)) to several baselines and the oracle, test-
ables effective editing of the shape and texture of the parts
ing using no completion (Jˆ = I ⊙ M ), omitting context
based on textual prompts. The details of the 3D editing
(Jˆ = B(I ⊙ M )), completing single views independently
model are provided in supplementary materials.
(Jˆv = B(Iv ⊙ Mv , Iv )), and the oracle (Jˆ = J). The latter
provides the upper-bound on the part reconstruction perfor-
mance, where the only bottleneck is the RM. 5. Conclusion
As shown in the table Tab. 2, our model largely surpasses We have introduced PartGen, a novel approach to gener-
the baselines. Both joint multi-view reasoning and contex- ate or reconstruct compositional 3D objects from text, im-
tual part completion are important for good performance. ages, or unstructured 3D objects. PartGen can reconstruct
We further provide qualitative results in Fig. 5. in 3D parts that are even minimally visible, or not visible
at all, utilizing the guidance of a specially-designed multi-
4.3. Reassembling parts
view diffusion prior. We have also shown several applica-
Evaluation protocol. Starting from multi-view image I of tion of PartGen, including text-guided part editing. This is a
a 3D object L, we run the segmentation algorithm to obtain promising step towards the generation of 3D assets that are
segmentation (M̂ 1 , . . . , M̂ S ), reconstruct each 3D part as more useful in professional workflows.
PartGen: Part-level 3D Generation and Reconstruction
with Multi-View Diffusion Models
Supplementary Material
Input Target
This supplementary material contains the following
parts: > A red cylindrical
cup with a smooth
• Implementation Details. Detailed descriptions of the matte finish a flat
training and inference settings for all models used in Part- bottom
R−1 N
0.30 XX
vi = (T̂i,j−1 − T̂i,j )wij · fvh (xij ).
ours (1 sample) j=1 h=1
0.20 ours (5 sample) PN
ours (10 sample) where wij h
= σ h (xij )/ l=1 σ l (xij ) is the weight
0.15 SAM2 (finetuned)
SAM2 (original) of the feature fvh (xij ) at xij for part h; T̂i,j =
0.10 SAM2 (4 views) Pj PN
1 3 5 10 exp(− k=0 h=1 ∆·σ h (xik )), ∆ is the distance between
two sampled points and σ h (xik ) is the opacity at position
xik for part h, and T̂i,j−1 − T̂i,j is the visibility of the point.
Figure 9. Recall curve of different methods. Our method achieve
better performance comparing with SAM2 and its variants. A.6. 3D part editing
As shown in the main text and Figure 7, once 3D assets
generator, employing a DDPM scheduler, v-prediction, and are generated or reconstructed as a composition of differ-
rescaled SNR. The network is trained with 64 H100 GPUs, ent parts through PartGen, specific parts can be edited us-
a batch size of 512, a learning rate of 10−5 , for 10k steps. ing text instructions to achieve 3D part editing. To enable
this, we fine-tune the text-to-multi-view generator using
A.4. Multi-view completion network part multi-view images, masks, and text description pairs.
Example of the training data are shown in Figure 8 (top).
The training strategy for the multi-view completion network Notably, instead of supplying the mask for the part to be
mirrors that of the multi-view segmentation network, with edited, we provide the mask of the remaining parts. This
the key difference in the input configuration. The number design choice encourages the editing network to imagine
of input channels (in latent space) is increased to 25 by in- the part’s shape without constraining the region where it has
cluding the context image, masked image, and binary mask, to project. The training recipe is similar to multi-view seg-
where the mask remains a single unencoded channel. Ex- mentation network.
ample inputs are illustrated in Figure 5 of the main text. The To generate captions for different parts, we establish an
network is trained with 64 H100 GPUs, a batch size of 512, annotation pipeline similar to the one used for captioning
a learning rate of 10−5 , and for approximately 10k steps. the whole object, where captions for various views are first
A.5. Parts assembly generated using LLAMA3 and then summarized into a sin-
gle unified caption using LLAMA3 as well. The key chal-
When compositing an object from its parts, we observed lenge in this variant is that some parts are difficult to identify
that simply combining the implicit neural fields of parts without knowing the context information of the object. We
reconstructed by the Reconstruction Model (RM) in the thus employ the technique inspired by [75]. Specifically, we
rendering process with their respective spatial locations use red annulet and alpha blending to emphasize the part be-
achieves satisfactory results. ing annotated. Example inputs and generated captions are
To describe this formally, we first review the rendering shown in Figure 8 (bottom). The network is trained with 64
function of LightplaneLRM [5] that we use as our recon- H100 GPUs, a batch size of 512, and the learning rate of
struction model. LightplaneLRM employs a generalized 10−5 over 10,000 steps.
Input Object Part 1 Part 2 Part 3
> A panda
rowing a boat in
a pond
> A dachshund
dressed up in a
hotdog costume
Figure 10. More examples. Additional examples illustrate that PartGen can process various modalities and effectively generate or recon-
struct 3D objects with distinct parts.
> A chihuahua
wearing a tutu
Figure 11. Iteratively adding parts. We show that users can iteratively add parts and combine the results of PartGen pipeline.
The constant ϵ = 10−4 smooths the metric when both re- between a predicted segment and a ground truth segment
gions are empty, in which case m(ϕ, ϕ) = 1, and will be as m(M̂ , M ) ∈ [0, 1]. Given this metric, we then report
useful later. the mean Average Precision (mAP) metric at different IoU
Finally, we sort the regions M by decreasing score s(M ) thresholds τ . Recall that, based on this definition, comput-
and, scanning the list from high to low, we incrementally ing the AP curve for a sample involves matching predicted
remove duplicates down the list if they overlap by more segments to ground truth segments in ranking order, ensur-
than 1/2 with the regions selected so far. The final result ing that each ground truth segment is matched only once,
is a ranked list of multi-view masks M = (M̂1 , . . . , M̂N ) and considering any unmatched ground truth segments.
where N ≤ |P| and: In more detail, we start by scanning the list of segments
M̂k in order k = 1, 2, . . . . Each time, we compare M̂k to
1 the ground truth segments S and define:
∀i < j : s(M̂i ) ≥ s(M̂j ) ∧ m(M̂i , M̂j ) < .
s∗ = argmax m(M̂k , Ms ).
Other algorithms like SAM2 come with their own region s=1,...,S
reliability metric s, which we use for sorting. We otherwise If m(M̂k , Ms∗ ) ≥ τ, then we label the region Ms as re-
apply non-maxima suppression to their ranked regions in trieved by setting yk = 1 and removing Ms from the list of
the same way as ours. ground truth segments not yet recalled by setting
Computing mAP. The image I comes from an object L
S ← S \ {Ms∗ }.
with parts (S1 , . . . , SS ) from which we obtain the ground-
truth part masks S = (M 1 , . . . , M S ) as explained in Sec- Otherwise, if m(M̂k , Ms∗ ) < τ or if S is empty, we set
tion 3.5 in the main text. We assign ground-truth segments yk = 0. We repeat this process for all k, which results in
to candidates following the procedure: we go through the labels (y1 , . . . , yN ) ∈ {0, 1}N . We then set the average
list M = (M̂1 , . . . , M̂N ) and match the candidates one precision (AP) at τ to be:
by one to the ground truth segment with the highest IOU, N k
exclude that ground-truth segment, and continue travers- 1 X X yi yk
AP(M, S; τ ) = .
ing the candidate list. We measure the degree of overlap S i=1
Input Generated Grid View Reconstructed 3D Computing recall at K. For a given sample, we define re-
call at K the curve
> An orangutan S
using chopsticks to 1X
eat ramen R(K; M, S, τ ) = χ max m(M̂s , Mk ) > τ .
S s=1 k=1,...,K
C. Additional Examples
More application examples. We provide additional appli-
cation examples in Figure 10, showcasing the versatility of
(b) Segmentation Failure our approach to varying input types. These include part-
Input Reconstructed 3D Depth map aware text-to-3D generation, where textual prompts guide
the synthesis of 3D models with semantically distinct parts;
part-aware image-to-3D generation, which reconstructs 3D
objects from a single image while maintaining detailed
part-level decomposition; and real-world 3D decomposi-
tion, where complex real-world objects are segmented into
different parts. These examples demonstrate the broad ap-
plicability and robustness of PartGen in handling diverse
inputs and scenarios.
Iteratively adding parts. As shown in Figure 11, we
demonstrate the capability of our approach to compose a 3D
object by iteratively adding individual parts to it. Starting
(C) Reconstruction Model Failure with different inputs, users can seamlessly integrate addi-
tional parts step by step, maintaining consistency and co-
Figure 12. Failure Cases. (a) Multi-view grid generation failure, herence in the resulting 3D model. This process highlights
where the generated views lack 3D consistency. (b) Segmentation the flexibility and modularity of our method, enabling fine-
failure, where semantically distinct parts are incorrectly grouped grained control over the composition of complex objects
together. (c) Reconstruction model failure, where the complex ge- while preserving the semantic and structural integrity of the
ometry of the input leads to inaccuracies in the depth map. composition.
D. Failure Cases
As outlined in the method section, PartGen incorporates
several steps, including multi-view grid generation, multi-
PN that this quantity is at most 1 because by construction
Note view segmentation, multi-view part completion, and 3D
i=1 yi ≤ S as we cannot match more proposal than there part reconstruction. Failures at different stages will result
are ground truth regions. mAP is defined as the average of in specific issues. For instance, as shown in Figure 12(a),
the AP over all test samples. failures in grid view generation can cause inconsistencies
in 3D reconstruction, such as misrepresentations of the
orangutan’s hands or the squirrel’s oars. The segmenta-
tion method can sometimes group distinct parts together,
and limited, in our implementation, to objects containing
no more than 10 parts, otherwise it merges different build-
ing blocks into a single part. Furthermore, highly complex
input structures, such as dense grass and leaves, can lead to
poor reconstruction outcomes, particularly in terms of depth
quality, as illustrated in Figure 12(c).