2412.18608v1
2412.18608v1
2412.18608v1
silent-chen.github.io/PartGen
> A beagle in
a detective’s
outfit
Figure 1. We introduce PartGen, a pipeline that generates compositional 3D objects similar to a human artist. It can start from text,
an image, or an existing, unstructured 3D object. It consists of a multi-view diffusion model that identifies plausible parts automatically
and another that completes and reconstructs them in 3D, accounting for their context, i.e., the other parts, to ensure that they fit together
correctly. Additionally, PartGen enables 3D part editing based on text instructions, enhancing flexibility and control in 3D object creation.
3D segmentation. Our work decomposes a given 3D ob- This section introduces PartGen, our framework for gener-
ject into parts. Several works have considered segment- ating 3D objects that are fully decomposable into complete
ing 3D objects or scenes represented in an unstructured 3D parts. Each part is a distinct, human-interpretable, and
manner, lately as neural fields or 3D Gaussian mixtures. self-contained element, representing the 3D object compo-
Semantic-NeRF [101] was the first to fuse 2D semantic sitionally. PartGen can take different modalities as input
segmentation maps in 3D with neural fields. DFF [36] (text prompts, image prompts, or 3D assets) and performs
and N3F [84] propose to map 2D features to 3D fields, part segmentation and completion by repurposing a pow-
allowing their supervised and unsupervised segmentation. erful multi-view diffusion model for these two tasks. An
LERF [33] extends this concept to language-aware fea- overview of PartGen is shown in Figure 2.
tures like CLIP [69]. Contrastive Lift [2] considers in- The rest of the section is organised as follows. In
stead instance segmentation, fusing information from sev- Sec. 3.1, we introduce the necessary background on multi-
eral independently-segmented views using a contrastive for- view diffusion and how PartGen can be applied to text, im-
mulation. GARField [34] and OminiSeg3D [97] consider age, or 3D model inputs briefly. Then, in Secs. 3.2 to 3.4 we
that concepts exist at different levels of scale, which they describe how PartGen automatically segments, completes,
identify with the help of SAM [35]. LangSplat [67] lever- and reconstructs meaningful parts in 3D.
ages both CLIP and SAM, creating distinct 3D language
3.1. Background on 3D generation
fields to model each SAM scale explicitly, while N2F2 [3]
automates binding the correct scale to each concept. Neu- First, we provide essential background on multi-view diffu-
ral Part Priors [4] completes and decomposes 3D scans with sion models for 3D generation [39, 74, 76]. These methods
learned part priors in a test-time optimization manner. Fi- usually adopt a two-stage approach to 3D generation.
nally, Uni3D [102] learns a ‘foundation’ model for 3D point In the first stage, given a prompt y, an image generator Φ
clouds that can perform zero-shot segmentation. outputs several 2D views of the object from different van-
<Text Prompt> or
Conditional Multi-View
Reconstruction Network
Completion Network
Generator
Muti-View Part
Segmentation
Multi-View
Figure 2. Overview of PartGen. Our method begins with text, single images, or existing 3D objects to obtain an initial grid view of the
object. This view is then processed by a diffusion-based segmentation network to achieve multi-view consistent part segmentation. Next,
the segmented parts, along with contextual information, are input into a multi-view part completion network to generate a fully completed
view of each part. Finally, a pre-trained reconstruction model generates the 3D parts.
tage points. Depending on the nature of y, the network Φ lows us to repurpose existing multi-view models Φ, which,
is either a text-to-image (T2I) model [39, 74] or a image- as described in Sec. 3.1, are already pre-trained to produce
to-image (I2I) one [73, 86]. These are fine-tuned to output multi-view consistent generations in the RGB domain. Sec-
a single ‘multi-view’ image I ∈ R3×2H×2W , where views ond, it integrates easily with established multi-view frame-
from the four cardinal directions around the object are ar- works. Third, decomposing an object into parts is an in-
ranged into a 2 × 2 grid. This model thus provides a proba- herently non-deterministic, ambiguous task as it depends on
bilistic mapping I ∼ p(I | Φ, y). The 2D views I are subse- the desired verbosity level, individual preferences, and artis-
quently passed to a Reconstruction Model (RM) [39, 76, 90] tic intent. By learning this task with probabilistic diffusion
Ψ, i.e., a neural network that reconstructs the 3D object L in models, we can effectively capture and model this ambigu-
both shape and appearance. Compared to direct 3D genera- ity. We thus train our model on a curated dataset of artist-
tion, this two-stage paradigm takes full advantage of an im- created 3D objects, where each object L is annotated with
age generation model pre-trained on internet-scale 2D data. a possible decomposition into 3D parts, L = (S1 , . . . , SS ).
This approach is general and can be applied with var- The dataset details are provided in Sec. 3.5.
ious implementations of image-generation and reconstruc- Consider that the input is a multi-view image I, and the
tion models. Our work in particular follows a setup similar output is a set of multi-view part masks M 1 , M 2 , . . . , M S .
to AssetGen [76]. Specifically, we obtain Φ by finetuning a To finetune our multi-view image generators Φ for mask
pre-trained text-to-image diffusion model with an architec- prediction, we quantize the RGB space into Q different
ture similar to Emu [13], a diffusion model in a 8-channel colors c1 , . . . , cQ ∈ [0, 1]3 . For each training sample
latent space, the mapping to which is provided by a spe- L = (Sk )Sk=1 , we assign colors to the parts, mapping
cially trained variational autoencoder (VAE). The detailed part Sk to color cπk , where π is a random permutation on
fine-tuning strategy can be found in Sec. 4.4 and supple- {1, . . . , Q} (we assume that Q ≥ S). Given this mapping,
mentary material. When the input is a 3D model, we render we render the segmentation map as a multi-view RGB im-
multiple views to form the grid view. For the RM Ψ we use age C ∈ [0, 1]3×2H×2W (Fig. 4). Then, we fine-tune Φ to
LightplaneLRM [5], trained on our dataset. (1) take as conditioning the multi-view image I, and (2) to
3.2. Multi-view part segmentation generate the color-coded multi-view segmentation map C,
hence sampling a distribution C ∼ p(C | Φseg , I).
The first major contribution of our paper is a method for
segmenting an object into its constituent parts. Inspired This approach can produce alternative segmentations by
by multi-view diffusion approaches, we frame object de- simply re-running Φseg , which is stochastic. It further ex-
composition into parts as a multi-view segmentation task, ploits the fact that Φseg is stochastic to discount the specific
rather than as direct 3D segmentation. At a high-level, the ‘naming’ or coloring of the parts, which is arbitrary. Nam-
goal is to map I to a collection 2D masks M 1 , . . . , M S ∈ ing is a technical issue in instance segmentation which usu-
{0, 1}2H×2W , one for each visible part of the object. Both ally requires ad-hoc solutions, and here is solved ‘for free’.
image I and masks Mi are multi-view grids. To extract the segments at test time, we sample the im-
Addressing 3D object segmentation through the lens of age C and simply quantize it based on the reference colors
multi-view diffusion offers several advantages. First, it al- c1 , . . . , cQ , discarding parts that contain only a few pixels.
Whole object Part 1 Part 2 Part N
by the inpainting setup in [71]. We apply the pre-trained
VAE separately to the masked image I ⊙ M and context
… image I, yielding 2 × 8 channels, and stack them with the
8D noise image and the unencoded part mask M to obtain
the 25-channel input to the diffusion model. Example re-
sults are shown in Figure 5.
…
3.4. Part reconstruction
Given a multi-view part image J, the final step is to recon-
… struct the part in 3D. Because the part views are now com-
plete and consistent, we can simply use the RM to obtain a
predicted reconstruction Ŝ = Ψ(J) of the part. We found
Figure 3. Training data. We obtain a dataset of 3D objects de- that the model does not require special finetuning to move
composed into parts from assets created by artists. These come from objects to their parts, so any good quality reconstruc-
‘naturally’ decomposed into parts according to the artist’s design. tion model can be plugged into our pipeline directly.
Figure 4. Examples of automatic multi-view part segmentations. By running our method several times, we obtain different segmenta-
tions, covering the space of artist intents.
Context Incomplete Part Mask GT Ours Sample 1 Ours Sample 2 Ours Sample 3
Automatic Seeded
Method mAP50 ↑ mAP75 ↑ mAP50 ↑ mAP75 ↑
Part123 [44] 11.5 7.4 10.3 6.5
SAM2† [70] 20.3 11.8 24.6 13.1
SAM2∗ [70] 37.4 27.0 44.2 30.1
SAM2 [70] 35.3 23.4 41.4 27.4
PartGen (1 sample) 45.2 32.9 44.9 33.5
PartGen (5 samples) 54.2 33.9 51.3 32.9
PartGen (10 samples) 59.3 38.5 53.7 35.4
Table 2. Part completion results. We first evaluate view part completion by computing scores w.r.t. the ground-truth multi-view part
image J. Then, we evaluate 3D part reconstruction by reconstructing each part S and rendering it. See text for details.
(a) Part-Aware Text-to-3D
Input Generated 3D Example Parts Input Generated 3D Example Parts Input Generated 3D Example Parts
(c) 3D Decomposition
Input Reconstructed 3D Example Parts Input Reconstructed 3D Example Parts Input Reconstructed 3D Example Parts
Figure 6. Examples of applications. PartGen can effectively generate or reconstruct 3D objects with meaningful and realistic parts in
different scenarios: a) Part-aware text-to-3D generation; b) Part-aware image-to-3D generation; c) 3D decomposition.
Baselines. We consider the original and fine-tuned marily because of the ambiguity of the segmentation task,
SAM2 [70] as our baselines for multi-view segmentation. which is better captured by our generator-based approach.
We fine-tune SAM2 in two different ways. First, we fine- We further provide qualitative results in Fig. 4.
tune SAM2’s mask decoder on our dataset, given the ground
4.2. Part completion and reconstruction
truth masks and randomly selected seed points for different
views. Second, we concatenate the four orthogonal views We utilize the same test data as in Sec. 4.1, forming tuples
in a multi-view image I and fine-tune SAM2 to predict the (S, I, M k , J k ) consisting of the 3D object part S, the full
multi-view mask M (in this case, the seed point randomly multi-view image I, the part mask M k and the multi-view
falls in one of the views). SAM2 produces three regions for image J k of the part, as described in Section 3.5. We choose
each input image and seed point. For automatic segmenta- one random part index k per model, and will omit it from
tion, we seed SAM2 with a set of query points spread over the notation below to be more concise.
the object, obtaining three different regions for each seed
Evaluation protocol. The completion algorithm and its
point. For seeded segmentation, we simply return the re-
baselines are treated as a black box Jˆ = B(I ⊙ M, I) that
gions that SAM2 outputs for the given seed point. We also ˆ We then com-
predicts the completed multi-view image J.
provide a comparison with recent work, Part123 [44].
pare Jˆ to the ground-truth render J using Peak Signal to
Results. We report the results in Tab. 1. As shown in the Noise Ratio (PSNR) of the foreground pixels, Learned Per-
table, mAP results for our method are much higher than oth- ceptual Image Patch Similarity (LPIPS) [100], and CLIP
ers, including SAM2 fine-tuned on our data. This is pri- similarity [69]. The latter is an important metric since the
Ŝk = Φ(Jˆk ), and reassemble the 3D object L̂ by merg-
ing the 3D parts {Ŝ1 , . . . , ŜN }. We then compare L̂ =
S ˆ
k Φ(Jk ) to the unsegmented reconstruction L̂ = Φ(I) us-
ing the same protocol as for parts.
Results. Table 3 shows that our method achieves perfor-
Original “White T-shirt with “Hawaii shirt” “Cloth with colorful
logo” texture” mance comparable to directly reconstructing the objects us-
ing the RM (L̂ = Φ(I)), with the additional benefit of pro-
ducing the reconstruction structured into parts, which are
useful for downstream applications such as editing.
4.4. Applications
Original “Black magic hat” “White hat” “Cowboy hat”
Part-aware text-to-3D generation. First, we apply Part-
Gen to part-aware text-to-3D generation. We train a text-
to-multi-view generator similar to [76], which takes a text
prompt as input and outputs a grid of four views. For il-
Original “pink cup with “Green cup with “Yellow cup with a
lustration, we use the prompts from DreamFusion [65]. As
square bottom” cute logo” smile on it” shown in Fig. 6, PartGen can effectively generate 3D ob-
Figure 7. 3D part editing. We can edit the appearance and shape jects with distinct and completed parts, even in challenging
of the 3D objects with text prompt. cases with heavy occlusions, such as the gummy bear. Ad-
ditional examples are provided in the supp. mat.
Method CLIP↑ LPIPS↓ PSNR↑ Part-aware image-to-3D generation. Next, we consider
PartGen (L̂ = k Φ(Jˆk )) 0.952 part-aware image-to-3D generation. Building upon the text-
S
0.065 20.33
Unstructured (L̂ = Φ(I)) 0.955 0.064 20.47 to-multi-view generator, we further fine-tune the generator
to accept images as input with a strategy similar to [95].
Table 3. Model reassembling result. The quality of 3D recon- Further training details are provided in supplementary ma-
struction of the object as a whole is close to that of the part- terials. Results are shown in Fig. 6 demonstrating that Part-
based compositional reconstruction, which proves that the pre- Gen is successful in this case as well.
dicted parts fit together well.
Real-world 3D object decomposition. PartGen can also
decompose real-world 3D objects. We show this using ob-
completion task is highly ambiguous, and thus evaluating
jects from Google Scanned Objects (GSO) [15] for this pur-
semantic similarity can provide additional insights. We
pose. Given a 3D object from GSO, we render different
also evaluate the quality of the reconstruction of the pre-
views to obtain a an image grid and then apply PartGen as
dicted completions by comparing the reconstructed object
ˆ to the ground-truth part S using the same above. The last row of Figure 6 shows that PartGen can
part Ŝ = Φ(J)
effectively decompose real-world 3D objects too.
metrics, but averaged after rendering the part to four ran-
dom novel viewpoints. 3D part editing. Finally, we show that once the 3D parts
are decomposed, they can be further modified through text
Results. We compare our part completion algorithm (Jˆ =
input. As illustrated in Fig. 7, a variant of our method en-
B(I ⊙ M, I)) to several baselines and the oracle, test-
ables effective editing of the shape and texture of the parts
ing using no completion (Jˆ = I ⊙ M ), omitting context
based on textual prompts. The details of the 3D editing
(Jˆ = B(I ⊙ M )), completing single views independently
model are provided in supplementary materials.
(Jˆv = B(Iv ⊙ Mv , Iv )), and the oracle (Jˆ = J). The latter
provides the upper-bound on the part reconstruction perfor-
mance, where the only bottleneck is the RM. 5. Conclusion
As shown in the table Tab. 2, our model largely surpasses We have introduced PartGen, a novel approach to gener-
the baselines. Both joint multi-view reasoning and contex- ate or reconstruct compositional 3D objects from text, im-
tual part completion are important for good performance. ages, or unstructured 3D objects. PartGen can reconstruct
We further provide qualitative results in Fig. 5. in 3D parts that are even minimally visible, or not visible
at all, utilizing the guidance of a specially-designed multi-
4.3. Reassembling parts
view diffusion prior. We have also shown several applica-
Evaluation protocol. Starting from multi-view image I of tion of PartGen, including text-guided part editing. This is a
a 3D object L, we run the segmentation algorithm to obtain promising step towards the generation of 3D assets that are
segmentation (M̂ 1 , . . . , M̂ S ), reconstruct each 3D part as more useful in professional workflows.
References [14] Deemos. Rodin text-to-3D gen-1 (0525) v0.5, 2024. 2
[15] Laura Downs, Anthony Francis, Nate Koenig, Brandon
[1] Hertz Amir, Perel Or, Giryes Raja, Sorkine-Hornung Olga,
Kinman, Ryan Hickman, Krista Reymann, Thomas B
and Cohen-Or Daniel. SPAGHETTI: editing implicit
McHugh, and Vincent Vanhoucke. Google scanned objects:
shapes through part aware generation. In ACM Transac-
A high-quality dataset of 3d scanned household items. In
tions on Graphics, 2022. 3
2022 International Conference on Robotics and Automa-
[2] Yash Sanjay Bhalgat, Iro Laina, Joao F. Henriques, Andrea tion (ICRA), pages 2553–2560. IEEE, 2022. 2, 8
Vedaldi, and Andrew Zisserman. Contrastive Lift: 3D ob- [16] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab-
ject instance segmentation by slow-fast contrastive fusion. hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil
In Proceedings of Advances in Neural Information Process- Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh
ing Systems (NeurIPS), 2023. 3 Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra,
[3] Yash Sanjay Bhalgat, Iro Laina, Joao F. Henriques, Andrew Archie Sravankumar, Artem Korenev, Arthur Hinsvark,
Zisserman, and Andrea Vedaldi. N2F2: Hierarchical scene Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen
understanding with nested neural feature fields. In Pro- Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron,
ceedings of the European Conference on Computer Vision Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya
(ECCV), 2024. 3 Nayak, Chloe Bi, Chris Marra, Chris McConnell, Chris-
[4] Aleksei Bokhovkin and Angela Dai. Neural part priors: tian Keller, Christophe Touret, Chunyang Wu, Corinne
Learning to optimize part-based object completion in rgb- Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien
d scans. In Proceedings of the IEEE/CVF Conference on Allonsius, Daniel Song, Danielle Pintz, Danny Livshits,
Computer Vision and Pattern Recognition, pages 9032– David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego
9042, 2023. 3 Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor
[5] Ang Cao, Justin Johnson, Andrea Vedaldi, and David Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Di-
Novotny. Lightplane: Highly-scalable components for neu- nan, Eric Michael Smith, Filip Radenovic, Frank Zhang,
ral 3d fields. arXiv preprint arXiv:2404.19760, 2024. 4, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Ander-
2 son, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem
[6] Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexan- Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo
der W. Bergman, Jeong Joon Park, Axel Levy, Miika Ait- Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M.
tala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon
Generative novel view synthesis with 3D-aware diffusion Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar,
models. In Proc. ICCV, 2023. 3 Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny
[7] Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang,
Jia, and Ziwei Liu. Comboverse: Compositional 3d assets Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak,
creation using spatially-aware diffusion guidance. arXiv Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe,
preprint arXiv:2403.12409, 2024. 3 Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani,
[8] Zilong Chen, Feng Wang, and Huaping Liu. Text-to-3D Kate Plawiak, Ke Li, Kenneth Heafield, and Kevin Stone.
using Gaussian splatting. arXiv, 2309.16585, 2023. 3 The Llama 3 herd of models. arXiv, 2407.21783, 2024. 5
[9] Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and [17] Dave Epstein, Ben Poole, Ben Mildenhall, Alexei A. Efros,
Huaping Liu. V3D: Video diffusion models are effective and Aleksander Holynski. Disentangled 3d scene genera-
3D generators. arXiv, 2403.06738, 2024. 3 tion with layout learning, 2024. 3
[10] Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, [18] Ruiqi Gao, Aleksander Holynski, Philipp Henzler,
Wenqing Zhang, Xujie Zhang, Hanqing Zhao, and Xi- Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan,
aodan Liang. Catvton: Concatenation is all you need Jonathan T. Barron, and Ben Poole. CAT3D: create any-
for virtual try-on with diffusion models. arXiv preprint thing in 3d with multi-view diffusion models. arXiv,
arXiv:2407.15886, 2024. 1 2405.10314, 2024. 2, 3
[11] Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja [19] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna,
Giryes, and Daniel Cohen-Or. Set-the-scene: Global-local William T. Freeman, and Thomas Funkhouser. Learning
training for generating controllable nerf scenes. In Proc. shape templates with structured implicit functions. In Proc.
ICCV Workshops, 2023. 3 CVPR, 2019. 3
[12] CSM. CSM text-to-3D cube 2.0, 2024. 2 [20] Kyle Genova, Forrester Cole, Avneesh Sud, Aaron Sarna,
[13] Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam S. Tsai, Jialiang and Thomas A. Funkhouser. Local deep implicit functions
Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xi- for 3D shape. In Proc. CVPR, 2020. 3
aofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek [21] Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and
Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Barlas Oguz. 3DGen: Triplane latent diffusion for textured
Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Mot- mesh generation. corr, abs/2303.05371, 2023. 2
wani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ra- [22] Junlin Han, Jianyuan Wang, Andrea Vedaldi, Philip Torr,
manathan, Zijian He, Peter Vajda, and Devi Parikh. Emu: and Filippos Kokkinos. Flex3d: Feed-forward 3d genera-
Enhancing image generation models using photogenic nee- tion with flexible reconstruction model and input view cu-
dles in a haystack. CoRR, abs/2309.15807, 2023. 4, 1 ration. arXiv preprint arXiv:2410.00890, 2024. 3
[23] Junlin Han, Filippos Kokkinos, and Philip Torr. Vfusion3d: [38] D. Larlus, G. Dorko, D. Jurie, and B. Triggs. Pascal visual
Learning scalable 3d generative models from video diffu- object classes challenge. In Selected Proceeding of the first
sion models. In European Conference on Computer Vision, PASCAL Challenges Workshop, 2006. 3
pages 333–350. Springer, 2025. 3 [39] Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun
[24] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg
fusion probabilistic models. In Proc. NeurIPS, 2020. 1 Shakhnarovich, and Sai Bi. Instant3D: Fast text-to-3D with
[25] Lukas Höllein, Aljaz Bozic, Norman Müller, David sparse-view generation and large reconstruction model.
Novotný, Hung-Yu Tseng, Christian Richardt, Michael Proc. ICLR, 2024. 2, 3, 4, 5, 1
Zollhöfer, and Matthias Nießner. ViewDiff: 3D-consistent [40] Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen,
image generation with text-to-image models. In Proc. Yi Zhang, Peng Zhou, and Bingbing Ni. Focaldreamer:
CVPR, 2024. 3 Text-driven 3d editing via focal-fusion assembly, 2023. 3
[26] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, [41] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki
Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis,
Hao Tan. LRM: Large reconstruction model for single im- Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D:
age to 3D. In Proc. ICLR, 2024. 3 High-resolution text-to-3D content creation. arXiv.cs,
[27] Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, abs/2211.10440, 2022. 3
Zheng-Jun Zha, and Lei Zhang. Dreamtime: An im- [42] Connor Lin, Niloy Mitra, Gordon Wetzstein, Leonidas J.
proved optimization strategy for text-to-3D content cre- Guibas, and Paul Guerrero. NeuForm: adaptive overfitting
ation. CoRR, abs/2306.12422, 2023. 3 for neural shape editing. In Proc. NeurIPS, 2022. 3
[28] Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neu- [43] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang.
ral template: Topology-aware reconstruction and disentan- Common diffusion noise schedules and sample steps are
gled generation of 3d meshes. In Proc. CVPR, 2022. 3 flawed. In Proceedings of the IEEE/CVF winter confer-
ence on applications of computer vision, pages 5404–5411,
[29] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste
2024. 1
Alayrac, Carl Doersch, Catalin Ionescu, David Ding,
[44] Anran Liu, Cheng Lin, Yuan Liu, Xiaoxiao Long, Zhiyang
Skanda Koppula, Daniel Zoran, Andrew Brock, Evan
Dou, Hao-Xiang Guo, Ping Luo, and Wenping Wang.
Shelhamer, Olivier J. Hénaff, Matthew M. Botvinick,
Part123: Part-aware 3d reconstruction from a single-view
Andrew Zisserman, Oriol Vinyals, and João Carreira.
image. arXiv, 2405.16888, 2024. 3, 6, 7
Perceiver IO: A general architecture for structured inputs
& outputs. In Proc. ICLR, 2022. 1 [45] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen,
Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45:
[30] Wonbong Jang and Lourdes Agapito. CodeNeRF: Disen-
Any single image to 3D mesh in 45 seconds without per-
tangled neural radiance fields for object categories. In Proc.
shape optimization. In Proc. NeurIPS, 2023. 3
ICCV, 2021. 2
[46] Minghua Liu, Yinhao Zhu, Hong Cai, Shizhong Han, Zhan
[31] Heewoo Jun and Alex Nichol. Shap-E: Generating condi- Ling, Fatih Porikli, and Hao Su. PartSLIP: low-shot part
tional 3D implicit functions. arXiv, 2023. 2 segmentation for 3D point clouds via pretrained image-
[32] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, language models. In Proc. CVPR, 2023. 3
and George Drettakis. 3D Gaussian Splatting for real-time [47] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok-
radiance field rendering. Proc. SIGGRAPH, 42(4), 2023. 3 makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3:
[33] Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Zero-shot one image to 3D object. In Proc. ICCV, 2023. 3
Kanazawa, and Matthew Tancik. LERF: language embed- [48] Weiyu Liu, Jiayuan Mao, Joy Hsu, Tucker Hermans, Ani-
ded radiance fields. In Proc. ICCV, 2023. 3 mesh Garg, and Jiajun Wu. Composable part-based manip-
[34] Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken ulation. In CoRL 2023, 2023. 2
Goldberg, Matthew Tancik, and Angjoo Kanazawa. [49] Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie
Garfield: Group anything with radiance fields. arXiv.cs, Liu, Taku Komura, and Wenping Wang. SyncDreamer:
abs/2401.09419, 2024. 3 Generating multiview-consistent images from a single-view
[35] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi image. arXiv, 2309.03453, 2023. 3
Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer [50] Xiaoxiao Long, Yuanchen Guo, Cheng Lin, Yuan Liu,
Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang,
and Ross Girshick. Segment anything. In Proc. CVPR, Marc Habermann, Christian Theobalt, and Wenping Wang.
2023. 2, 3 Wonder3D: Single image to 3D using cross-domain diffu-
[36] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- sion. arXiv.cs, abs/2310.15008, 2023. 3
mann. Decomposing NeRF for editing via feature field dis- [51] LumaAI. Genie text-to-3D v1.0, 2024. 2
tillation. arXiv.cs, 2022. 3 [52] Tiange Luo, Chris Rockwell, Honglak Lee, and Justin John-
[37] Juil Koo, Seungwoo Yoo, Minh Hieu Nguyen, and Min- son. Scalable 3d captioning with pretrained models. arXiv
hyuk Sung. SALAD: part-level latent diffusion for 3D preprint arXiv:2306.07279, 2023. 5
shape generation and manipulation. In Proc. ICCV, 2023. [53] Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard.
3 Grounding language with visual affordances over unstruc-
tured data. In Proceedings of the IEEE International Con- [67] Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and
ference on Robotics and Automation (ICRA), London, UK, Hanspeter Pfister. LangSplat: 3D language Gaussian splat-
2023. 2 ting. In Proc. CVPR, 2024. 3
[54] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and [68] Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo,
Andrea Vedaldi. RealFusion: 360 reconstruction of any Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong,
object from a single image. In Proceedings of the IEEE Liefeng Bo, and Xiaoguang Han. Richdreamer: A gen-
Conference on Computer Vision and Pattern Recognition eralizable normal-depth diffusion model for detail richness
(CVPR), 2023. 3 in text-to-3D. arXiv.cs, abs/2311.16918, 2023. 3
[55] Luke Melas-Kyriazi, Christian Rupprecht, and Andrea [69] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Vedaldi. PC2: Projection-conditioned point cloud diffusion Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
for single-image 3d reconstruction. In Proceedings of the Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
IEEE Conference on Computer Vision and Pattern Recog- Krueger, and Ilya Sutskever. Learning transferable visual
nition (CVPR), 2023. 2 models from natural language supervision. In Proc. ICML,
[56] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Na- pages 8748–8763, 2021. 3, 7, 1
talia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos [70] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang
Kokkinos. IM-3D: Iterative multiview diffusion and re- Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman
construction for high-quality 3D generation. In Proceed- Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt-
ings of the International Conference on Machine Learning ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-
(ICML), 2024. 2, 3 Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Fe-
[57] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and ichtenhofer. SAM 2: Segment anything in images and
A. Geiger. Occupancy Networks: Learning 3D reconstruc- videos. arXiv, 2408.00714, 2024. 2, 6, 7
tion in function space. In Proc. CVPR, 2019. 3 [71] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[58] Meshy. Meshy text-to-3D v3.0, 2024. 2 Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proc. CVPR,
[59] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
2022. 5
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng.
[72] Tim Salimans and Jonathan Ho. Progressive distillation
NeRF: Representing scenes as neural radiance fields for
for fast sampling of diffusion models. arXiv preprint
view synthesis. In Proc. ECCV, 2020. 3
arXiv:2202.00512, 2022. 1
[60] Tom Monnier, Jake Austin, Angjoo Kanazawa, Alexei
[73] Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua
Efros, and Mathieu Aubry. Differentiable blocks world:
Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng,
Qualitative 3d decomposition by rendering primitives. Ad-
and Hao Su. Zero123++: a single image to consistent multi-
vances in Neural Information Processing Systems, 36:
view diffusion base model. arXiv.cs, abs/2310.15110, 2023.
5791–5807, 2023. 3
4
[61] George Kiyohiro Nakayama, Mikaela Angelina Uy, Jiahui [74] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li,
Huang, Shi-Min Hu, Ke Li, and Leonidas Guibas. Diff- and Xiao Yang. MVDream: Multi-view diffusion for 3D
Facto: controllable part-based 3D point cloud generation generation. In Proc. ICLR, 2024. 3, 4
with cross diffusion. In Proc. ICCV, 2023. 3 [75] Aleksandar Shtedritski, Christian Rupprecht, and Andrea
[62] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Vedaldi. What does clip know about a red circle? vi-
Mishkin, and Mark Chen. Point-E: A system for gener- sual prompt engineering for vlms. In Proceedings of the
ating 3D point clouds from complex prompts. arXiv.cs, IEEE/CVF International Conference on Computer Vision,
abs/2212.08751, 2022. 2 pages 11987–11997, 2023. 2
[63] Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Pe- [76] Yawar Siddiqui, Filippos Kokkinos, Tom Monnier, Mahen-
ters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard dra Kariya, Yanir Kleiman, Emilien Garreau, Oran Gafni,
Newcombe, and Carl Yuheng Ren. Aria digital twin: A Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and
new benchmark dataset for egocentric 3d machine percep- David Novotny. Meta 3D Asset Gen: Text-to-mesh gener-
tion, 2023. 2 ation with high-quality geometry, texture, and PBR mate-
[64] Ryan Po and Gordon Wetzstein. Compositional 3d scene rials. In Proceedings of Advances in Neural Information
generation using locally conditioned diffusion. ArXiv, Processing Systems (NeurIPS), 2024. 2, 3, 4, 5, 8
abs/2303.12218, 2023. 3 [77] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
[65] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- ing diffusion implicit models. In Proc. ICLR, 2021. 1
hall. DreamFusion: Text-to-3D using 2D diffusion. In Proc. [78] Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen
ICLR, 2023. 3, 8 Liu, Zhenda Xie, and Yebin Liu. DreamCraft3D: Hier-
[66] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, archical 3D generation with bootstrapped diffusion prior.
Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko- arXiv.cs, abs/2310.16818, 2023. 3
rokhodov, Peter Wonka, Sergey Tulyakov, and Bernard [79] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and
Ghanem. Magic123: One image to high-quality 3D object Gang Zeng. DreamGaussian: Generative gaussian splat-
generation using both 2D and 3D diffusion priors. arXiv.cs, ting for efficient 3D content creation. arXiv, 2309.16653,
abs/2306.17843, 2023. 3 2023. 3
[80] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran ation via multi-view conditions. arXiv.cs, abs/2312.03611,
Yi, Lizhuang Ma, and Dong Chen. Make-It-3D: High- 2023. 3
fidelity 3d creation from A single image with diffusion [94] Lior Yariv, Omri Puny, Natalia Neverova, Oran Gafni, and
prior. arXiv.cs, abs/2303.14184, 2023. 3 Yaron Lipman. Mosaic-SDF for 3D generative models.
[81] Shitao Tang, Jiacheng Chen, Dilin Wang, Chengzhou Tang, arXiv.cs, abs/2312.09222, 2023. 2
Fuyang Zhang, Yuchen Fan, Vikas Chandra, Yasutaka Fu- [95] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
rukawa, and Rakesh Ranjan. MVDiffusion++: A dense adapter: Text compatible image prompt adapter for text-to-
high-resolution multi-view diffusion model for single or image diffusion models. arXiv preprint arxiv:2308.06721,
sparse-view 3d object reconstruction. arXiv, 2402.12712, 2023. 8, 1
2024. 3 [96] Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xi-
[82] Konstantinos Tertikas, Despoina Paschalidou, Boxiao Pan, aopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang.
Jeong Joon Park, Mikaela Angelina Uy, Ioannis Z. Emiris, GaussianDreamer: Fast generation from text to 3D gaussian
Yannis Avrithis, and Leonidas J. Guibas. PartNeRF: Gen- splatting with point cloud priors. arXiv.cs, abs/2310.08529,
erating part-aware editable 3D shapes without 3D supervi- 2023. 3
sion. arXiv.cs, abs/2303.09554, 2023. 3 [97] Haiyang Ying, Yixuan Yin, Jinzhi Zhang, Fan Wang, Tao
[83] TripoAI. Tripo3D text-to-3D, 2024. 2 Yu, Ruqi Huang, and Lu Fang. Omniseg3d: Omniversal 3d
[84] Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea segmentation via hierarchical contrastive learning. In Pro-
Vedaldi. Neural Feature Fusion Fields: 3D distillation of ceedings of the IEEE/CVF Conference on Computer Vision
self-supervised 2D image representation. In Proceedings of and Pattern Recognition, pages 20612–20622, 2024. 3
the International Conference on 3D Vision (3DV), 2022. 3 [98] Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu
[85] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, Li, Long Quan, Ying Shan, and Yonghong Tian. HiFi-123:
and Greg Shakhnarovich. Score Jacobian chaining: Lifting Towards high-fidelity one image to 3D content generation.
pretrained 2D diffusion models for 3D generation. In Proc. arXiv.cs, abs/2310.06744, 2023. 3
CVPR, 2023. 3 [99] Guanqi Zhan, Qingnan Fan, Kaichun Mo, Lin Shao, Bao-
[86] Peng Wang and Yichun Shi. ImageDream: Image-prompt quan Chen, Leonidas J Guibas, Hao Dong, et al. Generative
multi-view diffusion for 3D generation. In Proc. ICLR, 3d part assembly via dynamic graph learning. Advances
2024. 3, 4 in Neural Information Processing Systems, 33:6315–6326,
[87] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongx- 2020. 3
uan Li, Hang Su, and Jun Zhu. ProlificDreamer: High- [100] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-
fidelity and diverse text-to-3D generation with variational man, and Oliver Wang. The unreasonable effectiveness of
score distillation. arXiv.cs, abs/2305.16213, 2023. 3 deep features as a perceptual metric. In Proc. CVPR, pages
[88] Daniel Watson, William Chan, Ricardo Martin-Brualla, 586–595, 2018. 7
Jonathan Ho, Andrea Tagliasacchi, and Mohammad [101] Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and
Norouzi. Novel view synthesis with diffusion models. In Andrew J. Davison. In-place scene labelling and under-
Proc. ICLR, 2023. 3 standing with implicit scene representation. In Proc. ICCV,
[89] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Liang Pan 2021. 3
Jiawei Ren, Wayne Wu, Lei Yang, Jiaqi Wang, Chen [102] Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu,
Qian, Dahua Lin, and Ziwei Liu. Omniobject3d: Large- Tiejun Huang, and Xinlong Wang. Uni3D: Exploring uni-
vocabulary 3d object dataset for realistic perception, re- fied 3D representation at scale. In Proc. ICLR, 2024. 3
construction and generation. In IEEE/CVF Conference on [103] Yuchen Zhou, Jiayuan Gu, Xuanlin Li, Minghua Liu, Yun-
Computer Vision and Pattern Recognition (CVPR), 2023. 2 hao Fang, and Hao Su. PartSLIP++: enhancing low-shot
[90] Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, 3d part segmentation via multi-view instance segmentation
Ceyuan Yang, Sida Peng, Yujun Shen, and Gordon Wet- and maximum likelihood estimation. arXiv, 2312.03015,
zstein. GRM: Large gaussian reconstruction model for effi- 2023. 3
cient 3D reconstruction and generation. arXiv, 2403.14621, [104] Junzhe Zhu and Peiye Zhuang. HiFA: High-fidelity
2024. 4 text-to-3D with advanced diffusion guidance. CoRR,
[91] Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji- abs/2305.18766, 2023. 3
ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, [105] Yan Zizheng, Zhou Jiapeng, Meng Fanpeng, Wu Yushuang,
Zexiang Xu, and Kai Zhang. DMV3D: Denoising multi- Qiu Lingteng, Ye Zisheng, Cui Shuguang, Chen Guanying,
view diffusion using 3D large reconstruction model. In and Han Xiaoguang. Dreamdissector: Learning disentan-
Proc. ICLR, 2024. 3 gled text-to-3d generation from 2d diffusion priors. ECCV,
[92] Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, and Hong- 2024. 3
dong Li. ConsistNet: Enforcing 3D consistency for multi-
view images diffusion. arXiv.cs, abs/2310.10343, 2023.
[93] Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen
Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, and
Xihui Liu. DreamComposer: Controllable 3D object gener-
PartGen: Part-level 3D Generation and Reconstruction
with Multi-View Diffusion Models
Supplementary Material
Input Target
This supplementary material contains the following
parts: > A red cylindrical
cup with a smooth
• Implementation Details. Detailed descriptions of the matte finish a flat
training and inference settings for all models used in Part- bottom
R−1 N
0.30 XX
h
vi = (T̂i,j−1 − T̂i,j )wij · fvh (xij ).
0.25
ours (1 sample) j=1 h=1
0.20 ours (5 sample) PN
ours (10 sample) where wij h
= σ h (xij )/ l=1 σ l (xij ) is the weight
0.15 SAM2 (finetuned)
SAM2 (original) of the feature fvh (xij ) at xij for part h; T̂i,j =
0.10 SAM2 (4 views) Pj PN
1 3 5 10 exp(− k=0 h=1 ∆·σ h (xik )), ∆ is the distance between
k
two sampled points and σ h (xik ) is the opacity at position
xik for part h, and T̂i,j−1 − T̂i,j is the visibility of the point.
Figure 9. Recall curve of different methods. Our method achieve
better performance comparing with SAM2 and its variants. A.6. 3D part editing
As shown in the main text and Figure 7, once 3D assets
generator, employing a DDPM scheduler, v-prediction, and are generated or reconstructed as a composition of differ-
rescaled SNR. The network is trained with 64 H100 GPUs, ent parts through PartGen, specific parts can be edited us-
a batch size of 512, a learning rate of 10−5 , for 10k steps. ing text instructions to achieve 3D part editing. To enable
this, we fine-tune the text-to-multi-view generator using
A.4. Multi-view completion network part multi-view images, masks, and text description pairs.
Example of the training data are shown in Figure 8 (top).
The training strategy for the multi-view completion network Notably, instead of supplying the mask for the part to be
mirrors that of the multi-view segmentation network, with edited, we provide the mask of the remaining parts. This
the key difference in the input configuration. The number design choice encourages the editing network to imagine
of input channels (in latent space) is increased to 25 by in- the part’s shape without constraining the region where it has
cluding the context image, masked image, and binary mask, to project. The training recipe is similar to multi-view seg-
where the mask remains a single unencoded channel. Ex- mentation network.
ample inputs are illustrated in Figure 5 of the main text. The To generate captions for different parts, we establish an
network is trained with 64 H100 GPUs, a batch size of 512, annotation pipeline similar to the one used for captioning
a learning rate of 10−5 , and for approximately 10k steps. the whole object, where captions for various views are first
A.5. Parts assembly generated using LLAMA3 and then summarized into a sin-
gle unified caption using LLAMA3 as well. The key chal-
When compositing an object from its parts, we observed lenge in this variant is that some parts are difficult to identify
that simply combining the implicit neural fields of parts without knowing the context information of the object. We
reconstructed by the Reconstruction Model (RM) in the thus employ the technique inspired by [75]. Specifically, we
rendering process with their respective spatial locations use red annulet and alpha blending to emphasize the part be-
achieves satisfactory results. ing annotated. Example inputs and generated captions are
To describe this formally, we first review the rendering shown in Figure 8 (bottom). The network is trained with 64
function of LightplaneLRM [5] that we use as our recon- H100 GPUs, a batch size of 512, and the learning rate of
struction model. LightplaneLRM employs a generalized 10−5 over 10,000 steps.
Input Object Part 1 Part 2 Part 3
> A panda
rowing a boat in
a pond
> A dachshund
dressed up in a
hotdog costume
Figure 10. More examples. Additional examples illustrate that PartGen can process various modalities and effectively generate or recon-
struct 3D objects with distinct parts.
> A chihuahua
wearing a tutu
…
Figure 11. Iteratively adding parts. We show that users can iteratively add parts and combine the results of PartGen pipeline.
The constant ϵ = 10−4 smooths the metric when both re- between a predicted segment and a ground truth segment
gions are empty, in which case m(ϕ, ϕ) = 1, and will be as m(M̂ , M ) ∈ [0, 1]. Given this metric, we then report
useful later. the mean Average Precision (mAP) metric at different IoU
Finally, we sort the regions M by decreasing score s(M ) thresholds τ . Recall that, based on this definition, comput-
and, scanning the list from high to low, we incrementally ing the AP curve for a sample involves matching predicted
remove duplicates down the list if they overlap by more segments to ground truth segments in ranking order, ensur-
than 1/2 with the regions selected so far. The final result ing that each ground truth segment is matched only once,
is a ranked list of multi-view masks M = (M̂1 , . . . , M̂N ) and considering any unmatched ground truth segments.
where N ≤ |P| and: In more detail, we start by scanning the list of segments
M̂k in order k = 1, 2, . . . . Each time, we compare M̂k to
1 the ground truth segments S and define:
∀i < j : s(M̂i ) ≥ s(M̂j ) ∧ m(M̂i , M̂j ) < .
2
s∗ = argmax m(M̂k , Ms ).
Other algorithms like SAM2 come with their own region s=1,...,S
reliability metric s, which we use for sorting. We otherwise If m(M̂k , Ms∗ ) ≥ τ, then we label the region Ms as re-
apply non-maxima suppression to their ranked regions in trieved by setting yk = 1 and removing Ms from the list of
the same way as ours. ground truth segments not yet recalled by setting
Computing mAP. The image I comes from an object L
S ← S \ {Ms∗ }.
with parts (S1 , . . . , SS ) from which we obtain the ground-
truth part masks S = (M 1 , . . . , M S ) as explained in Sec- Otherwise, if m(M̂k , Ms∗ ) < τ or if S is empty, we set
tion 3.5 in the main text. We assign ground-truth segments yk = 0. We repeat this process for all k, which results in
to candidates following the procedure: we go through the labels (y1 , . . . , yN ) ∈ {0, 1}N . We then set the average
list M = (M̂1 , . . . , M̂N ) and match the candidates one precision (AP) at τ to be:
by one to the ground truth segment with the highest IOU, N k
exclude that ground-truth segment, and continue travers- 1 X X yi yk
AP(M, S; τ ) = .
ing the candidate list. We measure the degree of overlap S i=1
k
k=1
Input Generated Grid View Reconstructed 3D Computing recall at K. For a given sample, we define re-
call at K the curve
> An orangutan S
using chopsticks to 1X
eat ramen R(K; M, S, τ ) = χ max m(M̂s , Mk ) > τ .
S s=1 k=1,...,K
C. Additional Examples
More application examples. We provide additional appli-
cation examples in Figure 10, showcasing the versatility of
(b) Segmentation Failure our approach to varying input types. These include part-
Input Reconstructed 3D Depth map aware text-to-3D generation, where textual prompts guide
the synthesis of 3D models with semantically distinct parts;
part-aware image-to-3D generation, which reconstructs 3D
objects from a single image while maintaining detailed
part-level decomposition; and real-world 3D decomposi-
tion, where complex real-world objects are segmented into
different parts. These examples demonstrate the broad ap-
plicability and robustness of PartGen in handling diverse
inputs and scenarios.
Iteratively adding parts. As shown in Figure 11, we
demonstrate the capability of our approach to compose a 3D
object by iteratively adding individual parts to it. Starting
(C) Reconstruction Model Failure with different inputs, users can seamlessly integrate addi-
tional parts step by step, maintaining consistency and co-
Figure 12. Failure Cases. (a) Multi-view grid generation failure, herence in the resulting 3D model. This process highlights
where the generated views lack 3D consistency. (b) Segmentation the flexibility and modularity of our method, enabling fine-
failure, where semantically distinct parts are incorrectly grouped grained control over the composition of complex objects
together. (c) Reconstruction model failure, where the complex ge- while preserving the semantic and structural integrity of the
ometry of the input leads to inaccuracies in the depth map. composition.
D. Failure Cases
As outlined in the method section, PartGen incorporates
several steps, including multi-view grid generation, multi-
PN that this quantity is at most 1 because by construction
Note view segmentation, multi-view part completion, and 3D
i=1 yi ≤ S as we cannot match more proposal than there part reconstruction. Failures at different stages will result
are ground truth regions. mAP is defined as the average of in specific issues. For instance, as shown in Figure 12(a),
the AP over all test samples. failures in grid view generation can cause inconsistencies
in 3D reconstruction, such as misrepresentations of the
orangutan’s hands or the squirrel’s oars. The segmenta-
tion method can sometimes group distinct parts together,
and limited, in our implementation, to objects containing
no more than 10 parts, otherwise it merges different build-
ing blocks into a single part. Furthermore, highly complex
input structures, such as dense grass and leaves, can lead to
poor reconstruction outcomes, particularly in terms of depth
quality, as illustrated in Figure 12(c).