0% found this document useful (0 votes)
4 views15 pages

2411.07184v2

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 15

SAMPart3D: Segment Any Part in 3D Objects

Yunhan Yang1 Yukun Huang1 Yuan-Chen Guo2 Liangjun Lu1


Xiaoyang Wu1 Edmund Y. Lam1 Yan-Pei Cao2† Xihui Liu1
1
The University of Hong Kong 2 VAST
Project Page: https://yhyang-myron.github.io/SAMPart3D-website
arXiv:2411.07184v2 [cs.CV] 16 Nov 2024

Figure 1. SAMPart3D is able to segment any 3D object into semantic parts across multiple levels of granularity, without the need for
predefined part label sets or text prompts. It supports a range of applications, including part-level editing and interactive segmentation.

Abstract we distill scale-conditioned part-aware 3D features for 3D


part segmentation at multiple granularities. Once the seg-
3D part segmentation is a crucial and challenging task in mented parts are obtained from the scale-conditioned part-
3D perception, playing a vital role in applications such aware 3D features, we use VLMs to assign semantic labels
as robotics, 3D generation, and 3D editing. Recent meth- to each part based on the multi-view renderings. Compared
ods harness the powerful Vision Language Models (VLMs) to previous methods, our SAMPart3D can scale to the re-
for 2D-to-3D knowledge distillation, achieving zero-shot cent large-scale 3D object dataset Objaverse and handle
3D part segmentation. However, these methods are lim- complex, non-ordinary objects. Additionally, we contribute
ited by their reliance on text prompts, which restricts the a new 3D part segmentation benchmark to address the lack
scalability to large-scale unlabeled datasets and the flexi- of diversity and complexity of objects and parts in existing
bility in handling part ambiguities. In this work, we intro- benchmarks. Experiments show that our SAMPart3D sig-
duce SAMPart3D, a scalable zero-shot 3D part segmenta- nificantly outperforms existing zero-shot 3D part segmenta-
tion framework that segments any 3D object into seman- tion methods, and can facilitate various applications such
tic parts at multiple granularities, without requiring pre- as part-level editing and interactive segmentation.
defined part label sets as text prompts. For scalability, we
use text-agnostic vision foundation models to distill a 3D 1. Introduction
feature extraction backbone, allowing scaling to large un-
labeled 3D datasets to learn rich 3D priors. For flexibility, 3D part segmentation is a fundamental 3D perception task
that is essential for various application areas, such as robotic
: Corresponding author, †: Project leader. manipulation, 3D analysis and generation, part-level edit-

1
ing [38] and stylization [7]. sets and enhancing both scalability and flexibility. Besides,
In the past few years, data-driven fully supervised meth- to handle the ambiguity in segmentation granularity, we em-
ods [23, 30, 31, 34, 53] have achieved excellent results on ploy a scale-conditioned MLP [19] distilled from SAM for
closed-set 3D part segmentation benchmarks [3, 27]. How- granularity-controllable 3D part segmentation. The distil-
ever, these methods are limited to segmenting simple ob- lation from DINOv2 and SAM is divided into two train-
jects due to the restricted quantity and diversity of 3D data ing stages to balance efficiency and performance. After
with part annotations. Despite the recent release of large- obtaining the segmented 3D parts, we adaptively render
scale 3D object datasets [9, 10, 45], acquiring part annota- multi-view images for each part based on its visual area,
tions for such vast amounts of 3D assets is time-consuming then use the powerful Multi-modal Large Language Mod-
and labor-intensive, which prevents 3D part segmentation els (MLLMs) [6, 15, 24, 42] to assign semantic descriptions
from replicating the success of data scaling and model scal- for each part based on the renderings, yielding the final part
ing in 2D segmentation [21]. segmentation results.
To achieve zero-shot 3D part segmentation in the ab- In summary, our contributions are as follows:
sence of annotated 3D data, several challenges need to be • We introduce SAMPart3D, a scalable zero-shot 3D part
addressed. The first and most significant challenge is how to segmentation framework that segments object parts at
generalize to open-world 3D objects without 3D part anno- multiple granularities without requiring preset part labels
tations. To tackle this, recent works [1, 20, 25, 47, 56] have as text prompts.
utilized pre-trained 2D foundation vision models, such as • We propose a text-independent 2D-to-3D distillation,
SAM [21] and GLIP [22], to extract visual information from which enables learning 3D priors from large-scale unla-
multi-view renderings and project it onto 3D primitives, beled 3D objects and can handle part ambiguity in both
achieving zero-shot 3D part segmentation. However, these semantic and granularity aspects. The distillation is two-
methods rely solely on 2D appearance features without 3D stage, striking a balance between segmentation perfor-
geometric cues, leading to the second challenge: how to mance and training efficiency.
leverage 3D priors from unlabeled 3D shapes. PartDis- • We introduce PartObjaverse-Tiny, a 3D part segmentation
till [41] has made a preliminary exploration by introducing dataset which provides detailed semantic and instance
a 2D-to-3D distillation framework to learn 3D point cloud level part annotations for 200 complex 3D objects.
feature extraction, but it cannot scale to large 3D datasets • Extensive experiments demonstrate that SAMPart3D
like Objaverse [9] due to the need for predefined part la- achieves outstanding part segmentation results on com-
bels and the constrained capabilities of GLIP. Building on plex and diverse 3D objects compared to existing zero-
existing works, we further explore the third challenge: the shot 3D part segmentation methods. Furthermore, our
ambiguity of 3D parts, which manifests primarily in se- method can facilitate various applications, such as inter-
mantics and granularity. Semantic ambiguity arises from active segmentation and part-level editing.
the vague textual descriptions of parts. Existing methods
rely on vision-language models (VLMs) like GLIP, which 2. Related Work
require a part label set as text prompt. Unfortunately, not 2D Foundation Models. Recently, 2D vision founda-
all 3D parts can be clearly and precisely described in text. tion models have advanced significantly due to large-scale
Granularity ambiguity considers that a 3D object can be data and model size growth. Based on learning strate-
segmented at multiple levels of granularity. For example, gies, these models can be grouped into: traditional mod-
the human body can be divided into broader sections, such els, textually-prompted models, and visually-prompted mod-
as upper and lower halves, or into finer parts like limbs, els. Traditional models rely solely on images and use self-
torso, and head. Previous methods rely on fixed part label supervised objectives, like masked patch reconstruction in
sets and lack flexible control over segmentation granularity. MAE [14] and self-distillation in DINO [2, 29]. Textually-
To tackle the three aforementioned challenges, in this prompted models use large-scale text-image pairs, as seen
work, we propose SAMPart3D, a scalable zero-shot 3D in CLIP [36], which aligns images and text for strong
part segmentation framework that segments object parts at zero-shot performance. Since text prompts are less effec-
multiple granularities without requiring preset part labels tive for fine-grained tasks like segmentation [44], visually-
as text prompts. We argue that previous works overly rely prompted models use visual cues like bounding boxes,
on predefined part label sets and GLIP, limiting their scal- points, or masks. SAM [21], for example, employs visual
ability to complex, unlabeled 3D datasets and their flexi- prompts for zero-shot segmentation on new domains. Re-
bility in handling semantic ambiguity of 3D parts. To ad- cently, efforts [4, 17, 32, 48, 50, 51, 57] have explored us-
dress this, we abandon GLIP and instead utilize the more ing 2D foundation models for 3D content understanding. In
low-level, text-independent DINOv2 [29] model for 2D-to- this work, we integrate multiple 2D vision foundation mod-
3D feature distillation, eliminating the reliance on part label els for zero-shot 3D semantic part segmentation at varying

2
(a) (b) Segmentation-Aware
3D Features
MLP
PTv3 PTv3
object object
MLP
Large-Scale 2D-to-3D Contrastive
XYZ & Normal & RGB Distillation 3D Features XYZ & Normal & RGB Learning
3D Dataset

Multi-view Rendering DINOv2 Features Multi-view Rendering Multi- Scale 2D Segmentation Masks

… DINOv2 … … SAM …

(c)

“The Right Foot” Frozen


Clustering MLLM
“The Right Shoe”
Learnable
Segmentation-Aware 3D Part Segmentation Multi-View Rendering with Mask Annotation
3D Features

Figure 2. An overview pipeline of SAMPart3D. (a) We first pre-train 3D backbone PTv3-object on 3D large-scale data Objaverse, distilling
visual features from FeatUp-DINOv2. (b) Next, we train light-weight MLPs to distill 2D masks to scale-conditioned grouping. (c) Finally,
we cluster the feature of point clouds and highlight the consistent 2D part area with 2D-3D mapping on multi-view renderings, and then
query semantics from MLLMs.

granularities. dational models. Unlike previous methods directly transfer-


ring 2D pixel-wise or bounding-box-wise predictions to 3D
3D Part Segmentation. 3D part segmentation aims to
segmentation, PartDistill [41] adopts a cross-modal teacher-
divide a 3D object into semantic parts, which is a long-
student distillation framework to learn a 3D student back-
standing problem in 3D computer vision. Early works [23,
bone for extracting point-specific geometric features from
30, 31, 34, 53] primarily focused on exploring network ar-
unlabeled 3D shapes. In this work, we extend cross-modal
chitecture designs to better learn 3D representations. Qi et
distillation to a more challenging large-scale 3D dataset,
al. [31] propose a hierarchical neural network named Point-
Objaverse [9], without requiring text prompts for 3D parts,
Net++, which leverages neighborhoods at multiple scales to
and achieve granularity-controllable segmentation.
achieve both robustness and detail capture. Zhao et al. [53]
design an expressive Point Transformer layer, which can
be used to construct high-performing backbones for seman- 3. Method
tic segmentation of 3D point clouds. These methods typi- As shown in Figure 2, the proposed framework SAM-
cally employ fully supervised training [53], requiring time- Part3D consists of three stages: large-scale pre-training
consuming and labor-intensive 3D part annotations. Lim- to learn a 3D feature extraction backbone from a vast num-
ited by the scale and diversity of 3D part datasets [3, 27], ber of unlabeled 3D objects, as described in Section 3.1;
they struggle to achieve generalization on complex 3D ob- sample-specific fine-tuning to train a lightweight MLP for
jects in open-world scenarios. scale-conditioned grouping, as described in Section 3.2; and
Zero-shot 3D Part Segmentation. To overcome the the training-free semantic querying for assigning semantic la-
limitations of 3D annotated data and pursue zero-shot ca- bels to each part using a multimodal large language model,
pabilities, recent 3D part segmentation methods [1, 25, 37, as described in Section 3.3.
39, 41, 47, 55–57] leverage 2D priors from foundation vi-
3.1. Large-scale Pre-training: Distilling 2D Visual
sion models [21, 22, 29, 36]. PartSLIP [25] leverages the
Features to 3D Backbone
image-language model GLIP [22] to solve both semantic
and instance segmentation for 3D object parts, where the In this stage, we aim to learn a 3D feature extraction back-
GLIP model is expected to predict multiple bounding boxes bone that leverages the geometric cues of 3D objects and
for all part instances. The subsequent PartSLIP++ [56] in- learns 3D priors from a large-scale collection of unlabeled
tegrates the pretrained 2D segmentation model SAM [21] 3D objects.
into the PartSLIP pipeline, yielding more accurate pixel- Training Data. Unlike the previous work PartDistill [41],
wise part annotations than the bounding boxes used in Part- which was trained on limited categories of 3D objects, we
SLIP. Similarly, ZeroPS [47] introduces a two-stage 3D part utilize the large-scale 3D object dataset Objaverse [9] as our
segmentation and classification pipeline, bridging the multi- training data. Objaverse encompasses over 800K 3D assets
view correspondences and the prompt mechanism of foun- spanning diverse object categories, providing rich 3D priors

3
Semantic Segmentation
Body,
Balustrade, Body, Bench, Flowerpot, Base, Body,
Capsule, Dvd, Hair,
Head, Inside Seat, Leaf, Head, Enclosure,
Head, Head Band,
Billboard, Tire, Light, Seat, Head Leaves, Fan Blades,
Foot, Hand, Body,
Main Billboard, Nets, Roof, Eyebrow, Motor, Plug,
Stethoscope
Staircase, Steering Wheel Mud Wire

Instance Segmentation
Back Cylinder,
Armrest, Balcony, Big Iron Head, Left Arm, Head, Left Arm, Body, Hair,
Fence, Blocking The Door, Left Foot, Left Bumper, Head, Horns,
Body, Cask, Deck, Gold, Left Hand, Left Hand, Left Hub, Left Hind Leg,
Hanging Car, Knife, Ladder, Lower Body, Left Track, Lower Left Right Leg,
Left Arrow Stand, Left Cannon, Right Arm, Body, Right Arm, Right Front
Right Arrow Stand, Right Right Foot, Right Bumper, Leg, Right
Cannon, Main Mast, Medium Right Hand, Right Hand, Right Hind Leg,
Mast, Rope, Small Iron Fence, Upper Body Hub, Right Track, Sunglasses,
Small Mast, Sub-Mast, Top Upper Body Tail

Figure 3. Visualization of PartObjaverse-Tiny with part-level semantic and instance segmentation labels.

for zero-shot 3D part segmentation. Additionally, consider- point-wise supervision in 3D feature extraction.
ing that current 3D feature extraction network architectures Specifically, for each training iteration, we sample a
are primarily developed for 3D point clouds, we randomly batch of 3D objects, with each object represented by a point
sample point clouds from the mesh surfaces of 3D objects cloud X ∈ RN ×3 , where N denotes the number of 3D
as the input to our backbone. points. We input the point cloud X into the PTv3-object
Backbone for 3D Feature Extraction. Building upon backbone, resulting in 3D features F3D ∈ RN ×C , where
the state-of-the-art point cloud perception backbone, Point C is the feature dimension of 384 consistent with DINOv2
Transformer V3 (PTv3) [8, 46], we further tailor the archi- features. Then, to obtain corresponding 2D visual features
tecture to accommodate the characteristics of 3D objects, for supervision, we render images from K different views
resulting in PTv3-object. Specifically, PTv3 is designed for each object and extract the corresponding DINOv2 fea-
for scene-level point clouds, incorporating numerous down- tures. Utilizing the mapping relationship between point
sampling layers for a large receptive field and low compu- clouds and pixels, we can directly obtain the 2D features
tational load. However, the number of points and spatial F2D ∈ RN ×C of the 3D point cloud. However, consider-
extent required to represent an object is much smaller than ing occlusion, not all 3D points can be assigned 2D features
that required for a scene. Therefore, we removed most of given a single view. To address this, we use depth infor-
the down-sampling layers from PTv3 and instead stacked mation to determine the occlusion status of the point cloud
more transformer blocks to enhance detail preservation and following [16]. For occluded 3D points, we directly assign
feature abstraction. Note that our learning framework is their original 3D features from F3D . Finally, by averaging
model-agnostic and can be used with other more advanced the 2D features from all K rendered views, we obtain the
network architectures for 3D feature extraction. final 2D features of the point cloud:

Distilling 2D Visual Features to 3D Backbone. To train


K
the backbone for 3D feature extraction on large-scale un- 1 X (k)
F2D = F2D , (1)
labeled 3D objects from Objaverse [9], pre-trained 2D vi- K
k=1
sion foundation models are needed as supervision. Previous
methods, such as PartDistill [41], use VLMs as supervision
(k)
and require part label sets as text prompts, making it diffi- where F2D represents the obtained 2D features of the point
cult to scale to Objaverse. Therefore, we abandon VLMs cloud at the k-th view, and we simply choose a mean
and instead utilize the more low-level, text-independent DI- squared error (MSE) loss:
NOv2 [29] model as supervision for visual feature distilla-
tion. In particular, the visual features extracted by DINOv2
Lpre = (F3D − F2D )2 (2)
are low-resolution and lack detail, making them unsuitable
for subsequent part segmentation. To address this, we em-
ploy the recently proposed feature upsampling technique, as the learning objective for distilling 2D visual features to
FeatUp [13], to enhance the DINOv2 features for use as the 3D backbone.

4
3.2. Sample-specific Fine-tuning: Distilling 2D After training the scale-conditioned MLP, we can obtain
Masks for Multi-granularity Segmentation the segmentation-aware features of 3D point cloud condi-
tioned on a scale. By applying clustering algorithms such as
After pre-training the backbone via distilling the 2D visual
HDBSCAN [26] to these grouping features, we can segment
features, we can effectively extract 3D features of any 3D
the 3D point cloud into different parts. The segmentation of
object. These 3D features are used together with 2D seg-
3D mesh can be easily derived from the segmentation of 3D
mentation masks from SAM [21] for zero-shot 3D part seg-
point cloud using a voting algorithm.
mentation. Furthermore, considering the ambiguity in seg-
mentation granularity, we aim to introduce a scale value to 3.3. Semantic Querying with MLLMs
control the granularity of the segmentation. To this end,
we introduce a scale-conditioned lightweight MLP that en- After obtaining the part segmentation results of a 3D object,
ables 3D part segmentation at various scales, inspired by we query the semantics label of each part using the powerful
GARField [19] and GraCo [54]. Multimodal Large Langauge Models (MLLMs), as shown
Long Skip Connection. Although the pre-trained back- in Figure 2 (c). Utilizing the 3D-to-2D mapping, we can
bone is able to extract rich 3D features, the low-level cues identify the corresponding 2D area of each 3D part in the
of point cloud (critical for point-wise prediction tasks) are multi-view renderings, which enables view-consistent high-
lost due to overly deep networks. Therefore, we introduce lighting of 3D parts in 2D renderings. By inputting these
a MLP-based long skip connection module to capture the highlighted results into MLLMs, we can perform per-part
low-level features of point cloud. Specifically, we first as- semantic querying.
sign the normal values of the faces to each corresponding Specifically, we first select several canonical views of an
point in the point cloud to provide the shape information object for rendering and part highlighting. This enriches
of the mesh. Then, these normal values, along with color the object details in the rendered images and facilitates the
and coordinates, serve as inputs for the long skip connec- perception of MLLMs. Next, we choose a view with the
tion module, the outputs of which are added to the outputs largest rendered area for the part of interest and highlight
of the 3D backbone to complement low-level features. the corresponding part area, ensuring comprehensive incor-
Scale-conditioned Grouping. We first render multi-view poration of this part’s details. Finally, we combine these
images of the 3D object and utilize SAM to generate 2D images and feed them into MLLMs to obtain the semantic
masks of these multi-view renderings. For each mask, we labels of the part.
can find the relevant points and calculate the 3D scale σ
with: q
4. Experiments
σ= (εσx )2 + (εσy )2 + (εσz )2 , (3) 4.1. PartObjaverse-Tiny
where σx , σy , σz are the standard deviations of coordinates Current part segmentation datasets typically include lim-
in the x, y, z directions, respectively; ε is a scaling factor for ited object categories and incomplete part annotations. This
better distinguishing the scales of different masks, which we makes existing datasets unsuitable for evaluating the 3D
set to 10. segmentation performance of arbitrary objects and parts.
Then, we sample paired pixels on the valid region of 2D Therefore, we annotate a subset of Objaverse, named
renderings for contrastive learning. Specifically, for two 3D PartObjaverse-Tiny, which consists of 200 shapes with
points pi and pj mapping from a 2D pixel pair, we can ob- fine-grained annotations. Following GObjaverse [35], we
tain their features: divide these 200 objects into 8 categories: Human-Shape
(29), Animals (23), Daily-Used (25), Buildings&&Outdoor
Fi = FB P
i (σi ) + Fi (σi ), Fj = FB P
j (σj ) + Fj (σj ), (4) (25), Transportations (38), Plants (18), Food (8) and Elec-
tronics (34). Each major category includes multiple smaller
where FB (σ) is the feature derived from backbone PTv3- object categories, for example, Transportation includes
object, and FP (σ) represents the positional embedding de- cars, motorcycles, airplanes, cannons, ships, among others.
rived from positional encoding module. The final con- For each object, we meticulously segment and annotate it
trastive loss is: into fine-grained, semantically coherent parts. We present
( examples of PartObjaverse-Tiny dataset for semantic seg-
∥Fi − Fj ∥ , if C(i, j) = 1 mentation and instance segmentation in Figure 3.
Lcon = (5)
ReLU(m − ∥Fi − Fj ∥), if C(i, j) = 0
4.2. Experiments Results
where C(i, j) is a binary function that indicates whether the Multi-granularity 3D part segmentation. To demonstrate
pair (i, j) is from the same mask (1) or different masks (0), the generalization ability of our model, we use the model
and m is a lower margin. pretrained on Objaverse [9] to segment objects in GSO [11],

5
Scale Factor Scale Factor
Mesh 0.0 0.5 1.0 1.5 Mesh 0.0 0.5 1.0 1.5
GSO
OmniObject
3D
Vroid
Generated
Meshes

Figure 4. Visualization of multi-granularity 3D part segmentation on GSO [11], OmniObject3D [45], Vroid [5] and 3D generated meshes.

OmniObject3D [45] and Vroid [5] datasets, as well as on 3D SLIP, ZeroPS [47] and PartDistill [41] on the PartNetE [25]
meshes generated from TripoAI [40] and Rodin [18]. Multi- dataset, as shown in Table 4. We assign the label ”others”
granularity segmentation results are shown in Figure 4. for unlabeled areas in the PartNetE dataset.
Metrics. We utilize class-agnostic mean Intersection over
Union (mIoU) to evaluate part segmentation results with- 4.3. Ablation Analysis
out semantics. Follow [43, 47], for each ground-truth part, We conduct ablation studies on SAMPart3D, and the quan-
we calculate the IoU with every predicted part, and assign titative comparison are shown in Table 5. To save pre-
the maximum IoU as the part IoU. Finally, we calculate the training time, we utilize a high-quality subset of Objaverse
average of part IoU as class-agnostic mIoU. For semantic with 36k objects for ablation studies. We pre-train the orig-
evaluation, we follow [25, 27], utilizing category mIoU and inal PTv3 backbone on this dataset. And we also pre-train
mean Average Precision (mAP) with a 50% IoU threshold our PTv3-object on this dataset with the same number of
as metrics for semantic and instance segmentation, respec- parameters as PTv3 for fair comparison.
tively. We consider each object as a separate category, cal- Necessity of Pre-training. We ablate the proposed large-
culate the IoU/AP50 of each part separately, and compute scale pretraining on Objaverse, distilling the knowledge
the mIoU/mAP50 of this object. from powerful 2D foundational model DINOv2. Without
Comparison with Existing Methods. For the pre-training, we randomly initialize the PTv3-object back-
PartObjaverse-Tiny dataset, we evaluate our method bone and retain other contents of the pipeline. Without pre-
against PointCLIP [51], PointCLIPv2 [57], SATR [1] and training, the model lacks rich semantic part information,
PartSLIP [25] for zero-shot semantic segmentation, as which hinders its ability to effectively encode 3D objects.
shown in Table 1; against SAM3D [48] and PartSLIP for This limitation not only impacts the segmentation results
zero-shot class-agnostic part segmentation in Table 2; and but also leads to unstable training.
against PartSLIP for instance segmentation in Table 3. PTv3-object v.s. PTv3. We modify the original PTv3 back-
For methods such as PartSlip and SATR, which utilize the bone to PTv3-object, enhancing the model’s encoding capa-
GLIP detection model, the resulting segmentation often bilities and ensuring the effective transmission of informa-
exhibits numerous blank areas. To address this, we employ tion at each point cloud. The second and fourth rows of
the k-Nearest Neighbors (kNN) method, assigning the Table 5 show comparison of PTv3-object and PTv3.
label of each face to that of the nearest face with predicted Significance of Long Skip Connection. When training the
results. We present qualitative comparisons in Figure 5. grouping field, we freeze the backbone and only train MLPs
Note that our pre-training dataset excludes the 200 3D for efficient training. At this stage, without the long skip
objects in PartObjaverse-Tiny for fair comparison. connection, the model can only accept 3D embedding in-
We further compare our method with PointCLIPv2, Part- puts rich in part semantic information from our backbone,

6
GT Ours PartSLIP SATR

Body, Head,
Arm, Foot,
Hand, Leg,
Tail

Flag, Staircase
Ground Floor,
Windows, Roof,
Second Floor,

Body, Collar,
Head, Foot,
Horn, Hand

Figure 5. Qualitative comparison with PartSLIP [25] and SATR [1] in the semantic segmentation task on the PartObjaverse-Tiny dataset.

Method Overall Human-Shape Animals Daily-Used Buildings Transportations Plants Food Electronics
PointCLIP 5.4 3.5 4.5 6.5 5.5 3.6 8.8 12.3 5.6
PointCLIPv2 9.5 6.8 10.0 11.3 8.4 6.5 15.8 15.3 9.9
SATR 12.3 15.6 16.5 12.7 7.9 9.4 17.2 14.5 9.7
PartSLIP 24.3 39.3 41.1 19.0 13.0 17.1 31.7 17.3 18.5
Ours 34.7 44.4 51.6 33.6 20.7 26.6 42.6 35.1 31.1
Table 1. Zero-shot semantic segmentation on PartObjaverse-Tiny, reported in mIoU (%).

Method Overall Human-Shape Animals Daily-Used Buildings Transportations Plants Food Electronics
PartSLIP 35.2 45.0 50.1 34.4 22.5 26.3 44.6 33.4 32.0
SAM3D 43.6 47.2 45.0 43.1 38.6 39.4 51.1 46.8 43.8
Ours 53.7 54.4 59.0 52.1 46.2 50.3 60.7 59.8 54.5
Table 2. Zero-shot class-agnostic part segmentation on PartObjaverse-Tiny, reported in mIoU (%).

Method Overall Human-Shape Animals Daily-Used Buildings Transportations Plants Food Electronics
PartSLIP 16.3 23.0 34.1 13.1 6.7 10.4 28.9 7.2 10.2
Ours 30.2 36.9 43.7 29.0 19.0 21.4 38.5 39.4 27.7
Table 3. Zero-shot instance segmentation on PartObjaverse-Tiny, reported in mAP50 (%).

Method PointCLIPv2 PartSLIP ZeroPS PartDistill Ours puts, our backbone’s encoding capability exceeds the direct
Overall 16.1 34.4 39.3 39.9 41.2
fusion of DINOv2’s feature. We show qualitative compar-
Table 4. Zero-shot semantic segmentation on PartNetE [25], re- ison of our backbone’s encoding feature, DINOv2’s fusion
ported in mIoU (%). feature and SAM’s fusion feature in Figure 7.

but lacks direct input of 3D information. This leads to diffi- 4.4. Applications
culty in training convergence, affecting final results. Recent several works [7, 12, 33, 38, 49, 52] attempt to gen-
DINOv2’s Feature v.s. SAM’s Feature for Pre-training. erate or edit the style or material of parts for 3D objects.
We utilize DINOv2 as our teacher model for pre-training, However, due to the lack of accurate 3D segmentation meth-
since DINOv2 contains extensive semantic information of ods, most of these methods are designed to use 2D segmen-
the part and the whole object. We also attempt to directly tation methods or traditional simple 3D segmentation meth-
utilize the features of SAM for pre-training, but find that the ods instead. This limitation may cause these methods to
3D backbone can not learn any knowledge that is helpful for struggle with complex objects or generate inconsistencies
part segmentation or perceiving objects. By pre-training on in 3D editing. Our model is capable of segmenting any 3D
3D large-scale data with more 3D spatial information in- object at various scales, thereby serving as a robust tool for

7
(a) Part Segmentation Controlled by 2D Maks (b) Part Material Editing
3D Shape 2D Control Mask Controled 3D Part Segmentation

(c) Part Shape Editing & Animation (d) Click-Based Hierarchical Segmentation

scale = 0.5 1.0 1.5

Figure 6. The resulting 3D part segmentation can directly support various applications, including part segmentation controlled by 2D
masks, part material editing, part geometry editing, and click-based hierarchical segmentation.
Method Pre-train Data Overall Human-Shape Animals Daily-Used Buildings Transportations Plants Food Electronics
w.o. pre. - 43.4 48.5 45.7 44.9 31.7 37.2 54.5 48.1 44.8
PTv3 36k 46.7 50.9 48.7 47.8 38.5 43.0 51.5 52.0 47.0
w.o. skip 36k 48.7 51.1 51.0 49.0 40.5 44.3 59.0 53.1 49.5
Ours 36k 50.5 53.3 53.4 51.1 41.6 45.5 58.7 57.2 51.8
Ours 200k 53.7 54.4 59.0 52.1 46.2 50.3 60.7 59.8 54.5
Table 5. Ablation study on PartObjaverse-Tiny, reported in mIoU (%).

Mesh Ours DINOv2 SAM Part Material Editing. Leveraging precise segmentation
results for 3D objects, our approach enables the customiza-
tion and editing of materials and styles for individual com-
ponents. This greatly improves the adaptability of 3D mod-
els to different design needs, enabling detailed personaliza-
tion and optimization of textures. As shown in Figure 6 (b),
we can edit the material of parts for realistic outcomes.

Part Shape Editing and Part Animation. The 3D object


part segmentation results produced by our model can be di-
Figure 7. Visualization and qualitative comparison of the features
rectly applied to part shape editing tasks. By using these
encoded by our backbone, DINOv2, and SAM. Due to the uti- segmentation results, we can modify selected components
lization of 3D information from point clouds, our backbone can within Blender, streamlining the workflow for 3D model
produce more accurate and fine-grained visual semantic features. refinement and customization. As shown in Figure 6 (c),
we can use the segmentation results of the windmill and
3D generation and editing. At the same time, our method chimney to alter the shape of object parts. Similarly, we can
can serve as a pipeline for creating 3D part data, to create adjust the angle of the windmill to showcase part animation.
more 3D part assets for training 3D perception and generat-
ing models. Click-based Hierarchical Segmentation. SAMPart3D is
Part Segmentation Controlled by 2D Masks. As illus- capable of accepting scale control to segment objects from
trated in Figure 6 (a), our model seamlessly adapt to part coarse to fine. Therefore, similar to GraCo [54] in 2D, our
segmentation guided by 2D segmentation masks. By utiliz- model can accept a click in 3D space with a scale value to
ing 2D masks from a given viewpoint, we substitute SAM’s perform hierarchical segmentation of 3D objects. As shown
masks with these input masks and compute a scale factor in the Figure 6 (d), given a click at a location, the segmented
for the visible points within that view for inference. area can be adjusted by controlling the scale.

8
5. Conclusion and Discussions [12] Ye Fang, Zeyi Sun, Tong Wu, Jiaqi Wang, Ziwei Liu, Gor-
don Wetzstein, and Dahua Lin. Make-it-real: Unleashing
We propose SAMPart3D, a zero-shot 3D part segmentation large multimodal model’s ability for painting 3d objects with
framework that can segment 3D objects into semantic parts realistic materials. arXiv:2404.16829, 2024. 7
at multiple granularities. Additionally, we introduce a new [13] Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feld-
3D part segmentation benchmark, PartObjaverse-Tiny, to man, Zhoutong Zhang, and William T Freeman. Featup:
address the shortcomings in diversity and complexity of ex- A model-agnostic framework for features at any resolution.
isting annotated datasets. Experimental results demonstrate arXiv:2403.10516, 2024. 4, 11
the effectiveness of SAMPart3D. [14] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
Dollár, and Ross Girshick. Masked autoencoders are scalable
References vision learners. In CVPR, 2022. 2
[15] Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze
[1] Ahmed Abdelreheem, Ivan Skorokhodov, Maks Ovsjanikov, Wang, Tiejun Huang, and Bo Zhao. Efficient multimodal
and Peter Wonka. Satr: Zero-shot semantic segmentation of learning from data-centric perspective. arXiv:2402.11530,
3d shapes. In ICCV, 2023. 2, 3, 6, 7 2024. 2
[2] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, [16] Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia, and Tien-
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- Tsin Wong. Bidirectional projection network for cross di-
ing properties in self-supervised vision transformers. In mension scene understanding. In CVPR, 2021. 4
ICCV, 2021. 2
[17] Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui
[3] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng
Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Mano- Zuo. Clip2point: Transfer clip to point cloud classification
lis Savva, Shuran Song, Hao Su, et al. Shapenet: An with image-depth pre-training. In ICCV, 2023. 2
information-rich 3d model repository. arXiv:1512.03012,
[18] HYPERHUMAN. Rodin website. https : / /
2015. 2, 3
hyperhuman.deemos.com/rodin, 2024. 6
[4] Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu,
[19] Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Goldberg,
Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping
Matthew Tancik, and Angjoo Kanazawa. Garfield: Group
Wang. Clip2scene: Towards label-efficient 3d scene under-
anything with radiance fields. In CVPR, 2024. 2, 5
standing by clip. In CVPR, 2023. 2
[20] Hyunjin Kim and Minhyuk Sung. Partstad: 2d-to-3d part
[5] Shuhong Chen, Kevin Zhang, Yichun Shi, Heng Wang, Yi-
segmentation task adaptation. arXiv:2401.05906, 2024. 2
heng Zhu, Guoxian Song, Sizhe An, Janus Kristjansson,
Xiao Yang, and Matthias Zwicker. Panic-3d: Stylized single- [21] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
view 3d reconstruction from portraits of anime characters. In Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
CVPR, 2023. 6 head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
thing. In ICCV, 2023. 2, 3, 5
[6] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo
Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou [22] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian-
Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu
tion models and aligning for generic visual-linguistic tasks. Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded
arXiv:2312.14238, 2023. 2 language-image pre-training. In CVPR, 2022. 2, 3
[7] SeungJeh Chung, JooHyun Park, Hyewon Kan, and [23] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di,
HyeongYeop Kang. 3dstyleglip: Part-tailored text-guided 3d and Baoquan Chen. Pointcnn: Convolution on x-transformed
neural stylization. arXiv:2404.02634, 2024. 2, 7 points. In NeurIPS, 2018. 2, 3
[8] Pointcept Contributors. Pointcept: A codebase for point [24] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
cloud perception research. https://github.com/ Visual instruction tuning. In NeurIPS, 2024. 2
Pointcept/Pointcept, 2023. 4 [25] Minghua Liu, Yinhao Zhu, Hong Cai, Shizhong Han, Zhan
[9] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Ling, Fatih Porikli, and Hao Su. Partslip: Low-shot part seg-
Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana mentation for 3d point clouds via pretrained image-language
Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: models. In CVPR, 2023. 2, 3, 6, 7
A universe of annotated 3d objects. In CVPR, 2023. 2, 3, 4, [26] Leland McInnes, John Healy, Steve Astels, et al. hdbscan:
5 Hierarchical density based clustering. J. Open Source Softw.,
[10] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, 2017. 5, 11
Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, [27] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna
Vikram Voleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-
universe of 10m+ 3d objects. In NeurIPS, 2024. 2 scale benchmark for fine-grained and hierarchical part-level
[11] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- 3d object understanding. In CVPR, 2019. 2, 3, 6, 11
man, Ryan Hickman, Krista Reymann, Thomas B McHugh, [28] OpenAI. Gpt-4o website. https://openai.com/
and Vincent Vanhoucke. Google scanned objects: A high- index/hello-gpt-4o/, 2024. 11
quality dataset of 3d scanned household items. In ICRA. [29] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
IEEE, 2022. 5, 6 Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,

9
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. [44] Meng Wei, Xiaoyu Yue, Wenwei Zhang, Shu Kong, Xi-
Dinov2: Learning robust visual features without supervision. hui Liu, and Jiangmiao Pang. Ov-parts: Towards open-
arXiv:2304.07193, 2023. 2, 3, 4 vocabulary part segmentation. In NeurIPS, 2024. 2
[30] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. [45] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren,
Pointnet: Deep learning on point sets for 3d classification Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian,
and segmentation. In CVPR, 2017. 2, 3 et al. Omniobject3d: Large-vocabulary 3d object dataset for
[31] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J realistic perception, reconstruction and generation. In CVPR,
Guibas. Pointnet++: Deep hierarchical feature learning on 2023. 2, 6
point sets in a metric space. In NeurIPS, 2017. 2, 3 [46] Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi-
[32] Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang
Jiaqi Wang, Dahua Lin, and Hengshuang Zhao. Gpt4point: Zhao. Point transformer v3: Simpler, faster, stronger. In
A unified framework for point-language understanding and CVPR, 2024. 4
generation. In CVPR, 2024. 2 [47] Yuheng Xue, Nenglun Chen, Jun Liu, and Wenyun Sun. Ze-
[33] Zhangyang Qi, Yunhan Yang, Mengchen Zhang, Long rops: High-quality cross-modal knowledge transfer for zero-
Xing, Xiaoyang Wu, Tong Wu, Dahua Lin, Xihui Liu, Ji- shot 3d part segmentation. arXiv:2311.14262, 2023. 2, 3,
aqi Wang, and Hengshuang Zhao. Tailor3d: Customized 6
3d assets editing and generation with dual-side images. [48] Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao,
arXiv:2407.06191, 2024. 7 and Xihui Liu. Sam3d: Segment anything in 3d scenes.
[34] Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, arXiv:2306.03908, 2023. 2, 6
Hasan Hammoud, Mohamed Elhoseiny, and Bernard [49] Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen
Ghanem. Pointnext: Revisiting pointnet++ with improved Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, and Xi-
training and scaling strategies. In NeurIPS, 2022. 2, 3 hui Liu. Dreamcomposer: Controllable 3d object generation
[35] Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi zuo, Mu- via multi-view conditions. In CVPR, 2024. 7
tian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong, Liefeng [50] Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or,
Bo, and Xiaoguang Han. Richdreamer: A generalizable Jingwei Huang, and Baoquan Chen. Sai3d: Segment any
normal-depth diffusion model for detail richness in text-to- instance in 3d scenes. In CVPR, 2024. 2
3d. arXiv:2311.16918, 2023. 5
[51] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu-
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Li. Pointclip: Point cloud understanding by clip. In CVPR,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
2022. 2, 6
ing transferable visual models from natural language super-
[52] Shangzhan Zhang, Sida Peng, Tao Xu, Yuanbo Yang, Tian-
vision. In ICML, 2021. 2, 3
run Chen, Nan Xue, Yujun Shen, Hujun Bao, Ruizhen Hu,
[37] George Tang, William Zhao, Logan Ford, David Benhaim,
and Xiaowei Zhou. Mapa: Text-driven photorealistic mate-
and Paul Zhang. Segment any mesh: Zero-shot mesh
rial painting for 3d shapes. arXiv:2404.17569, 2024. 7
part segmentation via lifting segment anything 2 to 3d.
[53] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and
arXiv:2408.13679, 2024. 3
Vladlen Koltun. Point transformer. In ICCV, 2021. 2, 3
[38] Konstantinos Tertikas, Despoina Paschalidou, Boxiao Pan,
Jeong Joon Park, Mikaela Angelina Uy, Ioannis Emiris, Yan- [54] Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xi-
nis Avrithis, and Leonidas Guibas. Generating part-aware awu Zheng, Rongrong Ji, Chang Liu, Li Yuan, and Jie Chen.
editable 3d shapes without 3d supervision. In CVPR, 2023. Graco: Granularity-controllable interactive segmentation. In
2, 7 CVPR, 2024. 5, 8
[39] Anh Thai, Weiyao Wang, Hao Tang, Stefan Stojanov, Matt [55] Ziming Zhong, Yanyu Xu, Jing Li, Jiale Xu, Zhengxin Li,
Feiszli, and James M Rehg. 3x2: 3d object part segmentation Chaohui Yu, and Shenghua Gao. Meshsegmenter: Zero-shot
by 2d semantic correspondences. arXiv:2407.09648, 2024. mesh semantic segmentation via texture synthesis. In ECCV.
3 Springer, 2024. 3
[40] TripoAI. Tripoai website. https://www.tripo3d. [56] Yuchen Zhou, Jiayuan Gu, Xuanlin Li, Minghua Liu, Yun-
ai/, 2024. 6 hao Fang, and Hao Su. Partslip++: Enhancing low-shot 3d
[41] Ardian Umam, Cheng-Kun Yang, Min-Hung Chen, Jen- part segmentation via multi-view instance segmentation and
Hui Chuang, and Yen-Yu Lin. Partdistill: 3d shape maximum likelihood estimation. arXiv:2312.03015, 2023.
part segmentation by vision-language model distillation. 2, 3
arXiv:2312.04016, 2023. 2, 3, 4, 6 [57] Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao
[42] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Point-
Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan clip v2: Prompting clip and gpt for powerful 3d open-world
Song, et al. Cogvlm: Visual expert for pretrained language learning. In ICCV, 2023. 2, 3, 6
models. arXiv:2311.03079, 2023. 2
[43] Xiaogang Wang, Xun Sun, Xinyu Cao, Kai Xu, and Bin
Zhou. Learning fine-grained segmentation of 3d shapes with-
out part labels. In CVPR, 2021. 6

10
6. Supplemental Material a large number of part data, and then train a large 3D part
segmentation model.
6.1. Implementation Details
In the large-scale pre-training stage, we train the PTv3- 6.4. Visualization on PartNetE Dataset
object backbone on 200K high-quality objects from Obja- We visualize the segmentation results of our SAMPart3D
verse with a batch size of 32, taking 7 days on eight A800 on the PartNetE [27] dataset in Fig. 8. Compared to previ-
GPUs. Each object is rendered from 36 views, including 6 ous methods, our SAMPart3D can segment all fine-grained
fixed views (+x, +y, +z, -x, -y, -z) and 30 random views. parts of 3D objects from PartNetE, even for parts that are
For each iteration, we pick 6 fixed views and 2 random not annotated in the dataset.
views to cover most of the area for each object. We use
the pre-trained DINOv2-ViTS14 model to encode the ren- 6.5. More Qualitative Results of Our SAMPart3D
dered images into visual features. These features are then We show more visualization results of multi-granularity
upsampled by the FeatUp [13] model to obtain pixel-wise point clouds and meshes produced by our SAMPart3D in
features as supervision. Figure 9.
In the sample-specific fine-tuning stage, we initially
sample 15K point clouds on the mesh surface for the ob- 6.6. More PartObjaverse-Tiny Visualization
ject of interest. We render 36 views for the object, con- We demonstrate more visualization of dataset
sistent with the settings used during the pre-training stage. PartObjaverse-Tiny. We present more examples of se-
We input the 2D rendered images into SAM to obtain the mantic segmentation annotations in Figure 10, and instance
2D segmentation masks. For each iteration, we randomly segmentation annotations in Figure 11.
pick 90 rendered images (with replacement) and sample 256
valid pixels from each image, obtaining 23,040 3D points
mapped from pixels as inputs. The modules used for scale-
conditioned grouping and long skip connection each em-
ploy 6-layer and 4-layer MLPs, with hidden dimensions set
to 384. This stage requires 1 minute to generate masks us-
ing SAM, followed by 5 minutes of training MLPs.
After the fine-tuning stage, we can obtain the
segmentation-aware features of 3D point cloud conditioned
on a scale. We use the clustering algorithm HDBSCAN [26]
for feature grouping, and utilize GPT-4o [28] for per-part
semantic querying.

6.2. Ablation Analysis of Segmentation Scale


Figure 4 in the paper presents a visualization of segmenta-
tion results across different scale factors. For the quanti-
tative analysis of scale, we use five scale values [0.0, 0.5,
1.0, 1.5, 2.0] for each object, automatically selecting the
result closest to the ground truth for evaluation. We con-
duct measurements using five values: [0.0, 0.5, 1.0, 1.5,
2.0] and compare them with our mixed scale results. Since
the dataset is manually annotated, the number of suitable
scales for different objects varies, leading to differences be-
tween individual and mixed scale results. The results of
class-agnostic part segmentation results at different scales
is shown in Table 6.

6.3. Limitations
The training for the grouping field stage uses masks from
SAM segmented at different scales. If some of these masks
are inaccurate, it can affect the final results. Also, training
the grouping field for each object is still slow. A better solu-
tion might be to utilize our method as a pipeline to annotate

11
Method Overall Human-Shape Animals Daily-Used Buildings Transportations Plants Food Electronics
0.0 49.1 52.6 54.8 45.3 42.4 47.8 55.2 42.5 49.5
0.5 48.9 51.1 53.1 47.1 41.3 46.9 58.0 50.9 48.0
1.0 39.6 35.2 43.0 42.0 37.4 34.1 41.9 50.4 43.4
1.5 31.5 26.7 31.0 34.3 20.9 24.9 34.1 52.3 35.6
2.0 24.4 21.5 24.6 30.6 22.5 15.4 30.3 48.4 24.9
mixed-scale 53.7 54.4 59.0 52.1 46.2 50.3 60.7 59.8 54.5

Table 6. Zero-shot class-agnostic part segmentation on PartObjaverse-Tiny across different scale values, reported in mIoU (%).

Figure 8. Visualization of segmentation results on PartNetE dataset.

12
Figure 9. Visualization of multi-granularity segmentation of point clouds and meshes.

13
Base, Cab, Chassis, Tire, Window, Secondary Tires, Vitta

Back Cover, Button, Enclosure, Inside, Knob, Screen, Watchband

Body, Electrical Box, Front, Exhaust Pipe, Pipeline, Top, Wire

Big Stab, Body, Flower, Flowerpot, Mud, Stab

House Beams, Support, Plank, Roof, Rope, Swivel Rod, Water Well

Leaf, Earth, Flower, Flowerpot, Grass, Petals, Stalks, Stones

Figure 10. Visualization of PartObjaverse-Tiny with part-level annotations with semantic labels for segmentation segmentation.

14
Frame, Left Glass, Left Rims, Right Glass, Right Rims

Brackets, Door, Foundation, Front Column, Left Lantern, Left Railing, Rear Column, Right Lantern, Right Railing, Roof,
Support Frame, Wall, Windows, Wing Angles

Body, Button, Glass, Knob, Lens, Ornament, Screen

Head, Left Arm, Left Ear, Left Foot, Left Hand, Left Leg, Lower Body, Others, Right Arm, Right Ear, Right Foot,
Right Hand, Right Leg, Upper Body

Beard, Body, Book, Cloak, Hair, Hat, Head, Left Arm, Left Foot, Left Hand, Right Arm, Right Foot, Right Hand,
Sceptre

Body, Hide Buckle, Left Buckle, Left Front Leg, Left Hand, Left Hide Leg, Left Machine Gun, Left Tube, Legs Low
Phong, Right Buckle, Right Front Leg, Right Hand, Right Hide Leg, Right Machine Gun, Right Tube, Top Body, Top Trim

Figure 11. Visualization of PartObjaverse-Tiny with part-level annotations with semantic labels for instance segmentation.

15

You might also like