Domain Randomization-Enhanced Depth Simulation and Restoration For Perceiving and Grasping Specular and Transparent Objects

Domain Randomization-Enhanced Depth
Simulation and Restoration for Perceiving and

Grasping Specular and Transparent Objects
Qiyu Dai1,∗ , Jiyao Zhang2,∗ , Qiwei Li1 , Tianhao Wu1 , Hao Dong1 ,
Ziyuan Liu3 , Ping Tan3,4 , and He Wang1,†
arXiv:2208.03792v2 [cs.CV] 23 Nov 2022
1
Peking University 2 Xi’an Jiaotong University
3
Alibaba XR Lab 4 Simon Fraser University
{qiyudai,lqw,hao.dong,hewang}@pku.edu.cn,
zhangjiyao@stu.xjtu.edu.cn, thwu@stu.pku.edu.cn,
ziyuan-liu@outlook.com, pingtan@sfu.ca
Abstract. Commercial depth sensors usually generate noisy and miss-

ing depths, especially on specular and transparent objects, which poses
critical issues to downstream depth or point cloud-based tasks. To miti-
gate this problem, we propose a powerful RGBD fusion network, Swin-
DRNet, for depth restoration. We further propose Domain Randomization-
Enhanced Depth Simulation (DREDS) approach to simulate an active
stereo depth system using physically based rendering and generate a
large-scale synthetic dataset that contains 130K photorealistic RGB im-
ages along with their simulated depths carrying realistic sensor noises. To
evaluate depth restoration methods, we also curate a real-world dataset,
namely STD, that captures 30 cluttered scenes composed of 50 ob-
jects with different materials from specular, transparent, to diffuse. Ex-
periments demonstrate that the proposed DREDS dataset bridges the
sim-to-real domain gap such that, trained on DREDS, our SwinDRNet
can seamlessly generalize to other real depth datasets, e.g. ClearGrasp,
and outperform the competing methods on depth restoration with a
real-time speed. We further show that our depth restoration effectively
boosts the performance of downstream tasks, including category-level
pose estimation and grasping tasks. Our data and code are available at
https://github.com/PKU-EPIC/DREDS.
Keywords: Depth sensor simulation, specular and transparent objects,

domain randomization, pose estimation, grasping
1 Introduction
With the emerging depth-sensing technologies, depth sensors and 3D point cloud
data become more and more accessible, rendering many applications in VR/AR
and robotics. Compared with RGB images, depth images or point clouds contain
the true 3D information of the underlying scene geometry, thus depth cameras
*: equal contributions, †: corresponding author
2 Q. Dai, J. Zhang, Q. Li, T. Wu, H. Dong, Z. Liu, P. Tan, H. Wang
Domain Randomization-Enhanced Depth Simulation Depth Restoration: SwinDRNet Downstream Tasks

Depth sensor simulation Category-level pose estimation
Stereo
matching
IR pattern
capture
Left IR image Simulated depth
Simulated point cloud
Virtual 3D scene
Grasping
Right IR image Ground truth depth
Domain randomization
Material
Object
Specular Transparent Diffuse
… …
Illumination Background Camera viewpoint

… …
Refined point cloud
Fig. 1. Framework overview. From the left to right: we leverage domain

randomization-enhanced depth simulation to generate paired data, on which we can
train our depth restoration network SwinDRNet, and the restored depths will be fed to
downstream tasks and improve estimating category-level pose and grasping for specular
and transparent objects.
have been widely deployed in many robotic systems, e.g. for object grasping [4,14]
and manipulation [37,22,21], that care about the accurate scene geometry. How-
ever, an apparent disadvantage of accessible depth cameras is that they may
carry non-ignorable sensor noises more significant than usual noises in colored
images captured by commercial RGB cameras. A more drastic failure case of
depth sensing would be on objects that are either transparent or their sur-
faces are highly specular, where the captured depths would be highly erroneous
and even missing around the specular or transparent region. It should be noted
that specular and transparent objects are indeed ubiquitous in our daily life,
given most of the metallic surfaces are specular and many man-made objects
are made of glasses and plastics which can be transparent. The existence of so
many specular and transparent objects in our real-world scenes thus poses severe
challenges to depth-based vision systems and limits their application scenarios
to well-controlled scenes and objects made of diffuse materials.
In this work, we devise a two-stream Swin Transformer [18] based RGB-D
fusion network, SwinDRNet, for learning to perform depth restoration. However,
it is a lack of real data composed of paired sensor depths and perfect depths to
train such a network. Previous works on depth completion for transparent ob-
jects, like ClearGrasp [30] and LIDF [43], leverage synthetic perfect depth image
for network training. They simply remove the transparent area in the perfect
depth and their methods then learn to complete the missing depths in a feed-
forward way or further combines with depth optimization. We argue that both
the methods can only access incomplete depth images during training and never
see a depth with realistic sensor noises, leading to suboptimality when directly
deployed on real sensor depths. Also, these two works only consider a small
number of similar objects with little shape variations and all being transpar-
Domain Randomization-Enhanced Depth Simulation and Restoration 3
ent and hence fail to demonstrate their usefulness when adopted in scenes with
completely novel object instances. Given material specularity or transparency
forms a continuous spectrum, it is further questionable whether their methods
can handle objects of intermediate transparency or specularity.
To mitigate the problems in the existing works, we thus propose to syn-
thesize depths with realistic sensor noise patterns by simulating an active stereo
depth camera resembling RealSense D415. Our simulator is built on Blender and
leverages raytracing to mimic the IR stereo patterns and compute the depths
from them. To facilitate generalization, we further adopt domain randomization
techniques that randomize the object textures, object materials (from specular,
transparent, to diffuse), object layout, floor textures, illuminations along cam-
era poses. This domain randomization-enhanced depth simulation method, or in
short DREDS, leads to 130K photorealistic RGB images and their correspond-
ing simulated depths. We further curate a real-world dataset, STD dataset, that
contains 50 objects with specular, transparent, and diffuse material. Our exten-
sive experiments demonstrate that our SwinDRNet trained on DREDS dataset
can handle depth restoration on object instances from both seen and unseen
object categories in STD dataset and can even seamlessly generalize to Clear-
Grasp dataset and beat the previous state-of-the-art method, LIDF [43] trained
on ClearGrasp dataset. Additionally, SwinDRNet allows real-time depth restora-
tion (30 FPS). Our further experiments on estimating category-level pose and
grasping specular and transparent objects prove that our depth restoration is
both generalizable and successful.
2 Related Work
2.1 Depth Estimation and Restoration
The increasing popularity of RGBD sensors has encouraged much research on

depth estimation and restoration. Many works [10,15,19] directly estimate the
depth from a monocular RGB image, but fail to restore accurate geometries
of the point cloud because of the few geometric constraints of the color image.
Other studies [23,38,29] restore the dense depth map given the RGB image
and the sparse depth from LiDAR, but the estimated depth still suffers from
low quality due to the limited geometric guidance of the sparse input. Recent
research focuses on commercial depth sensors, trying to complete and refine the
depth values from the RGB and noisy dense depth images. Sajjan et al. [30]
proposed a two-stage method for transparent object depth restoration, which
firstly estimates surface normals, occlusion boundaries, and segmentations from
RGB images, and then calculates the refined depths via global optimization.
However, the optimization is time-consuming, and heavily relies on the previous
network predictions. Zhu et al. [43] proposed an implicit transparent object depth
completion model, including the implicit representation learning from ray-voxel
pairs and the self-iterating refinement, but voxelization of the 3D space results
in heavy geometric discontinuity of the refined point cloud. Our method falls
into this category and outperforms those methods, ensuring fast inference time
and better geometries to improve the performance of downstream tasks.
2.2 Depth Sensor Simulation

To close the sim-to-real gap, the recent research focuses on generating simulated
depth maps with realistic noise distribution. [17] simulated the pattern projection
and capture system of Kinect to obtain simulated IR images and perform stereo
matching, but could not simulate the sensor noise caused by object materials
and scene environments. [26] proposed an end-to-end framework to simulate the
mechanism of various types of depth sensors. However, the rasterization method
limits the photorealistic rendering and physically correct simulation. [25] pre-
sented a new differentiable structure-light depth sensor simulation pipeline, but
cannot simulate the transparent material, limited by the renderer. Recently, [42]
proposed a physics-grounded active stereovision depth sensor simulator for vari-
ous sim-to-real applications, but focused on instance-level objects and the robot
arm workspace. Our DREDS pipeline generates realistic RGBD images for vari-
ous materials and scene environments, which can generalize the proposed model
to category-level unseen object instances and novel categories.
2.3 Domain Randomization

Domain randomization bridges the sim-to-real gap in the way of data augmenta-
tion. Tobin et al. [32] first explore transferring to real environments by generating
training data through domain randomization. Subsequent works [33,40,27] gen-
erate synthetic data with sufficient variation by manually setting randomized
features. Other studies [41] perform randomization using the neural networks.
These works have verified the effectiveness of domain randomization on the tasks
such as robotic manipulation [24], object detection and pose estimation [16], etc.
In this work, we combine the depth sensor simulation pipeline with domain ran-
domization, which, for the first time, enables direct generalization to unseen
diverse real instances on specular and transparent object depth restoration.
3 Domain Randomization-Enhanced Depth Simulation

3.1 Overview
In this work, we propose a simulated RGBD data generation pipeline, namely Do-
main Randomization Enhanced Depth Simulation (DREDS), for tasks of depth
restoration, object perception, and robotic grasping. We build a depth sensor
simulator, modeling the mechanism of the active stereo vision depth camera
system based on the physically based rendering, along with the domain random-
ization technique to handle real-world variations.
Leveraging domain randomization and active stereo sensor simulation, we
present DREDS, the large-scale simulated RGBD dataset, containing photo-
realistic RGB images and depth maps with the real-world measurement noise
Table 1. Comparisons of specular and transparent depth restoration dataset.

S, T, and D refer to specular, transparent, and diffuse materials, respectively. #Objects
refers to the number of objects. SN+CG means the objects are selected from ShapeNet
and ClearGrasp (the number are not mentioned).
Dataset Type #Objects Type of Material Size

ClearGrasp-Syn [30] Syn 9 T 50K
Omniverse [43] Syn SN+CG T+D 60K
ClearGrasp-Real [30] Real 10 T 286
TODD [39] Real 6 T 1.5K
DREDS Sim 1,861 S+T+D 130K
STD Real 50 S+T+D 27K
and error, especially for the hand-scale objects with specular and transparent
materials. The proposed DREDS dataset bridges the sim-to-real domain gap,
and generalizes the RGBD algorithms to unseen objects. DREDS dataset’s com-
parison to the existing specular and transparent depth restoration datasets is
summarized in Table 1.
3.2 Depth Sensor Simulation

A classical active stereo depth camera system contains an infrared (IR) projector,
left and right IR stereo cameras, and a color camera. To measure the depth, the
projector emits an IR pattern with dense dots to the scene. Subsequently, the
two stereo cameras capture the left and right IR images, respectively. Finally, the
stereo matching algorithm is used to calculate per-pixel depth values based on the
discrepancy between the stereo images, to get the final depth scan. Our depth
sensor simulator follows this mechanism, containing light pattern projection,
capture, and stereo matching. The simulator is mainly built upon Blender [1].
Light Pattern Capture via physically based rendering. For real-world
specular and transparent objects, the IR light from the projector may not be re-
ceived by the stereo cameras, due to the reflection on the surface or the refraction
through the transparent objects, resulting in inaccurate and missing depths. To
simulate the physically correct IR pattern emission and capture process, we thus
adopt physically based ray tracing, a technique that mimics the real light trans-
portation process, and supports various surface materials especially specular and
transparent materials.
Specifically, the textured spotlight projects a binary pattern image into the
virtual scene. Sequentially, the binocular IR images are rendered from the stereo
cameras. We manage to simulate IR images via visible light rendering, where
both the light pattern and the reduced environment illumination contribute to
the IR rendering. From the perspective of physics, the difference between IR and
visible light is the reflectivity and refractive index of the object. We note that
the wavelength (850 nm) of IR light used in depth sensors, e.g. RealSense D415,
is close to the visible light (400-800 nm). So the resulting effects have already
been well-covered by the randomization in object reflectivity and refractive index
used in DREDS, which constructs a superset of real IR images. To mimic the

portion of IR in environmental light, we reduce its intensity. Finally, all RGB
values are converted to intensity, which is our final IR image.
Stereo Matching. We perform stereo matching to obtain the disparity map,
which can be transferred to the depth map leveraging the intrinsic parameters
of the depth sensor. In detail, we compute a matching cost volume over the left
and right IR images along the epipolar line and find the matching results with
minimum matching cost. Then we perform sub-pixel detection to generate a more
accurate disparity map using the quadratic curve fitting method. To generate
a more realistic depth map, we perform post-processing, including left/right
consistency check, uniqueness constraint, median filtering, etc.
3.3 Simulated Data Generation with Domain Randomization

Based on the proposed depth sensor simulator, we formulate the simulated
RGBD data generation pipeline as D = Sim(S, C), where S = {O, M, L, B}
denotes scene-related simulation parameters in the virtual environment, includ-
ing O the setting of the objects with random categories, poses, arrangements,
and scales, M the setting of random object materials from specular, transpar-
ent, to diffuse, L the setting of environment lighting from varying scenes with
different intensities, B the setting of background floor with diverse materials. C
is the cameras’ statue parameters, consisting of intrinsic and extrinsic param-
eters, the pattern image, baseline distance, etc. Taking these settings as input,
the proposed simulator Sim generates the realistic RGB and depth images D.
To construct scenes with sufficient variations so that the proposed method
can generalize to the real, we adopt domain randomization to enhance the gener-
ation, considering all these aspects. See supplementary materials for more details.
3.4 Simulated Dataset: DREDS

Making use of domain randomization and depth simulation, we construct the
large-scale simulated dataset, DREDS. In total, DREDS dataset consists of two
subsets: 1) DREDS-CatKnown: 100,200 training and 19,380 testing RGBD
images made of 1,801 objects spanning 7 categories from ShapeNetCore [8], with
randomized specular, transparent, and diffuse materials, 2) DREDS-CatNovel:
11,520 images of 60 category-novel objects, which is transformed from GraspNet-
1Billion [11] that contains CAD models and annotates poses, by changing their
object materials to specular or transparent, to verify the ability of our method to
generalize to new object categories. Examples of paired simulated RGBD images
of DREDS-Catknown and DREDS-CatNovel datasets are shown in Figure 2.
4 STD Dataset
4.1 Real-world Dataset: STD
To further examine the proposed method in real scenes, we curate a real-world
dataset, composed of Specular, Transparent, and Diffuse objects, which we call
RGBD examples of DREDS-CatKnown Scene examples of STD-CatKnown

Examples of STD-CatKnown STD data and
STDannotations
data and annotations
RGB RGB
Depth Depth
Ground Ground
truth truth
depth depth
RGBD examples of DREDS-CatNovel Scene examples of STD-CatNovel

Examples of STD-CatNovel NOCS NOCS
map map
Instance Instance
mask mask
Fig. 2. RGBD examples of Fig. 3. Scene examples and annotations

DREDS dataset. of STD dataset.
it STD dataset. Similar to DREDS dataset, STD dataset contains 1) STD-

CatKnown: the subset with category-level objects, for the evaluation of depth
restoration and category-level pose estimation tasks, and 2) STD-CatNovel:
the subset with category-novel objects for evaluating the generalization ability
of the proposed SwinDRNet method. Figure 3 shows the scene examples and
annotations of STD dataset.
4.2 Data Collection

We collect an object set, covering specular, transparent, and diffuse materials.
Specifically, for STD-CatKnown dataset, we collect 42 instances from 7 known
ShapeNetCore [8] categories, and several category-unseen objects from the YCB
dataset [7] and our own as the distractors. For STD-CatNovel dataset, we pick 8
specular and transparent objects from unseen categories. For each object except
the distractors, we utilize the photogrammetry-based reconstruction tool, Object
Capture API [2], to obtain its clean and accurate 3D mesh for ground truth poses
annotation, so that we can yield ground truth depth and object masks.
We capture data from 30 different scenes (25 for STD-CatKnown, 5 for STD-
CatNovel) with various backgrounds and illuminations, using RealSense D415.
In each scene, over 4 objects with random arrangements are placed in a cluttered
way. The sensor moves around the objects in an arbitrary trajectory. In total,
we take 22,500 RGBD frames for STD-CatKnown, and 4,500 for STD-CatNovel.
Overall, the proposed real-world STD dataset consists of 27K RGBD frames,
30 diverse scenes, and 50 category-level and category-novel objects, making it
facilitate the further generalizable object perception and grasping research.
5 Method
In this section, we introduce our network for depth restoration in section 5.1 and
then introduce the methods we used for downstream tasks, i.e. category-level 6D
object pose estimation and robotic grasping, in section 5.2.
Phase 1 SwinT-based Phase 2 Cross-Attention Transformer Phase 3 Final Depth Prediction

Feature Extraction based RGB-D Feature Fusion via Confidence Interpolation
𝑺𝒘𝒊𝒏𝑻𝒄𝒐𝒍𝒐𝒓 {𝑭𝒊𝒄 }
𝑫𝒅𝒆𝒑𝒕𝒉
SwinT Stage4
𝑰𝒄 𝑰)𝒅
Input RGB SwinT Stage3 {𝑯𝒊 } Initial depth
SwinT Stage2
Patch Partition
𝑻𝒄 SwinT Stage1
⊗
𝑪
RGB-D fusion
𝑀$
𝑫𝒄𝒐𝒏𝒇 Confidence of initial depth
⨁
𝑇*+
𝑇*+
𝑇*+
𝑇*+
𝑺𝒘𝒊𝒏𝑻𝒅𝒆𝒑𝒕𝒉
𝑻𝒅 SwinT Stage1 𝟏−𝑪

Patch Partition Confidence of raw depth
SwinT Stage2
SwinT Stage3
⨁ Concatenate ⊗ ∑
∑ Add
SwinT Stage4
𝑰𝒅 ⊗ Multiply
{𝑭𝒊𝒅 }
Input depth 𝑰-𝒅
Restored depth
Fig. 4. Overview of our proposed depth restoration network SwinDRNet. We

first extract the multi-scale features of RGB and depth image in phase 1, respectively.
Next, in phase 2, our network fuse features of different modalities. Finally, we generate
the initial depth map and confidence maps via two decoders, respectively, and fuse the
raw depth and initial depth using the predicted confidence map.
5.1 SwinDRNet for Depth Restoration
Overview. To restore the noisy and incomplete depth, we propose a SwinTrans-

former [18] based depth restoration network, namely SwinDRNet.
SwinDRNet takes as input a RGB image Ic ∈ RH×W ×3 along with its
aligned depth image Id ∈ RH×W and outputs a refined depth Iˆd ∈ RH×W that
restores the error area of the depth image and completes the invalid area, where
H and W are the input image sizes.
We notice that prior works, e.g. PVN3D [12], usually leverage a heteroge-
neous architecture that extracts CNN features from RGB and extracts Point-
Net++ [28] features from depth. We, for the first time, devise a homogeneous
and mirrored architecture that only leverages SwinT to extract and hierarchi-
cally fuse the RGB and depth features.
As shown in Figure 4, the architecture of SwinDRNet is a two-stream fused
encoder-decoder and can be further divided into three phases: in the first phase
of feature extraction, we leverage two separate SwinT backbones to extract hi-
erarchical features {Fci } and {Fdi } from the input RGB image Ic and depth Id ,
respectively; In the second stage of RGBD feature fusion, we propose a fusion
module Mf that utilizes cross-attention transformers to combine the features
from the two streams and generate fused hierarchical features {Hi } ; and finally
in the third phase, we propose two decoder modules, the depth decoder module
Ddepth decodes the fused feature into a raw depth and the confidence decoder
module Dconf outputs a confidence map of the predicted raw depth, and from
the outputs we can compute the final restored depth by using the confidence
map to select accurate depth predictions at noisy and invalid areas of the input
depth while keeping the originally correct area as much as possible.
SwinT-based Feature Extraction. To accurately restore the noisy and
incomplete depth, we need to leverage visual cues from the RGB image that helps
depth completion as well as geometric cues from the depth that may save efforts
at areas with correct input depths. To extract rich features, we propose to utilize
SwinT [18] as our backbone, since it is a very powerful and efficient network
that can produce hierarchical feature representations at different resolutions and
has linear computational complexity with respect to input image size. Given
our inputs contain two modalities – RGB and depth, we deploy two seperate
SwinT networks, SwinTcolor and SwinTdepth , to extract features from Ic and
Id , respectively. For each one of them, we basically follow the design of SwinT.
Taking the SwinTcolor as an example: we first divide the input RGB image
Ic ∈ RH×W ×3 into non-overlapping patches, which is also called tokens, Tc ∈
H W
R 4 × 4 ×48 ; we then pass Tc through the four stages of SwinT to generate the
multi-scale features {Fci }, which are especially useful for dense depth prediction
thanks to the hierarchical structure. The encoder process can be formulated as:
\{\mathcal {F}_cî\}_{i=1,2,3,4} = SwinT_\text {color}(\mathcal {T}_{c}), (1)
\{\mathcal {F}_dî\}_{i=1,2,3,4} = SwinT_\text {depth}(\mathcal {T}_d). (2)

H W
where F i ∈ R 4i × 4i
×iC
and C is the output feature dimension of the linear
embedding layer in the first stage of SwinT.
Cross-Attention Transformer based RGB-D Feature Fusion. Given
the hierarchical features {Fci } and {Fdi } from the two-stream SwinT backbone,
our RGB-D fusion module Mf leverages cross-attention transformers to fuse the
corresponding Fci and Fdi into Hi . For attending feature FA to FB , a common
cross-attention transformer TCA first calculates the query vector QA from FA
and the key KB and value VB vectors from feature FB :
\ Q_A = \mathcal {F}_A \cdot W_q, ~~K_B = \mathcal {F}_B \cdot W_k, ~~V_B = \mathcal {F}_B \cdot W_v, (3)
where W s are the learnable parameters, and then computes the cross-attention
feature HFA →FB from FA to FB :
\ \mathcal {H}_{\mathcal {F}_A\rightarrow \mathcal {F}_B} = T_{CA}(\mathcal {F}_A, \mathcal {F}_B) =\text {softmax}\left (\frac {Q_A \cdot K_B^T}{\sqrt {d_K}}\right ) \cdot V_B, (4)
where dK is the dimension of Q and K.

In our module Mf , we leverage bidirectional cross-attention by deploying two
cross-attention transformers to obtained the cross-attention features from both
directions, and then concatenates them with the original features to form the
fused hierarchical features {Hi }, as shown below:
\ \mathcal {H}î = \mathcal {H}_{\mathcal {F}_cî\rightarrow \mathcal {F}_dî} \bigoplus \mathcal {H}_{\mathcal {F}_dî\rightarrow \mathcal {F}_cî} \bigoplus \mathcal {F}_cî \bigoplus \mathcal {F}_dî, (5)
L
where represents concatenation along the channel axis.
Final Depth Prediction via Confidence Interpolation. The credible
area of the input depth map (e.g., the edges of specular or transparent objects in
contact with background or diffusive objects) plays a critical role in providing in-
formation about spatial arrangement. Inspired by the previous works [35,13], we
make use of a confidence map between the raw and predicted depth maps. How-
ever, unlike [35,13] predicting the confidence map between the multi-modality,
we focus on preserving the correct original value to generate more realistic depth
maps with less distortion. The final depth map can be formulated as:
\hat {\mathcal {I}}_d = C \bigotimes \tilde {\mathcal {I}}_{d} + (1-C) \bigotimes \mathcal {I}_{d} (6)
N
where represents elementwise multiplication, and Îd and Ĩd denote the final
restored depth and the output of depth decoder head, respectively.
Loss Functions For SwinDRNet training, we supervise both the final re-
stored depth Îd and the output of depth decoder head Ĩd , which is formulated
as:
\mathcal {L} = \omega _{\tilde {\mathcal {I}}_d}\mathcal {L}_{\tilde {\mathcal {I}}_d} + \omega _{\hat {\mathcal {I}}_d}\mathcal {L}_{\hat {\mathcal {I}}_d}, (7)
where LÎd and LĨd are the losses of Îd and Ĩd , respectively. ωÎd and ωĨd are
weighting factors. Each of the two loss can be formulated as:
\mathcal {L}_i = \omega _{n}\mathcal {L}_n + \omega _{d}\mathcal {L}_d + \omega _{g}\mathcal {L}_g, (8)
where Ln , Ld and Lg are the L1 losses between the predicted and ground truth
surface normal, depth and the gradient map of depth image, respectively. ωn ,
ωd and ωg are the weights for different losses. We further add higher weight to
the loss within the foreground region, to push the network to concentrate more
on the objects.
5.2 Downstream Tasks
Category-level 6D Object Pose Estimation. Inspired by [36], we use the

same backbone with SwinDRNet, and add two decoder heads to predict coordi-
nates of the NOCS map and semantic segmentation mask. Then we follow the
method [36], perform pose fitting between the restored object point clouds in the
world coordinate space and the predicted object point clouds in the normalized
object coordinate space, and perform pose fitting to get the 6D object pose.
Robotic Grasping. By combining SwinDRNet to the object grasping task,
we can analyze the performance of depth restoration on the robotic manipula-
tion. We adopt the end-to-end network, GraspNet-baseline [11], to predict the
6-DoF grasping poses directly from the scene point cloud. Given the restored
depth map from SwinDRNet, the scene point cloud is transformed and sent to
GraspNet-baseline. Then the model predicts the grasp candidates. Finally, the
gripper of the parallel-jaw robot arm executes the target rotation and position
selected from those candidates.
6 Tasks, Benchmarks and Results

In this section, we train our SwinDRNet on the train split of DREDS-CatKnown
dataset and deploy it on the tasks including category-level 6D object pose esti-
mation and robotic grasping.
6.1 Depth Restoration

Evaluation Metrics. We follow the metrics of transparent objects depth com-
pletion in [43]: 1) RMSE: the root mean squared error, 2) REL: the mean
absolute relative difference, 3) MAE: the mean absolute error, 4) the percent-
d∗
age of di satisfying max( dd∗i , dii ) < δ, where di denotes the predicted depth, d∗i
i
is GT and δ ∈ {1.05, 1.10, 1.25}. We resize the prediction and GT to 126 × 224
resolution for fair comparisons, and evaluate in all objects area and challenging
area (specular and transparent objects), respectively.
Baselines. We compare our method with several state-of-the-art methods,
including LIDF [43], the SOTA method for depth completion of transparent
objects, and NLSPN [23], the SOTA method for depth completion on NYUv2 [34]
dataset. All baselines are trained on the train split of DREDS-CatKnown and
evaluated on four types of testing data: 1) the test split of DREDS-CatKnown:
simulated images of category-known objects. 2) DREDS-CatNovel: simulated
images of category-novel objects. 3) STD-CatKnown: real images of category-
known objects; 4) STD-CatNovel. real images of category-novel objects.
Results. The quantitative results reported in Table 2 show that we achieve
the best performance compared to other methods on DREDS and STD datasets,
and have a powerful generalization ability to transfer to not only novel category
objects in the simulation environment but also in the real world. In addition
to performance gain, ours (30 FPS) is significantly faster than LIDF (13 FPS)
and the two-branch baseline that uses PointNet++ on depth (6 FPS). Although
it is a little slower than NLSPN (35 FPS), SwinDRNet has achieved real-time
depth restoration, and our code still has room for optimization and speedup.
The methods are all evaluated on an NVIDIA RTX 3090 GPU.
Sim-to-Real and Domain Transfer. We perform sim-to-real and domain
transfer experiments to verify the generalization ability of the DREDS dataset.
For sim-to-real experiments, SwinDRNet is trained on DREDS-CatKnown, but
takes different depth images as input of training (one follow [43] and takes the
cropped synthetic depth image as input, and another takes the simulated depth
image). The results evaluated on STD in Table 3 reveal the powerful potential
of our depth simulation pipeline, which can significantly close the sim-to-real
gap and generalize to the new categories. For domain transfer experiments, we
train SwinDRNet on the train split of DREDS-CatKnown dataset and evaluate
on Cleargrasp dataset. The results reported in Table 4 testify that model only
trained on DREDS-CatKnown can easily generalize to the new domain Claer-
Grasp and outperform the previous results directly trained on ClearGrasp and
Omniverse [43] (LIDF train on Omniverse and ClearGrasp), which verifies the
generalization ability of our dataset.
Table 2. Quantitative comparison to state-of-the-art methods on DREDS

and STD. ↓ means lower is better, ↑ means higher is better. The left of ’/’ shows
the results evaluated on all objects, and the right of ’/’ shows the results evaluated
on specular and transparent objects. Note that only one result is reported on STD-
CatNovel, because all the objects are specular or transparent.
Methods RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑

DREDS-CatKnown (Sim)
NLSPN 0.010/0.011 0.009/0.011 0.006/0.007 97.48/96.41 99.51/99.12 99.97/99.74
LIDF 0.016/0.015 0.018/0.017 0.011/0.011 93.60/94.45 98.71/98.79 99.92/99.90
Ours 0.010/0.010 0.008/0.009 0.005/0.006 98.04/97.76 99.62/99.57 99.98/99.97
DREDS-CatNovel (Sim)
NLSPN 0.026/0.031 0.039/0.054 0.015/0.021 78.90/69.16 89.02/83.55 97.86/96.84
LIDF 0.082/0.082 0.183/0.184 0.069/0.069 23.70/23.69 42.77/42.88 75.44/75.54
Ours 0.022/0.025 0.034/0.044 0.013/0.017 81.90/75.27 92.18/89.15 98.39/97.81
STD-CatKnown (Real)
NLSPN 0.114/0.047 0.027/0.031 0.015/0.018 94.83/89.47 98.37/97.48 99.38/99.32
LIDF 0.019/0.022 0.019/0.023 0.013/0.015 93.08/90.32 98.39/97.38 99.83/99.62
Ours 0.015/0.018 0.013/0.016 0.008/0.011 96.66/94.97 99.03/98.79 99.92/99.85
STD-CatNovel (Real)
NLSPN 0.087 0.050 0.025 81.95 90.36 96.06
LIDF 0.041 0.060 0.031 53.69 79.80 99.63
Ours 0.025 0.033 0.017 81.55 93.10 99.84
Table 3. Quantitative results for Sim-to-Real. Synthetic means taking the

cropped synthetic depth images for training, and Simulated means taking the sim-
ulated depth images from the train split of DREDS-CatKnown for training.
Trainset RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑

STD-CatKnown (Real)
Synthetic 0.0467/0.056 0.0586/0.070 0.0377/0.047 49.12/39.42 86.50/79.85 98.98/97.66
Simulated 0.015/0.018 0.013/0.016 0.008/0.011 96.66/94.97 99.03/98.79 99.92/99.85
STD-CatNovel (Real)
Synthetic 0.065 0.101 0.053 21.04 55.87 96.96
Simulated 0.025 0.033 0.017 81.55 93.10 99.84
Table 4. Quantitative results for domain transfer. The previous best results
means that the best previous method is trained on ClearGrasp and Omniverse, and
evaluated on ClearGrasp. Domain transfer means that SwinDRNet is trained on
DREDS-CatKnown and evaluated on ClearGrasp.
Model RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑

ClearGrasp real-known
The previous best results 0.028 0.033 0.020 82.37 92.98 98.63
Domain transfer 0.022 0.017 0.012 91.46 97.47 99.86
ClearGrasp real-novel
The previous best results 0.025 0.036 0.020 79.5 94.01 99.35
Domain transfer 0.016 0.008 0.005 96.73 98.83 99.78
6.2 Category-level Pose Estimation
Evaluation Metrics. We use two aspects of metrics to evaluate: 1) 3D IoU. It

computes the intersection over union of ground truth and predicted 3D bounding
boxes. We choose the threshold of 25% (IoU25), 50%(IoU50) and 75%(IoU75)
for this metric. 2) Rotation and translation errors. It computes the rotation
and translation errors between the ground truth pose and predicted pose. We
choose 5◦ 2cm, 5◦ 5cm, 10◦ 2cm, 10◦ 5cm, 10◦ 10cm for this metric.
Baselines. We choose two models as baselines to show the usefulness of
the restored depth for category-level pose estimation and the effectiveness of
SwinDRNet+NOCSHead: 1) NOCS [36]. It takes a RGB image as input to
predict the per-pixel normalized coordinate map and obtain the pose with the
help of the depth map. 2) SGPA [9]. The state-of-the-art method. It leverages
one object and its corresponding category prior to dynamically adapting the
prior to the observed object. Then the prior adaptation is used to reconstruct
the 3D canonical model of the specific object for pose fitting.
Results. To verify the usefulness of the restored depth, we report the results
of three methods using raw or restored (output of SwinDRNet) depth in Table
5. -only means using raw depth in the whole experiment, Refined depth+ means
using restored depth for pose fitting in NOCS and SwinDRNet+NOCSHead.
Due to the fact that SGPA deforms the point cloud to get the results which
are sensitive to depth, we use restored depth for both training and inference.
We observe that restored depth improves the performance of three methods by
large margins under all the metrics on both dataset. These performance gains
suggest that depth restoration is truly useful for category-level pose estimation.
Moreover, SwinDRNet+NOCSHead outperforms NOCS and SGPA under all the
metrics.
Table 5. Quantitative results for category-level pose estimation. only means

using raw depth in the whole experiment,Refined means using restored depth for train-
ing and inference in SGPA and for pose fitting in NOCS and our method.
Methods IoU25 IoU50 IoU75 5◦ 2cm 5◦ 5cm 10◦ 2cm 10◦ 5cm 10◦ 10cm
NOCS-only 85.4 61.1 18.3 22.8 27.2 43.4 51.8 52.9
SGPA-only 77.3 63.7 30.0 30.1 33.1 49.9 55.9 56.7
Refined depth + NOCS 85.4 65.9 27.6 32.1 33.5 57.3 60.9 60.9
Refined depth + SGPA 82.1 73.4 45.4 46.5 47.4 67.5 69.4 69.5
Ours-only 94.3 78.8 36.7 34.6 37.8 55.9 62.9 63.5
Refined depth + Ours 95.3 85.0 49.9 49.3 50.3 70.1 72.8 72.8
STD-CatKnown (Real)
NOCS-only 89.1 63.7 17.2 23.0 28.9 42.1 57.4 58.2
SGPA-only 75.2 63.1 30.5 31.9 34.3 50.3 56.0 56.5
Refined depth + NOCS 88.8 71.1 28.7 29.8 31.2 57.4 60.6 60.7
Refined depth + SGPA 77.2 71.6 49.0 51.1 51.5 72.8 73.7 73.7
Ours-only 91.5 81.3 39.3 38.2 42.9 58.3 71.2 71.5
Refined depth + Ours 91.5 85.7 55.7 53.3 54.1 77.6 79.7 79.7
6.3 Robotic Grasping

Experiments Setting. We conduct real robot experiments to evaluate the
depth restoration performance on robotic grasping tasks. In our physical setup,
we use a 7-DoF Panda robot arm from Franka Emika with a parallel-jaw gripper.
RealSense D415 depth sensor is mounted on the tripod in front of the arm. We
set 6 rounds of table clearing experiments. For each round, 4 to 5 specular
and transparent objects are randomly picked from STD objects to construct a
cluttered scene. For each trial, the robot arm executes the grasping pose with
the highest score, and removes the grasped object until the workspace is cleared,
or 10 attempts are reached.
Evaluation Metrics. Real grasping performance is measured using the fol-
lowing metrics: 1) Success Rate: the ratio of grasped object number and at-
tempt number, 2) Completion Rate: the ratio of successfully removed object
number and the original object number in a scene.
Baselines. We follow the 6-DoF grasping pose prediction network GraspNet-
baseline, using the released pretrained model. GraspNet means GraspNet-baseline
directly takes the captured raw depth as input, while SwinDRNet+GraspNet
means the network receives the refined point cloud from SwinDRNet that is
trained only on DREDS-CatKnown dataset.
Table 6. Results of real robot experiments. #Objects denotes the sum of grasped
object numbers in all rounds. #Attempts denotes the sum of robotic grasping attempt
numbers in all rounds.
Methods #Objects #Attempts Success Rate Completion Rate

GraspNet 19 49 38.78% 40%
SwinDRNet+GraspNet 25 26 96.15% 100%
Results. Table 6 reports the performance of real robot experiments. Swin-

DRNet+GraspNet obtains high success rate and completion rate, while Grasp-
Net is lower. Without depth restoration, it is difficult for a robot arm to grasp
specular and transparent objects due to the severely incomplete and inaccurate
raw depth. The proposed SwinDRNet significantly improves the performance of
specular and transparent object grasping.
7 Conclusions
In this work, we propose a powerful RGBD fusion network, SwinDRNet, for
depth restoration. Our proposed framework, DREDS, synthesizes a large-scale
RGBD dataset with realistic sensor noises, so as to close the sim-to-real gap for
specular and transparent objects. Furthermore, we collect a real dataset STD, for
real-world performance evaluation. Evaluations on depth restoration, category-
level pose estimation, and object grasping tasks demonstrate the effectiveness of
our method.
References
1. Blender. https://www.blender.org/
2. Object capture api on macos. https://developer.apple.com/augmented-
reality/object-capture/
3. Bartell, F.O., Dereniak, E.L., Wolfe, W.L.: The theory and measurement of bidi-
rectional reflectance distribution function (brdf) and bidirectional transmittance
distribution function (btdf). In: Radiation scattering in optical systems. vol. 257,
pp. 154–160. SPIE (1981)
4. Breyer, M., Chung, J.J., Ott, L., Roland, S., Juan, N.: Volumetric grasping net-
work: Real-time 6 dof grasp detection in clutter. In: Conference on Robot Learning
(2020)
5. Burley, B.: Extending the disney brdf to a bsdf with integrated subsurface scatter-
ing. Physically Based Shading in Theory and Practice’SIGGRAPH Course (2015)
6. Burley, B., Studios, W.D.A.: Physically-based shading at disney. In: ACM SIG-
GRAPH. vol. 2012, pp. 1–7. vol. 2012 (2012)
7. Calli, B., Singh, A., Bruce, J., Walsman, A., Konolige, K., Srinivasa, S., Abbeel,
P., Dollar, A.M.: Yale-cmu-berkeley dataset for robotic manipulation research. The
International Journal of Robotics Research 36(3), 261–268 (2017)
8. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z.,
Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich
3d model repository. arXiv preprint arXiv:1512.03012 (2015)
9. Chen, K., Dou, Q.: Sgpa: Structure-guided prior adaptation for category-level 6d
object pose estimation. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision. pp. 2773–2782 (2021)
10. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using
a multi-scale deep network. Advances in neural information processing systems 27
(2014)
11. Fang, H.S., Wang, C., Gou, M., Lu, C.: Graspnet-1billion: A large-scale bench-
mark for general object grasping. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. pp. 11444–11453 (2020)
12. He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: Pvn3d: A deep point-
wise 3d keypoints voting network for 6dof pose estimation. In: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition. pp. 11632–
11641 (2020)
13. Hu, M., Wang, S., Li, B., Ning, S., Fan, L., Gong, X.: Penet: Towards precise and
efficient image guided depth completion. In: 2021 IEEE International Conference
on Robotics and Automation (ICRA). pp. 13656–13662. IEEE (2021)
14. Jiang, Z., Zhu, Y., Svetlik, M., Fang, K., Zhu, Y.: Synergies between affordance and
geometry: 6-dof grasp detection via implicit representations. In: Robotics: Science
and Systems XVII, Virtual Event, July 12-16, 2021 (2021)
15. Jiao, J., Cao, Y., Song, Y., Lau, R.: Look deeper into depth: Monocular depth
estimation with semantic booster and attention-driven loss. In: Proceedings of the
European conference on computer vision (ECCV). pp. 53–69 (2018)
16. Khirodkar, R., Yoo, D., Kitani, K.: Domain randomization for scene-specific car
detection and pose estimation. In: 2019 IEEE Winter Conference on Applications
of Computer Vision (WACV). pp. 1932–1940. IEEE (2019)
17. Landau, M.J., Choo, B.Y., Beling, P.A.: Simulating kinect infrared and depth
images. IEEE transactions on cybernetics 46(12), 3018–3031 (2015)
18. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin
transformer: Hierarchical vision transformer using shifted windows. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022
(2021)
19. Long, X., Lin, C., Liu, L., Li, W., Theobalt, C., Yang, R., Wang, W.: Adaptive
surface normal constraint for depth estimation. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 12849–12858 (2021)
20. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International
Conference on Learning Representations (2018)
21. Mo, K., Guibas, L.J., Mukadam, M., Gupta, A., Tulsiani, S.: Where2act: From
pixels to actions for articulated 3d objects. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 6813–6823 (2021)
22. Mu, T., Ling, Z., Xiang, F., Yang, D., Li, X., Tao, S., Huang, Z., Jia, Z.,
Su, H.: ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale
Demonstrations. In: Annual Conference on Neural Information Processing Systems
(NeurIPS) (2021)
23. Park, J., Joo, K., Hu, Z., Liu, C.K., So Kweon, I.: Non-local spatial propagation
network for depth completion. In: European Conference on Computer Vision. pp.
120–136. Springer (2020)
24. Peng, X.B., Andrychowicz, M., Zaremba, W., Abbeel, P.: Sim-to-real transfer of
robotic control with dynamics randomization. In: 2018 IEEE international confer-
ence on robotics and automation (ICRA). pp. 3803–3810. IEEE (2018)
25. Planche, B., Singh, R.V.: Physics-based differentiable depth sensor simulation. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.
14387–14397 (2021)
26. Planche, B., Wu, Z., Ma, K., Sun, S., Kluckner, S., Lehmann, O., Chen, T., Hutter,
A., Zakharov, S., Kosch, H., et al.: Depthsynth: Real-time realistic synthetic data
generation from cad models for 2.5 d recognition. In: 2017 International Conference
on 3D Vision (3DV). pp. 1–10. IEEE (2017)
27. Prakash, A., Boochoon, S., Brophy, M., Acuna, D., Cameracci, E., State, G.,
Shapira, O., Birchfield, S.: Structured domain randomization: Bridging the re-
ality gap by context-aware synthetic data. In: 2019 International Conference on
Robotics and Automation (ICRA). pp. 7249–7255. IEEE (2019)
28. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn-
ing on point sets in a metric space. Advances in neural information processing
systems 30 (2017)
29. Qu, C., Liu, W., Taylor, C.J.: Bayesian deep basis fitting for depth completion
with uncertainty. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 16147–16157 (2021)
30. Sajjan, S., Moore, M., Pan, M., Nagaraja, G., Lee, J., Zeng, A., Song, S.: Clear
grasp: 3d shape estimation of transparent objects for manipulation. In: 2020 IEEE
International Conference on Robotics and Automation (ICRA). pp. 3634–3642.
IEEE (2020)
31. Schönberger, J.L., Frahm, J.M.: Structure-from-Motion Revisited. In: Conference
on Computer Vision and Pattern Recognition (CVPR) (2016)
32. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain
randomization for transferring deep neural networks from simulation to the real
world. In: 2017 IEEE/RSJ international conference on intelligent robots and sys-
tems (IROS). pp. 23–30. IEEE (2017)
33. Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T.,
Cameracci, E., Boochoon, S., Birchfield, S.: Training deep networks with synthetic
data: Bridging the reality gap by domain randomization. In: Proceedings of the
IEEE conference on computer vision and pattern recognition workshops. pp. 969–
977 (2018)
34. Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity
invariant cnns. In: 2017 international conference on 3D Vision (3DV). pp. 11–20.
IEEE (2017)
35. Van Gansbeke, W., Neven, D., De Brabandere, B., Van Gool, L.: Sparse and noisy
lidar completion with rgb guidance and uncertainty. In: 2019 16th international
conference on machine vision applications (MVA). pp. 1–6. IEEE (2019)
36. Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normal-
ized object coordinate space for category-level 6d object pose and size estimation.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 2642–2651 (2019)
37. Weng, Y., Wang, H., Zhou, Q., Qin, Y., Duan, Y., Fan, Q., Chen, B., Su, H.,
Guibas, L.J.: Captra: Category-level pose tracking for rigid and articulated objects
from point clouds. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 13209–13218 (2021)
38. Xiong, X., Xiong, H., Xian, K., Zhao, C., Cao, Z., Li, X.: Sparse-to-dense depth
completion revisited: Sampling strategy and graph construction. In: European Con-
ference on Computer Vision. pp. 682–699. Springer (2020)
39. Xu, H., Wang, Y.R., Eppel, S., Aspuru-Guzik, A., Shkurti, F., Garg, A.: Seeing
glass: Joint point-cloud and depth completion for transparent objects. In: 5th An-
nual Conference on Robot Learning (2021)
40. Yue, X., Zhang, Y., Zhao, S., Sangiovanni-Vincentelli, A., Keutzer, K., Gong, B.:
Domain randomization and pyramid consistency: Simulation-to-real generalization
without accessing target domain data. In: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision. pp. 2100–2110 (2019)
41. Zakharov, S., Kehl, W., Ilic, S.: Deceptionnet: Network-driven domain random-
ization. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. pp. 532–541 (2019)
42. Zhang, X., Chen, R., Xiang, F., Qin, Y., Gu, J., Ling, Z., Liu, M., Zeng, P., Han,
S., Huang, Z., et al.: Close the visual domain gap by physics-grounded active
stereovision depth sensor simulation. arXiv preprint arXiv:2201.11924 (2022)
43. Zhu, L., Mousavian, A., Xiang, Y., Mazhar, H., van Eenbergen, J., Debnath, S.,
Fox, D.: Rgb-d local implicit function for depth completion of transparent objects.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 4649–4658 (2021)
Supplementary Material for

Domain Randomization-Enhanced Depth
Simulation and Restoration for Perceiving and
Grasping Specular and Transparent Objects
In the supplementary material, we present the additional sections for this
paper, including domain randomization details, network implementation details,
additional experiments and results, and additional dataset details.
1 Domain Randomization Details

In this work, we propose the Domain Randomization-Enhanced Depth Simu-
lation (DREDS) approach, leveraging domain randomization and depth sensor
simulation to generate photorealistic RGB images and simulated depths with
realistic sensor noises. Specifically, during the simulated data generation, we
perform domain randomization in the following aspects:
Scene and Object Setting. We focus on hand-scale objects and a table-top
setting. We set the scene into the following two types: 1) Category-aware scenes
that mainly utilize ShapeNetCore [8] objects from 7 object categories – camera,
car, airplane, bowl, bottle, can, and mug. We also have some distractor objects
from categories of phone, guitar, cap, etc. In total, we leverage 1536 objects for
training and 265 objects for evaluation. In our simulated scenes, we load a ran-
dom number of objects ranging from 6 to 10 with random scales and categories
and let them fall freely under gravity onto a ground plane to create random
but physically plausible spatial arrangements of objects and prepare cluttered
scenes. 2) Category-agnostic scenes. To evaluate the generalization ability to
category-novel objects and the performance of grasping, we adopt 60 objects
from GraspNet-1Billion [11]. We follow their original poses and arrangements
but transfer random types of material as described in the next section.
Material Modeling and Assignment. Few of the existing depth sensor
simulators consider modeling a variety of randomized real-world materials, espe-
cially specular and transparent materials. In this work, we adopt a bidirectional
scattering distribution function (BSDF) [3], a unified representation covering the
most common materials. BSDF defines how the light is scattered on a surface
to determine the material of each point on the object.
Specifically, we use Disney principled BSDF [5,6] fD&S (ϕ) for diffuse and
specular material modeling, where ϕ is the set of scalar parameters or nested
functions, including the base color, subsurface, metallic, specular, roughness,
anisotropic, etc. We use a mix of BSDF fT (ψ) to represent transparent materials,
containing glass BSDF, transparent BSDF, and translucent BSDF to adjust
transparency, as well as refraction BSDF to add refraction, and glossy BSDF to
add reflection on the surface, etc, where ψ means the parameter set from each
BSDF function like surface color, index of refraction (IOR), and roughness.
Based on the above BSDF models, we collect an asset of materials with
different categories that cover common objects in life, including 1) 27 specular
materials including metal, porcelain, clean plastic, paint, etc., 2) 4 transparent

materials, 3) 36 diffuse materials including rubber, leather, wood, fabric, coarse
plastic, paper, clay, etc. We randomize the parameters of the BSDF function for
each material within a range, generating a large-scale material collection with
wide variations.
We assign one type of material to each object in the scene randomly. For those
objects with default colors or texture maps, we mix their colors or textures with
the base color of the assigned material in a randomized ratio. It means that we
can easily transfer an existing synthetic object dataset to a dataset with a large
amount of specular and transparent objects.
Camera Setting. We follow RealSense D415 to set up the projector’s pa-
rameters (e.g., the IR pattern image, baseline distance) and other cameras’ in-
trinsic parameters. Camera locations and poses are randomized within a range,
so that the objects in each scene can be captured from arbitrary directions.
Lighting and Background Setting. We collect 74 HDRI environment
maps for training, and 23 for testing, including indoor and outdoor scenes, as
well as natural and artificial lighting. An arbitrarily chosen environment map
with random intensities is used to simulate realistic ambient illumination. For
the background, we pick 81 common indoor materials for training and 23 for
evaluation, including wood, marble, tiles, concrete, etc. A random selection of
these materials is applied to the ground plane to increase variations of the scene.
2 Network Implementation Details

We implement the proposed SwinDRNet and downstream algorithms in Py-
Torch. We train SwinDRNet for 20 epochs (nearly 146,000 iterations) with batch
size 32, using AdamW [20] optimizer with β1 = 0.9, β2 = 0.999, a learning rate
of 1e-4, a weight decay of 0.01, as well as a learning scheduler with a linear
warmup of 500 iterations and a linear learning rate decay. SwinDRNet takes
RGB and raw depth images that are resized to 224*224 as the input, and out-
puts the restored depth image with the same size for downstream tasks. Note
that for SGPA [9], the baseline method of category-level pose estimation, as its
performance depends on the number of points of the input point cloud, i.e., the
resolution of the depth, the original RGBD images are firstly resized to 224*448,
and then sampled at an interval of 1 along the direction of the row to obtain two
224*224 inputs, as well as two 224*224 outputs from the network. We finally
interpolate these two outputs in the same sampling way above, to obtain the
224*448 depth as the input to SGPA.
3 Additional Experiments and Results
3.1 Depth Restoration
Qualitative Comparison to State-of-the-art Methods. Figure 5 shows

the qualitative comparison of STD dataset, demonstrating that our method can
predict a more accurate depth on the area with missing or incorrect values while
preserving the depth value of the correct area of the raw depth map.
Input RGB Input point cloud LIDF NLSPN Ours Groundtruth
Fig. 5. Qualitative comparison to state-of-the-art methods. For a fair compar-

ison, all the methods are trained on the train split of DREDS-CatKnown. Red boxes
highlight the specular or transparent objects.
Cross-Sensor Evaluation. In this work, depth sensor simulation and real-

world data capture are both based on Intel RealSense D415. To investigate the
robustness of the proposed SwinDRNet on other types of depth sensors, we
evaluate the performance on data of two scenes from STD-CatKnown dataset,
captured by Intel RealSense D435. Table 7 shows a comparison of the results
evaluated on D415 and D435 data after training on DREDS-CatKnown dataset.
We observe that SwinDRNet has similar performance on data from these two
different depth sensors in each scene, which verifies the good cross-sensor gener-
alization ability of SwinDRNet.
Table 7. Quantitative results for cross-sensor evaluation. The performance of

SwinDRNet is evaluated on RGB-D data captured by Intel RealSense D415 and D435
in each of the two scenes.
Scenes Sensors RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑

D415 0.017/0.017 0.015/0.016 0.009/0.010 94.62/94.30 98.34/98.60 99.94/99.95
1
D435 0.021/0.023 0.022/0.025 0.013/0.015 89.30/86.23 97.95/97.85 99.95/99.98
D415 0.013/0.018 0.011/0.014 0.008/0.011 97.93/96.02 99.47/98.94 100.00/100.00
2
D435 0.016/0.024 0.015/0.024 0.010/0.017 95.25/89.29 99.16/97.69 100.00/100.00
3.2 Category-level Pose Estimation
Qualitative Comparison to Baseline Methods. Figure 6 shows the quali-

tative results of different experiments on DREDS and STD datasets. We can see
that the qualities of our predictions are generally better than others. The figure
also shows that NOCS [36], SGPA [9] and our method all perform better with
the help of restoration depth, especially for specular and transparent objects
like the mug, bottle and bowl, which indicates that depth restoration does help
category-level pose estimation task.
Quantitative Comparison to Restored Depth Inputs. We further eval-
uate the influence of different restored depths for category-level pose estimation,
which is presented in Table 8. The proposed SwinDRNet+NOCSHead network

receives the restored depth from SwinDRNet and the competing depth restora-
tion methods for pose fitting. Quantitative results under all metrics demonstrate
the superiority of SwinDRNet over other baseline methods in boosting the per-
formance of category-level pose estimation.
Table 8. Quantitative results for category-level pose estimation using differ-
ent restored depths from SwinDRNet and the competing baseline methods.
The left of ’/’ shows the results evaluated on all objects, and the right of ’/’ shows the
results evaluated on specular and transparent objects.
Methods IoU25 IoU50 IoU75 5◦ 2cm 5◦ 5cm 10◦ 2cm 10◦ 5cm 10◦ 10cm
NLSPN 95.1/97.5 83.8/87.4 46.4/48.9 39.6/39.6 40.4/40.5 65.5/68.1 67.9/70.7 67.9/70.8
LIDF 94.6/97.1 80.7/85.0 36.6/40.8 33.9/37.5 36.4/40.0 58.2/64.0 64.7/70.2 64.9/70.4
Ours 95.3/97.8 85.0/88.9 49.9/52.4 49.3/51.8 50.3/53.1 70.1/74.3 72.8/77.4 72.8/77.5
STD-CatKnown (Real)
NLSPN 91.4/97.2 85.2/89.4 53.6/48.8 45.5/31.6 46.5/33.4 73.1/57.2 75.7/61.1 75.7/61.1
LIDF 91.3/96.7 83.2/85.5 42.9/35.9 34.8/35.5 37.4/40.3 65.2/61.1 71.0/69.3 71.1/69.4
Ours 91.5/97.1 85.7/89.9 55.7/54.4 53.3/40.1 54.1/41.4 77.6/66.5 79.7/68.8 79.7/68.9
3.3 Robotic Grasping

The illustration of a real robot experiment for specular and transparent object
grasping is shown in Figure 7. We carry out the table-clearing using the Franka
Emika Panda robot arm with the parallel-jaw gripper, and RealSense D415 depth
sensor for RGBD images capture.
3.4 Ablation Study

To analyze the components of the proposed SwinDRNet, as well as domain
randomization and the scale of the proposed DREDS dataset, we conduct the
ablation studies with different configurations.
Analysis of the Modules of SwinDRNet. We first evaluate the effect of
different modules of SwinDRNet with three configurations: 1) Take the concate-
nated RGBD images as input without the RGB-D fusion and confidence interpo-
lation module. 2) Have no confidence module compared with SwinDRNet. 3) The
complete SwinDRNet. As shown in Table 9, the performance of depth restora-
tion improves when using these two modules. Note that the network with and
without the confidence interpolation module obtain the similar depth restoration
performance. However, in Table 10, we observe that SwinDRNet with this mod-
ule achieves higher performance on object pose estimation, because the module
keeps the correct geometric features from the original depth input which benefits
the downstream task. The results above indicate the effectiveness of the RGB-D
fusion and confidence interpolation module of SwinDRNet.
Analysis of Material Randomization. We analyze the effect of material
randomization on depth restoration. We create a dataset of the same size as
the fully randomized DREDS-CatKnown dataset. The original materials from
ShapeNetCore [8] are directly applied to the objects without any transfer or ran-
domization of specular, transparent, diffuse materials. Table 11 shows the results
Fig. 6. Qualitative results of pose estimations on DREDS and STD datasets.

The ground truths are shown in green while the estimations are shown in red. only
means using raw depth in the whole experiment, Refined means using restored depth
for training and inference in SGPA and for pose fitting in NOCS and our method.
RealSense D415
Parallel-jaw Depth Sensor
Gripper
Panda Arm
Specular and
Transparent Objects
Box
Fig. 7. The setting of real robot experiment for specular and transparent
object grasping.
of depth restoration, evaluating on specular and transparent objects. Without

material randomization, the performance drops significantly, since the network
cannot consider real-world data as the variation of the synthetic training data
without seeing sufficient material variation, which demonstrates the significance
of material randomization.
Analysis of the Scale of Training Data. In Table 12, we show the per-
formance dependence on the dataset scale. Compared to the full scale, the depth
restoration performance of SwinDRNet trained on the half scale also degraded,
demonstrating the necessity of the scale of the DREDS dataset for the method.
Table 9. Ablation studies for the effect of different modules on depth

restoration. ✓denotes prediction with the module.
Fusion Confidence RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑
STD-CatKnown
0.019/0.027 0.019/0.032 0.0123/0.021 91.09/79.20 98.92/97.73 99.95/99.91
✓ 0.014/0.017 0.013/0.017 0.009/0.012 96.33/94.18 99.36/99.01 99.92/99.91
✓ ✓ 0.015/0.018 0.013/0.016 0.008/0.011 96.66/94.97 99.03/98.79 99.92/99.85
Table 10. The effect of confidence for category-level pose estimation.
Confidence IoU25 IoU50 IoU75 5◦ 2cm 5◦ 5cm 10◦ 2cm 10◦ 5cm 10◦ 10cm
STD-CatKnown (Real)
91.5 85.7 56.2 51.3 52.2 76.6 78.8 78.8
✓ 91.5 85.7 55.7 53.3 54.1 77.6 79.7 79.7
Table 11. Quantitative results for material randomization on depth restora-

tion task. The left of ’/’ shows the results evaluated on all objects, and the right of ’/’
evaluated on specular and transparent objects. Note that only one result is reported
on STD-CatNovel, because all the objects are specular or transparent.
Model RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑
STD-CatKnow (Real)
Fixed material 0.024/0.038 0.024/0.045 0.015/0.029 86.20/65.63 96.12/90.94 99.87/99.72
Full randomization 0.015/0.018 0.013/0.016 0.008/0.011 96.66/94.97 99.03/98.79 99.92/99.85
STD-CatNovel (Real)
Fixed material 0.038 0.051 0.027 67.52 84.86 98.51
Full randomization 0.025 0.033 0.017 81.55 93.10 99.84
4 Additional Dataset Details
4.1 DREDS Dataset

We present the DREDS-CatKnown dataset, where the category-level objects are
from ShapeNetCore [8], and the DREDS-CatNovel dataset, where we transfer
random materials to the objects of GraspNet-1Billion [11]. Figure 8 shows the
examples and annotations of DREDS dataset. For each virtual scene, we provide
the RGB image, stereo IR images, simulated depth, ground truth depth, NOCS
map, surface normal, instance mask, etc.
Table 12. Ablation study for the scale of training data on depth restora-
tion. SwinDRNet is trained on DREDS-CatKnown and evaluated on the specular and
transparent objects of STD.
Scale RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑

STD-CatKnow (Real)
Half 0.021 0.020 0.014 92.71 98.54 99.83
Full 0.018 0.016 0.011 94.97 98.79 99.84
STD-CatNovel (Real)
Half 0.028 0.037 0.020 80.37 91.16 99.79
Full 0.025 0.033 0.017 81.55 93.10 99.84
Examples of DREDS-CatKnown
Examples of DREDS-CatNovel
DREDS data and annotation
RGB Left IR image Right IR image Simulated depth
Ground truth depth NOCS map Surface normal Instance mask
Fig. 8. Paired RGB and simulated depth examples and annotations of

DREDS-CatKnown and DREDS-CatNovel datasets.
4.2 STD Dataset

Example of CAD Models. We obtain CAD models of 42 category-level objects
and 8 category-novel objects using the 3D reconstruction algorithm. For most of
the objects, especially specular and transparent objects, we spray the dye and
decorate objects with ink to enhance the reconstruction performance. 50 CAD
models are shown in Figure 9.
Fig. 9. CAD models of the STD object set. The 1st to 7th rows show 42 objects
in 7 categories, and the last row shows 8 objects in novel categories.
Data Annotation. It is quite time-consuming to annotate such a large

amount of real data. We propose to annotate the 6D poses of the objects in
the first frame of each scene. Then the annotated 6D poses are propagated to
the subsequent frames according to the camera poses with respect to the first
frame. We calculate the camera poses using COLMAP [31]. In our annotation,
we develop a program with GUI, enabling the user to move the CAD model,
switching back and forth between the 2D image and 3D point cloud space to
determine its pose, which facilitates labeling specular and transparent objects
whose point clouds are severely missing or incorrect. After annotating 6D pose,
we can easily render other annotations like the ground truth depth, instance
mask, etc. Figure 10 shows the examples and annotations of DREDS dataset.
Examples of STD-CatKnown dataset
RGB Depth Groundtruth Depth Instance Mask NOCS Map
Examples of STD-CatNovel dataset
RGB Depth Groundtruth Depth Instance Mask
Fig. 10. Examples and annotations of STD-CatKnown and STD-CatNovel

datasets. The ground truth depth maps are labeled only in the area of 42 objects
in 7 categories and 8 objects in novel categories. Moreover, the NOCS maps are not
annotated in STD-CatNovel dataset because we do not define the normalized object
coordinate space for novel categories.

Domain Randomization-Enhanced Depth Simulation and Restoration For Perceiving and Grasping Specular and Transparent Objects

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Domain Randomization-Enhanced Depth Simulation and Restoration For Perceiving and Grasping Specular and Transparent Objects

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Domain Randomization-Enhanced Depth Simulation and Restoration For Perceiving and Grasping Specular and Transparent Objects

Uploaded by

Copyright:

Available Formats

Domain Randomization-Enhanced Depth

Simulation and Restoration for Perceiving and

Abstract. Commercial depth sensors usually generate noisy and miss-

Keywords: Depth sensor simulation, specular and transparent objects,

Domain Randomization-Enhanced Depth Simulation Depth Restoration: SwinDRNet Downstream Tasks

Illumination Background Camera viewpoint

Fig. 1. Framework overview. From the left to right: we leverage domain

2.1 Depth Estimation and Restoration

The increasing popularity of RGBD sensors has encouraged much research on

2.2 Depth Sensor Simulation

2.3 Domain Randomization

3 Domain Randomization-Enhanced Depth Simulation

Table 1. Comparisons of specular and transparent depth restoration dataset.

Dataset Type #Objects Type of Material Size

3.2 Depth Sensor Simulation

used in DREDS, which constructs a superset of real IR images. To mimic the

3.3 Simulated Data Generation with Domain Randomization

3.4 Simulated Dataset: DREDS

RGBD examples of DREDS-CatKnown Scene examples of STD-CatKnown

RGBD examples of DREDS-CatNovel Scene examples of STD-CatNovel

Fig. 2. RGBD examples of Fig. 3. Scene examples and annotations

it STD dataset. Similar to DREDS dataset, STD dataset contains 1) STD-

4.2 Data Collection

Phase 1 SwinT-based Phase 2 Cross-Attention Transformer Phase 3 Final Depth Prediction

𝑻𝒅 SwinT Stage1 𝟏−𝑪

Fig. 4. Overview of our proposed depth restoration network SwinDRNet. We

5.1 SwinDRNet for Depth Restoration

Overview. To restore the noisy and incomplete depth, we propose a SwinTrans-

\{\mathcal {F}_c^i\}_{i=1,2,3,4} = SwinT_\text {color}(\mathcal {T}_{c}), (1)

\{\mathcal {F}_d^i\}_{i=1,2,3,4} = SwinT_\text {depth}(\mathcal {T}_d). (2)

where dK is the dimension of Q and K.

5.2 Downstream Tasks

Category-level 6D Object Pose Estimation. Inspired by [36], we use the

6 Tasks, Benchmarks and Results

6.1 Depth Restoration

Table 2. Quantitative comparison to state-of-the-art methods on DREDS

Methods RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑

Table 3. Quantitative results for Sim-to-Real. Synthetic means taking the

Trainset RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑

Model RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑

6.2 Category-level Pose Estimation

Evaluation Metrics. We use two aspects of metrics to evaluate: 1) 3D IoU. It

Table 5. Quantitative results for category-level pose estimation. only means

6.3 Robotic Grasping

Methods #Objects #Attempts Success Rate Completion Rate

Results. Table 6 reports the performance of real robot experiments. Swin-

Supplementary Material for

1 Domain Randomization Details

materials including metal, porcelain, clean plastic, paint, etc., 2) 4 transparent

2 Network Implementation Details

3 Additional Experiments and Results

3.1 Depth Restoration

Qualitative Comparison to State-of-the-art Methods. Figure 5 shows

Input RGB Input point cloud LIDF NLSPN Ours Groundtruth

Fig. 5. Qualitative comparison to state-of-the-art methods. For a fair compar-

Cross-Sensor Evaluation. In this work, depth sensor simulation and real-

Table 7. Quantitative results for cross-sensor evaluation. The performance of

Scenes Sensors RMSE↓ REL↓ MAE↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑