Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer

mage HR-Depth CADepth
MonoViT: Self-Supervised Monocular Depth Estimation with

a Vision Transformer
Chaoqiang Zhao1,2,∗ Youmin Zhang2,∗ Matteo Poggi2 Fabio Tosi2 Xianda Guo3
3 3 1‡
Zheng Zhu Guan Huang Yang Tang Stefano Mattoccia2
1
East China University of Science and Technology 2 University of Bologna 3 PhiGent Robotics
arXiv:2208.03543v1 [cs.CV] 6 Aug 2022
Image
Mono2 Image HR-Depth
DiffNet MonoViT
CADepth
Figure 1. Effects of global reasoning on self-supervised monocular depth estimation. The limited receptive field of
existing solutions (e.g. HR-Depth [30], in the middle) often yields inaccurate depth estimation, losing fine-grained details
Mono2 (like the car and cyclist over imposed in yellow). On the contrary, our MonoViT architecture
DiffNet (right) achieves superior results.
Ours
Abstract spite the steady progress of active depth-sensing technolo-

gies brought by devices such as LiDARs, Time-of-Flight
Self-supervised monocular depth estimation is an attrac- (ToF) cameras and more, the possibility of estimating depth
tive solution that does not require hard-to-source depth la- from standard images is generally preferable, mainly be-
bels for training. Convolutional neural networks (CNNs) cause of three (among many) advantages: higher image res-
have recently achieved great success in this task. However, olution, lower hardware costs and potentially unconstrained
their limited receptive field constrains existing network ar- working range. Although using two or more images [36] is
chitectures to reason only locally, dampening the effective- often the preferred choice, estimating depth from a single
ness of the self-supervised paradigm. In the light of the image allows for the deployment of depth-sensing solutions
recent successes achieved by Vision Transformers (ViTs), on any monocular configuration, still representing the most
we propose MonoViT, a brand-new framework combining diffused setting in most practical cases nowadays.
the global reasoning enabled by ViT models with the flex-
ibility of self-supervised monocular depth estimation. By Deep learning paradigms favoured the blooming of this
combining plain convolutions with Transformer blocks, our latter approach [12, 24, 13, 3], at the cost of requiring ex-
model can reason locally and globally, yielding depth pre- tensive collections of images annotated with depth labels
diction at a higher level of detail and accuracy, allowing in order to carry out the training effectively. However,
MonoViT to achieve state-of-the-art performance on the es- considering the high cost of collecting dense depth labels
tablished KITTI dataset. Moreover, MonoViT proves its for this purpose, self-supervised monocular depth estima-
superior generalization capacities on other datasets such tion [16, 71] has emerged in the literature enabling signifi-
as Make3D and DrivingStereo. Source code available at cant progress in recent years [66]. These approaches replace
https://github.com/zxcqlf/MonoViT supervised losses on depth labels with supervisory signals
derived from image reprojection across different views, by
exploiting the geometric relationship between frames, i.e.
1. Introduction the scene depth itself and camera pose. Since these net-
Depth perception is at the foundation of several high- works aim at learning depth, two prominent cases exist to
level computer vision applications such as autonomous deal with the relative pose across frames. They consist of
driving, robotics, and augmented reality [45]. However, de- either knowing it as a prior – for instance, by collecting
stereo images and training on them [16] – or estimating it
∗ Joint first authorship; ‡ Corresponding author. during training, allowing in this second case to train on un-
1
constrained monocular videos [71]. The latter configuration popular KITTI dataset, using the standard split by Eigen
turns out to be the preferred choice for practical deployment et al. [12]. The comparison to SoTA solutions highlights
since it simply requires a single moving camera for gather- the constantly superior accuracy of our framework. More-
ing training data. For this reason, we stick to monocular over, we also analyze the generalization capability of self-
videos for training purposes. However, view reconstruction supervised monocular depth estimation networks across dif-
based losses suffer from occlusions, dynamic objects and ferent datasets. Purposely, we compare MonoViT with
photometric changes, which severely limit the performance its main competitors on the Make3D [42] and Driving-
of the network [66]. Therefore, novel constraints [17] and Stereo [58] datasets, highlighting, even in this case, the su-
additional cues [22, 64] (like semantic segmentation, opti- perior generalization capacity of MonoViT.
cal flow and surface normals) are often used to reduce the
shortcomings mentioned above. 2. Related works
Improving the network backbone of depth networks is This section reviews the literature concerning self-
another well-known effective way to gain accuracy. Re- supervised monocular depth estimation and ViT architec-
cent research has shown that the encoder is crucial for tures, being both relevant to our work.
achieving this [71, 17]. Different kinds of backbone, such Monocular Depth Estimation. Estimating depth from a
as VGGNet, ResNet, HRNet and PackNet, made their single image is a challenging, inherently ill-posed problem.
way into the self-supervised monocular depth estimation Nonetheless, the many learning-based approaches aimed at
task [71, 17, 69, 18]. Moreover, to improve the feature ex- addressing it [66] enabled significant progress in the field.
traction and processing ability, new frameworks like HR- As fully supervised techniques [24, 27, 13, 3] for depth esti-
Depth [30] and CADepth [57] also introduced attention mation advanced rapidly, the availability of precise depth la-
modules. However, we argue that a shared shortcoming of bels in the real world became a major issue. Hence, more re-
existing self-supervised models falls in the reduced recep- cent self-supervised works provided alternatives to remove
tive field of Convolutional Neural Networks (CNNs). This it, avoiding the need for hard-to-source ground-truth depth
fact represents an implicit bottleneck for current dense es- annotations. This goal is feasible by casting the depth esti-
timation methods, dampening accuracy, and the capacity to mation task as a view-synthesis problem between adjacent
generalize to different domains. Specifically, the local na- views, in space or time, of the same observed scene. Pre-
ture of convolutions leads CNNs in their first layers – i.e., cisely, training single-view depth estimation network with
those in charge of modeling fine-grained details – to ex- stereo images [14, 16], monocular videos [71], or a com-
tract features missing long-range relationships across the bination of both [65, 17]. The supervisory signal based
same image. Going deeper with convolutions makes the on the photometric difference between real and synthesized
receptive field wider, yet it does not reach the whole im- images enables training in a self-supervised manner.
age. Fig. 1 highlights the effect of this shortcoming. CNNs Although stereo pairs enable scale recovery, with further
based frameworks sometimes fail to estimate foreground- improvements achievable by leveraging noisy proxy labels
background structures due to the lack of global perceiving [46, 54, 6], guidance from visual odometry [1] or trinoc-
and long-range relationship among modelled pixels. Vision ular assumptions [37], unlabeled video sequences repre-
Transformers (ViTs) [11, 56, 9] recently showed outstand- sent a more flexible alternative at the expense of learning
ing results on tasks such as object detection [9] and se- camera poses alongside depth. Several frameworks have
mantic segmentation [56], thanks to their capacity to model advanced this line of research by incorporating additional
long-range relationships between pixels and thus a global losses and constraints such as those based on direct visual
receptive field. The popularity of ViTs has also reached su- odometry [49], adversarial learning[68], ICP [31], normal
pervised depth estimation as well [38, 26], yet being not consistency [63, 62], semantic segmentation [23, 19] and
adopted for self-supervised monocular depth estimation. uncertainty [35, 60]. Another notable example is given
This paper takes this missing step and explores ViTs for in [17] where the authors introduced a minimum reprojec-
self-supervised monocular depth estimation by proposing tion loss between frames and an auto-masking strategy to
the MonoViT architecture. It combines both convolutional handle both occluded regions and static camera situations
layers and state-of-the-art (SoTA) Transformer blocks [25] that violate the main constraints of the view-synthesis for-
within its backbone to model both the local information mulation and, as a consequence, cause poor network con-
(objects) and global information (relationship among fore- vergence. Other works, instead, directly tackle highly com-
ground and background, among objects) within the same plex scenarios [67] and model rigid and non-rigid compo-
image. This strategy allows us to remove the bottleneck nents present in the scene using the estimated depth, relative
caused by the limited perceptive fields of CNNs encoders, camera poses, and optical flow in order to handle indepen-
leading to naturally finer-grained predictions, as shown in dent motions [47, 64, 72, 5, 39, 61] or by means of scene
Fig. 1. We evaluate the performance of MonoViT on the decomposition [41].
DIFFNet
CADepth
MonoViT
𝑾 𝑯 𝑾 𝑯 𝑾 𝑯
× × × 𝑾×𝑯
𝟖 𝟖 𝟒 𝟒 𝟐 𝟐
Figure 2. Attention maps of SoTA methods and our MonoViT. The first row shows the RGB image, and the highlighted car is the region
we want to analyze. In the next two rows, we report multiscale disparity predictions and attention maps of each method. For an object that
is small in size and hard to distinguish from the background, such as the car highlighted, we notice how MonoViT can predict its disparity
even at the lowest resolution (i.e. H
8
×W 8
). At the same time, other methods fail to capture it.
Network architectures. Playing with different archi- ing Transformer architectures [38, 59, 26]. However, these
tectures used as backbone showed a significant impact on methods focus on the supervised setting only.
the performance of monocular depth estimation itself. Yin
et al. [64] replaced the VGG encoder used by [71] with a 3. Proposed framework
ResNet. Guizilini et al. [18] designed a novel model, Pack-
Net, to learn detail-preserving compression and decompres- This section analyzes the necessity for introducing a
sion of features by using 3D convolutions. Lyu et al. [51] Transformer for self-supervised monocular depth estima-
worked on the features decoding, implementing an attention. Then, we describe our MonoViT network architecture
tion module for multi-scale feature fusion. Because of the and the loss functions used for the self-supervised training
limited receptive field of CNNs, the network performance of our framework.
still has room for further improvement. To extract the long-
range relationships between features, Yan et al. [57] pro- 3.1. Motivations
pose a channel-wise attention-based network to aggregate Unlike supervised depth estimation methods, the super-
discriminated features in channel dimensions. Consider- visory signal of self-supervised approaches derives from
ing the ability of HRNet [51] at modeling multi-scale fea- image reprojection across different, nearby viewpoints.
tures, Zhou et al. [69] introduced HRNet for self-supervised Thus, to achieve good performance, this formulation re-
monocular depth estimation. Other works, instead, focused quires the network to accurately perceive the scene struc-
on the design of efficient and fast networks suitable for low- ture: a challenging task, especially for regions with hard to
powered embedded devices [34, 33, 55]. Despite the in- distinguish foreground objects from the background. Cur-
creased accuracy achieved by the above networks, the issue rent SoTA networks [57, 30] rely on traditional convolu-
concerning long-range relationships persists [26]. tional layers for aggregating context information and grad-
Transformers in Depth Estimation. Recently, inspired ually lift the receptive field of the network through a cascade
by the success of the attention mechanism on modeling of layers and strided convolution [40]. However, given the
global context perception, ViTs [11, 28] showed great po- intrinsic locality of the convolution operator, CNNs hardly
tential in tasks such as image classification [28, 25], ob- model long-range appearance similarity among objects, in
ject detection [9, 4], and semantic segmentation [50, 56]. A particular within the shallowest features. An example of
few works also tackled monocular depth estimation by us- this occurs when the foreground objects have a texture sim-
...
Transformer based Depth Network
Upconv Layer Disparity Head
Conv-stem Upconv Layer Disparity Head

𝑾 𝑯
. ×
𝟐 𝟐
Joint CNN &
Transformer Layer
Upconv Layer Disparity Head
𝑾 𝑯 Joint CNN &

× Upconv Layer Disparity Head
𝟒 𝟒 Transformer Layer
𝑾 𝑯 Joint CNN &
×
. 𝟖 𝟖 Transformer Layer
Upconv Layer
𝑾 𝑯 Joint CNN & 𝑾 𝑯

× ×
𝟏𝟔 𝟏𝟔 Transformer Layer 𝟑𝟐 𝟑𝟐
R, t Up Sample
Conv Conv Skip Connection
...
Atten Conv
Block 3x3 3x3 3x3
S Connection
Image splits Pose Network Upconv Layer Disparity Head S Sigmoid
Figure 3. Overview of our MonoViT architecture. Our MonoViT consists of two parts, Depth Network and Pose Network. For Depth
Network, both Transformer [25] and convolutional layer are adopted to enhance the feature modeling and depth inferring. For pose
estimation between temporally adjacent images, we use a lightweight PoseNet as in previous works [17, 30, 69, 57].
Joint CNN & Transformer Layer

Transformer
Depth encoder. As pointed out by recent research [51,
LayerNorm Multi-Scale Patch Embedding 1 x 1 Conv 57, 71, 17], the encoder is crucial for effective features ex-
Factorized
MHSA Transformer Transformer Transformer Convolutional 3x3 traction. Inspired by one of the most recent Transformer ar-
Block Block Block Block DWConv
LayerNorm
chitectures – i.e., MPViT [25], in which a Multi-Path Trans-
FFN
Global-to-Local Feature Interaction 1 x 1 Conv
former Block is proposed for simultaneously representing
×𝐌
local and global context extracted from images – we fol-
low such a design to build the key components of our depth
Figure 4. Joint CNN & Transformer Layer used in depth
encoder. Each Transformer block contains M Transformer lay-
encoder in five stages. Given the current input image, we
ers, consisting of a Layer Normalization (LayerNorm) module, a adopt a Conv-stem block consisting of two convolutions
Factorized Multi-head Self Attention (MHSA) layer [25], another with kernel size 3×3 and stride of 2 only at the first convolu-
Layer Normalization and a Feed-forward Network (FFN). tion, generating features with size H W
2 × 2 . From stage two
to stage five, we stack the Multi-Path Transformer Blocks in
each stage, shown as “Joint CNN & Transformer Layer” in
ilar to the one of the background. In such a case, the feature Fig. 3. Precisely, each “Joint CNN & Transformer Layer”
backbone tends to embed them in the same semantic con- (shown in Fig. 4) consists of a Multi-Scale Patch Embed-
text, and the whole architecture cannot distinguish between ding layer, used to embed various-sized visual tokens in par-
foreground and background depths. Fig. 2 shows this be- allel – in our case, four parallel convolutional blocks extract
haviour; we can notice that the car in the middle of the road features with a receptive field of 3 × 3, 3 × 3, 5 × 5 and 7 × 7
is hard to spot from the ground due to the strong sunlight. pixels by stacking multiple 3×3 convolutional layers. Then,
CNNs such as CADepth [57] and DIFFNet [69] predict a considering the advantage of ViT at building global depen-
depth for the car similar to the one of the ground plane. This dencies while shows limitations modeling local details [25],
fact is due to their encoder network paying more attention extracted tokens are processed through both convolutional
to the ground than the car itself. Hence, we propose inte- layers and Transformers blocks, in a parallel and comple-
grating convolutions and ViT blocks to address the standard mentary manner – i.e., using the four branches shown in
limitation of the former, since the latter has more significant Fig. 4, respectively three parallel Transformer blocks and
potential for modeling long-range correlation. a convolutional block, this latter made of 1×1, 3×3 depth-
Driven by this rationale, we design our Monocular Vi- wise and 1×1 convolutions. While the convolutional branch
sion Transformer framework, MonoViT in short, as shown constructs the local relationship between neighbors within
in Fig. 3. It includes a DepthNet and a PoseNet, respec- features L, the three Transformer Blocks model the infor-
tively, designed for depth prediction of each input image mation interaction over the whole input space within fea-
and pose estimation and trained through image reconstruc- tures G0 , G1 , G2 , thanks to the self-attention mechanism.
tion losses. Specifically, these latter take a sequence of visual tokens
3.2. DepthNet Architecture embedded by the Multi-Scale Patching Embedding mod-
ule and project them into a query (Q), key (K), and value
As typical in previous works [71, 17], we design our (V ∈ RN ×C ) vectors through three separated but structure
DepthNet as an encoder-decoder architecture.
same heads (where N denotes the number of visual tokens, View reconstruction loss. By knowing camera intrin-
equal to the total number of pixels in the input space). The sics k and the predicted pose T between two nearby views,
self-attention mechanism is implemented in an efficient fac- a reconstructed target image Ĩ is obtained as a function π of
torized way [25]: intrinsics, pose, source image I † and depth D. A loss signal
Lss is computed as a function F of inputs Ĩ and I:
Q
FactorAtt(Q, K, V) = √ (softmax(K)T V), (1)
C
Lss = F(Ĩ, I) = F(π(I † , T, k, D), I). (4)
where C refers to the embedding dimension. Finally, a fea-
F is usually obtained as a weighted sum between a struc-
ture fusion block is used to collect and further enhance the
tural similarity term and an intensity difference term. Popu-
interaction between local and global features extracted by
lar choices for these two terms are the Structured Similarity
the “Joint CNN & Transformer Layer” at stage i.
Index Measure (SSIM) [53] and the L1 difference, as pro-
posed in [17]:
Ai = Concat([Li , Gi,0 , Gi,1 , Gi,2 ]), (2)
Xi+1 = H(Ai ), (3)

1 − SSIM(Ĩ, I)
with Ai ∈ RHi ×Wi ×Ci being the aggregated feature and F(Ĩ, I) = α · + (1 − α) · |Ĩ − I| (5)
2
H(·) a 1×1 convolutional layer which fuses them and yields
the final feature Xi+1 for the next stage i + 1. with α commonly set to 0.85 [17]. Besides, for each pixel
A clear benefit of integrating convolutions with Trans- p, the minimum among losses computed from forward and
former is the comprehensive – both local and global – inter- backward adjacent frames allows for softening the effect of
action between pixels. It helps the network to perceive the occlusions [17] on the reprojection process
structure and relative position of objects so that small fore-
ground objects can be preserved even at the lowest resolu- Lss (p) = min F(Ĩi (p), I(p)) (6)
tion, rather than collapsed into similar texture background i∈[1,−1]
as shown in Fig. 2.
with ‘1’ and ‘-1’ referring to the forward and backward ad-
Depth decoder. Taking the multi-scale features from the
jacent frames, respectively.
depth encoder, cross-layer and cross-scale connections are
adopted in our depth decoder to gradually increase the spa- Smoothness loss. As in previous works [17, 69], the
tial resolution, as shown in Fig. 3. Considering the context edge-aware smoothness loss is used to improve the inverse
difference between features at different scales, e.g. higher depth map d:
resolution features favour fine-grained details, we enhance
cross-scale feature fusion with both spatial and channel at- Lsmooth = |∂x d∗ |e∂x I + |∂y d∗ |e∂y I , (7)
tention mechanisms [30, 69] (i.e., our Atten Block). Fi-
nally, four heads – made of two convolutional layers and where d∗ = d/dˆ represents the mean-normalized inverse
a Sigmoid activation – are in charge of disparity (inverse depth. Besides, following [17], an auto-mask µ is calculated
depth) prediction from corresponding aggregated features, to filter static frames and objects moving with the same mo-
outputting maps at full, 12 , 14 , 18 resolution respectively. tion of the camera.
Total loss. Finally, both the view reconstruction loss Lss
3.3. PoseNet and the smoothness loss Lsmooth are computed from out-
Following [17, 30, 69, 57], our PoseNet favors a sim- puts at each scale s ∈ {1, 12 , 14 , 18 } – brought to full resolu-
ple, yet effective implementation. Specifically, our PoseNet tion – and then averaged as Ltot to train MonoViT:
uses the lightweight structure of ResNet18 [20]. Receiving
4
concatenated images [I, I † ] as input, it outputs a 6 DoF rel- 1 X
Ltot = · (µ · Lss + λ · Lsmooth ), (8)
ative pose T between adjacent frames of a video sequence. 4 s=1
3.4. Self-supervised Learning

with λ being set to 10−3 .
We cast depth estimation as an image reconstruction
task, replacing ground truth labels with unlabeled, monoc- 4. Experiments
ular videos at training time. The depth network takes a still
(i.e., target) image I and predict its dense inverse depth map In this section, we report the outcome of our ex-
d, from which depth D is derived as 1./d and forcing it to periments, clearly supporting the superior accuracy of
be in [Dmin , Dmax ] as in [17]. MonoViT at estimating depth across several benchmarks.
lower is better higher is better Method Data Resolution lower is better higher is better
Method Data Resolution Abs Rel Sq Rel RMSE RMSE log δ1 δ2 δ3 Abs Rel Sq Rel RMSE RMSE log δ1 δ2 δ3
Monodepth2 [17] M 640×192 0.115 0.903 4.863 0.193 0.877 0.959 0.981 Monodepth2 [17] M 1024×320 0.115 0.882 4.701 0.190 0.879 0.961 0.982
Sun [44] M 640×192 0.117 0.863 4.813 0.192 0.871 0.959 0.982 Sun [44] M 1024×320 0.110 0.791 4.557 0.184 0.887 0.964 0.983
SGDepth [22] M+Se 640×192 0.113 0.835 4.693 0.191 0.879 0.961 0.981 SAFENet [7] M+Se 1024×320 0.106 0.743 4.489 0.181 0.884 0.965 0.984
SAFENet [7] M+Se 640×192 0.112 0.788 4.582 0.187 0.878 0.963 0.983 HR-Depth [30] M 1024×320 0.106 0.755 4.472 0.181 0.892 0.966 0.984
VC-Depth [70] M 640×192 0.112 0.816 4.715 0.190 0.880 0.960 0.982 FeatDepth [43] M 1024×320 0.104 0.729 4.481 0.179 0.893 0.965 0.984
PackNet‡ [18] M 640×192 0.108 0.727 4.426 0.184 0.885 0.963 0.983 GCNDepth [32] M 1024×320 0.104 0.720 4.494 0.181 0.888 0.965 0.984
?
Mono-Uncertainty[35] M 640×192 0.111 0.863 4.756 0.188 0.881 0.961 0.982 CADepth [57] M 1024×320 0.102 0.734 4.407 0.178 0.898 0.966 0.984
†
HR-Depth [30] M 640×192 0.109 0.792 4.632 0.185 0.884 0.962 0.983 DIFFNet [69] M 1024×320 0.097 0.722 4.345 0.174 0.907 0.967 0.984
Johnston et al. [21] M 640×192 0.106 0.861 4.699 0.185 0.889 0.962 0.982 MonoViT (ours) M 1024×320 0.096 0.714 4.292 0.172 0.908 0.968 0.984
CADepth? [57] M 640×192 0.105 0.769 4.535 0.181 0.892 0.964 0.983 PackNet‡ [18] M 1280×384 0.104 0.758 4.386 0.182 0.895 0.964 0.982
DIFFNet† [69] M 640×192 0.102 0.749 4.445 0.179 0.897 0.965 0.983 SGDepth [22] M+Se 1280×384 0.107 0.768 4.468 0.186 0.891 0.963 0.982
MonoFormer [2] M 640×192 0.106 0.839 4.627 0.183 0.889 0.962 0.983 HR-Depth [30] M 1280×384 0.104 0.727 4.410 0.179 0.894 0.966 0.984
MonoViT (ours) M 640×192 0.099 0.708 4.372 0.175 0.900 0.967 0.984 CADepth? [57] M 1280×384 0.102 0.715 4.312 0.176 0.900 0.968 0.984
MonoViT (ours) M 1280×384 0.094 0.682 4.200 0.170 0.912 0.969 0.984
Monodepth2 [17] MS 640×192 0.106 0.818 4.750 0.196 0.874 0.957 0.979 Monodepth2 [17] MS 1024×320 0.106 0.818 4.750 0.196 0.874 0.957 0.979
HR-depth [30] MS 640×192 0.107 0.785 4.612 0.185 0.887 0.962 0.982 HR-Depth [30] MS 1024×320 0.101 0.716 4.395 0.179 0.892 0.966 0.984
CADepth? [57] MS 640×192 0.102 0.752 4.504 0.181 0.894 0.964 0.983 CADepth? [57] MS 1024×320 0.096 0.694 4.264 0.173 0.908 0.968 0.984
DIFFNet† [69] MS 640×192 0.101 0.749 4.445 0.179 0.898 0.965 0.983 DIFFNet† [69] MS 1024×320 0.094 0.678 4.250 0.172 0.911 0.968 0.984
MonoViT (ours) MS 640×192 0.098 0.683 4.333 0.174 0.904 0.967 0.984 MonoViT (ours) MS 1024×320 0.093 0.671 4.202 0.169 0.912 0.969 0.985
Table 1. Results on the KITTI benchmark using the Eigen split [15]. Each method is grouped by input resolution (low: left, high: right)
and training methodology (M: monocular videos, MS: binocular videos, Se: trained with semantic labels). The best scores are in bold. ?
refers to the current SoTA self-supervised method on the KITTI depth benchmark. † stands for the novel results from the official Github
repository, better than published ones. ‡ refers to the model pretrained on Cityscapes [8], while the others are pretrained on ImageNet [10].
lower is better higher is better

Method Data Abs Rel Sq Rel RMSE RMSE log δ1 δ2 δ3
4.2. Datasets
Monodepth2 [17] M 0.090 0.545 3.942 0.137 0.914 0.983 0.995
Johnston [21] M 0.081 0.484 3.716 0.126 0.927 0.985 0.996
KITTI [15]. The KITTI stereo dataset contains 61
HR-Depth [30] M 0.085 0.471 3.769 0.130 0.919 0.985 0.996 scenes, with a typical image size of 1242 × 375, cap-
CADepth [57] M 0.080 0.450 3.649 0.124 0.927 0.986 0.996
DIFFNet [69] M 0.076 0.412 3.494 0.119 0.935 0.988 0.996
tured using a stereo rig mounted on a moving car equipped
MonoViT (ours) M 0.075 0.389 3.419 0.115 0.938 0.989 0.997 with a LiDAR sensor. Following previous works in this
field [17, 69, 30], we use the image split of Eigen et al. [12],
PackNet‡ [18] M 0.071 0.359 3.153 0.109 0.944 0.990 0.997
MonoViT (ours) M 0.067 0.328 3.108 0.104 0.950 0.992 0.998 which consists of 39810 monocular triplets for training and
Table 2. Results on KITTI using the improved ground 4424 for validation. To compare with the existing solutions,
truth [48]. Top: 640 × 192 input resolution, bottom: 1280 × 384. we evaluate the single-view depth performance on the test
‡ pretrained on Cityscapes [8] (on ImageNet [10] otherwise). split of [12] either using raw LiDAR (697 images) or im-
proved ground truth labels [48] (652 images).
Make3D [42]. This dataset features outdoor environ-
4.1. Implementation Details ments and is typically used for testing the generalization
performance of monocular depth frameworks. We test
We implement our MonoViT in Pytorch. The model MonoViT following the same image pre-processing steps
is trained for 20 epochs on the KITTI dataset using and computing the evaluation metrics detailed in [17].
AdamW [29] as optimizer and a batch size set to 12. The DrivingStereo [58]. It is a large-scale stereo dataset de-
initial learning rate for PoseNet and depth decoder is 10−4 , picting autonomous driving scenarios. Among several se-
while the Transformer-based depth encoder is trained with quences, we use the four image splits made available on the
an initial learning rate of 5×10−5 . The number M of Trans- website, each made of 500 frames collected under different
former layers in each of the three Transformer blocks in weather conditions, respectively foggy, cloudy, rainy and
the ‘Joint CNN & Transformer Layer’ is set as 1, 3, 6, 3 sunny. We use this dataset to evaluate the generalization
from stage 2 to stage 5 in the depth encoder, respectively. capacity of MonoViT and its most recent competitors.
Both the pose encoder and depth encoder are pre-trained on
ImageNet [10]. We use a single RTX 3090 GPU for the 4.3. Depth Evaluation
low resolution (640 × 192) experiments while 4 RTX 3090 Results on KITTI: We test our model by using the stan-
GPUs for higher resolution (1024 × 320,1280 × 384) ones. dard KITTI Eigen split [12], which includes 697 images
Overall, network training requires about 15 hours. In our coupled with raw LiDAR scans. Among them, improved
experiments, we adopt the same data augmentation detailed ground truth labels [48] are provided for 652 images. Since
in [17, 30]. monocular depth models trained on video sequences suf-
For evaluation, we compute the seven standard met- fer from monocular scale ambiguity, the estimated depth is
rics (Abs Rel, Sq Rel, RMSE, RMSE log, δ1 < 1.25, scaled by the per-image median ground truth [71].
δ2 < 1.252 , δ3 < 1.253 ) proposed in [12] and used by Tab. 1 collects the results achieved by SoTA self-
most works in the literature. supervised frameworks, processing either low resolution
Images
Mono2
HR-Depth
CADepth
DIFFNet
MonoViT
Figure 5. Qualitative results on KITTI. Top row, input images. Then, predictions by SoTA methods (Mono2 [17], HR-Depth [30],
CADepth [57], DIFFNet [69]) and MonoViT (Ours). For each method, we show the depth map and the corresponding error map.
lower is better lower is better higher is better
Method Abs Rel Sq Rel RMSE RMSE log Domain Method Abs Rel Sq Rel RMSE RMSE log δ1 δ2 δ3
Monodepth2 [17] 0.125 1.514 7.927 0.195 0.849 0.950 0.980
Monodepth2 [17] 0.321 3.378 7.252 0.163 HR-Depth [30] 0.131 1.504 8.023 0.199 0.828 0.949 0.982
HR-Depth [30] 0.305 2.944 6.857 0.157 foggy CADepth [57] 0.126 1.375 7.585 0.187 0.845 0.956 0.986
CADepth [57] 0.319 3.564 7.152 0.158 DIFFNet [69] 0.111 1.232 7.047 0.169 0.869 0.966 0.989
DIFFNet [69] 0.298 2.901 6.753 0.153 MonoViT (ours) 0.096 0.934 6.313 0.150 0.893 0.974 0.993
MonoViT (ours) 0.286 2.758 6.623 0.147 Monodepth2 [17] 0.155 1.900 6.976 0.209 0.813 0.943 0.979
HR-Depth [30] 0.149 1.656 6.658 0.204 0.815 0.945 0.981
Table 3. Quantitative results on the Make3D Dataset [42]. Mod- cloudy CADepth [57] 0.147 1.811 6.785 0.201 0.832 0.948 0.981
els trained on KITTI with 640 × 192 images. DIFFNet [69] 0.140 1.571 6.298 0.192 0.837 0.950 0.983
MonoViT (ours) 0.125 1.300 5.970 0.177 0.861 0.958 0.986
Monodepth2 [17] 0.240 3.339 11.040 0.301 0.591 0.857 0.952
(left) or high resolution (right) images. We report re- HR-Depth [30] 0.222 2.962 10.494 0.281 0.631 0.868 0.959
sults for methods trained both using monocular (‘M’, top), rainy CADepth [57] 0.221 3.072 10.681 0.277 0.632 0.879 0.963
DIFFNet [69] 0.191 2.411 9.626 0.244 0.679 0.914 0.978
and binocular videos (‘MS’, bottom) for completeness. MonoViT (ours) 0.169 1.925 8.604 0.219 0.733 0.934 0.985
MonoViT significantly outperforms existing SoTA meth- Monodepth2 [17] 0.155 1.740 6.744 0.214 0.819 0.941 0.977
HR-Depth [30] 0.153 1.546 6.505 0.212 0.812 0.942 0.978
ods for any training resolution and setting on all metrics. sunny CADepth [57] 0.145 1.518 6.485 0.202 0.827 0.949 0.982
In particular, we also highlight how MonoViT greatly out- DIFFNet [69] 0.142 1.457 6.165 0.197 0.835 0.950 0.982
MonoViT (ours) 0.130 1.266 6.109 0.186 0.851 0.956 0.985
performs MonoFormer [2], a concurrent attempt to deploy
Table 4. Results on DrivingStereo Dataset [58]. Models trained
Transformers in self-supervised monocular depth estima- on KITTI with 640×192 images and tested on four different com-
tion. Tab. 2 reports the same metrics computed over the plex scenarios (foggy, cloudy, rainy and sunny).
improved ground truth labels processing 640 × 192 images.
Again, MonoViT is constantly more accurate. in [17, 30, 69, 57], we firstly train our model on KITTI using
Fig. 5 reports a comparison between MonoViT and some images at 640 × 192 resolution and, then test on Make3D
of its competitors, showing that our model can get a much without a fine-tuning procedure. For fairness, we evaluate
lower RMSE and proving that MonoViT is more powerful MonoViT and the existing self-supervised networks using
at modeling relations between objects than existing models. the same evaluation code provided by [17]. Tab. 3 demon-
Results on Make3D. We run experiments on the strates how our proposed architecture allows us to outper-
Make3D dataset [42] in order to evaluate the capability form other strategies by a large margin and to achieve SoTA
of our model to generalize on different real-world en- generalization results.
vironments. By following the same protocol indicated Results on DrivingStereo. Additionally, to further
Images
Images CADepth DIFFNet MonoViT

Images
Images
KITTI
CADepth
DIFFNet
DIFFNet
cloudy
DrivingStereo
MonoViT
foggy
Figure 6. Qualitative comparison on the Make3D dataset [42]. rainy
Predictions by CADepth [57], DIFFNet [69] and our MonoViT.

sunny
evaluate the generalization capacity of MonoViT, we also

Figure 7. Qualitative comparison on KITTI [15] (top) and
test it under four different weather conditions, including DrivingStereo [58] (bottom). Predictions by CADepth [57],
foggy, cloudy, rainy and sunny, from the DrivingStereo [58] DIFFNet [69] and our MonoViT.
dataset. In Tab. 4, we collect the output of this evaluation lower is better higher is better
Backbone Params↓ FPS↑
for MonoViT and SoTA frameworks. Any model has been Abs Rel↓ Sq Rel↓ RMSE↓ RMSE log ↓ δ1 ↑ δ2 ↑ δ3 ↑
ResNet34 [20] 27M 42 0.108 0.780 4.622 0.183 0.884 0.963 0.983
trained on KITTI and tested on DrivingStereo without any SwinT-tiny [28] 34M 41 0.101 0.698 4.404 0.177 0.894 0.966 0.984
PVT-small[52] 30M 38 0.106 0.801 4.648 0.184 0.887 0.961 0.982
re-training or fine-tuning. Once again, MonoViT perfor- MPViT-tiny 10M 24 0.102 0.733 4.459 0.177 0.895 0.965 0.984
MPViT-xsmall 13M 24 0.101 0.738 4.402 0.175 0.898 0.967 0.984
mance always results vastly superior to any CNN competi- MPViT-small 27M 18 0.099 0.708 4.372 0.175 0.900 0.967 0.984
MPViT-base 78M 15 0.100 0.747 4.427 0.176 0.901 0.966 0.984
tor. In this case, the margin is even higher than what was MPViT-small 27M 18 0.099 0.708 4.372 0.175 0.900 0.967 0.984
observed for KITTI and Make3D. This fact further suggests w/o CNN Path
2 Trans. Path
-
-
-
-
0.114
0.107
0.929
0.801
4.821
4.590
0.190
0.182
0.879 0.959 0.981
0.889 0.963 0.983
that the ViT encoder used within our framework dramati- 1 Trans. Path
w/o Trans. Path
-
-
-
-
0.120
0.127
0.876
0.931
4.799
4.972
0.196
0.203
0.864 0.956 0.981
0.850 0.951 0.980
cally affects the generalization capacity of the whole depth w/o Atten. Block - - 0.101 0.772 4.465 0.177 0.898 0.965 0.983
network, thanks to the long-range relationships among fea- Table 5. Ablation study on KITTI. Trans. refers to Transformer.
tures modeled by the Transformer blocks themselves. Input is 640×192, runtime measured on RTX 3090 GPU. En-
coders are pre-trained on ImageNet.
Qualitative results. Fig. 6 reports some qualitative
examples from the Make3D dataset, with MonoViT being Block in the decoder and the CNN path/Transformer path in
able to model the structures of objects more accurately than the “Joint CNN & Transformer Layer” (Fig. 4). As shown
its competitors. Fig. 7 shows a further qualitative com- in the table, both CNN path, Transformer path and Atten.
parison between MonoViT and SoTA frameworks on some Blocks play an important role in the architecture.
challenging images from KITTI (top) and some even more
challenging samples from DrivingStereo (bottom). For both 5. Conclusion
datasets, we notice that MonoViT can effectively model the
foreground and background because of the global receptive This paper proposed MonoViT, a new architecture for
field, resulting in more precise, finer-grained estimation. self-supervised monocular depth estimation. By combining
both convolutions and Transformers block inside the net-
4.4. Ablation study work encoder, MonoViT can model the local and global
Finally, to further validate the effectiveness of our depth context of images jointly, overcoming existing solutions
architecture, we report an ablation study in Tab. 5. com- based on CNNs. Our proposal vastly and consistently out-
paring the results yielded by MPViT variants (tiny, xsmall, performs the SoTA on the KITTI dataset. Moreover, exper-
small, base) and different recent Transformer encoders, iments on Make3D and DrivingStereo datasets show that
SwinT-tiny [28] and PVT [52] (which counts a similar num- MonoViT achieves better generalization performance than
ber of parameters compared to MPViT-small) on top, also SoTA architectures for self-supervised depth estimation.
Acknowledgements: This work was supported in part by na-
by reporting the amount of parameters and FPS for each.
tional Key Research & Development Program - National Key Re-
Benefiting from the combination of CNNs and Transform-
search and Development Program of China (2021YFB1714300),
ers, the MPViT backbone outperforms the other two SoTA National Natural Science Fund for Distinguished Young Schol-
pure Transformer backbones (SwinT [28], PVT [52]) and ars (61725301), Programme of Introducing Talents of Discipline
the pure CNN one (ResNet34 [20]) in the self-supervised to Universities (the 111 Project) under Grant B17017, Innova-
monocular depth estimation task. Besides, we assess the tion Research Funding of China National Petroleum Corporation
impact of the different modules on bottom, like the Atten. (2021D002-0902) and Shanghai AI Lab.
References work. Advances in neural information processing systems,
27, 2014. 1, 2, 6
[1] Lorenzo Andraghetti, Panteleimon Myriokefalitakis,
[13] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat-
Pier Luigi Dovesi, Belen Luque, Matteo Poggi, Alessandro
manghelich, and Dacheng Tao. Deep ordinal regression net-
Pieropan, and Stefano Mattoccia. Enhancing self-supervised
work for monocular depth estimation. In Conference on
monocular depth estimation with traditional visual odom-
Computer Vision and Pattern Recognition (CVPR), pages
etry. In 7th International Conference on 3D Vision (3DV),
2002–2011, Salt Lake City, Utah, 2018. IEEE. 1, 2
2019. 2
[14] Ravi Garg, Vijay Kumar BG, and Ian Reid. Unsuper-
[2] Jinwoo Bae, Sungho Moon, and Sunghoon Im. Mono-
vised cnn for single view depth estimation: Geometry to
former: Towards generalization of self-supervised monoc-
the rescue. In European Conference on Computer Vi-
ular depth estimation with transformers. arXiv preprint
sion (ECCV), pages 740–756, Amsterdam, The Netherlands,
arXiv:2205.11083, 2022. 6, 7
2016. Springer. 2
[3] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka.
[15] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
Adabins: Depth estimation using adaptive bins. In Proceed-
Urtasun. Vision meets robotics: The kitti dataset. The Inter-
ings of the IEEE/CVF Conference on Computer Vision and
national Journal of Robotics Research, 32(11):1231–1237,
Pattern Recognition, pages 4009–4018, 2021. 1, 2
2013. 6, 8
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [16] Clément Godard, Oisin Mac Aodha, and Gabriel J. Bros-
end object detection with transformers. In European confer- tow. Unsupervised monocular depth estimation with left-
ence on computer vision, pages 213–229. Springer, 2020. 3 right consistency. In Conference on Computer Vision and
Pattern Recognition (CVPR), 2017. 1, 2
[5] Yuhua Chen, Cordelia Schmid, and Cristian Sminchis-
escu. Self-supervised learning with geometric constraints in [17] Clément Godard, Oisin Mac Aodha, Michael Firman, and
monocular video: Connecting flow, depth, and camera. In Gabriel Brostow. Digging into self-supervised monocular
ICCV, 2019. 2 depth estimation. In International Conference on Computer
[6] Hyesong Choi, Hunsang Lee, Sunkyung Kim, Sunok Kim, Vision (ICCV), 2019. 2, 4, 5, 6, 7
Seungryong Kim, Kwanghoon Sohn, and Dongbo Min. [18] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven-
Adaptive confidence thresholding for monocular depth esti- tos, and Adrien Gaidon. 3d packing for self-supervised
mation. In Proceedings of the IEEE/CVF International Con- monocular depth estimation. In Conference on Computer
ference on Computer Vision (ICCV), pages 12808–12818, Vision and Pattern Recognition (CVPR), 2020. 2, 3, 6
October 2021. 2 [19] Vitor Guizilini, Rui Hou, Jie Li, Rares Ambrus, and Adrien
[7] JaeHoon Choi, Dongki Jung, DongHwan Lee, and Changick Gaidon. Semantically-guided representation learning for
Kim. Safenet: Self-supervised monocular depth estimation self-supervised monocular depth. In International Confer-
with semantic-aware feature extraction. In Conference on ence on Learning Representations (ICLR), 2020. 2
Neural Information Processing Systems (NIPS), 2020. 6 [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[8] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Deep residual learning for image recognition. In Conference
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe on Computer Vision and Pattern Recognition (CVPR), pages
Franke, Stefan Roth, and Bernt Schiele. The cityscapes 770–778, Las Vegas, Nevada, 2016. IEEE. 5, 8
dataset for semantic urban scene understanding. In Proceed- [21] Adrian Johnston and Gustavo Carneiro. Self-supervised
ings of the IEEE conference on computer vision and pattern monocular trained depth estimation using self-attention and
recognition, pages 3213–3223, 2016. 6 discrete disparity volume. In Proceedings of the ieee/cvf con-
[9] Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. ference on computer vision and pattern recognition, pages
Up-detr: Unsupervised pre-training for object detection with 4756–4765, 2020. 6
transformers. In Proceedings of the IEEE/CVF Conference [22] Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk,
on Computer Vision and Pattern Recognition, pages 1601– and Tim Fingscheidt. Self-supervised monocular depth es-
1610, 2021. 2, 3 timation: Solving the dynamic object problem by seman-
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, tic guidance. In European Conference on Computer Vision,
and Li Fei-Fei. Imagenet: A large-scale hierarchical im- pages 582–600. Springer, 2020. 2, 6
age database. In Conference on Computer Vision and Pat- [23] Varun Ravi Kumar, Marvin Klingner, Senthil Yogamani, Ste-
tern Recognition (CVPR), pages 248–255, Miami, FL, 2009. fan Milz, Tim Fingscheidt, and Patrick Mader. Syndistnet:
IEEE. 6 Self-supervised monocular fisheye camera distance estima-
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, tion synergized with semantic segmentation for autonomous
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, driving. In Proceedings of the IEEE/CVF Winter Conference
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- on Applications of Computer Vision, pages 61–71, 2021. 2
vain Gelly, et al. An image is worth 16x16 words: Trans- [24] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed-
formers for image recognition at scale. arXiv preprint erico Tombari, and Nassir Navab. Deeper depth prediction
arXiv:2010.11929, 2020. 2, 3 with fully convolutional residual networks. In Fourth Inter-
[12] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map national Conference on 3D Vision (3DV), pages 239–248,
prediction from a single image using a multi-scale deep net- Stanford University, California, 2016. IEEE. 1, 2
[25] Youngwan Lee, Jonghee Kim, Jeff Willette, and Sung Ju [38] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi-
Hwang. Mpvit: Multi-path vision transformer for dense pre- sion transformers for dense prediction. In Proceedings of
diction. arXiv preprint arXiv:2112.11010, 2021. 2, 3, 4, 5 the IEEE/CVF International Conference on Computer Vi-
[26] Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. sion, pages 12179–12188, 2021. 2, 3
Depthformer: Exploiting long-range correlation and local in- [39] Anurag Ranjan, Varun Jampani, Lukas Balles, Kihwan Kim,
formation for accurate monocular depth estimation. arXiv Deqing Sun, Jonas Wulff, and Michael J. Black. Competitive
preprint arXiv:2203.14211, 2022. 2, 3 collaboration: Joint unsupervised learning of depth, camera
[27] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. motion, optical flow and motion segmentation. In Confer-
Learning depth from single monocular images using deep ence on Computer Vision and Pattern Recognition (CVPR),
convolutional neural fields. IEEE Trans. on Pattern Analysis pages 12232–12241, Long Beach, California, 2019. IEEE. 2
and Machine Intelligence, 38(10):2024–2039, 2016. 2 [40] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
[28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Convolutional networks for biomedical image segmentation.
Zhang, Stephen Lin, and Baining Guo. Swin transformer: In Nassir Navab, Joachim Hornegger, William M. Wells, and
Hierarchical vision transformer using shifted windows. In Alejandro F. Frangi, editors, In Medical Image Computing
Proceedings of the IEEE/CVF International Conference on and Computer-Assisted Intervention (MICCAI), 2015. 3
Computer Vision, pages 10012–10022, 2021. 3, 8 [41] Sadra Safadoust and Fatma Güney. Self-supervised monoc-
[29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay ular scene decomposition and depth estimation. In 2021 In-
regularization. arXiv preprint arXiv:1711.05101, 2017. 6 ternational Conference on 3D Vision (3DV), pages 627–636.
[30] Xiaoyang Lyu, Liang Liu, Mengmeng Wang, Xin Kong, Lina IEEE, 2021. 2
Liu, Yong Liu, Xinxin Chen, and Yi Yuan. Hr-depth: High [42] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d:
resolution self-supervised monocular depth estimation. In Learning 3d scene structure from a single still image. IEEE
AAAI Conference on Artificial Intelligence (AAAI), 2021. 1, transactions on pattern analysis and machine intelligence,
2, 3, 4, 5, 6, 7 31(5):824–840, 2008. 2, 6, 7, 8
[31] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Un- [43] Chang Shu, Kun Yu, Zhixiang Duan, and Kuiyuan Yang.
supervised learning of depth and ego-motion from monoc- Feature-metric loss for self-supervised learning of depth and
ular video using 3d geometric constraints. In The IEEE egomotion. In European Conference on Computer Vision,
Conference on Computer Vision and Pattern Recognition pages 572–588. Springer, 2020. 6
(CVPR), 2018. 2 [44] Qiyu Sun, Yang Tang, Chongzhen Zhang, Chaoqiang Zhao,
[32] Armin Masoumian, Hatem A Rashwan, Saddam Abdulwa- Feng Qian, and Jürgen Kurths. Unsupervised estimation of
hab, Julian Cristiano, and Domenec Puig. Gcndepth: Self- monocular depth and vo in dynamic environments via hybrid
supervised monocular depth estimation based on graph con- masks. IEEE Transactions on Neural Networks and Learn-
volutional network. arXiv preprint arXiv:2112.06782, 2021. ing Systems, 2021. 6
6 [45] Yang Tang, Chaoqiang Zhao, Jianrui Wang, Chongzhen
[33] Valentino Peluso, Antonio Cipolletta, Andrea Calimera, Zhang, Qiyu Sun, Wei Xing Zheng, Wenli Du, Feng Qian,
Matteo Poggi, Fabio Tosi, and Stefano Mattoccia. Enabling and Jürgen Kurths. Perception and navigation in autonomous
energy-efficient unsupervised monocular depth estimation systems in the era of learning: A survey. IEEE Transactions
on ARMv7-based platforms. In Design, Automation & Test on Neural Networks and Learning Systems, 2022. 1
in Europe Conference & Exhibition, DATE 2019, Florence, [46] Fabio Tosi, Filippo Aleotti, Matteo Poggi, and Stefano Mat-
Italy, March 25-29, 2019, pages 1703–1708, 2019. 3 toccia. Learning monocular depth estimation infusing tradi-
[34] Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mat- tional stereo knowledge. In The IEEE Conference on Com-
toccia. Towards real-time unsupervised monocular depth es- puter Vision and Pattern Recognition (CVPR), 2019. 2
timation on CPU. In IEEE/JRS Conference on Intelligent [47] Fabio Tosi, Filippo Aleotti, Pierluigi Zama Ramirez, Matteo
Robots and Systems (IROS), 2018. 3 Poggi, Samuele Salti, Luigi Di Stefano, and Stefano Mat-
[35] Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mat- toccia. Distilled semantics for comprehensive scene under-
toccia. On the uncertainty of self-supervised monocular standing from videos. In Proceedings of the IEEE/CVF Con-
depth estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages
ference on Computer Vision and Pattern Recognition, pages 4654–4665, 2020. 2
3227–3237, 2020. 2, 6 [48] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,
[36] Matteo Poggi, Fabio Tosi, Konstantinos Batsos, Philippos Thomas Brox, and Andreas Geiger. Sparsity invariant cnns.
Mordohai, and Stefano Mattoccia. On the synergies between In 2017 international conference on 3D Vision (3DV), pages
machine learning and binocular stereo for depth estimation 11–20. IEEE, 2017. 6
from images: a survey. IEEE Transactions on Pattern Anal- [49] Chaoyang Wang, Jose Miguel Buenaposada, Rui Zhu, and
ysis and Machine Intelligence, 2021. 1 Simon Lucey. Learning depth from monocular videos using
[37] Matteo Poggi, Fabio Tosi, and Stefano Mattoccia. Learning direct methods. In The IEEE Conference on Computer Vision
monocular depth estimation with unsupervised trinocular as- and Pattern Recognition (CVPR), 2018. 2
sumptions. In 6th International Conference on 3D Vision [50] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and
(3DV), 2018. 2 Liang-Chieh Chen. Max-deeplab: End-to-end panoptic
segmentation with mask transformers. In Proceedings of [62] Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ra-
the IEEE/CVF Conference on Computer Vision and Pattern makant Nevatia. Unsupervised learning of geometry with
Recognition, pages 5463–5474, 2021. 3 edge-aware depth-normal consistency, 2017. 2
[51] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, [63] Zhenheng Yang, Peng Wang, Wang Yang, Wei Xu, and
Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Nevatia Ram. Lego: Learning edge with geometry all at once
Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep by watching videos. In The IEEE Conference on Computer
high-resolution representation learning for visual recogni- Vision and Pattern Recognition (CVPR), 2018. 2
tion. IEEE transactions on pattern analysis and machine [64] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learn-
intelligence, 43(10):3349–3364, 2020. 3, 4 ing of dense depth, optical flow and camera pose. In Pro-
[52] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao ceedings of the IEEE conference on computer vision and pat-
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. tern recognition, pages 1983–1992, 2018. 2, 3
Pyramid vision transformer: A versatile backbone for dense [65] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera,
prediction without convolutions. In Proceedings of the Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learn-
IEEE/CVF International Conference on Computer Vision, ing of monocular depth estimation and visual odometry with
pages 568–578, 2021. 8 deep feature reconstruction. In The IEEE Conference on
[53] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Computer Vision and Pattern Recognition (CVPR), 2018. 2
Image quality assessment: From error visibility to structural [66] Chaoqiang Zhao, Qiyu Sun, Chongzhen Zhang, Yang Tang,
similarity. Trans. Img. Proc., 13(4):600–612, Apr. 2004. 5 and Feng Qian. Monocular depth estimation based on deep
learning: An overview. Science China Technological Sci-
[54] Jamie Watson, Michael Firman, Gabriel J Brostow, and
ences, 63(9):1612–1627, 2020. 1, 2
Daniyar Turmukhambetov. Self-supervised monocular depth
[67] Chaoqiang Zhao, Yang Tang, and Qiyu Sun. Unsuper-
hints. In ICCV, 2019. 2
vised monocular depth estimation in highly complex envi-
[55] Wofk, Diana and Ma, Fangchang and Yang, Tien-Ju and ronments. IEEE Transactions on Emerging Topics in Com-
Karaman, Sertac and Sze, Vivienne. FastDepth: Fast Monoc- putational Intelligence, 2022. 2
ular Depth Estimation on Embedded Systems. In IEEE In-
[68] Chaoqiang Zhao, Gary G Yen, Qiyu Sun, Chongzhen Zhang,
ternational Conference on Robotics and Automation (ICRA),
and Yang Tang. Masked gan for unsupervised depth and pose
2019. 3
prediction with scale consistency. IEEE Transactions on
[56] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Neural Networks and Learning Systems, 32(12):5392–5403,
Jose M Alvarez, and Ping Luo. Segformer: Simple and 2020. 2
efficient design for semantic segmentation with transform- [69] Hang Zhou, David Greenwood, and Sarah Taylor. Self-
ers. Advances in Neural Information Processing Systems, supervised monocular depth estimation with internal feature
34, 2021. 2, 3 fusion. In British Machine Vision Conference (BMVC), 2021.
[57] Jiaxing Yan, Hong Zhao, Penghui Bu, and YuSheng Jin. 2, 3, 4, 5, 6, 7, 8
Channel-wise attention-based network for self-supervised [70] Hang Zhou, David Greenwood, Sarah Taylor, and Han Gong.
monocular depth estimation. In 2021 International Confer- Constant velocity constraints for self-supervised monocular
ence on 3D Vision (3DV), pages 464–473. IEEE, 2021. 2, 3, depth estimation. In European Conference on Visual Media
4, 5, 6, 7, 8 Production, pages 1–8, 2020. 6
[58] Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng, [71] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G
Jianping Shi, and Bolei Zhou. Drivingstereo: A large-scale Lowe. Unsupervised learning of depth and ego-motion from
dataset for stereo matching in autonomous driving scenarios. video. In Proceedings of the IEEE conference on computer
In IEEE Conference on Computer Vision and Pattern Recog- vision and pattern recognition, pages 1851–1858, 2017. 1,
nition (CVPR), 2019. 2, 6, 7, 8 2, 3, 4, 6
[59] Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and [72] Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Un-
Elisa Ricci. Transformer-based attention networks for supervised joint learning of depth and flow using cross-task
continuous pixel-wise prediction. In Proceedings of the consistency. In European Conference on Computer Vision
IEEE/CVF International Conference on Computer Vision, (ECCV), pages 1–18, Munich, Germany, 2018. Springer. 2
pages 16269–16279, 2021. 3
[60] Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel
Cremers. D3vo: Deep depth, deep pose and deep uncer-
tainty for monocular visual odometry. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 1281–1292, 2020. 2
[61] Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, and Ram
Nevatia. Every pixel counts: Unsupervised geometry learn-
ing with holistic 3d motion understanding. In The Euro-
pean Conference on Computer Vision (ECCV) Workshops,
September 2018. 2

Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer

Uploaded by

Copyright:

Available Formats

mage HR-Depth CADepth

MonoViT: Self-Supervised Monocular Depth Estimation with

Abstract spite the steady progress of active depth-sensing technolo-

Upconv Layer Disparity Head

Conv-stem Upconv Layer Disparity Head

𝑾 𝑯 Joint CNN &

𝑾 𝑯 Joint CNN & 𝑾 𝑯

Joint CNN & Transformer Layer

Xi+1 = H(Ai ), (3)

3.4. Self-supervised Learning

lower is better higher is better

Images CADepth DIFFNet MonoViT

Figure 6. Qualitative comparison on the Make3D dataset [42]. rainy

Predictions by CADepth [57], DIFFNet [69] and our MonoViT.

evaluate the generalization capacity of MonoViT, we also

You might also like