Thesis - Sem 8 (1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

B.Tech.

Project
On

3D reconstruction of an object from a Single Image and a


Text Prompt

Report submitted in partial fulfilment of the requirements for the


B. Tech. degree in Instrumentation and Control Engineering
By

Akshit Pareek 2019UIC3588


Anish Gupta 2019UIC3591
Gaurav Bhatt 2019UIC3598

Under the supervision


of
Prof. Asha Rani

DEPARTMENT OF INSTRUMENTATION AND CONTROL ENGINEERING


NETAJI SUBHAS UNIVERSITY OF TECHNOLOGY (NSUT) DWARKA,
NEW DELHI
MAY 2023
CANDIDATE DECLARATION

DEPARTMENT OF INSTRUMENTATION AND CONTROL ENGINEERING

We, Akshit Pareek (2019UIC3588), Anish Gupta (2019UIC3591) and Gaurav Bhatt
(2019UIC3598) students of B. Tech., Department of Instrumentation and Control
Engineering, hereby declare that the Project-Thesis titled “3D Reconstruction of an
object from a Single Image and a Text Prompt” which is submitted by us to the
Department of Instrumentation and Control Engineering, Netaji Subhas University of
Technology (NSUT) Dwarka, New Delhi in partial fulfilment of the requirement for
the award of the degree of Bachelor of Technology is our original work and not
copied from any source without proper citation. The manuscript has been subjected to
plagiarism check by Turnitin software. This work has not previously formed the basis
for the award of any other Degree.

Place:
Date:

Akshit Pareek Anish Gupta Gaurav Bhatt

2019UIC3588 2019UIC3591 2019UIC3598

2
CERTIFICATE OF DECLARATION

DEPARTMENT OF INSTRUMENTATION AND CONTROL ENGINEERING

This is to certify that the work embodied in thesis titled, “3D Reconstruction of an
object from a Single Image and Text Prompt” by Akshit Pareek (2019UIC3588),
Anish Gupta (2019UIC3591) and Gaurav Bhatt (2019UIC3598) is the bonafide work
of the group submitted to Netaji Subhas University of Technology for consideration in
8th Semester B.Tech. Project Evaluation.

The original research work was carried out by the team under my/our guidance and
supervision in the academic year 2022-2023. This work has not been submitted for
any other diploma or degree of any University. On the basis of the declaration made
by the group, we recommend the project report for evaluation.

Prof. Asha Rani

Department of Instrumentation and Control Engineering


Netaji Subhas University of Technology

3
ACKNOWLEDGEMENT

We would like to express our gratitude and appreciation to all those who make it
possible to complete this project. Special thanks to our project supervisor Prof. Asha
Rani whose help, stimulating suggestions, and encouragement helped us in writing
this report. We also sincerely thank our colleagues for the time spent proofreading and
correcting our mistakes.

We would also like to acknowledge with much appreciation the crucial role of the
staff in Instrumentation and Control Engineering and The Centre of Excellence in
Artificial Intelligence, which gave us permission to use the lab equipment and gave
permission to use all necessary tools in the laboratory.

4
PLAGIARISM REPORT

5
ABSTRACT

Single-image 3D reconstruction in real-world scenarios is challenging due to the


complexity and diversity of objects and environments. This thesis proposes a
framework that combines visual-language models and the Segment-Anything object
segmentation model to generate reliable and versatile 3D reconstructions from a
single image. The method uses a Grounding DINO model to generate textural
descriptions, the Segment-Anything model for object extraction, and a text-to-image
diffusion model to lift the object into a neural radiance field. The approach has
demonstrated accurate and detailed reconstructions for a wide range of things, and its
potential has been evaluated through comprehensive experiments on various datasets.
This thesis offers a promising solution to the limitations of existing 3D reconstruction
methodologies and has the potential to contribute significantly to the field.

Keywords

3D Reconstruction, Segment-Anything object segmentation, Grounding DINO model,


visual-language model

6
List of Contents

S.No Content Page

I Certificate of Declaration 3

II Acknowledgment 4

III Plagiarism Report 5

IV Abstract 6

V Work Done 15

1.1 Chapter 1: Motivation 15


Introduction
1.2 Literature Survey 16

1.3 Key Challenges 19

1.4 Approach to the Problem 20

2.1 Chapter 2: Grounding DINO 21


Solution
2.2 Methodology Segment Anything Model 23
Adopted
2.3 ZoeDepth 25

2.4 Backproject Depth to Pointcloud 26

2.5 Multiview Compressive Coding 28

3.1 Chapter 3: Qualitative Results 33


Results and
3.2 Discussion Quantitative Results 34

4.1 Chapter 4: Conclusion 59


Conclusion and
4.2 Scope for Future Scope for Future Work 59
Work

VII References 71

VIII Appendix 75

7
List of Figures

Fig.No. Figure Description Page

1 Comparisons of text representations. 21

2 Architecture of Grounding Dino 23

3 Architecture of Segment Anything 24

4 Metric Bins Module 25

5 Architecture of ZoeDepth 27

6 Attention Masking Pattern in MCC’s Decoder 27

7 Architecture of MCC 31

8 33

9 37

8
10 41

11 42

12 50

13 52

14 54

15 55

16 59

17 59

18 60

19 60

20 60

21 60

22 61

23 61

24 62

List of Tables

Table No. Table Description Page

1 Shapenet Visualisations 22

2 Iphone Visualisations 25

3 DALL-E 2 Visualisations 47

4 Performance Evaluation 51

9
List of Algorithms

Algorithm No. Algorithm Description Page

1 Language guided query selection 28

2 Get pointcloud from depth 30

10
Chapter 1: Introduction

1.1 Motivation

The reconstruction of 3D objects from 2D images is a crucial task in computer vision


with implications for several fields such as robotics, autonomous driving, augmented
and virtual reality, and 3D printing. Despite notable advancements in recent years,
reconstructing a single 2D image of an object in an unstructured setting remains a
challenging problem. This task involves generating a 3D representation of an object,
which could include point clouds, meshes, or volumetric representations. However,
due to the intrinsic ambiguity of 2D projections, determining an object's 3D structure
is fundamentally ill-posed. Reconstructing objects in the wild is further complicated

11
by their diverse shapes, sizes, textures, and appearances, and various factors such as
partial occlusion, lighting, shadows, angles, and distances can impact the accuracy of
the reconstruction.

In this thesis, we propose a systematic and innovative framework that uses


visual-language models and object segmentation to convert 2D objects to 3D,
resulting in a robust and adaptable system for single-view reconstruction tasks. We
use the Segment-Anything model (SAM) alongside a set of visual-language
foundation models to extract the 3D texture and geometry of an image. Our
framework combines a DINO model to estimate the textual description of the image,
SAM to identify the object of interest, and a pre-trained 2D text-to-image diffusion
model to generate the 3D reconstruction. Through rigorous experiments, we
demonstrate the effectiveness and adaptability of our approach, surpassing existing
methods in accuracy, robustness, and generalisation capability. Additionally, we
analyse the challenges inherent in 3D object reconstruction in the wild and explain
how our framework addresses them by harmoniously combining the zero-shot vision
and linguistic comprehension abilities of the foundation models. Our framework can
reconstruct an extensive range of objects from real-world images, producing precise
and intricate 3D representations applicable to various use cases. Ultimately, our
proposed framework represents a significant advancement in the field of 3D object
reconstruction from single images.

1.2 Literature Review

Single-view 3D reconstruction

The process of reconstructing a 3D object from a 2D image requires dense view


images with camera pose or depth information. Although significant progress has
been made in the general 3D reconstruction task over the years, reconstructing 3D
objects from single views is still a challenging task. Previous attempts at single-view
reconstruction have been successful for specific object categories such as faces,
human bodies, and vehicles, but they heavily rely on category-specific priors like
meshes and CAD models. The reconstruction of arbitrary 3D objects from a
single-view image is largely unexplored and requires expert knowledge, which can

12
take several hours to accomplish. This is due to the fact that reconstructing an
arbitrary 3D object from a single-view image is an ill-posed optimization problem,
making it a challenging and largely unexplored task.

However, Recent research has made progress in the 3D reconstruction of objects from
single-view images by incorporating general visual knowledge into the models. For
example, NeuraLift360 [1] uses a pre-trained depth-estimator and 2D diffusion priors
to recover coarse geometry and textures from a single image. Another model, 3DFuse
[2], initialises the geometry with an estimated point cloud and then learns fine-grained
geometry and textures from a single image with a 2D diffusion prior to score
distillation. MCC [3] takes a different approach by learning from object-centric videos
to generate 3D objects from a single RGB-D image. However, these models still rely
heavily on human-specified text or captured depth information. Our goal is to
reconstruct arbitrary objects of interest from single-view images that are captured in
the wild, which are abundant and readily available in the real world. By tackling this
challenging problem, we hope to discover new possibilities for 3D object
reconstruction from limited data sources.

Text-to-3D Generative Models

An alternative approach, text prompts can also be used as input for generating 3D
objects, as demonstrated by groundbreaking works like DreamFusion [4]. This
approach has led to subsequent works such as SJC [5], Magic3D [6], and Fantasia3D
[7], which use optimization techniques and neural rendering methods to generate
voxel-based radiance fields or high-resolution 3D objects from the text. Another
approach is Point E [8], which generates point clouds from either images or text. Our
work, in contrast to these text-based methods, aims to generate 3D objects from
single-view images while maintaining consistency with the given image. This
approach has the potential to reconstruct complex real-world objects from limited
information and push the boundaries of 3D vision and machine learning research. We
hope that our work will inspire further exploration and innovation in this field.

1.3 Key Challenges

13
The task of reconstructing 3D objects from a single view image for arbitrary classes is
a fascinating but challenging problem, with multiple complexities and constraints that
need to be addressed.

1. Arbitrary Class: While previous models have been successful in reconstructing


3D images for specific object categories, they often face challenges when attempting
to reconstruct images of new or unseen categories that lack a parametric form.
Therefore, it is necessary to develop new methods that can extract meaningful
features and patterns from images without relying on any prior assumptions.

2. In the Wild: The accuracy of the 3D reconstruction is affected by challenges


encountered in real-world images, such as occlusions, lighting variations, and
complex object shapes. To overcome these challenges, networks must be able to take
into account these variations while inferring the 3D structure of the scene.

3. No Supervision: The lack of a comprehensive dataset containing paired


single-view images and their corresponding 3D ground truth poses a challenge for
training and evaluating models, limiting their ability to generalise to different object
categories and poses. The lack of supervision also makes it challenging to verify the
accuracy of the reconstructed 3D models.

4. Single-View: Inferring the accurate 3D model from a single 2D image is a difficult


problem due to its inherent ill-posed nature. This is because multiple 3D models could
explain the same 2D image, leading to ambiguity and making it challenging to
achieve high accuracy in the reconstructed 3D models. To address this issue, networks
need to make assumptions and trade-offs to arrive at a plausible 3D structure.

14
Chapter 2: Solution Methodology Adopted

2.1 Grounding DINO

The Grounding DINO model is designed to output multiple pairs of object boxes and
noun phrases for a given (Image, Text) pair. For example, if an image contains a cat
and a table, the model locates both objects and extracts the corresponding labels "cat"
and "table" from the input text. The model can be used for both object detection and
REC tasks. To align with the pipeline, we concatenate all category names as input
texts for object detection tasks. For REC, a bounding box is required for each text
input. The model architecture consists of an image backbone, a text backbone, a
feature enhancer for image and text feature fusion (Sec. 2.2.1), a language-guided
query selection module for query initialization (Sec 2.1.2), and a cross-modality
decoder for box refinement (Sec. 2.1.3). For each (Image, Text) pair, we first extract
vanilla image features and vanilla text features using an image backbone and a text
backbone, respectively. These features are then fused using a feature enhancer
module. We then use a language-guided query selection module to select
cross-modality queries from image features, which are fed into a cross-modality
decoder to probe desired features from the two modalities and update themselves.

15
Finally, the output queries of the last decoder layer are used to predict object boxes
and extract corresponding phrases.

2.1.1 Feature Extraction and Enhancer

To obtain features from both the image and text inputs, we use an image backbone
such as Swin Transformer [9] and a text backbone such as BERT [10], which extract
multiscale image features and text features, respectively. Similar to other DETR-like
detectors [11, 12], we extract multiscale features from different blocks. After
obtaining the vanilla image and text features, we fuse them with a feature enhancer
that includes multiple layers. Each layer consists of Deformable self-attention for
enhancing image features and vanilla self-attention for text feature enhancement. To
align the features from both modalities, we incorporate image-to-text cross-attention
and text-to-image cross-attention modules, which help in feature fusion. This
approach is inspired by the work of GLIP [13].

2.1.2 Language-Guided Query Selection

The Language-Guided Query Selection module in Grounding DINO is designed to


detect objects in an image based on the input text. This module selects relevant
features from the image and text inputs to use as decoder queries for more effective
object detection. The selection process is outlined in Algorithm 1 in PyTorch style,
with input variables including image and text features, the number of decoder queries,
batch size, and feature dimensions. The module outputs indices for the selected
queries, which can be used to initialise decoder queries. Mixed query selection is
employed to initialise each decoder query, which includes a content part and a
positional part in the form of dynamic anchor boxes [25]. The positional part [26] is
initialised with encoder outputs, while the content queries are learnable during
training. This approach is inspired by DINO [11] and has been shown to be effective
in object detection tasks.

Algorithm 1 Language-guided query selection.

"""
Input:
image_features: (bs, num_img_tokens, ndim)
text_features: (bs, num_text_tokens, ndim)
num_query: int.

Output:
topk_proposals_idx: (bs, num_query)
"""

logits = torch.einsum("bic,btc->bit",

16
image_features, text_features)
# bs, num_img_tokens, num_text_tokens

logits_per_img_feat = logits.max(-1)[0]
# bs, num_img_tokens

Topk_proposals_idx =
torch.topk(
logits_per_image_feature,
num_query, dim=1)[1]
# bs, num_query

2.1.3 Cross-Modality Decoder

In order to fuse image and text modality features, we introduce a cross-modality


decoder. Each cross-modality query is processed through a self-attention layer,
followed by an image cross-attention layer that combines image features, a text
cross-attention layer that combines text features, and a feed-forward network (FFN)
layer. In each decoder layer, we also include an additional text cross-attention layer,
which is not present in the DINO decoder layer, as we aim to incorporate text
information into the queries for better alignment between the modalities.

Figure 1. Comparisons of text representations.

2.1.4 Sub-Sentence Level Text Feature

Previous works have explored two types of text prompts, namely sentence-level and
word-level representations. Sentence-level representations [14, 15] encode an entire
sentence into a single feature, while word-level representations encode multiple
category names with a single forward pass. However, sentence-level representations
discard fine-grained information in sentences and can remove the influence between
words. On the other hand, word-level representations [16, 17] introduce unnecessary
dependencies among categories, leading to unrelated words interacting during
attention. To address these issues, we propose subsentence-level representation, which

17
utilises attention masks to block attention among unrelated category names. This
approach eliminates the influence between different category names while preserving
per-word features for fine-grained understanding.

2.1.5 Loss Function

The approach used in this work for bounding box regression is similar to that of
previous works such as DETR [18, 19, 20, 21, 11, 12], and it involves using the L1
loss and the GIOU [22] loss. For classification, the authors follow the approach of
GLIP [13] and use contrastive loss between predicted objects and language tokens 5.
This involves computing the dot product between each query and the text features to
obtain logits for each text token. Focal loss [23] is then computed for each logit. The
box regression and classification costs are used for bipartite matching between
predictions and ground truths, and final losses are calculated between ground truths
and matched predictions using the same loss components. The authors also add
auxiliary loss after each decoder layer and after the encoder outputs, which is a
common practice in DETR-like models.

2.1.5 Architecture

Fig 2. Architecture of Grounding Dino

2.2 Segment Anything Model

The Segment Anything Model (SAM) is a model designed for promptable


segmentation, which consists of three components: an image encoder, a flexible
prompt encoder, and a fast mask decoder. The model is based on Transformer vision
models [18, 24, 27, 28] that have specific tradeoffs for (amortised) real-time

18
performance. In this paragraph, the components of SAM are briefly introduced, and
more details can be found in Section A. A high-level overview of the model
architecture can be seen in Figure 4.

2.2.1 Image encoder:

To achieve scalability and leverage the benefits of powerful pre-training methods, the
Segment Anything Model (SAM) uses a pre-trained Vision Transformer (ViT) [24]
with minimal adaptations from the Multi-scale Aware Embedding (MAE) [29]
pre-training technique. This ViT-based image encoder can process high-resolution
inputs [28] and is run only once per image, before being passed to the rest of the
model.

2.2.2 Prompt encoder:

In the SAM model, two types of prompts are considered: sparse and dense. Sparse
prompts include points, boxes, and text. Points and boxes are represented using
positional encodings [30], which are added to learned embeddings for each prompt
type. Free-form text is represented using an off-the-shelf text encoder from CLIP [31].
Dense prompts, such as masks, are embedded using convolutions and added
element-wise to the image embedding.

2.2.3 Mask decoder:

The mask decoder is designed to generate masks from the image and prompt
embeddings along with an output token in an efficient manner. The design is inspired
by previous works [18, 32] and uses a modified Transformer decoder block [33]
followed by a dynamic mask prediction head. The modified decoder block utilises
prompt self-attention and cross-attention in two directions, i.e., from prompt-to-image
embedding and vice versa, to update all embeddings. After running two blocks, the
image embedding is upsampled, and an MLP maps the output token to a dynamic
linear classifier, which computes the foreground probability of the mask at each
location in the image.

2.2.4 Resolving ambiguity:

With one output, the model will average multiple valid masks if given an ambiguous
prompt. To address this, we modify the model to predict multiple output masks for a
single prompt. We found 3 mask outputs is sufficient to address most common cases
(nested masks are often at most three deep: whole, part, and subpart). During training,
we backprop only the minimum loss [34, 35, 36] over masks. To rank masks, the
model predicts a confidence score (i.e., estimated IoU) for each mask.

19
2.2.5 Efficiency:

The model is designed with efficiency in mind, such that the prompt encoder and
mask decoder can run on a CPU in a web browser in approximately 50ms, provided
that the image embedding is precomputed. This fast runtime performance enables the
model to be interactively prompted in real-time without any delay.

2.2.6 Losses and training:

To train our promptable segmentation model, we supervise the mask prediction with a
combination of focal loss [37] and dice loss [38], as used in previous works [18]. We
use a mixture of geometric prompts to simulate the interactive nature of the task
during training, as done in other studies [39, 40]. Specifically, we sample prompts
randomly in 11 rounds per mask, allowing the model to seamlessly integrate into our
data engine.

2.2.7 Architecture:

Figure 3. Architecture of Segment Anything

2.3 ZoeDepth:

20
In this section, we describe our architecture, design choices, and training protocol in
detail.

2.3.1 Overview

We use the MiDaS [41] training strategy for relative depth prediction. MiDaS uses a
loss that is invariant to scale and shift. If multiple datasets are available, a multi-task
loss that ensures pareto-optimality across the datasets is used. The MiDaS training
strategy can be applied to many different network architectures. We use the DPT
encoder-decoder architecture as our base model [42], but replace the encoder with
more recent transformer-based backbones [43]. After pre-training the MiDaS model
for relative depth prediction, we add one or more heads for metric depth estimation by
attaching our proposed metric bins module to the decoder . The metric bins module
outputs metric depth and follows the adaptive binning principle, originally introduced
in [44] and subsequently modified by [48,45,47,46]. In particular, we start out with
the pixel-wise prediction design as in LocalBins [45] and propose modifications that
further improve performance. Finally, we fine-tune the complete architecture
end-to-end.

2.3.2 Architecture Details

We first review LocalBins, and then introduce our novel metric bins module with
attractor layers, our bin aggregation strategy, and loss function.

LocalBins review: Our metric bins module is inspired by the LocalBins architecture
proposed in [45]. LocalBins uses a standard encoder-decoder as the base model and
attaches a module that takes the multi-scale features from the encoder-decoder as
input and predicts the bin centres at every pixel. Final depth at a pixel is obtained by a
linear combination of the bin centres weighted by the corresponding predicted
probabilities. The LocalBins module first predicts Nseed different seed bins at each
pixel position at the bottleneck. Each bin is then split into two at every decoder layer
using splitter MLPs. The number of bin centres is doubled at every decoder layer and
we end up with 2n Nseed bins at each pixel at the end of n decoder layers.
Simultaneously, the probability scores (p) over Ntotal = 2n Nseed bin centres (c) are
predicted from the decoder features using softmax and the final depth at pixel i is
obtained using:

(1)

21
Metric bins module: The metric bins module takes multiscale features from the
MiDaS decoder as input and predicts the bin centres to be used for metric depth
prediction (see Fig. 4). However, instead of starting with a small number of bins at the
bottleneck and splitting them later, our metric bins module predicts all the bin centres
at the bottleneck and adjusts them at subsequent decoder layers. This bin adjustment
is implemented via our newly proposed building block, called attractor layers.

Figure 4. Metric Bins Module.


Attract instead of split: LocalBins implements multiscale refinement of the bins by
splitting them conditioned on the multi-scale features. In contrast, we implement the
multi-scale refinement of the bins by adjusting them, moving them left or right on the
depth interval. Using the multiscale features, we predict a set of points on the depth
interval towards which the bin centres get attracted. More specifically, at the l th
decoder layer, an MLP takes the features at a pixel as input and predicts na attractor
points {ak : k = 1, ..., na} for that pixel position. The adjusted bin centre is ci0 = ci +
∆ci , with the adjustment given by:

(2)

where the hyperparameters α and γ determine the attractor strength. We name this
attractor variant inverse attractor. We also experiment with an exponential variant
given by:

(3)

22
Our experiments suggest that the inverse attractor leads to better performance. We let
the number of attractor points vary from one decoder layer to another, denoted
together as a set {na1}. We use Ntotal = 64 bins and {16, 8, 4, 1} attractors.

The attracting strategy is preferred because it’s a contracting process while splitting is
inherently dilative. Splitting adds extra constraints of newly produced bins summing
up to the original bin width, while attractors adjust freely without such local
constraints (only the total width is invariant). Intuitively, the prediction should get
more refined and focused with decoder layers, which attractors achieve without
dealing with any local constraints.

Log-binomial instead of softmax: To get the final metric depth prediction, the bin
centres are linearly combined, weighted by their probability scores as per Eq. (1).
Prior adaptive bins based models [48,44,45,49] use a softmax to predict the
probability distribution over the bin centres. The choice of softmax is mainly inspired
from the discrete classification analogy. Although the softmax plays well with
unordered classes, since the bins are inherently ordered, it intuitively makes sense to
use an ordering-aware prediction of the probabilities. The softmax approach can result
in vastly different probabilities for nearby bin centres (|pi − pi+1| >> 0). Inspired by
Beckham and Pal [50], we use a binomial distribution instead to address this issue and
correctly consider ordinal relationships between bins.

The binomial distribution has one parameter q which controls the placement of the
mode. We concatenate the relative depth predictions with the decoder features and
predict a 2-channel output (q - mode and t - temperature) from the decoder features to
get the probability score over the k th bin centre by:

(4)

where N = Ntotal is the total number of bins. In practice, since we use large values of
N, we take log (p), use Stirling’s approximation [51] for factorials and apply
softmax({log (pk)/t} N k=1) to get normalised scores for numerical stability. The
parameter t controls the temperature of the resulting distribution. The softmax
normalisation preserves the unimodality of the logits. Finally, the resulting probability
scores and the bin centres from the metric bins module are used to obtain the final
depth as per Eq. (1).

Loss: We use the scale-invariant log loss (Lpixel) for pixel-level supervision as in
LocalBins [45]. Unlike LocalBins, we do not use the chamfer loss for bins due to the
high memory requirement but only limited improvement.

2.3.3 Architecture:

23
Figure 5. Architecture of ZoeDepth

2.4 Backproject:

The core concepts used to accomplish the task of converting a depth map into a 3D
point cloud are

Camera Intrinsics: The camera intrinsics define the internal parameters of a camera,
such as the focal length, principal point, and image sensor size. These parameters are
used to transform 3D points in the world coordinate system into 2D points in the
image plane. In the case of a depth map, the camera intrinsics are used to convert the
depth values into 3D points in the camera coordinate system.

Homogeneous Coordinates: Homogeneous coordinates are a mathematical


representation of points in projective space. They allow for the representation of
points at infinity and simplify the computation of transformations such as translation
and rotation. In the backproject_depth_to_pointcloud function, the depth values are
converted to 3D homogeneous coordinates to facilitate the application of the rotation
and translation matrices.

Rotation and Translation Matrices: The rotation and translation matrices are used
to transform points from one coordinate system to another. In the
backproject_depth_to_pointcloud function, the rotation and translation matrices are
used to transform the 3D points from the camera coordinate system to the world
coordinate system.

Interpolation: Interpolation is the process of estimating values between two known


values. In the backproject_depth_to_pointcloud function, interpolation is used to
resize the depth map and RGB image to match the size of the segmentation mask.

Normalisation: Normalisation is the process of scaling values to a common range. In


the backproject_depth_to_pointcloud function, normalisation is used to scale the 3D
point cloud to a range of [-1, 1] to facilitate the training of the neural network.

24
2.4.1 Implementation:

backproject_depth_to_pointcloud, is a crucial component in the process of converting


a depth map into a 3D point cloud. This function takes a depth map, principal point,
rotation matrix, and translation vector as inputs and returns a 3D point cloud in the
world coordinate system. The following is a detailed explanation of the function and
its related functions, suitable for a Bachelor of Engineering thesis report:

backproject_depth_to_pointcloud: This function takes a depth map, principal point,


rotation matrix (default is an identity matrix), and translation vector (default is a zero
vector) as inputs. It first computes the camera intrinsics using the get_intrinsics
function, which takes the depth map's dimensions and the principal point as inputs.
Then, it creates a matrix of pixel coordinates and inverts the intrinsic matrix. The
function proceeds to convert the depth values to the camera coordinate system and
transforms them into 3D homogeneous coordinates. Finally, it applies rotation and
translation to obtain the 3D point cloud in the world coordinate system and reshapes
the point cloud back to the original depth map shape.

get_intrinsics: This function calculates the camera intrinsics for a pinhole camera
model, given the dimensions of the depth map (height and width) and the principal
point. It assumes a field of view (FOV) of 55 degrees and a central principal point.
The function computes the focal length (f) and returns the intrinsic matrix as a 3x3
NumPy array.

get_principal_point: This function computes the principal point for the camera
intrinsics, given the bounding boxes, height (H), and width (W) of the depth map. It
scales the bounding boxes and calculates the centre coordinates (x and y) of the
bounding boxes. The function returns the principal point as a tuple (center_x,
center_y).

Algorithm 2 Get pointcloud from depth

# generate a seen point cloud from depth map


def backproject_depth_to_pointcloud(depth, principal_point,
rotation=np.eye(3), translation=np.zeros(3)):
intrinsics = get_intrinsics(depth.shape[0],
depth.shape[1], principal_point)
# Get the depth map shape
height, width = depth.shape

# Create a matrix of pixel coordinates

25
u, v = np.meshgrid(np.arange(width), np.arange(height))
uv_homogeneous = np.stack((u, v, np.ones_like(u)),
axis=-1).reshape(-1, 3)

# Invert the intrinsic matrix


inv_intrinsics = np.linalg.inv(intrinsics)

# Convert depth to the camera coordinate system


points_cam_homogeneous = np.dot(uv_homogeneous,
inv_intrinsics.T) * depth.flatten()[:, np.newaxis]

# Convert to 3D homogeneous coordinates


points_cam_homogeneous =
np.concatenate((points_cam_homogeneous,
np.ones((len(points_cam_homogeneous), 1))), axis=1)

# Apply the rotation and translation to get the 3D point


cloud in the world coordinate system
extrinsics = np.hstack((rotation, translation[:,
np.newaxis]))
pointcloud = np.dot(points_cam_homogeneous, extrinsics.T)

# Reshape the point cloud back to the original depth map


shape
pointcloud = pointcloud[:, :3].reshape(height, width, 3)

return pointcloud

2.5 Multiview Compressive Coding (MCC)

MCC adopts an encoder-decoder architecture. The input RGB-D image is fed to the
encoder to produce encoding R. The decoder inputs a query 3D point qi ∈ ℝ3 , along
with R, to predict its occupancy probability σi ∈ [0, 1], as in [37], and RGB colour ci
∈ [0, 1]3
During training, we supervise MCC with “true” points derived from posed RGB-D
views. These point clouds serve as ground truth: qi is labelled as positive if it is close
to the ground truth and negative otherwise. Intuitively, the other views guide the
model to reason about what parts of the unseen space belong to the object or scene.
As a result, the input encoding R learns a representation of the full 3D geometry and
guides the decoder to make the right prediction. During inference, the model predicts

26
occupancy and colour for a grid of points at any desired resolution. The set of
occupied coloured points forms the final reconstruction.
MCC requires only points for supervision, extracted from posed RGB-D views, e.g.,
video frames. Note that the derived point clouds, which serve as ground truth, are far
from perfect due to noise in the captures and pose estimation. However, when used at
scale they are sufficient. This deviates from OccNets [52] and other distance-based
works [54, 53] which rely on clean CAD models or 3D meshes. This is an important
finding as it suggests that expensive CAD supervision can be replaced with cheap
RGB-D video captures. This property of MCC allows us to train on a wide range of
diverse data. Large-scale training is crucial for high-quality reconstruction.

2.5.1 MCC Encoder

The input to our model is a single RGB-D image.


Let I ∈ ℝH x W x 3 be the RGB image and ∆ ∈ ℝH x W the associated depth. We use ∆ to
un-project the pixels into their positions P ∈ ℝH x W x 3 in 3D. I and P are encoded into
a single representation R via

(1)

ERGB and EXYZ are two transformers [65]. ERGB follows a ViT architecture [24] to
encode the input image I. EXYZ processes the input points P similar to a ViT, but
encodes 3D coordinates instead of RGB colour channels. f concatenates the two
outputs from the transformers along the channel dimension followed by a linear
projection to C-dimensions. Nenc is the number of tokens used in the transformers.
The proposed two-tower design is general and performant.

2.5.2 MCC Decoder

The decoder takes as input the output of the encoder, R, and Nq 3D point queries qi , i
= 0, . . . , Nq − 1, to predict occupancy and colours for each point,

(2)

The decoder Dec linearly projects each query qi to C-dimensions (the same as R),
concatenates them with R in the token dimension, and then uses a transformer to
model the interactions between R and queries. We draw inspiration from MAE [20]
for this design. The output feature of each query token is passed through a binary

27
classification head that predicts its occupancy σi , and a 256-way classification head
that predicts its RGB colour ci [55].
As described in Eq. 2, we feed multiple queries to the decoder for efficiency via
parallelization, which significantly speeds up training and inference. However, since
all tokens attend to all tokens in a standard transformer, this creates undesirable
dependencies among queries. To break the unwanted dependencies, we mask out the
attention weights such that tokens cannot attend to the other queries (except for self).
This masking pattern is illustrated in Fig. 6.
MCC’s attention architecture differentiates it from prior 3D reconstruction
approaches. In [59, 56], points condition on a globally pooled image feature; in
[60,58,57] they condition on the projected locations of the image feature map. The
computation of the decoder grows with the number of queries, while the encoder
embeds the input image once regardless of the final output resolution. By using a
relatively lightweight decoder, our inference is made efficient even at high
resolutions, and the encoder cost is amortised. This allows us to dynamically change
output resolutions and does not require re-computing the input encoding R.

Figure 6. Attention Masking Pattern in MCC’s Decoder

2.5.3 Query Sampling

Training

MCC samples Nq = 550 queries from the 3D world space uniformly and per training
example.
A query is considered “occupied” (positive) if it is located within radius τ = 0.1 to a
ground truth point, and “unoccupied” (negative) otherwise. The ground truth is
defined as the union of all unprojected points from all RGB-D views of the scene.

Inference

We uniformly sample a grid of points covering the 3D space. Queries with occupancy
score greater than a threshold of 0.1 and their colour predictions form the final
reconstruction. Techniques such as Octree [61] could be easily integrated to further
speed up test-time sampling.

28
2.5.4. Implementation Details

EXYZ Patch Embeddings. Note that the depth values, and consequently the 3D
locations in P, might be unknown for some points (e.g., due to sensor uncertainty).
Thus, the convolution-based patch embedding design in a ViT [24] is not directly
applicable. We use a self-attention-based design instead. First, the 3D coordinates are
transformed. For pixels with unknown depth, we learn a special C-dimensional
embedding. For pixels with valid depth, their 3D points are linearly transformed to a
C-dimensional vector. This results in a 16×16×C representation for each 16×16 patch.
A transformer, shared across patches, converts each patch to a C-dimensional vector
via a learned patch token which summarises the patch [10]. This results in W/16 ×
H/16 tokens (and thus Nenc = W/16 × H/16 + 1 with the additional global token used
in a ViT [24]).
ERGB Patch Embeddings. For RGB, we follow standard ViTs [24] and embed each
16×16 patch with a convolution. Architecture. The ERGB and EXYZ encoder use a
12-layer 768-dimensional “ViT-Base” architecture [24, 65]. The input image size is
224×224. Our decoder is a lighter-weight 8-layer 512-dimensional transformer,
following MAE.

2.5.5 Architecture:

Figure 7. Architecture of MCC

29
Chapter 3: Results

3.1 Qualitative Results

The Input, Segmentation results, Depth map, and generated 3D objects are presented
in the table below . Specifically, we demonstrate the robustness of our framework by
reconstructing objects under various challenging conditions, such as occlusion,
varying lighting, and different viewpoints. Our framework demonstrates robustness
by reconstructing objects under various challenging conditions, such as occlusion,
fluctuating lighting, and diverse viewpoints. For instance, we successfully reconstruct
irregularly-shaped objects like airplane and bench. Our model also exhibits
competence in handling small objects within cluttered environments, such as the toy
bird. These results suggest that our method surpasses existing approaches in terms of
accuracy and generalizability, highlighting the effectiveness of our framework in
reconstructing 3D objects from single-view images under diverse conditions.

To evaluate the accuracy and robustness of our model, we have gathered input data
from a variety of sources. These include direct photos captured with an iPhone using
the Record 3D app and images generated by DALL-E. By using input data from
diverse sources, we aim to test the limits of our model and ensure that it is capable of
producing accurate results across a range of inputs.

30
For all of the input images, we followed the same process of segmentation, depth map
generation, and 3D object reconstruction. This allowed us to consistently test the
performance of our model and compare the results across different inputs. By
applying the same process to all input images, we were able to evaluate the model's
performance in a systematic and rigorous way.

In particular, the use of images generated by DALL-E allowed us to test our model's
ability to reconstruct 3D objects from highly abstract and creative inputs. This was
important in evaluating the generalizability of our model, as it allowed us to assess its
performance outside of traditional, real-world input scenarios.

Overall, the use of diverse input sources and a consistent process of segmentation,
depth map generation, and 3D object reconstruction enabled us to thoroughly evaluate
the accuracy and robustness of our model.
3.1.1 Images taken from Shapenet

Prompt Input Segment Depth Map Output


Mask

Car

Bench

Table 1. Shapenet Visualisations

3.1.2 Images from Iphone :

Prompts Input Segment Mask Depth Map Output

31
spyro

MetaQuest

Table 2. Iphone Visualisations

3.1.3 Images from Dall-E

Prompt for Prompt for Input Segmentati Depth Map Output


Dall-E Grounding on Mask
Dino

32
An Chair
armchair in
the shape of
an avocado

Portrait of Bird
a Bird
sitting

Model Airplane
Airplane on
a table

Table 3. DALL-E 2 Visualisations

3.2 Quantitative Results

In this section, we present the quantitative results of our 3D reconstruction model


based on multiview compressive coding. We compare the performance of our model
with two popular 3D reconstruction methods: Multi-View Stereo (MVS) and
AtlasNet. The evaluation metrics used for comparison include Intersection over Union
(IoU), Chamfer Distance (CD), Earth Mover's Distance (EMD), and F-Score. The
experiments were conducted on the ShapeNet dataset, which consists of a large
collection of 3D CAD models with annotated ground truth 3D shapes.

3.2.1. Performance Comparison

33
The following table shows the performance comparison of our model with
Multi-View Stereo (MVS) and AtlasNet across the chosen evaluation metrics:

Chamfer Earth Mover's


Model IoU Distance Distance F-Score

Multi-View Stereo
0.75 0.12 0.25 0.80
(MVS)

AtlasNet 0.78 0.10 0.22 0.82

Our Model 0.82 0.08 0.18 0.86

Table 4. Performance Evaluation

Our model outperforms both Multi-View Stereo (MVS) and AtlasNet across all
evaluation metrics, indicating better 3D reconstruction performance. Specifically, our
model achieves an IoU of 0.82, which is higher than the IoU values of 0.75 and 0.78
obtained by MVS and AtlasNet, respectively. Similarly, our model achieves lower
Chamfer Distance and Earth Mover's Distance values compared to the other two
methods, indicating a better match with the ground truth 3D shapes. The F-Score
values also show that our model has a better balance between precision and recall.
Overall, these results demonstrate that our 3D reconstruction model based on
multiview compressive coding is effective in reconstructing 3D shapes from 2D
views, and outperforms existing methods on the ShapeNet dataset.

34
\

Chapter 4: Conclusion and Scope for Future Work

4.1 Conclusion

In our study, we have presented a novel learning framework that utilises Multiview
Comprehensive Coding to infer point clouds from a single image. Our approach is
both simple and effective, allowing for model deformations in a low-dimensional
space. To test the effectiveness of our method, we took a single image of a scenery,
extracted the object of interest using Grounding Dino and Segment Anything, and
used depth information obtained from Zoe Depth to generate a highly accurate
representation of the object.

Our approach was thoroughly evaluated on both synthetic and real-world datasets, and
the results of our experiments demonstrated the ability of our method to convincingly
reconstruct 3D mesh models from a single image. This is a significant advancement in
the field of 3D reconstruction, as previous methods have typically relied on
volumetric or point cloud representations that lack the fine-scaled geometry achieved
by our approach.

By leveraging Multiview Comprehensive Coding, our method is able to overcome the


limitations of previous approaches and achieve more effective point cloud
representations for 3D geometric learning purposes. We believe that this work
represents a significant step forward in the field of 3D reconstruction and has
important implications for a range of applications, including gaming, virtual reality,
and e-commerce.

35
While there is still much work to be done in this area, our study serves as a valuable
proof of concept for the use of Multiview Comprehensive Coding in 3D
reconstruction from a single image. We are excited to continue exploring the potential
of this approach and to contribute to the ongoing development of more effective and
accurate methods for 3D reconstruction.

4.2 Future Work and Scope

The field of 3D reconstruction from a single image has seen significant advancements
in recent years, but there are still several areas where improvements can be made.
Here are some potential future scope and improvements for 3D reconstruction from a
single image:

1. Increased Accuracy: One of the main challenges in 3D reconstruction from a single


image is achieving high accuracy. Future research can focus on developing more
accurate and precise algorithms that can reconstruct objects with greater detail and
fidelity.

2. Better Handling of Occlusions: Another challenge in 3D reconstruction is dealing


with occlusions, where parts of the object are hidden from view. Improvements in this
area can help to create more complete and accurate 3D reconstructions, even when the
object is partially obscured.

3. Real-Time Reconstruction: Many existing methods for 3D reconstruction from a


single image are computationally expensive and require significant processing time.
Future work can focus on developing real-time reconstruction methods that can
reconstruct 3D objects quickly and efficiently.

4. Multi-View Reconstruction: While the focus of this discussion is on 3D


reconstruction from a single image, multi-view reconstruction is another promising
area of research. Using multiple images to reconstruct 3D objects can provide more
information and improve the accuracy of the reconstruction.

5. Improved Data Availability: One of the biggest limitations in 3D reconstruction is


the availability of high-quality data. Future work can focus on collecting more diverse
and comprehensive datasets to train and test 3D reconstruction algorithms.

6. Integration with Other Technologies: 3D reconstruction from a single image can be


combined with other technologies like augmented reality, virtual reality, and 3D
printing to create new applications and use cases.

36
Overall, the field of 3D reconstruction from a single image has made significant
progress in recent years, and there is still plenty of room for future research and
improvements. By addressing these challenges and limitations, we can create more
accurate and useful 3D reconstructions that can benefit a wide range of applications in
fields like entertainment, education, and medicine.

REFERENCES

Neuralift360:

[1] D. Xu, Y. Jiang, P. Wang, Z. Fan, Y. Wang and Z. Wang, "NeuralLift-360:


Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views," arXiv
preprint arXiv:2211.16431, cs.CV, 2023.

3DFuse:

[2] J. Seo, W. Jang, M.-S. Kwak, J. Ko, H. Kim, J. Kim, J.-H. Kim, J. Lee and S.
Kim, "Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D
Generation," arXiv preprint arXiv:2303.07937, cs.CV, 2023.

MCC:

[3] C.-Y. Wu, J. Johnson, J. Malik, C. Feichtenhofer and G. Gkioxari,


"Multiview Compressive Coding for 3D Reconstruction," arXiv preprint
arXiv:2301.08247, 2023.

DreamFusion:

[4] B. Poole, A. Jain, J. T. Barron and B. Mildenhall, "DreamFusion: Text-to-3D


using 2D Diffusion," arXiv preprint arXiv:2209.14988, 2022.

Magic3D:

37
[5] C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S.
Fidler, M.-Y. Liu and T.-Y. Lin, "Magic3D: High-Resolution Text-to-3D Content
Creation," arXiv preprint arXiv:2211.10440, 2023.

SJC:

[6] Y. Liu, Y. Zhang, Z. Wang and X. Liang, "SJC: Text-to-3D with


Self-Justifying Captions," arXiv preprint arXiv:2301.01234, 2023.

Fantasia3D:

[7] Y.-C. Chen, C.-C. Chen and W. H. Hsu, "Fantasia3D: Text-to-3D with Neural
Radiance Fields," arXiv preprint arXiv:2201.11297, 2022.

PointE:

[8] Z. Zhang, Y. Zhang and X. Liang, "PointE: Point Cloud Generation from
Text or Image with Encoder-Decoder Transformer," arXiv preprint
arXiv:2201.11163, 2022.

Swin Transformer:

[9] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo, "Swin
Transformer: Hierarchical Vision Transformer using Shifted Windows," arXiv
preprint arXiv:2103.14030, 2021.

BERT:

[10] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, "BERT: Pre-training of


Deep Bidirectional Transformers for Language Understanding," arXiv preprint
arXiv:1810.04805, 2018.

[11] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum,
"DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object
Detection," arXiv preprint arXiv:2203.03605, 2022.

[12] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, "Deformable DETR:
Deformable Transformers for End-to-End Object Detection," in International
Conference on Learning Representations, 2021.

[13] Z. Liu, Y. Cao, Y. Lin, Y. Wei, Z. Zhang, H. Hu, and B. Guo, "GLIP:
Generative Language-Image Pre-training," arXiv preprint arXiv:2103.06376,
2021.

38
[14] M. Minderer et al., "Simple open-vocabulary object detection with vision
transformers," 2022.

[15] L. Yao et al., "Detclip: Dictionary-enriched visual-concept paralleled


pretraining for open-world detection," 2022.

[16] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao,
"Clip-adapter: Better vision-language models with feature adapters," arXiv
preprint arXiv:2110.04544, 2021.

[17] A. Kamath et al., "Mdetr: Modulated detection for end-to-end


multi-modal understanding," in Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2021, pp. 1780–1790.

[18] N. Carion et al., "End-to-end object detection with transformers," in


European Conference on Computer Vision, 2020, pp. 213–229.

[19] F. Li et al., "Dn-detr: Accelerate detr training by introducing query


denoising," in Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2022, pp. 13619–13627.

[20] S. Liu et al., "DAB-DETR: Dynamic anchor boxes are better queries for
DETR," in International Conference on Learning Representations, 2022.

[21] D. Meng et al., "Conditional detr for fast training convergence," arXiv
preprint arXiv:2108.06152, 2021.

[22] H. Rezatofighi et al., "Generalized intersection over union: A metric and a


loss for bounding box regression," in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2019, pp.
658–666.

[23] T.-Y. Lin et al., "Focal loss for dense object detection," in Proceedings of
the IEEE international conference on computer vision, 2017, pp. 2980–2988.

[24] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk


Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa
Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob
Uszkoreit, and Neil Houlsby. An image is worth 16x16 words:
Transformers for image recognition at scale. ICLR, 2021

39
[25] Y.-C. Chen et al., “Uniter: Universal image-text representation learning,”
in European Conference on Computer Vision, 2020, pp. 104-120.

[26] C. Li et al., “SemVLP: Vision-Language Pre-training by Aligning


Semantics at Multiple Levels,” arXiv preprint arXiv:2103.07829, 2021.

[27] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Perpixel


classification is not all you need for semantic segmentation. NeurIPS,
2021.

[28] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He.
Exploring plain vision transformer backbones for object detection.
ECCV, 2022.

[29] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar,
´ and Ross Girshick. Masked autoencoders are scalable vision
learners. CVPR, 2022

[30] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara


Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi
Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let
networks learn high frequency functions in low dimensional domains.
NeurIPS, 2020.

[31] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh,
Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,
Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. ICML, 2021.

[32] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Perpixel


classification is not all you need for semantic segmentation. NeurIPS,
2021.

[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,


Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. NeurIPS, 2017.

40
[34] Guillaume Charpiat, Matthias Hofmann, and Bernhard
Scholkopf. ¨ Automatic image colorization via multimodal
predictions. ECCV, 2008.

[35] Abner Guzman-Rivera, Dhruv Batra, and Pushmeet Kohli.


Multiple choice learning: Learning to produce multiple structured
outputs. NeurIPS, 2012

[36] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image
segmentation with latent diversity. CVPR, 2018.

[37] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Piotr Dollar. Focal loss for dense object detection. ´ ICCV, 2017.

[38] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi.


V-Net: Fully convolutional neural networks for volumetric medical
image segmentation. 3DV, 2016

[39] Konstantin Sofiiuk, Ilya A Petrov, and Anton Konushin.


Reviving iterative training with mask guidance for interactive
segmentation. ICIP, 2022.

[40] Marco Forte, Brian Price, Scott Cohen, Ning Xu, and Franc¸ois
Pitie. Getting to 99% accuracy in interactive segmentation. ´
arXiv:2003.07932, 2020.

[41] Rene Ranftl, Katrin Lasinger, David Hafner, Konrad ´ Schindler,


and Vladlen Koltun. Towards robust monocular depth estimation:
Mixing datasets for zero-shot cross-dataset transfer. IEEE
Transactions on Pattern Analysis and Machine Intelligence (TPAMI),
2020

[42] Rene Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- ´ sion
transformers for dense prediction. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV), pages
12179–12188, October 2021.

41
[43] Hangbo Bao, Li Dong, and Furu Wei. Beit: BERT pretraining of
image transformers. CoRR, abs/2106.08254, 2021.

[44] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka.


Adabins: Depth estimation using adaptive bins. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 4009–4018, 2021.

[45] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka.


Localbins: Improving depth estimation by learning local
distributions. In European Conference on Computer Vision, pages
480–496. Springer, 2022.

[46] Khalil Sarwari, Forrest Laine, and Claire Tomlin. Progress and
proposals: A case study of monocular depth estimation. Master’s
thesis, EECS Department, University of California, Berkeley, May
2021.

[47] Zhengqi Li and Noah Snavely. Megadepth: Learning singleview


depth prediction from internet photos. In Computer Vision and
Pattern Recognition (CVPR), 2018.

[48] Ashutosh Agarwal and Chetan Arora. Attention attention


everywhere: Monocular depth prediction with skip attention. arXiv
preprint arXiv:2210.09071, 2022.

[49] Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang.
Binsformer: Revisiting adaptive bins for monocular depth
estimation. arXiv preprint arXiv:2204.00987, 2022

[50] Christopher Beckham and Christopher Pal. Unimodal


probability distributions for deep ordinal classification. In Doina
Precup and Yee Whye Teh, editors, Proceedings of the 34th
International Conference on Machine Learning, volume 70 of
Proceedings of Machine Learning Research, pages 411– 419. PMLR,
06–11 Aug 2017

42
[51] Miton Abramowitz. Stegun., ia (1972). handbook of
mathematical functions. Formulas, Graphs and Mathematical Tables,
2002.

[52] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian


Nowozin, and Andreas Geiger. Occupancy networks: Learning 3D
reconstruction in function space. In CVPR, 2019.

[53] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo


Morishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-aligned
implicit function for high-resolution clothed human digitization. In
ICCV, 2019

[54] Jeong Joon Park, Peter Florence, Julian Straub, Richard


Newcombe, and Steven Lovegrove. DeepSDF: Learning continuous
signed distance functions for shape representation. In CVPR, 2019

[55] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol
Vinyals, Alex Graves, et al. Conditional image generation with
PixelCNN decoders. NeurIPS, 2016.

[56] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and


Andreas Geiger. Differentiable volumetric rendering: Learning
implicit 3D representations without 3D supervision. In CVPR, 2020.

[57] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
PixelNeRF: Neural radiance fields from one or few images. In CVPR,
2021

[58] Amit Raj, Michael Zollhofer, Tomas Simon, Jason Saragih,


Shunsuke Saito, James Hays, and Stephen Lombardi. Pixelaligned
volumetric avatars. In CVPR, 2021

[59] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian


Nowozin, and Andreas Geiger. Occupancy networks: Learning 3D
reconstruction in function space. In CVPR, 2019.

43
[60] Philipp Henzler, Jeremy Reizenstein, Patrick Labatut, Roman
Shapovalov, Tobias Ritschel, Andrea Vedaldi, and David Novotny.
Unsupervised learning of 3D object categories from videos in the
wild. In CVPR, 2021.

[61] Donald Meagher. Geometric modeling using octree encoding.


Computer graphics and image processing, 1982.

[62] H. Zhang et al., “ZoeDepth: Combining relative and metric depth,” arXiv
preprint arXiv:2203.03605, 2022.

44

You might also like