Thesis - Sem 8 (1)
Thesis - Sem 8 (1)
Thesis - Sem 8 (1)
Project
On
We, Akshit Pareek (2019UIC3588), Anish Gupta (2019UIC3591) and Gaurav Bhatt
(2019UIC3598) students of B. Tech., Department of Instrumentation and Control
Engineering, hereby declare that the Project-Thesis titled “3D Reconstruction of an
object from a Single Image and a Text Prompt” which is submitted by us to the
Department of Instrumentation and Control Engineering, Netaji Subhas University of
Technology (NSUT) Dwarka, New Delhi in partial fulfilment of the requirement for
the award of the degree of Bachelor of Technology is our original work and not
copied from any source without proper citation. The manuscript has been subjected to
plagiarism check by Turnitin software. This work has not previously formed the basis
for the award of any other Degree.
Place:
Date:
2
CERTIFICATE OF DECLARATION
This is to certify that the work embodied in thesis titled, “3D Reconstruction of an
object from a Single Image and Text Prompt” by Akshit Pareek (2019UIC3588),
Anish Gupta (2019UIC3591) and Gaurav Bhatt (2019UIC3598) is the bonafide work
of the group submitted to Netaji Subhas University of Technology for consideration in
8th Semester B.Tech. Project Evaluation.
The original research work was carried out by the team under my/our guidance and
supervision in the academic year 2022-2023. This work has not been submitted for
any other diploma or degree of any University. On the basis of the declaration made
by the group, we recommend the project report for evaluation.
3
ACKNOWLEDGEMENT
We would like to express our gratitude and appreciation to all those who make it
possible to complete this project. Special thanks to our project supervisor Prof. Asha
Rani whose help, stimulating suggestions, and encouragement helped us in writing
this report. We also sincerely thank our colleagues for the time spent proofreading and
correcting our mistakes.
We would also like to acknowledge with much appreciation the crucial role of the
staff in Instrumentation and Control Engineering and The Centre of Excellence in
Artificial Intelligence, which gave us permission to use the lab equipment and gave
permission to use all necessary tools in the laboratory.
4
PLAGIARISM REPORT
5
ABSTRACT
Keywords
6
List of Contents
I Certificate of Declaration 3
II Acknowledgment 4
IV Abstract 6
V Work Done 15
VII References 71
VIII Appendix 75
7
List of Figures
5 Architecture of ZoeDepth 27
7 Architecture of MCC 31
8 33
9 37
8
10 41
11 42
12 50
13 52
14 54
15 55
16 59
17 59
18 60
19 60
20 60
21 60
22 61
23 61
24 62
List of Tables
1 Shapenet Visualisations 22
2 Iphone Visualisations 25
3 DALL-E 2 Visualisations 47
4 Performance Evaluation 51
9
List of Algorithms
10
Chapter 1: Introduction
1.1 Motivation
11
by their diverse shapes, sizes, textures, and appearances, and various factors such as
partial occlusion, lighting, shadows, angles, and distances can impact the accuracy of
the reconstruction.
Single-view 3D reconstruction
12
take several hours to accomplish. This is due to the fact that reconstructing an
arbitrary 3D object from a single-view image is an ill-posed optimization problem,
making it a challenging and largely unexplored task.
However, Recent research has made progress in the 3D reconstruction of objects from
single-view images by incorporating general visual knowledge into the models. For
example, NeuraLift360 [1] uses a pre-trained depth-estimator and 2D diffusion priors
to recover coarse geometry and textures from a single image. Another model, 3DFuse
[2], initialises the geometry with an estimated point cloud and then learns fine-grained
geometry and textures from a single image with a 2D diffusion prior to score
distillation. MCC [3] takes a different approach by learning from object-centric videos
to generate 3D objects from a single RGB-D image. However, these models still rely
heavily on human-specified text or captured depth information. Our goal is to
reconstruct arbitrary objects of interest from single-view images that are captured in
the wild, which are abundant and readily available in the real world. By tackling this
challenging problem, we hope to discover new possibilities for 3D object
reconstruction from limited data sources.
An alternative approach, text prompts can also be used as input for generating 3D
objects, as demonstrated by groundbreaking works like DreamFusion [4]. This
approach has led to subsequent works such as SJC [5], Magic3D [6], and Fantasia3D
[7], which use optimization techniques and neural rendering methods to generate
voxel-based radiance fields or high-resolution 3D objects from the text. Another
approach is Point E [8], which generates point clouds from either images or text. Our
work, in contrast to these text-based methods, aims to generate 3D objects from
single-view images while maintaining consistency with the given image. This
approach has the potential to reconstruct complex real-world objects from limited
information and push the boundaries of 3D vision and machine learning research. We
hope that our work will inspire further exploration and innovation in this field.
13
The task of reconstructing 3D objects from a single view image for arbitrary classes is
a fascinating but challenging problem, with multiple complexities and constraints that
need to be addressed.
14
Chapter 2: Solution Methodology Adopted
The Grounding DINO model is designed to output multiple pairs of object boxes and
noun phrases for a given (Image, Text) pair. For example, if an image contains a cat
and a table, the model locates both objects and extracts the corresponding labels "cat"
and "table" from the input text. The model can be used for both object detection and
REC tasks. To align with the pipeline, we concatenate all category names as input
texts for object detection tasks. For REC, a bounding box is required for each text
input. The model architecture consists of an image backbone, a text backbone, a
feature enhancer for image and text feature fusion (Sec. 2.2.1), a language-guided
query selection module for query initialization (Sec 2.1.2), and a cross-modality
decoder for box refinement (Sec. 2.1.3). For each (Image, Text) pair, we first extract
vanilla image features and vanilla text features using an image backbone and a text
backbone, respectively. These features are then fused using a feature enhancer
module. We then use a language-guided query selection module to select
cross-modality queries from image features, which are fed into a cross-modality
decoder to probe desired features from the two modalities and update themselves.
15
Finally, the output queries of the last decoder layer are used to predict object boxes
and extract corresponding phrases.
To obtain features from both the image and text inputs, we use an image backbone
such as Swin Transformer [9] and a text backbone such as BERT [10], which extract
multiscale image features and text features, respectively. Similar to other DETR-like
detectors [11, 12], we extract multiscale features from different blocks. After
obtaining the vanilla image and text features, we fuse them with a feature enhancer
that includes multiple layers. Each layer consists of Deformable self-attention for
enhancing image features and vanilla self-attention for text feature enhancement. To
align the features from both modalities, we incorporate image-to-text cross-attention
and text-to-image cross-attention modules, which help in feature fusion. This
approach is inspired by the work of GLIP [13].
"""
Input:
image_features: (bs, num_img_tokens, ndim)
text_features: (bs, num_text_tokens, ndim)
num_query: int.
Output:
topk_proposals_idx: (bs, num_query)
"""
logits = torch.einsum("bic,btc->bit",
16
image_features, text_features)
# bs, num_img_tokens, num_text_tokens
logits_per_img_feat = logits.max(-1)[0]
# bs, num_img_tokens
Topk_proposals_idx =
torch.topk(
logits_per_image_feature,
num_query, dim=1)[1]
# bs, num_query
Previous works have explored two types of text prompts, namely sentence-level and
word-level representations. Sentence-level representations [14, 15] encode an entire
sentence into a single feature, while word-level representations encode multiple
category names with a single forward pass. However, sentence-level representations
discard fine-grained information in sentences and can remove the influence between
words. On the other hand, word-level representations [16, 17] introduce unnecessary
dependencies among categories, leading to unrelated words interacting during
attention. To address these issues, we propose subsentence-level representation, which
17
utilises attention masks to block attention among unrelated category names. This
approach eliminates the influence between different category names while preserving
per-word features for fine-grained understanding.
The approach used in this work for bounding box regression is similar to that of
previous works such as DETR [18, 19, 20, 21, 11, 12], and it involves using the L1
loss and the GIOU [22] loss. For classification, the authors follow the approach of
GLIP [13] and use contrastive loss between predicted objects and language tokens 5.
This involves computing the dot product between each query and the text features to
obtain logits for each text token. Focal loss [23] is then computed for each logit. The
box regression and classification costs are used for bipartite matching between
predictions and ground truths, and final losses are calculated between ground truths
and matched predictions using the same loss components. The authors also add
auxiliary loss after each decoder layer and after the encoder outputs, which is a
common practice in DETR-like models.
2.1.5 Architecture
18
performance. In this paragraph, the components of SAM are briefly introduced, and
more details can be found in Section A. A high-level overview of the model
architecture can be seen in Figure 4.
To achieve scalability and leverage the benefits of powerful pre-training methods, the
Segment Anything Model (SAM) uses a pre-trained Vision Transformer (ViT) [24]
with minimal adaptations from the Multi-scale Aware Embedding (MAE) [29]
pre-training technique. This ViT-based image encoder can process high-resolution
inputs [28] and is run only once per image, before being passed to the rest of the
model.
In the SAM model, two types of prompts are considered: sparse and dense. Sparse
prompts include points, boxes, and text. Points and boxes are represented using
positional encodings [30], which are added to learned embeddings for each prompt
type. Free-form text is represented using an off-the-shelf text encoder from CLIP [31].
Dense prompts, such as masks, are embedded using convolutions and added
element-wise to the image embedding.
The mask decoder is designed to generate masks from the image and prompt
embeddings along with an output token in an efficient manner. The design is inspired
by previous works [18, 32] and uses a modified Transformer decoder block [33]
followed by a dynamic mask prediction head. The modified decoder block utilises
prompt self-attention and cross-attention in two directions, i.e., from prompt-to-image
embedding and vice versa, to update all embeddings. After running two blocks, the
image embedding is upsampled, and an MLP maps the output token to a dynamic
linear classifier, which computes the foreground probability of the mask at each
location in the image.
With one output, the model will average multiple valid masks if given an ambiguous
prompt. To address this, we modify the model to predict multiple output masks for a
single prompt. We found 3 mask outputs is sufficient to address most common cases
(nested masks are often at most three deep: whole, part, and subpart). During training,
we backprop only the minimum loss [34, 35, 36] over masks. To rank masks, the
model predicts a confidence score (i.e., estimated IoU) for each mask.
19
2.2.5 Efficiency:
The model is designed with efficiency in mind, such that the prompt encoder and
mask decoder can run on a CPU in a web browser in approximately 50ms, provided
that the image embedding is precomputed. This fast runtime performance enables the
model to be interactively prompted in real-time without any delay.
To train our promptable segmentation model, we supervise the mask prediction with a
combination of focal loss [37] and dice loss [38], as used in previous works [18]. We
use a mixture of geometric prompts to simulate the interactive nature of the task
during training, as done in other studies [39, 40]. Specifically, we sample prompts
randomly in 11 rounds per mask, allowing the model to seamlessly integrate into our
data engine.
2.2.7 Architecture:
2.3 ZoeDepth:
20
In this section, we describe our architecture, design choices, and training protocol in
detail.
2.3.1 Overview
We use the MiDaS [41] training strategy for relative depth prediction. MiDaS uses a
loss that is invariant to scale and shift. If multiple datasets are available, a multi-task
loss that ensures pareto-optimality across the datasets is used. The MiDaS training
strategy can be applied to many different network architectures. We use the DPT
encoder-decoder architecture as our base model [42], but replace the encoder with
more recent transformer-based backbones [43]. After pre-training the MiDaS model
for relative depth prediction, we add one or more heads for metric depth estimation by
attaching our proposed metric bins module to the decoder . The metric bins module
outputs metric depth and follows the adaptive binning principle, originally introduced
in [44] and subsequently modified by [48,45,47,46]. In particular, we start out with
the pixel-wise prediction design as in LocalBins [45] and propose modifications that
further improve performance. Finally, we fine-tune the complete architecture
end-to-end.
We first review LocalBins, and then introduce our novel metric bins module with
attractor layers, our bin aggregation strategy, and loss function.
LocalBins review: Our metric bins module is inspired by the LocalBins architecture
proposed in [45]. LocalBins uses a standard encoder-decoder as the base model and
attaches a module that takes the multi-scale features from the encoder-decoder as
input and predicts the bin centres at every pixel. Final depth at a pixel is obtained by a
linear combination of the bin centres weighted by the corresponding predicted
probabilities. The LocalBins module first predicts Nseed different seed bins at each
pixel position at the bottleneck. Each bin is then split into two at every decoder layer
using splitter MLPs. The number of bin centres is doubled at every decoder layer and
we end up with 2n Nseed bins at each pixel at the end of n decoder layers.
Simultaneously, the probability scores (p) over Ntotal = 2n Nseed bin centres (c) are
predicted from the decoder features using softmax and the final depth at pixel i is
obtained using:
(1)
21
Metric bins module: The metric bins module takes multiscale features from the
MiDaS decoder as input and predicts the bin centres to be used for metric depth
prediction (see Fig. 4). However, instead of starting with a small number of bins at the
bottleneck and splitting them later, our metric bins module predicts all the bin centres
at the bottleneck and adjusts them at subsequent decoder layers. This bin adjustment
is implemented via our newly proposed building block, called attractor layers.
(2)
where the hyperparameters α and γ determine the attractor strength. We name this
attractor variant inverse attractor. We also experiment with an exponential variant
given by:
(3)
22
Our experiments suggest that the inverse attractor leads to better performance. We let
the number of attractor points vary from one decoder layer to another, denoted
together as a set {na1}. We use Ntotal = 64 bins and {16, 8, 4, 1} attractors.
The attracting strategy is preferred because it’s a contracting process while splitting is
inherently dilative. Splitting adds extra constraints of newly produced bins summing
up to the original bin width, while attractors adjust freely without such local
constraints (only the total width is invariant). Intuitively, the prediction should get
more refined and focused with decoder layers, which attractors achieve without
dealing with any local constraints.
Log-binomial instead of softmax: To get the final metric depth prediction, the bin
centres are linearly combined, weighted by their probability scores as per Eq. (1).
Prior adaptive bins based models [48,44,45,49] use a softmax to predict the
probability distribution over the bin centres. The choice of softmax is mainly inspired
from the discrete classification analogy. Although the softmax plays well with
unordered classes, since the bins are inherently ordered, it intuitively makes sense to
use an ordering-aware prediction of the probabilities. The softmax approach can result
in vastly different probabilities for nearby bin centres (|pi − pi+1| >> 0). Inspired by
Beckham and Pal [50], we use a binomial distribution instead to address this issue and
correctly consider ordinal relationships between bins.
The binomial distribution has one parameter q which controls the placement of the
mode. We concatenate the relative depth predictions with the decoder features and
predict a 2-channel output (q - mode and t - temperature) from the decoder features to
get the probability score over the k th bin centre by:
(4)
where N = Ntotal is the total number of bins. In practice, since we use large values of
N, we take log (p), use Stirling’s approximation [51] for factorials and apply
softmax({log (pk)/t} N k=1) to get normalised scores for numerical stability. The
parameter t controls the temperature of the resulting distribution. The softmax
normalisation preserves the unimodality of the logits. Finally, the resulting probability
scores and the bin centres from the metric bins module are used to obtain the final
depth as per Eq. (1).
Loss: We use the scale-invariant log loss (Lpixel) for pixel-level supervision as in
LocalBins [45]. Unlike LocalBins, we do not use the chamfer loss for bins due to the
high memory requirement but only limited improvement.
2.3.3 Architecture:
23
Figure 5. Architecture of ZoeDepth
2.4 Backproject:
The core concepts used to accomplish the task of converting a depth map into a 3D
point cloud are
Camera Intrinsics: The camera intrinsics define the internal parameters of a camera,
such as the focal length, principal point, and image sensor size. These parameters are
used to transform 3D points in the world coordinate system into 2D points in the
image plane. In the case of a depth map, the camera intrinsics are used to convert the
depth values into 3D points in the camera coordinate system.
Rotation and Translation Matrices: The rotation and translation matrices are used
to transform points from one coordinate system to another. In the
backproject_depth_to_pointcloud function, the rotation and translation matrices are
used to transform the 3D points from the camera coordinate system to the world
coordinate system.
24
2.4.1 Implementation:
get_intrinsics: This function calculates the camera intrinsics for a pinhole camera
model, given the dimensions of the depth map (height and width) and the principal
point. It assumes a field of view (FOV) of 55 degrees and a central principal point.
The function computes the focal length (f) and returns the intrinsic matrix as a 3x3
NumPy array.
get_principal_point: This function computes the principal point for the camera
intrinsics, given the bounding boxes, height (H), and width (W) of the depth map. It
scales the bounding boxes and calculates the centre coordinates (x and y) of the
bounding boxes. The function returns the principal point as a tuple (center_x,
center_y).
25
u, v = np.meshgrid(np.arange(width), np.arange(height))
uv_homogeneous = np.stack((u, v, np.ones_like(u)),
axis=-1).reshape(-1, 3)
return pointcloud
MCC adopts an encoder-decoder architecture. The input RGB-D image is fed to the
encoder to produce encoding R. The decoder inputs a query 3D point qi ∈ ℝ3 , along
with R, to predict its occupancy probability σi ∈ [0, 1], as in [37], and RGB colour ci
∈ [0, 1]3
During training, we supervise MCC with “true” points derived from posed RGB-D
views. These point clouds serve as ground truth: qi is labelled as positive if it is close
to the ground truth and negative otherwise. Intuitively, the other views guide the
model to reason about what parts of the unseen space belong to the object or scene.
As a result, the input encoding R learns a representation of the full 3D geometry and
guides the decoder to make the right prediction. During inference, the model predicts
26
occupancy and colour for a grid of points at any desired resolution. The set of
occupied coloured points forms the final reconstruction.
MCC requires only points for supervision, extracted from posed RGB-D views, e.g.,
video frames. Note that the derived point clouds, which serve as ground truth, are far
from perfect due to noise in the captures and pose estimation. However, when used at
scale they are sufficient. This deviates from OccNets [52] and other distance-based
works [54, 53] which rely on clean CAD models or 3D meshes. This is an important
finding as it suggests that expensive CAD supervision can be replaced with cheap
RGB-D video captures. This property of MCC allows us to train on a wide range of
diverse data. Large-scale training is crucial for high-quality reconstruction.
(1)
ERGB and EXYZ are two transformers [65]. ERGB follows a ViT architecture [24] to
encode the input image I. EXYZ processes the input points P similar to a ViT, but
encodes 3D coordinates instead of RGB colour channels. f concatenates the two
outputs from the transformers along the channel dimension followed by a linear
projection to C-dimensions. Nenc is the number of tokens used in the transformers.
The proposed two-tower design is general and performant.
The decoder takes as input the output of the encoder, R, and Nq 3D point queries qi , i
= 0, . . . , Nq − 1, to predict occupancy and colours for each point,
(2)
The decoder Dec linearly projects each query qi to C-dimensions (the same as R),
concatenates them with R in the token dimension, and then uses a transformer to
model the interactions between R and queries. We draw inspiration from MAE [20]
for this design. The output feature of each query token is passed through a binary
27
classification head that predicts its occupancy σi , and a 256-way classification head
that predicts its RGB colour ci [55].
As described in Eq. 2, we feed multiple queries to the decoder for efficiency via
parallelization, which significantly speeds up training and inference. However, since
all tokens attend to all tokens in a standard transformer, this creates undesirable
dependencies among queries. To break the unwanted dependencies, we mask out the
attention weights such that tokens cannot attend to the other queries (except for self).
This masking pattern is illustrated in Fig. 6.
MCC’s attention architecture differentiates it from prior 3D reconstruction
approaches. In [59, 56], points condition on a globally pooled image feature; in
[60,58,57] they condition on the projected locations of the image feature map. The
computation of the decoder grows with the number of queries, while the encoder
embeds the input image once regardless of the final output resolution. By using a
relatively lightweight decoder, our inference is made efficient even at high
resolutions, and the encoder cost is amortised. This allows us to dynamically change
output resolutions and does not require re-computing the input encoding R.
Training
MCC samples Nq = 550 queries from the 3D world space uniformly and per training
example.
A query is considered “occupied” (positive) if it is located within radius τ = 0.1 to a
ground truth point, and “unoccupied” (negative) otherwise. The ground truth is
defined as the union of all unprojected points from all RGB-D views of the scene.
Inference
We uniformly sample a grid of points covering the 3D space. Queries with occupancy
score greater than a threshold of 0.1 and their colour predictions form the final
reconstruction. Techniques such as Octree [61] could be easily integrated to further
speed up test-time sampling.
28
2.5.4. Implementation Details
EXYZ Patch Embeddings. Note that the depth values, and consequently the 3D
locations in P, might be unknown for some points (e.g., due to sensor uncertainty).
Thus, the convolution-based patch embedding design in a ViT [24] is not directly
applicable. We use a self-attention-based design instead. First, the 3D coordinates are
transformed. For pixels with unknown depth, we learn a special C-dimensional
embedding. For pixels with valid depth, their 3D points are linearly transformed to a
C-dimensional vector. This results in a 16×16×C representation for each 16×16 patch.
A transformer, shared across patches, converts each patch to a C-dimensional vector
via a learned patch token which summarises the patch [10]. This results in W/16 ×
H/16 tokens (and thus Nenc = W/16 × H/16 + 1 with the additional global token used
in a ViT [24]).
ERGB Patch Embeddings. For RGB, we follow standard ViTs [24] and embed each
16×16 patch with a convolution. Architecture. The ERGB and EXYZ encoder use a
12-layer 768-dimensional “ViT-Base” architecture [24, 65]. The input image size is
224×224. Our decoder is a lighter-weight 8-layer 512-dimensional transformer,
following MAE.
2.5.5 Architecture:
29
Chapter 3: Results
The Input, Segmentation results, Depth map, and generated 3D objects are presented
in the table below . Specifically, we demonstrate the robustness of our framework by
reconstructing objects under various challenging conditions, such as occlusion,
varying lighting, and different viewpoints. Our framework demonstrates robustness
by reconstructing objects under various challenging conditions, such as occlusion,
fluctuating lighting, and diverse viewpoints. For instance, we successfully reconstruct
irregularly-shaped objects like airplane and bench. Our model also exhibits
competence in handling small objects within cluttered environments, such as the toy
bird. These results suggest that our method surpasses existing approaches in terms of
accuracy and generalizability, highlighting the effectiveness of our framework in
reconstructing 3D objects from single-view images under diverse conditions.
To evaluate the accuracy and robustness of our model, we have gathered input data
from a variety of sources. These include direct photos captured with an iPhone using
the Record 3D app and images generated by DALL-E. By using input data from
diverse sources, we aim to test the limits of our model and ensure that it is capable of
producing accurate results across a range of inputs.
30
For all of the input images, we followed the same process of segmentation, depth map
generation, and 3D object reconstruction. This allowed us to consistently test the
performance of our model and compare the results across different inputs. By
applying the same process to all input images, we were able to evaluate the model's
performance in a systematic and rigorous way.
In particular, the use of images generated by DALL-E allowed us to test our model's
ability to reconstruct 3D objects from highly abstract and creative inputs. This was
important in evaluating the generalizability of our model, as it allowed us to assess its
performance outside of traditional, real-world input scenarios.
Overall, the use of diverse input sources and a consistent process of segmentation,
depth map generation, and 3D object reconstruction enabled us to thoroughly evaluate
the accuracy and robustness of our model.
3.1.1 Images taken from Shapenet
Car
Bench
31
spyro
MetaQuest
32
An Chair
armchair in
the shape of
an avocado
Portrait of Bird
a Bird
sitting
Model Airplane
Airplane on
a table
33
The following table shows the performance comparison of our model with
Multi-View Stereo (MVS) and AtlasNet across the chosen evaluation metrics:
Multi-View Stereo
0.75 0.12 0.25 0.80
(MVS)
Our model outperforms both Multi-View Stereo (MVS) and AtlasNet across all
evaluation metrics, indicating better 3D reconstruction performance. Specifically, our
model achieves an IoU of 0.82, which is higher than the IoU values of 0.75 and 0.78
obtained by MVS and AtlasNet, respectively. Similarly, our model achieves lower
Chamfer Distance and Earth Mover's Distance values compared to the other two
methods, indicating a better match with the ground truth 3D shapes. The F-Score
values also show that our model has a better balance between precision and recall.
Overall, these results demonstrate that our 3D reconstruction model based on
multiview compressive coding is effective in reconstructing 3D shapes from 2D
views, and outperforms existing methods on the ShapeNet dataset.
34
\
4.1 Conclusion
In our study, we have presented a novel learning framework that utilises Multiview
Comprehensive Coding to infer point clouds from a single image. Our approach is
both simple and effective, allowing for model deformations in a low-dimensional
space. To test the effectiveness of our method, we took a single image of a scenery,
extracted the object of interest using Grounding Dino and Segment Anything, and
used depth information obtained from Zoe Depth to generate a highly accurate
representation of the object.
Our approach was thoroughly evaluated on both synthetic and real-world datasets, and
the results of our experiments demonstrated the ability of our method to convincingly
reconstruct 3D mesh models from a single image. This is a significant advancement in
the field of 3D reconstruction, as previous methods have typically relied on
volumetric or point cloud representations that lack the fine-scaled geometry achieved
by our approach.
35
While there is still much work to be done in this area, our study serves as a valuable
proof of concept for the use of Multiview Comprehensive Coding in 3D
reconstruction from a single image. We are excited to continue exploring the potential
of this approach and to contribute to the ongoing development of more effective and
accurate methods for 3D reconstruction.
The field of 3D reconstruction from a single image has seen significant advancements
in recent years, but there are still several areas where improvements can be made.
Here are some potential future scope and improvements for 3D reconstruction from a
single image:
36
Overall, the field of 3D reconstruction from a single image has made significant
progress in recent years, and there is still plenty of room for future research and
improvements. By addressing these challenges and limitations, we can create more
accurate and useful 3D reconstructions that can benefit a wide range of applications in
fields like entertainment, education, and medicine.
REFERENCES
Neuralift360:
3DFuse:
[2] J. Seo, W. Jang, M.-S. Kwak, J. Ko, H. Kim, J. Kim, J.-H. Kim, J. Lee and S.
Kim, "Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D
Generation," arXiv preprint arXiv:2303.07937, cs.CV, 2023.
MCC:
DreamFusion:
Magic3D:
37
[5] C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S.
Fidler, M.-Y. Liu and T.-Y. Lin, "Magic3D: High-Resolution Text-to-3D Content
Creation," arXiv preprint arXiv:2211.10440, 2023.
SJC:
Fantasia3D:
[7] Y.-C. Chen, C.-C. Chen and W. H. Hsu, "Fantasia3D: Text-to-3D with Neural
Radiance Fields," arXiv preprint arXiv:2201.11297, 2022.
PointE:
[8] Z. Zhang, Y. Zhang and X. Liang, "PointE: Point Cloud Generation from
Text or Image with Encoder-Decoder Transformer," arXiv preprint
arXiv:2201.11163, 2022.
Swin Transformer:
[9] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo, "Swin
Transformer: Hierarchical Vision Transformer using Shifted Windows," arXiv
preprint arXiv:2103.14030, 2021.
BERT:
[11] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum,
"DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object
Detection," arXiv preprint arXiv:2203.03605, 2022.
[12] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, "Deformable DETR:
Deformable Transformers for End-to-End Object Detection," in International
Conference on Learning Representations, 2021.
[13] Z. Liu, Y. Cao, Y. Lin, Y. Wei, Z. Zhang, H. Hu, and B. Guo, "GLIP:
Generative Language-Image Pre-training," arXiv preprint arXiv:2103.06376,
2021.
38
[14] M. Minderer et al., "Simple open-vocabulary object detection with vision
transformers," 2022.
[16] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao,
"Clip-adapter: Better vision-language models with feature adapters," arXiv
preprint arXiv:2110.04544, 2021.
[20] S. Liu et al., "DAB-DETR: Dynamic anchor boxes are better queries for
DETR," in International Conference on Learning Representations, 2022.
[21] D. Meng et al., "Conditional detr for fast training convergence," arXiv
preprint arXiv:2108.06152, 2021.
[23] T.-Y. Lin et al., "Focal loss for dense object detection," in Proceedings of
the IEEE international conference on computer vision, 2017, pp. 2980–2988.
39
[25] Y.-C. Chen et al., “Uniter: Universal image-text representation learning,”
in European Conference on Computer Vision, 2020, pp. 104-120.
[28] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He.
Exploring plain vision transformer backbones for object detection.
ECCV, 2022.
[29] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar,
´ and Ross Girshick. Masked autoencoders are scalable vision
learners. CVPR, 2022
[31] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh,
Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,
Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. ICML, 2021.
40
[34] Guillaume Charpiat, Matthias Hofmann, and Bernhard
Scholkopf. ¨ Automatic image colorization via multimodal
predictions. ECCV, 2008.
[36] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image
segmentation with latent diversity. CVPR, 2018.
[37] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Piotr Dollar. Focal loss for dense object detection. ´ ICCV, 2017.
[40] Marco Forte, Brian Price, Scott Cohen, Ning Xu, and Franc¸ois
Pitie. Getting to 99% accuracy in interactive segmentation. ´
arXiv:2003.07932, 2020.
[42] Rene Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- ´ sion
transformers for dense prediction. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV), pages
12179–12188, October 2021.
41
[43] Hangbo Bao, Li Dong, and Furu Wei. Beit: BERT pretraining of
image transformers. CoRR, abs/2106.08254, 2021.
[46] Khalil Sarwari, Forrest Laine, and Claire Tomlin. Progress and
proposals: A case study of monocular depth estimation. Master’s
thesis, EECS Department, University of California, Berkeley, May
2021.
[49] Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang.
Binsformer: Revisiting adaptive bins for monocular depth
estimation. arXiv preprint arXiv:2204.00987, 2022
42
[51] Miton Abramowitz. Stegun., ia (1972). handbook of
mathematical functions. Formulas, Graphs and Mathematical Tables,
2002.
[55] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol
Vinyals, Alex Graves, et al. Conditional image generation with
PixelCNN decoders. NeurIPS, 2016.
[57] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
PixelNeRF: Neural radiance fields from one or few images. In CVPR,
2021
43
[60] Philipp Henzler, Jeremy Reizenstein, Patrick Labatut, Roman
Shapovalov, Tobias Ritschel, Andrea Vedaldi, and David Novotny.
Unsupervised learning of 3D object categories from videos in the
wild. In CVPR, 2021.
[62] H. Zhang et al., “ZoeDepth: Combining relative and metric depth,” arXiv
preprint arXiv:2203.03605, 2022.
44