Sig Graph 2014
Sig Graph 2014
Sig Graph 2014
Natasha Kholgade1
1
Tomas Simon1
2
Before
Alexei Efros2
After
Before
Yaser Sheikh1
After
>
Original Photograph
Estimated Illumination
Object Manipulated in 3D
3D Copy-Paste
Figure 1: Using our approach, a user manipulates the taxi cab in a photograph to do a backflip, and copy-pastes the cabs to create a traffic
jam (right) by aligning a stock 3D model (inset) obtained from an online repository. Such 3D manipulations often reveal hidden parts of the
object. Our approach completes the hidden parts using symmetries and the stock model appearance, while accounting for illumination in 3D.
Photo Credits (leftmost photograph): Flickr user Lucas Maystre.
Abstract
Photo-editing software restricts the control of objects in a photograph to the 2D image plane. We present a method that enables
users to perform the full range of 3D manipulations, including scaling, rotation, translation, and nonrigid deformations, to an object in
a photograph. As 3D manipulations often reveal parts of the object
that are hidden in the original photograph, our approach uses publicly available 3D models to guide the completion of the geometry
and appearance of the revealed areas of the object. The completion
process leverages the structure and symmetry in the stock 3D model
to factor out the effects of illumination, and to complete the appearance of the object. We demonstrate our system by producing object
manipulations that would be impossible in traditional 2D photoediting programs, such as turning a car over, making a paper-crane
flap its wings, or manipulating airplanes in a historical photograph
to change its story.
CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional
Graphics and RealismVirtual Reality;
Keywords: three-dimensional, photo-editing, 3D models
Links:
DL
Introduction
toshop. Once mainly a tool of professional photographers and designers, it has become mainstream, so much so that to photoshop
is now a legitimate English verb [Simpson 2003]. Photoshop lets
a user creatively edit the content of a photograph with image operations such as recoloring, cut-and-paste, hole-filling, and filtering.
Since the starting point is a real photograph, the final result often
appears quite photorealistic as well. However, while photographs
are depictions of a three-dimensional world, the allowable geometric operations in photo-editing programs are currently restricted to
2D manipulations in picture space. Three-dimensional manipulations of objectsthe sort that we are used to doing naturally in the
real worldare simply not possible with photo-editing software;
the photograph knows only the pixels of the objects 2D projection, not its actual 3D structure.
Our goal in this paper is to allow users to seamlessly perform 3D
manipulation of objects in a single consumer photograph with the
realism and convenience of Photoshop. Instead of simply editing
what we see in the photograph, our goal is to manipulate what
we know about the scene behind the photograph [Durand 2002].
3D manipulation of essentially a 2D object sprite is highly underconstrained as it is likely to reveal previously unobserved areas of
the object and produce new, scene-dependent shading and shadows.
One way to achieve a seamless break from the original photograph
is to recreate the scene in 3D in the softwares internal representation. However, this operation requires significant effort, that only
large special effects companies can afford. It typically also involves
external scene data such as light probes, multiple images, and calibration objects, not available with most consumer photographs.
Instead, in this paper, we constrain the recreation of the scenes 3D
geometry, illumination, and appearance from the 2D photograph
using a publicly available 3D model of the manipulated object as
a proxy. Graphics is now entering the age of Big Visual Data:
enormous quantities of images and video are uploaded to the Internet daily. With the move towards model standardization and the
use of 3D scanning and printing technologies, publicly available
3D data (modeled or scanned using 3D sensors like the Kinect) are
also readily available. Public repositories of 3D models (e.g., 3D
Warehouse or Turbosquid) are growing rapidly, and several Internet companies are currently in the process of generating 3D models
for millions of merchandise items such as toys, shoes, clothing, and
household equipment. It is therefore increasingly likely that for
most objects in an average user photograph, a stock 3D model will
soon be available, if it is not already.
However, it is unreasonable to expect such a model to be a perfect match to the depicted objectthe visual world is too varied to
ever be captured perfectly no matter how large the dataset. Therefore, our approach deals with several types of mismatch between
the photographed object and the stock 3D model:
Geometry Mismatch. Interestingly, even among standard, massproduced household brands (e.g., detergent bottles), there are often
subtle geometric variabilities as manufacturers tweak the shape of
their products. Of course, for natural objects (e.g., a banana), the
geometry of each instance will be slightly different. Even in the
cases when a perfect match could be found (e.g., a car of a specific
make, model, and year), many 3D models are created with artistic
license and their geometry will likely not be metrically accurate, or
there are errors due to scanning.
Apperance Mismatch. Although both artists and scanning techniques often provide detailed descriptions of object appearance
(surface reflectance), these descriptions may not match the colors
and textures (and aging and weathering effects) of the particular
instance of the object in the photograph.
Illumination Mismatch. To perform realistic manipulations in
3D, we need to generate plausible lighting effects, such as shadows
on an object and on contact surfaces. The environment illumination
that generates these effects is not known a priori, and the user may
not have access to the original scene to take illumination measurements (e.g., in dynamic environments or for legacy photographs).
Our approach uses the pixel information in visible parts of the
object to correct the three sources of mismatch. The user semiautomatically aligns the stock 3D model to the photograph using a
real-time geometry correction interface that preserves symmetries
in the object. Using the aligned model and photograph, our approach automatically estimates environment illumination and appearance information in hidden parts of the object. While a photograph and 3D model may still not contain all the information needed
to precisely recreate the scene, our approach sufficiently approximates the illumination, geometry, and appearance of the underlying
object and scene to produce plausible completion of uncovered areas. Indeed, as shown by the user study in Section 8, our approach
plausibly reveals hidden areas of manipulated objects.
The ability to manipulate objects in 3D while maintaining realism
greatly expands the repertoire of creative manipulations that can
be performed on a photograph. Users are able to quickly perform
object-level motions that would be time-consuming or simply impossible in 2D. For example, from just one photograph, users can
cause grandmas car to perform a backflip, and fake a baby lifting a
heavy sofa. We tie our approach to standard modeling and animation software to animate objects from a single photograph. In this
way, we re-imagine typical Photoshop editssuch as object rotation, translation, rescaling, deformation, and copy-pasteas object
manipulations in 3D, and enable users to more directly translate
what they envision into what they can create.
Our key contribution is an approach that allows
out-of-plane 3D manipulation of objects in consumer photographs,
while providing a seamless break from the original image. To do
so, our approach leverages approximate object symmetries and a
new non-parametric model of image-based lighting for appearance
completion of hidden object parts and for illumination-aware compositing of the manipulated object into the image. We make no assumptions on the structure or nature of the object being manipulated
beyond the fact that an approximate stock 3D model is available.
Assumptions. In this paper, we assume a Lambertian model of
illumination. We do not model material properties such as refraction, specularities, sub-surface scattering, or inter-reflection. The
user study discussed in Section 8 shows that while for some objects,
Contributions.
these constraints are necessary to produce plausible 3D manipulations, if their effects are not too pronounced, the results can be perceptually plausible without explicit modeling. In addition, we assume that the user provides a stock 3D model with components for
all parts of the objects visible in the original photograph. Finally,
we assume that the appearance of the 3D model is self-consistent,
i.e., the precise colors of the stock model need not match the photograph, but appearance symmetries should be preserved. For instance, the cliffhanger in Figure 13 is created using the 3D model
of a blueish-grey Audi A4 (shown in the supplementary material)
to manipulate the green Rover 620 Ti in the photograph.
Notation. For the rest of the paper, we refer to known quantities without using the overline notation, and we use the overline
notation for unknown quantities. For instance, the geometry and
appearance of the stock 3D model known a priori are referred to
as X and T respectively. The geometry and appearance of the 3D
model modified to match the photograph are not known a priori and
are referred to as X and T respectively. Similarly, the illumination
environment which is not known a priori is referred to as L.
Related Work
Modern photo-editing software such as Photoshop provides sophisticated 2D editing operations such as content-aware fill [Barnes
et al. 2009] and content-aware photo resizing [Avidan and Shamir
2007]. Many approaches provide 2D edits using low-level assumptions about shape and appearance [Barrett and Cheney 2002; Fang
and Hart 2004]. The classic work of Khan et al. [2006] uses insights
from human perception to edit material properties of photographed
objects, to add transparency, translucency, and gloss, and to change
object textures. Goldberg et al. [2012] provide data-driven techniques to add new objects or manipulate existing objects in images
in 2D. While these techniques can produce surprisingly realistic
results in some cases, their lack of true 3D limits their ability to
perform more invasive edits, such as 3D manipulations.
The seminal work of Oh et al. [2001] uses depth-based segmentation to perform viewpoint changes in a photograph. Chen et
al. [2011] extend this idea to videos. These methods manipulate
visible pixels, and cannot reveal hidden parts of objects. To address
these limitations, several methods place prior assumptions on photographed objects. Data-driven approaches [Blanz and Vetter 1999]
provide drastic view changes by learning deformable models, however, they rely on training data. Debevec et al. [1996] use the regular symmetrical structure of architectural models to reveal novel
views of buildings. Kopf et al. [2008] use georeferenced terrain
and urban 3D models to relight objects and reveal novel viewpoints
in outdoor imagery. Unlike our method, Kopf et al. do not remove
the effects of existing illumination. While this works well outdoors,
it might not be appropriate in indoor settings where objects cast soft
shadows due to area light sources.
Approaches in proxy-based modeling of photographed objects include cuboid proxies [Zheng et al. 2012] and 3-Sweep [Chen et al.
2013]. Unlike our approach, Zheng et al. and Chen et al. (1) cannot
reveal hidden areas that are visually distinct from visible areas, limiting the full range of 3D manipulation (e.g., the logo of the laptop
from Zheng et al. that we reveal in Figure 7, the underside of the
taxi cab in Figure 1, and the face of the wristwatch in Figure 13),
(2) cannot represent a wide variety of objects precisely, as cuboids
(Zheng et al.) or generalized cylinders (Chen et al.) cannot handle
highly deformable objects such as backpacks, clothing, and stuffed
toys, intricate or indented objects such as the origami crane in Figure 6 or a pair of scissors, or objects with negative space such as
cups, top hats, and shoes, and (3) cannot produce realistic shading and shadows (e.g. in the case of the wristwatch, the top hat,
the cliffhanger, the chair, and the fruit in Figure 13, the taxi cab in
2D Point
Corrections
Mask
Figure 2: Overview: (a) Given a photograph and (b) a 3D model from an online repository, the user (c) interactively aligns the model to the
photograph and provides a mask for the ground and shadow, which we augment with the object mask and use to fill the background using
PatchMatch [Barnes et al. 2009]. (d) The user then performs the desired 3D manipulation. (e) Our approach computes the camera and corrects the 3D geometry, and (f) reveals hidden geometry during the 3D manipulation. (g) It automatically estimates environment illumination
and reflectance, (h) to produce shadows and surface illumination during the 3D manipulation. (i) Our approach completes appearance to
hidden parts revealed during manipulation, and (j) composites the appearance with the illumination to obtain the final photograph.
Mises-Fisher kernels have been estimated for single view relighting [Hara et al. 2008; Panagopoulos et al. 2009], however, these
require estimating the number of mixtures separately. Our appearance completion approach is related to methods that texture map 3D
models using images [Kraevoy et al. 2003; Tzur and Tal 2009; Gal
et al. 2010], however, they do not factor out illumination, and may
use multiple images to obtain complete appearance. In using symmetries to complete appearance, our work is related to approaches
that extract symmetries from images and 3D models [Hong et al.
2004; Gal and Cohen-Or 2006; Pauly et al. 2005], and that use
symmetries to complete geometry [Terzopoulos et al. 1987; Mitra
et al. 2006; Mitra and Pauly 2008; Bokeloh et al. 2011], and to infer
missing appearance [Kim et al. 2012]. However, our work differs
from these approaches in that the approaches are mutually exclusive: approaches focused on symmetries from geometry do not respect appearance constraints, and vice versa. Our approach uses
an intersection of geometric symmetry and appearance similarity,
and prevents appearance completion between geometrically similar but visually distinct parts, such as the planar underside and top
of a taxi-cab, or between visually similar but geometrically distinct
parts such as the curved surface of a top-hat and its flat brim.
Overview
The user manipulates an object in a photograph as shown in Figure 2(a) by using a stock 3D model. For this photograph, the model
was obtained through a word search on the online repository TurboSquid. Other repositories such as 3D Warehouse, (Figure 2(b))
or semi-automated approaches such as those of Xu et al. [2011],
Lim et al. [2013], and Aubry et al. [2014] may also be used. The
user provides a mask image that labels the ground and shadow pixels. We compute a mask for the object pixels, and use this mask
to inpaint the background using the PatchMatch algorithm [Barnes
et al. 2009]. For complex backgrounds, the user may touch up the
background image after inpainting. Figure 2(c) shows the mask
with ground pixels in gray, and object and shadow pixels in white.
The user semi-automatically aligns and corrects the stock 3D model
to match the photograph using our symmetry-preserving geometry
correction interface as shown in Figure 2(c). Using the corrected
(1)
Equation (2) only estimates the appearance for parts of the object
that are visible in the original photograph I as shown in Figure 2(i).
The new pose 0 potentially reveals hidden parts of the object. To
produce the manipulated photograph J, we need to complete the
hidden appearance. After factoring out the effect of illumination on
the appearance in visible areas, we present an algorithm that uses
symmetries to complete the appearance of hidden parts from visible
areas as described in Section 6. The algorithm uses the stock model
appearance for hidden parts of objects that are not symmetric to visible parts. Given the estimated geometry, appearance, and illumination, and the user-manipulated pose of the object, we composite
the edited photograph by replacing with 0 in Equation (1) as
shown in Figures 2(f), 2(h), and 2(j).
Geometry Correction
pose of the object using a set A of user-defined 3D-2D correspondences, Xj R3 on the model and xj R2 , j A in the
photograph. Here, = {R, t}, where R R33 is the object
rotation, and t R3 is the object translation. We use the EPnP algorithm [Lepetit et al. 2009] to estimate R and t. The algorithm
takes as input Xj , xj , and the matrix K R33 of camera parameters (i.e., focal length, skew, and pixel aspect ratio). We assume
a zero-skew camera, with square pixels and principal point at the
photograph center. We use the focal length computed from EXIF
tags when available, else we use vanishing points to compute the
focal length. We assume that objects in the photograph are at rest
on a ground plane. We describe focal length extraction using vanishing points, and ground plane estimation in the supplementary
material. It should be noted that there exists a scale ambiguity in
computing t. The EPnP algorithm handles the scale ambiguity in
terms of translation along the z-axis of the camera.
As shown in Figure 3(a), after the camera is estimated, the user
provides a set B of start points xk R2 , k B on the projection
of the stock model, and a corresponding set of end points xk R2
on the photographed object for the purpose of geometry correction.
We used a point-to-point correction approach, as opposed to sketch
or contour-based approaches [Nealen et al. 2005; Kraevoy et al.
2009], as reliably tracing soft edges can be challenging compared
to providing point correspondences. The user only provides the
point corrections in 2D. We use them to correct X to X in 3D by
optimizing an objective in X consisting of a correction term E1 , a
symmetry prior E2 , and a smoothness prior E3 :
E(X) = E1 (X) + E2 (X) + E3 (X).
(3)
vk vk
Xk
kvk k2
2
X
vk vkT
E1 (X) =
Xk
.
Xk
2
kvk k2
kB
(4)
N X
X
(Xi Xj ) Ri (Xi Xj )
2 .
2
(5)
i=1 jDi
vk
vk
= 1
xk
xk
X
k
xk
xk
X
k
xk
xk
X
sym(k)
X
sym(k)
Camera
Xsym(k)
Camera
Figure 3: Geometry correction. (a) The user makes a 2D correction by marking a start-end pair, (xk , xk ) in the photograph.
(b) Correction term: The back-projected ray vk corresponding to
xk is shown in black, and the back-projected ray corresponding
to xk is shown in red. The top inset shows the 3D point X
k for
xk on the stock model, and the bottom inset shows its symmetric
pair X
sym(k) . We deform the stock model geometry (light grey) to
the user-specified correction (dark grey) subject to smoothness and
symmetry-preserving priors.
N
2
X
T
T
S[Xi 1] Xsym(i)
.
i=1
(7)
Light Map
Figure 4: We represent the environment map as a linear combination of the von Mises-Fisher (vMF) basis. We enforce constraints of
sparseness and grouping of basis coefficients to mimic area lighting
and produce soft cast shadows.
this light source from Xi . L() is the intensity of the light source
along . We assume that the light sources lie on a sphere, i.e., that
L() is a spherical environment map.
To estimate these quantities, we optimize the following objective
function in P and L, consisting of a data term F1 , an illumination
prior F2 , and a reflectance prior F3 :
F (P, L) = F1 (P, L) + F2 (L) + F3 (P),
F1 (P, L) =
NI
X
i=1
(9)
1 At
Xjr
+ . . . + K
+2
Xk
where i =
2
Z
,
i
I
P
n
s
()v
()L()d
i
i
i
i
i
(a) Layer 1
(b) Layer 2
(c) Layer 3
1
2
Dorsal View
(d) Layer 4
(e) MRF
Labeling
h (u(); k , ) =
(10)
(11)
P ,L
(12)
P,L
The above optimization is non-convex due to the bilinear interaction of the surface reflectances P with the illumination L. If we
know the reflectances, we can solve a convex optimization for the
illumination, and vice versa. We initialize the reflectances with the
stock model reflectance P for the object, and the median pixel value
for the ground plane. We alternately solve for illumination and reflectance until convergence to a local minimum. To represent the
vMF kernels and L, we discretize the sphere into K directions, and
compute K kernels, one per direction. Finally, we compute the appearance difference as the residual of synthesizing the photograph
using the diffuse reflection model, i.e.,
?
i = Ii Pi
(13)
Figure 5: We build an MRF over the object model to complete appearance. (a) Due to the camera viewpoint, the vertices are partitioned into a visible set Iv shown with the visible appearance, and
a hidden set Ih shown in green. Initially, the graph has a single
layer of appearance candidates, labeled Layer 1, corresponding to
visible parts. At the first iteration, we use the bilateral plane of symmetry 1 to transfer appearance candidates from Layer 1 to Layer
2. At the second iteration, we use an alternate plane of symmetry 2
to transfer appearance candidates (c) from Layer 1 to Layer 3, and
(d) from Layer 2 to Layer 4. We perform inference over an MRF
to find the best assignment of appearance candidates from several
layers to each vertex. This result was obtained after six iterations.
i=1 jNi
Lateral View
The appearance T = {P , } computed in Section 5 is only available for visible parts of the object, as shown in Figure 5(a). We use
multiple planes of symmetry to complete the appearance in hidden
parts of the object using the visible parts. We first establish symmetric relationships between hidden and visible parts of the object
via planes of symmetry. These symmetric relationships are used to
suggest multiple appearance candidates for vertices on the object.
The appearance candidates form the labels of a Markov Random
Field (MRF) over the vertices. To create the MRF, we first obtain a
fine mesh of object vertices Xs , s I created by mapping the uvlocations on the texture map onto the 3D object geometry. Here I
is a set of indices for all texel locations. We use this fine mesh since
the original 3D model mesh usually does not provide one vertex per
texel location, and cannot be directly used to completely fill the appearance. We set up the MRF as a graph whose vertices correspond
to Xs , and whose edges consist of links from each Xs to the set
Ks consisting of k nearest neighbors of Xs . As described in Section 6.1, we associate each vertex Xs with a set of L appearance
is ), i {1, 2, , L} through multiple sym is ,
candidates (P
metries. Appearance candidates with the same value of i form a
layer, and the algorithm in Section 6.1 grows layers (Figure 5(b)
through 5(d)) by transferring appearance candidates from previous
layers across planes of symmetry. To obtain the completed appearance, shown in Figure 5(e), we find an assignment of appearance
candidates, such that each vertex is assigned one candidate and the
assignment satisfies the constraints of neighborhood smoothness,
consistency of texture, and matching of visible appearance to the
observed pixels. We obtain this assignment, by performing inference over the MRF as described in Section 6.2.
6.1
compute Xs on the corrected model using the barycentric coordinates of Xs . We then use pose of the object to determine the set
of indices Iv for vertices visible from the camera viewpoint (shown
as textured parts of the banana in Figure 5(a)), and the set of indices
Ih = I \ Iv for vertices hidden from the camera viewpoint (shown
in green in Figure 5(a)). While it is possible to pre-compute symmetry planes on an object, our objective is to relate visible parts
of an object to hidden parts through symmetries, many of which
turn out to be approximate (for instance, different parts of a banana
with approximately similar curvature are identified as symmetries
using our approach). Pre-computing all such possible symmetries
is computationally prohibitive. We proceed iteratively, and in each
iteration, we compute a symmetric relationship between Ih and
Iv using the stock model X. Specifically, we compute planes of
symmetry, shown as planes 1 and 2 in Figures 5(a) and 5(b).
Through this symmetric relationship, we associate appearance can is ) to each vertex Xs in the graph by growing out
is ,
didates (P
layers of appearance candidates.
6.2
To obtain the completed appearance for the entire object from the
appearance candidates, shown in Figure 5(e), we need to select a set
of candidates such that (1) each vertex on the 3D model is assigned
exactly one candidate, (2) the selected candidates satisfy smoothness and consistency constraints, and (3) visible vertices retain
their original appearance. To do this, we perform inference over
the MRF using tree-reweighted message passing (TRW-S) [Kolmogorov 2006]. While graph-based inference has been used to
complete texture in images [Kwatra et al. 2003], our approach uses
end for
Update Ic Ic Im and Il Il \ Im .
end for
if |Il | > 0 then
1s 0.
1s Ps and
s Il , set P
end if
(14)
s=1 tKs
The pairwise term in the objective function, (, ) enforces neighborhood smoothness via the Euclidean distance. Here Ks represents the set of indices for the k nearest neighbors of Xs . We
bias the algorithm to select candidates from the same layer using
a weighting factor of 0 < < 1. This provides consistency of
texture. We use the following form for the pairwise terms:
2
is P
it
if is = it ,
P
is , P
it )
2 2
(P
(15)
is P
it
otherwise.
P
2
The unary term () forces visible vertices to receive the reflectance computed in Section 5. We set the unary term at the first
layer for visible vertices to , where 0 < < 1, else we set it to 1:
if s Iv , is = 0,
is ) =
(P
(16)
1 otherwise.
We use the tree-reweighted message passing algorithm to perform
the optimization in Equation (14). We use the computed assign?
ment to obtain the reflectance values P for all vertices in the set I.
Original Photograph
Corrected Geometry
Estimated Illumination
Output
Figure 6: Top row: 3D manipulation of an origami crane. We show the corrected geometry for the crane in the original photograph, and the
estimated illumination, missing parts, and final output for a manipulation. Bottom row: Our approach uses standard animation software to
create realistic animations such as the flying origami crane. Photo Credits: Natasha Kholgade.
?
Final Composition
The user manipulates the object pose from to . Given the corrected geometry, estimated illumination, and completed appearance
and from Sections 4 to 6, we create the result of the manipulation J
by replacing with in Equation (1). We use ray-tracing to ren?
der each pixel according to Equation (8), using the illumination L ,
?
geometry X, and reflectance P . We add the appearance difference
?
to the rendering to produce the final pixels on the manipulated
object. We render pixels for the object and the ground using this
method, while leaving the rest of the photograph unchanged (such
as the corridor in Figure 6). To handle aliasing, we perform the
illumination estimation, appearance completion, and compositing
using a super-sampled version of the photograph. We filter and
subsample the composite to create an anti-aliased result.
Results
Figure 7: We perform a 3D rotation of the laptop in a photograph from Zheng et al. [2012] (Copyright: ACM). Unlike their approach, we
can reveal the hidden cover and logo of the laptop.
Figure 8: Comparison of geometry correction by our approach against the alignment of Xu et al. [2011]. As shown in the insets, Xu et al.
do not align the leg and the seat accurately. Through our approach, the user accurately aligns the model to the photograph.
ance estimation described in Section 5 estimates grayscale appearances for the black-and-white photograph of the airplanes. Using
our approach, the user manipulates the airplanes to pose them as
if they were pointing towards the camera, an effect that would be
nearly impossible to capture in the actual scene. In the case of
paintings, our approach maintains the style of the painting and the
grain of the sheet, by transferring these through the fine-scale detail
difference in the appearance completion described in Section 6.
Figure 8 shows results of geometry correction through our user-guided approach compared to the semiautomated approach of Xu et al. [2011] on the chair. In the system of Xu et al., the user input involves seeding a graph-cut segmentation algorithm with a bounding box, and rigidly aligning the
model. Their approach automatically segments the photographed
object based on connected components in the model, and deforms
the 3D model to resemble the photograph. Their approach approximates the form of most of the objects. However, as shown by the
insets in Figure 8(b), it fails to exactly match the model to the photograph. Through our approach (shown in Figure 8(c)), users can
accurately correct the geometry to match the photographed objects.
Supplementary material shows comparisons of our approach with
Xu et al. for photographs of the crane, taxi-cab, banana, and mango.
Geometry Evaluation.
To evaluate the illumination estimation, we captured fifteen ground truth photographs of the chair from
Figure 2 in various orientations and locations using a Canon EOS
5d Mark II Digital SLR camera mounted on a tripod and fitted with
an aspheric lens. The photographs are shown in the supplementary
material. We also capture a photograph of the scene without the
object to provide a ground-truth background image. We aligned
the 3D model of the chair to each of the fifteen photographs using our geometry correction approach from Section 4, and evaluated our illumination and reflectance estimation approach from
Section 5 against a ground truth light probe, and three approaches:
(1) Haar wavelets with positivity constraints on coefficients [2009],
(2) L1 -sparse high frequency illumination with spherical harmonics for low-frequency illumination [Mei et al. 2009], and (3) environment map completion using projected background [Khan et al.
2006]. We fill the parts of the Khan et al. environment map not seen
in the image using the PatchMatch algorithm [Barnes et al. 2009].
Illumination Evaluation.
1.5
1
0.5
x 10
Ground Only
1.5
MSE
x 10
MSE
MSE
1.5
0.5
x 10
Object Only
vMF
Haar
SpH+L1
0.5
Bkgnd
0
10
10
# Basis Components
10
10
10
# Basis Components
10
10
10
# Basis Components
10
LightProbe
Figure 9: Plots of mean-squared reconstruction error (MSE) versus number of basis components for the vMF basis in green (method used
in this paper) compared to the Haar basis [Haber et al. 2009], spherical harmonics with L1 prior (Sph+L1) [Mei et al. 2009], background
image projected (Bkgnd) [Khan et al. 2006], and a light probe, on the object and the ground, ground only, and object only.
Table 1: Times taken (minutes) to align 3D models to photographs.
User
1
2
3
4
Banana
12.43
6.33
2.23
5.63
Mango
7.10
3.08
2.42
5.13
Top hat
8.72
10.08
4.93
6.17
Taxi
16.09
7.75
2.57
7.52
Chair
45.17
20.32
6.12
8.53
Crane
32.22
14.92
22.07
19.03
Discussion
Original
Light Probe
Bkgnd
Haar
SpH+L1
vMF
x
z
Light Probe
x
z
Toyota Corolla
Finally, failures can occur if the model from the 3D repository is not
correctly or too coarsely designed, particularly in cases of objects
with complex geometry and small parts.
While large collections of 3D models are available online, there are
several objects for which models may not be found. The exponential trend in the availability of online 3D models suggests that models not found today will be available online in the near future. We
expect that rising ubiquity in 3D scanning and printing technologies, and the tendency towards standardization in object design and
manufacture, will contribute to the increase in model availability.
The more pressing question will soon be not whether a particular
model exists online, but rather, whether the user can find the model
in a database of millions. A crucial area of future research will be
to automate the search and alignment of 3D models to photographs.
Finally, while we address manipulations of photographs in 3D, extending these ideas to editing videos would vastly expand creative
control in the temporal domain.
Acknowledgments. This work was funded in part by the Google
Research Grant. We would like to thank James McCann, Srinivasa Narasimhan, Sean Banerjee, Leonid Sigal, and Jessica Hodgins for their valuable comments on the paper. In addition, we thank
Spencer Diaz and Moshe Mahler for help with animations.
References
dence algorithm for structural image editing. In Proc. ACM SIGGRAPH, 24:124:11.
BARRETT, W. A., AND C HENEY, A. S. 2002. Object-based image
editing. In Proc. ACM SIGGRAPH, 777784.
BARRON , J. T. 2012. Shape, albedo, and illumination from a single
image of an unknown object. In CVPR, 334341.
B LANZ , V., AND V ETTER , T. 1999. A morphable model for the
synthesis of 3d faces. In Proc. ACM SIGGRAPH, 187194.
B OKELOH , M., WAND , M., KOLTUN , V., AND S EIDEL , H.-P.
2011. Pattern-aware shape deformation using sliding dockers.
ACM Trans. Graph. 30, 6 (Dec.), 123:1123:10.
C HEN , J., PARIS , S., WANG , J., M ATUSIK , W., C OHEN , M.,
AND D URAND , F. 2011. The video mesh: A data structure for
image-based three-dimensional video editing. In ICCP, 18.
C HEN , T., Z HU , Z., S HAMIR , A., H U , S.-M., AND C OHEN -O R ,
D. 2013. 3-sweep: Extracting editable objects from a single
photo. ACM Trans. Graph. 32, 6, to appear.
D EBEVEC , P. E., TAYLOR , C. J., AND M ALIK , J. 1996. Modeling and rendering architecture from photographs: a hybrid
geometry- and image-based approach. In Proc. ACM SIGGRAPH, 1120.
AVIDAN , S., AND S HAMIR , A. 2007. Seam carving for contentaware image resizing. In Proc. ACM SIGGRAPH.
BARNES , C., S HECHTMAN , E., F INKELSTEIN , A., AND G OLD MAN , D. B. 2009. Patchmatch: a randomized correspon-
N G , R., R AMAMOORTHI , R., AND H ANRAHAN , P. 2003. Allfrequency shadows using non-linear wavelet lighting approximation. In Proc. ACM SIGGRAPH, 376381.
B ULTHOFF
, H. H. 2006. Image-based material editing. In Proc.
ACM SIGGRAPH, 654663.
R AMAMOORTHI , R., AND H ANRAHAN , P. 2001. On the relationship between radiance and irradiance: determining the illumination from images of a convex lambertian object. J. Opt. Soc. Am.
A 18, 10, 24482459.
ROMEIRO , F., AND Z ICKLER , T. 2010. Blind reflectometry. In
ECCV, 4558.
S IMPSON , J., 2003. Oxford English Dictionary Online, 2nd edition. http://www.oed.com/, July.
S ORKINE , O., AND A LEXA , M. 2007. As-rigid-as-possible surface modeling. In Proc. SGP, 109116.
T ERZOPOULOS , D., W ITKIN , A., AND K ASS , M.
1987.
Symmetry-seeking models and 3d object reconstruction. International Journal of Computer Vision 1, 211221.
T ZUR , Y., AND TAL , A. 2009. Flexistickers: photogrammetric
texture mapping using casual images. ACM Trans. Graph. 28, 3
(July), 45:145:10.
X U , K., Z HENG , H., Z HANG , H., C OHEN -O R , D., , L IU , L.,
AND X IONG , Y. 2011. Photo-inspired model-driven 3d object
modeling. ACM Transactions on Graphics 30, 4.
Z OU , H., AND H ASTIE , T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society,
Series B 67, 301320.
Original Photograph
Object Composited in
Original View
3D Manipulation
3D Manipulation
Original
3D Model
Corrected
3D Model
Illumination
x
y
z
x
y
z
x
z
x
z
x
z
Figure 13: 3D Manipulations (rotation, translation, copy-paste, deformation) to a chair, a pen (Photo Credits: Christopher Davis),
a subjects watch, fruit, a painting (Credits: Odilon Redon), a car on a cliff (Photo Credits: rpriegu), a top hat (Photo Credits: tony the bald eagle), and a historical photograph of World War II Avengers (Photo Credits: Naval Photographic Center). As the
shown in the second column, our approach plausibly replicates the the original photograph in the first column, which enables our approach
to achieve a seamless transition in image appearance when new manipulations are done to the object. The illumination is shown as an environment map whose x- and y-axes represent the image plane, and whose z-axis represents the direction of the camera into the scene. Through
our approach, users can create dynamic compositions such as levitating chairs, flying watches, falling fruit, and diving cars, combine 3D
manipulations with creative effects such as pen strokes and painting styles, and create object deformations such as the top hat resized and
curved to a magicians hat.