2D-to-3D Photo Rendering For 3D Displays: Comandu@dsi - Unifi.it Atsuto - Maki@crl - Toshiba.co - Uk

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2D-to-3D Photo Rendering for 3D Displays

Dario Comanducci Atsuto Maki


Dip. di Sistemi e Informatica, Univ. di Firenze Toshiba Research Europe
Via S.Marta 3, 50139 Firenze, Italy Cambridge CB4 0GZ, UK
comandu@dsi.unifi.it atsuto.maki@crl.toshiba.co.uk

Carlo Colombo Roberto Cipolla


Dip. di Sistemi e Informatica, Univ. di Firenze Dept. of Engineering, Univ. of Cambridge
Via S.Marta 3, 50139 Firenze, Italy Cambridge CB2 1PZ, UK
colombo@dsi.unifi.it cipolla@eng.cam.ac.uk

Abstract human brain as corresponding variations of scene depth.


The simplest stereoscopic system consists of two cameras
We describe a computationally fast and effective ap- with parallel axes. Such a camera system produces images
proach to 2D-3D conversion of an image pair for the three- with only horizontal disparity, thus avoiding the vertical im-
dimensional rendering on stereoscopic displays of scenes age disparity arising in stereoscopic systems that verge the
including a ground plane. The stereo disparities of all the camera axes. Studies about viewer comfort for stereoscopic
other scene elements (background, foreground objects) are displays agree about the fact that the amount of disparity
computed after statistical segmentation and geometric lo- should vary in a limited range [8]. This is because, for hu-
calization of the ground plane. Geometric estimation in- mans, eye convergence and accommodation (focusing) are
cludes camera self-calibration from epipolar geometry, and tightly related. When we watch a 3D TV we are focusing
an original algorithm for the recovery of 3D visual param- the screen, and our eyes converge according to the distance
eters from the properties of planar homologies. Experi- from the screen. Hence, limiting disparities ensures that the
mental results on real images show that, notwithstanding viewer’s perceived depth is controlled without stressing the
the simple “theatrical” model employed for the scene, the convergence-accommodation bond. Points with zero dis-
disparity maps generated with our approach are accurate parity are located on the surface of the TV screen . Camera
enough to provide users with a stunning 3D impression of separation is indeed the most important parameter to pro-
the displayed scene, and fast enough to be extended to video duce realistic 3D images: as reported in [8], the length of
sequences. the baseline between the two cameras is based on viewing
arrangement, disparity range, scene depth and camera fea-
tures. In practice, the proper value of camera separation is
usually chosen manually.
1. Introduction
Given a single stream of uncalibrated images as input,
The recent advent of commercial 3D screens and visu- the goal of 2D-3D conversion is essentially to generate, for
alization devices has renewed the interest in computer vi- each input image, a pair of stereo images having dispari-
sion techniques for 2D-to-3D conversion. The appeal of ties that directly reflect the actual depth of the scene. In
2D-to-3D conversion is two-fold. First, direct production of order to deal with this highly ill-posed problem, it is a pre-
3D media contents through specialised capturing equipment requisite to make certain assumptions [12], and therefore
such as a stereoscopic video camera is still quite expensive. existing techniques either adapt models of scene geometry,
Second, a facility for converting monocular videos to stereo or use manual procedures such as user-scribbles for guid-
format would support the full 3D visualization of already ing depth estimation (see [4] for a recent example). An
existing contents, such as vintage movies. alternative strategy for disparity map generation is to per-
A stereoscopic camera system consists of a pair of cam- form dense depth search [16]. Although such an approach
eras producing a stereo (left and right) pair of images. Dif- appears most general and appropriate, finding dense cor-
ferent disparities (i.e., shifts of corresponding scene points respondences can quite hard in the presence of textureless
in the left and right visual channels) are interpreted by the regions and/or occlusions. Even when a powerful bundle
optimization framework is employed [18], it is neverthe- 2. The approach
less difficult to obtain a clean segmentation between objects
throughout an entire sequence, which typically results in a Given an uncalibrated monocular view I of a 3D scene,
blurred perception of boundaries in the 3D scene. Another our goal is to synthesize the corresponding stereo pair (Il ,
difficulty of dense stereo methods is that they are very time Ir ) for a virtual stereoscopic system by exploiting a sec-
consuming, and therefore hardly usable for the stereoscopic ond view J of the same scene. The cameras corresponding
rendering of long video sequences. To avoid the visual ar- to the actual views are placed in general position and are
tifact due to the inaccurately recovered 3D information, in therefore not in a stereoscopic system configuration. I and
[17] the stereoscopic video frames are generated by select- J are referred to as reference and support images, respec-
ing the most suitable frames within the input video. Stereo- tively. The role of I and J can be swapped, so each of them
scopic effect, frame similarity and temporal smoothness are can be rendered on a 3D TV screen.
taken into account. This strategy is useful only in videos Fig. 1 provides a general overview of the approach. By
with a consistent panning motion. exploiting the support image, epipolar geometry estimation
and camera self-calibration are first carried out. Automatic
ground segmentation then allows recovering the homogra-
In this paper, we describe a practical and effective ap- phy induced by the ground plane on the two actual views.
proach to 2D-3D conversion of an image pair, under the ba- By combining this homography with calibration data, the
sic assumption that the scene contains a ground plane. Once ground plane equation is estimated. Hence, ground plane
such a plane is first segmented in the images by a statisti- and calibration data are exploited to compute the two homo-
cal algorithm [7, 14], the rest of the scene elements (back- graphies generating the stereoscopic images of the ground
ground and foreground objects, the latter segmented in a plane. After that, the rendered ground plane images are used
semi-automatic way) can be acquired and rendered. In par- to render the background and foreground images, are even-
ticular, the background is modelled as a vertical ruled sur- tually all the rendered images are merged together to form
face following the ground boundary deformation, whereas the stereoscopic image pair to be displayed.
foreground objects are flat, vertical layers, standing upon
the ground plane. The disparities of all scene elements are 2.1. Geometric estimation
computed starting from the equation describing the actual We now develop the theory related to warping the image
position and orientation of the ground plane in the scene. of the ground plane onto the stereoscopic pair Il and Ir .
To compute the ground plane equation, an original method The theory being actually general, in the following we will
based on scene homographies and homologies is employed, refer to any given planar region π in the scene. The image
requiring as input only an estimate of the epipolar geome- Iπ ⊂ I of π can be warped onto Il and Ir according to a pair
try of the two views. Experimental results on real images of homographies Hl and Hr that depend on plane orientation
show that the disparity maps generated with the proposed nπ in space, signed distance dπ from the reference camera,
method are effective in providing the users with a dramatic and calibration data. Explicitly, the homography warping
and vivid 3D impression of the displayed scene. This per- Iπ onto the right view Ir is
ceptual success is obtained notwithstanding the deliberate
−1
simplicity of the scene model, and is due in part to a proper Hr = Ki (I − sn>
π /dπ )Ki , (1)
rendering of texture as a dominant visual cue [1, 2]. Results
also demonstrate that, for the purpose of successful 3D ren- where Ki is the calibration matrix for view I, and s =
dering and visualization, the correct ordering of layers in [δ/2 0 0]> , δ being the baseline between the virtual cam-
terms of their distance together with a neat segmentation of eras. The homography Hl for the left view has the same
their boundaries is more important than a high accuracy of form, but s = [−δ/2 0 0]> . These formulas are the special-
disparity estimates [9]. The overall process of stereoscopic ization of the general homography between two views of a
rendering is fast enough to be fully automated and extended plane for the case when the two cameras are only shifted of
to 2D-3D video conversion. a quantity ±δ/2 along the horizontal camera axis.
We discuss hereafter the estimation of geometric enti-
ties related to the planar region π. Estimation of the epipo-
The paper is organized as follows. The next section dis- lar geometry between the views I and J and camera self-
cusses all the theoretical aspects of the approach, from ge- calibration of both intrinsic and extrinsic parameters from
ometric modelling and estimation (subsect. 2.1) to image the fundamental matrix F between views I and J will be
segmentation and stereoscopic rendering (subsect. 2.2). In addressed later on.
sect. 3 experimental results are presented and discussed. Fi- The plane orientation nπ can be computed as
nally, conclusions and directions for future work are pro-
vided in sect. 4. n π = K> i
i lπ , (2)

2
Reference image I

Camera calibration
Support image J
Image segmentation:
Ground plane equation
Ground & foreground objects
Stereoscopic
Ground images baseline δ

Background images +

Foreground images

Final stereo images

Figure 1. Overview of the approach.

where i lπ is the vanishing line of π in image I. The signed RANSAC algorithm [3] on SIFT correspondences [10].
distance dπ can be obtained by triangulating any two cor- In particular, for the ground plane homography the
responding points under the homography Hπ (induced by π parametrization of Eq. 4 is used, thus requiring only three
between I and J, and estimated as detailed in subsect. 2.2.1) point correspondences for its estimation.
and imposing the passage of π through the triangulated 3D
point. The vanishing line i lπ of the planar region π is com-
posed of points that are mapped from I to J both by Hπ and
by the infinite homography H∞ = Kj RK−1 2.1.1 Camera self-calibration
i . The homogra-
phy
Hp = H−1
π H∞ (3) Camera self-calibration follows the approach of [11], where
the fundamental matrix F between I and J is exploited. In
mapping I onto itself is actually a planar homology, i.e., a
our notation, F is defined as
special planar transformation having a line of fixed points
(the axis) and a distinct fixed point (the vertex), not on the
j >
line. In the case of Hp , the vertex is the epipole i ej ∈ I of x F ix = 0 , (6)
view J, and the axis is the vanishing line i lπ , since it is the
intersection of π with the plane at infinity π∞ [5]. Thus, for any two corresponding points i x ∈ I and j x ∈ J. In
thanks to the properties of homologies, i lπ is obtained as [11], the internal camera matrices Ki and Kj are estimated
i
lπ = w1 × w2 , where w1 , w2 are the two eigenvectors of by forcing the matrix Ê = K> j FKi to have the same proper-
Hp corresponding to the two equal eigenvalues. ties of the essential matrix. This is achieved by minimizing
In order to obtain robust warping results, it is required the difference between the two non zero singular values of
that the homography Hπ be compatible with the fundamen- Ê, since they must be equal. The Levenberg-Marquardt al-
tal matrix F, i.e., H> >
π F + F Hπ = 0. This is achieved by gorithm is used, so an initial guess for Ki and Kj is required.
using a proper parametrization for Hπ [5]. Given the fun- The most uncertain value among the entries of Ki and Kj is
damental matrix F between two views, the three-parameter the focal length: as suggested in [6], this value is expected
family of homographies induced by a world plane π is to fall in the interval [1/3(w + h), 3(w + h)], where w and
H π = A − j ei v > , (4) h are respectively the width and height of the image. In
our approach, the first guess for the focal length is obtained
where [j ei ]× A = F is any decomposition (up to scale) of the with the method proposed in [15] if the solution falls in the
fundamental matrix, and j ei is the epipole of view I in im- above interval, otherwise it is set to w + h. The principal
age J (in other words, j e> > j j
i F = 0 ). Since [ ei ]× [ ei ]× F = point is set in the center of the image, while pixels are as-
°j °2
° °
− ei F, the matrix A can be chosen as sumed square (unit aspect ratio and zero skew). Extrinsic
parameters (rotation matrix R and translation vector t) of
A = [j ei ]× F . (5)
the support camera with respect to the reference camera are
Both the fundamental matrix F and the ground plane then recovered by factorizing the estimated essential matrix
homography Hπ are robustly computed by running the as Ê = [t]× R [5].

3
(a) (b) (a) (b)
Figure 2. (a): Reference image I. (b): Support image J. Figure 3. Ground plane recovery. (a): Ground classification for
image I: The brighter the color, the more probable the ground
region. (b): Recovery of the ground plane vanishing line (dashed
2.2. Stereo pair generation and rendering line in the picture), after camera self-calibration and ground plane
homography estimation.
So far, we have described how to compute the pair of ho-
mographies mapping the image of a generic planar region
onto the two translated virtual views forming the stereo-
scopic pair (Il ,Ir ). This section specializes the use of Eq. 1
to the case of a scene including a planar ground, and then
expounds how to warp the background and foreground ob-
jects properly, given the image of the ground plane. Fig. 2
shows the images I and J that will be used to illustrate the
various rendering phases.

2.2.1 Ground plane virtual view generation


The ground plane is segmented in the images I and J
by exploiting the classification algorithm proposed in [7].
Fig. 3(a) shows the ground plane classification for the ref-
erence image of Fig. 2(a). Fig. 3(b) shows the computed
vanishing line for the ground plane in the reference image
I, after camera self-calibration and the computation of the (a) (b)
ground plane homography Hπ have been performed. The Figure 4. The two virtual views for the ground plane of image I.
resulting two virtual views (Il , Ir ) of the ground plane are (a): Il . (b): Ir .
shown in Fig. 5.

before. Instead, the missing background part is recovered


2.2.2 Background generation
by linearly interpolating the corresponding background col-
Given the warped ground plane, the background of the umn indexes in I. In particular, the missing background
scene is generated in a column-wise way. This is a direct pixel columns are obtained by uniformly sampling the ref-
consequence of modelling the background as a ruled sur- erence image I in the range [yl , yr ], where yl and yr denote
face perpendicular to the ground plane. For each point p of the borders of the missing background in image I. If there
the top border of the ground in I, the corresponding point are several connected missing parts, the procedure must be
in Ir and Il is recovered, and the whole column of pix- repeated for each of them. Fig. 5(a) shows an example of
els above p is copied in Ir and Il starting from Hr p and occlusion by a foreground object (the statue). Fig. 5(b)
Hl p respectively. When the top border of the ground is not shows that background data have been correctly filled in.
visible because it is occluded by a foreground object, the When the foreground object does not occlude the top bor-
corresponding image column cannot be copied as described der of the ground, but it occludes some pixels of the corre-

4
sponding background column, the foreground pixels are not that the 3D image is inside the TV, starting from the screen
copied. The remaining background portions, i.e., those oc- surface. Users are nonetheless free to change the overall
cluded by the foreground objects, are filled in with smooth shift and put on the screen surface other frontal regions, if
color interpolation. required.

(a) (b)

(a) (b)
(c) (d)
Figure 5. Background generation for Ir . (a): Top border of the
background not occluded. (b): Recovery of the background for
the occluded part of the ground top border.

2.2.3 Displacement of the foreground objects


Foreground objects are segmented in a semi-automatic way
(e) (f)
with the GrabCut tool [13]. They are rendered in the images
as flat and frontal objects, since the texture of the object is Figure 7. The “horse” example. (a): Reference image I. (b): Sup-
usually sufficient to provide the user with the impression port image J. (c): Left stereoscopic image Il . (d): Right stereo-
scopic image Ir . (e): Resulting disparity map for (Il , Ir ). (f):
of local depth variation due to the object’s shape. Depth is
Front-to-parallel view of the ground in the horse case. The ground
assigned as the value corresponding to the point of contact plane corner forms a right angle as it is delimited by two perpen-
with the ground, considered to be the bottom point of their dicular walls.
silhouette. Users are allowed to change the position of the
point of contact by clicking on the desired point in the im-
age. Fig. 6 shows the final stereo pair ((a) and (b)), the two 3. Experimental results
images superimposed (c) and the disparity map (d).
The approach was tested on several image pairs with per-
2.2.4 Stereoscopic display on a 3D screen ceptually pleasing results and a convincing 3D visual effect.
In Figs. 7(a) and (b) are shown the reference and sup-
For a parallel camera stereoscopic system, points at infin- port images of the “horse” pair together with their associ-
ity have zero disparity, and appear to the user to be on the ated ground plane vanishing lines. The original pair does
screen surface when the images are displayed on a 3D TV not form a parallel camera stereoscopic system, as the van-
without modification. When a limited range [−a, b] for dis- ishing lines are not coincident. Figs. 7(c) and (d) show the
parity is introduced, the nearest and furthest points are as- obtained stereoscopic pair, featuring coincident vanishing
sociated with the extreme values of that range. Hence the lines. Notice at the bottom left (c) and right (d) corners the
zero disparity plane is not located anymore at infinity, but is black (i.e., empty) regions arising after warping the original
frontal to the cameras, in a region between the nearest and images. Fig. 7(e) shows the resulting disparity map. Fi-
furthest points. Since the scene is in front of the camera, in nally, Fig. 7(f) shows a front-to-parallel view of the ground
our approach an overall translation is applied to the two im- plane. Such a view, obtained by metric rectification of the
ages Ir and Il in order to have zero disparity in the bottom ground plane in image I based on the vanishing line and the
line of the ground. Doing so, the user has the impression camera calibration data, provides a clear visual proof of the

5
(a) (b)

(c) (d)
Figure 6. Stereoscopic rendering for I of Fig. 2(a). (a): Il . (b): Ir . (c): Superimposed stereoscopic images. (d): Disparity map.

good accuracy of geometric estimates. Indeed, the ground disparity map obtained with our approach is confirmed by
boundaries next to the walls (dashed lines) are almost per- a visual comparison against the disparity map of Fig. 8(f),
fectly orthogonal, as it should be, despite the very slanted which was obtained with a state-of-the-art dense stereo ap-
view of the ground in the original image. proach [16]: The two maps look very similar. However,
dense stereo is much slower than our approach, taking
Figs. 8(a) and (b) illustrate the “bushes” pair, where about 50 minutes for each image pair on a quad core In-
two partially self-occluding foreground objects are present. tel Xeon 2.5GHz PC. In the present MATLAB implemen-
Notice, from both Figs. 8(c) and (d), the small blurred tation of our approach, the overall processing time for an
regions—especially evident to the left (c) and right (d) of image pair is less than 5 minutes, also taking into account
the closer bush—due to color interpolation inside occluded the semi-automatic foreground segmentation procedure.
background areas. As evident from the disparity map of
Fig. 8(e), the two bushes are correctly rendered as belong- Fig. 9 illustrates the results obtained with the “bride stat-
ing to two distinct depth layers. The good quality of the ues” pair. This pair also includes two foreground objects,

6
(a) (b)

(a) (b)
(c) (d)

(e) (f)
Figure 8. The “bushes” example. (a): Reference image I. (b):
Support image J. (c): Left stereoscopic image Il . (d): Right
stereoscopic image Ir . (e): Disparity map with our approach. (f):
Disparity map with a dense stereo approach.

(c) (d)

but, differently from the “bushes” pair, the second fore-


ground object is almost completely occluded by the first.
However, the disparity map of Fig. 9(e) clearly shows that
the unoccluded part of the second foreground object was
nevertheless correctly rendered in a depth layer between
the first foreground object and the background. Also no-
tice from the disparity map that, due to the ruled surface
model, the background is rendered at different depths, thus
reflecting the irregular shape of the ground plane upper bor-
der in the image. Although the ruled surface model is but
an approximation of the real background (as evident from
a comparison with the dense stereo disparity of Fig. 9(f), (e) (f)
where the shape of the background building is nicely cap- Figure 9. The “bride statues” example. (a): Reference image I.
tured, while the second foreground object is totally miss- (b): Support image J. (c): Left stereoscopic image Il . (d): Right
ing), still it represents the visual scene accurately enough to stereoscopic image Ir . (e): Disparity map with our approach. (f):
produce an impressive 3D illusion. Disparity map with a dense stereo approach.

Finally, some frames of a synthetic video generated from


the stereo data extracted for the “horse” example (see again 4. Conclusions and Future Work
Fig. 7) are shown in Fig. 10. The camera performs a virtual
translation along its x−axis, showing the parallax effect on We have described and discussed a simple yet fast and
the horse position w.r.t. the background. effective approach to 2D-3D conversion of an image pair for

7
[2] A. Criminisi, M. Kemp, and A. Zisserman. Bringing pic-
torial space to life: computer techniques for the analysis of
paintings. In on-line Proc. Computers and the History of Art,
2002. 2
[3] M. Fischler and R. Bolles. Random sample consensus:
A paradigm for model fitting with applications to image
analysis and automated cartography. Comm. of the ACM,
(a) (b) 24(6):381–395, 1981. 3
[4] M. Guttman, L. Wolf, and D. Cohen-Or. Semi-automatic
stereo extraction from video footage. In Proc. IEEE Interna-
tional Conference on Computer Vision, 2009. 1
[5] R. Hartley and A. Zisserman. Multiple View Geometry in
Computer Vision. Cambridge University Press, 2004. 3
[6] A. Heyden and M. Pollefeys. Multiple view geometry. In
G. Medioni and S. B. Kang, editors, Emerging Topics in
Computer Vision. Prentice Hall, 2005. 3
(c) (d) [7] D. Hoiem, A. Efros, and M. Hebert. Recovering surface lay-
out from an image. International Journal on Computer Vi-
sion, 75(1), 2007. 2, 4
[8] G. Jones, D. Lee, N. Holliman, and D. Ezra. Controlling per-
ceived depth in stereoscopic images. In Proc. SPIE Stereo-
scopic Displays and Virtual Reality Systems VIII, volume
4297, 2001. 1
[9] J. Koenderink, A. van Doorn, A. M. L. Kappers, and J. T.
(e) (f) Todd. Ambiguity and the ‘mental eye’ in pictorial relief.
Perception, 30(4):431–448, 2001. 2
Figure 10. Some frames of a synthetic video sequence for the
[10] D. Lowe. Distinctive image features from scale-invariant
“horse” example of Fig. 7. The camera translates along its x−axis
keypoints. International Journal on Computer Vision,
from right to left. Black pixels around the horse correspond to
60(2):91–110, 2004. 3
occluded background points.
[11] P. Mendonça and R. Cipolla. A simple technique for self-
calibration. In Proc. Conf. Computer Vision and Pattern
parallel stereoscopic displays, where the disparities of all Recognition, 1999. 3
scene elements are generated after statistical segmentation [12] V. Nedovic, A. W. M. Smeulders, A. Redert, and J. M.
Geusebroek. Stages as models of scene geometry. IEEE
and geometric localization of the ground plane in the scene.
Transactions on Pattern Analysis and Machine Intelligence,
Future work will address (1) extending the approach to (in press), 2010. 1
videos (which will lead to investigate the problem of of tem- [13] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interac-
poral consistency among frames), (2) relaxing the ground tive foreground extraction using iterated graph cuts. In ACM
plane assumption, (3) performing a totally automatic im- Transactions on Graphics (SIGGRAPH), 2004. 5
age segmentation based on a multi-planar scene model, thus [14] A. Saxena, M. Sun, and A. Y. Ng. Learning 3-d scene struc-
further speeding up computations (in the current implemen- ture from a single still image. In Proc. IEEE International
tation, more than 90% of the time is taken by the semi- Conference on Computer Vision, pages 1–8, 2007. 2
automatic foreground object segmentation) while retaining [15] P. Sturm. On focal length calibration from two views. In
the basic geometric structure of the approach expounded in Proc. IEEE Conference on Computer Vision and Pattern
subsect. 2.1, (4) implementing an automatic method to de- Recognition, 2001. 3
termine the optimal range of disparities for 3D perception. [16] O. Woodford, P. Torr, I. Reid, and A. Fitzgibbon. Global
stereo reconstruction under second-order smoothness priors.
IEEE Trans. on Pattern Analysis and Machine Intelligence,
Acknowledgements 31(12):2115–2128, 2009. 1, 6, 8
We heartily thank Oliver Woodford for providing us with [17] G. Zhang, W. Hua, X. Qin, T. T. Wong, and H. Bao. Stereo-
the experimental results used to compare our approach with scopic video synthesis from a monocular video. IEEE
Transactions on Visualization and Computer Graphics,
his dense stereo method [16].
13(4):686–696, 2007. 2
[18] G. Zhang, J. Jia, T.-T. Wong, and H. Bao. Consistent depth
References maps recovery from a video sequence. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 31(6):974–988,
[1] S. Coren, L. M. Ward, and J. T. Enns. Sensation and Percep-
2009. 2
tion. Harcourt Brace, 1993. 2

You might also like