0% found this document useful (0 votes)
12 views15 pages

08357884

Uploaded by

sandy300322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views15 pages

08357884

Uploaded by

sandy300322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier xx.xxxx/ACCESS.2018.DOI

Stitching for Multi-View Videos With


Large Parallax Based on Adaptive Pixel
Warping
KYU-YUL LEE1 and JAE-YOUNG SIM1 , (Member, IEEE)
1
School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, South Korea
Corresponding author: Jae-Young Sim (e-mail: jysim@unist.ac.kr).
This work was supported in part by the National Research Foundation of Korea (NRF) within the Ministry of Science and ICT (MSIT)
under Grant 2017R1A2B4011970 and within the Ministry of Education under Grant 2016R1D1A1A09919618, and in part by Institute for
Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No.20170006670021001,
Information-Coordination Technique Enabling Augmented Reality with Mobile Objects).

ABSTRACT Conventional stitching techniques for images and videos are based on smooth warping
models, and therefore, they often fail to work on multi-view images and videos with large parallax captured
by cameras with wide baselines. In this paper, we propose a novel video stitching algorithm for such
challenging multi-view videos. We estimate the parameters of ground plane homography, fundamental
matrix, and vertical vanishing points reliably, using both of the appearance and activity based feature
matches validated by geometric constraints. We alleviate the parallax artifacts in stitching by adaptively
warping the off-plane pixels into geometrically accurate matching positions through their ground plane
pixels based on the epipolar geometry. We also exploit the inter-view and inter-frame correspondence
matching information together to estimate the ground plane pixels reliably, which are then refined by energy
minimization. Experimental results show that the proposed algorithm provides geometrically accurate
stitching results of multi-view videos with large parallax and outperforms the state-of-the-art stitching
methods qualitatively and quantitatively.

INDEX TERMS Multi-view videos, video stitching, image stitching, large parallax, adaptive pixel warping,
epipolar geometry.

I. INTRODUCTION Traditional image stitching methods assume that a pair of


ULTI-VIEW videos are widely used in many appli- images are taken from very close camera locations to each
M cations such as surveillance [1]–[3], sports [4]–[6],
virtual training [7] and video conferencing [8], [9]. One of the
other and the captured scene structures are roughly planar.
Based on these assumptions, we obtain stitched images by
essential techniques for multi-view applications is stitching, performing the three major steps: feature matching, image
which combines multiple images, captured from different alignment, and image composition. First, feature points are
viewing positions and directions, to generate a single im- detected from different images, which are then matched
age with a wider field of view [10]. Image stitching has together by using feature descriptors, e.g., SIFT [23]. In
been actively studied in the literatures [11]–[21], and related the alignment step, a global image warping model such
commercial products have been also developed, e.g., Adobe as homography is estimated by using the obtained feature
Photoshop Photomerge™ and Microsoft Image Composite matches, and multiple images are aligned to a common image
Editor. Moreover, many current mobile devices with cam- domain accordingly. Finally, the pixel values in a stitched
eras are able to synthesize a panorama image by stitching image are determined by average blending or seam cutting
multiple images captured at different time instances. Also, methods [10].
around view monitoring is one of the core applications of However, when multi-view cameras capture non-planar
autonomous vehicles, which employs bird’s eye views of scene structures at relatively far camera positions from one
stitched multiple images captured by front, side, and rear another, resulting multi-view images exhibit parallax phe-
view cameras [22]. nomenon where the relative locations of scene contents are

VOLUME x, 2018 1

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

varying across different views. In such cases, the traditional follows.


stitching methods suffer from parallax artifact. Therefore, • We propose a more generalized video stitching frame-
advanced image stitching methods [11]–[21] have been stud- work which aligns the foreground objects and the back-
ied which alleviate some amount of parallax artifact by de- ground, respectively, while our previous algorithm [30]
signing locally adaptive transformations for flexible warping, was applied to the foreground objects only.
employing similarity transformation to reduce perspective • We improve the warping performance by estimating op-
distortion, and/or hiding the misalignment in composition timal ground plane pixels, while our previous work [30]
stage based on seem-cutting method. estimates a projective depth using the lowest pixel in
Recently, in many practical applications such as surveil- each object.
lance and sports, static multiple cameras are placed at very • We perform more extensive experiments using 12 video
far viewing positions from one another with wide baselines. sequences and provide comparative experimental results
Also, captured 3D real-world scenes often include multi- between the conventional methods and the proposed
ple foreground objects moving over a wide range of scene algorithm qualitatively and quantitatively.
depths. For example, walking pedestrians are captured by The rest of this paper is organized as follows. Section II
static multiple cameras installed at arbitrary locations [24]– describes the related work on image and video stitching and
[26], and multiple players in sports games are captured by static multi-camera based tracking. Section III proposes the
static cameras with wide baselines [27]. On these challeng- basic concept of the proposed parallax-adaptive pixel warp-
ing multi-view images, even the aforementioned advanced ing model. Section IV and Section V explain the algorithms
image stitching techniques have limitations to combine the of parameter estimation and ground plane pixel estimation,
diverse scene structures accurately causing ghosting artifacts respectively. Section VI presents the experimental results.
in stitching results due to the two main reasons. First, abrupt Finally, Section VII concludes the paper.
depth discontinuity among multiple foreground objects and
background is hard to be treated accurately by the existing II. RELATED WORK
warping schemes. Second, appearance-based feature descrip- A. IMAGE AND VIDEO STITCHING
tors may provide large numbers of outlier matches due to Homography is a traditional image warping model which de-
severe parallax. scribes the projective relationship between two image planes
Compared to the image stitching research, relatively little based on the planar scene assumption [10], [31]. In general,
effort has been made to develop multi-view video stitching an optimal homography is estimated by feature matching
techniques. Video stitching was regarded as an extension between two images. Homography can register multiple im-
of image stitching where the multiple frames from different ages associated with small camera baselines successfully,
views at a certain time instance are stitched together by using however, it fails to work on the images with large camera
existing image stitching techniques [28]. Also, a temporal baselines where a captured scene is composed of multiple
cost term is simply added to the cost function for image planar structures.
stitching [29]. Therefore, stitching for challenging multi- To overcome this limitation, advanced image stitching
view videos with large parallax still has the aforementioned methods employ spatially-varying warps which adaptively
problems of image stitching. align spatial deviation between two images caused by par-
In this paper, we propose a geometrically accurate stitch- allax. Gao et al. estimated dual homographies to align the
ing algorithm for multi-view videos with large parallax ground plane and the distant background plane, respectively,
(MVLP) which are captured by stationary cameras with by clustering the feature points according to their posi-
wide baselines. We also consider surveillance and sports tions [11]. Lin et al. initialized a global affine transformation
applications where multiple people are moving on the ground which is then iteratively refined to minimize a cost function
plane at arbitrary distances from the cameras. We develop defined by matched features [12]. Zaragoza et al. partitioned
a parallax-adaptive pixel warping model, where the ground an input image into multiple cells, and estimated a homog-
plane pixels are warped by homography, but the pixels off raphy for each cell by weighting feature matches according
the plane, i.e. the pixels on the foreground objects and the to the relative distances to the feature points [13]. Zhang et
distant background region, are warped through their ground al. proposed a mesh-based alignment technique to mitigate
plane pixels based on the epipolar geometry. We also estimate the shrinking problem of wide-baseline panorama synthe-
the optimal ground plane pixels by employing both of the sis, which designs a scale preserving cost function using
reliable spatial and temporal feature matches based on energy the perimeter of polygons created from feature points [14].
minimization framework. Experimental results show that the The spatially-varying warps reduce the parallax artifact of
proposed algorithm stitches multi-view videos successfully image stitching by a certain amount, however they cannot
without severe parallax artifacts, and yields a significantly reflect abrupt depth changes in a captured scene completely
better performance than that of the existing state-of-the-art since the neighboring cells are processed with smoothness
image stitching techniques qualitatively and quantitatively. constraints. Moreover, the spatially-varying warps were in-
A preliminary result of this work was presented in [30]. herently designed to deform images assuming small base-
The major differences between [30] and this paper are as lines [32], and thus the warped images look unnatural when
2 VOLUME x, 2018

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

FIGURE 1: Stitching images with large parallax. (a) A target image and (b) a reference image. The resulting stitched images by using (c) a
homography based warping scheme, (d) APAP [13], and (e) the proposed parallax-adaptive stitching, respectively.

the relative orders of control points are changed across mul- putational complexity of video stitching [28]. Jiang et al.
tiple images due to large parallax [33]. extended CPW of local alignment and image composition
The stitched images usually exhibit perspective distortions to video stitching by applying the seam cutting scheme to
in non-overlapping regions among multiple images where spatiotemporal domain [29].
no valid feature matches are obtained. To alleviate the per-
spective distortions, shape-preserving warps were proposed B. STATIC MULTI-CAMERA BASED TRACKING
which extrapolate the warping models to non-overlapping Multi-camera based people tracking techniques detect walk-
regions using similarity transformation and/or homography ing pedestrians on a ground plane from multiple videos,
linearization [15]–[18]. Chang et al. applied a homography to which are captured by different static cameras set toward a
the overlapping region of images and similarity transforma- common ground plane and positioned with relatively wide
tions to the non-overlapping regions, respectively [15]. Lin et baselines. Specifically, moving foreground objects are first
al. proposed a homography linearization method to combine detected by background subtraction methods, and then the
homography and similarity transformations smoothly [16]. elongated shapes of detected people are represented by
Chen et al. improved the shape-preserving warp by accu- principal axes [24] which are used for people tracking in
rately estimating the scale and rotation of similarity trans- addition to the ground plane homography. To localize each
formation [17]. Li et al. proposed quasi-homography warps person for robust tracking, Khan et al. computed multiple
which linearly extrapolate the horizontal component of ho- homographies associated with parallel planes to the ground
mography [18]. The shape-preserving warps provide visually plane using vanishing points [25]. In addition to homography
plausible stitching results, but do not always produce geo- and vanishing points, fundamental matrix was also used to
metrically correct results. reliably find correspondence matching for the top points of
Attempts have been also made to align only a certain people [26].
region of input images and hide the artifacts of mismatched
regions by applying seam-based composition methods. Gao III. PARALLAX-ADAPTIVE PIXEL WARPING MODEL
et al. obtained multiple homographies by taking the groups In many practical applications of multi-view videos such as
of inlier feature matches in order, and selected the best ho- surveillance and sports, static multiple cameras are located
mography that yields a minimum seam cost [19]. Zhang et al. with wide baselines toward a target real-world scene which
clustered closely located feature points together and found an yields severely different camera parameters, e.g., rotation,
optimal local homography associated with a minimum seam translation, and zoom factor. Also, in a typical video se-
cutting error to align a local image region [20]. They also quence, the background is composed of a ground plane
applied content-preserving warping (CPW) [34] to further and optionally a far distant region orthogonal to the ground
refine the local alignment. Lin et al. generated multiple local plane, and moreover, people moving on the ground plane at
homographies using a superpixel-based grouping scheme, different distances from the cameras are captured as multiple
and further refined each homography to select the best one foreground objects. Figs. 1(a) and (b) show two frames of
by using energy minimization [21]. They also designed an the ‘Soccer’ sequence captured by two cameras with severely
energy function to encourage the warp undergoes similarity different positions and viewing directions from each other,
transformation and to preserve the structures like curves and where large parallax is observed especially in the vicinity of
lines after warping. Note that these techniques register one the foreground objects. For example, the players denoted by
local region only and thus inevitably cause geometrically red boxes in Fig. 1(a) appear in a different order in Fig. 1(b).
inaccurate stitching results. In addition, the players denoted by yellow boxes appear in
On the other hand, the previous video stitching algorithms only one view of Fig. 1(a) not in Fig. 1(b).
simply apply the existing image stitching techniques to stitch Such large parallax makes the multi-view video stitching
the video frames at each time instance, respectively [28]. quite a challenging problem, and the conventional stitching
Also, they extend the image stitching techniques straightfor- techniques often fail to provide faithful results. Fig. 1(c)
wardly to video stitching for the purposes of improving the shows the stitched image by warping a target frame in
computation speed or reducing the flickering artifacts. El- Fig. 1(a) to a reference frame in Fig. 1(b) according to the
Saban et al. computed SIFT descriptors for selected frames homography. Since the homography-based warping assumes
only and tracked the feature points to reduce the com- a planar scene structure, only the ground plane is accurately
VOLUME x, 2018 3

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

FIGURE 2: Epipolar geometry.

FIGURE 3: Parallax-adaptive pixel warping.


aligned and the foreground objects and the distant back-
ground region yield large parallax artifacts. Also, Fig. 1(d)
shows the stitching result of APAP [13] which is one of the vertical vanishing point vI . Since gp is an on-plane pixel, it
state-of-the-art image stitching techniques. The APAP adap- can be warped to the corresponding GPP gq in the reference
tively warps images using mesh grid structure to reduce par- image J by using the homography matrix H evaluated on the
allax artifacts, however, it still exhibits inaccurate alignment ground plane.
of multiple foreground objects due to depth discontinuity, gq = Hgp . (3)
and furthermore, it causes perspective distortions in the non-
overlapping area between two images. The unknown pixel q corresponding to p can be estimated as
The parallax between two views can be explained based the cross point between the object direction line Lq = gq ×
on the epipolar geometry as shown in Fig. 2. Homography is vJ passing through gq and the vertical vanishing point vJ ,
a planar mapping from one image domain to another image and the epipolar line lp = Fp specified by the fundamental
domain. Suppose that a 3D real-world point X1 is located on matrix.
a plane π and projected to the pixels p1 and q1 in the image q = Lq × lp . (4)
planes I and J, respectively. Then the relation between p1 Fig. 1(e) shows the resulting image stitched by using the
and q1 is described by proposed warping model, where we see that the multiple
q1 = Hπ p1 (1) foreground objects and the background are aligned correctly,
while the parallax artifacts, occurred in the conventional
where Hπ is the homography associated with the plane π. methods as shown in Figs. 1(c) and (d), are alleviated effec-
However, for the pixels p2 and q2 projected from a 3D point tively. Also the proposed algorithm can warp the foreground
X2 , which is not on π, the relation (1) does not hold, i.e., objects and the background on the non-overlapped areas
q2 6= Hπ p2 , and therefore, a single homography Hπ map naturally as well.
p2 to a wrong pixel q̃2 = Hπ p2 , which causes parallax Consequently, to perform the proposed parallax-adaptive
artifact. On the other hand, we can describe the geometric pixel warping, we need to estimate the parameters of the
relationship between any pair of corresponding pixels by homography matrix H of the ground plane, the fundamental
epipolar constraint. For example, for a given pixel p2 ∈ I, matrix F, and the vertical vanishing points vI and vJ .
the corresponding pixel q2 ∈ J should be located on the We will explain the details of the parameter estimation in
epipolar line l2 computed as Section IV. Also, we need to estimate an optimal GPP gp for
l2 = Fp2 (2) a given query pixel p. Note that [26] employs only a single
query pixel at the top of a foreground object and roughly
where F is the fundamental matrix. estimates the GPP by using the average height of objects.
In this work, we propose an adaptive pixel warping model In this work, we estimate optimal GPPs more accurately by
for parallax-free stitching of MVLP which employs faithful using the spatial and temporal feature matches based on an
correspondence matching among multi-view videos based on energy minimization framework, which will be explained in
the epipolar constraint. We first define on-plane pixels which Section V.
are projected from the ground plane in real-world scene, and
define off-plane pixels belonging to the foreground objects
IV. PARAMETER ESTIMATION
and the far distant background region. We generalize the
concept of epipolar constraint, used for matching the top For given two input MVLP, we first estimate the parameters
points of people in multi-camera based tracking [26], to of the homography matrix, the fundamental matrix, and the
find reliable correspondence matching of off-plane pixels. As vertical vanishing points. Note that these parameters are fixed
shown in Fig. 3, for a given off-plane pixel p in a target image over all the frames since we assume that multi-view videos
I, we first estimate the ground plane pixel (GPP) gp of p are captured by static cameras.
along the object direction Lp = p × vI determined by the
4 VOLUME x, 2018

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

FIGURE 4: (a) An input video sequence and (b) its background


image. FIGURE 5: Ground plane pixel estimation. p and q are given as
corresponding to each other. L0p and L0q denote the homography
transformed lines of Lp and Lq into the other views, respectively.

A. GROUND PLANE HOMOGRAPHY


We estimate the homography associated with the ground
plane using inter-view correspondence matching. In general,
initial matching between two views is performed by using
feature descriptors such as SIFT [23] or ASIFT [35], and
then the spurious matches are removed by outlier removal
schemes such as RANSAC [36]. However, the conven-
tional appearance-based techniques may not provide reliable
matching results on MVLP, especially in multiple foreground
objects at different scene depths, since the neighboring pixels
of a feature point in one image yield severely different
values from that of the corresponding feature point in another
image [37], [38]. Therefore, in this work, we estimate the
homography more reliably by employing the appearance fea-
tures as well as the activity information of moving foreground
FIGURE 6: Refinement of feature matching on (a) foreground objects
objects. and (b) background. Correct and spurious matches are denoted by
Fig. 4(a) shows an input color video sequence: I = {I (k) : the yellow and red lines, respectively.
k = 1, 2, · · · , K} where I (k) denotes the k-th frame and
K is the total number of frames. We find Bground the set of (k)
feature matches on the ground plane between I and J using reliable performance of activity-based matching, Fspatial and
the activity-based correspondence matching technique [38]. B include relatively large numbers of spurious matches since
Then we compute an initial homography Hinit from Bground appearance-based matching is vulnerable to severe parallax.
(k)
using RANSAC. We also obtain a background image IBG , Therefore, we further refine the matches in Fspatial and B
as shown in Fig. 4(b), by performing the median filtering to using the geometric constraints.
all the frames in I. Then we use SIFT to find a set of feature As shown in Fig. 5, when a pair of corresponding off-plane
matches B between two background images IBG and JBG pixels p ∈ I and q ∈ J are given, their GPPs gp and gq are
obtained from two video sequences I and J , respectively. corresponding on-plane pixels to each other and should be
Note that B includes the matches on the ground plane and located on the object direction lines Lp and Lq , respectively.
the matches in the distant background region together. Hence Hence we can estimate gp and gq as [24]
we first extract the matches on the ground plane only from gp = Lp × L0q ,
B by selecting the inliers matches of Hinit . Then we refine
gq = Lq × L0p , (5)
Hinit to obtain a final homography H by using Bground and
the selected ground plane matches in B, based on RANSAC. where L0p and L0q are the warped lines of Lp and Lq into
the other views, respectively, by the ground plane homog-
B. FUNDAMENTAL MATRIX raphy H. Based on this property, we induce two geometric
To estimate the fundamental matrix between two views, we constraints to validate the obtained correspondence matches.
find inter-view feature matching on the foreground objects as First, gp should be located at a position on Lp equal to or
well. Note that, while the correspondence matching for the below p such that (gp − p) · vI ≥ 0. Similarly, we have
background is performed once over a whole video sequence, (gq − q) · vJ ≥ 0. Second, gp should be close to the
that for the foreground objects is performed at each time in- lowest possible pixel plow along Lp in a connected object
stance, respectively. In practice, we use SIFT to find the inter- area. In practice, we employ a tolerance range for gp such
view feature matches between I (k) and J (k) , and obtain the that |(gp − plow ) · ||vvII || | is less than 40% of the height of a
(k)
set Fspatial by selecting the matches lying on the foreground foreground object. This also applies to gq and q.
(k)
regions only by using background subtraction [39]. While We remove the false matches from Fspatial , which violate
Bground includes a small number of outlier matches thanks to the first and/or second constraints, to yield a refined set
VOLUME x, 2018 5

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

(k)
F̃spatial . For B, we test only the first constraint and apply the
multi-structure guided sampling (MULTI-GS) [40] to obtain
a refined set B̃. Fig. 6 shows that the proposed matching
refinement for MVLP removes most of the spurious matches
successfully both on the foreground objects and the back-
ground. Finally, we estimate the fundamental matrix F by
applying RANSAC to the appearance-based feature matches
(k)
of F̃spatial ’s and B̃ as well as the activity-based matches of
Bground together. Note that, due to computational complexity,
(k)
we empirically collect 1000 feature matches from F̃spatial ’s
associated with randomly selected frames.

C. VERTICAL VANISHING POINTS


Vanishing points are the points where the parallel lines are
converging [31]. In multi-view video sequences, people are
assumed to be standing along the orthogonal direction to the
ground plane, and therefore, we define a vertical vanishing
point as a converging point of parallel lines in a scene
orthogonal to the ground plane. In practice, we estimate the
vertical vanishing points by using [41]. Instead of complex FIGURE 7: Relation between ground plane pixels and ground values.
(a) A target frame and (b) its ground value map.
people tracking, we simply select 10,000 major axis lines
of people from randomly selected frames, where the lines
satisfy the condition that the ratio of the length of minor axis to its GPP gp , as shown in Fig. 7(a).
to the length of major axis is below 0.3. Then, as shown in
vI − p
Fig. 3, the object direction Lp can be computed at each off- δp = · gp . (7)
plane pixel p as the line passing through p and v ||vI − p||
Note that the ground values of off-plane pixels are almost
Lp = p × v (6)
invariant within a same foreground object or a same distant
where v is the vertical vanishing point. Note that the object background region. We exploit this property to estimate the
direction Lp is used to estimate the GPP gp based on the GPPs by estimating their ground values instead, since gp and
constraint that gp should be located on Lp . δp are put in one-to-one correspondence with each other for
a given p via (7).
V. GROUND PLANE PIXEL ESTIMATION
B. SPATIOTEMPORAL ESTIMATION FOR FOREGROUND
We estimate optimal GPPs for given query pixels in a target
OBJECTS
frame to find their warped pixels in a reference frame. Note (k)
that the proposed pixel warping model is not only applicable Let us first define Φspatial as the set of feature pixels in
(k)
to off-plane pixels but on-plane pixels such that gp = p for a F̃spatial detected from a target image I (k) . For a given feature
(k)
pixel p on the ground plane. We perform the GPP estimation pixel p(k) ∈ Φspatial associated with an inter-view match
for the foreground objects and the background, respectively, denoted by a yellow line in Fig. 8, a GPP gp(k) is found by
where the inter-view and inter-frame feature matches are (5). We call this procedure of GPP estimation using inter-
used together for the foreground objects while only the inter- view feature matches as spatial matching based estimation
(k)
view feature matches are used for the background. The esti- (SME). We perform SME using F̃spatial for each k-th frame,
mated GPP positions are also optimized based on an energy respectively.
minimization framework. However, some foreground objects may not provide suffi-
cient numbers of inter-view matches or may have no inter-
A. GROUND PLANE PIXEL AND GROUND VALUE view match at all, due to large parallax between two views
and/or relatively small areas in an image. Hence we addi-
Multiple off-plane pixels on a same object direction line
tionally employ the temporal information from the previous
share a same GPP, since the corresponding real-world points
frame to predict GPPs. Specifically, we use SIFT to obtain
are assumed to be located on a same vertical line perpendic- (k)
ular to the ground plane. For example, as shown in Fig. 7(a), the set of inter-frame feature matches F̃temporal associated
the pixels r1 , r2 and r3 on Lr have the GPP gr , while the with the foreground objects between a current frame I (k)
pixels s1 , s2 and s3 on Ls have GPP gs . However, off-plane and its previous frame I (k−1) , which are denoted by the
(k)
pixels lying on different object direction lines have different blue lines in Fig. 8. In general, F̃temporal has a much larger
(k)
GPPs. We define a ground value δp for the pixel p according number of reliable matches than F̃spatial , since the adjacent
6 VOLUME x, 2018

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

optionally a far distant region. To adaptively warp the back-


ground image, we first decide whether the captured scene
includes a distant background region or not. To be specific,
we use the inter-view feature matching on the background.
From B̃, we extract the set of matches which are outliers of
the ground plane homography obtained in Section IV-A. If
the number of outlier matches is less than 5% of the total
number of matches in B̃, we decide the background scene
includes only the ground plane without a distant region, and
then we simply estimate the GPPs as gp = p for all the
background pixels p.
FIGURE 8: Inter-view feature matches (yellow lines) and inter-frame Otherwise, it means that the background includes a distant
feature matches (blue lines). region where we perform GPP estimation. We first compute
the GPPs for the extracted outlier matches in B̃ by SME,
and predict a line passing through the obtained GPPs using
frames in a same view exhibit similar scene contents to
linear regression. This line is regarded as a boundary to
each other while the frames from different views exhibit
roughly separate the distant background region from the
severely different appearance due to large parallax. Note that
ground plane. For the pixels p located below the boundary
some pixels may be detected as the spatial features and the
line, we simply estimate the GPPs as gp = p. For the feature
temporal features simultaneously, which belong to both of
(k) (k) pixels in B̃ located above the boundary line, we estimate the
F̃spatial and F̃temporal .
(k)
GPPs by SME.
Let us define Φtemporal as the set of inter-frame fea-
(k) (k)
ture pixels in F̃temporal detected from a target
 image I . D. GROUND VALUE OPTIMIZATION
(k) (k) (k)
For each pixel p ∈ Φtemporal − Φspatial , we find its For seamless warping of foreground objects and distant
(k−1) background region, we further refine the positions of the
temporal corresponding pixel p . In addition, we also
(k) initial GPPs for the off-plane feature pixels, obtained in
collect the inter-view feature pixels from Φspatial , which
Section V-B and Section V-C. Specifically, we formulate an
are located in the same foreground object to p(k) . Then,
energy function EFG to refine the associated initial ground
by (3) and (4), we compute a candidate pixel q̂(k) in the
values for
 the feature pixels of the foreground objects in
reference image J (k) corresponding to p(k) by finding a (k) (k) S (k)
Φ = Φspatial Φtemporal .
candidate GPP ĝp(k) . Note that we estimate the optimal GPP
by estimating the ground value via (7) instead. In practice,
EFG (F(k) ) = EFG,data (F(k) )+αEFG,ss (F(k) )+βEts (F(k) )
we take a ground value of p(k) as the ground value of p(k−1)
(8)
and the ground values of the additionally collected inter-
where F(k) denotes the set of optimal ground values δp(k) ’s
view feature pixels, respectively, since the ground values are
for all feature pixels p(k) ’s in Φ(k) . We set the weighting
same within a same foreground object while the GPPs are
parameters as α = 0.5 and β = 0.5 experimentally. EFG,data
changeable. Then we check whether each of the candidate
is the data cost designed as
positions q̂(k) lies on a foreground object region in J (k) or X 2
not, and we discard the associated GPP ĝp(k) when q̂(k) EFG,data (F(k) ) = δp(k) − δ̄p(k) (9)
lies outside of the foreground areas within J (k) . Finally, we p(k) ∈Φ(k)
evaluate the SIFT descriptors for the surviving candidate
positions q̂(k) , and select the GPP of p(k) associated with where δ̄p(k) denotes the initial ground value of p(k) . The
the best matching candidate position. We call this procedure initial ground values may be inaccurate due to the errors in
as temporal matching based estimation (TME). feature matching and/or background subtraction. Hence we
When TME returns no available solution, we estimate the employ the spatial smoothness cost given by
2
GPP by taking the ground value of the lowest possible pixel

(k) (k)
X X
EFG,ss (F(k) ) = w(pi , pj )· δp(k) − δp(k)
in a foreground object. We call this procedure as region based (k) (k) (k)
i j

estimation (RE). RE yields relatively lower accuracy of GPP pi ∈Φ(k) pj ∈Ni

estimation than SME due to the lack of inter-view matching (10)


(k)
information, however it can perform reasonable warping of where Ni denotes the set of spatially neighboring pixels
(k) (k) (k)
the foreground objects lying on the non-overlapping region to pi . Two pixels pi and pj are regarded as spatial
which appear only in I (k) but not in J (k) . neighbors to each other when they are located in a same fore-
ground object region and satisfy the compatibility constraint:
(k)
C. SPATIAL ESTIMATION FOR BACKGROUND the warped pixel of pi ∈ I (k) using the initial GPP of
(k)
We also estimate the GPPs for the background. We assume pj is located on a foreground object region in J (k) , and
(k)
that the background is composed of the ground plane and at the same time, the warped pixel of pj ∈ I (k) using the
VOLUME x, 2018 7

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

(k) TABLE 1: Specification of test video sequences.


initial GPP of pi is also located on the same foreground
object in J (k) . In this work, we select at most the four nearest Sequence Resolution Distant Time Parallax
(k) (k) Background (min) Angle(°)
neighboring pixels to pi to define Ni . The spatial weight
Fountain 640×360 x 51 1.9
is given by
Tennis 640×360 o 34 12.3
w(pi , pj ) = exp (−||pi − pj ||/τ ) (11) Lawn 640×360 x 37 12.7
Badminton 640×360 o 52 18.2
where we set τ = 100 empirically. Moreover, to mitigate the Square 640×360 x 35 18.3
flickering artifacts in a resulting stitched video sequence, the Office 640×360 o 30 18.5
temporal smoothness cost is defined as Trail 640×360 o 51 18.7
Stadium 640×360 o 35 24.4
 2
X Soccer 320×240 o 55 28.0
Ets (F(k) ) = δp(k) − δp∗ (k−1) (12)
Street 640×360 o 29 30.5
(k)
p(k) ∈Φtemporal School 640×360 o 36 31.9
Garden 640×360 o 24 32.0
where δp∗ (k−1) is the optimal ground value of the inter-frame
corresponding pixel p(k−1) in the previous frame I (k−1) .
Note that we do not use the temporal cost function at the first along the direction toward the vertical vanishing point.
frame.
VI. EXPERIMENTAL RESULTS
Let Ψ represent the set of the feature pixels in B̃ located
We evaluate the performance of the proposed algorithm using
above the boundary line in the background image of a target
12 test video sequences, as shown in Fig. 12. Each test video
view. We also formulate an energy function EBG for Ψ as
sequence is composed of two videos captured at 30 frames
EBG (B) = EBG,data (B) + γEBG,ss (B) (13) per second by two synchronized cameras with unknown cam-
era parameters. A captured scene includes multiple moving
where B denotes the set of optimal ground values δp ’s for all people on a ground plane at various scene depths. Table 1
feature pixels p’s in Ψ. The weighting parameter γ is set to presents the specification of the test sequences. We simply
be 1 empirically. The data term is given by approximate the parallax angle by first taking the sum of
the angle between Lp and L0q and the angle between Lq
X 2
EBG,data (B) = δp − δ̄p (14)
p∈Ψ and L0p shown in Fig. 5, and by computing the average for
all the manually obtained ground truth matching pixels of
where δ̄p denotes the ground value of p ∈ Ψ initially p and q which is then divided by 2. In general, a larger
obtained by SME. The spatial smoothness cost is given by parallax angle is yielded, when two videos are captured with
X X 2 a wider camera baseline and a captured scene is closer to
EBG,ss (B) = w(pi , pj ) · δpi − δpj (15)
pi ∈Ψ pj ∈Ni the cameras. We warp each pixel in a target image frame to
a reference frame based on the proposed parallax-adaptive
where Ni is the set of the four feature points in Ψ nearest to pixel warping model. The hole pixels in warped target frame
pi . are interpolated by using the valid warped pixels. To evaluate
We refine the ground values for all the off-plane feature whether the alignment is geometrically accurate or not, we
pixels in the foreground objects by minimizing the energy simply use the average blending scheme to combine the
function in (8) using a linear solver. Then the remaining non- warped target frame and the reference frame.
feature pixels in the foreground objects are assigned ground
values by using the nearest interpolation on the available A. FOREGROUND OBJECT ALIGNMENT
optimal ground values computed at the feature pixels. We The performance of video stitching highly depends on the ac-
also find the set of the optimal ground values at the off-plane curacy of correspondence matching between different views.
feature pixels in the distant background region by minimizing In particular, accurate inter-view matches on the foreground
the energy function in (13), which are then interpolated to object regions are required to adaptively alleviate the parallax
determine the ground values at all the background pixels artifacts caused by different scene depths of multiple objects.
above the boundary line. In practice, we apply the linear Therefore, we first evaluate the alignment performance of
interpolation within the convex hull of the feature pixels multiple foreground objects according to various GPP esti-
and apply the nearest interpolation outside of the convex mation methods.
hull. Fig. 7(b) shows the resulting ground value map of Fig. 9 compares the stitching results on selected frames
a target image frame in Fig. 7(a). Note that the off-plane from the three test sequences of MVLP, using the GPPs
pixels belonging to a same foreground object region or a estimated by the four different methods: RE, SME+RE,
distant background region have almost same ground values SME+TME+RE without optimization, and SME+TME+RE
to one another, even though their GPPs are different. On with optimization. Figs. 9(a) and (b) show target frames
the contrary, the on-plane pixels on the ground plane have and reference frames, respectively, where we mark the ob-
(k)
different ground values according to their relative positions tained inter-view feature pixels in F̃spatial by crosses. In the
8 VOLUME x, 2018

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

FIGURE 9: Stitching results of multiple foreground objects using the proposed ground plane pixel estimation methods. (a) Target frames and
(b) reference frames. The stitched images by using (c) RE, (d) SME+RE, (e) SME+TME+RE without optimization, and (f) SME+TME+RE with
optimization, respectively. From top to bottom, “Lawn,” “Street,” and “Garden” sequences.

descriptor DAISY [42], which are then refined manually. We


find ground truth matches on 100 selected pairs of frames
for each sequence, and on average, we obtain about 20
matches on the foreground objects at each pair of frames.
Fig. 10 compares the root mean squared errors (RMSEs) of
the foreground matching averaged over the 12 test sequences,
where the RMSEs of RE, SME+RE, and SME+TME+RE
without and with optimization are 5.45, 4.38, 3.77, and 3.34
FIGURE 10: Comparison of the average error of correspondence pixels, respectively.
matching for the foreground objects using different ground plane pixel
estimation methods. The matching error measures the RMSE be-
tween the resulting matches and the ground truth matches averaged
over 12 test sequences. B. VIDEO STITCHING
Fig. 11 shows the video stitching results of the proposed
algorithm on six test sequences of MVLP. We select frames
“Lawn” sequence, the foreground objects occupy relatively at five different time instances in each sequence which in-
small image areas since the cameras are located far from the clude various challenging scene contents. In Fig. 11, all the
captured scene, and thus they yield few inter-view feature sequences except the “Square” are detected to include the dis-
matches. RE shows the artifact on the person in red, since the tant background regions in addition to the ground planes. We
associated GPPs are selected on the person in white, which see that the ground planes and the distant background regions
is connected to the person in red in the target frame by the are well aligned simultaneously, since the on-plane pixels
blob analysis. The matching accuracy on the person in red is and the off-plane background pixels are warped adaptively.
improved by using SME+RE, but the artifact on the legs is Note that the ground planes in the “Office” and “Soccer”
still observed. The selected frames in the “Street” sequence sequences have less textures, which are often occurred in
are quite a challenging case, since the two people occlude surveillance and sports scenes, but the proposed algorithm
each other. SME+RE improves the results of RE using inter- also finds correct homographies for these ground planes by
view matching information, but it still causes the misalign- using the appearance and activity based feature matches
ment on the right person. However, SME+TME+RE provides together.
accurate results of foreground object alignment on the two We also observe that the multiple foreground objects are
sequences by using the spatiotemporal information together. accurately aligned without ghosting artifacts in most frames.
The “Garden” sequence includes the false matches marked For example, in the “Tennis” sequence, the two people on
by yellow crosses in Figs. 9(a) and (b). Hence, SME+RE the right side are moving toward different directions from
and even SME+TME+RE without optimization suffer from each other, and thus they are detected as a single object at
the misalignment artifact of foreground objects, however, this some time instances due to overlap. The proposed algorithm
artifact is alleviated in SME+TME+RE with optimization. provides accurate warping of these foreground objects by
We also quantitatively measure the matching errors of the estimating optimal GPPs reliably using the spatiotemporal
foreground objects using the ground truth correspondence feature matches. In the “Square” sequence, the left person
matches. We select regularly distributed query pixels on moves on the overlapped area between the target and ref-
the foreground objects in a target frame, and obtain initial erence views at the 29571th and 29663th frames, however
matching pixels in a reference frame by using a dense feature it disappears from the reference frames at the 29804th and
VOLUME x, 2018 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

FIGURE 11: Video stitching results of the proposed algorithm. For each sequence, pairs of target and reference frames (left) and the stitched
images (right) are shown.

10 VOLUME x, 2018

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

29857th frames. The proposed algorithm warps this object locations without any overlap on the stitched domain, since
naturally on the non-overlapped area in the stitched images. the conventional methods extract dominant features from
In the “Trail” sequence, the foreground object approaches the distant background regions causing the misalignment
to the camera yielding severely changing scene depths, but artifacts on the ground planes and the foreground objects.
the proposed algorithm aligns this object correctly at various Specifically, Homography warps all the pixels in a target
scales. On the other hand, the proposed algorithm yields frame by global transformation derived from a dominant
artifacts on some exceptional situations. In the “Badminton” planar scene structure, and thus it mismatches either the
sequence, the person marked with a red circle is jumping and ground plane or a distant background region. CPW adap-
never touches the ground plane at the 27767th frame. In such tively refines the initial homography according to feature
a case, no valid inter-view feature matches are obtained on matches, and reduces the parallax artifacts on the foreground
this region due to the geometric constraint in Section IV-B, objects compared with that of Homography, as shown in the
and thus RE yields the misalignment artifact. In the “Office” “Tennis,” “Office” and “Street” sequences. SPHP adopts the
sequence, we see some artifacts near the right person since a similarity transformation to reduce the perspective distortion
moving car behind the cameras is reflected on the background of the non-overlapping area, and thus it aligns the foreground
windows. The “Soccer” is quite a challenging sequence objects on the non-overlapping areas well in the “Square”
which includes various fast moving players, where multiple sequence as marked with a red circle. However, at the same
people occlude one another at the 3800th and 5038th frames. time, SPHP distorts the line structure on the ground plane
In such cases, SIFT provides insufficient correct inter-view to curves as marked with green ellipses in the “Lawn” and
matches or even no correct match at all, resulting in the “Square” sequences. APAP estimates locally adaptive warps
stitching artifacts indicated by red circles. and reduces the spatial deviation of a same foreground object
in the stitched domain compared with that of CPW, as shown
C. COMPARISON WITH CONVENTIONAL METHODS in the “School” sequence, however, APAP results in unnat-
We compare the performance of the proposed algorithm with ural distortions in the “Badminton,” “Trail,” and “School”
that of the four conventional methods including the state-of- sequences as marked with green ellipses.
the-art image stitching techniques: Homography, CPW [34], On the contrary, in all the frames, the proposed algorithm
SPHP [15] and APAP [13]. Note that CPW is used as an alleviates the parallax artifacts of video stitching success-
alignment model for stitching methods [20], [29]. SPHP is fully by adaptively aligning the multiple foreground objects
a shape-preserving warping method which can be compared and background simultaneously. It also performs geomet-
to evaluate the naturalness of warping on non-overlapping rically accurate warping on the non-overlapping areas as
regions. APAP is one of the most flexible warping methods well, as shown in the “Badminton,” “Square,” and “Soccer”
which directly estimates multiple homographies for local sequences. Moreover, the proposed algorithm correctly de-
image regions. However, we do not compare the seam-based termines the existence of distant background regions in all
techniques [19]–[21], since they just hide the misalignment 12 test sequences. Thus both of the ground plane and the
artifacts using seam-cutting based composition. We apply the distant background region are correctly aligned as shown in
compared image stitching techniques to the frames at each the “Badminton,” “Office,” and “School” sequences. In the
time instance, respectively. We implement Homography and “Soccer” sequence, even some ghost artifacts are observed
CPW. The parameters for warp in CPW are set as [29]. We due to significant amount of occlusion as marked by a red
obtain the stitching results of SPHP and APAP using the circle, the proposed algorithm aligns most people accurately
source codes provided by the authors’ webpages [43], [44]. In while the compared methods fail to work on this challenging
our experiment, MULTI-GS [40] used in [13] yields a better case. Also, the umpire chair and the net in the “Tennis”
performance of outlier removal than RANSAC, and thus we sequence and the net and the light lamp in the “Badminton”
also apply MULTI-GS to remove outlier matches of SIFT in sequence are static objects over a whole video sequence
Homography, CPW, and SPHP as well. which are not detected as moving foreground objects, and
Fig. 12 compares the stitching results on selected frames therefore the proposed algorithm cannot align them correctly.
of 12 test video sequences. All the conventional methods However, all the compared methods also fail to align these
including the proposed algorithm achieve good stitching re- objects as marked with yellow ellipses. More comparative
sults on the “Fountain” sequence which yields the smallest results of video stitching are provided in the supplementary
parallax angle of 1.9◦ . However, for the other sequences video.
of MVLP, the conventional methods fail to work to align We also quantitatively compare the performance of the
multiple foreground objects and background simultaneously. proposed algorithm with that of the conventional methods us-
For example, in the “Square” and “Office” sequences, the ing manually obtained ground truth correspondence matches
feet of multiple people are well aligned on the ground planes, on the foreground objects and the background together. We
but the mismatch artifact gets worse toward the heads, since use the same ground truth matches on the foreground objects
the ground plane warping is dominant in the conventional as explained in Sec. VI-A. We generate ground truth matches
methods. On the other hand, in the “Stadium,” “Soccer,” and on the background only once for each sequence using the
“Garden” sequences, a same person appears twice at different background image. We first consider multiple large planar
VOLUME x, 2018 11

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

FIGURE 12: Comparison of video stitching results of the proposed algorithm and the four existing methods: Homography, CPW [34], SPHP [15],
and APAP [13]. From top to bottom, “Fountain,” “Tennis,” “Lawn,” “Badminton,” “Square,” “Office,” “Trail,” “Stadium,” “Soccer,” “Street,” “School,”
and “Garden” sequences.

areas in the background, and compute an optimal homogra- truth matches on the background image are added to each of
phy for each planar area by using manually obtained feature the 100 frames which are selected for finding ground truth
matches. Then we select regularly distributed query pixels matches on the foreground objects, where we exclude the
on the background image of a target view, and find the background query pixels occluded by the foreground objects.
ground truth matching pixels by warping the query pixels Consequently, on average, we have 724 ground truth matches
employing the multiple homographies selectively. For the on the background for each of the 100 selected frames over
query pixels on small and/or non-planar areas, we manually 12 test sequences.
obtain the ground truth matching pixels. The resulting ground
Fig. 13 presents the RMSE between the ground truth
12 VOLUME x, 2018

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

TABLE 2: Comparison of Execution Times of the Conventional Meth-


ods and the Proposed Algorithm. The Unit is seconds per frame. PP:
Preprocessing. PE: Parameter Estimation. ST: Stitching.
Proposed
Sequence Homography CPW SPHP APAP
PP PE ST
Fountain 0.58 13.7 5.20 4.83 0.25 0.24 9.10
Tennis 0.62 15.0 4.17 2.90 0.28 0.07 26.3
Lawn 0.54 14.8 3.91 2.75 0.27 0.14 8.20
Badminton 0.53 16.5 4.03 2.57 0.31 0.18 39.0
Square 0.59 18.1 3.91 2.55 0.29 0.17 10.0
Office 0.55 17.7 4.41 3.02 0.28 0.10 38.0
Trail 0.65 18.7 3.93 2.77 0.30 0.20 34.1
Stadium 0.69 17.2 4.20 2.87 0.30 0.08 44.8
Soccer 0.25 9.10 3.04 4.28 0.21 0.03 19.3
Street 0.57 12.2 4.19 3.32 0.30 0.09 66.0
School 0.61 16.2 4.20 3.20 0.29 0.10 39.3
FIGURE 13: Quantitative comparison of the stitching performance of Garden 0.71 15.8 4.96 4.28 0.28 0.07 72.1
the proposed algorithm with that of the conventional methods. The
matching error measures the average RMSE between the warped Average 0.57 15.4 4.18 3.28 0.28 0.12 33.8
pixels and the ground truth corresponding pixels.

corresponding pixels and the warped pixels on the overlapped matching [38]. PE includes the homography estimation with
regions of the target and reference frames. We see that the activity-based correspondence matching computation, the
conventional methods tend to yield large RMSEs on test se- fundamental matrix estimation, and the estimation of vertical
quences with large parallax angles. For example, the RMSEs vanishing points. ST includes the SIFT matching computa-
of all the stitching methods are below 2 pixels on the “Foun- tion, ground pixel estimation, warping, and blending. Note
tain” sequence which exhibits the smallest parallax angle that PP and PE are performed once over the entire frames for
of 1.9◦ . However, on the challenging sequences of MVLP each video sequence, and thus yield relatively short execution
such as “Soccer” and “School,” the conventional methods times for each frame. However, ST in the proposed algorithm
yield significantly larger RMSEs compared with that of the consumes a major portion of the execution time to compute
other sequences. On the other hand, the proposed algorithm hole pixels in the warped target frame using valid warped
always achieves smaller RMSEs than that of the conventional pixels, which takes 33.8 seconds per frame on average. Note
methods on all the test sequences, and yields a much smaller that “Fountain,” “Lawn,” and “Square” sequences exhibit
average error of 5.64 pixels while Homography, CPW, SPHP, relatively short execution times of ST, since they do not have
and APAP result in the average errors of 35.37, 34.91, 32.05, distant background regions.
and 34.86 pixels, respectively.
VII. CONCLUSIONS
D. EXECUTION TIME COMPARISON
We proposed a novel video stitching algorithm to achieve
geometrically accurate alignment of MVLP. We warped the
Table 2 compares the execution times of the conventional
multiple foreground objects, distant background, and ground
methods and the proposed algorithm measured on a PC
plane adaptively based on the epipolar geometry, where an
with 3.4 GHz AMD Ryzen 7 1700X CPU and 32 GB
off-plane pixel in a target view is warped to a reference
RAM. Note that this may not be a fair comparison since
view through its GPP. We also estimated optimal GPPs for
the optimization level of implementation is different for the
the foreground objects by using the spatiotemporal feature
compared methods. The execution times of the conventional
matches, and for the background by using the spatial feature
methods and the stitching (ST) in the proposed algorithm
matches, respectively. The initially obtained GPPs are refined
are averaged on 100 frames for each sequence, and that of
by energy minimization. Experimental results demonstrated
the preprocessing (PP) and the parameter estimation (PE)
that the proposed algorithm aligns various MVLP accurately,
in the proposed algorithm are averaged on the entire frames
and yields a significantly better performance of parallax
for each sequence. Homography is the fastest method which
artifact reduction qualitatively and quantitatively compared
takes 0.57 seconds per each frame on average. CPW, SPHP,
with the state-of-the-art image stitching techniques. Our fu-
and APAP require relatively longer execution times, since
ture research topics include the warping of static objects
these methods use different warping models for each cell or
with large parallax and the parallax-free stitching for MVLP
mesh grid in an image. Note that CPW is a non-parametric
captured by moving cameras.
warping scheme and takes the longest execution time of 15.4
seconds per frame among the four conventional methods. The
REFERENCES
proposed algorithm is divided into three steps to evaluate
[1] W. Liu, M. Zhang, Z. Luo, and Y. Cai, “An ensemble deep learning method
the execution times. PP includes the background subtraction for vehicle type classification on visual traffic surveillance sensors,” IEEE
and the activity extraction for activity-based correspondence Access, vol. 5, pp. 24 417–24 425, 2017.

VOLUME x, 2018 13

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

[2] R. Panda and A. K. Roy-Chowdhury, “Multi-view surveillance video [28] M. El-Saban, M. Izz, and A. Kaheel, “Fast stitching of videos captured
summarization via joint embedding and sparse optimization,” IEEE Trans. from freely moving devices by exploiting temporal redundancy,” in Proc.
Multimedia, vol. 19, no. 9, pp. 2010–2021, May 2017. IEEE Int’l Conf. Image Process., 2010.
[3] M. Wang, B. Cheng, and C. Yuen, “Joint coding-transmission optimization [29] W. Jiang and J. Gu, “Video stitching with spatial-temporal content-
for a video surveillance system with multiple cameras,” IEEE Trans. preserving warping,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Multimedia, Sep. 2017. Workshops, 2015.
[4] K. Bilal, A. Erbad, and M. Hefeeda, “Crowdsourced multi-view live video [30] K.-Y. Lee and J.-Y. Sim, “Robust video stitching using adaptive pixel
streaming using cloud computing,” IEEE Access, vol. 5, pp. 12 635– transfer,” in Proc. IEEE Int’l Conf. Image Process., 2015.
12 647, 2017. [31] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.
[5] S. A. Pettersen, D. Johansen, H. Johansen, V. Berg-Johansen, V. R. Cambridge university press, 2003.
Gaddam, A. Mortensen, R. Langseth, C. Griwodz, H. K. Stensland, and [32] T. Igarashi, T. Moscovich, and J. F. Hughes, “As-rigid-as-possible shape
P. Halvorsen, “Soccer video and player position dataset,” in Proc. ACM manipulation,” ACM Trans. Graphics, vol. 24, no. 3, pp. 1134–1141, 2005.
Multimedia Syst. ACM, 2014, pp. 18–23. [33] S. Schaefer, T. McPhail, and J. Warren, “Image deformation using moving
[6] Q. Yao, H. Sankoh, K. Nonaka, and S. Naito, “Automatic camera self- least squares,” ACM Trans. Graphics, vol. 25, no. 3, pp. 533–540, 2006.
calibration for immersive navigation of free viewpoint sports video,” in [34] F. Liu, M. Gleicher, H. Jin, and A. Agarwala, “Content-preserving warps
Proc. Int’l Conf. Multimedia Signal Process., Sep. 2016, pp. 1–6. for 3d video stabilization,” ACM Trans. Graphics, vol. 28, no. 3, p. 44,
[7] B. Kwon, J. Kim, K. Lee, Y. K. Lee, S. Park, and S. Lee, “Implementation 2009.
of a virtual training simulator based on 360° multi-view human action [35] J.-M. Morel and G. Yu, “Asift: A new framework for fully affine invariant
recognition,” IEEE Access, vol. 5, pp. 12 496–12 511, 2017. image comparison,” SIAM J. Imaging Sciences, vol. 2, no. 2, pp. 438–469,
[8] B. Macchiavello, C. Dorea, E. M. Hung, G. Cheung, and W. T. Tan, “Loss- 2009.
resilient coding of texture and depth for free-viewpoint video conferenc- [36] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm
ing,” IEEE Trans. Multimedia, vol. 16, no. 3, pp. 711–725, Apr. 2014. for model fitting with applications to image analysis and automated
[9] L. Toni, G. Cheung, and P. Frossard, “In-network view synthesis for cartography,” Comm. ACM, vol. 24, no. 6, pp. 381–395, 1981.
interactive multiview video systems,” IEEE Trans. Multimedia, vol. 18, [37] E. Ermis, P. Clarot, P. Jodoin, and V. Saligrama, “Activity based matching
no. 5, pp. 852–864, May 2016. in distributed camera networks,” IEEE Trans. Image Process., vol. 19,
no. 10, pp. 2595–2613, Oct. 2010.
[10] R. Szeliski, “Image alignment and stitching: A tutorial,” Foundations and
[38] S.-Y. Lee, J.-Y. Sim, C.-S. Kim, and S.-U. Lee, “Correspondence matching
Trends® in Computer Graphics and Vision, vol. 2, no. 1, pp. 1–104, 2006.
of multi-view video sequences using mutual information based similarity
[11] J. Gao, S. J. Kim, and M. S. Brown, “Constructing image panoramas using
measure,” IEEE Trans. Multimedia, vol. 15, no. 8, pp. 1719–1731, Dec.
dual-homography warping,” in Proc. IEEE Conf. Comput. Vis. Pattern
2013.
Recognit., 2011.
[39] J. M. McHugh, J. Konrad, V. Saligrama, and P.-M. Jodoin, “Foreground-
[12] W.-Y. Lin, S. Liu, Y. Matsushita, T.-T. Ng, and L.-F. Cheong, “Smoothly
adaptive background subtraction,” IEEE Signal Process. Lett., vol. 16,
varying affine stitching,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
no. 5, pp. 390–393, May 2009.
nit., 2011.
[40] T.-J. Chin, J. Yu, and D. Suter, “Accelerated hypothesis generation for
[13] J. Zaragoza, T.-J. Chin, Q.-H. Tran, M. S. Brown, and D. Suter, “As- multistructure data via preference analysis,” IEEE Trans. Pattern Anal.
projective-as-possible image stitching with moving dlt,” IEEE Trans. Mach. Intell., vol. 34, no. 4, pp. 625–638, Apr. 2012.
Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1285–1298, Jul. 2014. [41] F. Lv, T. Zhao, and R. Nevatia, “Camera calibration from video of a
[14] G. Zhang, Y. He, W. Chen, J. Jia, and H. Bao, “Multi-viewpoint panorama walking human,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 9,
construction with wide-baseline images,” IEEE Trans. Image Process., pp. 1513–1518, Sep. 2006.
vol. 25, no. 7, pp. 3099–3111, Jul. 2016. [42] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor
[15] C.-H. Chang, Y. Sato, and Y.-Y. Chuang, “Shape-preserving half- applied to wide-baseline stereo,” IEEE Trans. Pattern Anal. Mach. Intell.,
projective warps for image stitching,” in Proc. IEEE Conf. Comput. Vis. vol. 32, no. 5, pp. 815–830, May 2010.
Pattern Recognit., 2014. [43] [Online]. Available: https://www.cmlab.csie.ntu.edu.tw/~frank/
[16] C.-C. Lin, S. U. Pankanti, K. N. Ramamurthy, and A. Y. Aravkin, “Adap- [44] [Online]. Available: http://cs.adelaide.edu.au/~tjchin/apap/
tive as-natural-as-possible image stitching,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2015.
[17] Y.-S. Chen and Y.-Y. Chuang, “Natural image stitching with the global
similarity prior,” in Proc. Eur. Conf. Comput. Vis., 2016.
[18] N. Li, Y. Xu, and C. Wang, “Quasi-homography warps in image stitching,”
IEEE Trans. Multimedia, vol. PP, no. 99, pp. 1–1, 2017.
[19] J. Gao, Y. Li, T.-J. Chin, and M. S. Brown, “Seam-driven image stitching,”
in Proc. Eurographics, 2013.
[20] F. Zhang and F. Liu, “Parallax-tolerant image stitching,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., 2014.
[21] K. Lin, N. Jiang, L.-F. Cheong, M. Do, and J. Lu, “Seagull: Seam-guided
local alignment for parallax-tolerant image stitching,” in Proc. Eur. Conf.
Comput. Vis., 2016.
[22] M. Yu and G. Ma, “360 surround view system with parking guidance,”
SAE Int’l J. Commercial Vehicles, vol. 7, no. 2014-01-0157, pp. 19–24,
2014.
[23] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
Int’l J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
[24] W. Hu, M. Hu, X. Zhou, T. Tan, J. Lou, and S. Maybank, “Principal
axis-based correspondence between multiple cameras for people tracking,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 4, pp. 663–671, Apr.
KYU-YUL LEE received the B.S. degree in elec-
2006.
trical and computer engineering from Ulsan Na-
[25] S. M. Khan and M. Shah, “Tracking multiple occluding people by local-
tional Institute of Science and Technology, Ulsan,
izing on multiple scene planes,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 31, no. 3, pp. 505–519, Mar. 2009. Korea, in 2013, where he is currently pursuing the
[26] A. Yildiz and Y. S. Akgul, “A fast method for tracking people with multiple Ph.D. degree in electrical and computer engineer-
cameras,” in Proc. Eur. Conf. Comput. Vis. Workshops, 2010. ing. His research interests include correspondence
[27] M. Takahashi, K. Ikeya, M. Kano, H. Ookubo, and T. Mishina, “Robust matching, video stitching and deep learning.
volleyball tracking system using multi-view cameras,” in Proc. Int’l Conf.
Pattern Recognit., Dec. 2016, pp. 2740–2745.

14 VOLUME x, 2018

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2835659, IEEE Access

K.-Y. Lee and J.-Y. Sim: Stitching for Multi-View Videos With Large Parallax Based on Adaptive Pixel Warping

JAE-YOUNG SIM (S’02-M’06) received the B.S.


degree in electrical engineering and the M.S.
and Ph.D. degrees in electrical engineering and
computer science from Seoul National University,
Seoul, Korea, in 1999, 2001, and 2005, respec-
tively. From 2005 to 2009, he was a Research Staff
Member, Samsung Advanced Institute of Technol-
ogy, Samsung Electronics Company, Ltd. In 2009,
he joined the School of Electrical and Computer
Engineering, Ulsan National Institute of Science
and Technology, Ulsan, Korea, where he is now an Associate Professor. His
research interests include image, video, and 3D visual processing, computer
vision, and multimedia data compression.

VOLUME x, 2018 15

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like