A Review of Image-based Rendering Techniques
Heung-Yeung Shum and Sing Bing Kang
Microsoft Research
{hshum, sbkang}@microsoft.com
Abstract
In this paper, we survey the techniques for image-based rendering. Unlike traditional 3D computer graphics in
which 3D geometry of the scene is known, image-based rendering techniques render novel views directly from input
images. Previous image-based rendering techniques can be classified into three categories according to how much
geometric information is used: rendering without geometry, rendering with implicit geometry (i.e., correspondence),
and rendering with explicit geometry (either with approximate or accurate geometry). We discuss the characteristics of
these categories and their representative methods. The continuum between images and geometry used in image-based
rendering techniques suggests that image-based rendering with traditional 3D graphics can be united in a joint image
and geometry space.
Keywords: Image-based rendering, survey.
1 Introduction
Image-based modeling and rendering techniques have recently received much attention as a powerful alternative to
traditional geometry-based techniques for image synthesis. Instead of geometric primitives, a collection of sample
images are used to render novel views. Previous work on image-based rendering (IBR) reveals a continuum of imagebased representations [22, 15] based on the tradeoff between how many input images are needed and how much is known
about the scene geometry.
For didactic purposes, we classify the various rendering techniques (and their associated representations) into three
categories, namely rendering with no geometry, rendering with implicit geometry, and rendering with explicit geometry.
These categories, depicted in Figure 1, should actually be viewed as a continuum rather than absolute discrete ones, since
there are techniques that defy strict categorization.
At one end of the rendering spectrum, traditional texture mapping relies on very accurate geometric models but only
a few images. In an image-based rendering system with depth maps, such as 3D warping [25], and layered-depth images
(LDI) [38], LDI tree [5], etc., the model consists of a set of images of a scene and their associated depth maps. When depth
is available for every point in an image, the image can be rendered from any nearby point of view by projecting the pixels
of the image to their proper 3D locations and re-projecting them onto a new picture. For many synthetic environments
or objects, it is not difficult to keep the depth information during the rendering process. However, obtaining depth
information from real images is hard even for the state-of-art vision algorithms.
Less geometry
Rendering with
no geometry
Light field
Concentric mosaics
Mosaicking
More geometry
Rendering with
implicit geometry
Lumigraph
Rendering with
explicit geometry
Texture-mapped models
3D warping
View-dependent geometry
View-dependent texture
LDIs
Transfer methods
View morphing
View interpolation
Figure 1: Categories used in this paper, with representative members.
1
Some image-based rendering systems do not require explicit geometric models. Rather, they require feature (such
as points) correspondence between images. For example, view interpolation [6] generates novel views by interpolating
optical flow between corresponding points. On the other hand, view morphing [37] generates in-between camera matrices
along the line of two original camera centers, based on point correspondences. Computer vision techniques are usually
used to generate such correspondences.
At the other extreme, light field rendering uses many images but does not require any geometric information or
correspondence. Light field rendering [23] generates a new image of a scene by appropriately filtering and interpolating
a pre-acquired set of samples. Lumigraph [12] is similar to light field rendering but it applies approximated geometry
to compensate for non-uniform sampling in order to improve rendering performance. Unlike light field and lumigraph
where cameras are placed on a two-dimensional grid, the concentric mosaics representation [39] reduces the amount
of data by capturing a sequence of images along a circle path. Light field rendering, however, has a tendency to rely
on oversampling to counter undesirable aliasing effects in output display. Oversampling means more intensive data
acquisition, more storage, and more redundancy.
How many images are necessary for anti-aliased rendering? This sampling question needs to be answered by every
image-based rendering system. Sampling analysis in image-based rendering, however, is a difficult problem because it
involves the unraveling relationship among three elements: the depth and texture information of the scene, the number
of sample images, and the rendering resolution. The answer to the sampling analysis provides design principles for
image-based rendering systems, in terms of trade-off between the images and the geometry information needed.
The remainder of this paper is organized as follows. Three categories of image-based rendering systems, with no,
implicit, and explicit geometric information respectively, are presented in Sections 2, 3, and 4. The issue of trade-off
between images and geometric information needed for image-based rendering is discussed in Section 5. We also discuss
compact representation and efficient rendering techniques in Section 6, and provide concluding remarks in Section 7.
2 Rendering with no geometry
In this section, we describe representative techniques for rendering with unknown scene geometry. These techniques
rely on the characterization of the plenoptic function.
2.1
Plenoptic modeling
The original 7D plenoptic function [1] is defined as the intensity of light rays passing through the camera center at every
location (Vx , Vy , Vz ) at every possible angle (θ, φ), for every wavelength λ, at every time t, i.e.,
P7 = P (Vx , Vy , Vz , θ, φ, λ, t).
(1)
Adelson and Bergen [1] considered one of the tasks of early vision as extracting a compact and useful description
of the plenoptic function’s local properties (e.g., low order derivatives). It has also been shown by [44] that light source
directions can be incorporated into the plenoptic function for illumination control. By dropping out two variables, time
t (therefore static environment) and light wavelength λ (hence fixed lighting condition), McMillan and Bishop [28]
introduced plenoptic modeling with the 5D complete plenoptic function,
P5 = P (Vx , Vy , Vz , θ, φ).
(2)
The simplest plenoptic function is a 2D panorama (cylindrical [7] or spherical [43]) when the viewpoint is fixed,
P2 = P (θ, φ).
(3)
And a regular image (with a limited field of view) can be regarded as an incomplete plenoptic sample at a fixed viewpoint.
Image-based rendering, therefore, becomes one of constructing a continuous representation of the plenoptic function
from observed discrete samples (complete or incomplete). How to sample the plenoptic function and how to reconstruct
a continuous function from discrete samples are important research topics. For example, the samples used in [28]
are cylindrical panoramas. Disparity of each pixel in stereo pairs of cylindrical panoramas is computed and used for
generating new plenoptic function samples. Similar work on regular stereo pairs can be found in [20].
2
t
light ray
L(u,v,s,t)
v
object
s
u
Figure 2: Representation of a light field.
Dimension
7
5
4
3
2
Year
1991
1995
1996
1999
1994
Viewing space
free
free
bounding box
bounding plane
fixed point
Name
Plenoptic function
Plenoptic modeling
Lightfield/Lumigraph
Concentric mosaics
Cylindrical/Spherical panorama
Figure 3: A taxonomy of plenoptic functions.
2.2
Light field and lumigraph
It was observed in both light-field rendering [23] and lumigraph [12] systems that as long as we stay outside the convex
hull (or simply a bounding box) of an object,1 we can simplify the 5D complete plenoptic function to a 4D lightfield
plenoptic function,
P4 = P (u, v, s, t),
(4)
where (u, v) and (s, t) parameterize two parallel planes of the bounding box, as shown in Figure 2. To have a complete
description of the plenoptic function for the bounding box, six sets of such two-planes are needed. More restricted
versions of lumigraph have also been developed by Sloan et al. [41] and Katayama et al [19]. The camera motion is
restricted to a straight line.
In the light field system, a capturing rig is designed to obtain uniformly sampled images. To reduce aliasing effect,
the light field is pre-filtered before rendering. A vector quantization scheme is used to reduce the amount of data used
in light field rendering, yet achieving random access and selective decoding. On the other hand, the lumigraph can be
constructed from a set of images taken from arbitrarily placed viewpoints. A re-binning process is therefore required.
Geometric information is used to guide the choices of the basis functions. Because of the use of geometric information,
sampling density can be reduced.
2.3
Concentric mosaics
Obviously the more constraints we have on the camera location (Vx , Vy , Vz ), the simpler the plenoptic function becomes.
If we want to capture all viewpoints, we need a complete 5D plenoptic function. As soon as we stay in a convex hull (or
conversely viewing from a convex hull) free of occluders, we have a 4D lightfield. If we do not move at all, we have a 2D
panorama. An interesting 3D parameterization of the plenoptic function, called Concentric Mosaics [39] , is proposed
by Shum and He where the camera motion is constrained along concentric circles on a plane. A taxonomy of plenoptic
functions is shown in Figure 3.
1 The
reverse is also true if camera views are restricted inside a convex hull.
3
(a)
(b)
(c)
(d)
Figure 4: Rendering a lobby: rebinned concentric mosaic (a) at the rotation center; (b) at the outermost circle; (c) at the
outermost circle but looking at the opposite direction of (b); (d) parallax change between the plant and the poster.
By constraining camera motion to planar concentric circles, concentric mosaics can be created by compositing slit
images taken at different locations of each circle. Concentric mosaics index all input image rays naturally in 3 parameters:
radius, rotation angle and vertical elevation. Novel views are rendered by combining the appropriate captured rays in
an efficient manner at rendering time. Although vertical distortions exist in the rendered images, they can be alleviated
by depth correction. Concentric mosaics have good space and computational efficiency. Compared with a lightfield or
lumigraph, concentric mosaics have much smaller file size because only a 3D plenoptic function is constructed.
Most importantly, concentric mosaics are very easy to capture. Capturing concentric mosaics is as easy as capturing a
traditional panorama except that concentric mosaics require more images. By simply spinning an off-centered camera on
a rotary table, we can construct concentric mosaics for a real scene in 10 minutes. Like panoramas, concentric mosaics
do not require the difficult modeling process of recovering geometric and photometric scene models. Yet concentric
mosaics provide a much richer user experience by allowing the user to move freely in a circular region and observe
significant parallax and lighting changes. The ease of capturing makes concentric mosaics very attractive and useful for
many virtual reality applications.
Rendering of a lobby scene from captured concentric mosaics is shown in Figure 4. A rebinned concentric mosaic at
the rotation center is shown in Figure 4(a), while two rebinned concentric mosaics taken at exactly opposite directions
are shown in Figure 4(b) and (c), respectively. It has also been shown in [32] that such two mosaics taken from a single
rotating camera can simulate a stereo panorama. In Figure 4(d), strong parallax can be seen between the plant and the
poster in the rendered images.
2.4
Image mosaicing
A complete plenoptic function at a fixed viewpoint can be constructed from incomplete samples. Specifically, a panoramic
mosaic is constructed by registering multiple regular images. For example, if the camera focal length is known and fixed,
one can project each image to its cylindrical map and the relationship between the cylidrical images becomes a simple
4
Figure 5: Tessellated spherical panorama covering the north pole (constructed from 54 images).
translation. For arbitrary camera rotation, one can first register the images by recovering the camera movement, before
converting to a final cylindrical/spherical map.
Many systems have been built to construct cylindrical and spherical panoramas by stitching multiple images together,
e.g., [24, 42, 7, 28, 43] among others. When the camera motion is very small, it is possible to put together only small
stripes from registered images, i.e., slit images (e.g., [46, 33]), to form a large panoramic mosaic. Capturing panoramas
is even easier if omnidirectional cameras (e.g., [30, 29]) or fisheye lens [45] are used.
Szeliski and Shum [43] presented a complete system for constructing panoramic image mosaics from sequences of
images. Their mosaic representation associates a transformation matrix with each input image, rather than explicitly
projecting all of the images onto a common surface (e.g., a cylinder). In particular, to construct a full view panorama,
a rotational mosaic representation associates a rotation matrix (and optionally a focal length) with each input image.
A patch-based alignment algorithm is developed to quickly align two images given motion models. Techniques for
estimating and refining camera focal lengths are also presented.
In order to reduce accumulated registration errors, global alignment (block adjustment) is applied to the whole
sequence of images, which results in an optimally registered image mosaic. To compensate for small amounts of
motion parallax introduced by translations of the camera and other unmodeled distortions, a local alignment (deghosting)
technique [40] warps each image based on the results of pairwise local image registrations. Combining both global
and local alignment significantly improves the quality of our image mosaics, thereby enabling the creation of full view
panoramic mosaics with hand-held cameras.
A tessellated spherical map of the full view panorama is shown in Figure 5. Three panoramic image sequences of
a building lobby were taken with the camera on a tripod tilted at three different angles (with 22 images for the middle
sequence, 22 images for the upper sequence, and 10 images for the top sequence). The camera motion covers more than
two thirds of the viewing sphere, including the top.
3 Rendering with implicit geometry
There is a class of techniques that relies on positional correspondences across a small number of images to render new
views. This class has the term implicit to express the fact that geometry is not directly available; 3D information is
computed only using the usual projection calculations. New views are computed based on direct manipulation of these
positional correspondences, which are usually point features.
3.1 View interpolation
From two input images, given dense optical flow between them, Chen and Williams’ view interpolation method [6] can
reconstruct arbitrary viewpoints. This method works well when two input views are close by, so that visibility ambiguity
does not pose a serious problem. Otherwise, flow fields have to be constrained so as to prevent foldovers. In addition,
when two views are far apart, the overlapping parts of two images become too small. Chen and Williams’ approach
5
works particularly well when all the input images share a common gaze direction, and the output images are restricted
to have a gaze angle less than 90 degrees.
Establishing flow fields for view interpolation can be difficult, in particular for real images. Computer vision
techniques such as feature correspondence or stereo must be employed. For synthetic images, flow fields can be obtained
from the known depth values.
3.2 View morphing
From two input images, Seitz and Dyer’s view morphing technique [37] reconstructs any viewpoint on the line linking
two optical centers of the original cameras. Intermediate views are exactly linear combinations of two views only if the
camera motion associated with the intermediate views are perpendicular to the camera viewing direction. If the two input
images are not parallel, a pre-warp stage can be employed to rectify two input images so that corresponding scan lines
are parallel. Accordingly, a post-warp stage can be used to un-rectify the intermediate images. Scharstein [36] extends
this framework to camera motion in a plane. He assumes, however, that the camera parameters are known.
3.3 Transfer methods
Transfer methods (a term used within the photogrammetric community) are characterized by the use of a relatively small
number of images with the application of geometric constraints (either recovered at some stage or known a priori) to
reproject image pixels appropriately at a given virtual camera viewpoint. The geometric constraints can be of the form
of known depth values at each pixel, epipolar constraints between pairs of images, or trifocal/trilinear tensors that link
correspondences between triplets of images. The view interpolation and view morphing methods above are actually
specific instances of transfer methods.
Laveau and Faugeras [21] use a collection of images called reference views and the principle of the fundamental
matrix to produce virtual views. The new viewpoint, which is chosen by interactively choosing the positions of four
control image points, is computed using a reverse mapping or raytracing process. For every pixel in the new target image,
a search is performed to locate the pair of image correspondences in two reference views. The search is facilitated by
using the epipolar constraints and the computed dense correspondences (also known as image disparities) between the
two reference views.
Note that if the camera is only weakly calibrated, the recovered viewpoint will be that of a projective structure (see
[11] for more details). This is because there is a class of 3-D projections and structures that will result in exactly the
same reference images. Since angles and areas are not preserved, the resulting viewpoint may appear warped. Knowing
the internal parameters of the camera removes this problem.
If a trifocal tensor, which is a 3 × 3 × 3 matrix, is known for a set of three images, then given a pair of point
correspondences in two of these images, a third corresponding point can be directly computed in the third image without
resorting to any projection computation. This idea has been used to generate novel views from either two or three
reference images [2].
The idea of generating novel views from two or three reference images is rather straightforward. First, the “reference”
trilinear tensor is computed from the point correspondences between the reference images. In the case of only two
reference images, one of the images is replicated and regarded as the “third” image. If the camera intrinsic parameters
are known, then a new trilinear tensor can be computed from the known pose change with respect to the third camera
location. The new view can subsequently be generated using the point correspondences from the first two images and
the new trilinear tensor. A set of novel views created using this approach can be seen in Figure 6.
4 Rendering with explicit geometry
In this class of techniques, the representation has direct 3D information encoded in it, either in the form of depth along
known lines-of-sight, or 3D coordinates. The more traditional 3D texture-mapped model belongs to this category (not
described here, since its rendering uses the conventional graphics pipeline).
4.1
3D warping
When the depth information is available for every point in one or more images, 3D warping techniques (e.g., [27]) can be
used to render nearly viewpoints. An image can be rendered from any nearby point of view by projecting the pixels of the
original image to their proper 3D locations and re-projecting them onto the new picture. The most significant problem
in 3D warping is how to deal with holes generated in the warped image. Holes are due to the difference of sampling
6
Figure 6: Example of visualizing using the trilinear tensor: The left-most two images are the reference images, with the
rest synthesized at arbitrary viewpoints.
resolution between the input and output images, and the disocclusion where part of the scene is seen by the output image
but not by the input images. To fill in holes, the most commonly used method is to splat a pixel in the input image to
several pixels size in the output image.
4.1.1 Relief texture
To improve the rendering speed of 3D warping, the warping process can be factored into a relatively simple pre-warping
step and a traditional texture mapping step. The texture mapping step can be performed by standard graphics hardware.
This is the idea behind relief texture, a technique proposed by Oliveira and Bishop [31]. Similar factoring approach
has been proposed by Szeliski in a two-step algorithm [38] where the depth is first forward warped before the pixel is
backward mapped onto the output image.
4.1.2 Multiple-center-of-projection images
The 3D warping techniques can be applied not only to the traditional perspective images, but also multi-perspective
images as well. For example, Rademacher and Bishop [35] proposed to render novel views by warping multiple-centerof-projection images, or MCOP images.
4.2
Layered depth images
To deal with the disocclusion artifacts in 3D warping, Shade et al. proposed Layered Depth Image, or LDI [38], to store
not only what is visible in the input image, but also what is behind the visible surface. In LDI, each pixel in the input
image contains a list of depth and color values where the ray from the pixel intersects with the environment.
Though LDI has the simplicity of warping a single image, it does not consider the issue of sampling rate or how
densely should the LDI be. Chang et al. [5] proposed LDI trees so that the sampling rates of the reference images are
preserved by adaptively selecting an LDI in the LDI tree for each pixel. While rendering with the LDI tree, only the level
of LDI tree that is the comparable to the sampling rate of the output image need to be traversed.
4.3 View-dependent texture maps
Texture maps are widely used in computer graphics for generating photo-realistic environments. Texture-mapped models
can be created using a CAD modeler for a synthetic environment. For real environments, these models can be generated
using a 3D scanner or applying computer vision techniques to captured images. Unfortunately, vision techniques are
not robust enough to recover accurate 3D models. In addition, it is difficult to capture visual effects such as highlights,
reflections, and transpency using a single texture-mapped model.
7
Depth and
texture
information
Rendering
resolution
Image
samples
Figure 7: Plenoptic sampling. Quantitative analysis of the relationships among three key elements: depth and texture
information, number of input images, and rendering resolution.
To obtain these visual effects of a reconstructed architectural environment, Debevec et al. [9] used view-dependent
texture mapping to render new views, by warping and compositing several input images of an environment. A three-step
view-dependent texture mapping method was also proposed later by Debevec et al. [8] to further reduce the computational
cost and to have smoother blending. This method employs visibility preprocessing, polygon-view maps, and projective
texture mapping.
5 Trade-off between images and geometry
Rendering with no geometry is expensive in terms of acquiring and storing the database. On the other hand, using explicit
geometry, while more compact, may compromise output visual quality. So, an important question is, what is the right
mix of image sampling size and quality of geometric information required to satisfy a mix of quality, compactness, and
speed? Part of that question may be answered by analyzing the nature of plenoptic sampling.
5.1
Plenoptic sampling analysis
Many image-based rendering systems, especially light field rendering [23, 12, 39], have a tendency to rely on oversampling
to counter undesirable aliasing effects in output display. Oversampling means more intensive data acquisition, more
storage, and more redundancy. Sampling analysis in image-based rendering is a difficult problem because it involves
the unraveling relationship among three elements: the depth and texture information of the scene, the number of sample
images, and the rendering resolution, as shown in Figure 7.
Chai et al. [4] recently studied plenoptic sampling, or how many images are needed for plenoptic modeling. Plenoptic
sampling can be stated as:
How many image samples (e.g., from a 4D light field) and how much geometric and textural information are
needed to generate a continuous representation of the plenoptic function?
Specifically, the following two problems are studied under plenoptic sampling:
• Minimum sampling rate for light field rendering;
• Minimum sampling curve in the joint image and geometry space.
Chai et al. formulate the question of sampling analysis as a high dimensional signal processing problem. Rather
than attempting to obtain a closed-form general solution to the 4D light field spectral analysis, they only analyze the
bounds of the spectral support of the light field signals. A key observation to be presented in this paper is that the spectral
support of a light field signal is bounded by only the minimum and maximum depths, irrespective of how complicated
8
(a)
(b)
(c)
Figure 8: Minimum sampling: (a) the minimum sampling rate in image space; (b) the minimum sampling curve in the
joint image and geometry space; (c) minimum sampling curves at different rendering resolutions.
the spectral support might be because of depth variations in the scene. Given the minimum and maximum depths, a
reconstruction filter with an optimal and constant depth can be designed to achieve anti-aliased light field rendering.
The minimum sampling rate of light field rendering is obtained by compacting the replicas of the spectral support
of the sampled light field within the smallest interval after the optimal filter is applied. How small the interval can be
depends on the design of the optimal filter. More depth information results in tighter bounds of the spectral support,
thus a smaller number of images. Plenoptic sampling in the joint image and geometry space determines the minimum
sampling curve which quantitatively describes the relationship between the number of images and the information on
scene geometry under a given rendering resolution. This minimal sampling curve serves as the design principles for
IBR systems. Furthermore, it bridges the gap between image-based rendering and traditional geometry-based rendering.
Minimum sampling rate and minimum sampling curves are illustrated in Figure 8.
There are a number of techniques that can be applied to reduce the size of the representation; they are usually based on
local coherency either in the spatial or temporal domains. The following subsections describe some of these techniques.
5.2
Multiple viewpoint rendering
An approach that bridges the notions of the light field or lumigraph and 3D scene geometry is what Halle calls multiple
viewpoint rendering [13]. Assuming that the 3D scene is completely known, multiple viewpoints can be precomputed at
known camera viewpoints and preprocessed to take advantage of the perspective coherence (i.e., the similarity of images
of a static scene at different viewpoints). The tool used for such a purpose is the EPI [3] representation. In this case, the
EPI is a slice of spatio-perspective space cut parallel to the direction of camera motion.
5.3 View-dependent geometry
Another interesting representation that trades off geometry and images is view-dependent geometry, first used in the
context of 3D cartoons [34]. We can potentially extend this idea to represent real or synthetically-generated scenes more
compactly. As described in [18], view-dependent geometry is useful to accommodate the fact that stereo reconstruction
errors are less visible during local viewpoint perturbations, but may show dramatic effects over large view changes. In
areas where stereo data is inaccurate, they suggest that we may well represent these areas with view-dependent geometry,
which comprises a set of geometry extracted at various positions (in [34], this set is manually created). View-dependent
geometry may also be used to capture visual effects such as highlights and transparency, which are likely to be locally
coherent in image and viewpoint spaces. This area should be a fertile one for future investigation with potentially
significant payoffs.
5.4
Dynamically reparameterized light field
Recently, Isaksen et al. [14] proposed the notion of dynamically reparameterized light fields by adding the ability to vary
the apparent focus within a light field using variable aperture and focus ring. Compared with the original light field and
lumigraph, this method can deal with a much larger depth variation in the scene by combining multiple focal planes.
Therefore, it is suitable not only for outside-looking-in objects, but also for inside-looking-out environments. When
9
multiple focus planes are used for a scene, a scoring algorithm is used before rendering to determine which focus plane
is used during rendering.
While this method does not need to recover actual or approximate geometry of the scene for focusing, it does need
to assign which focus plane to be used. The number of focal planes needed is not discussed.
6 Discussion
Image-based rendering is an area that straddles both computer vision and computer graphics. The continuum between
images and geometry is evident from the image-based rendering techniques reviewed in this article. However, the
emphasis of this article is more on the aspect of rendering and not so much on image-based modeling. Other important
topics such as lighting and animation are also not treated here.
In this review, image-based techniques are divided based on how much geometric information has been used, i.e.,
whether the method uses explicit geometry (e.g., LDI), implicit geometry or correspondence (e.g., view interpolation),
or no geometry at all (e.g., light field). Other methods of dividing image-based rendering techniques have also been
proposed by others, such as on the nature of the pixel indexing scheme [15].
There remain many challenges in image-based rendering, including:
1. Efficient representation
What is very interesting is the trade-off between geometry and images needed to use for anti-aliased image-based
rendering. Many image-based rendering systems have made their choices on whether accurate geometry and how
much geometric information should be used. Plenoptic sampling provides a theoretical foundation for designing
image-based rendering systems.
Both light field rendering and lumigraph avoided the feature correspondence problem by collecting many light
rays with known camera poses. With the help of a specially designed rig, they are capable of generating light
fields for objects sitting on a rotary table. Camera calibration with marked features was used in the lumigraph
system to recover camera poses. Unfortunately, the resulting light field/lumigraph database is very large even for
a small object (therefore small convex hull). Walkthrough of a real scene using lightfield has not yet been fully
demonstrated.
Because of the large amount of data used to represent the 4D function, lightfield compression is necessary. It also
makes sense to compress it because of the spatial coherency among all captured images.
2. Rendering performance
How would one implement the “perfect” rendering engine? One possible would be to utilize current hardware
accelerators to produce, say, an approximate version of an LDI or a Lumigraph by replacing it with view-dependent
texture-mapped sprites. The alternative is to design new hardware accelerators that can handle both conventional
rendering and IBR. An example in this direction is the use of PixelFlow to render image-based models [26].
PixelFlow [10] is a high-speed image generation architecture that is based on the techniques of object-parallelism
and image composition.
3. Capturing
Panoramas are relatively not difficult to construct. Many previous systems have been built to construct cylindrical
and spherical panoramas by stitching multiple images together (e.g., [24, 42, 7, 28, 43]). When the camera motion
is very small, it is possible to put together only small stripes from registered images, i.e., slit images (e.g., [46, 33]),
to form a large panoramic mosaic. Capturing panoramas is even easier if omnidirectional cameras (e.g., [30, 29])
or fisheye lens [45] are used.
It is, however, very difficult to construct a continuous 5D2 complete plenoptic function [28, 17] because it requires
solving the difficult feature correspondence problem. The QuickTime VR system [7] simply enables the user to
discretize the 3D space into a number of sample nodes. The user can only jump between samples.
Image-based rendering can have many interesting applications. Two scenarios, in particular, are worth pursuing:
2 To
date no one has yet shown collection of 7D complete plenoptic functions, even though wandering in a dynamic environment with varying
lighting condition is a very interesting problem.
10
• Large environments.
Many successful techniques, e.g., light field, concentric mosaics, have restrictions on how much a user can change
his viewpoint. For large environment, QuickTime VR is still the most popular system despite the visual discomfort
caused by hot-spot jumping between panoramas. This can be alleviated by having multiple panoramic clusters and
enabling single DOF transitioning between these clusters [16], but motion is nevertheless still restricted. To move
around in a large environment, one has to combine image-based techniques with geometry-based models, in order
to avoid excessive amount of data required.
• Dynamic environments.
Until now, most of image-based rendering systems have been focused on static environments. With the development
of panoramic video systems, it is conceivable that image-based rendering can be applied to dynamic environments
as well. Two issues must be studied: sampling (how many images should be captured), and compression (how to
reduce data effectively).
7 Concluding remarks
We have surveyed recent developments in the area of image-based rendering, and in particular, categorized them based on
the extent of use of geometric information in rendering. Geometry is used as a means of compressing representations for
rendering, with the limit being a single 3D model with a single static texture. While the purely image-based representations
have the advantage of photorealistic rendering, they come with the high costs of data acquisition and storage requirements.
Demands on realistic rendering, compactness of representation, speed of rendering, and costs and limitations of computer vision reconstruction techniques force the practical representation to be fall somewhere between the two extremes.
It is clear from our survey that IBR and the traditional 3D model-based rendering techniques have complimentary characteristics that can be capitalized. As a result, we believe that it is important that future rendering hardware be customized
to handle both the traditional 3D model-based rendering as well as IBR.
References
[1] E. H. Adelson and J. Bergen. The plenoptic function and the elements of early vision. In Computational Models of Visual
Processing, pages 3–20. MIT Press, Cambridge, MA, 1991.
[2] S. Avidan and A. Shashua. Novel view synthesis in tensor space. In Conference on Computer Vision and Pattern Recognition,
pages 1034–1040, San Juan, Puerto Rico, June 1997.
[3] H. H. Baker and R. C. Bolles. Generalizing epipolar-plane image analysis on the spatiotemporal surface. International Journal
of Computer Vision, 3(1):33–49, 1989.
[4] J.-X. Chai, X. Tong, S.-C. Chan, and H.-Y. Shum. Plenoptic sampling. In Proc. SIGGRAPH, 2000.
[5] C. Chang, G. Bishop, and A. Lastra. LDI tree: A hierarchical representation for image-based rendering. Computer Graphics
(SIGGRAPH’99), pages 291–298, August 1999.
[6] S. Chen and L. Williams. View interpolation for image synthesis. Computer Graphics (SIGGRAPH’93), pages 279–288, August
1993.
[7] S. E. Chen. QuickTime VR – an image-based approach to virtual environment navigation. Computer Graphics (SIGGRAPH’95),
pages 29–38, August 1995.
[8] P. Debevec, Y. Yu, and G. Borshukov. Efficient view-dependent image-based rendering with projective texture-mapping. In
Proc. 9th Eurographics Workshop on Rendering, pages 105–116, 1998.
[9] P. E. Debevec, C. J. Taylor, and J. Malik. Modeling and rendering architecture from photographs: A hybrid geometry- and
image-based approach. Computer Graphics (SIGGRAPH’96), pages 11–20, August 1996.
[10] J. Eyles, S. Molnar, J. Poulton, T. Greer, A. Lastra, N. England, and L. Westover. Pixelflow: The realization. In Siggraph/Eurographics Workshop on Graphics Hardware, Los Angeles, CA, Sug. 1997.
[11] O. Faugeras. Three-dimensional computer vision: A geometric viewpoint. MIT Press, Cambridge, Massachusetts, 1993.
[12] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. The lumigraph. In Computer Graphics Proceedings, Annual
Conference Series, pages 43–54, Proc. SIGGRAPH’96 (New Orleans), August 1996. ACM SIGGRAPH.
[13] M. Halle. Multiple viewpoint rendering. In Computer Graphics Proceedings, Annual Conference Series, pages 243–254, Proc.
SIGGRAPH’98 (Orlando), July 1998. ACM SIGGRAPH.
[14] A. Isaksen, L. McMillan, and S. Gortler. Dynamically reparameterized light fields. Technical report, Technical Report MITLCS-TR-778, May 1999.
[15] S. B. Kang. A survey of image-based rendering techniques. In VideoMetrics, SPIE Vol. 3641, pages 2–16, 1999.
11
[16] S. B. Kang and P. K. Desikan. Virtual navigation of complex scenes using clusters of cylindrical panoramic images. In Graphics
Interface, pages 223–232, Vancouver, Canada, June 1998.
[17] S. B. Kang and R. Szeliski. 3-D scene data recovery using omnidirectional multibaseline stereo. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’96), pages 364–370, San Francisco, California, June 1996.
[18] S. B. Kang, R. Szeliski, and P. Anandan. The geometry-image representation tradeoff for rendering. In International Conference
on Image Processing, Vancouver, Canada, Sept. 2000.
[19] A. Katayama, K. Tanaka, T. Oshino, and H. Tamura. A viewpoint dependent stereoscopic display using interpolation of multiviewpoint images. In S. Fisher, J. Merritt, and B. Bolas, editors, Stereoscopic Displays and Virtual Reality Systems II, Proc.
SPIE, volume 2409, pages 11–20, 1995.
[20] S. Laveau and O. Faugeras. 3-D scene representation as a collection of images and fundamental matrices. Technical Report
2205, INRIA-Sophia Antipolis, February 1994.
[21] S. Laveau and O. D. Faugeras. 3-d scene representation as a collection of images. In Twelfth International Conference on Pattern
Recognition (ICPR’94), volume A, pages 689–691, Jerusalem, Israel, October 1994. IEEE Computer Society Press.
[22] J. Lengyel. The convergence of graphics and vision. Technical report, IEEE Computer, July 1998.
[23] M. Levoy and P. Hanrahan. Light field rendering. In Computer Graphics Proceedings, Annual Conference Series, pages 31–42,
Proc. SIGGRAPH’96 (New Orleans), August 1996. ACM SIGGRAPH.
[24] S. Mann and R. W. Picard. Virtual bellows: Constructing high-quality images from video. In First IEEE International Conference
on Image Processing (ICIP-94), volume I, pages 363–367, Austin, Texas, November 1994.
[25] W. Mark, L. McMillan, and G. Bishop. Post-rendering 3d warping. In Proc. Symposium on I3D Graphics, pages 7–16, 1997.
[26] D. K. McAllister, L. Nyland, V. Popescu, A. Lastra, and C. McCue. Real-time rendering of real world environments. In
Eurographics Workshop on Rendering, Granada, Spain, June 1999.
[27] L. McMillan. An image-based approach to three-dimensional computer graphics. Technical report, Ph.D. Dissertation, UNC
Computer Science TR97-013, 1999.
[28] L. McMillan and G. Bishop. Plenoptic modeling: An image-based rendering system. Computer Graphics (SIGGRAPH’95),
pages 39–46, August 1995.
[29] V. S. Nalwa. A true omnidirecional viewer. Technical report, Bell Laboratories, Holmdel, NJ, USA, February 1996.
[30] S. Nayar. Catadioptric omnidirectional camera. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’97), pages 482–488, San Juan, Puerto Rico, June 1997.
[31] M. Oliveira and G. Bishop. Relief textures. Technical report, UNC Computer Science TR99-015, March 1999.
[32] S. Peleg and M. Ben-Ezra. Stereo panorama with a single camera. In Proc. Computer Vision and Pattern Recognition Conf.,
1999.
[33] S. Peleg and J. Herman. Panoramic mosaics by manifold projection. In IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’97), pages 338–343, San Juan, Puerto Rico, June 1997.
[34] P. Rademacher. View-dependent geometry. SIGGRAPH, pages 439–446, Aug. 1999.
[35] P. Rademacher and G. Bishop. Multiple-center-of-projection images. In Computer Graphics Proceedings, Annual Conference
Series, pages 199–206, Proc. SIGGRAPH’98 (Orlando), July 1998. ACM SIGGRAPH.
[36] D. Scharstein. Stereo vision for view synthesis. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’96), pages 852–857, San Francisco, California, June 1996.
[37] S. M. Seitz and C. M. Dyer. View morphing. In Computer Graphics Proceedings, Annual Conference Series, pages 21–30,
Proc. SIGGRAPH’96 (New Orleans), August 1996. ACM SIGGRAPH.
[38] J. Shade, S. Gortler, L.-W. He, and R. Szeliski. Layered depth images. In Computer Graphics (SIGGRAPH’98) Proceedings,
pages 231–242, Orlando, July 1998. ACM SIGGRAPH.
[39] H.-Y. Shum and L.-W. He. Rendering with concentric mosaics. In Proc. SIGGRAPH 99, pages 299–306, 1999.
[40] H.-Y. Shum and R. Szeliski. Construction and refinement of panoramic mosaics with global and local alignment. In Sixth
International Conference on Computer Vision (ICCV’98), pages 953–958, Bombay, January 1998.
[41] P. P. Sloan, M. F. Cohen, and S. J. Gortler. Time critical lumigraph rendering. In Symposium on Interactive 3D Graphics, pages
17–23, Providence, RI, USA, 1997.
[42] R. Szeliski. Video mosaics for virtual environments. IEEE Computer Graphics and Applications, 16(2):22–30, March 1996.
[43] R. Szeliski and H.-Y. Shum. Creating full view panoramic image mosaics and texture-mapped models. Computer Graphics
(SIGGRAPH’97), pages 251–258, August 1997.
[44] T. Wong, P. Heng, S. Or, and W. Ng. Image-based rendering with controllable illumination. In Proceedings of the 8-th
Eurographics Workshop on Rendering, pages 13–22, St. Etienne, France, June 1997.
[45] Y. Xiong and K. Turkowski. Creating image-based VR using a self-calibrating fisheye lens. In IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’97), pages 237–243, San Juan, Puerto Rico, June 1997.
[46] J. Y. Zheng and S. Tsuji. Panoramic representation of scenes for route understanding. In Proc. of the 10th Int. Conf. Pattern
Recognition, pages 161–167, June 1990.
12