Statistical Tuning of Adaptive-Weight Depth
Map Algorithm
Alejandro Hoyos1, John Congote1,2 , Iñigo Barandiaran2,
Diego Acosta3 , and Oscar Ruiz1
1
CAD CAM CAE Laboratory, EAFIT University, Medellin, Colombia
{ahoyossi,oruiz}@eafit.edu.co
2
Vicomtech Research Center, Donostia-San Sebastián, Spain
{jcongote,ibarandiaran}@vicomtech.org
3
DDP Research Group, EAFIT University, Medellin, Colombia
dacostam@eafit.edu.co
Abstract. In depth map generation, the settings of the algorithm parameters to yield an accurate disparity estimation are usually chosen
empirically or based on unplanned experiments. A systematic statistical approach including classical and exploratory data analyses on over
14000 images to measure the relative influence of the parameters allows
their tuning based on the number of bad pixels. Our approach is systematic in the sense that the heuristics used for parameter tuning are
supported by formal statistical methods. The implemented methodology
improves the performance of dense depth map algorithms. As a result
of the statistical based tuning, the algorithm improves from 16.78% to
14.48% bad pixels rising 7 spots as per the Middlebury Stereo Evaluation Ranking Table. The performance is measured based on the distance
of the algorithm results vs. the Ground Truth by Middlebury. Future
work aims to achieve the tuning by using significantly smaller data sets
on fractional factorial and surface-response designs of experiments.
Keywords: Stereo Image Processing, Parameter Estimation, Depth
Map.
1
Introduction
Depth map calculation deals with the estimation of multiple object depths on a
scene. It is useful for applications like vehicle navigation, automatic surveillance,
aerial cartography, passive 3D scanning, automatic industrial inspection, or 3D
videoconferencing [1]. These maps are constructed by generating, at each pixel,
an estimation of the distance between the screen and the object surface (depth).
Disparity is commonly used to describe inverse depth in computer vision, and
also to measure the perceived spatial shift of a feature observed from close camera
viewpoints. Stereo correspondence techniques often calculate a disparity function
d (x, y) relating target and reference images, so that the (x, y) coordinates of
the disparity space match the pixel coordinates of the reference image. Stereo
methods commonly use a pair of images taken with known camera geometry to
A. Berciano et al. (Eds.): CAIP 2011, Part II, LNCS 6855, pp. 563–572, 2011.
c Springer-Verlag Berlin Heidelberg 2011
564
A. Hoyos et al.
generate a dense disparity map with estimates at each pixel. This dense output
is useful for applications requiring depth values even in difficult regions like
occlusions and textureless areas. The ambiguity of matching pixels in heavy
textured or textureless zones tends to require complex and expensive overall
image processing or statistical correlations using color and proximity measures
in local support windows.
Most implementations of vision algorithms make assumptions about the visual appearance of objects in the scene to ease the matching problem. The steps
generally taken to compute the depth maps may include: (i) matching cost computation, (ii) cost or support aggregation, (iii) disparity computation or optimization, and (iv) disparity refinement.
This article is based on work done in [1] where the principles of the stereo
correspondence techniques and the quantitative evaluator are discussed. The literature review is presented in section 2, followed by section 3 describing the
algorithm, filters, statistical analysis and experimental set up. Results and discussions are covered in section 4, and the article is concluded in section 5.
2
Literature Review
The algorithm and filters use several user-specified parameters to generate the
depth map of an image pair, and their settings are heavily influenced by the
evaluated data sets [2]. Published works usually report the settings used for their
specific case studies without describing the procedure followed to fine-tune them
[3,4,5], and some explicitly state the empirical nature of these values [6]. The
variation of the output as a function of several settings on selected parameters is
briefly discussed while not taking into account the effect of modifying them all
simultaneously [3,2,7]. Multiple stereo methods are compared choosing values
based on experiments, but only some algorithm parameters are changed not
detailing the complete rationale behind the value setting [1].
2.1
Conclusions of the Literature Review
Commonly used approaches in determining the settings of depth map algorithm
parameters show all or some of the following shortcomings: (i) undocumented
procedures for parameter setting, (ii) lack of planning when testing for the best
settings, and (iii) failure to consider interactions of changing all the parameters
simultaneously.
As a response to these shortcomings, this article presents a methodology to
fine-tune user-specified parameters on a depth map algorithm using a set of
images from the adaptive weight implementation in [4]. Multiple settings are used
and evaluated on all parameters to measure the contribution of each parameter
to the output variance. A quantitative accuracy evaluation allows using main
effects plots and analyses of variance on multi-variate linear regression models
to select the best combination of settings for each data set. The initial results
are improved by setting new values of the user-specified parameters, allowing
the algorithm to give much more accurate results on any rectified image pair.
Statistical Tuning of Adaptive-Weight Depth Map Algorithm
3
3.1
565
Methodology
Image Processing
In the adaptive weight algorithm ([3,4]), a window is moved over each pixel on
every image row, calculating a measurement based on the geometric proximity
and color similarity of each pixel in the moving window to the pixel on its center.
Pixels are matched on each row based on their support measurement with larger
weights coming from similar pixel colors and closer pixels. The horizontal shift,
or disparity, is recorded as the depth value, with higher values reflecting greater
shifts and closer proximity to the camera.
The strength of grouping by color (fs (cp , cq )) for pixels p and q is defined as
the Euclidean distance between colors (∆cpq ) by Equation (1). Similarly, grouping strength by distance (fp (gp , gq )) is defined as the Euclidean distance between
pixel image coordinates (∆gpq ) by Equation (2). Where γc and γp are adjustable
settings used to scale the measured color delta and window size respectively.
∆cpq
fs (cp , cq ) = exp −
(1)
γc
∆gpq
fp (gp , gq ) = exp −
γp
(2)
The matching cost between pixels shown in Equation (3) is measured by aggregating raw matching costs, using the support weights defined by Equations (1)
and (2), in support windows based on both the reference and target images.
c∈{r,g,b} |Ic (q) − Ic (q̄d )|
q∈Np ,q̄d ∈Np̄d w (p, q) w (p̄d , q̄d )
E (p, p̄d ) =
(3)
q∈Np ,q̄d ∈Np̄ w (p, q) w (p̄d , q̄d )
d
where w (p, q) = fs (cp , cq ) · fp (gp , gq ), p̄d and q̄d are the target image pixels
at disparity d corresponding to pixels p and q in the reference image, Ic is the
intensity on channels red (r), green (g), and blue (b), and Np is the window
centered at p and containing all q pixels. The size of this movable window N is
another user-specified parameter. Increasing the window size reduces the chance
of bad matches at the expense of missing relevant scene features.
Post-Processing Filters. Algorithms based on correlations depend heavily
on finding similar textures at corresponding points in both reference and target
images. Bad matches happen more frequently in textureless regions, occluded
zones, and areas with high variation in disparity. The winner takes all approach
enforces uniqueness of matches only for the reference image in such a way that
points on the target image may be matched more than once, creating the need to
check the disparity estimates and fill any gaps with information from neighboring
pixels using post-processing filters like the ones shown in Table 1.
566
A. Hoyos et al.
Table 1. User-specified parameters of the adaptive weight algorithm and filters
Filter
Adaptive
Weight [3]
Median
Crosscheck[8]
Bilateral[9]
Function
User-specified parameter
Disparity estimation and γaws : similarity factor, γawg : proximity factor
pixel matching
related to the WAW pixel size of the support
window
Smoothing and incorrect WM : pixel size of the median window
match removal
Validation of disparity ∆d : allowed disparity difference
measurement per pixel
Intensity and proximity γbs : similarity factor, γbg : proximity factor reweighted smoothing with lated to the WB pixel size of the bilateral window
edge preservation
Median Filter. They are widely used in digital image processing to smooth
signals and to remove incorrect matches and holes by assigning neighboring
disparities at the expense of edge preservation. The median filter provides a
mechanism for reducing image noise, while preserving edges more effectively than
a linear smoothing filter. It sorts the intensities of all the q pixels on a window of
size M and selects the median value as the new intensity of the p central pixel.
The size M of the window is another of the user-specified parameters.
Cross-check Filter. The correlation is performed twice by reversing the roles of
the two images and considering valid only those matches having similar depth
measures at corresponding points in both steps. The validity test is prone to
fail in occluded areas where disparity estimates will be rejected. The allowed
difference in disparities is one more adjustable parameter.
Bilateral Filter. Is a non-iterative method of smoothing images while retaining edge detail. The intensity value at each pixel in an image is replaced by a
weighted average of intensity values from nearby pixels. The weighting for each
pixel q is determined by the spatial distance from the center pixel p, as well as
its relative difference in intensity, defined by Equation (4).
q∈W fs (q − p) gi (Iq − Ip )Iq
Op =
(4)
q∈W fs (q − p) gi (Iq − Ip )
where O is the output image, I the input image, W the weighting window, fs
the spatial weighing function, and gi the intensity weighting function. The size
of the window W is yet another parameter specified by the user.
3.2
Statistical Analysis
The user-specified input parameters and output accuracy measurements data
is statistically analyzed measuring the relations amongst inputs and outputs
with correlation analyses, while box plots give insight on the influence of groups
Statistical Tuning of Adaptive-Weight Depth Map Algorithm
567
of settings on a given factor. A multi-variate linear regression model shown in
Equation (5) relates the output variable as a function of all the parameters to find
the equation coefficients, correlation of determination, and allows the analysis
of variance to measure the influence of each parameter on the output variance.
Residual analyses are checked to validate the assumptions of the regression model
like constant error variance, and mean of errors equal to zero, and if necessary,
the model is transformed. The parameters are normalized to fit the range (−1, 1)
as shown in Table 2.
n
ŷ = β0 +
(5)
βi xi + ǫ
i=1
where ŷ is the predicted variable, xi are the factors, and βi are the coefficients.
3.3
Experimental Set Up
The depth maps are calculated with an implementation developed for real time
videoconferencing in [4]. Using well-known rectified image sets: Cones from [1],
Teddy and Venus from [10], and Tsukuba head and lamp from the University of
Tsukuba. Other commonly used sets are also freely available [11,12]. The sample
used consists of 14688 depth maps, 3672 for each data set, like the ones shown
in Figure 1.
Fig. 1. Depth Map Comparison. Top: best initial, bottom: new settings. (a) Cones, (b)
Teddy, (c) Tsukuba, and (d) Venus data set.
Many recent stereo correspondence performance studies use the Middlebury
Stereomatcher for their quantitative comparisons [2,7,13]. The evaluator code,
sample scripts, and image data sets are available from the Middlebury stereo
vision site1 , providing a flexible and standard platform for easy evaluation.
1
http://vision.middlebury.edu/stereo/
568
A. Hoyos et al.
Table 2. User-specified parameters of the adaptive weight algorithm
Parameter
Adaptive Weights Window Size
Adaptive Weights Color Factor
Median Window Size
Cross-Check Disparity Delta
Cross-Bilateral Window Size
Cross-Bilateral Color Factor
Name
aw win
aw col
m win
cc disp
cb win
cb col
Levels Values
4
[1 3 5 7]
6
[4 7 10 13 16 19]
3
[N/A 3 5]
4
[N/A 0 1 2]
5
[N/A 1 3 5 7]
7
[N/A 4 7 10 13 16 19]
Coding
[-1 -0.3 0.3 1]
[-1 -0.6 -0.2 0.2 0.6 1]
[N/A -1 0.2 1]
[N/A -1 0 1]
[N/A -1 -0.3 0.3 1]
[N/A -1 -0.6 -0.2 0.2 0.6 1]
The online Middlebury Stereo Evaluation Table gives a visual indication of
how well the methods perform with the proportion of bad pixels (bad pixels)
metric defined as the average of the proportion of bad pixels in the whole image (bad pixels all), the proportion of bad pixels in non-occluded regions
(bad pixels nonocc), and the proportion of bad pixels in areas near depth discontinuities (bad pixels discont) in all data sets.
4
Results and Discussion
4.1
Variable Selection
Pearson correlation of the factors show that they are independent and that each
one must be included in the evaluation. On the other hand, a strong correlation
amongst bad pixels and the other outputs is detected and shown in Figure 2.
This allows the selection of bad pixels as the sole output because the other
responses are expected to follow a similar trend. Other output are explain in the
Table 3.
Table 3. Result metrics computed by the Middlebury Stereomatcher evaluator
Parameter
rms error all
rms error nonocc
rms error occ
rms error textured
rms error textureless
rms error discont
bad pixels all
bad pixels nonocc
bad pixels occ
bad pixels textured
bad pixels textureless
bad pixels discont
evaluate only
output params
depth map
Description
Root Mean Square (RMS) disparity error (all pixels)
RMS disparity error (non-occluded pixels only)
RMS disparity error (occluded pixels only)
RMS disparity error (textured pixels only)
RMS disparity error (textureless pixels only)
RMS disparity error (near depth discontinuities)
Fraction of bad points (all pixels)
Fraction of bad points (non-occluded pixels only)
Fraction of bad points (occluded pixels only)
Fraction of bad points (textured pixels only)
Fraction of bad points (textureless pixels only)
Fraction of bad points (near depth discontinuities)
Read specified depth map and evaluate only
Text file logging all used parameters
Evaluated image
Statistical Tuning of Adaptive-Weight Depth Map Algorithm
569
Fig. 2. bad pixels and other output correlation
4.2
Exploratory Data Analysis
Box plots analysis of bad pixels presented in Figure 3(a) show lower output
values from using filters, relaxed cross-check disparity delta values, large adaptive
weight window sizes, and large adaptive weight color factor values. The median
window size, bilateral window size, and bilateral window color values do not
show a significant influence on the output at the studied levels.
The influence of the parameters is also shown on the slopes of the main effects
plots of Figure 4 and confirms the behavior found with the ANOVA of the
multi-variate linear regression model. The settings to lower bad pixels from
this analysis yields a result of 14.48%.
(a) Box Plots
(b) ANOVA proportion of bad pixels
Fig. 3. (a) Box Plots of bad pixels. (b) Contribution to the bad pixels variance by
parameter.
4.3
Multi-variate Linear Regression Model
The analysis of variance on a multi-variate linear regression (MVLR) over all
data sets using the most parsimonious model quantifies the parameters with the
most influence as shown in Figure 3(b). cc disp is the most significant factor
accounting for a third to a half of the variance on every case.
570
A. Hoyos et al.
Interactions and higher order terms are included on the multi-variate linear
regression models to improve the goodness of fit. Reducing the number of input
images per dataset from 3456 to 1526 by excluding the worst performing cases
corresponding to cc disp = 0 and aw col = [4, 7], allows using a cubic model
with interactions and an R2 of 99.05%.
The residuals of the selected model fail to follow a normal distribution. Transforming the output variable or removing large residuals does not improve the
residuals distribution, and there are no reasons to exclude any outliers from
the image data set. Nonetheless, improved algorithm performance settings are
found using the model to obtain lower bad pixels values comparable to the
ones obtained through the exploratory data analysis (14.66% vs. 14.48%).
In summary, the most noticeable influence on the output variable comes from
having a relaxed cross-check filter, accounting for nearly half the response variance in all the study data sets. Window size is the next most influential factor,
followed by color factor, and finally window size on the bilateral filter. Increasing
the window sizes on the main algorithm yield better overall results at the expense of longer running times and some foreground loss of sharpness, while the
support weights on each pixel have the chance of becoming more distinct and
potentially reduce disparity mismatches. Increasing the color factor on the main
algorithm allows better results by reducing the color differences, and slightly
compensating minor variations in intensity from different viewpoints.
A small median smoothing filter window size is faster than a larger one, while
still having a similar accuracy. Low settings on both the window size and the
color factor on the bilateral filter seem to work best for a good balance between
performance and accuracy.
Fig. 4. Main Effects Plots of each factor level for all data sets. Steeper slopes relate to
bigger influence on the variance of the bad pixels output measurement.
The optimal settings in the original data set are presented in Table 4 along
with the proposed combinations. Low settings comprise the depth maps with
all their parameter settings at each of their minimum tested values yielding
67.62% bad pixels. High settings relates to depth maps with all their parameter settings at each of their maximum tested values yielding 19.84% bad pixels.
Best initial are the most accurate depth maps from the study data set yielding
Statistical Tuning of Adaptive-Weight Depth Map Algorithm
571
Table 4. Model comparison. Average bad pixels values over all data sets and their
parameter settings.
Run Type
bad pixels aw win aw col m win cc disp cb win cb col
Low Settings
67.62%
1
4
3
0
1
4
High Settings
19.84%
7
19
5
2
7
19
Best Initial
16.78%
7
19
5
1
3
4
Exploratory analysis
14.48%
9
22
5
1
3
4
MVLR optimization
14.66%
11
22
5
3
3
18
16.78% bad pixels. Exploratory analysis corresponds to the settings determined using the exploratory data analysis based on box plots and main effects
plots yielding 14.48% bad pixels. MVLR optimization is the extrapolation
optimization of the classical data analysis based on multi-variate linear regression model, nested models, and ANOVA yielding 14.66% bad pixels.
The exploratory analysis estimation and the MVLR optimization tend to
converge at similar lower bad pixels values using the same image data set. The
best initial and improved depth map outputs are shown in Figure 1.
5
Conclusions and Future Work
This work presents a systematic methodology to measure the relative influence of
the inputs of a depth map algorithm on the output variance and the identification
of new settings to improve the results from 16.78% to 14.48% bad pixels. The
methodology is applicable on any group of depth map image sets generated with
an algorithm where the relative influence of the user-specified parameters merits
to be assessed.
Using design of experiments reduces the number of depth maps needed to
carry out the study when a large image database is not available. Further analysis
on the input factors should be started with exploratory experimental fractional
factorial designs comprising the full range on each factor, followed by a response
surface experimental design and analysis. In selecting the factor levels, analyzing
the influence of each filter independently would be an interesting criterion.
Acknowledgments. This work has been partially supported by the Spanish
Administration Agency CDTI under project CENIT-VISION 2007-1007, the
Colombian Administrative Department of Science, Technology, and Innovation;
and the Colombian National Learning Service (COLCIENCIAS-SENA) grant
No. 1216-479-22001.
References
1. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int. J. Comput. Vision 47(1-3), 7–42 (2002)
2. Gong, M., Yang, R., Wang, L., Gong, M.: A performance study on different cost aggregation approaches used in real-time stereo matching. Int. J. Comput. Vision 75,
283–296 (2007)
572
A. Hoyos et al.
3. Yoon, K., Kweon, I.: Adaptive support-weight approach for correspondence search.
IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 650 (2006)
4. Congote, J., Barandiaran, I., Barandiaran, J., Montserrat, T., Quelen, J., Ferrán,
C., Mindan, P., Mur, O., Tarrés, F., Ruiz, O.: Real-time depth map generation
architecture for 3d videoconferencing. In: 3DTV-Conference: The True VisionCapture, Transmission and Display of 3D Video (3DTV-CON), 2010, pp. 1–4
(2010)
5. Gu, Z., Su, X., Liu, Y., Zhang, Q.: Local stereo matching with adaptive supportweight, rank transform and disparity calibration. Pattern Recogn. Lett. 29,
1230–1235 (2008)
6. Hosni, A., Bleyer, M., Gelautz, M., Rhemann, C.: Local stereo matching using
geodesic support weights. In: Proceedings of the 16th IEEE Int. Conf. on Image
Processing (ICIP), pp. 2093–2096 (2009)
7. Wang, L., Gong, M., Gong, M., Yang, R.: How far can we go with local optimization in real-time stereo matching. In: Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT 2006),
pp. 129–136 (2006)
8. Fua, P.: A parallel stereo algorithm that produces dense depth maps and preserves
image features. Machine Vision and Applications 6(1), 35–49 (1993)
9. Weiss, B.: Fast median and bilateral filtering. ACM Trans. Graph. 25, 519–526
(2006)
10. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured
light. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1,
pp. 195–202 (2003)
11. Scharstein, D., Pal, C.: Learning conditional random fields for stereo. In: IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)
12. Hirschmuller, H., Scharstein, D.: Evaluation of cost functions for stereo matching.
In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)
13. Tombari, F., Mattoccia, S., Di Stefano, L., Addimanda, E.: Classification and evaluation of cost aggregation methods for stereo correspondence. In: IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1–8 (2008)