Upmixing and Downmixing Two-Channel Stereo Audio For Consumer Electronics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

M. R. Bai and G.-Y.

Shih: Upmixing and Downmixing Two-channel Stereo Audio for Consumer Electronics 1011

Upmixing and Downmixing Two-channel Stereo Audio


for Consumer Electronics
Mingsian R. Bai and Geng-Yu Shih

Abstract — In this comprehensive study, algorithms for techniques including a passive surround decoder method [1], a
upmixing, downmixing, and joint up/downmixing are Least-Mean-Square (LMS)-based method [2], an adaptive
examined and compared. Five upmixing algorithms based on panning method [3], a Principal Component Analysis (PCA)-
signal decorrelation and reverberation are employed to based method [4], and an artificial reverberator-based method
convert two-channel stereo signals to five-channel signals. [5] are presented.
For downmixing, methods ranging from mixing with simple Contrary to the upmixing process, the downmixing process
gain adjustment to more sophisticated Head Related Transfer is employed to produce a decreased number of channels due to
Function (HRTF) filtering and Crosstalk Cancellation System practical reasons such as availability of loudspeakers. Without
(CCS) are utilized to downmix the center channel and the loss of generality, this study is focused on remixing 5.1 audio
surround channels into the available two frontal inputs into two-channel signals. Downmixing can be
loudspeakers. For situations where only two-channel content accomplished by simple mixing or more complex filtering by
and loudspeakers are available, a number of up/down mixing
Head Related Transfer Functions (HRTFs), as in the Sound
schemes are used to simulate a virtual surround environment.
Retrieval System (SRS) 3D stereo sound system [6]. An
Emphasis of comparison is placed on two consumer electronic
products: a 5.1 home theater system and a dual-loudspeaker HRTF is a mathematical model representing the propagation
MP3 handset. The effect of loudspeaker spacing on rendering process from a sound source to the human ears [7].
performance is examined. Listening tests are conducted to Another problem well known in reproduction using
compare the processing methods in terms of three levels of loudspeakers is that the crosstalk of the contralateral paths
subjective indices. The results are processed by using the from the loudspeakers to the listener’s ears can adversely
Multi-Analysis Of VAriance (MANOVA) to justify the affect source localization. A solution to this problem is to
statistical significance, followed by a multiple regression minimize crosstalk using a Crosstalk Cancellation System
analysis to correlate the auditory preference with various (CCS) derived from inverse filter design. In this paper,
timbral and spatial attributes1. downmixing techniques, ranging from mixing with simple
gain adjustment [8] to more sophisticated HRTF filtering [9],
Index Terms —Virtual surround, upmixing, downmixing, and CCS-based processing [10], shall be examined.
subjective listening test. One last situation that may call for the combined use of
upmixing and downmixing is when the audio inputs and the
I. INTRODUCTION reproducing loudspeakers are both of the two-channel stereo
With the growing proliferation of multichannel audio in configurations. The interactions between upmixing and
consumer electronics, there are situations where the number of downmixing are also investigated in the paper.
channels of either the audio content or the reproducing Subjective tests are carried out to assess the performance of
loudspeakers is limited and it is necessary to upmix or each processing method. The experimental results are
downmix channels to improve the listening experience. Yet, processed by using the multi-analysis of variance
another situation where combined upmixing and dwnmixing is (MANOVA) to justify the statistical significance. In addition,
necessary is the newly emerging third-generation (3G) a multiple regression model is employed to correlate global
handsets fitted with dual loudspeakers. This paper is aimed at auditory preference with low-level attributes. In light of these
a comprehensive study that systematically compares a variety comprehensive tests, it is hoped that viable upmixing and
of upmixing and downmixing strategies. downmixing techniques can be found to cater for practical
In the upmixing process, surround channels can be created multichannel audio reproduction.
from the two-channel stereo inputs by using two different
approaches. One approach uses the ‘decorrelated’ part of the II. UPMIXING ALGORITHMS
stereo inputs as the surround channels, whereas the other In this section, five upmixing algorithms are introduced.
approach produces the surround channels by simulating the These methods differ in how to derive the additional channels
reverberant sound field in the background. In this paper, five from the two stereo channels. The general architecture of the
direct-ambient upmixing approach for creating the additional
1
The work was supported by the National Science Council in Taiwan, channels is shown in Fig. 1. Common to all methods, the Center
Republic of China, under the project number NSC94-2212-E-009-019. (C) channel is created by filtering the correlated input using a
The authors are with the Department of Mechanical Engineering, the 128-tapped FIR bandpass filter with cut-off frequencies 100 Hz
National Chiao-Tung University, 1001 Ta-Hsueh Road, Hsin-Chu 300,
Taiwan. (e-mail: msbai@mail.nctu.edu.tw). and 4 kHz to emphasize voice and dialog. The Rear Left (RL)
Contributed Paper
Manuscript received May 30, 2007 0098 3063/07/$20.00 © 2007 IEEE
1012 IEEE Transactions on Consumer Electronics, Vol. 53, No. 3, AUGUST 2007

and the Rear Right (RR) channels are intended to provide LMS algorithm called Normalized LMS (NLMS) takes into
ambience and envelopment in the background, where a 15 ms account the variation in the input signals and selects a step
delay is added to the rear channels to fulfill the precedence size normalized with the input power. The weight update
effect. High-frequency absorption is simulated by filtering the equation is modified into
rear channels with a 7 kHz cut-off, 128-tapped FIR lowpass μ%
filter. In addition, the rear channels are 180-degree out-of-phase w ( n + 1) = w ( n ) + e ( n ) x ( n ), (6)
xT (n)x(n) + ψ
with one another, which can increase spaciousness of the
ambient field [11]. The Low Frequency Enhancement (LFE) where μ% and ψ are positive constants.
channel is derived from the center channel with a 128-tapped In some occasions, the adaptive algorithm diverges due to
FIR lowpass filter to retain the signals below 120 Hz. low correlation level between the signals of the front channels
Five upmixing techniques are presented as follows. when dominated with diffuse components. To deal with it, a
correlation-based method can be used, where the step size is
selected according to a modified correlation coefficient [12]:
M −1 M −1
ρ (n) = ∑ x(n − i)d (n − i) ∑ x(n − i)d (n − i) , (7)
i=0 i=0

where M is the length of the estimation window.


Fig. 1. General architecture of direct-ambient upmixing technique for the
center, the LFE, and the surround channels.

A. The Passive Surround Decoder


The first approach employed in this study follows from an
early passive version of the Dolby Surround Decoder [1], as
shown in Fig. 2. The center channel is the average of the
Fig. 3. The adaptive LMS algorithm with the left channel as the desired
original stereo channels, whereas the rear surround signals are signal and the right channel as the input of a FIR filter.
the difference of the stereo channels. That is,
C. The Adaptive Panning Method
Center = ( L + R ) 2 (1) Another method to be considered in this section is proposed
Surround = ( L − R ) 2 (2) by Irwan and Aarts [3]. Let xL(n) and xR(n) be the stereo
signals, as shown in Fig. 4. The dominant signal y(n) and the
remaining signal q(n) are generated by simple gain adjustment
of the input signals.

Fig. 2. The block diagram of the passive surround decoder.

B. The LMS-based Method


The approach to be described is based on a decorrelation
technique using the LMS algorithm [2]. The basic idea of the
LMS filter is shown in Fig. 3, where d(n) is the desired signal,
x(n) is the input of an FIR filter, w(n) is the coefficient vector
of a 16-tapped FIR filter, y(n) is the output of an FIR filter and Fig. 4. The plot of original stereo signals. Dashed lines represent new
e(n) is the error signal. The LMS algorithm consists of three coordinate system based on both dominant signal y and remaining signal
q, forming the direction of principal axes.
important equations:
To maximize the energy of y(n), two scalar panning
y (n ) = xT (n )w (n ) = w T (n )x (n ) (3) weights, wL(n) and wR(n), corresponding to the left and right
e(n) = d (n ) − y (n) = d (n) − w T (n)x (n) (4) channels are determined by using the LMS algorithm with
w ( n + 1) = w ( n ) + 2 μ e ( n ) x ( n ), (5) y(n-1) being the input.

where μ is a constant step size that dictates the convergence w L ( n ) = w L ( n − 1) + μ y ( n − 1)[ x L ( n − 1) − w L ( n − 1) y ( n − 1)]
(8)
behavior. The left channel and the right channel of the original w R ( n ) = w R ( n − 1) + μ y ( n − 1)[ x R ( n − 1) − w R ( n − 1) y ( n − 1)]
stereo channels are taken as the desired signal and the input of
Two signals, y(n) and q(n), are then obtained by panning the
the FIR filter, respectively. The output y(n) representing the
stereo signals xL(n) and xR(n) with the weights found above.
correlated part is used as the center channel, whereas the error
e(n) representing the uncorrelated part is used to produce two y ( n ) = wL ( n ) x L ( n ) + wR ( n ) x R ( n ) (9)
surround channels in upmixing. A special implementation of q ( n ) = wR ( n ) x L ( n ) − wL ( n ) x R ( n ) (10)
M. R. Bai and G.-Y. Shih: Upmixing and Downmixing Two-channel Stereo Audio for Consumer Electronics 1013

Consequently, the dominant signal y(n) and the remaining w 1, L − C ( k ) = C L ( k )


signal q(n), corresponding to the correlated and the w 2 , L − C ( k ) = 0 .7 5 × C L ( k ) + 0 .2 5 × C L ( k + 1)
decorrelated channel, respectively, are used as the center and (15)
w 3 , L − C ( k ) = 0 .5 × C L ( k ) + 0 .5 × C L ( k + 1)
the surround channels in upmixing. The weights wL(n) and w 4 , L − C ( k ) = 0 .2 5 × C L ( k ) + 0 .7 5 × C L ( k + 1)
wR(n) generally fluctuate around 0.7, as shown in Fig. 5.
Figure 6(b) shows the correspondence between the mixing
coefficients and the left channel signals. Finally, the center
and surround channels are produced by mixing the front
channels using the new mixing coefficients as follows:
C ( k , m ) = wm , L − C x L ( k , m ) + wm , R − C x R ( k , m )
S ( k , m ) = wm , L − S x L ( k , m ) + wm , R − S x R ( k , m )
(16)
m = 1, 2, 3, 4

E. The Reverb-based Method


Fig. 5. The weights of the adaptive panning. The solid line is the panning An artificial revereberator is employed in this section to
weight corresponding to the left channel wL (n). The dashed line is the produce the ambience-enriched surround channels [5]. The
panning weight corresponding to the right channel wR (n). reverberator comprises three parallel comb filters (Fig. 7(a))
and three nested allpass filters (Fig. 7(b)) which are used to
D. The PCA-based Method
increase the modal density and echo density of reverberation.
Along the line of the decorrelation idea, a PCA-based The parameters are determined by a sophisticated optimization
upmixing approach [4] is presented in this section. Let wL(n) procedure using the Genetic Algorithm (GA) [13]. In this
and wR(n) be the left and the right channels. A 2×2 covariance approach, the surround channels are simply generated by
matrix A is calculated as follows: feeding the average of the stereo inputs to the above-
mentioned reverberator, plus 180-degree phase reversal.
⎡ cov ( x L , x L ) cov ( x L , x R ) ⎤
A=⎢ , (11)
⎣ cov ( x R , x L ) cov ( x R , x R ) ⎥⎦

where cov(xp,xq), p,q=L,R symbolizes the covariance estimated


by an l-sample frame-based average. Specifically,
(a)

l
cov ( x L , x R ) = n =1
[ x L ( n ) − x L ][ x R ( n ) − x R ] ( l − 1) , (12)

where xL(n) and xR(n) are the left and right signals at the time
instant n, and xL and xR represent the means of the left and
(b)
the right signals, respectively. The symmetric covariance Fig. 6. The interpolation procedure of the weights in the PCA-based
matrix A guarantees to give two orthonormal eigenvectors. upmixing technique. (a) Frame and sub-frame structure. (b) The
Let the eigenvectors associated with the larger and the smaller correspondence of the weights and the left channel signal.
eigenvalues be (CL,CR) and (SL,SR), respectively. The center
and the surround channels, presumably the correlated and the
uncorrelated portions, can be derived by mixing the input
signals according to the eigenvectors.
Center = C L x L ( n ) + C R x R ( n ) (13)
Surround = S L x L ( n ) + S R x R ( n ) (14) (a)

Since the method is essentially frame-based processing, the


discontinuities in mixing coefficients may lead to artifacts at
the frame boundaries. A fading procedure is proposed to
smooth the crossover of weights between successive frames.
Let CL(k) and CL(k+1) be the coefficients to mix the left
channel into the center channel in two successive frames, k
(b)
and k + 1. Each frame is further divided into four smaller sub- Fig. 7. The allpass-comb network of the reverberator. (a) The structure of
frames. As shown in Fig. 6(a), the fading scheme proceeds each comb filter, where the bp is the gain of absorbent lowpass filter and
with the following mixing coefficients, wm,L-C, m = 1,2,3,4, for kp is the gain of comb filter. (b) The reverb filter comprising 3 parallel
comb filters and 3 serial nested allpass filters.
the sub-frames
1014 IEEE Transactions on Consumer Electronics, Vol. 53, No. 3, AUGUST 2007

III. DOWNMIXING ALGORITHMS where HH(ejω) is the hermitian transpose of the acoustical
In many applications such as personal computer (PC) system H(ejω) and β is the regularization parameter. The
multimedia systems, portable audio products, only two- coefficients of CCS filters can be obtained by applying
channel stereo loudspeakers are available. Thus, downmixing inverse FFT and circular shifts to the frequency response
is necessary to convert multichannel audio content into two functions.
channels for loudspeaker presentation. In what follows, two
downmixing techniques will be presented.
A. The Standard Downmixing Method
The standard, ITU-R BS.775-1 [8], details how to downmix
multichannel signals with simple gain adjustment. The
architecture of such standard downmixing technique is shown Fig. 9. The architecture of the Shuffler filter.
in Fig. 8. That is,
L = FL + 0.71 × C + 0.71 × RL
(17)
R = FR + 0.71 × C + 0.71 × RR

Fig. 10. The architecture of the HRTF-based downmixing method.

IV. INTEGRATION OF UP/DOWNMIXING ALGORITHMS


Most audio recordings now available are still in the stereo
format and most consumer electronic products are equipped
Fig. 8. The architecture of the standard downmixing method.
with two-channel loudspeakers. Upmixing and downmixing
B. The HRTF-based Method capabilities can be integrated to simulate a multichannel audio
environment. The SRS 3D stereo sound system [6] is one such
An HRTF-based downmixing technique is presented in the
system with integrated upmixing and downmixing capabilities.
following. Figure 10 shows the architecture of the HRTF-
In this paper, a hybrid method is introduced. The hybrid
based downmixing technique in which the rear surround
method processes the two-channel stereo inputs as follows:
channels are fed into a Shuffler filter [10]. For a symmetrical
acoustical system, the Shuffler filter shown in Fig. 9 can be Lout = K 0 L + K 1 ( L + R ) + K 2 ( L − R ) p (21)
used to reduce computation cost. And Σ and Δ are given as Rout = K 0 R + K 1 ( L + R ) + K 2 ( R − L ) p , (22)
Σ = hi + hc (18)
where (L+R) and (L-R) produce the upmixed center and the
Δ = hi − hc , (19) surround channels, K0 =1, K1 =0.5, and K2 =0.5 are the gains for
the L and R channels, the center channel, and the surround
where hi and hc represent the ipsilateral and the contralateral
channels, respectively. In the Shuffler filter, the HRTF filtering
HRTFs, respectively. In the HRTF-based method, the center,
of the surround channels is only needed for the Δ filter since the
the rear left, and the rear right channels are filtered by the
sum of (L-R) and (R-L) equals zero. That is, the filter p in Eqs.
corresponding HRTFs at 0˚, +110˚, and −110˚, respectively, to
(21) and (22) represents the difference of the ipsilateral and the
provide directional impression before mixing with the front
contralateral transfer functions, (hi- hc). The above equations
channels. The HRTF database implemented by using 128-
have combined the (Dolby-like) upmixing and dowmixing (with
tapped FIR filters is obtained from the website of the MIT
HRTFs) into one single step. An equalizer is employed to
media lab [7]. In the downmixing process, a crosstalk problem
emphasize the lower frequencies (below 1 kHz) and the higher
could arise as loudspeakers are used for reproducing the
frequencies (7~20 kHz) and avoid the coloration problem
binaural signals. Excessive crosstalk could degrade sound
resulting from the difference of signals.
localization especially for closely spaced loudspeakers. In
order to alleviate the problem, crosstalk cancellation can be V. SUBJECTIVE EVALUATION OF UPMIXING AND
incorporated into the HRTF-based downmixing process. In DOWNMIXING ALGORITHMS
this paper, a multichannel deconvolution method based on the
A. Experimental Arrangement
Fast Fourier Transform (FFT) and Tikhonov regularization is
adopted to calculate the frequency response functions of the 1) Experimental Setup
CCS filters [10]: Subjective listening tests were conducted to assess the
performance of the aforementioned upmixing and downmixing
algorithms. A 5.1 home theater system including five 3.5-inch
C ( e jω ) = [ H H ( e jω ) H ( e jω ) + β 2 (ω ) Ι ]− 1 H H ( e jω ) (20)
loudspeakers and a subwoofer was used for sound reproduction.
M. R. Bai and G.-Y. Shih: Upmixing and Downmixing Two-channel Stereo Audio for Consumer Electronics 1015

The loudspeakers were deployed according to ITU-R BS.775 quality and timbral quality describe the general performance
[8], as shown in Fig. 11(a). On the other hand, a dual of the spatial and timbral characteristics, respectively. In the
loudspeakers MP3 handset was adopted in another subjective top hierarchy, the overall preference represents the global
test, as shown in Fig. 11(b). The listening tests were conducted impression of the reproduced program. The grading scale used
complying with the requirement of ITU-R BS.1116 [14]. As for the subjective tests are shown in Tables I .
required in the CCS design, the binaural transfer functions from
TABLE I
the loudspeakers to the microphone embedded in the ears of a THE GRADING SCALE USED IN THE MUSHRA PROCESS.
KEMAR’s were measured in an anechoic chamber by using a
spectrum analyzer. Performance Grade

Much better 3
Better 2
Slightly better 1
About the same 0
Slightly worse -1
Worse -2
Much worse -3
(a) (b)
Fig. 11. The experimental setup in the listening room. (a) The standard
3) Test Subjects
5.1 configuration for multichannel loudspeaker reproduction. (b) The Thirty listeners who are experienced in audio evaluation took
MP3 handset equipped with dual loudspeakers. part in the listening tests. A training phase was arranged for the
2) Procedure of Listening Test listeners prior to the formal test such that the listeners were
A modified double-blind Multi-Stimulus test with Hidden thoroughly familiarized with the test facilities, the environment,
Reference and a hidden Anchor (MUSHRA) (ITU-R BS.1534 and the grading process. Different items of lowpass processing,
[15]) was employed as the basis of the experimental design. highpass processing, phantom mono, and multichannel 5.1
The original unprocessed signal was used as the hidden mode of presentation were demonstrated in the training phase.
reference in the test. The hidden anchor employed in this test 4) Data Analysis
was the ‘phantom mono’ reproduction that broadcasts the The results of the tests were processed by using the
same signal over two-channel loudspeakers. The upmixing MANOVA. Both the mean and the 95% confidence intervals
and downmixing algorithms were implemented on the of the grades were shown in the analysis results. Cases with
platform of a fixed-point digital signal processor (DSP) significance levels below p = 0.05 indicate that difference
equipped with a 6-input and 6-output codec operating at 48 among methods is statistically significant. The scores obtained
kHz. The subjects are allowed to switch between different from the hidden reference and anchor were only intended for
stimuli at their discretion and grade the presentations. The validating the consistency of subjects, but were excluded from
loudness of each reproduced signal was adjusted to equal level the MANOVA as usually done in practice. Three statistical
by a group of five skilled subjects to minimize experimental assumptions of MANOVA such as independence of grading,
errors due to loudness variation. Three-leveled hierarchical normal distribution of scores, and homogeneity of variances
subjective indices shown in Fig. 12 were employed to assess (HOV) were verified in these experiments. Independence of
the performance of the upmixing and downmixing techniques. grading has been fulfilled due to the randomization of the
experimental factors. The assumptions of normality and HOV
were examined by using Shapiro-Wilk's W test and Levene’s
test, respectively [16]. Furthermore, a multiple regression
model was employed to correlate the global auditory attributes
with the low-level attributes in the listening test.
TABLE II
THE STATISTICAL ANALYSIS RESULTS OF THE LMS-BASED UPMIXING
TEST FOR THE HOME THEATER LOUDSPEAKERS.
Fig. 12. Three-leveled hierarchical subjective indices employed in the Dependent Type III
listening test to assess the performance of the upmixing and downmixing df MS F p
Variable SS
techniques.
Width 0.545455 2 0.272727 0.387931 0.681813
In the lowest hierarchy, the width, depth, and spaciousness Depth 0.242424 2 0.121212 0.322581 0.726759
Spaciousness 0.424242 2 0.212121 0.472973 0.627715
refer to the perceived angular width, depth of the sound image, Fullness 0.181818 2 0.090909 0.111940 0.894469
and the ambience, envelopment, and sensation of space Brightness 0.000000 2 0.000000 0.000000 1.000000
pertaining to the listening environment. The fullness and Artifact 0.000000 2 0.000000 0.000000 1.000000
brightness refer to the dominant low and high frequency Spatial quality 0.424242 2 0.212121 0.813953 0.452649
content, respectively. The index, artifact, refers to any linear Timbral quality 0.000000 2 0.000000 0.000000 1.000000
Overall preference 0.424242 2 0.212121 0.372340 0.692260
or nonlinear distortions. In the second hierarchy, the spatial
1016 IEEE Transactions on Consumer Electronics, Vol. 53, No. 3, AUGUST 2007

TABLE III C. Evaluation of Downmixing Algorithms


THE STATISTICAL ANALYSIS RESULTS OF THE UPMIXING TEST FOR THE In this section, the downmix processing using the standard
HOME THEATER LOUDSPEAKERS.
downmix method, the HRTF-based method, and the HRTF-
Dependent Type III
Variable SS
df MS F p CCS-based method will be examined. The 5.1 home theater
Width
system and a dual-loudspeaker MP3 handset are as rendering
20.72000 4 5.18000 2.749747 0.034795
Depth systems, respectively. In the case of home theater, the 5.1
1.41333 4 0.35333 0.216706 0.928302
Spaciousness 18.34667 4 4.58667 3.698925 0.008650
setup was employed as the reference. In the case of handset,
Fullness 7.06667 4 1.76667 1.069781 0.378068
the standard downmix method was employed as the reference.
Brightness 7.12000 4 1.78000 1.005920 0.410419
The test results of the downmixing for home theater system
Artifact 13.78667 4 3.44667 5.337758 0.000825 are shown in Fig. 15. As expected, the grades of the hidden
Spatial quality 14.74667 4 3.68667 2.662311 0.039562 reference and phantom mono reproduction were quite low in
Timbral quality 21.46667 4 5.36667 4.551696 0.002517 most aspects.
Overall preference 41.94667 4 10.48667 4.942101 0.001442

B. Evaluation of Upmixing Algorithms


An experiment was conducted to compare the upmixing
methods for PC home theater loudspeakers. Upmixing does
not apply to the case of the handset because of the limited
number of the rendering loudspeakers. We ran a listening test
to choose the LMS-based approach that was most effective in
upmixing. The result shown in Fig. 13 revealed that the
correlation-based approach slightly outperformed the other
methods in spaciousness and overall preference. However, the
difference in performance was not significant because the p-
Fig. 14. Listening test results of five upmixing methods for the home
values of the MANOVA output summarized in Table Ⅱ were theater loudspeakers. (Pha: phantom mono reproduction, Passive: passive
above 0.05. Therefore, we choose the simple LMS algorithm surround decoder, LMS: LMS-based method, Ad.Pan.: adaptive panning
method, PCA: PCA-based method, Reverb: reverb-based method, H.R.:
for upmixing because of its computational efficiency. hidden reference).
Figure 14 compares five upmixing methods by plotting the
mean grades with 95% confidence intervals. The small p-values
of MANOVA output were summarized in Table III. As expected,
the grade of the hidden reference was nearly zero. The phantom
mono attained the lowest grade except for depth, fullness, and
artifact. This justifies the reliability and consistency of the test
subjects. The reverb-based method outperforms the other
methods in the spatial attributes, albeit ringing artifacts were
reported by some subjects. The reverb-based method has attained
the highest grade among all methods in overall preference. This
seems to suggest that reverb-based upmixing methods are
subjectively superior to the correlation-based methods.
Fig. 15. Listening test results of three downmixing techniques for the
home theater loudspeakers (Pha: phantom mono reproduction, Std.:
standard downmixing, HRTF: HRTF-based method, H.C.: HRTF-CCS-
based method, H.R.: hidden reference).

The MANOVA output shown in Table IV indicated


significant difference among the approaches. The HRTF-
CCS-based method has attained the higher grade in spatial
characteristics and overall preference, but overlapping of the
confidence intervals suggested that the difference was not
statistically significant. The fact that the CCS did not work as
expected, especially for widely spaced loudspeakers, could be
due to several reasons [17] [18]. First, the reflections from
Fig. 13. Listening test results of three LMS-based upmixing methods for boundaries may have obscured the localization of sound
the home theater loudspeakers. (Pha: phantom mono reproduction, LMS: images. Second, CCS can increase only marginal performance
LMS algorithm, NLMS: NLMS algorithm, Co.LMS: correlation-based for widely spaced loudspeakers because the natural separation
LMS method, H.R.: hidden reference).
is inherently good in such situation. Third, the CCS is
M. R. Bai and G.-Y. Shih: Upmixing and Downmixing Two-channel Stereo Audio for Consumer Electronics 1017

the spatial impression. This suggests that the use of CCS is


critical to downmixing for closely-spaced loudspeakers.

Fig. 16. Listening test results of three downmixing techniques for the MP3
handset, with the mean and 95% confidence intervals of scores indicated
on the plot. (Pha: phantom mono reproduction, Std.(H.R.): standard
downmixing referred as hidden reference, HRTF: HRTF-based method,
H.C.: HRTF-CCS-based method).
Fig. 17. Listening test results of up/downmixing techniques for the home
TABLE IV
theater loudspeakers. (Pha: phantom mono reproduction, Pa.Std.: passive
THE STATISTICAL ANALYSIS RESULTS OF THE FOWNMIXING TEST FOR
surround decoder upmixing + standard downmixing, Hybrid: the hybrid
THE HOME THEATER LOUDSPEAKERS.
method, Re.Std.: reverb-based upmixing + standard downmixing,
Dependent Type III Re.HRTF: reverb-based upmixing + HRTF-based downmixing, Re.H.C.:
df MS F p
Variable SS reverb-based upmixing + HRTF-CCS-based downmixing, H.R.: hidden
Width 16.22222 2 8.11111 4.17685 0.024154 reference).
Depth 5.05556 2 2.52778 2.59326 0.089942 D. Evaluation of Up/Downmixing Algorithms
Spaciousness 11.55556 2 5.77778 5.66337 0.007682
Another listening test was conducted to evaluate the
Fullness 4.66667 2 2.33333 2.53846 0.094310
Brightness
combined upmixing and downmixing techniques. In addition
39.38889 2 19.69444 25.40391 0.000000
Artifact
to the above-mentioned hybrid method, the procedure that
6.16667 2 3.08333 6.97714 0.002971
Spatial quality
integrates the passive surround decoder for upmixing and the
9.55556 2 4.77778 4.45176 0.019424
Timbral quality
standard method for downmixing was included in the test as a
2.00000 2 1.00000 1.65000 0.207501
Overall preference benchmark. Also, the reverb-based upmixing approach that
5.722222 2.86111 4.30798 0.021762
TABLE V has achieved the best performance in the preceding tests was
THE STATISTICAL ANALYSIS RESULTS OF THE DOWNMIXING TEST FOR integrated with various downmixing methods, including the
THE MP3 HANDSET. standard downmixing method, the HRTF-based method, and
Dependent Type III the HRTF-CCS-based method. The front right and the front
df MS F p
Variable SS
left loudspeakers of the home theater system were used as the
Width 27.15152 2 13.57576 32.94118 0.000000
rendering devices in the home theater test, whereas the two
Depth 1.63636 2 0.81818 1.64634 0.209694
microspeakers in the handset were used as the rendering
Spaciousness 24.42424 2 12.21212 51.66667 0.000000
devices in the handset test.
Fullness 0.42424 2 0.21212 0.48611 0.619774
Brightness 12.78788 2 6.39394 6.67722 0.003993
Artifact 5.09091 2 2.54545 4.11765 0.026294
Spatial quality 18.72727 2 9.36364 30.29412 0.000000
Timbral quality 1.87879 2 0.93939 1.01974 0.372853
Overall preference 9.51515 2 4.75758 5.13072 0.012119

ineffective as the listener moved outside the sweet spot for


binaural reproduction, where the sweet spot of widely spaced
loudspeakers is smaller than that of the closely spaced
loudspeakers [18]. Figure 16 shows the test result of the
downmixing for the handset. The MANOVA outputs were
summarized in Table V. Among the methods, no significant
difference was found in fullness because the handset
loudspeakers literally had no sufficient low-frequency
Fig. 18. Listening test results of the up/downmixing techniques for the
response. Although HRTF improved spatial quality, it also MP3 handset. (Pha: phantom mono reproduction, Pa.Std.: passive
had detrimental effect on timbral quality. Consequently, the surround decoder upmixing + standard downmixing, Hybrid: the hybrid
HRTF-based method was no longer superior to the standard method, Re.Std.: reverb-based upmixing + standard downmixing,
Re.HRTF: reverb-based upmixing + HRTF-based downmixing, Re.H.C.:
downmixing method in overall preference as in the previous reverb-based upmixing + HRTF-CCS-based downmixing, H.R.: hidden
test. However, the CCS technique has significantly improved reference).
1018 IEEE Transactions on Consumer Electronics, Vol. 53, No. 3, AUGUST 2007

TABLE VI up/downmixing methods performed predominantly better than


THE STATISTICAL ANALYSIS RESULTS OF THE UP/DOWNMIXING TEST FOR
THE HOME THEATER LOUDSPEAKERS.
the unprocessed reference.
Dependent Type III
df MS F p E. The regression analysis of auditory attibutes
Variable SS
The regression models are obtained as follows:
Width 7.00000 4 1.750000 2.916667 0.031485
Preference = 0.545 × Spatial + 0.568 × Timbral + 0.048
Depth 1.32000 4 0.330000 0.339041 0.850181 (23)
( R = 0.587)
2
Spaciousness 9.40000 4 2.350000 3.975564 0.007591
Fullness 21.08000 4 5.270000 4.686759 0.003011 Spatial = 0.156 × W idth + 0.051 × Depth
Brightness 17.12000 4 4.280000 4.755556 0.002757 + 0.726 × Spaciousness + 0.077
(24)
Artifact 7.08000 4 1.770000 3.403846 0.016265 ( R = 0.785)
2

Spatial quality 10.88000 4 2.720000 2.467742 0.058232


Timbral = 0.125 × Fullness + 0.209 × Brightness
Timbral quality 13.48000 4 3.370000 6.266529 0.000427 (25)
Overall preference + 0.751 × Artifact − 0.049
6.60000 4 1.650000 1.444553 0.234981
( R = 0.593)
2
TABLE VII
THE STATISTICAL ANALYSIS RESULTS OF THE UP/DOWNMIXING TEST FOR
THE MP3 HANDSET. The squared correlation coefficient (R2) of the predicted
Dependent Type III
df MS F p
results indicates the regression model is statistically
Variable SS significant. The coefficients in the model suggest the
Width 7.35000 4 1.837500 3.868421 0.010504 weighting that a low-level attribute contributes to the global
Depth 15.65000 4 3.912500 5.676166 0.001250 attribute. The regression model revealed that the spatial and
Spaciousness 9.40000 4 2.350000 3.576087 0.015115 the timbral quality contributed comparably to the overall
Fullness 0.40000 4 0.100000 0.118644 0.974978 preference. In addition, spaciousness and artifact are dominant
Brightness 8.35000 4 2.087500 2.731308 0.044484 attributes to the spatial quality and the timbral quality,
Artifact 6.90000 4 1.725000 3.037736 0.029940 respectively. This suggests that the top and the second level of
Spatial quality 11.35000 4 2.837500 4.815152 0.003356 auditory attributes may have sufficed the listening test.
Timbral quality 6.40000 4 1.600000 3.612903 0.014433
Overall preference 9.35000 4 2.337500 2.671429 0.048086 VI. CONCLUDING REMARKS
Figure 17 shows the test results of up/downmixing by using A comprehensive study has been carried out to compare
the home theater loudspeakers. The MANOVA output was various upmixing and downmixing techniques. A 5.1 home
summarized in Table VI. The benchmark method attained theater system and a dual-loudspeaker MP3 handset were used
lower grade than the others in most attributes but timbral. The as the rendering devices. Subjective listening tests were
results showed that the hybrid method and the reverb-based conducted. Conclusions drawn from the results are
upmixing combined with the HRTF-based downmixing summarized as follows.
method were preferred over the others. Surprisingly, however, Among the upmixing methods, the reverb-based technique
the HRTF-CCS-based downmixing did not improve the has attained the best performance in spatial quality as well as
performance over the HRTF-based downmixing, but decrease in overall preference. This also suggests that the reverb-based
somewhat in both spatial and timbral qualities. Timbral method is subjectively more preferred in upmixing over the
quality degraded because CCS processing involved ipsilateral correlation-based methods.
equalization which altered the perceived timbre. The gain in For downmixing techniques applied to widely spaced
crosstalk cancellation seemed to be less than the loss in loudspeakers, the HRTF-based method outperformed the
timbral quality when the CCS is applied to widely spaced standard downmixing method in the spatial quality and overall
loudspeakers. In any rate, the processed signals yielded preference. The combined HRTF-CCS-based downmixing
predominantly better performance than the unprocessed improved only marginally the performance, but the difference
reference, which justifies the necessity of up/downmixing in was not statistically significant. However, for closely spaced
multichannel audio reproduction. loudspeakers, the HRTF-based method was no longer superior
The result of up/downmixing for the MP3 handset is shown to the standard downmixing method in overall preference as in
in Fig 18. The MANOVA output summarized in Table VII the previous test. However, the CCS technique has
indicated a significant difference among the up/downmixing significantly improved the spatial impression. This suggests
approaches. The results follow a similar trend to that of the that the use of CCS is critical to downmixing for closely-
home theater. The HRTF-based downmixing technique for the spaced loudspeakers.
microspeakers did not perform as well as that in the home For combined up/downmixing when applied to widely
theater test due to the crosstalk problem. The improvement on spaced loudspeakers, the hybrid method and the reverb-based
spatial quality seemed to be totally offset by the degradation upmixing combined with the HRTF-based downmixing
of timbral quality. However, addition of CCS to HRTF in method were preferred over the others. In particular, the latter
downmixing seemed to have slightly recovered spatial quality. approach has attained better spatial quality than the former
Similar to the home theater, the signals processed by the approach. For methods using reverb-based upmixing, HRTF
M. R. Bai and G.-Y. Shih: Upmixing and Downmixing Two-channel Stereo Audio for Consumer Electronics 1019

downmixing did have positive effects in spatial quality and [9] W. G. Gardner, 3-D Audio Using Loudspeakers. Kluwer Academic
Publishers, 1998.
overall preference. However, the HRTF-CCS-based [10] O. Kirkeby, P. A. Nelson, and H. Hamada, “Fast deconvolution of
downmixing did not improve the performance over the HRTF- multichannel systems using regularization,” IEEE Trans. Speech and
based downmixing, but decrease somewhat in both spatial and Audio Processing, Vol. 6, pp. 189-195, 1998.
[11] U. Zolzer, DAFX, Digital Audio Effects, John Wiley & Sons, 2002.
timbral qualities. The gain in crosstalk cancellation seemed to
[12] P. Heitkamper, “An adaptation control for acoustic echo cancellers,”
be less than the loss in timbral quality when the CCS is IEEE Signal Processing Letters, Vol. 4, pp 170-172, 1997.
applied to widely spaced loudspeakers. In combined use of [13] R. L. Haupt and S. E. Haupt, Practical Genetic Algorithms, 2nd ed.,
up/downmixing for the microspeakers, downmixing using Wiley, 2004.
[14] ITU-R BS.1116, “Method of subjective assessment of small impairments
HRTF did not perform as well as that in the home theater test in audio systems including multichannel sound systems,” International
due to the crosstalk problem. The improvement on spatial Communications Union, Geneva, Switzerland, 1994.
quality seemed to be totally offset by the degradation of [15] ITU-R BS.1534-1, “Method for the subjective assessment of
intermediate sound quality (MUSHRA)”, International
timbral quality. Under such circumstance, the CCS technique Telecommunications Union, Geneva, Switzerland, 2001.
is crucial to minimize the crosstalk and recover spatial quality. [16] D. C. Howell, Statistical Method for Psychology, Duxbury, NY, 1997.
In addition, the reverb-standard up/downmixing has achieved [17] M. R. Bai, C. W. Tung, and C. C. Lee, “Optimal design of loudspeaker
arrays for robust cross-talk cancellation using the Taguchi method and
comparable preference with the reverb-HRTF-CCS the genetic algorithm,” J. Acoust. Soc. Am., Vol. 117, pp. 2802-2813,
up/downmixing because the former approach performed quite 2005.
well in the timbral quality. Therefore, the up/downmixing [18] M. R. Bai and C. C. Lee, “Objective and subjective analysis of effects of
method using reverberator and simple standard downmixing is listening span on crosstalk cancellation in spatial sound reproduction,” J.
Acoust. Soc. Am., to appear in Sept. 2006.
preferred to realize virtual surround for handsets if
computation cost is of concern. In any rate, the processed Mingsian R. Bai was born in 1959 in Taipei, Taiwan,
signals yielded predominantly better performance than the ROC. He received a bachelor’s degree in Power
Mechanical Engineering from National Tsing-Hwa
unprocessed reference, which justifies the necessity of University in 1981. He also received a master degree in
up/downmixing in multichannel audio reproduction. Business Management from National Chen-Chi University
in 1984. He left Taiwan in 1984 to enter graduate school
of Iowa State University and later received a MS degree
ACKNOWLEDGMENT from Mechanical Engineering in 1985 and a Ph. D. from
The work was supported by the National Science Council in Engineering Mechanics and Aerospace Engineering in 1989. In 1989, he
joined the Department of Mechanical Engineering of National Chiao-Tung
Taiwan, Republic of China, under the project number NSC94- University in Taiwan as an associate professor and became a professor in
2212-E-009-019. 1996. He was also a visiting scholar to Center of Vibration and Acoustics,
Penn State University, University of Adelaide, Australia, and Institute of
REFERENCES Sound and Vibration Research (ISVR), UK in 1997, 2000, 2002, respectively.
His current interests encompass acoustics, audio signal processing,
[1] Dolby Laboratory, Dolby surround Pro Logic decoder principles of electroacoustic transducers, vibroacoustic diagnostics, active noise and
operation, http://www.dolby.com/resources/tech_library/index.cfm vibration control, and so forth. He currently serves as an active consultant and
[2] B. Widrow and S.D. Stearns, Adaptive Signal Processing, Prentice-Hall, project leader in these areas in industry. He has over 100 published papers
1985. and 13 granted or pending patents. Professor Bai is a member of the Audio
[3] R. Irwan and R. M. Aarts, “Two-to-five channel sound processing,” J. Engineering Society (AES), Acoustical Society of America (ASA), Acoustical
Audio Eng. Soc., Vol. 50, pp. 914-926, 2002. Society of Taiwan, and Vibration and Noise Control Engineering Society in
[4] I. T. Jolliffe, Principal Component Analysis, Springer-Verlag, 2002. Taiwan.
[5] M. R. Bai and G. Bai, “Optimal design and synthesis of reverberators
with a fuzzy user interface for spatial audio,“ J. Audio Eng. Soc., Vol. Geng-Yu Shih was born in 1982 in Tainan, Taiwan,
54, pp. 812-825, 2005. ROC. He received a bachelor’s degree in Mechanical
[6] S. Gajjar, “A 3D stereo sound system,” IEEE Colloquium on Audio and Engineering from the National Central University in
Music Technology: The Challenge of Creative DSP, London, pp. 15/1- 2004. He is currently working on the Mechanical
15/7, 1998. Engineering master degree in the National Chiao-Tung
[7] B. Gardner and K. Martin, “HRTF measurements of KEMAR dummy- University. His master thesis is on implementation of a
head microphone,” MIT Media Lab, 1994, 3D audio module with up/downmix and CCS for two-
http://sound.media.mit.edu/KEMAR.html channel stereo loudspeakers.
[8] ITU-R BS.775-1, “Multi-channel stereophonic sound system with or
without accompanying picture,” International Telecommunications
Union, Geneva, Switzerland, 1992–1994.

You might also like