Upmixing and Downmixing Two-Channel Stereo Audio For Consumer Electronics
Upmixing and Downmixing Two-Channel Stereo Audio For Consumer Electronics
Upmixing and Downmixing Two-Channel Stereo Audio For Consumer Electronics
Shih: Upmixing and Downmixing Two-channel Stereo Audio for Consumer Electronics 1011
Abstract — In this comprehensive study, algorithms for techniques including a passive surround decoder method [1], a
upmixing, downmixing, and joint up/downmixing are Least-Mean-Square (LMS)-based method [2], an adaptive
examined and compared. Five upmixing algorithms based on panning method [3], a Principal Component Analysis (PCA)-
signal decorrelation and reverberation are employed to based method [4], and an artificial reverberator-based method
convert two-channel stereo signals to five-channel signals. [5] are presented.
For downmixing, methods ranging from mixing with simple Contrary to the upmixing process, the downmixing process
gain adjustment to more sophisticated Head Related Transfer is employed to produce a decreased number of channels due to
Function (HRTF) filtering and Crosstalk Cancellation System practical reasons such as availability of loudspeakers. Without
(CCS) are utilized to downmix the center channel and the loss of generality, this study is focused on remixing 5.1 audio
surround channels into the available two frontal inputs into two-channel signals. Downmixing can be
loudspeakers. For situations where only two-channel content accomplished by simple mixing or more complex filtering by
and loudspeakers are available, a number of up/down mixing
Head Related Transfer Functions (HRTFs), as in the Sound
schemes are used to simulate a virtual surround environment.
Retrieval System (SRS) 3D stereo sound system [6]. An
Emphasis of comparison is placed on two consumer electronic
products: a 5.1 home theater system and a dual-loudspeaker HRTF is a mathematical model representing the propagation
MP3 handset. The effect of loudspeaker spacing on rendering process from a sound source to the human ears [7].
performance is examined. Listening tests are conducted to Another problem well known in reproduction using
compare the processing methods in terms of three levels of loudspeakers is that the crosstalk of the contralateral paths
subjective indices. The results are processed by using the from the loudspeakers to the listener’s ears can adversely
Multi-Analysis Of VAriance (MANOVA) to justify the affect source localization. A solution to this problem is to
statistical significance, followed by a multiple regression minimize crosstalk using a Crosstalk Cancellation System
analysis to correlate the auditory preference with various (CCS) derived from inverse filter design. In this paper,
timbral and spatial attributes1. downmixing techniques, ranging from mixing with simple
gain adjustment [8] to more sophisticated HRTF filtering [9],
Index Terms —Virtual surround, upmixing, downmixing, and CCS-based processing [10], shall be examined.
subjective listening test. One last situation that may call for the combined use of
upmixing and downmixing is when the audio inputs and the
I. INTRODUCTION reproducing loudspeakers are both of the two-channel stereo
With the growing proliferation of multichannel audio in configurations. The interactions between upmixing and
consumer electronics, there are situations where the number of downmixing are also investigated in the paper.
channels of either the audio content or the reproducing Subjective tests are carried out to assess the performance of
loudspeakers is limited and it is necessary to upmix or each processing method. The experimental results are
downmix channels to improve the listening experience. Yet, processed by using the multi-analysis of variance
another situation where combined upmixing and dwnmixing is (MANOVA) to justify the statistical significance. In addition,
necessary is the newly emerging third-generation (3G) a multiple regression model is employed to correlate global
handsets fitted with dual loudspeakers. This paper is aimed at auditory preference with low-level attributes. In light of these
a comprehensive study that systematically compares a variety comprehensive tests, it is hoped that viable upmixing and
of upmixing and downmixing strategies. downmixing techniques can be found to cater for practical
In the upmixing process, surround channels can be created multichannel audio reproduction.
from the two-channel stereo inputs by using two different
approaches. One approach uses the ‘decorrelated’ part of the II. UPMIXING ALGORITHMS
stereo inputs as the surround channels, whereas the other In this section, five upmixing algorithms are introduced.
approach produces the surround channels by simulating the These methods differ in how to derive the additional channels
reverberant sound field in the background. In this paper, five from the two stereo channels. The general architecture of the
direct-ambient upmixing approach for creating the additional
1
The work was supported by the National Science Council in Taiwan, channels is shown in Fig. 1. Common to all methods, the Center
Republic of China, under the project number NSC94-2212-E-009-019. (C) channel is created by filtering the correlated input using a
The authors are with the Department of Mechanical Engineering, the 128-tapped FIR bandpass filter with cut-off frequencies 100 Hz
National Chiao-Tung University, 1001 Ta-Hsueh Road, Hsin-Chu 300,
Taiwan. (e-mail: msbai@mail.nctu.edu.tw). and 4 kHz to emphasize voice and dialog. The Rear Left (RL)
Contributed Paper
Manuscript received May 30, 2007 0098 3063/07/$20.00 © 2007 IEEE
1012 IEEE Transactions on Consumer Electronics, Vol. 53, No. 3, AUGUST 2007
and the Rear Right (RR) channels are intended to provide LMS algorithm called Normalized LMS (NLMS) takes into
ambience and envelopment in the background, where a 15 ms account the variation in the input signals and selects a step
delay is added to the rear channels to fulfill the precedence size normalized with the input power. The weight update
effect. High-frequency absorption is simulated by filtering the equation is modified into
rear channels with a 7 kHz cut-off, 128-tapped FIR lowpass μ%
filter. In addition, the rear channels are 180-degree out-of-phase w ( n + 1) = w ( n ) + e ( n ) x ( n ), (6)
xT (n)x(n) + ψ
with one another, which can increase spaciousness of the
ambient field [11]. The Low Frequency Enhancement (LFE) where μ% and ψ are positive constants.
channel is derived from the center channel with a 128-tapped In some occasions, the adaptive algorithm diverges due to
FIR lowpass filter to retain the signals below 120 Hz. low correlation level between the signals of the front channels
Five upmixing techniques are presented as follows. when dominated with diffuse components. To deal with it, a
correlation-based method can be used, where the step size is
selected according to a modified correlation coefficient [12]:
M −1 M −1
ρ (n) = ∑ x(n − i)d (n − i) ∑ x(n − i)d (n − i) , (7)
i=0 i=0
where μ is a constant step size that dictates the convergence w L ( n ) = w L ( n − 1) + μ y ( n − 1)[ x L ( n − 1) − w L ( n − 1) y ( n − 1)]
(8)
behavior. The left channel and the right channel of the original w R ( n ) = w R ( n − 1) + μ y ( n − 1)[ x R ( n − 1) − w R ( n − 1) y ( n − 1)]
stereo channels are taken as the desired signal and the input of
Two signals, y(n) and q(n), are then obtained by panning the
the FIR filter, respectively. The output y(n) representing the
stereo signals xL(n) and xR(n) with the weights found above.
correlated part is used as the center channel, whereas the error
e(n) representing the uncorrelated part is used to produce two y ( n ) = wL ( n ) x L ( n ) + wR ( n ) x R ( n ) (9)
surround channels in upmixing. A special implementation of q ( n ) = wR ( n ) x L ( n ) − wL ( n ) x R ( n ) (10)
M. R. Bai and G.-Y. Shih: Upmixing and Downmixing Two-channel Stereo Audio for Consumer Electronics 1013
where xL(n) and xR(n) are the left and right signals at the time
instant n, and xL and xR represent the means of the left and
(b)
the right signals, respectively. The symmetric covariance Fig. 6. The interpolation procedure of the weights in the PCA-based
matrix A guarantees to give two orthonormal eigenvectors. upmixing technique. (a) Frame and sub-frame structure. (b) The
Let the eigenvectors associated with the larger and the smaller correspondence of the weights and the left channel signal.
eigenvalues be (CL,CR) and (SL,SR), respectively. The center
and the surround channels, presumably the correlated and the
uncorrelated portions, can be derived by mixing the input
signals according to the eigenvectors.
Center = C L x L ( n ) + C R x R ( n ) (13)
Surround = S L x L ( n ) + S R x R ( n ) (14) (a)
III. DOWNMIXING ALGORITHMS where HH(ejω) is the hermitian transpose of the acoustical
In many applications such as personal computer (PC) system H(ejω) and β is the regularization parameter. The
multimedia systems, portable audio products, only two- coefficients of CCS filters can be obtained by applying
channel stereo loudspeakers are available. Thus, downmixing inverse FFT and circular shifts to the frequency response
is necessary to convert multichannel audio content into two functions.
channels for loudspeaker presentation. In what follows, two
downmixing techniques will be presented.
A. The Standard Downmixing Method
The standard, ITU-R BS.775-1 [8], details how to downmix
multichannel signals with simple gain adjustment. The
architecture of such standard downmixing technique is shown Fig. 9. The architecture of the Shuffler filter.
in Fig. 8. That is,
L = FL + 0.71 × C + 0.71 × RL
(17)
R = FR + 0.71 × C + 0.71 × RR
The loudspeakers were deployed according to ITU-R BS.775 quality and timbral quality describe the general performance
[8], as shown in Fig. 11(a). On the other hand, a dual of the spatial and timbral characteristics, respectively. In the
loudspeakers MP3 handset was adopted in another subjective top hierarchy, the overall preference represents the global
test, as shown in Fig. 11(b). The listening tests were conducted impression of the reproduced program. The grading scale used
complying with the requirement of ITU-R BS.1116 [14]. As for the subjective tests are shown in Tables I .
required in the CCS design, the binaural transfer functions from
TABLE I
the loudspeakers to the microphone embedded in the ears of a THE GRADING SCALE USED IN THE MUSHRA PROCESS.
KEMAR’s were measured in an anechoic chamber by using a
spectrum analyzer. Performance Grade
Much better 3
Better 2
Slightly better 1
About the same 0
Slightly worse -1
Worse -2
Much worse -3
(a) (b)
Fig. 11. The experimental setup in the listening room. (a) The standard
3) Test Subjects
5.1 configuration for multichannel loudspeaker reproduction. (b) The Thirty listeners who are experienced in audio evaluation took
MP3 handset equipped with dual loudspeakers. part in the listening tests. A training phase was arranged for the
2) Procedure of Listening Test listeners prior to the formal test such that the listeners were
A modified double-blind Multi-Stimulus test with Hidden thoroughly familiarized with the test facilities, the environment,
Reference and a hidden Anchor (MUSHRA) (ITU-R BS.1534 and the grading process. Different items of lowpass processing,
[15]) was employed as the basis of the experimental design. highpass processing, phantom mono, and multichannel 5.1
The original unprocessed signal was used as the hidden mode of presentation were demonstrated in the training phase.
reference in the test. The hidden anchor employed in this test 4) Data Analysis
was the ‘phantom mono’ reproduction that broadcasts the The results of the tests were processed by using the
same signal over two-channel loudspeakers. The upmixing MANOVA. Both the mean and the 95% confidence intervals
and downmixing algorithms were implemented on the of the grades were shown in the analysis results. Cases with
platform of a fixed-point digital signal processor (DSP) significance levels below p = 0.05 indicate that difference
equipped with a 6-input and 6-output codec operating at 48 among methods is statistically significant. The scores obtained
kHz. The subjects are allowed to switch between different from the hidden reference and anchor were only intended for
stimuli at their discretion and grade the presentations. The validating the consistency of subjects, but were excluded from
loudness of each reproduced signal was adjusted to equal level the MANOVA as usually done in practice. Three statistical
by a group of five skilled subjects to minimize experimental assumptions of MANOVA such as independence of grading,
errors due to loudness variation. Three-leveled hierarchical normal distribution of scores, and homogeneity of variances
subjective indices shown in Fig. 12 were employed to assess (HOV) were verified in these experiments. Independence of
the performance of the upmixing and downmixing techniques. grading has been fulfilled due to the randomization of the
experimental factors. The assumptions of normality and HOV
were examined by using Shapiro-Wilk's W test and Levene’s
test, respectively [16]. Furthermore, a multiple regression
model was employed to correlate the global auditory attributes
with the low-level attributes in the listening test.
TABLE II
THE STATISTICAL ANALYSIS RESULTS OF THE LMS-BASED UPMIXING
TEST FOR THE HOME THEATER LOUDSPEAKERS.
Fig. 12. Three-leveled hierarchical subjective indices employed in the Dependent Type III
listening test to assess the performance of the upmixing and downmixing df MS F p
Variable SS
techniques.
Width 0.545455 2 0.272727 0.387931 0.681813
In the lowest hierarchy, the width, depth, and spaciousness Depth 0.242424 2 0.121212 0.322581 0.726759
Spaciousness 0.424242 2 0.212121 0.472973 0.627715
refer to the perceived angular width, depth of the sound image, Fullness 0.181818 2 0.090909 0.111940 0.894469
and the ambience, envelopment, and sensation of space Brightness 0.000000 2 0.000000 0.000000 1.000000
pertaining to the listening environment. The fullness and Artifact 0.000000 2 0.000000 0.000000 1.000000
brightness refer to the dominant low and high frequency Spatial quality 0.424242 2 0.212121 0.813953 0.452649
content, respectively. The index, artifact, refers to any linear Timbral quality 0.000000 2 0.000000 0.000000 1.000000
Overall preference 0.424242 2 0.212121 0.372340 0.692260
or nonlinear distortions. In the second hierarchy, the spatial
1016 IEEE Transactions on Consumer Electronics, Vol. 53, No. 3, AUGUST 2007
Fig. 16. Listening test results of three downmixing techniques for the MP3
handset, with the mean and 95% confidence intervals of scores indicated
on the plot. (Pha: phantom mono reproduction, Std.(H.R.): standard
downmixing referred as hidden reference, HRTF: HRTF-based method,
H.C.: HRTF-CCS-based method).
Fig. 17. Listening test results of up/downmixing techniques for the home
TABLE IV
theater loudspeakers. (Pha: phantom mono reproduction, Pa.Std.: passive
THE STATISTICAL ANALYSIS RESULTS OF THE FOWNMIXING TEST FOR
surround decoder upmixing + standard downmixing, Hybrid: the hybrid
THE HOME THEATER LOUDSPEAKERS.
method, Re.Std.: reverb-based upmixing + standard downmixing,
Dependent Type III Re.HRTF: reverb-based upmixing + HRTF-based downmixing, Re.H.C.:
df MS F p
Variable SS reverb-based upmixing + HRTF-CCS-based downmixing, H.R.: hidden
Width 16.22222 2 8.11111 4.17685 0.024154 reference).
Depth 5.05556 2 2.52778 2.59326 0.089942 D. Evaluation of Up/Downmixing Algorithms
Spaciousness 11.55556 2 5.77778 5.66337 0.007682
Another listening test was conducted to evaluate the
Fullness 4.66667 2 2.33333 2.53846 0.094310
Brightness
combined upmixing and downmixing techniques. In addition
39.38889 2 19.69444 25.40391 0.000000
Artifact
to the above-mentioned hybrid method, the procedure that
6.16667 2 3.08333 6.97714 0.002971
Spatial quality
integrates the passive surround decoder for upmixing and the
9.55556 2 4.77778 4.45176 0.019424
Timbral quality
standard method for downmixing was included in the test as a
2.00000 2 1.00000 1.65000 0.207501
Overall preference benchmark. Also, the reverb-based upmixing approach that
5.722222 2.86111 4.30798 0.021762
TABLE V has achieved the best performance in the preceding tests was
THE STATISTICAL ANALYSIS RESULTS OF THE DOWNMIXING TEST FOR integrated with various downmixing methods, including the
THE MP3 HANDSET. standard downmixing method, the HRTF-based method, and
Dependent Type III the HRTF-CCS-based method. The front right and the front
df MS F p
Variable SS
left loudspeakers of the home theater system were used as the
Width 27.15152 2 13.57576 32.94118 0.000000
rendering devices in the home theater test, whereas the two
Depth 1.63636 2 0.81818 1.64634 0.209694
microspeakers in the handset were used as the rendering
Spaciousness 24.42424 2 12.21212 51.66667 0.000000
devices in the handset test.
Fullness 0.42424 2 0.21212 0.48611 0.619774
Brightness 12.78788 2 6.39394 6.67722 0.003993
Artifact 5.09091 2 2.54545 4.11765 0.026294
Spatial quality 18.72727 2 9.36364 30.29412 0.000000
Timbral quality 1.87879 2 0.93939 1.01974 0.372853
Overall preference 9.51515 2 4.75758 5.13072 0.012119
downmixing did have positive effects in spatial quality and [9] W. G. Gardner, 3-D Audio Using Loudspeakers. Kluwer Academic
Publishers, 1998.
overall preference. However, the HRTF-CCS-based [10] O. Kirkeby, P. A. Nelson, and H. Hamada, “Fast deconvolution of
downmixing did not improve the performance over the HRTF- multichannel systems using regularization,” IEEE Trans. Speech and
based downmixing, but decrease somewhat in both spatial and Audio Processing, Vol. 6, pp. 189-195, 1998.
[11] U. Zolzer, DAFX, Digital Audio Effects, John Wiley & Sons, 2002.
timbral qualities. The gain in crosstalk cancellation seemed to
[12] P. Heitkamper, “An adaptation control for acoustic echo cancellers,”
be less than the loss in timbral quality when the CCS is IEEE Signal Processing Letters, Vol. 4, pp 170-172, 1997.
applied to widely spaced loudspeakers. In combined use of [13] R. L. Haupt and S. E. Haupt, Practical Genetic Algorithms, 2nd ed.,
up/downmixing for the microspeakers, downmixing using Wiley, 2004.
[14] ITU-R BS.1116, “Method of subjective assessment of small impairments
HRTF did not perform as well as that in the home theater test in audio systems including multichannel sound systems,” International
due to the crosstalk problem. The improvement on spatial Communications Union, Geneva, Switzerland, 1994.
quality seemed to be totally offset by the degradation of [15] ITU-R BS.1534-1, “Method for the subjective assessment of
intermediate sound quality (MUSHRA)”, International
timbral quality. Under such circumstance, the CCS technique Telecommunications Union, Geneva, Switzerland, 2001.
is crucial to minimize the crosstalk and recover spatial quality. [16] D. C. Howell, Statistical Method for Psychology, Duxbury, NY, 1997.
In addition, the reverb-standard up/downmixing has achieved [17] M. R. Bai, C. W. Tung, and C. C. Lee, “Optimal design of loudspeaker
arrays for robust cross-talk cancellation using the Taguchi method and
comparable preference with the reverb-HRTF-CCS the genetic algorithm,” J. Acoust. Soc. Am., Vol. 117, pp. 2802-2813,
up/downmixing because the former approach performed quite 2005.
well in the timbral quality. Therefore, the up/downmixing [18] M. R. Bai and C. C. Lee, “Objective and subjective analysis of effects of
method using reverberator and simple standard downmixing is listening span on crosstalk cancellation in spatial sound reproduction,” J.
Acoust. Soc. Am., to appear in Sept. 2006.
preferred to realize virtual surround for handsets if
computation cost is of concern. In any rate, the processed Mingsian R. Bai was born in 1959 in Taipei, Taiwan,
signals yielded predominantly better performance than the ROC. He received a bachelor’s degree in Power
Mechanical Engineering from National Tsing-Hwa
unprocessed reference, which justifies the necessity of University in 1981. He also received a master degree in
up/downmixing in multichannel audio reproduction. Business Management from National Chen-Chi University
in 1984. He left Taiwan in 1984 to enter graduate school
of Iowa State University and later received a MS degree
ACKNOWLEDGMENT from Mechanical Engineering in 1985 and a Ph. D. from
The work was supported by the National Science Council in Engineering Mechanics and Aerospace Engineering in 1989. In 1989, he
joined the Department of Mechanical Engineering of National Chiao-Tung
Taiwan, Republic of China, under the project number NSC94- University in Taiwan as an associate professor and became a professor in
2212-E-009-019. 1996. He was also a visiting scholar to Center of Vibration and Acoustics,
Penn State University, University of Adelaide, Australia, and Institute of
REFERENCES Sound and Vibration Research (ISVR), UK in 1997, 2000, 2002, respectively.
His current interests encompass acoustics, audio signal processing,
[1] Dolby Laboratory, Dolby surround Pro Logic decoder principles of electroacoustic transducers, vibroacoustic diagnostics, active noise and
operation, http://www.dolby.com/resources/tech_library/index.cfm vibration control, and so forth. He currently serves as an active consultant and
[2] B. Widrow and S.D. Stearns, Adaptive Signal Processing, Prentice-Hall, project leader in these areas in industry. He has over 100 published papers
1985. and 13 granted or pending patents. Professor Bai is a member of the Audio
[3] R. Irwan and R. M. Aarts, “Two-to-five channel sound processing,” J. Engineering Society (AES), Acoustical Society of America (ASA), Acoustical
Audio Eng. Soc., Vol. 50, pp. 914-926, 2002. Society of Taiwan, and Vibration and Noise Control Engineering Society in
[4] I. T. Jolliffe, Principal Component Analysis, Springer-Verlag, 2002. Taiwan.
[5] M. R. Bai and G. Bai, “Optimal design and synthesis of reverberators
with a fuzzy user interface for spatial audio,“ J. Audio Eng. Soc., Vol. Geng-Yu Shih was born in 1982 in Tainan, Taiwan,
54, pp. 812-825, 2005. ROC. He received a bachelor’s degree in Mechanical
[6] S. Gajjar, “A 3D stereo sound system,” IEEE Colloquium on Audio and Engineering from the National Central University in
Music Technology: The Challenge of Creative DSP, London, pp. 15/1- 2004. He is currently working on the Mechanical
15/7, 1998. Engineering master degree in the National Chiao-Tung
[7] B. Gardner and K. Martin, “HRTF measurements of KEMAR dummy- University. His master thesis is on implementation of a
head microphone,” MIT Media Lab, 1994, 3D audio module with up/downmix and CCS for two-
http://sound.media.mit.edu/KEMAR.html channel stereo loudspeakers.
[8] ITU-R BS.775-1, “Multi-channel stereophonic sound system with or
without accompanying picture,” International Telecommunications
Union, Geneva, Switzerland, 1992–1994.