Academia.eduAcademia.edu

Multi-channel goes mobile: MPEG surround binaural rendering

2029

Surround sound is on the verge of broad adoption in consumers' homes, for digital broadcasting and even for Internet services. The currently developed MPEG Surround technology offers bitrate efficient and mono/stereo compatible transmission of high-quality multi-channel audio. This enables multi-channel services for applications where mono or stereo backwards compatibility is required as well as applications with severely bandwidth limited distribution channels. This paper outlines a significant addition to the MPEG Surround specification which enables computationally efficient decoding of MPEG Surround data into binaural stereo as is appropriate for appealing surround sound reproduction on mobile devices, such as cellular phones. The publication describes the basics of the underlying MPEG Surround architecture, the binaural decoding process, and subjective testing results.

MULTI-CHANNEL GOES MOBILE: MPEG SURROUND BINAURAL RENDERING JEROEN BREEBAART1, JÜRGEN HERRE2, LARS VILLEMOES3, CRAIG JIN4, KRISTOFER KJÖRLING3, JAN PLOGSTIES2 AND JEROEN KOPPENS5 1 Philips Research Laboratories, 5656 AA, Eindhoven, The Netherlands jeroen.breebaart@philips.com 2 Fraunhofer Institute for Integrated Circuits IIS, 91058 Erlangen, Germany {hrr;pts}@iis.fraunhofer.de 3 Coding Technologies, 11352 Stockholm, Sweden {lv;kk}@codingtechnologies.com 4 Vast Audio, NSW 1430 Sydney, Australia craig@ee.usyd.edu.au 5 Philips Applied Technologies, 5616 LW Eindhoven, The Netherlands jeroen.koppens@philips.com Surround sound is on the verge of broad adoption in consumers’ homes, for digital broadcasting and even for Internet services. The currently developed MPEG Surround technology offers bitrate efficient and mono/stereo compatible transmission of high-quality multi-channel audio. This enables multi-channel services for applications where mono or stereo backwards compatibility is required as well as applications with severely bandwidth limited distribution channels. This paper outlines a significant addition to the MPEG Surround specification which enables computationally efficient decoding of MPEG Surround data into binaural stereo as is appropriate for appealing surround sound reproduction on mobile devices, such as cellular phones. The publication describes the basics of the underlying MPEG Surround architecture, the binaural decoding process, and subjective testing results. INTRODUCTION Approximately half a century after the broad availability of two-channel stereophony, multi-channel sound is finally on its way into consumer’s homes as the next step towards higher spatial reproduction quality. While the majority of multi-channel audio consumption is still in the context of movie sound, consumer media for high-quality multi-channel audio (such as SACD and DVD-Audio) now respond to the demand for a compelling surround experience also for the audio-only market. Many existing distribution channels will be upgraded to multi-channel capability over the coming years if two key requirements can be met: a) As with the previous transition from mono to stereophonic transmission, the plethora of existing (now stereo) users must continue to receive high-quality service, and b) For digital distribution channels with substantially limited channel capacity (e.g. digital audio broadcasting), the introduction of multi-channel sound must not come at a significant price in terms of additional data rate required. This paper reports on the forthcoming MPEG Surround specification that offers an efficient representation of high quality multi-channel audio at bitrates that are only slightly higher than common rates currently used for coding of mono / stereo sound. Due to the underlying principle the format is also completely backward compatible with legacy (mono or stereo) decoders and is thus ideally suited to introduce multi-channel sound into existing stereophonic or monophonic media and services. Specifically, the paper reports on a recent addition to the MPEG Surround specification which complements the original loudspeaker-oriented multichannel decoding procedure by modes that enables a compelling surround sound reproduction on mobile devices. Using these binaural rendering modes, multichannel audio can be rendered into a realistic virtual sound experience on a wide range of existing mobile devices, such as mp3 players and cellular phones. Due to the nature of the task, MPEG Surround binaural rendering represents an intimate merge between binaural technologies, as they are known from virtual sound displays [1][2], and the parametric multi-channel audio coding that is at the heart of the MPEG Surround scheme. Thus, the paper is structured as follows: Sections 1 and 2 describe the basic concepts that classic binaural technology and the MPEG Surround specification are based on respectively. Then Section 3 introduces the new MPEG Surround binaural rendering modes as a synthesis between both worlds and characterizes it in terms of both sonic performance and implementation complexity. Finally, a number of AES 29th International Conference, Seoul, Korea, 2006 September 2–4 1 interesting applications for MPEG Surround binaural rendering are discussed. 1 BINAURAL RENDERING Spatial hearing relies to a great extent on binaural cues like time-, level- and spectral differences between the left and right ear signals [3]. These cues are contained in the acoustic transfer function from a point in space to the ear canal of a listener, called a head-related transfer function (HRTF). HRTFs are measured under anechoic conditions on human or artificial heads with small microphones in the ear canal. They are strongly dependent on direction, but also on the head and ear shape [4]. If acoustic transfer functions are measured in an echoic room, i.e. in the presence of reflections and reverberation, they are referred to as binaural room transfer functions (BRTFs). The well-known concept of binaural rendering makes use of the knowledge of transfer functions between sound sources and the listener’s ear signals to create virtual sound sources which are placed around the listener. This is done by convolving a signal with a pair of HRTFs or BRTFs to produce ear signals as they would have resulted in a real acoustic environment. Such signals are typically reproduced via headphones. Alternatively, cross-talk cancelled loudspeakers can be used [5]. Typical applications of binaural rendering are auditory virtual displays, gaming and other immersive environments. The input signals for binaural rendering are monophonic sounds to be spatialized. Most existing audio content is produced for loudspeaker reproduction. In order to render such material for headphone reproduction, each loudspeaker can be represented by a virtual source placed at a defined location. Each loudspeaker signal is filtered by a pair of HRTFs or BRTFs corresponding to this location. Finally, the filtered output signals for each ear are summed to form the headphone output channel. HRTF(1,L) + HRTF(1,R) HRTF(2,L) HRTF(3,L) HRTF(3,R) Bitstream input + + HRTF(2,R) Multichannel Decoder L + R Binaural output + + Multichannel signal Figure 1: Decoding and binaural rendering of multichannel signals. In Figure 1 the straightforward process of decoding and binaural rendering of a discrete multi-channel signal is depicted. First, the audio bitstream is decoded to N channels. In the subsequent binaural rendering stage each loudspeaker signal is then rendered for reproduction via two ear signals, yielding a total number of 2xN filters. Depending on the number of channels and the length of the filters, this process can be demanding in terms of both computational complexity and memory usage. Binaural rendering has several benefits for the user. Since the important cues for spatial hearing are conveyed, the user is able to localize sounds in direction and distance and to perceive envelopment. Sounds appear to originate somewhere outside the listener’s head as opposed to the in-head localization that occurs with conventional stereo headphone reproduction. The quality of binaural rendering is mostly determined by the localization performance, front-back discrimination, externalization and perceived sound coloration. Some studies show that there is a benefit in using individualized HRTF for binaural rendering, i.e. using the user’s own HRTFs. However, careful selection of HRTFs allows good localization performance for many subjects [6]. In addition, tracking of the listener’s head and updating the HRTF filtering accordingly can further reduce localization errors and create a highly realistic virtual auditory environment [7]. Another kind of interaction is the modification of the position of the virtual sources by the user. The aforementioned scenarios require that there is a defined interface such that different HRTFs/BRTFs can be applied to the binaural rendering process. 2 MPEG SURROUND TECHNOLOGY This section provides a brief description of the MPEG Surround technology. First, the basics of MPEG Surround will be discussed. Afterwards, an introduction of the decoder structure will be given. The section will then conclude with an outline of a selection of important MPEG Surround features. 2.1 Spatial Audio Coding concept MPEG Surround is based on a principle called Spatial Audio Coding (SAC). Spatial Audio Coding is a multichannel compression technique that exploits the perceptual inter-channel irrelevance in multi-channel audio signals to achieve higher compression rates. This can be captured in terms of spatial cues, i.e. parameters describing the spatial image of a multichannel audio signal. Spatial cues typically include level/intensity differences, phase differences and measures of correlation/coherence between channels, and can be represented in an extremely compact way. During encoding, spatial cues are extracted from the multi-channel audio signal and a downmix is generated. Typically, backwards-compatible downmix signals will be used like mono- or stereophonic signals. However, any number of channels that is smaller than that used for AES 29th International Conference, Seoul, Korea, 2006 September 2–4 2 Figure 2: Typical SAC encoder/decoder chain. Conceptually, this approach can be seen as an enhancement of several known techniques, such as an advanced method for joint stereo coding of multichannel signals [8], a generalization of Parametric Stereo [9] [10] [11] to multi-channel application, and an extension of the Binaural Cue Coding (BCC) scheme [12] [13] towards using more than one transmitted downmix channel [14]. From a different viewpoint, the Spatial Audio Coding approach may also be considered an extension of wellknown matrix surround schemes (Dolby Surround/Prologic, Logic 7, Circle Surround etc.) [15] [16] by transmission of dedicated side information to guide the multi-channel reconstruction process and thus achieve improved subjective audio quality [17]. 2.2 MPEG Surround decoder Hybrid analysis Hybrid analysis Hybrid synthesis Spatial synthesis Hybrid synthesis Hybrid synthesis Reconstructed original 2.2.1 Decoder structure In an MPEG Surround decoder, the decoded (i.e. PCM) downmix signal is up-mixed by a spatial synthesis process using the transmitted spatial cues. Due to the frequency dependence of the spatial side information the downmix is analyzed by a hybrid filterbank before spatial synthesis, and the multi-channel reconstruction is re-synthesized by a hybrid synthesis filterbank. This is shown in Figure 3. Down-mix the original audio can be used for the downmix. In the remainder of this section a stereophonic downmix will be assumed. The downmix can then be compressed and transmitted without the need to update existing coders and infrastructures. The spatial cues (spatial side information) are transmitted in a low bitrate side channel, e.g. the ancillary data portion of the downmix bitstream. For most audio productions both a stereo as well as a 5.1 multi-channel mix is produced from the original multi-track recording by an audio engineer. Naturally, the automated downmix produced by the spatial audio coder can differ significantly from the artistic stereo downmix as intended by the audio engineer. For optimal backward compatibility, this artistic downmix can be transmitted instead of the automated downmix. The difference between this artistic stereo downmix and the automated stereo downmix signal, required by the decoder for optimal multi-channel reconstruction, can be coded as part of the spatial side information stream either in a parametric fashion for low bitrate applications or as a wave-form coded difference signal. On the decoder side, a multi-channel up-mix is created from the transmitted downmix signal and spatial side information. In this respect, the Spatial Audio Coding concept can be used as a pre- and post-processing step to upgrade existing systems. Figure 2 illustrates this for a 5.1 original with a stereo downmix. Spatial parameters Figure 3: High level MPEG Surround decoder structure. Spatial synthesis is applied to this time-frequency representation by matrix transformations where the matrices are calculated for defined ranges in time and frequency (tiles) parameterized by the spatial side information. A more detailed overview focusing on the spatial synthesis is shown in Figure 4. In order to be able to reconstruct the inter-channel coherence in the up-mix, decorrelated signals are required. Therefore, the up-mix process is split up into three steps. A first matrix performs pre-processing and initial mixing of the downmix signals. Subsequently, some of these signals are decorrelated by independent decorrelators. Finally a second matrix mixes the decorrelated signals with the linearly pre-processed signals into a reconstruction of the original multi-channel signal. Figure 4: Spatial synthesis process. 2.2.2 Decoder concept The underlying concept of the MPEG Surround upmixing process is based on a tree structure consisting of conceptual up-mixing elements. Figure 5 shows the tree for decoding a stereophonic downmix to a 5.1 reconstruction. There are two basic elements, • One-To-Two (OTT) element, up-mixes one channel into two, AES 29th International Conference, Seoul, Korea, 2006 September 2–4 3 • Two-To-Three (TTT) element, up-mixes two channels into three. Figure 5: Conceptual synthesis structure. The up-mixing in these elements is done by a simple matrix operation. In order to recreate coherence, a decorrelator is available in each block. Figure 6 shows the operation of the OTT module. The incoming downmix signal is decorrelated and up-mixed by matrix Wumx for each time-frequency tile. 2.3.1 Rate/Distortion scalability In order to make MPEG Surround useable in as many applications a possible, it is important to cover a broad range, both in terms of side information rates and multichannel audio quality. There are two different focus areas in this trade-off: • Downmix, • Spatial side information. Although not completely orthogonal, the perceptual quality of the downmix largely determines the sound quality of the multi-channel reconstruction whereas the spatial side information mainly determines the quality of the spatialization of the up-mix. In order to provide the highest flexibility on the part of the spatial side information and to cover all conceivable application areas, the MPEG Surround technology was equipped with a number of provisions for rate/distortion scalability. This approach permits one to flexibly select the operating point for the trade-off between side information rate and multi-channel audio quality without any change in its generic structure. This concept is illustrated in Figure 7 and relies on several dimensions of scalability that are briefly described in the list below. Figure 6: Operation of an OTT element. The up-mix matrix is calculated based on the spatial side information. The relevant spatial cues for the OTT element are • • Channel Level Difference (CLD) – the level difference between the two input channels, Inter-channel Coherence/cross-correlation (ICC) – represents the coherence or crosscorrelation between the two input channels. The up-mixing in the TTT element is slightly different. It estimates a third channel from two input channels and two Channel Prediction Coefficients (CPC). Additionally, a parameter is transmitted that can be used to compensate for a possible prediction loss, e.g. by means of a decorrelated signal. The above-described conceptual decoding process of separate processing blocks is lumped together into the structure shown in Figure 4. 2.3 MPEG Surround features This section highlights some important features of MPEG Surround. For a thorough description of features the reader is referred to [18][19][20]. AES 29th International Conference, Seoul, Korea, 2006 September 2–4 Figure 7: Rate/Distortion scalability. • Parameter frequency resolution A first degree of freedom results from scaling the frequency resolution of spatial audio processing. Currently the MPEG Surround syntax covers between 28 and a single parameter frequency band. • Parameter time resolution Another degree of freedom is available in the temporal resolution of the spatial parameters, i.e., the parameter update rate. The MPEG Surround syntax covers a wide range of update rates and also allows for adapting the temporal grid dynamically to the signal structure. • Parameter quantization resolution As a third possibility, different granularities for transmitted parameters can be used. Using low-resolution parameter descriptions is accommodated by dedicated tools, such as 4 the Adaptive Parameter Smoothing mechanism [20]. • Parameter choice Furthermore, there is a choice as to how extensive the transmitted parameterization describes the original multi-channel signal. As an example, the number of ICC values transmitted to characterize the wideness of the spatial image may be as low as a single value per time-frequency tile, applicable to all OTT instances. • Residual coding Finally, it is recognized that the quality level of the multi-channel reconstruction is limited by the limits of the parametric model used. Therefore, the MPEG Surround system supports “residual coding” which is a waveform coding extension that codes the errorsignal originating from the limits of the parametric model. Together, these scaling dimensions enable operation at a wide range of rate/distortion trade-offs from side information rates below 3 kbit/s to 32 kbit/s and above. Although MPEG Surround offers the most efficient multi-channel coding to date while at the same time allowing for a backwards compatible downmix signal, there are applications where, due to the construction of the transmission infrastructure, no transmission of additional (however small) side information is possible. In order to account for this, the MPEG Surround decoder can be operated in a non-guided mode. This means that the multi-channel signal is recreated based solely on the available downmix signal without the MPEG Surround spatial data, and no spatial side information is transmitted in this mode. However, due to its adaptive nature, this mode still provides better quality than matrix surround based systems. analysis block and is controlled by the spatial parameters. The transformation matrix is guaranteed to have an inverse which can be uniquely determined from the spatial parameters in the bitstream. 2.3.2 Matrix surround capability Besides a conventional stereo downmix, the MPEG Surround encoder is also capable of generating a matrix surround compatible stereo downmix signal. This feature ensures backward-compatible 5.1 audio playback on matrix surround decoders not supporting MPEG Surround. In this context, it is important to ensure that the perceptual quality of the multi-channel reconstruction is not affected by enabling the matrix surround feature. The matrix surround capability is achieved by using a parameter-controlled post-processing unit that acts on the stereo downmix at the encoder side. A block diagram of an MPEG Surround encoder with this extension is shown in Figure 8. The matrix surround enabling post-processing unit, implemented as a matrix transformation, operates in the time-frequency domain on the output of the spatial 3 Figure 8: MPEG Surround encoder with post-processing for matrix surround (MTX) compatible downmix. In the MPEG Surround decoder the process is reversed, i.e., a complementary pre-processing step is applied to the downmix signal before entering the spatial synthesis process. The matrix surround compatibility comes without any significant additional spatial information (1 bit indicates whether it is enabled). The ability to invert the matrix surround compatibility processing guarantees that there is no negative effect on the multi-channel reconstruction quality. Furthermore, this feature enables optimal performance of the before-mentioned non-guided mode within the MPEG Surround framework. 2.3.3 Binaural rendering One of the most recent extensions of MPEG Surround is the capability to render a 3D/binaural stereo output. Using this mode, consumers can experience a 3D virtual multi-channel loudspeaker setup when listening over headphones. This extension is of especially significant interest for mobile devices (such as DVB-H receivers) and will be outlined in more detail below. MPEG SURROUND BINAURAL RENDERING 3.1 Application scenarios Two distinct use-cases are supported. In the first use case, referred to as ‘3D’, the transmitted (stereo) downmix is converted to a 3D headphone signal at the encoder side, accompanied by spatial parameters. In this use case, legacy stereo devices will automatically render a 3D headphone output. If the same (3D) bitstream is decoded by an MPEG Surround decoder, the transmitted 3D downmix can be converted to (standard) multichannel output optimized for loudspeaker playback. In the second use case, a conventional MPEG Surround downmix / spatial parameter bitstream is decoded using a so-called ‘binaural decoding’ mode. Hence the 3D/binaural synthesis is applied at the decoder side. Within MPEG Surround, both use cases are covered using a new technique for binaural audio synthesis. As AES 29th International Conference, Seoul, Korea, 2006 September 2–4 5 described previously in Section 1, the synthesis process of conventional 3D synthesis systems comprises convolution of each virtual sound source with a pair of HRTFs (e.g., 2N convolutions, with N being the number of sound sources). In the context of MPEG Surround, this method has several disadvantages: • Individual (virtual) loudspeaker signals are required for HRTF convolution. Within MPEG surround this means that multi-channel decoding is required as an intermediate step. • It is virtually impossible to ‘undo’ or ‘invert’ the encoder-side HRTF processing at the decoder (which is needed in the first use case for loudspeaker playback). • Convolution is most efficiently applied in the FFT domain while MPEG Surround operates in the QMF domain. To circumvent these potential problems, MPEG Surround 3D synthesis is based on new technology that operates in the QMF domain without intermediate multi-channel decoding. The incorporation of this technology in the two different use cases is outlined in the sections below. 3.2 HRTF parameters MPEG Surround facilitates the use of HRTF parameters. Instead of describing HRTFs by means of a transfer function, the perceptually relevant properties of HRTF pairs are captured by means of a small set of statistical properties. The parameterization is especially suitable for anechoic HRTFs and works in a similar way as the spatial parameterization of multi-channel content that is used in MPEG Surround. Parameters are extracted as a function of frequency (i.e., using the concept of non-uniformly distributed parameter bands) and describe the spectral envelopes of an HRTF pair, the average phase difference and optionally the coherence between an HRTF pair. This process is repeated for the HRTFs of every sound source position of interest. Using this compact representation, the perceptually relevant localization cues are represented accurately, while the perceptual irrelevance of fine-structure detail in HRTF magnitude and phase spectra is effectively exploited [21][22]. More importantly, the HRTF parameters facilitate low-complexity, parameter-based binaural rendering in the context of MPEG Surround. 3.3 Parameter-based binaural rendering As described in the previous sections, the spatial parameters describe the perceptually relevant properties of multi-channel content. In the context of binaural rendering, these parameters describe relations between virtual sound sources (e.g., virtual loudspeakers). The HRTF parameters, on the other hand, describe the relation between a certain sound source position and the resulting spatial properties of the signals that are presented over headphones. Consequently, spatial parameters and HRTF parameters can be combined to estimate so-called ‘binaural’ parameters (see Figure 9). These binaural parameters represent binaural properties (e.g., binaural cues) that are the result of simultaneous playback of all virtual sound sources. Said differently, the binaural parameters represent the changes that a mono or stereo downmix signal must undergo to result in a binaural signal that represents all virtual sound sources simultaneously, but without the need for an intermediate 5.1 signal presentation. This shift of HRTF processing from the traditional signal domain to the parameter domain has the great advantage of a reduced complexity, as will be outlined below. Spatial parameters HRTF parameters Binaural parameters Figure 9: Binaural parameters result from spatial parameters and HRTF parameters In Figure 10, a conceptual spatial decoder (by means of a single OTT element) is shown which generates two output signals from a mono input signal using a Channel Level Difference (CLD) and Inter-Channel Correlation (ICC) parameter. For each parameter band, the input signal has a power given by σ2, and the two OTT output signals have powers given by σ12 and σ22, respectively. Since we are only interested in relative changes with respect to the input signal, we assume σ 2 = 1, and given the energy preservation property of OTT elements: σ 12 + σ 22 = 1 . The transmitted CLD parameter is given by AES 29th International Conference, Seoul, Korea, 2006 September 2–4 ⎛σ 2 ⎞ CLD = 10 log10 ⎜⎜ 12 ⎟⎟ , ⎝σ2 ⎠ 6 which gives the solution for by σ12 and σ22: σ 12 = 10CLD / 10 , 1 + 10CLD / 10 σ 22 = 1 − σ 12 . CLD ICC σ Mono input σ xy2 = σ x2 p xy2 , with pxy being the HRTF amplitude parameter for a sound source position corresponding to OTT output channel x, and y the index for each ear of the HRTF pair. In a last step, the signals are summed across virtual sound sources for each ear signal. This addition results in the estimated relative sub-band powers of the left and right-ear signals σL2 and σR2: σ = σ + σ + 2ICCcos(φL )σ 1Lσ 2 L , σ = σ + σ + 2ICCcos(φR )σ 1Rσ 2 R . 2 1L 2 1R Spatial parameters 2 The two output signals of the OTT element are subsequently subject to HRTF processing. Each of the two output signals is processed by sets of HRTF parameters which change the (sub-band) level and phase of the input signals, determined by the (mean) amplitude parameters p and the average phase difference parameters φ of each HRTF pair. The resulting modified powers σ2xy after application of the HRTF parameters are given by 2 L 2 R binaural rendering is largely independent of the number of simultaneous sound sources. 2 2L 2 2R In a similar fashion, the average phase difference and the coherence between the two binaural output signals can be estimated in the parameter domain. If all relevant binaural cues are known, the binaural rendering system only needs to re-instate these parameters given the mono input signal. This synthesis process, which is essentially a ‘parametric stereo’ decoder, is described in detail elsewhere [9][10][11]. In essence, it comprises a 2x2 sub-band matrix operation: ⎡ Lbin ⎤ ⎡ h11 h12 ⎤ ⎡ M ⎤ ⎢ R ⎥ = ⎢h ⎥ ⎢ ⎥ , ⎣ bin ⎦ b ⎣ 21 h22 ⎦ b ⎣ D ⎦ b with M being the mono input signal, D the output of a decorrelator, hxy the upmix matrix elements, b the parameter band index, and Lbin, Rbin the binaural output signal. The approach of parameter-domain estimation of binaural cues can be extended to arbitrary configurations of OTT and TTT boxes while retaining the resulting 2x2 matrix operation in the signal domain. In other words, the computational complexity of OTT σ12 σ22 HRTF parameters HRTF1L HRTF1R HRTF2L HRTF2R σ1L2 σ1R2 σL 2 + σ2L2 σ2R2 Binaural output + σR2 Figure 10: Concept of spatial and HRTF parameter combination. 3.4 Extension to high quality and echoic HRTFs A starting point for the accurate (i.e. non-parametric) modeling of HRTFs/BRTFs of arbitrary length is the observation that any FIR filter can be implemented with high accuracy in the subband domain of the QMF filter bank used in MPEG Surround. The resulting subband filtering consists of simply applying one FIR filter per subband. An N -tap filter in the time domain is converted into a collection of 64 complex K -tap subband filters, where ⎡N⎤ K = ⎢ ⎥ +2. ⎢ 64 ⎥ In fact, the filter conversion algorithm itself consists of a complex modulated analysis filter bank very similar to the MPEG Surround analysis bank, albeit with a different prototype filter. It is important to note that a straightforward polyphase implementation [23] of filtering in a subband filterbank would result in cross filtering between different subbands. The absence of cross filter terms is the key enabling factor for the combination of HRTF/BRTF data with the MPEG Surround parameters, as it allows this combination to be performed independently in each of the MPEG Surround parameter frequency bands. The details of the combination algorithm are beyond the scope of this paper, but the concept is very close to that described for the parametric case in the previous section. The final result is a 2x2 synthesis matrix which is populated by time varying subband filters. 3.5 Binaural Decoding The binaural decoding scheme is outlined in Figure 11. The MPEG surround bitstream is decomposed into a downmix bitstream and spatial parameters. The downmix decoder produces conventional mono or stereo signals which are subsequently converted to the hybrid QMF domain by means of the MPEG Surround hybrid QMF analysis filter bank. A binaural synthesis stage generates the hybrid QMF-domain binaural output by means of a 2-in, 2-out matrix operation. Hence no AES 29th International Conference, Seoul, Korea, 2006 September 2–4 7 Spatial parameters parameter combiner Binaural synthesis Hybrid synthesis Binaural output Downmix encoder Multiplexer HRTF data Figure 12: 3D encoder schematic The corresponding decoder for multi-channel loudspeaker playback is shown in Figure 13. A 3D/binaural inversion stage operates as a pre-processing step before spatial decoding in the hybrid QMF domain, ensuring uncompromised quality for multi-channel reconstruction. C na io nt o ve r e on ste 3D Downmix decoder Hybrid analysis HRTF data parameter combiner l eo er st 3.6 3D-Stereo In this use case, the binaural processing is applied in the encoder, resulting in a binaural stereo downmix that can be played over headphones on legacy stereo devices. A binaural synthesis module is applied as a postprocessing step after spatial encoding in the hybrid QMF domain, in a similar fashion as the matrixedsurround compatibility mode (see Section 2.3.2). The 3D encoder scheme is outlined in Figure 12. The 3D post-processing step comprises the same invertible 2x2 synthesis matrix as used in the low-complexity binaural decoder, which is controlled by a combination of HRTF Hybrid synthesis Spatial parameters HRTF data In the case of a mono downmix, the 2x2 binaural synthesis matrix has as inputs the mono downmix signal, and the same signal processed by a decorrelator. In case of a stereo downmix, the left and right downmix channels form the input of the 2x2 synthesis matrix. The parameter combiner that generates binaural synthesis parameters can operate in two modes. The first mode is a high-quality mode, in which HRTFs of arbitrary length can be modelled very accurately, as described in Section 3.4. The resulting 2x2 synthesis matrix for this mode can thus have multiple taps in the time (slot) direction. The second mode is a lowcomplexity mode using the parameter-based rrendering as discussed in Section 3.3. In this mode, the 2x2 synthesis matrix has therefore only a single tap in the time direction. Furthermore, since interaural finestructure phase synthesis is not employed for frequencies beyond approximately 2.5 kHz, the synthesis matrix is real-valued for approximately 90% of the signal bandwidth. This is especially suitable for low-complexity operation and/or representing short (e.g., anechoic) HRTFs. An additional advantage of the low-complexity mode is the fact that the 2x2 synthesis matrix can be inverted, which is an interesting property for the ‘3D’ use case, as outlined subsequently. Binaural synthesis Spatial analysis parameter combiner Synthesis parameters Figure 11: Binaural decoder schematic. o re Hybrid analysis Hybrid analysis e st Downmix decoder Multichannel input 3D Demulti plexer l na io nt o ve re on ste Conventional mono or stereo data and extracted spatial parameters. The HRTF data can be transmitted as part of the MPEG Surround bitstream using a very efficient parameterized representation. C intermediate multi-channel up-mix is required. The matrix elements result from a combination of the transmitted spatial parameters and HRTF data. The hybrid QMF synthesis filter bank generates the timedomain binaural output signal. Binaural inversion Demulti plexer Spatial synthesis Hybrid synthesis Multichannel output 3D inversion parameters Spatial parameters Figure 13: 3D decoder for loudspeaker playback. 3.7 3D Stereo using individual HRTFs In the 3D-stereo use case, HRTF processing is applied in the encoder. It is therefore difficult to facilitate 3D rendering on legacy decoders with HRTFs that are matched to the characteristics of each listener (i.e., using individual HRTFs). MPEG Surround, however, does facilitate 3D rendering with individual HRTFs, even if a 3D-stereo downmix was transmitted using generic (i.e., non-individualized) HRTFs. This is achieved by replacing the spatial decoder for loudspeaker playback (see Figure 13) by a spatial decoder for binaural synthesis, controlled by the individual’s personal HRTF data (see Figure 14). The binaural inversion stage re-creates conventional stereo from the transmitted 3D downmix, the transmitted spatial parameters and the non-individualized HRTF data. Subsequently, the binaural re-synthesis stage creates a binaural stereo version based on individual HRTFs, supplied at the decoder side. In the low-complexity mode, binaural inversion and binaural synthesis both comprise a 2x2 matrix. Hence the cascade of binaural inversion and binaural re- AES 29th International Conference, Seoul, Korea, 2006 September 2–4 8 synthesis is again a 2x2 matrix (resulting from a matrix product) and can thus be implemented very efficiently. As a result, the decoder complexity using the combined binaural inversion and re-synthesis is similar to the complexity of the low-complexity binaural decoder alone. 3D stereo Downmix decoder Hybrid analysis HRTF data parameter combiner Conventional “Individualized” stereo binaural stereo Binaural inversion Binaural synthesis Hybrid synthesis parameter combiner Personal HRTF data Demulti plexer 3D inversion parameters Binaural output Spatial parameters Figure 14: Binaural (re)synthesis using individual HRTFs based on a 3D-stereo downmix. 3.8 3D sound for stereo loudspeaker systems MPEG Surround provides standardized interfaces for HRTF data (either in the parametric domain or as an impulse response), and thus ensures maximum flexibility for content providers, broadcasters and consumers to optimize playback according to their needs and demands. Besides the use of personal HRTFs for binaural rendering, this flexibility facilitates additional functionality. One example of such functionality is the possibility of creating 3D sound using a conventional stereo playback system. A wellknown approach for creating 3D sound over stereo loudspeakers is based on the concept of crosstalk cancellation [24]. Technology based on this principle aims at extending the possible range of sound sources outside the stereo loudspeaker base by cancellation of inherent crosstalk (see Figure 15). In practice, this means that for every sound source, two filters (Hxy) are applied to generate two signals that are fed to the two loudspeakers. In most cases, these filters differ between sound sources to result in a different perceived position of each sound source. Source 1 H11 + H21 Source 2 H12 H22 + Figure 15: Crosstalk (dashed lines) and crosstalk cancellation filters (Hxy) for 2 sound sources. The processing scheme for multiple simultaneous sound sources is, in fact, identical to the (conventional) HRTF processing scheme depicted in Figure 1, with the only modification that the HRTFs are replaced by crosstalkcancellation filters. As a result, the crosstalk cancellation principle can be exploited in MPEG Surround decoders by re-using binaural technology. For each audio channel (for example in a 5.1 setup), a set of crosstalk cancellation filters can be provided to the binaural decoder. Playback of the decoded output over a stereo loudspeaker pair will result in the desired 3D sound experience. Compared to conventional crosstalk cancellation systems, the application of these filters in the context of MPEG Surround has the following important advantages: • All processing is performed in a 2x2 processing matrix without the need of an intermediate 5.1 signal representation. • Only 2 synthesis filterbanks are required. • Freedom of crosstalk-cancellation filter designs; the filters can be optimized for each application or playback device individually. 3.9 Performance This section presents results from recent listening tests conducted within the context of the MPEG standardization process. Two types of tests were conducted: (1) individualized sound localization tests [25] were conducted to examine the spectral resolution required in the QMF domain for an adequate parameterbased representation of HRTF information; and (2) several MUSHRA [26] tests were conducted to perceptually evaluate the performance of the binaural decoding and 3D stereo encoding technologies. A brief description of the tests and their results follow. 3.9.1 Sound Localization Test Stimuli. A sound localization experiment was conducted in virtual auditory space to examine the spatial fidelity of a Gaussian broadband noise source (150ms duration with raised-cosine time envelope). Two normally-hearing subjects performed the localization task for two sound conditions: (1) a control sound condition in which the binaural stimuli were prepared using normal HRTF filter convolution; (2) a test sound condition in which the binaural stimuli were prepared using a 28 QMF band parameter-based binaural rendering as described in Section 3.3. The subject’s individualized HRTF filters were measured for 393 positions evenly spaced around an imaginary sphere one meter in radius about the subject’s head. Experimental paradigm. Localization performance was assessed using a nose pointing task. For this task, an electromagnetic tracking system (Polhemus Fastrak) is used to measure the subject’s perceived sound source location relative to the centre of the subject’s head. The sensor is mounted on top of a rigid headband worn by the subject. Prior to each stimulus presentation, the AES 29th International Conference, Seoul, Korea, 2006 September 2–4 9 Subject 2 Response Lateral Angle Reference Response Lateral Angle Subject 1 Target Lateral Angle Response Polar Angle Response Polar An Angle gle Subject 2 Reference Parameter-based Binaural Rendering Target Polar Angle Response Polar Angle Target Polar Angle Target Polar Angle Target Polar Angle Figure 17. The polar angle component of the localization performance data is shown for both subjects using a scatter plot. There were two sets of tests, one for the binaural rendering technique that occurs within the decoder, referred to as the binaural decoder, and one for the binaural rendering technique that occurs within the encoder, referred to as the 3D stereo encoder. For both binaural rendering techniques, tests were carried out using both a TC1 and TC3 configuration. For TC1, a stereo AAC core coder is used operating at 160 kbps stereo. For TC3, a monaural HE-AAC core coder is employed such that the total bitrate of core coding and spatial side information amounts to 48 kbps. For the HE-AAC core coder, KEMAR HRTF filters were used and for the AAC core coder, 1000 tap BRTF filters were used. A number of reference signals were employed during the testing and these are listed in Table 1. Table 2 shows the size of the MUSHRA tests. Table 1 –Signals under test Label Ref Target Lateral Angle Response Lateral Angle Parameter-based Binaural Rendering Response Lateral Angle Target Lateral Angle Subject 1 Response Polar Angle subject aligns his/her head to a calibrated start position with the aid of an LED display. After pressing a handheld pushbutton, the stimulus is played over in-ear tube phones (Etymotic Research ER-2). The subject responds by turning and pointing his/her nose to the perceived position of the sound source and once again presses the handheld pushbutton. The controlling computer records the orientation of the electromagnetic sensor and the subject returns to the calibrated start position for the next stimulus presentation. Localization performance is assessed for three repeats of 76 test locations spaced around the subject. Localization results. The positions on a sphere can be described using a lateral and polar angle coordinate system. The lateral angle is the horizontal angle away from the midline where negative lateral angles (down to -90°) define the left hemisphere and positive lateral angles (up to +90°) define the right hemisphere. The lateral angle describes positions for which binaural cues, such as interaural time and level differences are very similar. The polar angle is the angle on the circle around the interaural axis, for a given lateral angle, with 0° representing the horizontal plane in front, 90° representing directly above, 180° representing behind and 270° representing directly below. Localization on the polar angle depends on the spectral cues generated by the directionally dependent filtering of the outer ear. Figure 17 and 18 show localization data for both subjects for the lateral and polar angle, respectively, and the data indicate that there were no substantial differences in localization performance across conditions. Ref-3.5k RMB ADG RMS Target Lateral Angle Table 2 – Size of the MUSHRA tests. Figure 16. The lateral angle component of the localization performance data is shown for both subjects using a scatter plot. 3.9.2 MUSHRA Tests Test Setup: Subjective listening tests were conducted to evaluate the fidelity of the binaural rendering process within MPEG Surround. The experiments were conducted in virtual auditory space using the MUSHRA testing methodology. Description Original 5.1 item downmixed to binaural with common HRTF set Anchor, 3.5 kHz low-pass filtered reference MPEG Surround (RM) decoder 5.1 output downmixed to binaural with common HRTF set RM 5.1 decoding of a 3D/binaural down mix including Artistic Downmix Gains MPEG Surround (RM) decoder 5.1 output downmixed to stereo test Bin.stereo decoder 3D stereo encoder AES 29th International Conference, Seoul, Korea, 2006 September 2–4 Number of stimuli TC3 TC1 TC3 TC1 9108 8316 3036 4235 Number of subjects 92 84 46 77 Number of rejected subjects 8 11 7 17 10 Test results: The MUSHRA test results for the binaural decoder are shown in Figures 18 and 19. Similarly, the MUSHRA test results for the 3D stereo encoder are shown in Figures 21 and 22. 100 Ref 80 QMF binaural filtering 60 QMF parameterbased binaural rendering 40 RMB 20 Ref-3.5k Mean Stomp poulenc rock concert pops jackson1 glock indie2 chostakovitch fountain music applause ARL applause 0 Figure 18. MUSHRA test results for the binaural decoder in TC3 configuration. Mean opinion score is shown on the vertical axis and the audio test items are shown on the horizontal axis. Figure 21. MUSHRA tests results for the 3D stereo encoder in TC1 config. Other details as in Fig. 18. 100 Ref Ref 100 80 QMF binaural filtering 60 QMF parameterbased binaural rendering 40 Ref-3.5k 80 RMB 20 60 3D Stereo Ref-3.5k Mean Stomp rock concert poulenc pops jackson1 glock indie2 chostakovitch fountain music applause ARL applause 0 Figure 19. MUSHRA test results for the binaural decoder in TC1 config. Other details as in Fig. 18. 40 ADG 20 RMS 0 Figure 22. MUSHRA tests results for the 3D stereo encoder in TC3 config. Other details as in Fig. 18 For benchmarking, the complexity of the MPEG Surround binaural rendering techniques has been analyzed in terms of multiply-accumulate operations. Figure 20 shows a 2-D representation of the quality versus complexity numbers. Note that the RMB reference condition (MPEG Surround decoding Figure 20. Quality versus complexity for MPEG Surround binaural rendering techniques. AES 29th International Conference, Seoul, Korea, 2006 September 2–4 11 followed by efficient external HRTF filtering in the FFT domain) has been included. 4 APPLICATIONS The binaural rendering capability of MPEG Surround brings surround sound to portable devices to an extent not previously possible. MPEG Surround offers, by means of the binaural decoding functionality, a surround sound format that is suitable for stationary home devices, car radios etc., as well as portable devices. The following section gives a few examples of interesting applications that can be envisioned for the binaural rendering modes of MPEG Surround. Digital Radio Broadcasting. Surround sound in radio broadcasting is particularly interesting for car-radio applications since the listener’s position is fixed with respect to the loudspeakers. Hence, MPEG Surround with its inherent stereo backwards compatibility is ideal for this application. A legacy “kitchen radio” device (where typically the listener is moving around doing other things while listening to the radio) will play the stereo signal, while the car-radio can decode the MPEG Surround data and render the multi-channel signal. In a similar manner a legacy portable radio receiver will play the backwards compatible stereo part of the MPEG Surround stream while a binaural MPEG Surround equipped portable radio-receiver, will operate in the binaural decoding mode and provide a surround sound listening experience over headphones for the listener. Digital Video Broadcasting. The MPEG Surround binaural rendering capability is particularly attractive for TV/Movie consumption on portable devices. Since surround sound has an important place in TV/Movie consumption it is interesting to maintain this for portable TV/Movie consumption such as with DVB-H. For this application MPEG Surround is ideal, not only because it enables surround sound at a very low bitrate, but also because it enables the surround sound experience over headphones on portable devices. Music Download Services. There are several popular music store services available as of today, either for download of music over the Internet, e.g. “iTunes Music Store”, or for download of music directly to the mobilephone, e.g. KDDI's EZ “Chaku-Uta Full™” service. These make a very interesting application for MPEG Surround. Since the MPEG Surround data adds a minimal overhead to the existing downmix data, storing surround files on devices limited by disk space poses no problems. One could envision a portable music player that, when connected to the home stereo equipment, decodes the MPEG Surround files stored on the player into surround sound played over the home speaker set-up. However, when the player is “mobile” i.e. carried around by the user, the MPEG Surround data is decoded into binaural stereo enabling the surround sound experience on the mobile device. Finally, the legacy player can store the same files (with very limited penalty on storage space), and play the stereo backwards compatible part. 5 CONCLUSIONS Recent progress in the area of parametric coding of multi-channel audio has led to the MPEG Surround specification which provides an efficient and backward compatible representation of high quality audio at bitrates comparable to those currently used for representing stereo (or even mono) audio signals. While the technology was initially conceived for use with conventional loudspeaker reproduction, the idea of accommodating multi-channel playback on small mobile devices led to a number of interesting additions to the specification. These extensions combine traditional approaches for binaural rendering with the MPEG Surround framework in an innovative way to achieve high-quality binaural rendering of surround sound even with very limited computational resources, as they are typically available on mobile devices like cell phones, mp3 players or PDAs. Several options for binaural rendering are available, including approaches that allow binaural rendering even on legacy devices. The blending of binaural technology with parametric modeling techniques enables a wide range of attractive applications for both mobile and stationary home use and makes MPEG Surround an attractive format for the unified bitrate-efficient delivery of multi-channel sound. REFERENCES [1] D. Begault. 3D Sound for Virtual Reality and Multimedia. Academic Press, Cambridge, 1994 [2] Gilkey, R. H. and Anderson, T. R., "Binaural and spatial hearing in real and virtual environments." Lawrence Erlbaum Associates, 1997 [3] J. Blauert, “Spatial hearing: The psychophysics of human sound localization”, MIT Press, Revised edition, 1997. [4] H. Møller, M. F. Sørensen, D. Hammershøi, C. B. Jensen, “Head-related transfer functions of human subjects”, J. Audio Eng. Soc., Vol. 43, No. 5, pp. 300-321, 1995. [5] W. G. Gardner, 3-D Audio Using Loudspeakers. Kluwer Academic Publishers, 1998. AES 29th International Conference, Seoul, Korea, 2006 September 2–4 12 [6] P. J. Minnaar, S. K. Olesen, F. Christensen, H. Møller, “Localization with binaural recordings from artificial and human heads”, J. Audio Eng. Soc., May 2001 [7] P. J. Minnaar, S. K. Olesen, F. Christensen, H. Møller: "The importance of head movements for binaural room synthesis", Proceedings of ICAD 2001, Espoo, 2001 [8] J. Herre: “From Joint Stereo to Spatial Audio Coding - Recent Progress and Standardization“, Sixth International Conference on Digital Audio Effects (DAFX04), Naples, Italy, October 2004 [9] H. Purnhagen: “Low Complexity Parametric Stereo Coding in MPEG-4”, 7th International Conference on Audio Effects (DAFX-04), Naples, Italy, October 2004 [10] E. Schuijers, J. Breebaart, H. Purnhagen, J. Engdegård: “Low complexity parametric stereo coding”, Proc. 116th AES convention, Berlin, Germany, 2004, Preprint 6073 [11] [12] [13] J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers: “Parametric coding of stereo audio”, EURASIP J. Applied Signal Proc. 9:1305-1322 (2005) C. Faller, F. Baumgarte: “Efficient Representation of Spatial Audio Using Perceptual Parametrization”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York 2001 C. Faller and F. Baumgarte, “Binaural Cue Coding - Part II: Schemes and applications,” IEEE Trans. on Speech and Audio Proc., vol. 11, no. 6, Nov. 2003 [14] C. Faller: “Coding of Spatial Audio Compatible with Different Playback Formats“, 117th AES Convention, San Francisco 2004, Preprint 6187 [15] Dolby Publication: ““Dolby Surround Pro Logic II Decoder – Principles of Operation”, http://www.dolby.com/assets/pdf/tech_library/20 9_Dolby_Surround_Pro_Logic_II_Decoder_Prin ciples_of_Operation.pdf [16] [17] J. Herre, C. Faller, S. Disch, C. Ertel, J. Hilpert, A. Hoelzer, K. Linzmeier, C. Spenger, P. Kroon: “Spatial Audio Coding: Next-Generation Efficient and Compatible Coding of MultiChannel Audio”, 117th AES Convention, San Francisco 2004, Preprint 6186 [18] J. Herre, H. Purnhagen, J. Breebaart, C. Faller, S. Disch, K. Kjörling, E. Schuijers, J. Hilpert, F. Myburg: “The Reference Model Architecture for MPEG Spatial Audio Coding”, Proc. 118th AES convention, Barcelona, Spain, May 2005, Preprint 6477 [19] J. Breebaart, J. Herre, C. Faller, J. Rödén, F. Myburg, S. Disch, H. Purnhagen, G. Hotho, M. Neusinger, K. Kjörling, W. Oomen: “MPEG spatial audio coding / MPEG Surround: overview and current status”, Proc. 119th AES convention, New York, USA, October 2005, Preprint 6447 [20] L. Villemoes, J. Herre, J. Breebaart, G. Hotho, S. Disch, H. Purnhagen, K. Kjörling: “MPEG Surround: The forthcoming ISO standard for spatial audio coding”, AES 28th International Conference, Piteå, Sweden, 2006 [21] A. Kulkarni, S. K. Isabelle, H. S. Colburn: “Sensitivity of human subjects to head-related transfer-function phase spectra”, J. Acoust. Soc. Am. 105: 2821-2840, 1999. [22] J. Breebaart, A. Kohlrausch: “The perceptual (ir)relevance of HRTF magnitude and phase spectra”, Proc. 110th AES convention, Amsterdam, The Netherlands, 2001. [23] C. Lanciani, R.W. Schafer: “Subband-domain filtering of MPEG audio signals”. ICASSP’ 99. Proceedings, 1999. [24] M. R. Schroeder: “Models of hearing”, Proc. IEEE 63 (9): 1332-1350 (1975). [25] S. Carlile, P. Leong, and S. Hyams (1997). "The nature and distribution of errors in sound localisation by human listeners," Hearing Research, 114: 179-196. [26] ITU-R Recommendation BS.1534-1, “Method for the Subjective Assessment of Intermediate Sound Quality (MUSHRA)”, International Telecommunications Union, Geneva, Switzerland, 2001 D. Griesinger: “Multichannel Matrix Decoders For Two-Eared Listeners”, 101st AES Convention, Los Angeles 1996, Preprint 4402 AES 29th International Conference, Seoul, Korea, 2006 September 2–4 13