Direction of Arrival Estimation For Multiple Sound Sources Using Convolutional Recurrent Neural Network
Direction of Arrival Estimation For Multiple Sound Sources Using Convolutional Recurrent Neural Network
Direction of Arrival Estimation For Multiple Sound Sources Using Convolutional Recurrent Neural Network
Abstract—This paper proposes a deep neural network for also generates a spatial acoustic activity map similar to the
arXiv:1710.10059v2 [cs.SD] 5 Aug 2018
estimating the directions of arrival (DOA) of multiple sound MUSIC pseudo-spectrum (SPS) as an intermediate output. The
sources. The proposed stacked convolutional and recurrent neu- SPS has numerous applications that rely on a directional map
ral network (DOAnet) generates a spatial pseudo-spectrum (SPS)
along with the DOA estimates in both azimuth and elevation. We of acoustic activity such as soundfield visualizations [9], and
avoid any explicit feature extraction step by using the magnitudes room acoustics analysis [10]. In comparison, the proposed
and phases of the spectrograms of all the channels as input to DOAnet outputs the SPS and DOA’s of multiple overlapping
the network. The proposed DOAnet is evaluated by estimating sources similar to any popular DOA estimators like MUSIC,
the DOAs of multiple concurrently present sources in anechoic, ESPRIT or SRP without requiring the critical information of
matched and unmatched reverberant conditions. The results
show that the proposed DOAnet is capable of estimating the the number of active sound sources. A successful implementa-
number of sources and their respective DOAs with good precision tion of this will enable the integration of such DNN methods
and generate SPS with high signal-to-noise ratio. to higher-level learning based end-to-end sound analysis and
detection systems.
I. I NTRODUCTION Recently, several DNN-based approaches have been pro-
Direction of arrival (DOA) estimation is the task of identi- posed for DOA estimation [11], [12], [13], [14], [15], [16].
fying the relative position of the sound sources with respect to There are six significant differences between them and the
the microphone. DOA estimation is a fundamental operation proposed method: a) All the aforementioned works focused
in microphone array processing and forms an integral part of on azimuth estimation, with the exception of [15] where the
speech enhancement [1], multichannel sound source separation 2-D Cartesian coordinates of sound sources in a room were
[2] and spatial audio coding [3]. Popular approaches to DOA predicted, and [11] trained separate networks for azimuth and
estimation are based on time-delay-of-arrival (TDOA) [4], the elevation estimation. In contrast, we demonstrate the estima-
steered-response-power (SRP) [5], or on subspace methods tion of both azimuth and elevation for the DOA by sampling
such as multiple signal classification (MUSIC) [6] and the the unit sphere uniformly and predicting the probability of
estimation of signal parameters via rotational invariance tech- sound source at each direction. b) The past works focused on
nique (ESPRIT) [7]. the estimation of a single DOA at every time frame, with the
The aforementioned methods differ from each other in terms exception of [13] where localization of azimuth for up to two
of algorithmic complexity, and their suitability to various sources simultaneously was proposed. On the other hand, the
arrays and sound scenarios. MUSIC specifically is very generic proposed DOAnet does not algorithmically limit the number
with regards to array geometry, directional properties and can of directions to be estimated, i.e., with a higher number of
handle multiple simultaneously active narrowband sources. On audio channels input, the DOAnet can potentially estimate a
the other hand, MUSIC and subspace methods in general, larger number of sound events.
require a good estimate of the number of active sources, c) Past works were evaluated with different array geometries
which are often unavailable or difficult to obtain. Furthermore, making comparison difficult. Although the DOAnet can be
MUSIC can suffer at low signal to noise ratio (SNR) and applied to any array geometry, we evaluate the method using
in reverberant scenarios [8]. In this paper, we propose to real spherical harmonic input signals, which is an emerging
overcome the above shortcomings with a deep neural network popular spatial audio format under the name Ambisonics.
(DNN) method, referred to as DOAnet, that learns the number Microphone signals from various arrays, such as spherical,
of sources from the input data, generates high precision DOA circular, planar or volumetric, can be transformed to Am-
estimates and is robust to reverberation. The proposed DOAnet bisonic signals by an appropriate transform [17], resulting in a
common representation of the 3-D sound recording. Although
* Equally contributing authors in this paper. The research leading to these the DOAnet is scalable to higher-order Ambisonics, in this
results has received funding from the European Research Council under paper we evaluate it using the compact four-channel first-order
the European Unions H2020 Framework Programme through ERC Grant
Agreement 637422 EVERYSOUND. The authors also wish to acknowledge Ambisonics (FOA).
CSC-IT Center for Science, Finland, for computational resources d) Regarding classifiers, earlier methods have used fully
connected (FC) neural networks [11], [12], [13], [14], [15]
100x614
and convolutional neural networks (CNN) [16]. In this work, Magnitude and phase spectrogram Output 1 - Spatial pseudo-spectrum (SPS)
along with the CNNs we use recurrent neural network (RNN) 100x1024x8
100x614x1
64 filters, 3x3, 2D CNN, ReLUs
layers. The usage of RNN allows the network to learn long- 1x8 max pool 16 filters, 3x3, 2D CNN, ReLUs
1x2 max pool
term temporal information. Such an architecture is referred 64 filters, 3x3, 2D CNN, ReLUs
16 filters, 3x3, 2D CNN, ReLUs
1x8 max pool
to as a convolutional recurrent neural network (CRNN) in 64 filters, 3x3, 2D CNN, ReLUs 100x307x16
1x4 max pool
literature and is the state-of-the-art method in many single- 64 filters, 3x3, 2D CNN, ReLUs
32 units, time distributed dense, Linear
1x2 max pool 100x32
[18], [19] and multichannel [20], [21] audio tasks. e) Previous 100x2x64 16 units, Bidirectional GRU, tanh
methods used inter-channel features such as generalized cross- 64 units, Bidirectional GRU, tanh 16 units, Bidirectional GRU, tanh
64 units, Bidirectional GRU, tanh 100x32
correlation with phase transform (GCC-PHAT) [15], [12],
100x128 432 units, time distributed dense, Sigmoid
eigen-decomposition of the spatial covariance matrix [13], 614 units, time distributed dense, Linear 100x432
inter-channel time delay (ITD) and inter-channel level differ- 100x614 Output 2 - Direction of arrival (DOA)
ences (ILD) [11], [14]. More recently, Chakrabarty et al. [16] Output 1 - Spatial pseudo-spectrum (SPS)
100 100
50 50
0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
(a) MUSIC estimated (b) DOAnet estimated (a) O1A (b) O2A (c) O1R (d) O2R
Fig. 2. SPS for two closely located sound sources. The black-cross markers Fig. 3. Confusion matrix for the number of DOA estimated per frame by the
represent the ground truth DOA. The horizontal axis is azimuth and vertical DOAnet. The horizontal axis is the DOAnet estimate, and the vertical axis is
axis is elevation angle (in degrees) the ground truth.
TABLE II R EFERENCES
E VALUATION SCORES FOR UNMATCHED REVERBERANT ROOM .
[1] M. Woelfel and J. McDonough, “Distant speech recognition,” in Wiley,
Room 2 Room 3 2009.
Max. no. of overlapping sources 1 2 1 2 [2] J. Nikunen and T. Virtanen, “Direction of arrival based spatial covariance
SPS SNR (in dB) 3.53 1.49 3.49 1.46 model for blind sound source separation,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 22, no. 3, 2014.
DOAnet error (Unknown number of sources) [3] A. Politis et al., “Sector-based parametric sound field reproduction in the
DOAnet 3.44 6.88 4.59 10.89 spherical harmonic domain,” IEEE Journal of Selected Topics in Signal
Correctly predicted frames (in %) 46.2 14.3 49.7 14.1 Processing, vol. 9, no. 5, pp. 852–866, 2015.
[4] Y. Huang et al., “Real-time passive source localization: a practical linear-
DOA error (Known number of sources)
correction least-squares approach,” in IEEE Transactions on Speech and
DOAnet 8.60 32.10 9.17 33.82
Audio Processing, vol. 9, no. 8, 2001.
MUSIC 31.52 58.47 33.25 60.76
[5] M. S. Brandstein and H. F. Silverman, “A high-accuracy, low-latency
technique for talker localization in reverberant environments using
in Figure 3. We skipped the confusion matrices for the O3 microphone arrays,” in IEEE International Conference on Acoustics,
datasets as they were not meaningful for similar reasons as Speech and Signal Processing (ICASSP), 1997.
explained above. [6] R. O. Schmidt, “Multiple emitter location and signal parameter esti-
mation,” in IEEE Transactions on Antennas and Propagation, vol. 34,
With the knowledge of the number of active sources (Ta- no. 3, 1986.
ble I), the DOAnet performs considerably better than baseline [7] R. Roy and T. Kailath, “ESPRIT-estimation of signal parameters via ro-
MUSIC for all datasets other than the O2A and O3A. The tational invariance techniques,” in IEEE Transactions on Audio, Speech,
and Language Processing, vol. 37, no. 7, 1989.
MUSIC DOA’s were chosen using a 2D peak finder on the [8] J. H. DiBiase et al., “Robust localization in reverberant rooms,” in
MUSIC SPS, whereas the DOA’s in DOAnet were chosen Microphone Arrays, 2001, pp. 157–180.
by simply picking the top probabilities in the final DOA [9] A. O’Donovan et al., “Imaging concert hall acoustics using visual and
audio cameras,” in IEEE International Conference on Acoustics, Speech
prediction layer. A smarter peak picking method from the and Signal Processing (ICASSP), 2008.
DOAnet, or using the number of sources as an additional [10] D. Khaykin and B. Rafaely, “Acoustic analysis by spherical microphone
input can potentially result in better scores across all datasets. array processing of room impulse responses,” The Journal of the
Acoustical Society of America, vol. 132, no. 1, 2012.
Further, the DOAnet error on unmatched reverberant data is [11] R. Roden et al., “On sound source localization of speech signals using
presented in Table II. The performance of DOAnet is seen to deep neural networks,” in Deutsche Jahrestagung für Akustik (DAGA),
be consistent in comparison to the matched reverberant data 2015.
[12] X. Xiao et al., “A learning-based approach to direction of arrival
in Table I, and significantly better than the performance of estimation in noisy and reverberant environments,” in IEEE International
MUSIC. Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
In this paper, since the baseline was chosen to be MUSIC, [13] R. Takeda and K. Komatani, “Discriminative multiple sound source
localization based on deep neural networks using independent location
for a fair comparison the DOAnet was also trained using model,” in IEEE Spoken Language Technology Workshop (SLT), 2016.
MUSIC SPS. In an ideal scenario, considering the DOAnet [14] A. Zermini et al., “Deep neural network based audio source separation,”
is trained using datasets for which the ground truth DOAs are in International Conference on Mathematics in Signal Processing, 2016.
[15] F. Vesperini et al., “A neural network based algorithm for speaker local-
known, we can generate accurate high-resolution SPS from the ization in a multi-room environment,” in IEEE International Workshop
ground truth DOA’s as per the required application and use on Machine Learning for Signal Processing (MLSP), 2016.
them for training. Alternatively, the DOAnet can be trained [16] S. Chakrabarty and E. A. P. Habets, “Broadband DOA estimation using
convolutional neural networks trained with noise signals,” in IEEE
without the SPS to directly generate the DOAs, it was only Workshop on Applications of Signal Processing to Audio and Acoustics
used in this paper to present the complete potential of the (WASPAA), 2017.
method in the limited paper space. In general, the above [17] H. Teutsch, Modal array signal processing: principles and applications
of acoustic wavefield decomposition. Springer, 2007, vol. 348.
results show that the proposed DOAnet has the potential to [18] T. N. Sainath et al., “Convolutional, long short-term memory, fully
learn the 2D direction information of multiple overlapping connected deep neural networks,” in IEEE International Conference on
sound sources directly from the spectrogram of the input audio Acoustics, Speech and Signal Processing (ICASSP), 2015.
[19] M. Malik et al., “Stacked convolutional and recurrent neural networks
without the knowledge of the number of active sound sources. for music emotion recognition,” in Sound and Music Computing Con-
An exhaustive study with more detailed experiments including ference (SMC), 2017.
both synthetic and real datasets are planned for future work. [20] T. Sainath et al., “Multichannel signal processing with deep neural
networks for automatic speech recognition,” in IEEE Transactions on
Audio, Speech, and Language Processing, 2017.
V. C ONCLUSION [21] S. Adavanne et al., “Sound event detection using spatial features and
convolutional recurrent neural network,” in IEEE International Confer-
A convolutional recurrent neural network (DOAnet) was ence on Acoustics, Speech and Signal Processing (ICASSP), 2017.
proposed for multiple source localization. The DOAnet was [22] E. Benetos et al., “Sound event detection in synthetic audio,” http://
shown to learn the number of active sources directly from www.cs.tut.fi/sgn/arg/dcase2016/, 2016.
[23] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating
the input spectrogram, and estimate precise DOA in 2-D small-room acoustics,” in The Journal of the Acoustical Society of
polar space. The method was evaluated on anechoic, matched America, vol. 65, no. 4, 1979.
and unmatched reverberant dataset. The proposed DOAnet [24] H. W. Kuhn, “The hungarian method for the assignment problem,” in
Naval Research Logistics Quarterly, no. 2, 1955, p. 8397.
performed considerably better than baseline MUSIC in most [25] B. Ottersten et al., “Exact and large sample maximum likelihood
scenarios. Thereby showing the potential of DOAnet in learn- techniques for parameter estimation and detection in array processing,”
ing highly computational algorithm without prior knowledge in Radar Array Processing. Springer Series in Information Sciences,
1993.
of the number of sources.