Direction of Arrival Estimation For Multiple Sound Sources Using Convolutional Recurrent Neural Network

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Direction of Arrival Estimation

for Multiple Sound Sources Using


Convolutional Recurrent Neural Network
Sharath Adavanne*1 , Archontis Politis*2 , Tuomas Virtanen1
1
Laboratory of Signal Processing, Tampere University of Technology, Finland
2
Department of Signal Processing and Acoustics, Aalto University, Finland

Abstract—This paper proposes a deep neural network for also generates a spatial acoustic activity map similar to the
arXiv:1710.10059v2 [cs.SD] 5 Aug 2018

estimating the directions of arrival (DOA) of multiple sound MUSIC pseudo-spectrum (SPS) as an intermediate output. The
sources. The proposed stacked convolutional and recurrent neu- SPS has numerous applications that rely on a directional map
ral network (DOAnet) generates a spatial pseudo-spectrum (SPS)
along with the DOA estimates in both azimuth and elevation. We of acoustic activity such as soundfield visualizations [9], and
avoid any explicit feature extraction step by using the magnitudes room acoustics analysis [10]. In comparison, the proposed
and phases of the spectrograms of all the channels as input to DOAnet outputs the SPS and DOA’s of multiple overlapping
the network. The proposed DOAnet is evaluated by estimating sources similar to any popular DOA estimators like MUSIC,
the DOAs of multiple concurrently present sources in anechoic, ESPRIT or SRP without requiring the critical information of
matched and unmatched reverberant conditions. The results
show that the proposed DOAnet is capable of estimating the the number of active sound sources. A successful implementa-
number of sources and their respective DOAs with good precision tion of this will enable the integration of such DNN methods
and generate SPS with high signal-to-noise ratio. to higher-level learning based end-to-end sound analysis and
detection systems.
I. I NTRODUCTION Recently, several DNN-based approaches have been pro-
Direction of arrival (DOA) estimation is the task of identi- posed for DOA estimation [11], [12], [13], [14], [15], [16].
fying the relative position of the sound sources with respect to There are six significant differences between them and the
the microphone. DOA estimation is a fundamental operation proposed method: a) All the aforementioned works focused
in microphone array processing and forms an integral part of on azimuth estimation, with the exception of [15] where the
speech enhancement [1], multichannel sound source separation 2-D Cartesian coordinates of sound sources in a room were
[2] and spatial audio coding [3]. Popular approaches to DOA predicted, and [11] trained separate networks for azimuth and
estimation are based on time-delay-of-arrival (TDOA) [4], the elevation estimation. In contrast, we demonstrate the estima-
steered-response-power (SRP) [5], or on subspace methods tion of both azimuth and elevation for the DOA by sampling
such as multiple signal classification (MUSIC) [6] and the the unit sphere uniformly and predicting the probability of
estimation of signal parameters via rotational invariance tech- sound source at each direction. b) The past works focused on
nique (ESPRIT) [7]. the estimation of a single DOA at every time frame, with the
The aforementioned methods differ from each other in terms exception of [13] where localization of azimuth for up to two
of algorithmic complexity, and their suitability to various sources simultaneously was proposed. On the other hand, the
arrays and sound scenarios. MUSIC specifically is very generic proposed DOAnet does not algorithmically limit the number
with regards to array geometry, directional properties and can of directions to be estimated, i.e., with a higher number of
handle multiple simultaneously active narrowband sources. On audio channels input, the DOAnet can potentially estimate a
the other hand, MUSIC and subspace methods in general, larger number of sound events.
require a good estimate of the number of active sources, c) Past works were evaluated with different array geometries
which are often unavailable or difficult to obtain. Furthermore, making comparison difficult. Although the DOAnet can be
MUSIC can suffer at low signal to noise ratio (SNR) and applied to any array geometry, we evaluate the method using
in reverberant scenarios [8]. In this paper, we propose to real spherical harmonic input signals, which is an emerging
overcome the above shortcomings with a deep neural network popular spatial audio format under the name Ambisonics.
(DNN) method, referred to as DOAnet, that learns the number Microphone signals from various arrays, such as spherical,
of sources from the input data, generates high precision DOA circular, planar or volumetric, can be transformed to Am-
estimates and is robust to reverberation. The proposed DOAnet bisonic signals by an appropriate transform [17], resulting in a
common representation of the 3-D sound recording. Although
* Equally contributing authors in this paper. The research leading to these the DOAnet is scalable to higher-order Ambisonics, in this
results has received funding from the European Research Council under paper we evaluate it using the compact four-channel first-order
the European Unions H2020 Framework Programme through ERC Grant
Agreement 637422 EVERYSOUND. The authors also wish to acknowledge Ambisonics (FOA).
CSC-IT Center for Science, Finland, for computational resources d) Regarding classifiers, earlier methods have used fully
connected (FC) neural networks [11], [12], [13], [14], [15]
100x614
and convolutional neural networks (CNN) [16]. In this work, Magnitude and phase spectrogram Output 1 - Spatial pseudo-spectrum (SPS)
along with the CNNs we use recurrent neural network (RNN) 100x1024x8
100x614x1
64 filters, 3x3, 2D CNN, ReLUs
layers. The usage of RNN allows the network to learn long- 1x8 max pool 16 filters, 3x3, 2D CNN, ReLUs
1x2 max pool
term temporal information. Such an architecture is referred 64 filters, 3x3, 2D CNN, ReLUs
16 filters, 3x3, 2D CNN, ReLUs
1x8 max pool
to as a convolutional recurrent neural network (CRNN) in 64 filters, 3x3, 2D CNN, ReLUs 100x307x16
1x4 max pool
literature and is the state-of-the-art method in many single- 64 filters, 3x3, 2D CNN, ReLUs
32 units, time distributed dense, Linear
1x2 max pool 100x32
[18], [19] and multichannel [20], [21] audio tasks. e) Previous 100x2x64 16 units, Bidirectional GRU, tanh
methods used inter-channel features such as generalized cross- 64 units, Bidirectional GRU, tanh 16 units, Bidirectional GRU, tanh
64 units, Bidirectional GRU, tanh 100x32
correlation with phase transform (GCC-PHAT) [15], [12],
100x128 432 units, time distributed dense, Sigmoid
eigen-decomposition of the spatial covariance matrix [13], 614 units, time distributed dense, Linear 100x432
inter-channel time delay (ITD) and inter-channel level differ- 100x614 Output 2 - Direction of arrival (DOA)
ences (ILD) [11], [14]. More recently, Chakrabarty et al. [16] Output 1 - Spatial pseudo-spectrum (SPS)

proposed to use only the phase component of the spectrogram,


avoiding explicit feature extraction. In the proposed method, Fig. 1. DOAnet - neural network architecture for direction of arrival
estimation of multiple sound sources.
we use both the magnitude and the phase component. Contrary
to [16], which employed omnidirectional sensors only, general all channels first, followed by the phase. We use a sequence
arrays with directional microphones additionally encode the length L of 100 (= 2 s) in this work.
DOA information in magnitude differences, while Ambisonics
format especially encode directional information mainly in the B. Direction of arrival estimation network (DOAnet)
magnitude component. f) All previous methods were evaluated
on speech recordings that were synthetically spatialized and Local shift-invariant features are extracted from the input
spatially static. We continue to use the static sound sources in spectrogram tensor (L × 1024 × 2C dimension) using CNN
the present work and extend them to a larger variety of sound layers. In every CNN layer, the intra-channel time-frequency
events, such as impulsive and transient sounds. features are processed using a receptive field of 3 × 3,
rectified linear unit (ReLU) activation and pad zeros to the
II. M ETHOD resulting activation map to keep the output dimension equal
to input. Batch normalization and max-pooling operation along
The block diagram of the proposed DOAnet is presented in frequency axis are performed after every CNN layer to reduce
Figure 1. The DOAnet takes multichannel audio as the input the final dimension to L × 2 × NC , where NC is the number
and first extracts the spectrograms of all the channels. The of CNN filters in the last CNN layer. The CNN activations are
phases and the magnitudes of the spectrograms are mapped reshaped to L × 2NC keeping the time axis length unchanged
using a CRNN to two outputs sequentially. The first output, and fed to RNN layers in order to learn temporal structure.
spatial pseudo-spectrum (SPS) is generated as a regression Specifically, the bi-directional gated recurrent units (GRU)
task, followed by the DOA estimates as a classification task. with tanh activation are used. Further, the RNN output is
The DOA is defined by the azimuth φ and elevation λ with mapped to the first output, the SPS, in regression manner using
respect to the microphone and the SPS is the intensity of sound FC layers with linear activation.
along the DOA given by S(φ, λ). The SPS is further mapped to DOA estimates–the final
In this paper, we use discrete φ and λ by uniformly sampling output of the proposed method–using a similar CRNN network
the 2-D polar coordinate space, with a resolution of 10 degrees as above with two minor architectural changes. An FC layer
in both azimuth and elevation, resulting in 614 sampled is introduced between the CNN and RNN layers to reduce
directions. The SPS is computed at each sampled direction, the dimension of the RNN output. Additionally, the output
whereas, a subset of 432 directions is used for DOA, where layer which predicts the DOA uses sigmoid activation in order
the elevations are limited between -60 and 60 degrees. to estimate more than one DOA for a given time frame.
Each node in this output layer represents a direction in 2-
A. Feature extraction D polar space. During testing, the probabilities at these nodes
The spectrogram is calculated for each of the audio channels are thresholded with a value of 0.5, so that anything greater
whose sampling frequencies are 44100 Hz. A 2048-point suggests the presence of a source in the direction or otherwise
discrete Fourier transform (DFT) is calculated on Hamming absence of source.
windows of 40 ms with 50 % overlap. We keep 1024 values We refer to the combined architecture of SPS and DOA
of the DFT corresponding to the positive frequencies, without estimation in this work as DOAnet. The DOAnet is trained
the zeroth bin. L frames of features, each containing 1024 using the target SPS computed at each sampled direction, and
magnitude and phase values of the DFT extracted in all the for every time frame applying MUSIC (see Section III-B), and
C channels, are stacked in a L × 1024 × 2C 3-D tensor and is represented using nonnegative real numbers. For the DOA
used as the input to the proposed neural network. The 2C output, the DOAnet aims to make a discrete decision about the
dimension results from ordering the magnitude component of presence of a source in a certain direction; and during training,
the DOAnet uses the ground truth DOAs utilized to synthesize overlapping. The elevation was constrained within the range
the audio (see Section III-A). of [-60, 60] degrees, as most natural sound events occur in
The DOAnet was trained for 1000 epochs using Adam this range. Finally, for the anechoic dataset, the sound sources
optimizer, mean squared error loss for SPS output and binary were randomly placed at a distance d in the range 1-10 m.
cross entropy loss for DOA output. The sum of the two losses For the reverberant dataset, the sound events were randomly
was used for back propagation. Dropout was used after every placed inside a room of dimensions 10 × 8 × 4 m with the
layer and early stopping was used if the DOA metric (Sec- microphone in the center of the room.
tion III-C) did not improve for 100 epochs. The DOAnet was Spatialization for the anechoic case was done as
implemented using Keras framework with Theano backend. following. Each point source signal si with DOA
(φi , λi ), was converted to Ambisonics format by
III. E VALUATION
multiplying the signal with the vector y(φi , λi ) =
A. Dataset [Y00 (φi , λi ), Y1(−1) (φi , λi ), Y10 (φi , λi ), Y11 (φi , λi )]T of
In order to evaluate the proposed DOAnet, there are no real orthonormalized spherical harmonics Ynm (φ, λ). The
publicly available real or synthetic datasets which consist complete anechoic sound scene P multichannel recording xA
of general sound events each associated with a 2D spatial was generated as xA = i g i si y(φi , λi ), with the gains
coordinate. Since DNN-based methods need sufficiently large gi < 1 modeling the distance attenuation. p Each entry of xA
datasets to train on, most DNN-based methods proposed [11], corresponds to one channel and gi = 1/10d/dmax , where
[12], [14], [15], [16] have studied the performance on synthetic dmax = 10 m is the maximum distance.
datasets. In similar fashion, we evaluate the proposed DOAnet In the reverberant case, a fast geometrical acoustics simu-
on synthetic datasets about the same size as in the previous lator was used to model natural reverberation based on the
works. rectangular room image-source model [23]. For each point
We synthesize datasets consisting of static point sources source si with DOA in the dataset, K image sources were
associated with a spatial coordinate in the space in two generated modeling reflections up to a predefined time-limit.
contexts - anechoic and reverberant. For each context, three Based on the room and its propagation properties, each
datasets are generated with no temporally overlapping sources image source was associated with a propagation filter hik
(O1), maximum two overlapping sources (O2), and maximum and DOA PK (φk , λk ) resulting in the spatial impulse response
three overlapping sound sources (O3). We refer to the anechoic hi = k=1 hik y(φk , λk ). The P reverberant scene signal was
context dataset as OxA and reverberant as OxR, where x finally generated by xR = i si ∗ hi , where (∗) denotes
denotes the number of overlapping sources. Each of these convolution of the source signal with the spatial impulse
datasets has three cross-validation (CV) splits with 240 record- responses. The room absorption properties were adjusted to
ings for training and 60 for testing. Recordings are sampled match reverberation times of typical office spaces. Three sets
at 44.1 kHz and 30 s long. of testing data were generated with similar room size as
In order to generate these datasets, we use the isolated training data (Room 1), 80% of room size (8 × 8 × 4 m) and
real-life sound event recordings from the DCASE 2016 task reverberation time (Room 2), and 60% of room size (8 × 6 × 4
2 [22]. This dataset consists of 11 sound event classes, each m) and reverberation time (Room 3).
with 20 examples. The classes in this dataset included speech,
coughing, door slam, page-turning, phone ringing and key- B. Baseline
board sounds. During CV, for each of the splits, we randomly The proposed method to our knowledge is the first DNN-
chose disjoint sets of 16 and 4 examples for training and based implementation for 2D DOA estimation of multiple
testing, amounting to 176 examples for training and 44 for overlapping sound events. Thus in order to evaluate the
testing. In order to synthesize a recording, a random subset of complete features of the proposed DOAnet, we compare
the 176 or 44 sound examples was chosen from the respective the performance with the conventional, high-resolution DOA
split. The subset size varied for each recording based on the estimator based on MUSIC. Similar to the SPS and DOA
chosen sound examples. We start synthesizing a recording by outputs estimated by the DOAnet, the MUSIC method also
randomly choosing the beginning time of the first randomly estimates SPS and DOA, thus allowing a direct one-to-one
chosen sound example within the first second of the recording. comparison.
The next randomly chosen sound example is placed 250-500 The MUSIC SPS is based on a measure of orthogonality
ms after the end of the first sound example. On reaching the between the signal subspace (dominated by the source signals)
maximum recording length of 30 s, the process is repeated of the spatial covariance matrix Cs and the noise subspace
as many times as the number of required overlapping sound (dominated by diffuse and ambient sounds, late reverberation,
events. and microphone noise). The spatial covariance
 matrix is cal-
Each of the sound examples were assigned a DOA randomly culated as Cs = Ef,t X(f, t)X(f, t)H , where spectrogram
using the following conditions. All sound events were placed X(f, t) is a frequency f and time t dependent C-dimensional
in a spatial grid of ten degrees resolution along both azimuth vector, where C is the number of channels, H is the conjugate
and elevation. Two temporally overlapping sound events have transpose and Ef,t denotes the expectation over f and t. For
at least ten degrees of spatial separation to avoid spatial a sound scene with O number of sources, the MUSIC SPS
SGT is obtained from Cs by first performing an eigenvalue TABLE I
decomposition on Cs = EΛEH . The sorted eigenvectors E VALUATION METRIC SCORES FOR THE SPATIAL POWER MAP AND DOA S
ESTIMATED BY THE DOA NET FOR DIFFERENT DATASETS .
E (according to eigenvalues with decreasing magnitude) are
further partitioned into the two aforementioned subspaces Anechoic Reverberant (Room 1)
Max. no. of
E = [Us Un ], where Us denotes the signal subspace and overlapping sources
1 2 3 1 2 3
will be composed of O eigenvectors corresponding to the SPS SNR (in dB) 9.90 3.35 -0.26 3.11 1.24 0.13
higher eigenvalues and the rest will form the noise subspace
DOA error with unknown number of active sources (threshold of 0.5)
Un . The SGT along the direction (φi , λi ) is now given by DOAnet 0.57 8.03 18.34 6.31 11.46 38.41
SGT (φi , λi ) = 1/(yT (φi , λi )Un UH
n y(φi , λi )). Finally, the Correctly predicted
95.4 42.7 1.8 59.3 15.8 1.2
source DOAs are found by selecting the directions (φi , λi ) frames (in %)
corresponding to the O largest peaks from SGT . DOA error with known number of active sources
DOAnet 1.14 27.52 49.30 12.61 38.98 67.07
C. Metric MUSIC 2.29 8.60 28.66 25.80 57.33 91.72
The DOAnet estimated SPS (SE (φ, λ)) is evaluated
with respect to the baseline MUSIC estimated ground IV. R ESULTS AND D ISCUSSION
truth (SGT (φ, λ)) Pusing
P the SNR metric P Pcalculated as The results of the evaluations are presented in Table I.
SN R = 10 log10 ( φ λ SGT (φ, λ)2 / φ λ (SE (φ, λ) − The high SNRs for SPS in both the contexts, with up to
SGT (φ, λ))2 ). one and two overlapping sound events show that the SPS
As the DOA metric we use the angle between the estimate generated by DOAnet (SE ) is comparable with the baseline
DOA (defined by azimuth φE and elevation λE ) and the MUSIC SPS (SGT ). Figure 2 shows the SE and the respective
ground truth DOA (φGT , λGT ) used to synthesize the dataset SGT when two active sources are closely located. In the
in degrees. This is calculated as σ = arccos(sin φE sin φGT + case of up to three overlapping sound events, the baseline
cos φE cos φGT cos(λGT − λE )) · 180.0/π. Further, to accom- MUSIC is already at its theoretical limit of estimating N − 1
modate the scenario of unequal number of estimated and sources from N -dimensional signal space [25]. In practice,
ground truth DOAs we calculate and report the minimum for N − 1 sources only one noise subspace vector Un is
distance between them using the Hungarian algorithm [24] used to generate SPS, which for real signals is too weak for
along with the percentage of frames in which the number of stable estimation. In the present evaluation of DOAnet which is
DOAs estimated were correct. The final metric for the entire trained with four-channel audio features and MUSIC SPS, for
dataset, referred as DOA error, is calculated by normalizing the the case of three overlapping sound sources the SPS used is an
minimum distance with the total number of estimated DOA’s. unstable estimate resulting in poor training and consequently
the results. With more than four-channels input, which the
D. Evaluation procedure
proposed DOAnet can easily extend to, it can potentially
The parameter tuning for DOAnet was performed on the localize more than two sound sources simultaneously.
O1A test data, and the best configuration is as shown in The DOA error for the proposed DOAnet when the number
Figure 1. This configuration has 677 K weights, and the same of active sources are unknown is presented in Table I. The
configuration is used in all of the following studies. DOAnet error is considerably better in comparison to the
At test time, the SNR metric for SPS output of the DOAnet baseline MUSIC that uses the active sources knowledge for
(SE ) is calculated with respect to SPS of baseline MUSIC all datasets. However, the number of frames in which DOAnet
(SGT ). The DOA metric for the DOAs predicted by DOAnet produced the correct number of active sources were few. For
and baseline MUSIC are calculated with respect to the ground example, in the case of anechoic recordings with up to two
truth DOA used to synthesize the dataset. overlapping sound events, only 42.7% of the estimated frames
In the above experiment, the baseline MUSIC algorithm had the correct number of DOA predictions. This prediction
uses the knowledge of the number of active sources. In order drops even drastically when the number of sources is three,
to have a fair evaluation, we test the DOAnet in a similar due to the theoretical limit of MUSIC as explained previously,
scenario where the number of sources is known. We use this and consequently for the DOAnet as MUSIC SPS is used for
knowledge to choose the top probabilities in prediction layer training. Finally, the confusion matrix for the number of DOA
of the DOAnet instead of thresholding it with a value of 0.5. estimates per frame for O1 and O2 datasets are visualized
150 150

100 100

50 50

0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300

(a) MUSIC estimated (b) DOAnet estimated (a) O1A (b) O2A (c) O1R (d) O2R
Fig. 2. SPS for two closely located sound sources. The black-cross markers Fig. 3. Confusion matrix for the number of DOA estimated per frame by the
represent the ground truth DOA. The horizontal axis is azimuth and vertical DOAnet. The horizontal axis is the DOAnet estimate, and the vertical axis is
axis is elevation angle (in degrees) the ground truth.
TABLE II R EFERENCES
E VALUATION SCORES FOR UNMATCHED REVERBERANT ROOM .
[1] M. Woelfel and J. McDonough, “Distant speech recognition,” in Wiley,
Room 2 Room 3 2009.
Max. no. of overlapping sources 1 2 1 2 [2] J. Nikunen and T. Virtanen, “Direction of arrival based spatial covariance
SPS SNR (in dB) 3.53 1.49 3.49 1.46 model for blind sound source separation,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 22, no. 3, 2014.
DOAnet error (Unknown number of sources) [3] A. Politis et al., “Sector-based parametric sound field reproduction in the
DOAnet 3.44 6.88 4.59 10.89 spherical harmonic domain,” IEEE Journal of Selected Topics in Signal
Correctly predicted frames (in %) 46.2 14.3 49.7 14.1 Processing, vol. 9, no. 5, pp. 852–866, 2015.
[4] Y. Huang et al., “Real-time passive source localization: a practical linear-
DOA error (Known number of sources)
correction least-squares approach,” in IEEE Transactions on Speech and
DOAnet 8.60 32.10 9.17 33.82
Audio Processing, vol. 9, no. 8, 2001.
MUSIC 31.52 58.47 33.25 60.76
[5] M. S. Brandstein and H. F. Silverman, “A high-accuracy, low-latency
technique for talker localization in reverberant environments using
in Figure 3. We skipped the confusion matrices for the O3 microphone arrays,” in IEEE International Conference on Acoustics,
datasets as they were not meaningful for similar reasons as Speech and Signal Processing (ICASSP), 1997.
explained above. [6] R. O. Schmidt, “Multiple emitter location and signal parameter esti-
mation,” in IEEE Transactions on Antennas and Propagation, vol. 34,
With the knowledge of the number of active sources (Ta- no. 3, 1986.
ble I), the DOAnet performs considerably better than baseline [7] R. Roy and T. Kailath, “ESPRIT-estimation of signal parameters via ro-
MUSIC for all datasets other than the O2A and O3A. The tational invariance techniques,” in IEEE Transactions on Audio, Speech,
and Language Processing, vol. 37, no. 7, 1989.
MUSIC DOA’s were chosen using a 2D peak finder on the [8] J. H. DiBiase et al., “Robust localization in reverberant rooms,” in
MUSIC SPS, whereas the DOA’s in DOAnet were chosen Microphone Arrays, 2001, pp. 157–180.
by simply picking the top probabilities in the final DOA [9] A. O’Donovan et al., “Imaging concert hall acoustics using visual and
audio cameras,” in IEEE International Conference on Acoustics, Speech
prediction layer. A smarter peak picking method from the and Signal Processing (ICASSP), 2008.
DOAnet, or using the number of sources as an additional [10] D. Khaykin and B. Rafaely, “Acoustic analysis by spherical microphone
input can potentially result in better scores across all datasets. array processing of room impulse responses,” The Journal of the
Acoustical Society of America, vol. 132, no. 1, 2012.
Further, the DOAnet error on unmatched reverberant data is [11] R. Roden et al., “On sound source localization of speech signals using
presented in Table II. The performance of DOAnet is seen to deep neural networks,” in Deutsche Jahrestagung für Akustik (DAGA),
be consistent in comparison to the matched reverberant data 2015.
[12] X. Xiao et al., “A learning-based approach to direction of arrival
in Table I, and significantly better than the performance of estimation in noisy and reverberant environments,” in IEEE International
MUSIC. Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
In this paper, since the baseline was chosen to be MUSIC, [13] R. Takeda and K. Komatani, “Discriminative multiple sound source
localization based on deep neural networks using independent location
for a fair comparison the DOAnet was also trained using model,” in IEEE Spoken Language Technology Workshop (SLT), 2016.
MUSIC SPS. In an ideal scenario, considering the DOAnet [14] A. Zermini et al., “Deep neural network based audio source separation,”
is trained using datasets for which the ground truth DOAs are in International Conference on Mathematics in Signal Processing, 2016.
[15] F. Vesperini et al., “A neural network based algorithm for speaker local-
known, we can generate accurate high-resolution SPS from the ization in a multi-room environment,” in IEEE International Workshop
ground truth DOA’s as per the required application and use on Machine Learning for Signal Processing (MLSP), 2016.
them for training. Alternatively, the DOAnet can be trained [16] S. Chakrabarty and E. A. P. Habets, “Broadband DOA estimation using
convolutional neural networks trained with noise signals,” in IEEE
without the SPS to directly generate the DOAs, it was only Workshop on Applications of Signal Processing to Audio and Acoustics
used in this paper to present the complete potential of the (WASPAA), 2017.
method in the limited paper space. In general, the above [17] H. Teutsch, Modal array signal processing: principles and applications
of acoustic wavefield decomposition. Springer, 2007, vol. 348.
results show that the proposed DOAnet has the potential to [18] T. N. Sainath et al., “Convolutional, long short-term memory, fully
learn the 2D direction information of multiple overlapping connected deep neural networks,” in IEEE International Conference on
sound sources directly from the spectrogram of the input audio Acoustics, Speech and Signal Processing (ICASSP), 2015.
[19] M. Malik et al., “Stacked convolutional and recurrent neural networks
without the knowledge of the number of active sound sources. for music emotion recognition,” in Sound and Music Computing Con-
An exhaustive study with more detailed experiments including ference (SMC), 2017.
both synthetic and real datasets are planned for future work. [20] T. Sainath et al., “Multichannel signal processing with deep neural
networks for automatic speech recognition,” in IEEE Transactions on
Audio, Speech, and Language Processing, 2017.
V. C ONCLUSION [21] S. Adavanne et al., “Sound event detection using spatial features and
convolutional recurrent neural network,” in IEEE International Confer-
A convolutional recurrent neural network (DOAnet) was ence on Acoustics, Speech and Signal Processing (ICASSP), 2017.
proposed for multiple source localization. The DOAnet was [22] E. Benetos et al., “Sound event detection in synthetic audio,” http://
shown to learn the number of active sources directly from www.cs.tut.fi/sgn/arg/dcase2016/, 2016.
[23] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating
the input spectrogram, and estimate precise DOA in 2-D small-room acoustics,” in The Journal of the Acoustical Society of
polar space. The method was evaluated on anechoic, matched America, vol. 65, no. 4, 1979.
and unmatched reverberant dataset. The proposed DOAnet [24] H. W. Kuhn, “The hungarian method for the assignment problem,” in
Naval Research Logistics Quarterly, no. 2, 1955, p. 8397.
performed considerably better than baseline MUSIC in most [25] B. Ottersten et al., “Exact and large sample maximum likelihood
scenarios. Thereby showing the potential of DOAnet in learn- techniques for parameter estimation and detection in array processing,”
ing highly computational algorithm without prior knowledge in Radar Array Processing. Springer Series in Information Sciences,
1993.
of the number of sources.

You might also like