MSGLN

Applied Acoustics 223 (2024) 110067
Contents lists available at ScienceDirect
Applied Acoustics
journal homepage: www.elsevier.com/locate/apacoust
Multi-stage temporal representation learning via global and local

perspectives for real-time speech enhancement
Hoang Ngoc Chau, Nguyen Thi Nhat Linh, Tuan Kiet Doan, Quoc Cuong Nguyen ∗
School of Electrical and Electronic Engineering, Hanoi University of Science and Technology, Hanoi, 100000, Viet Nam
A R T I C L E I N F O A B S T R A C T
Keywords: Deep learning-based speech enhancement algorithms have been rapidly developed over the past few years.
Speech enhancement Although numerous approaches have been proposed, global and local information from speech features have
Deep learning-based not been thoroughly investigated. In this paper, we introduce a novel and highly effective speech enhancement
Global and local modeling
network called Multi-stage Global-Local Network (MSGLN), which exploits both local and global information via
Self-attention
Graph convolution
temporal self-attention, temporal graph convolution, and 1D convolution. Local modeling blocks capture the fast
changes in speech signals, while global modeling blocks learn long-term trends in noise or speech signals through
factors such as pitch, tone, resonance, timbre, and rhythm. In addition, we propose a multi-stage temporal
processing module as the bottleneck of a complex convolutional encoder-decoder structure to guide our network
to learn different acoustic structures from different scales. Then a dual-path RNN postprocessing module is
integrated to reconstruct the speech spectrum mask using a frequency-wise temporal refinement block followed
by a frame-wise spectral refinement block. Experimental results demonstrate the superior performance of our
proposed methodology compared to other state-of-the-arts on both real-time single- and multi-channel speech
enhancement tasks.
1. Introduction output through deconvolution layers, which enables the network to

learn hierarchical features and spatial features. Alternatively, dual-
Speech enhancement is widely applied in many real-world applica- path RNN-based approaches [9–14] perform a stack of iteratively two
tions, such as telecommunication, automatic speech recognition, and RNNs blocks to perform sequence modeling in both frequency dimen-
hearing aids. In particular, speech enhancement attempts to recover sion and time dimension. These techniques obtain remarkable results in
clean speech from the corrupted mixture to improve speech intelligi- modeling long-term dependencies but usually bring high computation
bility and perceptual quality. Recently, deep neural networks (DNNs)
trade-offs due to the recurrent nature of RNNs. In [7,8,15–17], TCNs
have shown impressive performance in speech enhancement tasks due
were leveraged to capture long-term temporal dependencies. TCNs con-
to their superior capacity in handling non-stationary or impulsive noise
sist of multiple dilated convolution layers stacks to enlarge the receptive
compared to traditional statistical signal-processing-based techniques
field while keeping low computational cost. Thanks to the capability
[1–3], and thus DNNs have become the de-facto standard in this field
of modeling long-range dependencies, transformers [18] have domi-
of research.
Numerous DNN-based techniques have been proposed and have nated a wide range of tasks in recent years, including computer vision
successfully achieved excellent performance in speech enhancement. and natural language processing, which inspires several applications in
The majority of recent methods adhere to the following popular speech enhancement [8,14,19–22]. To address real-time noise suppres-
designs, namely encoder-decoder architectures, dual-path recurrent sion, transformers are implemented in a causal configuration, where
neural networks (RNNs), temporal convolutional networks (TCNs), one frame cannot attend to future frames. Further improvements were
transformer-based networks, and multi-stage/branch architectures. reported in [21–26] by introducing multi-stage or multi-branch designs,
Encoder-decoder-based approaches [3–8] utilize deep convolution lay- which are combinations of advanced single-stage speech enhancement
ers to downsample spectrum resolution and reconstruct the target architectures. Among single-stage architectures, the encoder-decoder
* Corresponding author.
E-mail addresses: chau.hn222175m@sis.hust.edu.vn (N.C. Hoang), linh.ntn200350@sis.hust.edu.vn (T.N.L. Nguyen), kiet.dt202753@sis.hust.edu.vn
(T.K. Doan), cuong.nguyenquoc@hust.edu.vn (Q.C. Nguyen).
https://doi.org/10.1016/j.apacoust.2024.110067
Received 9 February 2024; Received in revised form 1 May 2024; Accepted 3 May 2024
0003-682X/© 2024 Elsevier Ltd. All rights reserved.
N.C. Hoang, T.N.L. Nguyen, T.K. Doan et al. Applied Acoustics 223 (2024) 110067
structure is the most dominant choice for real-time speech enhance- 2. Background and related work
ment. This design approach is based on the idea of leveraging local
information of speech signals across frequency bins to extract local 2.1. Signal model
spectral relations. In addition, squeezing frequency resolution allevi-
ates the burden of processing temporal dependencies in the bottleneck In the Short-Time Fourier Transform (STFT) domain, a noisy speech
layer, which is the pivotal objective of sequence modeling tasks such as signal captured by 𝑀 microphones can be formulated as:
speech processing or time series analysis, as signals fluctuate substan-
𝐗(𝑡, 𝑓 ) = 𝐒(𝑡, 𝑓 ) + 𝐍(𝑡, 𝑓 ), (1)
tially through time.
Deep neural networks’ powerful nonlinear approximation has where 𝐗(𝑡, 𝑓 ), 𝐒(𝑡, 𝑓 ) and 𝐍(𝑡, 𝑓 ) represent the complex-valued (𝑡, 𝑓 ) bin
shifted the focus of research to all-neural approaches [8,27,28]. Specif- of noisy mixture, the clean speech received at the microphone and the
ically, both single- and multi-channel speech enhancement were ad- noise of 𝑀 channels at frequency index 𝑓 ∈ {1, ..., 𝐹 } and time index
dressed using the same neural architectures in these studies. MTFAA 𝑡 ∈ {1, ..., 𝑇 }. Here 𝑇 is the number of time frames and 𝐹 is the number
[8,28] surprisingly achieved the best results in ICASSP 2022 Deep Noise of frequency bins. For the monaural setting, the number of microphones
Suppression (DNS) challenge [29], ICASSP 2022 Acoustic Echo Can- is 𝑀 = 1.
cellation (AEC) challenge [30], and L3DAS22 challenge [31], demon-
strating the feasibility of an efficient architecture to jointly exploit 2.2. Single- and multi-channel speech enhancement using DNNs
spatial-temporal-spectral characteristics of the speech spectrum. The
current success in speech enhancement mainly comes from the effec- Depending on the number of the input signals’ channels, the speech
enhancement system can be categorized as a single-channel or multi-
tiveness of temporal modeling of dilated convolutional networks and
channel approach. Single-channel (monaural) speech enhancement
long short-term memory networks (LSTMs) due to their strong internal
techniques target removing environmental noise from speech signals
inductive bias. Up till now, most advanced temporal modeling tech-
captured by a single microphone. Recently, DNN-based methods have
niques in speech enhancement mostly use LSTMs and TCNs as key
shown impressive results in real-time monaural speech enhancement
operations, while long-term and adaptive context modeling operations
[2,36]. In [5], the authors presented a complex convolutional recur-
like self-attention [18] are merely incremental elements [29]. To cap-
rent network and ranked first in DNS 2020 challenge’s real-time track.
ture both global and local speech characteristics, some works applied The network leveraged complex operations in CNNs and LSTMs to pre-
Conformer [32], which adopts the self-attention operation to learn serve the phase information of the complex STFT spectrum. Later, Hao
global features and the convolution operation to learn local features, et al. [37] proposed a full-band and sub-band fusion network to cap-
into speech enhancement models [21,33]. However, these methods did ture long-range cross-band dependencies and local spectral patterns
not discuss in detail what global and local features are learned and simultaneously. Li et al. [25] extended the enhancement network to a
whether TCNs and LSTMs can be replaced in temporal modeling or not, two-stage pipeline, where the first stage estimates coarse spectral mag-
since their encoders and decoders are TCN-based architectures. nitude and the second stage operates on the estimated magnitude with
In this work, we propose a novel multi-stage temporal modeling net- the original phase information to produce the final estimation. Inspired
work through both global and local perspectives (MSGLN). Specifically, by this work, the authors in [38,39] proposed collaborative-style frame-
we adopt the encoder-decoder architecture and propose a temporal works to model the magnitude and phase information separately by
processing mechanism in the bottleneck layer, which does not utilize two parallel paths, which further improved phase retrieval. To enhance
any dilated convolutions or LSTMs. We leverage two options for the the frequency representation, Zhao et al. [6] incorporated RNN-based
long-term adaptive modeling operation as the global module to cap- frequency recurrence blocks in the encoder-decoder architecture and
ture global correlations, including self-attention and graph convolution attained new state-of-the-art results. Alternatively, Chen et al. [23]
[34], along with 1D convolution to extract short-term changes in the designed a multi-domain system that utilizes different input represen-
speech signal. The global module is designed as a sequence of multi- tations including magnitude, complex spectrum, and time domain.
ple self-attention blocks or graph convolution blocks to progressively Different from single-channel systems, multi-channel approaches
refine the global correlations through time and capture increasingly can exploit spatial information from the microphone array. Previously,
complex patterns. In contrast, the local module is comprised of a se- many studies proposed using DNNs to estimate spatial filter weights
quence of local convolution blocks based on normal 1D CNNs to bring [40–42]. However, recent results demonstrated that end-to-end DNN-
the locality inductive bias into the network. Our temporal processing based models as joint nonlinear spatial-spectral-temporal filters sig-
mechanism employs multiple stages of global and local modeling, which nificantly outperform traditional spatial filter estimation approaches
[7,11]. Consequently, the multi-channel research direction has shifted
provides deep feature extraction and diverse ranges of aggregated in-
toward end-to-end DNNs, which directly estimate the clean speech sig-
formation. Finally, a dual-path RNN postprocessing module is proposed
nal from multi-channel noisy input. Ren et al. [43] proposed a causal
after the decoder to further refine the target spectrum mask at the time-
multi-input multi-output UNet structure to address real-time multi-
frequency bin level for redundant noise reduction. MSGLN addresses
channel speech enhancement, which ranked first in INTERSPEECH
both single- and multi-channel speech enhancement tasks and exhibits
2021 ConferencingSpeech challenge. In [3,44], a densely-connected
state-of-the-art performance on two large-scale datasets, including DNS-
convolutional recurrent architecture was adopted to boost the inter-
challenge dataset [2], INTERSPEECH 2021 ConferencingSpeech Chal-
channel and feature representation through dense connectivity and con-
lenge [35]. Different from mainstream approaches relying on dilated volutional skip pathways. TPARN [10] performed speech enhancement
convolutions or LSTMs for temporal modeling, our proposed method is in time domain by extending dual-path RNNs [9] to a triple-path net-
entirely based on self-attentions, graph convolutions, and dilated-free work with a third path for spatial modeling. Each path is composed of
CNNs, which could motivate novel directions for future research. RNNs followed by a self-attention block. In [7], Li et al. introduced
The rest of this paper is organized as follows. In section 2, we de- an encoder-decoder architecture called EaBNet with three key com-
scribe the signal model and review recent related works on single- ponents, including S-TCN for temporal modeling, LSTMs for sub-band
and multi-channel speech enhancement using DNNs as well as tempo- postprocessing, and U2 -Net in encoder-decoder structure, where each
ral modeling techniques in speech enhancement. Section 3 provides a encoder or decoder layer consists of a sub-UNet. This work inspired
detailed presentation of the proposed methodology. In section 4, we several subsequent works [45,46], where the beamforming process is
present information on datasets, experimental configurations, results, revisited and reformulated using Taylor’s expansion theory. In addi-
and discussions. Section 5 makes a conclusion of this paper. tion, the authors proposed to replace all the derivative terms in their
2
Fig. 1. The overall network architecture of MSGLN.
Taylor superimposition by TCN-based networks, leading to all-neural the authors in [34] proposed a different temporal modeling approach
architectures. and demonstrated remarkable results of GCNs compared to TCNs and
Thanks to the powerful nonlinear approximation of DNNs, all-neural LSTMs. Besides, some attempts have been made to adapt transformers
speech enhancement networks not only can extract complex temporal- to temporal modeling [20,21,33,52]. Inspired by the dual-path network
spectral characteristics in single-channel cases but also can additionally [9], TSTNN [21] designed a two-stage transformer network with two
model the spatial information from microphone array in multi-channel sequential transformer blocks as intra- and inter-chunk operations. Ad-
cases. MTFAA [8,28] validated this idea by leveraging the same net- ditionally, the authors proposed an improved transformer version by
work architecture and ranking first in two challenges for single-channel introducing GRUs [53] after the self-attention block. Zhao et al. [33]
scenarios (DNS22, AEC22) and one challenge for multi-channel scenar- took advantage of global-local modeling from Conformer network [32]
ios (L3DAS22). In addition, the work [27] also proposed a network and proposed a complex dual-path conformer network. However, these
inspired by EaBNet [7] and achieved impressive results in both single- works were tested on a small-scale dataset [54] and were not designed
and multi-channel speech enhancement. In general, such a convenient in a causal manner to address real-time speech enhancement. Despite
network design can offer more flexibility in practical situations. their remarkable performance in many other aspects, transformer-based
backbones for temporal modeling have not been validated thoroughly in
2.3. Temporal modeling in speech enhancement speech enhancement. First, existing transformer-based temporal mod-
ules were not compared directly with other state-of-the-arts like TCNs
The success of recent real-time speech enhancement systems mainly or LSTMs. Second, recent transformer-based models utilize TCNs in the
comes from the effectiveness of the temporal modeling module. Based encoder and decoder, which also contribute to temporal modeling and
on their designs, the backbone of the temporal modeling module can thus the effectiveness of transformers was not clarified. In the current
be TCNs, RNNs, graph convolutional networks (GCNs), or transformers. state-of-the-arts [8,20], self-attention operations are merely incremen-
TCNs are built upon stacks of dilated convolution layers, where the di- tal to prevalent LSTMs and TCNs. In general, an efficient way to adapt
lation rate changes increasingly. This mechanism significantly enlarges global operations such as self-attention or graph convolution as the
the model’s receptive field, which benefits long-term dependencies backbone for temporal modeling in real-time speech enhancement still
modeling and thus makes the TCN-based design a suitable choice for needs to be investigated.
speech enhancement. TCNs have been utilized in many state-of-the-art
methods [7,8,16,25,47]. To ease the parameter burden from stacking 3. Methodology
multiple dilated convolution blocks, [7,25] designed a lightweight mod-
ule called S-TCM while preserving network performance. MTFAA [8] This section mainly introduces the proposed MSGLN. Firstly, we give
introduced TCN in all the encoder, decoder, and bottleneck of encoder- a brief overview of MSGLN. Then, we present the principles and the
decoder architecture, which offers temporal representation learning in structure of MSGLN, followed by a detailed description of each compo-
multiple scales. Alternatively, RNN-based temporal modules have also nent. Finally, the joint loss function used in our MSGLN is presented.
been widely adopted [5,6,48–50] due to their superiority in sequence
modeling. One of the most popular applications of RNNs in speech en- 3.1. Overview of proposed network
hancement models is employing LSTMs in the bottleneck layer of the
encoder-decoder architecture [5,48,49]. In real-time encoder-decoder The overall speech enhancement network is illustrated in Fig. 1. Our
networks, the bottleneck layer contains high-level features at a coarse network aims to estimate the clean speech signal from a noisy mixture
spectral resolution, which provides a time sequence with the richest fea- 𝐱 . The short-time Fourier transform (STFT) is first performed on the
tures as an ideal input for LSTMs. Several works [11,13,49,50] adopted noisy signal 𝐱 to obtain a complex spectrum 𝐗 ∈ ℂ𝐶𝑖𝑛 ×𝐹 ×𝑇 , where 𝐶𝑖𝑛
dual-path RNNs [9], where a frame-wise LSTM operates on frequency is the number of input channel and 𝐹 × 𝑇 represents the time-frequency
sequences then a subsequent frequency-wise LSTM models time se- resolution of speech spectrum. The proposed network adopts a complex
quences. As in [6], the frequency representation was further boosted encoder-decoder architecture [5] to produce the complex ideal ratio
by incorporating complex RNNs in every encoder and decoder layer, mask (cIRM) [55] for subsequent complex multiplication with the in-
which can be viewed as multi-scale frequency representation learn- put spectrum. Finally, the output waveform signal is obtained through
ing. Based on the flexibility of graph convolutional networks [51], the inverse STFT. The encoder layers are composed of a complex 2D
3
Fig. 2. Flow diagrams of the Global Module (a), the Local Module (b), the Self-Attention Block (c), and the Temporal Graph Convolution Block (d).
convolution layer (Conv2d), a complex batch normalization (BN), and projected output is added with a residual connection and followed by a
a PReLU activation function [56]. The decoder is built similarly with multi-layer perception (MLP), which is composed of two fully connected
2D transposed convolution (DeConv2d). The multi-stage temporal pro- layers.
cessing module is designed in the bottleneck to help the network learn The TGCB is designed as depicted in Fig. 2(d) with the same con-
temporal patterns through both global and local perspectives at dif- figurations as described in [34]. In brief, the graph convolution is
ferent scales. Furthermore, a dual-path RNN postprocessing module is formulated as:
introduced for residual noise reduction by sequential refinement at the
original spectrum’s time-frequency bin level. ′ = ℎ(𝑔(), ) (3)
where  and ′ are the constructed graph and the graph convolu-
3.2. Multi-stage temporal processing module tion output, respectively. 𝑔(⋅) represents the aggregate operation. ℎ(⋅)
denotes the update operation and  is the learnable update weight ma-
The multi-stage temporal processing module is the key for MSGLN to trix. Here every temporal data point is treated as a graph node with 𝐶
extract rich temporal patterns. This module contains three global-local features. We form a causal k-nearest neighbors (k-NNs) graph, where
blocks (GLBs), which benefit modeling global correlations and captur- each node can only utilize the information from temporal nodes in the
ing short-term changes in speech features. past. Next, the aggregate and update operations are performed on a
node 𝑣𝑖 using max-relative GCN [57] as follows:
3.2.1. Global module
The global module is designed to model long-term temporal depen- 𝑔(⋅) = 𝑣′𝑖 = 𝑐𝑜𝑛𝑐𝑎𝑡(𝑣𝑖 , 𝑚𝑎𝑥{𝑣𝑗 − 𝑣𝑖 }),
dencies. Existing approaches commonly use LSTMs or TCNs for this (4)
ℎ(⋅) = 𝑃 𝑅𝑒𝐿𝑈 (𝐵𝑁(𝑣′𝑖 𝑖 )),
objective. But in this work, we propose two alternative options, includ-
ing the self-attention mechanism and the graph convolution, as shown where 𝑣𝑗 ∈  (𝑣𝑖 ) and  (𝑣𝑖 ) is the set of 𝑣𝑖 ’s neighbors.
in Fig. 2(a). Self-attention and graph convolution operations offer global To be specific, the networks with self-attention and graph convolu-
receptive fields and dynamic information aggregation conditioned by tion as the global operation are denoted as MSGLN𝑆𝐴 and MSGLN𝐺𝐶 ,
the input, which provides a clear advantage over prevalent temporal respectively.
modeling techniques like TCNs and LSTMs. In our design, the global
module is a stack of 𝑚 self-attention blocks (SABs) or temporal graph 3.2.2. Local module
convolutional blocks (TGCBs) [34]. We propose a stack of 𝑛 causal local convolution blocks (LConv) to
As illustrated in Fig. 2(c), we use the expression of the canonical extract local relations in neighboring features. As depicted in Fig. 2(b),
transformer for SABs. Given the input of the SAB has the shape of 𝐶 × the LConv operates based on normal 1D convolution with a kernel of
𝑇 , the query, key, and value matrix are denoted as 𝐐 ∈ ℝ𝐻𝐺 ×𝑇 , 𝐊 ∈ size 𝑘, where each time frame is computed by aggregating the informa-
ℝ𝐻𝐺 ×𝑇 , 𝐕 ∈ ℝ𝐻𝐺 ×𝑇 , respectively. Here 𝐻𝐺 is the hidden channel of the tion from up to 𝑘 − 1 past local features. In addition, 1D pointwise
query, key, and value matrix. PConv denotes 1D pointwise convolution. convolutions are included to project the input representation into a
Through the masked scaled dot-product operation, the score matrix can smaller dimension space 𝐻𝐿 , which further eases the parameter bur-
be described as: den. The LConv is crucial for introducing more locality inductive bias
to our network.
𝐐𝐊𝐓
𝐒 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑀𝑎𝑠𝑘( √ )), (2)
𝐶 3.2.3. Multi-stage global-local temporal modeling
where 𝐒 ∈ ℝ𝑇 ×𝑇denotes the masked score matrix. The upper triangular Given the input of the bottleneck layer 𝐅0 ∈ ℝ𝐶×𝑇 , MSGLN performs
part of 𝐐 is masked to ensure the causal configuration for real-time multi-stage temporal modeling via both global and local perspectives as:
processing. The attention score 𝐒 is multiplied with the value matrix
𝐅𝑖 = 𝐺𝐿𝐵𝑖 (𝐅𝑖−1 ), (5)
𝐕 to complete a self-attention operation. Subsequently, the output is
projected back to the original dimension 𝐶 by a pointwise convolution where 𝐅𝑖 ∈ ℝ𝐶×𝑇 and 𝐺𝐿𝐵𝑖 denote the output and the GLB of the stage
layer, a BN, and a PReLU activation. To preserve feature diversity, the 𝑖 ∈ {1, ..., 𝑠}. A GLB performs the following two steps. First, the global
4
module propagates increasingly complex global representation, which collected from 2150 speakers. The noise dataset comprises more than
can be formulated as: 180 hours of video in 150 different classes. We generate around 1000
hours of 6-second noisy-clean pairs sampled at 16 kHz with Signal-to-
𝐅𝑔𝑙𝑜𝑏𝑎𝑙_𝑖 = 𝐺𝑙𝑜𝑏𝑎𝑙𝑖 (𝐅𝑖−1 ), (6) Noise Ratio (SNR) ranges randomly from -5 dB to 20 dB for training
where 𝐺𝑙𝑜𝑏𝑎𝑙𝑖 and 𝐅𝑔𝑙𝑜𝑏𝑎𝑙_𝑖 denote the global module and the corre- set and another validation set of 5 hours. To facilitate comparison
sponding output at the 𝑖-th stage. Then, the local module captures with other models, we adopt the non-blind synthetic test set from the
short-term changes in the latent representation as: challenge, which includes 150 noisy-clean pairs with a duration of 10
seconds.
𝐅𝑖 = 𝐿𝑜𝑐𝑎𝑙𝑖 (𝐅𝑔𝑙𝑜𝑏𝑎𝑙_𝑖 ), (7)
4.1.2. INTERSPEECH 2021 Conferencing Speech Challenge
here 𝐿𝑜𝑐𝑎𝑙𝑖 represents the local module at the 𝑖-th stage. In this way, We use the INTERSPEECH 2021 Conferencing Speech Challenge
the GLB can refine the speech features progressively in a coarse-to-fine dataset, described in [35], to evaluate our multi-channel models. Fol-
manner. By utilizing multiple iterative GLBs, the network can observe lowing the challenge’s objective of algorithm development, participants
diverse global and local temporal information, which will be discussed are limited to training exclusively with the provided clean speech and
in Section 4. noise list. The clean speech list contains about 550 hours of speech ut-
terances from the following public corpora: AISHELL-1 [60], AISHELL-3
3.3. Dual-path RNN postprocessing [61], and Librispeech 360 [62]. The noise set consists of about 120
hours of data from two public databases: AudioSet [63] and MUSAN
The frequency dimension is mixed with the channel axis in the bot- [64]. The provided Room Impulse Responses (RIRs) from the challenge
tleneck layer. Therefore, the temporal modeling module transforms the simulated three different types of microphone arrays: linear uniformly
frequency representation globally based on temporal information aggre- distributed microphone array, linear non-uniformly distributed micro-
gation. To refine the speech representation at time-frequency bin level phone array, and circular microphone array. The given RIR set has
of the original spectrum, we utilize the sequential bias advantage of around 2500 rooms, with room sizes varying from 3 × 3 × 3 to 8 × 8 × 3
RNNs and propose a dual-path RNN postprocessing module, which can (length(m) × width (m) × height (m)) by uniform distribution sampling.
be represented as follows: The target and noise source were dispersed between 1.2 and 1.9 m in
𝐉 = 𝑅𝑁𝑁1 (𝑁𝑜𝑟𝑚(𝐈)), height and originated from any available spot throughout the room.
(8) The minimum arrival angle between two sources was 20◦ . The distance
𝐌 = 𝑀𝐿𝑃 (𝑅𝑁𝑁2 (𝑅𝑒𝑠ℎ𝑎𝑝𝑒(𝐉))), between the source and the array ranged from 0.5 to 5.0 m. The micro-
where 𝐈 ∈ ℝ𝐹 ×𝑇 ×𝐶 , 𝐉 ∈ ℝ𝑇 ×𝐹 ×𝐶 . 𝑅𝑁𝑁1 is a LSTM operating on sub- phone array was positioned at random heights between 1.0 and 1.5 m.
band time sequences. 𝑅𝑁𝑁2 is a frame-wise LSTM modeling spectral Finally, we create we generate 750 hours of 6-second noisy-clean pairs
sequences. 𝑀𝐿𝑃 is the final multilayer perceptron that produces the sampled at 16 kHz for training and 2 hours for development. For test-
complex output mask 𝐌 ∈ ℂ𝐹 ×𝑇 . ing, we created 794 noisy mixed clips for every microphone array to
evaluate models’ performance. The SNR was distributed between 0 dB
3.4. Loss function and 10 dB.
The Scale-Invariant Signal-to-Noise Ratio (SI-SNR) [58] is a widely 4.2. Experimental setup
used objective metric in speech separation and enhancement, defined
as: We utilize the 32 ms Hanning window with 50% overlap and 512
STFT points, resulting in 257 spectral dimensions. The model is op-
⎧ ⟨𝑠,
̂ 𝑠⟩ ⋅ 𝑠 timized by AdamW [65] with an initial learning rate of 0.001 and
⎪𝑠𝑡𝑎𝑟𝑔𝑒𝑡 = ;
⎪ ‖𝑠‖22 decayed by a factor of 0.5 if the validation performance is not improved
⎪ after 2 consecutive epochs. The number of microphones in our multi-
⎨𝑒𝑛𝑜𝑖𝑠𝑒 = 𝑠̂ − 𝑠; (9)
channel experiments is 𝑀 = 8.
⎪ ‖𝑠𝑡𝑎𝑟𝑔𝑒𝑡 ‖22
⎪𝑆𝐼−𝑆𝑁𝑅 (𝑠, 𝑠) ̂ = 10𝑙𝑜𝑔10 ( ), The model configurations of our proposed methods in single-
⎪ ‖𝑒𝑛𝑜𝑖𝑠𝑒 ‖22 and multi-channel are similar, except for the input channels of the
⎩
network, where the single-channel case is 2 and the multi-channel
here 𝑠̂ 𝑠 are the estimated and reference speech signals in the time- case is 16. The channel number of the encoder and decoder are
domain, respectively. ‖⋅‖2 is Euclidean norm and ⟨⋅, ⋅⟩ represents the {32, 64, 128, 128, 256, 256}. The kernel size of 2D complex convolution
dot product of two vectors. We also consider the power-law compressed in each encoder and decoder layer is (5, 2) with a stride size (2, 1). The
magnitude loss function [59], defined as: hidden channel of the bottleneck layer is 256. The multi-scale tempo-
∑ ∑ ral processing module utilizes 3 GLBs, in which 5 blocks are used for
𝑠𝑝𝑒𝑐 = (1 − 𝛼) ̂ 𝑐| + 𝛼
||𝑆|𝑐 − |𝑆| ̂ 𝑐 𝑒𝑗𝜑𝑆 |,
||𝑆|𝑐 𝑒𝑗𝜑𝑆 − |𝑆| (10) the global module and 5 blocks are used for the local module in each
𝑡,𝑓 𝑡,𝑓
GLB. In a global block (SAB or TGCB), the number of hidden channels is
where 𝑆̂ and 𝑆 are the STFT transform of the reference and clean speech 𝐻𝐺 = 256 and the MLP for feature transformation is 2 consecutive feed-
signals, respectively. We set 𝛼 = 0.3 and 𝑐 = 0.3. The loss function we forward networks with an inverted bottleneck factor of 2. The Lconv
finally utilized for our systems is: block squeezes the channel dimensions to 𝐻𝐿 = 64 and performs a lo-
cal kernel of size 𝑘 = 5. The channel number of the dual-path RNN
(𝑠, 𝑠)
̂ = 𝑆𝐼−𝑆𝑁𝑅 (𝑠, 𝑠)
̂ + 𝑆𝑝𝑒𝑐 . (11) postprocessing module is 64.
4. Experiments 4.3. Evaluation metrics
4.1. Dataset In this study, we adopt four popular speech evaluation metrics,
including Perceptual Evaluation of Speech Quality (PESQ) [66], Short-
4.1.1. DNS 2020 challenge dataset Time Objective Intelligibility (STOI) [67], Extended Short-Time Ob-
For single-channel enhancement evaluation, we utilize the DNS- jective Intelligibility (ESTOI) [68], Scale-Invariant Signal-to-Distortion
2020 dataset [2], which consists of 500 hours of clean voice recordings Ratio (SI-SDR) [58], and DNSMOS [69]. PESQ evaluates the perceived
5
Table I 4.4.3. Impact of locality

Effect of proposed modules evaluated on DNS-Challenge non-blind test Such global operations like self-attention or graph convolution lack
set. the capability of modeling local patterns, which is important as neigh-
System Param. (M) PESQ𝑊 𝐵 PESQ𝑁𝐵 STOI (%) SI-SDR boring features tend to relate to each other. Therefore, incorporating
1D-CNNs strengthens global operations by providing the network with
Unprocessed - 1.58 2.45 91.52 9.07
MSGLN𝑆𝐴 11.25 3.36 3.74 97.72 20.31 the locality inductive bias. As mentioned before, some concurrent works
-GLB 2.5 3.08 3.55 97.03 19.32 utilize Conformer [32] for both global and local temporal modeling.
+Local 3.31 3.13 3.61 97.10 19.11 We modify our network and create a conformer-like model by simply
+Global 10.44 3.15 3.61 97.17 19.37
placing a self-attention block and a 1D convolution block sequentially,
-DPRNN 11.19 3.16 3.58 97.20 18.98
+T-LSTM 11.22 3.29 3.69 97.56 20.04 which is similar to Conformer architecture. As a result, there are in
total 15 GLBs, where each GLB has an SAB followed by an LConv.
In addition, we also investigate a parallel structure proposed in [70],
where each GLB captures local and global information in parallel and
quality of speech signals, which ranges from -0.5 to 4.5. STOI evaluates fuses the outputs by a feed-forward network. Since MSGLN𝑆𝐴 operates
the intelligibility of speech signals by operating on short-time segments with a global-local structure, we swap the local and global modules
and calculates a correlation-based measure. ESTOI is an extended ver- in GLBs and examine a local-global structure. As shown in Table III,
sion of STOI that takes into account the impact of noise, reverberation, incorporating the locality mechanism global modules boosts the perfor-
and other distortions that affect speech intelligibility in real-world sce- mance of the conformer-like model (from 3.15 to 3.26 in PESQ𝑊 𝐵 and
narios. STOI and ESTOI score range from 0 to 1. SI-SNR quantifies the 3.61 to 3.67 in PESQ𝑁𝐵 ) and MSGLN𝑆𝐴 (3.15 to 3.36 in PESQ𝑊 𝐵 and
similarity between the estimated source and the reference source by 3.61 to 3.74 in PESQ𝑁𝐵 ). Notably, the parallel structure exhibits poor
measuring the ratio of their energies. SI-SNR is insensitive to ampli- performance, indicating that the sequential structure is more suitable
tude scaling and is expressed in decibels (dB). The DNSMOS metric for speech enhancement. One STFT frame hardly contains meaning-
follows ITU-T Rec. P.835 human ratings and measures 3 scores, includ- ful information compared to a word or a sequence of words in ma-
ing speech quality (SIG), background noise quality (BAK), and overall chine translation tasks, thus separately learning different information
audio quality (OVRL), respectively. For all these metrics, a higher value can introduce noises in temporal modeling. Sequential refining process
indicates a better quality. is preferable in denoising and refinement tasks [15,32,71]. Besides,
MSGLN𝑆𝐴 shows better performance than the Local-Global structure,
4.4. Experimental results which demonstrates the effectiveness of the coarse-to-fine style for tem-
poral modeling [72]. Furthermore, the local modeling capability of the
network can be investigated by examining the changes in the mean at-
4.4.1. Ablation studies tendance distance across self-attention layers [73]. To be specific, the
We conduct ablation studies to examine the contribution of proposed mean attendance distance is computed by averaging the distance be-
modules to the overall performance of the network with self-attention as tween frames using the attention weights over the DNS challenge public
the core global operation (MSGLN𝑆𝐴 ), which is demonstrated in Table I. test set. Smaller distance means more local knowledge is utilized in the
As can be seen, removing the proposed multi-stage temporal process- network. As illustrated in Fig. 3, introducing locality helps the global
ing module (-GLB) substantially decreases the performance. Adding the block attend to a more diverse range. Particularly, MSGLN𝑆𝐴 offers the
global (+Global) or local module (+Local) alone in the bottleneck layer most diverse attention range and richer local knowledge in higher lay-
improves the results compared to -GLB, indicating their significance ers. This indicates that stacking multiple global blocks and multiple
in temporal modeling. However, their performance is still inferior to local blocks together helps the network become more robust and pro-
combining both modules as proposed (3.61 vs. 3.74 in PESQ𝑁𝐵 and gressively extract richer temporal patterns after each stage.
3.13/3.15 vs. 3.36 in PESQ𝑊 𝐵 ). The DPRNN postprocessing module
also shows a great contribution to the overall performance, since it per-
4.4.4. Comparison with other state-of-the-art networks
forms refinement at the original spectrum’s time-frequency bin level.
For the single-channel case, Table IV shows the DNSMOS results
Adding only sub-band LSTM (+T-LSTM) considerably enhances the per-
comparing with NSNet2 [74] and GaGNet [38]. We use the NSNet2
formance (3.16 to 3.29 in PESQ𝑊 𝐵 and 3.58 to 3.69 in PESQ𝑁𝐵 ).
model1 and the GaGNet’s enhanced samples2 provided by the authors
Further improvements can be observed by introducing the frame-wise
to calculate the DNSMOS metric. As can be seen, our proposed models
LSTM as in our proposed MSGLN (3.29 to 3.36 in PESQ𝑊 𝐵 and 3.69 to
significantly outperform the baselines, and MSGLN𝐺𝐶 performs slightly
3.74 in PESQ𝑁𝐵 )
better than MSGLN𝑆𝐴 in terms of MOS metrics. Table V presents com-
parison results of our proposed models with previous state-of-the-arts
4.4.2. Temporal modeling performance of proposed methods on the DNS 2020 Challenge non-blind test set using PESQ, STOI, eSTOI,
To validate the capability of our multi-stage temporal processing and SI-SDR metrics. Here the information that was not mentioned in the
module, we make a comparison with the most recent state-of-the-art original papers is omitted. One can observe that our proposed models
temporal modeling approaches and show the results in Table II. Here exhibit the best performance in all metrics, except for the STOI score of
we omit the DPRNN postprocessing module to highlight the temporal MSGLN𝐺𝐶 (97.52% vs 97.69% of FRCRN). It is worth mentioning that
modeling performance in the bottleneck layer. LSTM [44] and S-TCN our proposed models outperform the two multi-scale TCN-based models
[27] are chosen to replace our proposed bottleneck while keeping the with time-frequency self-attention integration, i.e. MTFAA and CTFU-
encoder and decoder unchanged. The configurations are set the same as Net, demonstrating the effectiveness of the proposed temporal modeling
in the original papers. In particular, LSTM [44] consists of two grouped method. With a single-stage paradigm approach, MSGLN models still
LSTMs with a hidden channel of 1024, and S-TCN [27] is three stages outperform Taer, a multi-stage enhancement framework. This indicates
of 6 squeezed dilated convolution layers with a hidden channel of 64. the high efficiency of our models and their great potential for perfor-
It can be seen that our proposed temporal modules surpass the popular mance improvement when integrated into multi-stage frameworks.
LSTMs and TCNs approaches in all metrics, demonstrating the superi-
ority of the proposed methodology. In addition, MSGLN𝑆𝐴 exhibits the
best performance, indicating the feasibility and effectiveness of the self- 1
https://github.com/microsoft/DNS-Challenge.
2
attention mechanism in speech enhancement. https://github.com/Andong-Li-speech/GaGNet.
6
Table II
Comparison with prevalent temporal modeling module evaluated on DNS-Challenge
non-blind test set. Note that the DPRNN processing module is omitted to highlight the
temporal modeling performance in the bottleneck layer.
System Param. (M) PESQ𝑊 𝐵 PESQ𝑁𝐵 STOI (%) SI-SDR
Unprocessed - 1.58 2.45 91.52 9.07

LSTM [44] 10.29 2.95 3.44 96.56 17.97
S-TCN [27] 3.37 3.07 3.52 96.89 18.71
MSGLN𝑆𝐴 -Local -DPRNN 10.37 3.01 3.50 96.71 18.25
Conformer-like -DPRNN 11.19 3.09 3.53 96.88 18.88
MSGLN𝐺𝐶 -DPRNN 11.15 3.12 3.56 97.00 18.80
MSGLN𝑆𝐴 -DPRNN 11.19 3.16 3.58 97.20 18.98
Fig. 3. Attendance distance of different mechanisms.
Table III
Investigation results of the locality impact evaluated on DNS-Challenge non-blind
test set.
System Param. (M) PESQ𝑊 𝐵 PESQ𝑁𝐵 STOI (%) SI-SDR
Unprocessed - 1.58 2.45 91.52 9.07

w/o. Local 10.44 3.15 3.61 97.17 19.37
Conformer-like 11.25 3.26 3.67 97.49 20.03
Parallel structure 12.24 2.96 3.46 96.74 18.69
Local-Global structure 11.25 3.29 3.70 97.51 20.08
MSGLN𝑆𝐴 11.25 3.36 3.74 97.72 20.31
Table IV
The DNSMOS results evaluated on DNS 2020 Chal-
lenge non-blind test set.
System SIG-MOS BAK-MOS OVRL-MOS
Unprocessed 3.80 2.38 2.63

NSNet2 [74] 3.77 4.03 3.45
GaGNet [38] 3.96 4.40 3.77
MSGLN𝐺𝐶 4.07 4.48 3.88

MSGLN𝑆𝐴 4.05 4.47 3.87
Fig. 4. Comparison of computational complexity among models.

For the multi-channel case, we reimplement several state-of-the-art
baselines, including LSTM-IPD [35], MIMO-UNet [79], COSPA [80], illustrated by one bar because they have the same number of MACs.
TGCN [34], and EaBNet [7]. The comparison results of real-time Notice that EaBNet and DCCRN have higher MACs than our models in
multi-channel models are shown in Table VI. MSGLN models consis- contrast to smaller numbers of parameters. Besides, we also evaluate
tently outperform other multi-channel baselines by a large margin in the real-time factor (RTF) of the proposed models on an Intel Core i5
all benchmarks. In particular, MSGLN𝑆𝐴 and MSGLN𝐺𝐶 yield signifi- quad-core CPU with single thread. The RTF of MSGLN𝑆𝐴 and MSGLN𝐺𝐶
cantly higher results than the baseline with the best result (e.g. 3.24 are 0.41 and 0.59, respectively, which satisfied the DNS 2020 Challenge
of MSGLN𝑆𝐴 and 3.30 of MSGLN𝐺𝐶 vs 2.89 of EaBNet in PESQ from requirement for real-time processing. Since our models use no future in-
circular array). These results emphasize the efficacy of the MSGLN ar- formation, the algorithmic latency is 16 ms (frame length - hop length),
chitecture in modeling complex spatial information in conjunction with the buffering latency is 16 ms (hop length), and the total algorithmic
temporal-spectral information. latency is 32 ms.
4.4.5. Model complexity analysis 5. Conclusion

To validate the efficiency of our approach, we compare the multiply-
accumulate operations per second (MACs) of our models with several In this work, we propose MSGLN, a real-time speech enhancement
baselines, including EaBNet, DCCRN, TaEr, and MTFAA. As shown in network that models temporal representation through both global and
Fig. 4, our MSGLN models yield acceptable computational complexity local perspectives. The global module employs self-attention and graph
compared to other baselines. For brevity, MSGLN𝑆𝐴 and MSGLN𝐺𝐶 are convolution as long-term temporal modeling backbones, while the local
7
Table V
Comparison with single-channel state-of-the-arts on the DNS 2020 Challenge non-blind
test set.
System Year Param.(M) PESQ𝑊 𝐵 PESQ𝑁𝐵 STOI (%) SI-SDR
Unprocessed - - 1.58 2.45 91.52 9.07

DCCRN [5] 2020 3.67 - 3.27 - -
PoCoNet [75] 2020 50 2.75 - - -
DCCRN+ [76] 2021 3.3 - 3.33 - -
FullSubNet [37] 2021 5.64 2.78 3.31 96.11 17.29
FRCRN [6] 2022 10.27 3.23 3.60 97.69 19.78
FullSubNet+ [16] 2022 8.67 2.98 3.50 96.69 18.34
FS-CANet [77] 2022 4.21 3.02 3.51 96.74 18.08
GaGNet [38] 2022 5.94 3.17 3.56 97.13 18.91
HGCN+ [13] 2022 5.29 3.19 3.65 97.23 -
MTFAA [8] 2022 - 3.32 3.63 - -
CTFUNet [78] 2023 6.1 3.18 3.64 97.17 18.66
TaEr [27] 2023 6.03 3.26 3.60 97.56 19.64
MSGLN𝐺𝐶 (ours) 2024 11.24 3.28 3.70 97.52 20.04

MSGLN𝑆𝐴 (ours) 2024 11.25 3.36 3.74 97.72 20.31
Table VI
Comparison with multi-channel state-of-the-arts on the ConferencingSpeech Challenge test set.
System Param. (M) Microphone Array
Circular Linear uniformly distributed Linear non-uniformly distributed
PESQ STOI ESTOI SI-SDR PESQ STOI ESTOI SI-SDR PESQ STOI ESTOI SI-SDR
Unprocessed - 1.558 0.807 0.716 4.633 1.551 0.807 0.716 4.673 1.543 0.804 0.712 4.536
LSTM-IPD [35] 8.68 2.129 0.872 0.782 13.118 2.091 0.867 0.779 13.065 2.091 0.867 0.777 13.042
MIMO-UNet [79] 1.96 2.139 0.862 0.768 11.969 2.118 0.859 0.766 11.904 2.133 0.863 0.768 12.065
COSPA [80] 3.45 2.348 0.880 0.790 13.936 2.337 0.879 0.788 13.931 2.323 0.877 0.786 13.948
TGCN [34] 7.11 2.499 0.895 0.814 14.284 2.470 0.891 0.810 14.255 2.478 0.895 0.813 14.308
EaBNet [7] 2.84 2.894 0.926 0.860 16.307 2.883 0.923 0.856 16.227 2.871 0.926 0.858 16.326
MSGLN𝑆𝐴 (ours) 11.25 3.241 0.941 0.885 17.142 3.194 0.938 0.879 16.914 3.219 0.941 0.882 17.059
MSGLN𝐺𝐶 (ours) 11.24 3.298 0.943 0.888 17.288 3.250 0.940 0.883 17.054 3.281 0.943 0.886 17.179
module captures short-term relations based on local convolutions. The Declaration of competing interest
proposed temporal modeling mechanism performs deep feature extrac-
tion and diverse ranges of information through multiple global-local The authors declare that they have no known competing financial
modeling stages, which surpasses prevalent techniques in speech en- interests or personal relationships that could have appeared to influence
hancement such as TCNs and RNNs. We analyze how our global and the work reported in this paper.
local modeling mechanism is superior to that of Conformer, a powerful
transformer-based variant. The experimental results on two large-scale Data availability
datasets show that our networks outperform current state-of-the-art
methods in both single- and multi-channel speech enhancement tasks,
Data will be made available on request.
demonstrating the effectiveness of our modeling approach. We hope
these findings will open up novel breakthroughs in speech processing.
In the future, we will focus on exploring efficient strategies to utilize Acknowledgement
the advantages of self-attention and graph convolution simultaneously.
Besides, improving the combination of global and local information is This research is funded by Hanoi University of Science and Technol-
also a promising direction. ogy under grant number T2023-PC-039.
Declaration of competing interest References
[1] Das N, Chakraborty S, Chaki J, Padhy N, Dey N. Fundamentals, present and future
The authors declare that they have no known competing financial perspectives of speech enhancement. Int J Speech Technol 2021;24:883–901.
interests or personal relationships that could have appeared to influence [2] Reddy CK, Gopal V, Cutler R, Beyrami E, Cheng R, Dubey H, et al. The interspeech
the work reported in this paper. 2020 deep noise suppression challenge: datasets, subjective testing framework, and
challenge results. arXiv preprint arXiv:2005.13981, 2020.
[3] Tan K, Wang Z-Q, Wang D. Neural spectrospatial filtering. IEEE/ACM Trans Audio
CRediT authorship contribution statement Speech Lang Process 2022;30:605–21.
[4] Tan K, Wang D. A convolutional recurrent neural network for real-time speech en-
hancement. Interspeech 2018;2018:3229–33.
Hoang Ngoc Chau: Writing – review & editing, Writing – orig- [5] Hu Y, Liu Y, Lv S, Xing M, Zhang S, Fu Y, et al. DCCRN: deep complex convolution
inal draft, Visualization, Validation, Software, Methodology, Formal recurrent network for phase-aware speech enhancement. In: Proc. interspeech 2020;
analysis, Data curation, Conceptualization. Nguyen Thi Nhat Linh: 2020. p. 2472–6.
[6] Zhao S, Ma B, Watcharasupat KN, Gan W-S. Frcrn: boosting feature representation
Software, Investigation. Tuan Kiet Doan: Investigation, Data curation.
using frequency recurrence for monaural speech enhancement. In: ICASSP 2022-
Quoc Cuong Nguyen: Writing – review & editing, Supervision, Re- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing
sources, Funding acquisition, Conceptualization. (ICASSP). IEEE; 2022. p. 9281–5.
8
[7] Li A, Liu W, Zheng C, Li X. Embedding and beamforming: all-neural causal beam- IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
former for multichannel speech enhancement. In: ICASSP 2022-2022 IEEE Interna- IEEE; 2022. p. 9186–90.
tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. [32] Gulati A, Qin J, Chiu C-C, Parmar N, Zhang Y, Yu J, et al. Conformer: convolution-
p. 6487–91. augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100,
[8] Zhang G, Yu L, Wang C, Wei J. Multi-scale temporal frequency convolutional net- 2020.
work with axial attention for speech enhancement. In: ICASSP 2022-2022 IEEE In- [33] Zhao S, Ma B. D2former: a fully complex dual-path dual-decoder conformer network
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; using joint complex masking and complex spectral mapping for monaural speech
2022. p. 9122–6. enhancement. In: ICASSP 2023-2023 IEEE International Conference on Acoustics,
[9] Luo Y, Chen Z, Yoshioka T. Dual-path rnn: efficient long sequence modeling for Speech and Signal Processing (ICASSP). IEEE; 2023. p. 1–5.
time-domain single-channel speech separation. In: ICASSP 2020-2020 IEEE Interna- [34] Chau HN, Bui TD, Nguyen HB, Duong TT, Nguyen QC. A novel approach to multi-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2020. channel speech enhancement based on graph neural networks. IEEE/ACM Trans
p. 46–50. Audio Speech Lang Process 2024.
[10] Pandey A, Xu B, Kumar A, Donley J, Calamia P, Wang D. Tparn: triple-path at- [35] Rao W, Fu Y, Hu Y, Xu X, Jv Y, Han J, et al. Conferencingspeech challenge: towards
tentive recurrent network for time-domain multichannel speech enhancement. In: far-field multi-channel speech enhancement for video conferencing. In: 2021 IEEE
ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Automatic Speech Recognition and Understanding workshop (ASRU). IEEE; 2021.
Processing (ICASSP). IEEE; 2022. p. 6497–501. p. 679–86.
[11] Tesch K, Gerkmann T. Insights into deep non-linear filters for improved multi- [36] Mehrish A, Majumder N, Bharadwaj R, Mihalcea R, Poria S. A review of deep learn-
channel speech enhancement. IEEE/ACM Trans Audio Speech Lang Process ing techniques for speech processing. Inf Fusion 2023:101869.
2022;31:563–75. [37] Hao X, Su X, Horaud R, Li X. Fullsubnet: a full-band and sub-band fusion model for
[12] Yu J, Luo Y, Chen H, Gu R, Weng C. High fidelity speech enhancement with band- real-time single-channel speech enhancement. In: ICASSP 2021-2021 IEEE Interna-
split rnn. arXiv preprint arXiv:2212.00406, 2022. tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2021.
[13] Wang T, Zhu W, Gao Y, Chen Y, Feng J, Zhang S. Harmonic gated compensation p. 6633–7.
network plus for icassp 2022 dns challenge. In: ICASSP 2022-2022 IEEE Interna- [38] Li A, Zheng C, Zhang L, Li X. Glance and gaze: a collaborative learning framework
tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. for single-channel speech enhancement. Appl Acoust 2022;187:108499.
p. 9286–90. [39] Li A, Zheng C, Yu G, Cai J, Li X. Filtering and refining: a collaborative-style frame-
[14] Wang T, Zhu W, Gao Y, Zhang S, Feng J. Harmonic attention for monaural speech work for single-channel speech enhancement. IEEE/ACM Trans Audio Speech Lang
enhancement. IEEE/ACM Trans Audio Speech Lang Process 2023. Process 2022;30:2156–72.
[15] Luo Y, Mesgarani N. Conv-tasnet: Surpassing ideal time–frequency magnitude [40] Heymann J, Drude L, Haeb-Umbach R. Neural network based spectral mask estima-
masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process tion for acoustic beamforming. In: 2016 IEEE International Conference on Acoustics,
2019;27(8):1256–66. Speech and Signal Processing (ICASSP). IEEE; 2016. p. 196–200.
[16] Chen J, Wang Z, Tuo D, Wu Z, Kang S, Meng H. Fullsubnet+: channel attention full-
[41] Zhang Z, Xu Y, Yu M, Zhang S-X, Chen L, Yu D. Adl-mvdr: all deep learning
subnet with complex spectrograms for speech enhancement. In: ICASSP 2022-2022
mvdr beamformer for target speech separation. In: ICASSP 2021-2021 IEEE Interna-
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2021.
IEEE; 2022. p. 7857–61.
p. 6089–93.
[17] Zhang Z, Zhang L, Zhuang X, Qian Y, Li H, Wang M. Fb-mstcn: a full-band single-
[42] Erdogan H, Hershey JR, Watanabe S, Mandel MI, Le Roux J. Improved mvdr
channel speech enhancement method based on multi-scale temporal convolutional
beamforming using single-channel mask prediction networks. In: Interspeech; 2016.
network. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech
p. 1981–5.
and Signal Processing (ICASSP). IEEE; 2022. p. 9276–80.
[43] Ren X, Zhang X, Chen L, Zheng X, Zhang C, Guo L, et al. A causal U-net based
[18] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention
neural beamforming network for real-time multi-channel speech enhancement. In:
is all you need. Adv Neural Inf Process Syst 2017;30.
Proc. interspeech 2021; 2021. p. 1832–6.
[19] Fan J, Yang J, Zhang X, Yao Y. Real-time single-channel speech enhancement based
[44] Tan K, Zhang X, Wang D. Deep learning based real-time speech enhancement
on causal attention mechanism. Appl Acoust 2022;201:109084.
for dual-microphone mobile phones. IEEE/ACM Trans Audio Speech Lang Process
[20] Zhang Q, Qian X, Ni Z, Nicolson A, Ambikairajah E, Li H. A time-frequency atten-
2021;29:1853–63.
tion module for neural speech enhancement. IEEE/ACM Trans Audio Speech Lang
[45] Li A, Yu G, Zheng C, Li X. TaylorBeamformer: learning all-neural beamformer for
Process 2022;31:462–75.
multi-channel speech enhancement from Taylor’s approximation theory. In: Proc.
[21] Wang K, He B, Zhu W-P. Tstnn: two-stage transformer based neural network for
interspeech 2022; 2022. p. 5413–7.
speech enhancement in the time domain. In: ICASSP 2021-2021 IEEE Interna-
[46] Li A, Meng W, Yu G, Liu W, Li X, Zheng C. TaylorBeamixer: learning Taylor-inspired
all-neural multi-channel speech enhancement from beam-space dictionary perspec-
p. 7098–102.
tive. In: Proc. INTERSPEECH 2023; 2023. p. 1055–9.
[22] Yu G, Li A, Zheng C, Guo Y, Wang Y, Wang H. Dual-branch attention-in-attention
[47] Pandey A, Wang D. Tcnn: temporal convolutional neural network for real-time
transformer for single-channel speech enhancement. In: ICASSP 2022-2022 IEEE
speech enhancement in the time domain. In: ICASSP 2019-2019 IEEE Interna-
international conference on acoustics, speech and signal processing (ICASSP). IEEE;
2022. p. 7847–51.
[23] Chen W, Yu R, Ye Z. Decoupling-style monaural speech enhancement with a triple- p. 6875–9.
branch cross-domain fusion network. Appl Acoust 2024;217:109839. [48] Tan K, Wang D. Complex spectral mapping with a convolutional recurrent network
[24] Li J, Zhu Y, Luo D, Liu Y, Cui G, Li Z. The pcg-aiid system for l3das22 challenge: for monaural speech enhancement. In: ICASSP 2019-2019 IEEE International Con-
mimo and miso convolutional recurrent network for multi channel speech enhance- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2019. p. 6865–9.
ment and speech recognition. In: ICASSP 2022-2022 IEEE International Conference [49] Le X, Chen H, Chen K, Lu J. DPCRN: dual-path convolution recurrent network for
on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 9211–5. single channel speech enhancement. In: Proc. interspeech 2021; 2021. p. 2811–5.
[25] Li A, Liu W, Zheng C, Fan C, Li X. Two heads are better than one: a two-stage [50] Zhang S, Kong Y, Lv S, Hu Y, Xie L. F-T-LSTM based complex network for joint
complex spectral mapping approach for monaural speech enhancement. IEEE/ACM acoustic echo cancellation and speech enhancement. In: Proc. interspeech 2021;
Trans Audio Speech Lang Process 2021;29:1829–43. 2021. p. 4758–62.
[26] Chen L, Xu C, Zhang X, Ren X, Zheng X, Zhang C, et al. Multi-stage and multi-loss [51] Kipf TN, Welling M. Semi-supervised classification with graph convolutional net-
training for fullband non-personalized and personalized speech enhancement. In: works. In: International conference on learning representations; 2017. [Online].
ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Available: https://openreview.net/forum?id=SJU4ayYgl.
Processing (ICASSP). IEEE; 2022. p. 9296–300. [52] Nicolson A, Paliwal KK. Masked multi-head self-attention for causal speech enhance-
[27] Li A, Yu G, Zheng C, Liu W, Li X. A general unfolding speech enhancement method ment. Speech Commun 2020;125:80–96.
motivated by Taylor’s theorem. IEEE/ACM Trans Audio Speech Lang Process 2023. [53] Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent
[28] Zhang G, Wang C, Yu L, Wei J. Multi-scale temporal frequency convolutional net- neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
work with axial attention for multi-channel speech enhancement. In: ICASSP 2022- [54] Valentini-Botinhao C, Wang X, Takaki S, Yamagishi J. Investigating rnn-based
2022 IEEE International Conference on Acoustics, Speech and Signal Processing speech enhancement methods for noise-robust text-to-speech. In: SSW; 2016.
(ICASSP). IEEE; 2022. p. 9206–10. p. 146–52.
[29] Dubey H, Gopal V, Cutler R, Aazami A, Matusevych S, Braun S, et al. Icassp 2022 [55] Williamson DS, Wang D. Time-frequency masking in the complex domain for
deep noise suppression challenge. In: ICASSP 2022-2022 IEEE International Confer- speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Pro-
ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 9271–5. cess 2017;25(7):1492–501.
[30] Cutler R, Saabas A, Parnamaa T, Purin M, Gamper H, Braun S, et al. Icassp [56] He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing human-level
2022 acoustic echo cancellation challenge. In: ICASSP 2022-2022 IEEE Interna- performance on imagenet classification. In: Proceedings of the IEEE international
tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. conference on computer vision; 2015. p. 1026–34.
p. 9107–11. [57] Li G, Muller M, Thabet A, Ghanem B. Deepgcns: can gcns go as deep as cnns? In:
[31] Guizzo E, Marinoni C, Pennese M, Ren X, Zheng X, Zhang C, et al. L3das22 chal- Proceedings of the IEEE/CVF international conference on computer vision; 2019.
lenge: learning 3d audio sources in a real office environment. In: ICASSP 2022-2022 p. 9267–76.
9
[58] Le Roux J, Wisdom S, Erdogan H, Hershey JR. Sdr–half-baked or well done? In: ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE;
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal 2022. p. 886–90.
Processing (ICASSP). IEEE; 2019. p. 626–30. [70] Wu Z, Liu Z, Lin J, Lin Y, Han S. Lite transformer with long-short range attention.
[59] Braun S, Tashev I. Data augmentation and loss normalization for deep noise suppres- arXiv preprint arXiv:2004.11886, 2020.
sion. In: International conference on speech and computer. Springer; 2020. p. 79–86. [71] Li Y, Zhang Y, Timofte R, Van Gool L, Yu L, Li Y, et al. Ntire 2023 challenge on
[60] Bu H, Du J, Na X, Wu B, Zheng H. Aishell-1: an open-source mandarin speech corpus efficient super-resolution: methods and results. In: Proceedings of the IEEE/CVF
and a speech recognition baseline. In: 2017 20th conference of the oriental chapter conference on computer vision and pattern recognition; 2023. p. 1921–59.
of the international coordinating committee on speech databases and speech I/O [72] Singhania D, Rahaman R, Yao A. C2f-tcn: a framework for semi- and fully-supervised
systems and assessment (O-COCOSDA). IEEE; 2017. p. 1–5. temporal action segmentation. IEEE Trans Pattern Anal Mach Intell 2023.
[61] Shi Y, Bu H, Xu X, Zhang S, Li M. Aishell-3: a multi-speaker mandarin tts corpus and [73] Li W, Lu X, Qian S, Lu J, Zhang X, Jia J. On efficient transformer-based image
the baselines. arXiv preprint arXiv:2010.11567, 2020. pre-training for low-level vision. arXiv preprint arXiv:2112.10175, 2021.
[62] Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: an asr corpus based on [74] Reddy CK, Dubey H, Gopal V, Cutler R, Braun S, Gamper H, et al. Icassp 2021 deep
public domain audio books. In: 2015 IEEE international conference on acoustics, noise suppression challenge. In: ICASSP 2021-2021 IEEE International Conference
speech and signal processing (ICASSP). IEEE; 2015. p. 5206–10. on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2021. p. 6623–7.
[63] Gemmeke JF, Ellis DP, Freedman D, Jansen A, Lawrence W, Moore RC, et al. Audio [75] Isik U, Giri R, Phansalkar N, Valin J-M, Helwani K, Krishnaswamy A. Poconet: better
set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE interna- speech enhancement with frequency-positional embeddings, semi-supervised con-
tional conference on acoustics, speech and signal processing (ICASSP). IEEE; 2017. versational data, and biased loss. arXiv preprint arXiv:2008.04470, 2020.
p. 776–80. [76] Lv S, Hu Y, Zhang S, Xie L. Dccrn+: channel-wise subband dccrn with snr estimation
[64] Snyder D, Chen G, Povey D. Musan: a music, speech, and noise corpus. arXiv preprint for speech enhancement. arXiv preprint arXiv:2106.08672, 2021.
arXiv:1510.08484, 2015. [77] Chen J, Rao W, Wang Z, Wu Z, Wang Y, Yu T, et al. Speech enhancement
[65] Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv: with fullband-subband cross-attention network. In: Proc. interspeech 2022; 2022.
1711.05101, 2017. p. 976–80.
[66] Rix AW, Beerends JG, Hollier MP, Hekstra AP. Perceptual evaluation of speech qual- [78] Xu S, Zhang Z, Wang M. Channel and temporal-frequency attention unet for monau-
ity (pesq)-a new method for speech quality assessment of telephone networks and ral speech enhancement. EURASIP J Audio Speech Music Process 2023;2023(1):30.
codecs. In: 2001 IEEE international conference on acoustics, speech, and signal pro- [79] Ren X, Zhang X, Chen L, Zheng X, Zhang C, Guo L, et al. A causal u-net based
cessing. Proceedings (Cat. No. 01CH37221), vol. 2. IEEE; 2001. p. 749–52. neural beamforming network for real-time multi-channel speech enhancement. In:
[67] Taal CH, Hendriks RC, Heusdens R, Jensen J. An algorithm for intelligibility pre- Interspeech; 2021. p. 1832–6.
diction of time–frequency weighted noisy speech. IEEE Trans Audio Speech Lang [80] Halimeh MM, Kellermann W. Complex-valued spatial autoencoders for multichan-
Process 2011;19(7):2125–36. nel speech enhancement. In: ICASSP 2022-2022 IEEE International Conference on
[68] Jensen J, Taal CH. An algorithm for predicting the intelligibility of speech Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 261–5.
masked by modulated noise maskers. IEEE/ACM Trans Audio Speech Lang Process
2016;24(11):2009–22.
[69] Reddy CK, Gopal V, Cutler R. Dnsmos p. 835: a non-intrusive perceptual objective
speech quality metric to evaluate noise suppressors. In: ICASSP 2022-2022 IEEE In-
10

MSGLN

Uploaded by

Copyright:

Available Formats

MSGLN

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MSGLN

Uploaded by

Copyright:

Available Formats

Applied Acoustics 223 (2024) 110067

Contents lists available at ScienceDirect

Multi-stage temporal representation learning via global and local

1. Introduction output through deconvolution layers, which enables the network to

Fig. 1. The overall network architecture of MSGLN.

4. Experiments 4.3. Evaluation metrics

Table I 4.4.3. Impact of locality

System Param. (M) PESQ𝑊 𝐵 PESQ𝑁𝐵 STOI (%) SI-SDR

Unprocessed - 1.58 2.45 91.52 9.07

Fig. 3. Attendance distance of diﬀerent mechanisms.

System Param. (M) PESQ𝑊 𝐵 PESQ𝑁𝐵 STOI (%) SI-SDR

Unprocessed - 1.58 2.45 91.52 9.07

System SIG-MOS BAK-MOS OVRL-MOS

Unprocessed 3.80 2.38 2.63

MSGLN𝐺𝐶 4.07 4.48 3.88

Fig. 4. Comparison of computational complexity among models.

4.4.5. Model complexity analysis 5. Conclusion

System Year Param.(M) PESQ𝑊 𝐵 PESQ𝑁𝐵 STOI (%) SI-SDR

Unprocessed - - 1.58 2.45 91.52 9.07

MSGLN𝐺𝐶 (ours) 2024 11.24 3.28 3.70 97.52 20.04

Declaration of competing interest References

You might also like