Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading

2772 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO.
2, FEBRUARY 2024
LipSound2: Self-Supervised Pre-Training for

Lip-to-Speech Reconstruction
and Lip Reading
Leyuan Qu , Cornelius Weber , and Stefan Wermter , Member, IEEE
Abstract— The aim of this work is to investigate the impact of complementary to auditory representations [8]. Moreover, the
crossmodal self-supervised pre-training for speech reconstruc- visual contribution becomes more important as the acoustic
tion (video-to-audio) by leveraging the natural co-occurrence signal-to-noise ratio is decreased [9].
of audio and visual streams in videos. We propose LipSound2
that consists of an encoder–decoder architecture and location- In most approaches, the visual information is mainly used
aware attention mechanism to map face image sequences to as an auxiliary input to complement audio signals. However,
mel-scale spectrograms directly without requiring any human in some circumstances, the auditory information may be absent
annotations. The proposed LipSound2 model is first pre-trained or extremely noisy, which motivates speech reconstruction.
on ∼2400-h multilingual (e.g., English and German) audio-visual Speech reconstruction aims to generate both intelligible and
data (VoxCeleb2). To verify the generalizability of the proposed
method, we then fine-tune the pre-trained model on domain- qualified speech by only conditioning on image sequences of
specific datasets (GRID and TCD-TIMIT) for English speech talking mouths or faces. Generating intelligible speech from
reconstruction and achieve a significant improvement on speech silent videos enables many applications, e.g., a silent visual
quality and intelligibility compared to previous approaches in input method on mobile phones for privacy protection in public
speaker-dependent and speaker-independent settings. In addition areas [10]; communication assistance for patients suffering
to English, we conduct Chinese speech reconstruction on the
Chinese Mandarin Lip Reading (CMLR) dataset to verify the laryngectomy [11]; surveillance video understanding when
impact on transferability. Finally, we train the cascaded lip only visual signals are available [12]; enhancement of video
reading (video-to-text) system by fine-tuning the generated audios conferences or far-field human–robot interaction scenarios in
on a pre-trained speech recognition system and achieve the state- a noisy environment [13]; and nondisruptive user intervention
of-the-art performance on both English and Chinese benchmark for autonomous vehicles [14].
datasets.
It is challenging to reconstruct qualified and intelligible
Index Terms— Lip reading, self-supervised pre-training, speech speech from only mouth or face movements since human
recognition, speech reconstruction. speech is produced by not only externally observable organs
such as lips and tongue but also internally invisible ones
I. I NTRODUCTION that are difficult to capture in most cases [15], for instance,
vocal cords and pharynx. Consequently, it is hard to infer
I NSPIRED by human bimodal perception [1] in which both

sight and sound are used to improve the comprehension of
speech, a lot of effort has been spent on speech processing
fundamental frequency or voicing information controlled by
these organs. Moreover, some phonemes are acoustically dis-
criminative but not easy to distinguish visually since the
tasks by leveraging visual information, for example, integrat- phonemes share the same places of articulation but with
ing simultaneous lip movement sequences into speech recogni- different manners of articulation [16], for example, /v/ and
tion [2], [3], guiding neural networks in isolating target speech /f/ in English are both fricatives and look the same on lip and
signals with a static face image for speech separation [4], [5], teeth movements but are different on the vibration of vocal
and grounding speech recognition with visual objects and cords (voiced versus unvoiced) and the attribute of aspirate
scene information [6], [7]. Multimodal audio-visual methods (unaspirated versus aspirated) that are not visible in most video
achieve significant improvement over single modality models recordings. Hence, predicting human voices from appearance
since the visual signals are invariant to acoustic noise and is still a challenging task [17].
In recent years, there has been a growing interest in
Manuscript received 10 December 2020; revised 28 July 2021 and
16 February 2022; accepted 6 July 2022. Date of publication 22 July 2022; speech reconstruction and variant methods have been pro-
date of current version 6 February 2024. This work was supported in part by posed. A possible technique is to run lip reading (video-to-text)
the China Scholarship Council (CSC) and in part by the German Research and text-to-speech (TTS) systems in cascade, but the lip read-
Foundation DFG under project CML TRR 169. (Corresponding author:
Leyuan Qu.) ing performance is still unsatisfactory and the error is being
The authors are with the Knowledge Technology Institute, Department of propagated to TTS. Alternatively, other researchers directly
Informatics, University of Hamburg, 22527 Hamburg, Germany (e-mail: estimate speech representations, for example, linear predictive
quleyuan9826@gmail.com; cornelius.weber@uni-hamburg.de; stefan.
wermter@uni-hamburg.de). coding (LPC) [18], bottleneck features [19], and mel-scale
Digital Object Identifier 10.1109/TNNLS.2022.3191677 spectrograms [20], from videos, followed by a vocoder used to
© 2022 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/
QU et al.: LipSound2: SELF-SUPERVISED PRE-TRAINING FOR LIP-TO-SPEECH RECONSTRUCTION AND LIP READING 2773
transform intermediate representations to audio, for instance,

STRAIGHT [21] and WORLD vocoder [22]. In contrast, the
information of speaker identity and speaking styles can be
relatively preserved. However, most existing works only focus
on speaker-dependent settings with a small vocabulary or
artificial grammar dataset or even builds one model for each
individual speaker, which does not meet the requirements in
realistic scenarios.
In our previous work, we proposed LipSound [20] to
directly map visual sequences to low-level speech represen-
tation, i.e., mel spectrogram, which is inspired by audio-
visual self-supervised representation learning. By leveraging
the natural co-occurrence of audio and visual streams in
videos without requiring any human annotations or treating
one modality as the supervision of the other, self-supervised
representation learning has received substantial interest, for
example, learning representations by matching the temporal
synchronization [23] or spatial alignment [24] of audio and
video clips for action recognition.
In comparison to our previous work, LipSound that only Fig. 1. Process of (a) video-to-waveform generation and (b) waveform-to-text
focuses on speaker-dependent settings for the GRID artifi- transformation.
cial grammar dataset, in this article, we further explore to
what extent the large-scale crossmodal self-supervised pre- 3) To the best of our knowledge, no previous research has
training can benefit speech reconstruction in generalizabil- investigated Chinese speech reconstruction in speaker-
ity (speak-independent) and transferability (non-Chinese to dependent and speaker-independent cases.
Chinese) on a large vocabulary continuous speech corpus 4) By leveraging the large-scale self-supervised pre-
TCD-TIMIT. In addition, we also changed the LipSound training on LipSound2 and the advanced Jasper speech
architecture substantially by replacing the 1-D convolutional recognition model, our cascaded lip reading system out-
neural network (CNN) with 3-D CNN blocks (Conv 3D + performs existing models by a margin on both English
Batch Norm + ReLU + Max Pooling + Dropout). This and Chinese corpora.
should enable the model to directly learn stable representations
This article is organized as follows. Section II reviews
from raw pixels and using a location-aware attention mecha-
related work on lip-to-speech reconstruction, lip reading,
nism to make the alignments between encoder and decoder
and self-supervised learning. Section III provides the model
more robust to nonverbal areas. Moreover, we replace the
details, followed by the description of datasets and evaluation
Griffin–Lim algorithm [25] with a neural vocoder to smoothly
metrics in Section IV. Experimental results and discussion are
generate waveforms and voices.
presented in Sections V and VI, respectively. We conclude this
As shown in Fig. 1(a), our approach is first pre-training the
article in Section VII.
Lipsound2 model on a large-scale multilingual audio-visual
corpus (VoxCeleb2) to map silent videos to mel spectrogram
and then fine-tuning the pre-trained model on specific domain II. R ELATED W ORK
datasets [GRID, TCD-TIMIT, and Chinese Mandarin Lip A. Lip-to-Speech Reconstruction
Reading (CMLR)], followed by a neural vocoder (Wave- In recent years, researchers have investigated a variety
Glow [26]) to reconstruct estimated mel spectrogram to wave- of approaches to speech reconstruction from silent videos.
forms. Lip reading (video-to-text) experiments are performed We only review the neural network methods in this article.
by fine-tuning the generated audios on a pre-trained acoustic Cornu and Milner [28] proposed to use fully connected (FC)
model (Jasper [27]) in Fig. 1(b). neural networks to estimate spectral envelope representations,
The main contributions of this article are given as follows. for instance, LPC coefficients and mel filter bank ampli-
1) We propose an autoregressive encoder–decoder with tudes, from visual feature inputs, such as 2-D discrete cosine
attention architecture, LipSound2, to directly map silent transform, followed by a STRAIGHT vocoder [21], which is
facial movement sequences to mel-scale spectrograms used to synthesize time-domain speech signals from the esti-
for speech reconstruction, which does not require any mated representations. Follow-up work [29] predicts speech-
human annotations. related codebook entries with a classification framework to
2) We explore the model generalizability on speaker- get further improvement on speech intelligibility. Instead of
independent and large-scale vocabulary datasets which using handcrafted visual features, Ephrat and Peleg [18]
few studies have focused on, and we achieve better utilized CNNs to automatically learn optimal features from
performance on speech quality and intelligibility in the raw pixels and show promising results on out-of-vocabulary
speech reconstruction task. experiments. Subsequently, improved results are reported by
2774 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 2, FEBRUARY 2024
Ephrat et al. [30] via combining a RestNet backbone and a information and attain significant improvement. Weng
postprocessing network on a large-scale vocabulary dataset, and Kitani [49] presented two separated deep 3-D
TCD-TIMIT [31]. Akbari et al. [19] treated the intermediate CNN front ends to learn features from grayscale video
bottleneck features learned by a speech autoencoder as train- and optical flow inputs. Martinez et al. [50] replaced
ing targets by conditioning on lip reading network outputs. recurrent neural networks widely used in past work
Kumar et al. [32] validated the effectiveness of using multi- with temporal convolutional networks to simplify the
ple views of faces on both speaker-dependent and speaker- training procedure. The word-level methods are usually
independent speech reconstruction. Vougioukas et al. [33] able to achieve high accuracy, and however, the models
utilized generative adversarial networks (GANs) to directly disregard the interaction or co-articulation phenomenon
predict raw waveforms from visual inputs in an end-to-end between phonemes or words. A predefined lexicon with
fashion without generating an intermediate representa- closed-set vocabulary is used and words are usually
tion of audios. Inspired by the speech synthesis model, treated as isolated units in speech. Thereby, long-term
Tacotron2 [34], Qu et al. [20] proposed to directly context information and assimilation or dissimilation
map video inputs to low-level speech representations, mel effects are completely neglected. Moreover, it is hard
spectrogram, with an encoder–decoder architecture and to recognize out-of-vocabulary words.
achieve better results on lip reading experiments. Afterward, 2) Lip reading models with character or phoneme levels
Prajwal et al. [35] improved the model performance with 3-D mainly use methods proposed in speech recogni-
CNN and skip connections. Recently, Michelsanti et al. [36] tion. Assael et al. [44] conducted end-to-end lip read-
presented a multitask architecture to learn spectral envelope, ing experiments on sentence level with CTC loss.
aperiodic parameters, and fundamental frequency separately, Subsequently, sequence discriminative training [51]
which are then fed into a vocoder for waveform synthe- and domain-adversarial training [52] are introduced
sis. They integrate a connectionist temporal classification to lip reading. Chung et al. [2] collected the dataset,
(CTC) [37] loss to jointly perform lip reading, which is capa- “lip reading sentence” (LRS), which consists of hun-
ble of further enhancing and constraining the video encoder. dreds of thousands of videos from BBC television,
In addition to sequences of lip or face images, further and significantly promoted the research on sentence-
signals can be used for temporal self-supervision. For instance, level lip reading. Shillingford et al. [53] verified the
Gonzalez et al. [38] generated speech from articulatory sensor effectiveness of large-scale data (3886 h of video)
data and Akbari et al. [39] reconstructed speech from invasive for training continuous visual speech recognition.
electrocorticography. However, most existing works only focus Afouras et al. [54] compared the performance of recur-
on a speaker-dependent setting and small vocabulary or artifi- rent neural networks, fully CNNs, and transformer on
cial grammar datasets. In this article, we evaluate our method lip reading character recognition.
not only on speaker-dependent experiments but also pay atten- Different from the mainstream methods which directly
tion to speaker-independent and large-scale vocabulary setups. transform videos to text, we perform lip reading experiments
in a cascaded manner, in which the silent videos are first
B. Lip Reading mapped to audios with our LipSound2 model and, then, text
transcriptions are predicted by fine-tuning on a pretrained
Lip reading, also known as visual speech recognition, is the speech recognition system.
task to predict text transcriptions from silent videos, such as
mouth or face movement sequences. Research on lip reading
has a long tradition. Approaches to lip reading generally fall C. Self-Supervised Learning
into two categories on feature level: 1) handcrafted visual As a form of unsupervised learning, self-supervised learning
feature extraction, such as discrete cosine transform [40], leverages massive unlabeled data and aims to learn effec-
discrete wavelet transform [41], or active appearance mod- tive intermediate representations with the supervision of self-
els [42] and 2) representations learned by neural networks, generated labels. Training unlabeled data in a supervised
which has become the dominant technique for this task, for manner rely on the pretext tasks that determine what labels
example, using convolutional autoencoders [43], spatiotem- and loss functions to be used. In computer vision, the pretext
poral CNNs [44], long short-term memory [45], or residual tasks can be predicting angles of rotated images [55], learning
networks [46]. the relative position of segmented regions in an image [56],
Alternatively, methods on modeling units for lip reading can placing shuffled patches back [57], or colorizing grayscale
be divided into word and character levels. input images [58]. The video-based pretext tasks can be
1) In the case of word-level units, lip reading is simplified tracking moving objects in videos [59], validating temporal
as a classification task. Word-level lip reading datasets frame orders [60], video colorization [61], and so on.
and benchmarks are built, for instance, LRW [47] for Self-supervised learning is also widely used in natural lan-
English and LRW-1000 [48] for Chinese. Stafylakis guage processing. Substantial progress has been made recently,
and Tzimiropoulos [46] adopted spatiotemporal con- where diverse pretext tasks are proposed, for instance, predict-
volutional networks and 2-D ResNet as front end to ing center words using surrounding ones or vice versa [62],
extract visual features and bidirectional long short-term generating the next word by conditioning on previous words
memory networks as the backend to capture temporal in an autoregressive fashion [63], completing masked tokens
Fig. 2. Architecture of LipSound2. The video is split into visual and acoustic streams. The face region, which is cropped from the silent visual stream,
is used as the model input. The acoustic spectrogram features extracted from the counterpart audio stream are used as the training target. During training, the
ground-truth spectrogram frames are utilized to accelerate convergence, while, during inference, the outputs from previous steps are used.
or consecutive utterances [64], recovering the order of shuffled TABLE I

words [65], or the permutation of rotated sentence [66]. C ONFIGURATION OF L IP S OUND 2 E NCODER , D ECODER ,
ATTENTION , AND P OSTNET
Inspired by the strong correlation between different modal-
ities where, for example, the audio and visual modalities are
semantically consistent or temporally synchronous, more and
more researchers work on multimodal or cross-modal self-
supervised learning. Multimodal self-supervised learning aims
at learning joint or shared latent spaces or representations,
while cross-modal self-supervised learning lets one modality
supervise another. Here, we only review the audio-visual
modalities since this is the main focus of this article. Different
pretext tasks are designed according to the correspondence and
synchronization of audio and visual modalities, for instance,
predicting whether image and audio clips correspond, to enable
neural networks to classify sounds [67], learn cross-modal
retrieval [68], or locate the sound source in an image [69].
Besides, multimodal self-supervised representation learning
can also be performed by matching the temporal synchro-
nization [23] or spatial alignment [24] of audio and video
clips in the context of action recognition, where a contrastive
loss and a clustering loss are combined to learn high-level
semantic representations for visual event and concept under-
standing [70]. In this article, we focus on cross-modal self-
supervised learning where the corresponding audio signals
provide the supervision for face sequence inputs.
III. M ODEL A RCHITECTURE

Fig. 2 shows the LipSound2 model architecture. We split
the video clips into an audio stream used as training tar- but also easily learns long-distance dependence. Model details
get and a visual stream used as model input. The system are listed in Table I. Then, a pre-trained neural vocoder,
consumes the visual part to predict the audio counterpart WaveGlow, follows to reconstruct the raw waveform from the
in a self-supervised fashion. The proposed architecture is generated mel spectrogram.
composed of an encoder–decoder and an attention model to
map the soundless visual sequences to the low-level acoustic A. Encoder
representation, mel-scale spectrograms. Advantages are that, The multitask CNN (MTCNN) [71] is used to detect face
in contrast to directly predicting raw waveform, working with landmarks from raw videos. We crop only the face region
mel spectrogram not only reduces computational complexity (112 × 112 pixels) and smooth all frame landmarks since
consumes the attention content vector and the output from

attention LSTM to generate one frame at a time. Subsequently,
the linear projection layer maps the decoder LSTM outputs
to the dimension of the mel-scale filter bank. During training,
we use ground-truth mel-spectrogram frames as PreNet inputs,
and during inference, the predicted frames from previous
time steps are used. Since the decoder only receives past
information at every time step, after decoding, five Conv1D
layers (postnet) are used to further improve the model per-
formance by smoothing the transition of adjacent frames
and using future information, which is not available when
decoding.
D. Training Objective
The loss function is the sum of two mean square errors
(MSEs), as shown in (5), i.e., the MSE between the decoder
output Odec and the target mel spectrogram Mtar and the
MSE between the postnet output Opost and the target mel
Fig. 3. Computational flow of location-aware attention at time step t.
spectrogram
low-resolution videos or profile faces lead to detection failures Loss = MSE(Odec , Mtar ) + MSE(Opost , Mtar ). (5)
sometimes and landmark smoothing can eliminate frame skip
in adjacent images. The cropped face sequences are then fed
into 3-D CNN blocks and each block is based on a 3-D E. WaveGlow
CNN, batch normalization, ReLU activation, max pooling, and We use WaveGlow [26], which combines the approach of
dropout, as shown in Fig. 2. Then, two bidirectional LSTM the glow-based generative model [73] and the architecture
layers follow which capture the long-distance dependence insight of WaveNet [74] to transform the estimated mel
from the left and right context. spectrogram back to audio. WaveGlow abandons autoregres-
sion [74] and speeds up the procedure of waveform synthesis
B. Location-Sensitive Attention in high quality and resolution. We train WaveGlow from
We use location-aware attention [72] to bridge the encoder scratch using the same settings as original work [26] but in
and the decoder. The image sequence input i = (i 0 , . . . , i n ) 16k sampling rate on the LJSpeech dataset [75] to meet the
is first embedded into the latent space representation vector requirement of following up ASR models. To our surprise, the
h = (h 1 , . . . , h n ) by the encoder with the same dimension WaveGlow model that is trained with only one female voice
n in time, and then, the intermediate vector h is decoded can effectively generalize to any unseen voices and stably
into the mel spectrogram o = (o0 , . . . , om ). At time step t perform waveform reconstruction.
(0 ≤ t ≤ m), the attention weight at can be obtained by the
following equations: F. Acoustic Model and Language Model
at = Softmax(W · tanh(M · h + Q · x + L · y)) (1) The Jasper [27] speech recognition system, which is a fully
x = LSTM(h · at−1 , pprenet ) (2) convolutional architecture trained with skip connections and
CTC loss, is adopted to directly predict characters from speech

y = Conv at−1 , ai (3) signals. We pretrain the Jasper DR 10 × 5 model1 on 960 h
0≤i≤t−1 LibriSpeech and 1000 h AISHELL-2 corpora, which achieves
where W, M, Q, and L are the matrices learned by weight 3.61% word error rate (WER) and 10.05% character error
FC, memory FC, query FC, and location FC, respectively. rate (CER) on the development set for English and Chinese,
In (3), the sum of attention weights of all previous steps is respectively.
integrated, which enables the current step attention to be aware Beam search is utilized to decode the output character
of the global location and move forward monotonically. Fig. 3 possibilities from Jasper and a 6-g KenLM [76] language
visualizes the computational flow of the attention mechanism. model2 into grammatically and semantically correct words on
The attention content vector v t can be obtained by multiplying sentence level [77].
the encoder output by the normalized attention weights (see the
following equation): IV. E XPERIMENTAL S ETUP
v t = at · h. (4) A. Dataset
All datasets used in this article are summarized in Table II
C. Decoder and random frames from audio-visual ones are presented
The decoder module consists of one unidirectional LSTM 1 https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition.html
layer and one linear projection layer. The decoder LSTM 2 https://github.com/PaddlePaddle/DeepSpeech
TABLE II
OVERVIEW OF A LL C ORPORA U SED IN T HIS A RTICLE . S PK : S PEAKERS . U TT: U TTERANCES . V OCAB : V OCABULARY
LibriSpeech is derived from audiobooks, containing 460 h

of clean speech and 500 h of noisy speech. AISHELL-2
consists of 1000-h different domain speech, for instance, voice
command and smart home scenario, and includes various
accents from different areas of China. We use LibriSpeech
and AISHELL-2 to pretrain the Jasper acoustic model to boost
the performance of waveform-to-text transformation. The gen-
erated speech on GRID, TCD-TIMIT, and CMLR is used
for further fine-tuning to perform lip reading (video-to-text)
experiments.
The LJ Speech dataset with only one female voice is
especially designed for speech synthesis tasks, which is used
Fig. 4. Random face samples from audio-visual corpora. Only the face for WaveGlow training, in this article, to transform mel
region is cropped during training and test. Samples from audio-visual cor- spectrogram back to waveforms.
pora [31], [78], [79], [80].
B. Evaluation Metrics
in Fig. 4. VoxCeleb2 is a large-scale audio-visual corpus,
We evaluate the generated speech quality and intelligibility
extracted from YouTube videos, containing over one mil-
with perceptual evaluation of speech quality (PESQ) [83]
lion utterances and more than 6k different speakers from
and extended short-time objective intelligibility (ESTOI) [84],
around 145 nationalities and languages. It includes noisy and
respectively. The speech-to-text results are measured with
unconstrained conditions; specifically, the audio stream may
WER and CER, the ratio of error terms, i.e., substitutions,
be recorded with background noise, such as laughter and
deletions, and insertions, to the total number of words/
room reverberation, and the vision part may contain variable
characters in the ground-truth sequences.
head poses (e.g., frontal faces and profile), variable lighting
conditions, and low image quality, while the GRID and TCD-
TIMIT datasets are in controlled experimental environments C. Training
with fixed frontal face angle and clean background in audio We only describe the training settings of LipSound2 pre-
and vision. It is worth mentioning that the GRID dataset training, LipSound2 fine-tuning, and Jasper acoustic model
is designed to contain only a fixed six-word structure and fine-tuning. More details about Japser1 pre-training acoustic
all sentences are generated by a restricted artificial gram- model, KenLM2 language model, and WaveGlow3 can be
mar: command + color + preposition + letter + digit + found on the open-source websites.
adverb, for example, set blue in Z three now. CMLR is 1) Vision Stream: Face landmarks are detected using
collected from videos by 11 hosts of the Chinese national news MTCNN [71] from all video frames and only the face area is
program News Broadcast, which contains frontal faces and cropped and reshaped to size of 112 × 112 as inputs. We also
covers a large amount of Chinese vocabulary. We first pretrain add one “visual period”—an empty frame with all values of
LipSound2 on VoxCeleb2 and then fine-tune the model on 255—at the end of every visual stream to help the decoder
GRID, TCD-TIMIT, and CMLR for video to mel-spectrogram stop decoding at the right time. A max decoder step threshold
reconstruction. of 1000 is activated to terminate decoding when the decoder
LibriSpeech and AISHELL-2 are the current largest open- fails to capture the “visual period.”
source speech corpora and widely used speech recogni-
tion benchmarks for English and Chinese, respectively. 3 https://github.com/NVIDIA/waveglow
TABLE III TABLE IV

S PEAKER -D EPENDENT S PEECH R ECONSTRUCTION R ESULTS S PEAKER -I NDEPENDENT S PEECH R ECONSTRUCTION R ESULTS
ON GRID AND TCD-TIMIT D ATASETS ON GRID AND TCD-TIMIT D ATASETS
TABLE V
S PEECH R ECONSTRUCTION R ESULTS FOR
C HINESE ON THE CMLR D ATASET
2) Audio Stream: We first divide the raw waveforms by the

max value to normalize all audios to [0, 1] and then extract
the magnitude using the short-time Fourier transform (STFT)
with 1024 frequency bins and a 64-ms window size with 16-ms
stride. The mel-scale spectrograms are obtained by applying
an 80-channel mel filter bank to the magnitude, followed by 2) Speaker-Independent Result: For speaker-independent
dynamic range clipping with a minimum value of 1e−5 and cases, we follow the same setups for GRID [33] and TCD-
log dynamic range compression. TIMIT [31].
3) LipSound2 Pre-Training: Horizontal image flipping, gra- LipSound2 achieves the best results on both metrics on
dient clipping with a threshold of 1.0, early stopping, and the GRID dataset. Moreover, by listening to the reconstructed
scheduled sampling [85] are adopted to avoid overfitting. audios, we find that our model is capable of producing similar
Linear and convolutional layers are initialized with Xavier [86] voices as ground-truth speakers, instead of generating a weird
and tanh functions, respectively. We use the cosine learn- voice or one of the voices in the training set as occurring in
ing rate decay strategy with an initial value of 0.001. Our previous works. The model has implicitly learned the mapping
LipSound2 model has around 100M parameters. The audio between voices and faces. We highly recommend readers to
and visual sequences are both high-dimensional data, so we listen to the produced samples on our demo website.4
conduct all experiments on four NVIDIA Quadro RTX 6000 Furthermore, we find substitution errors occurring on seg-
GPUs with 24-GB memory in parallel to enable a big batch ment level (vowels and consonants) because the context infor-
size. The entire pre-training procedure took around 25 days. mation is still not sufficient to disambiguate the phonemes that
4) Fine-Tuning: Pre-trained LipSound2 is fine-tuned on share the same visible organs, such as lips and tongue, but are
GRID, TCD-TIMIT, and CMLR videos to conduct speech different in the invisible ones.
reconstruction experiments. Afterward, the produced speech To the best of our knowledge, we are the first to tackle
for English (GRID and TCD-TIMIT) and Chinese (CMLR) the speaker-independent case on the TCD-TIMIT dataset since
is fine-tuned on the pre-trained English (LibriSpeech) and TCD-TIMIT consists of limited samples (∼370) for each
Chinese (AISHELL-2) acoustic models to perform lip reading speaker but with large-scale vocabulary (∼5.9K), which makes
tasks with a ten times smaller learning rate. the tasks on TCD-TIMIT quite challenging. The speaker-
independent results reported in Table IV show considerable
V. E XPERIMENTAL R ESULTS performance, for example, the PESQ result is even better
A. Lip-to-Speech Reconstruction than some results reported on speaker-dependent settings (as
1) Speaker-Dependent Result: We report the generated shown in Table III), which suggests that the large-scale self-
speech results in two perspectives, i.e., speech quality (PESQ) supervised pre-training enables the model to successfully
and speech intelligibility (ESTOI). For a fair comparison, generalize to unseen speakers.
we keep the same settings as previous works. For speaker- 3) Speech Reconstruction for Chinese: To explore the effec-
dependent tasks, all datasets are randomly split into 90:5:5 tiveness of our proposed architecture, we further perform
for training, validation, and test sets on GRID, respectively speech reconstruction in Chinese. For the speaker-dependent
(Speaker S1 − S4) and TCD-TIMIT (Lipspeaker 1 − 3). case, we keep the same training and test splits used in
Different from previous works that build one model for each CSSMCM [80] for lip reading; for the speaker-independent
individual speaker, we train only one model on all speakers. case, S1 (male) and S6 (female) are used for testing and the
As shown in Table III, our LipSound2 system, which is first remaining speakers are used for training and validation.
pre-trained on the VoxCeleb2 dataset and then fine-tuned on In Table V, only LipSound2 results are reported since
the specific dataset, achieves the highest scores on both PESQ we make a first attempt at tackling speech reconstruction in
and ESTOI, which reveals the effectiveness of our proposed Chinese. After checking the generated audio samples, we find
method. The last column in Table III compares the number that, besides the confusion on segments, there are some tone
of LipSound2 model parameters against those of baseline errors. One of the reasons is that Chinese is a tonal language
systems, showing that its best performance is obtained while
staying well in the existing range of numbers of parameters. 4 https://leyuanqu.github.io/LipSound2/
Fig. 5. Comparison between generated mel spectrogram and ground truth in speaker-dependent and speaker-independent settings for English and Chinese
[79], [31], [80].
B. Lip Reading Results

Different from conventional methods which directly trans-
form videos into text, we perform lip reading experiments in
two steps, i.e., video-to-wav and wav-to-text.
1) Lip Reading Results for English: We follow the same
splits as previous works for training and test on the GRID [44]
and TCD-TIMIT [87] datasets. The comparison with related
results is listed in Table VI. We report the WER of GRID
and TCD-TIMIT audio test sets on pre-trained acoustic models
(audio gold standard) and the results fine-tuned on the training
audio samples (+Fine-Tuning), which is treated as the upper
boundary of lip reading.
Our LipSound2 model achieves the state-of-the-art perfor-
mance on both GRID and TCD-TIMIT datasets. Fine-tuning
Fig. 6. Attention alignment comparison on the GRID dataset. the acoustic model pretrained on 960-h LibriSpeech with
generated audios can not only significantly boost the model
in which lexical tones play an important role for semantic performance but also accelerate training time.
discrimination. The fundamental frequency (F0), which is Further improvement can be achieved when an external
produced by the vibration of vocal cords, is not visible in language model is integrated. The benefit from the language
the input videos (face area), and it is reported that the visual model on the GRID dataset is not as much as on TCD-
features have a weak correlation to F0 [28]. Another reason TIMIT since the sentence structure in GRID is designed by
is that the VoxCeleb2 dataset mainly consists of nontonal lan- an artificial grammar. The language model can only help to
guages, e.g., British English, American English, and German, correct misspelled words but cannot contribute grammatically
which makes the pre-training pay little attention to tone or semantically.
production. 2) Lip Reading Results for Chinese: We also explore lip
4) Attention Alignment: We compare the attention align- reading performance in Chinese, as shown in Table VII. Audio
ments learned by LipSound [20], which is only trained on the gold standard is directly evaluating the CMLR test set on a pre-
GRID dataset and LipSound2 (this article). As shown in Fig. 6, trained acoustic model trained on a 1000-h AISHELL2 dataset.
the LipSound attention weights are fuzzy at nonverbal areas After fine-tuning with CMLR training audios, we get 3.88%
and at short pauses between words, which may mislead the CER and 4.89% CER for speaker-dependent and speaker-
decoder into focusing on irrelevant encoder timesteps, whereas independent cases, respectively.
the attention weights learned by LipSound2 are intensive and In comparison to other work, our LipSound2 model achieves
more robust to silence or short pauses. better results. CER further drops when decoding with an
TABLE VI speech reconstruction on large-scale vocabulary datasets, par-

L IP R EADING R ESULTS ON THE GRID AND TCD-TIMIT D ATASETS ticularly for speaker-independent settings. Moreover, state-of-
ON WER. S PK -D EP : S PEAKER -D EPENDENT. S PK -I NDEP :
S PEAKER -I NDEPENDENT. LM: L ANGUAGE M ODEL
the-art results are achieved by fine-tuning the produced audios
on a well-pretrained speech recognition model for both Eng-
lish and Chinese lip reading experiments since our two-step
method benefits not only from the large-scale crossmodal
supervision which enables the model to learn more robust
representations and more different content information but also
from the advanced speech recognition architecture (acoustic
and language models), which is pre-trained on abundant
labeled data.
Although we have made great progress on speech recon-
struction in controlled environments, there is still a signifi-
cant gap to the requirements of real-world scenarios. Future
work will focus on more realistic configuration, such as the
variety of light conditions, moving head poses, and different
background environments. Moreover, the current lip reading
TABLE VII experiments are separately conducted in two steps in which the
L IP R EADING R ESULTS FOR C HINESE ON THE CMLR error generated in the first step (video-to-wav) will be propa-
D ATASET. CER: C HARACTER E RROR R ATE gated to the second step (wav-to-text). How to jointly train the
two tasks in an end-to-end fashion could be another direction.
Besides, we are also interested in integrating our LipSound2
model into active speaker detection, speech enhancement, and
speech separation tasks to boost the performance of speech
recognition systems in human–robot interaction.
ACKNOWLEDGMENT
The authors would like to thank Katja Kösters for improving
external language model. Besides, we build a new baseline the language of this article.
for CMLR in speaker-independent settings.
R EFERENCES
VI. D ISCUSSION
[1] J. Besle, A. Fort, C. Delpuech, and M.-H. Giard, “Bimodal speech: Early
Although the proposed LipSound2 model pre-trained on a suppressive visual effects in human auditory cortex,” Eur. J. Neurosci.,
large-scale dataset achieves considerable performance on both vol. 20, no. 8, pp. 2225–2234, Oct. 2004.
speech reconstruction and lip reading tasks, it still generates [2] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading
sentences in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern
error speech due to the visual similarity on pronunciation, for Recognit. (CVPR), Jul. 2017, pp. 3444–3453.
example, “pill” is easy to be misrecognized as “bill” in English [3] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep
and “ji zhi” is mistaken as “qi zhi” in Chinese. In addition, our audio-visual speech recognition,” IEEE Trans. Pattern Anal. Mach.
Intell., early access, Dec. 21, 2018, doi: 10.1109/TPAMI.2018.2889052.
model can generate quite similar voices as the ground truth in [4] L. Qu, C. Weber, and S. Wermter, “Multimodal target speech separation
speaker-dependent settings, while the model is inclined to pre- with voice and face references,” in Proc. Interspeech, Oct. 2020,
dict a voice existing in the training set sometimes in speaker- pp. 1416–1420.
[5] S.-W. Chung, S. Choe, J. S. Chung, and H.-G. Kang, “FaceFilter:
independent cases. For details and demonstrations, we refer Audio-visual speech separation using still images,” in Proc. Interspeech,
also to the demo video on the project website.5 How to stop Oct. 2020, pp. 3481–3485.
the fine-tuning procedure at the appropriate time and avoid the [6] Y. Miao and F. Metze, “Open-domain audio-visual speech recogni-
tion: A deep learning approach,” in Proc. Interspeech, Sep. 2016,
model overfitting on downstream tasks is an important direc- pp. 3414–3418.
tion for future research since the MSE loss always declines [7] A. Gupta, Y. Miao, L. Neves, and F. Metze, “Visual features for context-
when using teacher forcing during training, which hardly aware speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech
Signal Process. (ICASSP), Mar. 2017, pp. 5020–5024.
indicates whether the model is overfitting or not. Besides, [8] J. Macdonald and H. McGurk, “Visual influences on speech percep-
a possible solution could be using voice embeddings as tion processes,” Perception Psychophys., vol. 24, no. 3, pp. 253–257,
additional inputs that can efficiently help models learn speaker May 1978.
[9] P. L. Silsbee and A. C. Bovik, “Computer lipreading for improved
identity information, as we found in our previous work [4]. accuracy in automatic speech recognition,” IEEE Trans. Speech Audio
Process., vol. 4, no. 5, pp. 337–351, Sep. 1996.
VII. C ONCLUSION [10] B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and
In this article, we have proposed LipSound2 that directly J. S. Brumberg, “Silent speech interfaces,” Speech Commun., vol. 52,
no. 4, pp. 270–287, 2010.
predicts speech representations from raw pixels. We inves- [11] H. R. Sharifzadeh, I. V. McLoughlin, and F. Ahmadi, “Reconstruc-
tigated the effectiveness of self-supervised pre-training for tion of normal sounding speech for laryngectomy patients through a
modified CELP codec,” IEEE Trans. Biomed. Eng., vol. 57, no. 10,
5 https://leyuanqu.github.io/LipSound2/ pp. 2448–2458, Oct. 2010.
[12] M. Cristani, M. Bicego, and V. Murino, “Audio-visual event recognition [35] K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar,
in surveillance video sequences,” IEEE Trans. Multimedia, vol. 9, no. 2, “Learning individual speaking styles for accurate lip to speech synthe-
pp. 257–267, Feb. 2007. sis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
[13] A. Tsiami, P. P. Filntisis, N. Efthymiou, P. Koutras, G. Potamianos, Jun. 2020, pp. 13796–13805.
and P. Maragos, “Far-field audio-visual scene perception of multi- [36] D. Michelsanti, O. Slizovskaia, G. Haro, E. Gómez, Z.-H. Tan, and
party human–robot interaction for children and adults,” in Proc. IEEE J. Jensen, “Vocoder-based speech synthesis from silent videos,” in Proc.
Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, Interspeech, Oct. 2020, pp. 3530–3534.
pp. 6568–6572. [37] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connec-
[14] R. Tscharn, M. E. Latoschik, D. Löffler, and J. Hurtienne, “‘Stop tionist temporal classification: Labelling unsegmented sequence data
over there’: Natural gesture and speech interaction for non-critical with recurrent neural networks,” in Proc. 23rd Int. Conf. Mach. Learn.
spontaneous intervention in autonomous driving,” in Proc. 19th ACM (ICML), 2006, pp. 369–376.
Int. Conf. Multimodal Interact., Nov. 2017, pp. 91–100. [38] J. A. Gonzalez et al., “Direct speech reconstruction from articulatory
[15] B. Gick, I. Wilson, and D. Derrick, Articulatory Phonetics. Hoboken, sensor data by machine learning,” IEEE/ACM Trans. Audio, Speech,
NJ, USA: Wiley, 2012. Language Process., vol. 25, no. 12, pp. 2362–2374, Dec. 2017.
[16] S. Maeda, “Compensatory articulation during speech: Evidence from [39] H. Akbari, B. Khalighinejad, J. L. Herrero, A. D. Mehta, and
the analysis and synthesis of vocal-tract shapes using an articula- N. Mesgarani, “Towards reconstructing intelligible speech from the
tory model,” in Speech Production and Speech Modelling. Dordrecht, human auditory cortex,” Sci. Rep., vol. 9, no. 1, pp. 1–12, Dec. 2019.
The Netherlands: Springer, 1990, pp. 131–149. [40] M. Heckmann, K. Kroschel, C. Savariaux, and F. Berthommier,
[17] S. Goto, K. Onishi, Y. Saito, K. Tachibana, and K. Mori, “Face2Speech: “DCT-based video features for audio-visual speech recognition,” in Proc.
Towards multi-speaker text-to-speech synthesis using an embedding 7th Interspeech, 2002, pp. 1–4.
vector predicted from a face image,” in Proc. Interspeech, Oct. 2020, [41] G. Potamianos, H. P. Graf, and E. Cosatto, “An image transform
pp. 1321–1325. approach for HMM based automatic lipreading,” in Proc. Int. Conf.
[18] A. Ephrat and S. Peleg, “Vid2Speech: Speech reconstruction from Image Process., 1998, pp. 173–177.
silent video,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. [42] G. Sterpu and N. Harte, “Towards lipreading sentences with active
(ICASSP), Mar. 2017, pp. 5095–5099. appearance models,” 2018, arXiv:1805.11688.
[19] H. Akbari, H. Arora, L. Cao, and N. Mesgarani, “Lip2Audspec: [43] D. Parekh, A. Gupta, S. Chhatpar, A. Y. Kumar, and M. Kulkarni, “Lip
Speech reconstruction from silent lip movements video,” in Proc. reading using convolutional auto encoders as feature extractor,” 2018,
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, arXiv:1805.12371.
pp. 2516–2520. [44] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “Lip-
[20] L. Qu, C. Weber, and S. Wermter, “LipSound: Neural mel-spectrogram Net: End-to-end sentence-level lipreading,” in Proc. GPU Technol.
reconstruction for lip reading,” in Proc. Interspeech, Sep. 2019, Conf., 2017, pp. 1–13. [Online]. Available: https://github.com/Fengdalu/
pp. 2768–2772. LipNet-PyTorch
[21] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring [45] M. Wand, J. Koutník, and J. Schmidhuber, “Lipreading with long short-
speech representations using a pitch-adaptive time–frequency smoothing term memory,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
and an instantaneous-frequency-based F0 extraction: Possible role of (ICASSP), Mar. 2016, pp. 6115–6119.
a repetitive structure in sounds,” Speech Commun., vol. 27, nos. 3–4, [46] T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with
pp. 187–207, Apr. 1999. LSTMs for lipreading,” in Proc. Interspeech, Aug. 2017, pp. 3652–3656.
[22] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based [47] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Proc. Asian
high-quality speech synthesis system for real-time applications,” IEICE Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 87–103.
Trans. Inf. Syst., vol. 99, no. 7, pp. 1877–1884, 2016. [48] S. Yang et al., “LRW-1000: A naturally-distributed large-scale bench-
[23] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio mark for lip reading in the wild,” in Proc. 14th IEEE Int. Conf. Autom.
and video models from self-supervised synchronization,” in Proc. Adv. Face Gesture Recognit. (FG), May 2019, pp. 1–8.
Neural Inf. Process. Syst., 2018, pp. 7763–7774. [49] X. Weng and K. Kitani, “Learning spatio-temporal features with two-
[24] P. Morgado, Y. Li, and N. Nvasconcelos, “Learning representations from stream deep 3D CNNs for lipreading,” 2019, arXiv:1905.02540.
audio-visual spatial alignment,” in Proc. Adv. Neural Inf. Process. Syst., [50] B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using
vol. 33, 2020, pp. 1–12. temporal convolutional networks,” in Proc. IEEE Int. Conf. Acoust.,
[25] D. Griffin and J. Lim, “Signal estimation from modified short-time Speech Signal Process. (ICASSP), May 2020, pp. 6319–6323.
Fourier transform,” IEEE Trans. Acoust., Speech, Signal Process., [51] K. Thangthai and R. Harvey, “Improving computer lipreading via
vol. ASSP-32, no. 2, pp. 236–243, Apr. 1984. DNN sequence discriminative training techniques,” in Proc. Interspeech,
[26] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based Aug. 2017, pp. 1–5.
generative network for speech synthesis,” in Proc. IEEE Int. Conf. [52] M. Wand and J. Schmidhuber, “Improving speaker-independent lipread-
Acoust., Speech Signal Process. (ICASSP), May 2019, pp. 3617–3621. ing with domain-adversarial training,” in Proc. Interspeech, 2017,
[27] J. Li et al., “Jasper: An end-to-end convolutional neural acoustic model,” pp. 2415–2419.
in Proc. Interspeech, 2019, pp. 71–75. [53] B. Shillingford et al., “Large-scale visual speech recognition,” in Proc.
[28] T. L. Cornu and B. Milner, “Reconstructing intelligible audio speech Interspeech, 2018, pp. 4135–4139.
from visual speech features,” in Proc. Interspeech, Sep. 2015, [54] T. Afouras, J. S. Chung, and A. Zisserman, “Deep lip reading:
pp. 1–6. A comparison of models and an online application,” in Proc. Interspeech,
[29] T. Le Cornu and B. Milner, “Generating intelligible audio speech from Sep. 2018, pp. 3514–3518.
visual speech,” IEEE/ACM Trans. Audio, Speech, Language Process., [55] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation
vol. 25, no. 9, pp. 1751–1761, Sep. 2017. learning by predicting image rotations,” in Proc. ICLR, 2018, pp. 1–16.
[30] A. Ephrat, T. Halperin, and S. Peleg, “Improved speech reconstruction [56] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual represen-
from silent video,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops tation learning by context prediction,” in Proc. IEEE Int. Conf. Comput.
(ICCVW), Oct. 2017, pp. 455–462. Vis. (ICCV), Dec. 2015, pp. 1422–1430.
[31] N. Harte and E. Gillen, “TCD-TIMIT: An audio-visual corpus of con- [57] M. Noroozi and P. Favaro, “Unsupervised learning of visual representa-
tinuous speech,” IEEE Trans. Multimedia, vol. 17, no. 5, pp. 603–615, tions by solving jigsaw puzzles,” in Proc. Eur. Conf. Comput. Vis. Cham,
May 2015. Switzerland: Springer, 2016, pp. 69–84.
[32] Y. Kumar, R. Jain, K. M. Salik, R. R. Shah, Y. Yin, and R. Zimmermann, [58] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,”
“Lipper: Synthesizing thy speech using multi-view lipreading,” in Proc. in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016,
AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 2588–2595. pp. 649–666.
[33] K. Vougioukas, P. Ma, S. Petridis, and M. Pantic, “Video-driven [59] X. Wang and A. Gupta, “Unsupervised learning of visual representations
speech reconstruction using generative adversarial networks,” in Proc. using videos,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
Interspeech, Sep. 2019, pp. 4125–4129. pp. 2794–2802.
[34] J. Shen et al., “Natural TTS synthesis by conditioning WaveNet on [60] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: Unsupervised
MEL spectrogram predictions,” in Proc. IEEE Int. Conf. Acoust., Speech learning using temporal order verification,” in Proc. Eur. Conf. Comput.
Signal Process. (ICASSP), Apr. 2018, pp. 4779–4783. Vis. Cham, Switzerland: Springer, 2016, pp. 527–544.
[61] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy, [87] K. Thangthai, H. L. Bear, and R. Harvey, “Comparing phonemes and
“Tracking emerges by colorizing videos,” in Proc. Eur. Conf. Comput. visemes with DNN-based lipreading,” 2018, arXiv:1805.02924.
Vis. (ECCV), 2018, pp. 391–408. [88] M. Luo, S. Yang, S. Shan, and X. Chen, “Pseudo-convolutional policy
[62] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of gradient for sequence-to-sequence lip-reading,” 2020, arXiv:2003.03983.
word representations in vector space,” 2013, arXiv:1301.3781. [89] C. Yang, S. Wang, X. Zhang, and Y. Zhu, “Speaker-independent lipread-
[63] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, ing with limited data,” in Proc. IEEE Int. Conf. Image Process. (ICIP),
“Improving language understanding by generative pre-training,” Ope- Oct. 2020, pp. 2181–2185.
nAI Blog, 2018. Accessed: Aug. 6, 2020. [Online]. Available: [90] K. Xu, D. Li, N. Cassimatis, and X. Wang, “LCANet: End-to-end
https://openai.com/blog/language-unsupervised lipreading with cascaded attention-CTC,” in Proc. 13th IEEE Int. Conf.
[64] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training Autom. Face Gesture Recognit. (FG), May 2018, pp. 548–555.
of deep bidirectional transformers for language understanding,” 2018, [91] W. Chen, X. Tan, Y. Xia, T. Qin, Y. Wang, and T.-Y. Liu, “DualLip:
arXiv:1810.04805. A system for joint lip reading and generation,” 2020, arXiv:2009.05784.
[65] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, [92] A. Koumparoulis and G. Potamianos, “MobiLipNet: Resource-efficient
“ALBERT: A lite BERT for self-supervised learning of language repre- deep learning based lipreading,” in Proc. Interspeech, Sep. 2019,
sentations,” 2019, arXiv:1909.11942. pp. 2763–2767.
[66] M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training [93] Y. Zhao, R. Xu, X. Wang, P. Hou, H. Tang, and M. Song, “Hearing lips:
for natural language generation, translation, and comprehension,” 2019, Improving lip reading by distilling speech recognizers,” in Proc. AAAI,
arXiv:1910.13461. 2020, pp. 6917–6924.
[67] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in Proc.
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 609–617.
[68] S.-W. Chung, J. S. Chung, and H.-G. Kang, “Perfect match: Improved
cross-modal embeddings for audio-visual synchronisation,” in Proc.
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2019,
Leyuan Qu received the M.Sc. degree in computer
pp. 3965–3969.
science from Beijing Language and Culture Univer-
[69] R. Arandjelovic and A. Zisserman, “Objects that sound,” in Proc. Eur. sity, Beijing, China, in 2017, and the Ph.D. degree
Conf. Comput. Vis. (ECCV), 2018, pp. 435–451. from the Department of Informatics, University of
[70] B. Chen et al., “Multimodal clustering networks for self-supervised Hamburg, Hamburg, Germany, in 2021.
learning from unlabeled videos,” 2021, arXiv:2104.12671. His main research interests include robust speech
[71] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and recognition, audio-visual speech recognition, speech
alignment using multitask cascaded convolutional networks,” IEEE enhancement, speech separation, lip reading, and
Signal Process. Lett., vol. 23, no. 10, pp. 1499–1503, Oct. 2016. self-supervised learning.
[72] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
“Attention-based models for speech recognition,” in Proc. Adv. Neural
Inf. Process. Syst., vol. 28, 2015, pp. 577–585.
[73] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible
1×1 convolutions,” in Proc. Adv. Neural Inf. Process. Syst., 2018,
pp. 10215–10224.
[74] A. van den Oord et al., “WaveNet: A generative model for raw audio,” Cornelius Weber received the Diploma degree in
in Proc. 9th ISCA Speech Synth. Workshop (SSW), 2016, p. 125. physics from the University of Bielefeld, Bielefeld,
[75] K. Ito and L. Johnson. (2017). The LJ Speech Dataset. [Online]. Germany, in 1995, and the Ph.D. degree in com-
Available: https://keithito.com/LJ-Speech-Dataset/ puter science from the Technische Universität Berlin,
[76] K. Heafield, “KenLM: Faster and smaller language model queries,” in Berlin, Germany, in 2000.
Proc. 6th Workshop Stat. Mach. Transl., 2011, pp. 187–197. He was a Post-Doctoral Fellow of brain and cog-
[77] S. Wermter and V. Weber, “SCREEN: Learning a flat syntactic and nitive sciences with the University of Rochester,
semantic spoken language analysis using artificial neural networks,” Rochester, NY, USA. From 2002 to 2005, he was
J. Artif. Intell. Res., vol. 6, pp. 35–85, Jan. 1997. a Research Scientist of hybrid intelligent systems
[78] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker with the University of Sunderland, Sunderland, U.K.
recognition,” in Proc. Interspeech, 2018, pp. 1086–1090. He was a Junior Fellow with the Frankfurt Institute
[79] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual cor- for Advanced Studies, Frankfurt am Main, Germany, until 2010. He is
pus for speech perception and automatic speech recognition,” J. Acoustic currently a Laboratory Manager with the Knowledge Technology Group,
Soc. Amer., vol. 120, no. 5, pp. 2421–2424, 2006. University of Hamburg, Hamburg, Germany. His current research interests
[80] Y. Zhao, R. Xu, and M. Song, “A cascade sequence-to-sequence model include computational neuroscience with a focus on vision, unsupervised
for Chinese Mandarin lipreading,” in Proc. ACM Multimedia Asia, 2019, learning, and reinforcement learning.
pp. 1–6.
[81] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech:
An ASR corpus based on public domain audio books,” in Proc.
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2015,
pp. 5206–5210. Stefan Wermter (Member, IEEE) is currently a
[82] J. Du, X. Na, X. Liu, and H. Bu, “AISHELL-2: Transforming Mandarin Full Professor with the University of Hamburg,
ASR research into industrial scale,” 2018, arXiv:1808.10583. Hamburg, Germany, where he is also the Director
[83] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual of the Department of Informatics, Knowledge Tech-
evaluation of speech quality (PESQ)—A new method for speech quality nology Institute. Currently, he is a co-coordinator
assessment of telephone networks and codecs,” in Proc. IEEE Int. of the International Collaborative Research Centre
Conf. Acoust., Speech, Signal Process. (ICASSP), vol. 2, May 2001, on Crossmodal Learning (TRR-169) and a coordi-
pp. 749–752. nator of the European Training Network TRAIL on
[84] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility transparent interpretable robots. His main research
of speech masked by modulated noise maskers,” IEEE/ACM Trans. interests are in the fields of neural networks,
Audio, Speech, Language Process., vol. 24, no. 11, pp. 2009–2022, hybrid knowledge technology, cognitive robotics,
Nov. 2016. and human–robot interaction.
[85] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling Prof. Wermter has been an Associate Editor of IEEE T RANSACTIONS
for sequence prediction with recurrent neural networks,” in Proc. Adv. ON N EURAL N ETWORKS AND L EARNING S YSTEMS . He is an Associate
Neural Inf. Process. Syst., 2015, pp. 1171–1179. Editor of Connection Science and International Journal for Hybrid Intelligent
[86] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep Systems. He is on the Editorial Board of the journals Cognitive Systems
feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Research, Cognitive Computation, and Journal of Computational Intelligence.
Statist., 2010, pp. 249–256. He is serving as the President for the European Neural Network Society.

Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading

Uploaded by

Copyright:

Available Formats

Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading

Uploaded by

Copyright:

Available Formats

2772 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO.

LipSound2: Self-Supervised Pre-Training for

I NSPIRED by human bimodal perception [1] in which both

transform intermediate representations to audio, for instance,

or consecutive utterances [64], recovering the order of shuffled TABLE I

III. M ODEL A RCHITECTURE

consumes the attention content vector and the output from

LibriSpeech is derived from audiobooks, containing 460 h

TABLE III TABLE IV

2) Audio Stream: We first divide the raw waveforms by the

B. Lip Reading Results

TABLE VI speech reconstruction on large-scale vocabulary datasets, par-

You might also like