Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
2, FEBRUARY 2024
Abstract— The aim of this work is to investigate the impact of complementary to auditory representations [8]. Moreover, the
crossmodal self-supervised pre-training for speech reconstruc- visual contribution becomes more important as the acoustic
tion (video-to-audio) by leveraging the natural co-occurrence signal-to-noise ratio is decreased [9].
of audio and visual streams in videos. We propose LipSound2
that consists of an encoder–decoder architecture and location- In most approaches, the visual information is mainly used
aware attention mechanism to map face image sequences to as an auxiliary input to complement audio signals. However,
mel-scale spectrograms directly without requiring any human in some circumstances, the auditory information may be absent
annotations. The proposed LipSound2 model is first pre-trained or extremely noisy, which motivates speech reconstruction.
on ∼2400-h multilingual (e.g., English and German) audio-visual Speech reconstruction aims to generate both intelligible and
data (VoxCeleb2). To verify the generalizability of the proposed
method, we then fine-tune the pre-trained model on domain- qualified speech by only conditioning on image sequences of
specific datasets (GRID and TCD-TIMIT) for English speech talking mouths or faces. Generating intelligible speech from
reconstruction and achieve a significant improvement on speech silent videos enables many applications, e.g., a silent visual
quality and intelligibility compared to previous approaches in input method on mobile phones for privacy protection in public
speaker-dependent and speaker-independent settings. In addition areas [10]; communication assistance for patients suffering
to English, we conduct Chinese speech reconstruction on the
Chinese Mandarin Lip Reading (CMLR) dataset to verify the laryngectomy [11]; surveillance video understanding when
impact on transferability. Finally, we train the cascaded lip only visual signals are available [12]; enhancement of video
reading (video-to-text) system by fine-tuning the generated audios conferences or far-field human–robot interaction scenarios in
on a pre-trained speech recognition system and achieve the state- a noisy environment [13]; and nondisruptive user intervention
of-the-art performance on both English and Chinese benchmark for autonomous vehicles [14].
datasets.
It is challenging to reconstruct qualified and intelligible
Index Terms— Lip reading, self-supervised pre-training, speech speech from only mouth or face movements since human
recognition, speech reconstruction. speech is produced by not only externally observable organs
such as lips and tongue but also internally invisible ones
I. I NTRODUCTION that are difficult to capture in most cases [15], for instance,
vocal cords and pharynx. Consequently, it is hard to infer
Ephrat et al. [30] via combining a RestNet backbone and a information and attain significant improvement. Weng
postprocessing network on a large-scale vocabulary dataset, and Kitani [49] presented two separated deep 3-D
TCD-TIMIT [31]. Akbari et al. [19] treated the intermediate CNN front ends to learn features from grayscale video
bottleneck features learned by a speech autoencoder as train- and optical flow inputs. Martinez et al. [50] replaced
ing targets by conditioning on lip reading network outputs. recurrent neural networks widely used in past work
Kumar et al. [32] validated the effectiveness of using multi- with temporal convolutional networks to simplify the
ple views of faces on both speaker-dependent and speaker- training procedure. The word-level methods are usually
independent speech reconstruction. Vougioukas et al. [33] able to achieve high accuracy, and however, the models
utilized generative adversarial networks (GANs) to directly disregard the interaction or co-articulation phenomenon
predict raw waveforms from visual inputs in an end-to-end between phonemes or words. A predefined lexicon with
fashion without generating an intermediate representa- closed-set vocabulary is used and words are usually
tion of audios. Inspired by the speech synthesis model, treated as isolated units in speech. Thereby, long-term
Tacotron2 [34], Qu et al. [20] proposed to directly context information and assimilation or dissimilation
map video inputs to low-level speech representations, mel effects are completely neglected. Moreover, it is hard
spectrogram, with an encoder–decoder architecture and to recognize out-of-vocabulary words.
achieve better results on lip reading experiments. Afterward, 2) Lip reading models with character or phoneme levels
Prajwal et al. [35] improved the model performance with 3-D mainly use methods proposed in speech recogni-
CNN and skip connections. Recently, Michelsanti et al. [36] tion. Assael et al. [44] conducted end-to-end lip read-
presented a multitask architecture to learn spectral envelope, ing experiments on sentence level with CTC loss.
aperiodic parameters, and fundamental frequency separately, Subsequently, sequence discriminative training [51]
which are then fed into a vocoder for waveform synthe- and domain-adversarial training [52] are introduced
sis. They integrate a connectionist temporal classification to lip reading. Chung et al. [2] collected the dataset,
(CTC) [37] loss to jointly perform lip reading, which is capa- “lip reading sentence” (LRS), which consists of hun-
ble of further enhancing and constraining the video encoder. dreds of thousands of videos from BBC television,
In addition to sequences of lip or face images, further and significantly promoted the research on sentence-
signals can be used for temporal self-supervision. For instance, level lip reading. Shillingford et al. [53] verified the
Gonzalez et al. [38] generated speech from articulatory sensor effectiveness of large-scale data (3886 h of video)
data and Akbari et al. [39] reconstructed speech from invasive for training continuous visual speech recognition.
electrocorticography. However, most existing works only focus Afouras et al. [54] compared the performance of recur-
on a speaker-dependent setting and small vocabulary or artifi- rent neural networks, fully CNNs, and transformer on
cial grammar datasets. In this article, we evaluate our method lip reading character recognition.
not only on speaker-dependent experiments but also pay atten- Different from the mainstream methods which directly
tion to speaker-independent and large-scale vocabulary setups. transform videos to text, we perform lip reading experiments
in a cascaded manner, in which the silent videos are first
B. Lip Reading mapped to audios with our LipSound2 model and, then, text
transcriptions are predicted by fine-tuning on a pretrained
Lip reading, also known as visual speech recognition, is the speech recognition system.
task to predict text transcriptions from silent videos, such as
mouth or face movement sequences. Research on lip reading
has a long tradition. Approaches to lip reading generally fall C. Self-Supervised Learning
into two categories on feature level: 1) handcrafted visual As a form of unsupervised learning, self-supervised learning
feature extraction, such as discrete cosine transform [40], leverages massive unlabeled data and aims to learn effec-
discrete wavelet transform [41], or active appearance mod- tive intermediate representations with the supervision of self-
els [42] and 2) representations learned by neural networks, generated labels. Training unlabeled data in a supervised
which has become the dominant technique for this task, for manner rely on the pretext tasks that determine what labels
example, using convolutional autoencoders [43], spatiotem- and loss functions to be used. In computer vision, the pretext
poral CNNs [44], long short-term memory [45], or residual tasks can be predicting angles of rotated images [55], learning
networks [46]. the relative position of segmented regions in an image [56],
Alternatively, methods on modeling units for lip reading can placing shuffled patches back [57], or colorizing grayscale
be divided into word and character levels. input images [58]. The video-based pretext tasks can be
1) In the case of word-level units, lip reading is simplified tracking moving objects in videos [59], validating temporal
as a classification task. Word-level lip reading datasets frame orders [60], video colorization [61], and so on.
and benchmarks are built, for instance, LRW [47] for Self-supervised learning is also widely used in natural lan-
English and LRW-1000 [48] for Chinese. Stafylakis guage processing. Substantial progress has been made recently,
and Tzimiropoulos [46] adopted spatiotemporal con- where diverse pretext tasks are proposed, for instance, predict-
volutional networks and 2-D ResNet as front end to ing center words using surrounding ones or vice versa [62],
extract visual features and bidirectional long short-term generating the next word by conditioning on previous words
memory networks as the backend to capture temporal in an autoregressive fashion [63], completing masked tokens
QU et al.: LipSound2: SELF-SUPERVISED PRE-TRAINING FOR LIP-TO-SPEECH RECONSTRUCTION AND LIP READING 2775
Fig. 2. Architecture of LipSound2. The video is split into visual and acoustic streams. The face region, which is cropped from the silent visual stream,
is used as the model input. The acoustic spectrogram features extracted from the counterpart audio stream are used as the training target. During training, the
ground-truth spectrogram frames are utilized to accelerate convergence, while, during inference, the outputs from previous steps are used.
D. Training Objective
The loss function is the sum of two mean square errors
(MSEs), as shown in (5), i.e., the MSE between the decoder
output Odec and the target mel spectrogram Mtar and the
MSE between the postnet output Opost and the target mel
Fig. 3. Computational flow of location-aware attention at time step t.
spectrogram
low-resolution videos or profile faces lead to detection failures Loss = MSE(Odec , Mtar ) + MSE(Opost , Mtar ). (5)
sometimes and landmark smoothing can eliminate frame skip
in adjacent images. The cropped face sequences are then fed
into 3-D CNN blocks and each block is based on a 3-D E. WaveGlow
CNN, batch normalization, ReLU activation, max pooling, and We use WaveGlow [26], which combines the approach of
dropout, as shown in Fig. 2. Then, two bidirectional LSTM the glow-based generative model [73] and the architecture
layers follow which capture the long-distance dependence insight of WaveNet [74] to transform the estimated mel
from the left and right context. spectrogram back to audio. WaveGlow abandons autoregres-
sion [74] and speeds up the procedure of waveform synthesis
B. Location-Sensitive Attention in high quality and resolution. We train WaveGlow from
We use location-aware attention [72] to bridge the encoder scratch using the same settings as original work [26] but in
and the decoder. The image sequence input i = (i 0 , . . . , i n ) 16k sampling rate on the LJSpeech dataset [75] to meet the
is first embedded into the latent space representation vector requirement of following up ASR models. To our surprise, the
h = (h 1 , . . . , h n ) by the encoder with the same dimension WaveGlow model that is trained with only one female voice
n in time, and then, the intermediate vector h is decoded can effectively generalize to any unseen voices and stably
into the mel spectrogram o = (o0 , . . . , om ). At time step t perform waveform reconstruction.
(0 ≤ t ≤ m), the attention weight at can be obtained by the
following equations: F. Acoustic Model and Language Model
at = Softmax(W · tanh(M · h + Q · x + L · y)) (1) The Jasper [27] speech recognition system, which is a fully
x = LSTM(h · at−1 , pprenet ) (2) convolutional architecture trained with skip connections and
CTC loss, is adopted to directly predict characters from speech
y = Conv at−1 , ai (3) signals. We pretrain the Jasper DR 10 × 5 model1 on 960 h
0≤i≤t−1 LibriSpeech and 1000 h AISHELL-2 corpora, which achieves
where W, M, Q, and L are the matrices learned by weight 3.61% word error rate (WER) and 10.05% character error
FC, memory FC, query FC, and location FC, respectively. rate (CER) on the development set for English and Chinese,
In (3), the sum of attention weights of all previous steps is respectively.
integrated, which enables the current step attention to be aware Beam search is utilized to decode the output character
of the global location and move forward monotonically. Fig. 3 possibilities from Jasper and a 6-g KenLM [76] language
visualizes the computational flow of the attention mechanism. model2 into grammatically and semantically correct words on
The attention content vector v t can be obtained by multiplying sentence level [77].
the encoder output by the normalized attention weights (see the
following equation): IV. E XPERIMENTAL S ETUP
v t = at · h. (4) A. Dataset
All datasets used in this article are summarized in Table II
C. Decoder and random frames from audio-visual ones are presented
The decoder module consists of one unidirectional LSTM 1 https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition.html
layer and one linear projection layer. The decoder LSTM 2 https://github.com/PaddlePaddle/DeepSpeech
QU et al.: LipSound2: SELF-SUPERVISED PRE-TRAINING FOR LIP-TO-SPEECH RECONSTRUCTION AND LIP READING 2777
TABLE II
OVERVIEW OF A LL C ORPORA U SED IN T HIS A RTICLE . S PK : S PEAKERS . U TT: U TTERANCES . V OCAB : V OCABULARY
B. Evaluation Metrics
in Fig. 4. VoxCeleb2 is a large-scale audio-visual corpus,
We evaluate the generated speech quality and intelligibility
extracted from YouTube videos, containing over one mil-
with perceptual evaluation of speech quality (PESQ) [83]
lion utterances and more than 6k different speakers from
and extended short-time objective intelligibility (ESTOI) [84],
around 145 nationalities and languages. It includes noisy and
respectively. The speech-to-text results are measured with
unconstrained conditions; specifically, the audio stream may
WER and CER, the ratio of error terms, i.e., substitutions,
be recorded with background noise, such as laughter and
deletions, and insertions, to the total number of words/
room reverberation, and the vision part may contain variable
characters in the ground-truth sequences.
head poses (e.g., frontal faces and profile), variable lighting
conditions, and low image quality, while the GRID and TCD-
TIMIT datasets are in controlled experimental environments C. Training
with fixed frontal face angle and clean background in audio We only describe the training settings of LipSound2 pre-
and vision. It is worth mentioning that the GRID dataset training, LipSound2 fine-tuning, and Jasper acoustic model
is designed to contain only a fixed six-word structure and fine-tuning. More details about Japser1 pre-training acoustic
all sentences are generated by a restricted artificial gram- model, KenLM2 language model, and WaveGlow3 can be
mar: command + color + preposition + letter + digit + found on the open-source websites.
adverb, for example, set blue in Z three now. CMLR is 1) Vision Stream: Face landmarks are detected using
collected from videos by 11 hosts of the Chinese national news MTCNN [71] from all video frames and only the face area is
program News Broadcast, which contains frontal faces and cropped and reshaped to size of 112 × 112 as inputs. We also
covers a large amount of Chinese vocabulary. We first pretrain add one “visual period”—an empty frame with all values of
LipSound2 on VoxCeleb2 and then fine-tune the model on 255—at the end of every visual stream to help the decoder
GRID, TCD-TIMIT, and CMLR for video to mel-spectrogram stop decoding at the right time. A max decoder step threshold
reconstruction. of 1000 is activated to terminate decoding when the decoder
LibriSpeech and AISHELL-2 are the current largest open- fails to capture the “visual period.”
source speech corpora and widely used speech recogni-
tion benchmarks for English and Chinese, respectively. 3 https://github.com/NVIDIA/waveglow
2778 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 2, FEBRUARY 2024
TABLE V
S PEECH R ECONSTRUCTION R ESULTS FOR
C HINESE ON THE CMLR D ATASET
Fig. 5. Comparison between generated mel spectrogram and ground truth in speaker-dependent and speaker-independent settings for English and Chinese
[79], [31], [80].
ACKNOWLEDGMENT
The authors would like to thank Katja Kösters for improving
external language model. Besides, we build a new baseline the language of this article.
for CMLR in speaker-independent settings.
R EFERENCES
VI. D ISCUSSION
[1] J. Besle, A. Fort, C. Delpuech, and M.-H. Giard, “Bimodal speech: Early
Although the proposed LipSound2 model pre-trained on a suppressive visual effects in human auditory cortex,” Eur. J. Neurosci.,
large-scale dataset achieves considerable performance on both vol. 20, no. 8, pp. 2225–2234, Oct. 2004.
speech reconstruction and lip reading tasks, it still generates [2] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading
sentences in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern
error speech due to the visual similarity on pronunciation, for Recognit. (CVPR), Jul. 2017, pp. 3444–3453.
example, “pill” is easy to be misrecognized as “bill” in English [3] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep
and “ji zhi” is mistaken as “qi zhi” in Chinese. In addition, our audio-visual speech recognition,” IEEE Trans. Pattern Anal. Mach.
Intell., early access, Dec. 21, 2018, doi: 10.1109/TPAMI.2018.2889052.
model can generate quite similar voices as the ground truth in [4] L. Qu, C. Weber, and S. Wermter, “Multimodal target speech separation
speaker-dependent settings, while the model is inclined to pre- with voice and face references,” in Proc. Interspeech, Oct. 2020,
dict a voice existing in the training set sometimes in speaker- pp. 1416–1420.
[5] S.-W. Chung, S. Choe, J. S. Chung, and H.-G. Kang, “FaceFilter:
independent cases. For details and demonstrations, we refer Audio-visual speech separation using still images,” in Proc. Interspeech,
also to the demo video on the project website.5 How to stop Oct. 2020, pp. 3481–3485.
the fine-tuning procedure at the appropriate time and avoid the [6] Y. Miao and F. Metze, “Open-domain audio-visual speech recogni-
tion: A deep learning approach,” in Proc. Interspeech, Sep. 2016,
model overfitting on downstream tasks is an important direc- pp. 3414–3418.
tion for future research since the MSE loss always declines [7] A. Gupta, Y. Miao, L. Neves, and F. Metze, “Visual features for context-
when using teacher forcing during training, which hardly aware speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech
Signal Process. (ICASSP), Mar. 2017, pp. 5020–5024.
indicates whether the model is overfitting or not. Besides, [8] J. Macdonald and H. McGurk, “Visual influences on speech percep-
a possible solution could be using voice embeddings as tion processes,” Perception Psychophys., vol. 24, no. 3, pp. 253–257,
additional inputs that can efficiently help models learn speaker May 1978.
[9] P. L. Silsbee and A. C. Bovik, “Computer lipreading for improved
identity information, as we found in our previous work [4]. accuracy in automatic speech recognition,” IEEE Trans. Speech Audio
Process., vol. 4, no. 5, pp. 337–351, Sep. 1996.
VII. C ONCLUSION [10] B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and
In this article, we have proposed LipSound2 that directly J. S. Brumberg, “Silent speech interfaces,” Speech Commun., vol. 52,
no. 4, pp. 270–287, 2010.
predicts speech representations from raw pixels. We inves- [11] H. R. Sharifzadeh, I. V. McLoughlin, and F. Ahmadi, “Reconstruc-
tigated the effectiveness of self-supervised pre-training for tion of normal sounding speech for laryngectomy patients through a
modified CELP codec,” IEEE Trans. Biomed. Eng., vol. 57, no. 10,
5 https://leyuanqu.github.io/LipSound2/ pp. 2448–2458, Oct. 2010.
QU et al.: LipSound2: SELF-SUPERVISED PRE-TRAINING FOR LIP-TO-SPEECH RECONSTRUCTION AND LIP READING 2781
[12] M. Cristani, M. Bicego, and V. Murino, “Audio-visual event recognition [35] K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar,
in surveillance video sequences,” IEEE Trans. Multimedia, vol. 9, no. 2, “Learning individual speaking styles for accurate lip to speech synthe-
pp. 257–267, Feb. 2007. sis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
[13] A. Tsiami, P. P. Filntisis, N. Efthymiou, P. Koutras, G. Potamianos, Jun. 2020, pp. 13796–13805.
and P. Maragos, “Far-field audio-visual scene perception of multi- [36] D. Michelsanti, O. Slizovskaia, G. Haro, E. Gómez, Z.-H. Tan, and
party human–robot interaction for children and adults,” in Proc. IEEE J. Jensen, “Vocoder-based speech synthesis from silent videos,” in Proc.
Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, Interspeech, Oct. 2020, pp. 3530–3534.
pp. 6568–6572. [37] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connec-
[14] R. Tscharn, M. E. Latoschik, D. Löffler, and J. Hurtienne, “‘Stop tionist temporal classification: Labelling unsegmented sequence data
over there’: Natural gesture and speech interaction for non-critical with recurrent neural networks,” in Proc. 23rd Int. Conf. Mach. Learn.
spontaneous intervention in autonomous driving,” in Proc. 19th ACM (ICML), 2006, pp. 369–376.
Int. Conf. Multimodal Interact., Nov. 2017, pp. 91–100. [38] J. A. Gonzalez et al., “Direct speech reconstruction from articulatory
[15] B. Gick, I. Wilson, and D. Derrick, Articulatory Phonetics. Hoboken, sensor data by machine learning,” IEEE/ACM Trans. Audio, Speech,
NJ, USA: Wiley, 2012. Language Process., vol. 25, no. 12, pp. 2362–2374, Dec. 2017.
[16] S. Maeda, “Compensatory articulation during speech: Evidence from [39] H. Akbari, B. Khalighinejad, J. L. Herrero, A. D. Mehta, and
the analysis and synthesis of vocal-tract shapes using an articula- N. Mesgarani, “Towards reconstructing intelligible speech from the
tory model,” in Speech Production and Speech Modelling. Dordrecht, human auditory cortex,” Sci. Rep., vol. 9, no. 1, pp. 1–12, Dec. 2019.
The Netherlands: Springer, 1990, pp. 131–149. [40] M. Heckmann, K. Kroschel, C. Savariaux, and F. Berthommier,
[17] S. Goto, K. Onishi, Y. Saito, K. Tachibana, and K. Mori, “Face2Speech: “DCT-based video features for audio-visual speech recognition,” in Proc.
Towards multi-speaker text-to-speech synthesis using an embedding 7th Interspeech, 2002, pp. 1–4.
vector predicted from a face image,” in Proc. Interspeech, Oct. 2020, [41] G. Potamianos, H. P. Graf, and E. Cosatto, “An image transform
pp. 1321–1325. approach for HMM based automatic lipreading,” in Proc. Int. Conf.
[18] A. Ephrat and S. Peleg, “Vid2Speech: Speech reconstruction from Image Process., 1998, pp. 173–177.
silent video,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. [42] G. Sterpu and N. Harte, “Towards lipreading sentences with active
(ICASSP), Mar. 2017, pp. 5095–5099. appearance models,” 2018, arXiv:1805.11688.
[19] H. Akbari, H. Arora, L. Cao, and N. Mesgarani, “Lip2Audspec: [43] D. Parekh, A. Gupta, S. Chhatpar, A. Y. Kumar, and M. Kulkarni, “Lip
Speech reconstruction from silent lip movements video,” in Proc. reading using convolutional auto encoders as feature extractor,” 2018,
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, arXiv:1805.12371.
pp. 2516–2520. [44] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “Lip-
[20] L. Qu, C. Weber, and S. Wermter, “LipSound: Neural mel-spectrogram Net: End-to-end sentence-level lipreading,” in Proc. GPU Technol.
reconstruction for lip reading,” in Proc. Interspeech, Sep. 2019, Conf., 2017, pp. 1–13. [Online]. Available: https://github.com/Fengdalu/
pp. 2768–2772. LipNet-PyTorch
[21] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring [45] M. Wand, J. Koutník, and J. Schmidhuber, “Lipreading with long short-
speech representations using a pitch-adaptive time–frequency smoothing term memory,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
and an instantaneous-frequency-based F0 extraction: Possible role of (ICASSP), Mar. 2016, pp. 6115–6119.
a repetitive structure in sounds,” Speech Commun., vol. 27, nos. 3–4, [46] T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with
pp. 187–207, Apr. 1999. LSTMs for lipreading,” in Proc. Interspeech, Aug. 2017, pp. 3652–3656.
[22] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based [47] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Proc. Asian
high-quality speech synthesis system for real-time applications,” IEICE Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 87–103.
Trans. Inf. Syst., vol. 99, no. 7, pp. 1877–1884, 2016. [48] S. Yang et al., “LRW-1000: A naturally-distributed large-scale bench-
[23] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio mark for lip reading in the wild,” in Proc. 14th IEEE Int. Conf. Autom.
and video models from self-supervised synchronization,” in Proc. Adv. Face Gesture Recognit. (FG), May 2019, pp. 1–8.
Neural Inf. Process. Syst., 2018, pp. 7763–7774. [49] X. Weng and K. Kitani, “Learning spatio-temporal features with two-
[24] P. Morgado, Y. Li, and N. Nvasconcelos, “Learning representations from stream deep 3D CNNs for lipreading,” 2019, arXiv:1905.02540.
audio-visual spatial alignment,” in Proc. Adv. Neural Inf. Process. Syst., [50] B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using
vol. 33, 2020, pp. 1–12. temporal convolutional networks,” in Proc. IEEE Int. Conf. Acoust.,
[25] D. Griffin and J. Lim, “Signal estimation from modified short-time Speech Signal Process. (ICASSP), May 2020, pp. 6319–6323.
Fourier transform,” IEEE Trans. Acoust., Speech, Signal Process., [51] K. Thangthai and R. Harvey, “Improving computer lipreading via
vol. ASSP-32, no. 2, pp. 236–243, Apr. 1984. DNN sequence discriminative training techniques,” in Proc. Interspeech,
[26] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based Aug. 2017, pp. 1–5.
generative network for speech synthesis,” in Proc. IEEE Int. Conf. [52] M. Wand and J. Schmidhuber, “Improving speaker-independent lipread-
Acoust., Speech Signal Process. (ICASSP), May 2019, pp. 3617–3621. ing with domain-adversarial training,” in Proc. Interspeech, 2017,
[27] J. Li et al., “Jasper: An end-to-end convolutional neural acoustic model,” pp. 2415–2419.
in Proc. Interspeech, 2019, pp. 71–75. [53] B. Shillingford et al., “Large-scale visual speech recognition,” in Proc.
[28] T. L. Cornu and B. Milner, “Reconstructing intelligible audio speech Interspeech, 2018, pp. 4135–4139.
from visual speech features,” in Proc. Interspeech, Sep. 2015, [54] T. Afouras, J. S. Chung, and A. Zisserman, “Deep lip reading:
pp. 1–6. A comparison of models and an online application,” in Proc. Interspeech,
[29] T. Le Cornu and B. Milner, “Generating intelligible audio speech from Sep. 2018, pp. 3514–3518.
visual speech,” IEEE/ACM Trans. Audio, Speech, Language Process., [55] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation
vol. 25, no. 9, pp. 1751–1761, Sep. 2017. learning by predicting image rotations,” in Proc. ICLR, 2018, pp. 1–16.
[30] A. Ephrat, T. Halperin, and S. Peleg, “Improved speech reconstruction [56] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual represen-
from silent video,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops tation learning by context prediction,” in Proc. IEEE Int. Conf. Comput.
(ICCVW), Oct. 2017, pp. 455–462. Vis. (ICCV), Dec. 2015, pp. 1422–1430.
[31] N. Harte and E. Gillen, “TCD-TIMIT: An audio-visual corpus of con- [57] M. Noroozi and P. Favaro, “Unsupervised learning of visual representa-
tinuous speech,” IEEE Trans. Multimedia, vol. 17, no. 5, pp. 603–615, tions by solving jigsaw puzzles,” in Proc. Eur. Conf. Comput. Vis. Cham,
May 2015. Switzerland: Springer, 2016, pp. 69–84.
[32] Y. Kumar, R. Jain, K. M. Salik, R. R. Shah, Y. Yin, and R. Zimmermann, [58] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,”
“Lipper: Synthesizing thy speech using multi-view lipreading,” in Proc. in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016,
AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 2588–2595. pp. 649–666.
[33] K. Vougioukas, P. Ma, S. Petridis, and M. Pantic, “Video-driven [59] X. Wang and A. Gupta, “Unsupervised learning of visual representations
speech reconstruction using generative adversarial networks,” in Proc. using videos,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
Interspeech, Sep. 2019, pp. 4125–4129. pp. 2794–2802.
[34] J. Shen et al., “Natural TTS synthesis by conditioning WaveNet on [60] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: Unsupervised
MEL spectrogram predictions,” in Proc. IEEE Int. Conf. Acoust., Speech learning using temporal order verification,” in Proc. Eur. Conf. Comput.
Signal Process. (ICASSP), Apr. 2018, pp. 4779–4783. Vis. Cham, Switzerland: Springer, 2016, pp. 527–544.
2782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 2, FEBRUARY 2024
[61] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy, [87] K. Thangthai, H. L. Bear, and R. Harvey, “Comparing phonemes and
“Tracking emerges by colorizing videos,” in Proc. Eur. Conf. Comput. visemes with DNN-based lipreading,” 2018, arXiv:1805.02924.
Vis. (ECCV), 2018, pp. 391–408. [88] M. Luo, S. Yang, S. Shan, and X. Chen, “Pseudo-convolutional policy
[62] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of gradient for sequence-to-sequence lip-reading,” 2020, arXiv:2003.03983.
word representations in vector space,” 2013, arXiv:1301.3781. [89] C. Yang, S. Wang, X. Zhang, and Y. Zhu, “Speaker-independent lipread-
[63] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, ing with limited data,” in Proc. IEEE Int. Conf. Image Process. (ICIP),
“Improving language understanding by generative pre-training,” Ope- Oct. 2020, pp. 2181–2185.
nAI Blog, 2018. Accessed: Aug. 6, 2020. [Online]. Available: [90] K. Xu, D. Li, N. Cassimatis, and X. Wang, “LCANet: End-to-end
https://openai.com/blog/language-unsupervised lipreading with cascaded attention-CTC,” in Proc. 13th IEEE Int. Conf.
[64] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training Autom. Face Gesture Recognit. (FG), May 2018, pp. 548–555.
of deep bidirectional transformers for language understanding,” 2018, [91] W. Chen, X. Tan, Y. Xia, T. Qin, Y. Wang, and T.-Y. Liu, “DualLip:
arXiv:1810.04805. A system for joint lip reading and generation,” 2020, arXiv:2009.05784.
[65] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, [92] A. Koumparoulis and G. Potamianos, “MobiLipNet: Resource-efficient
“ALBERT: A lite BERT for self-supervised learning of language repre- deep learning based lipreading,” in Proc. Interspeech, Sep. 2019,
sentations,” 2019, arXiv:1909.11942. pp. 2763–2767.
[66] M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training [93] Y. Zhao, R. Xu, X. Wang, P. Hou, H. Tang, and M. Song, “Hearing lips:
for natural language generation, translation, and comprehension,” 2019, Improving lip reading by distilling speech recognizers,” in Proc. AAAI,
arXiv:1910.13461. 2020, pp. 6917–6924.
[67] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in Proc.
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 609–617.
[68] S.-W. Chung, J. S. Chung, and H.-G. Kang, “Perfect match: Improved
cross-modal embeddings for audio-visual synchronisation,” in Proc.
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2019,
Leyuan Qu received the M.Sc. degree in computer
pp. 3965–3969.
science from Beijing Language and Culture Univer-
[69] R. Arandjelovic and A. Zisserman, “Objects that sound,” in Proc. Eur. sity, Beijing, China, in 2017, and the Ph.D. degree
Conf. Comput. Vis. (ECCV), 2018, pp. 435–451. from the Department of Informatics, University of
[70] B. Chen et al., “Multimodal clustering networks for self-supervised Hamburg, Hamburg, Germany, in 2021.
learning from unlabeled videos,” 2021, arXiv:2104.12671. His main research interests include robust speech
[71] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and recognition, audio-visual speech recognition, speech
alignment using multitask cascaded convolutional networks,” IEEE enhancement, speech separation, lip reading, and
Signal Process. Lett., vol. 23, no. 10, pp. 1499–1503, Oct. 2016. self-supervised learning.
[72] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
“Attention-based models for speech recognition,” in Proc. Adv. Neural
Inf. Process. Syst., vol. 28, 2015, pp. 577–585.
[73] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible
1×1 convolutions,” in Proc. Adv. Neural Inf. Process. Syst., 2018,
pp. 10215–10224.
[74] A. van den Oord et al., “WaveNet: A generative model for raw audio,” Cornelius Weber received the Diploma degree in
in Proc. 9th ISCA Speech Synth. Workshop (SSW), 2016, p. 125. physics from the University of Bielefeld, Bielefeld,
[75] K. Ito and L. Johnson. (2017). The LJ Speech Dataset. [Online]. Germany, in 1995, and the Ph.D. degree in com-
Available: https://keithito.com/LJ-Speech-Dataset/ puter science from the Technische Universität Berlin,
[76] K. Heafield, “KenLM: Faster and smaller language model queries,” in Berlin, Germany, in 2000.
Proc. 6th Workshop Stat. Mach. Transl., 2011, pp. 187–197. He was a Post-Doctoral Fellow of brain and cog-
[77] S. Wermter and V. Weber, “SCREEN: Learning a flat syntactic and nitive sciences with the University of Rochester,
semantic spoken language analysis using artificial neural networks,” Rochester, NY, USA. From 2002 to 2005, he was
J. Artif. Intell. Res., vol. 6, pp. 35–85, Jan. 1997. a Research Scientist of hybrid intelligent systems
[78] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker with the University of Sunderland, Sunderland, U.K.
recognition,” in Proc. Interspeech, 2018, pp. 1086–1090. He was a Junior Fellow with the Frankfurt Institute
[79] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual cor- for Advanced Studies, Frankfurt am Main, Germany, until 2010. He is
pus for speech perception and automatic speech recognition,” J. Acoustic currently a Laboratory Manager with the Knowledge Technology Group,
Soc. Amer., vol. 120, no. 5, pp. 2421–2424, 2006. University of Hamburg, Hamburg, Germany. His current research interests
[80] Y. Zhao, R. Xu, and M. Song, “A cascade sequence-to-sequence model include computational neuroscience with a focus on vision, unsupervised
for Chinese Mandarin lipreading,” in Proc. ACM Multimedia Asia, 2019, learning, and reinforcement learning.
pp. 1–6.
[81] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech:
An ASR corpus based on public domain audio books,” in Proc.
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2015,
pp. 5206–5210. Stefan Wermter (Member, IEEE) is currently a
[82] J. Du, X. Na, X. Liu, and H. Bu, “AISHELL-2: Transforming Mandarin Full Professor with the University of Hamburg,
ASR research into industrial scale,” 2018, arXiv:1808.10583. Hamburg, Germany, where he is also the Director
[83] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual of the Department of Informatics, Knowledge Tech-
evaluation of speech quality (PESQ)—A new method for speech quality nology Institute. Currently, he is a co-coordinator
assessment of telephone networks and codecs,” in Proc. IEEE Int. of the International Collaborative Research Centre
Conf. Acoust., Speech, Signal Process. (ICASSP), vol. 2, May 2001, on Crossmodal Learning (TRR-169) and a coordi-
pp. 749–752. nator of the European Training Network TRAIL on
[84] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility transparent interpretable robots. His main research
of speech masked by modulated noise maskers,” IEEE/ACM Trans. interests are in the fields of neural networks,
Audio, Speech, Language Process., vol. 24, no. 11, pp. 2009–2022, hybrid knowledge technology, cognitive robotics,
Nov. 2016. and human–robot interaction.
[85] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling Prof. Wermter has been an Associate Editor of IEEE T RANSACTIONS
for sequence prediction with recurrent neural networks,” in Proc. Adv. ON N EURAL N ETWORKS AND L EARNING S YSTEMS . He is an Associate
Neural Inf. Process. Syst., 2015, pp. 1171–1179. Editor of Connection Science and International Journal for Hybrid Intelligent
[86] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep Systems. He is on the Editorial Board of the journals Cognitive Systems
feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Research, Cognitive Computation, and Journal of Computational Intelligence.
Statist., 2010, pp. 249–256. He is serving as the President for the European Neural Network Society.