A Multimodal German Dataset For Automatic Lip Reading Systems and Transfer Learning
Gerald Schwiebert, Cornelius Weber, Leyuan Qu, Henrique Siqueira, Stefan Wermter
Knowledge Technology, Department of Informatics, University of Hamburg
Vogt-Koelln-Str. 30, 22527 Hamburg
{weber, qu, siqueira, wermter}@informatik.uni-hamburg.de
Large datasets as required for deep learning of lip reading do not exist in many languages. In this paper we present the dataset
GLips (German Lips) consisting of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament,
which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English
arXiv:2202.13403v3 [cs.CV] 11 May 2022
language LRW (Lip Reading in the Wild) dataset, with each video encoding one word of interest in a context of 1.16 seconds
duration, which yields compatibility for studying transfer learning between both datasets. By training a deep neural network,
we investigate whether lip reading has language-independent features, so that datasets of different languages can be used to
improve lip reading models. We demonstrate learning from scratch and show that transfer learning from LRW to GLips and
vice versa improves learning speed and performance, in particular for the validation set.
Keywords: Audio-visual, Dataset, Lip reading, Automatic Speech Recognition, Deep Learning, Transfer Learning,
Computer Vision
2.2. Copyright
Copyright law (UrhG) in Germany and its related an-
cillary copyrights deal with the creation of works and
the rights and powers of their creators. Videos that can
be accessed and downloaded from publicly accessible
platforms are in principle subject to §1 UrhG2 as a work Figure 1: Word length distribution in GLips
as soon as they have a certain so-called creative level,
i.e., they are enhanced, for example, by creative edit- This section gives an overview of the creation pro-
ing. Choosing such videos as a source for creating a cedure of the dataset German Lips (GLips). Despite
dataset would involve a great deal of communication the focus of this paper and the name on the ALR do-
effort, as the permission of the author would have to be main, GLips should be applicable in the whole scien-
obtained in writing for each individual work. Videos tific ASR domain as versatile as possible, because, as
from webcams or surveillance cameras without further explained in Section 2, the legally compliant creation
significant creative editing generally lack this level of of large video datasets in the German-speaking area is
creativity, which is why they are particularly suitable as connected with some hurdles. Therefore, the creation
a data source for dataset creation, as long as the rules of GLips is oriented towards LRW in order to ensure a
of the DSGVO3 are followed. Furthermore, in creating high compatibility for methods such as transfer learn-
GLips, we comply with two special exceptions embed- ing and experiments on the topic of Language Indepen-
ded into the German copyright law. First, we pursue dence and to advance scientific knowledge in these ar-
a legitimate scientific interest for helping to enhance eas. Furthermore, by choosing video material based
the support for the hearing impaired through the cre- on naturally spoken language in a natural environment,
ation of our dataset and second, the politicians shown we decided to use this approach for ASR systems, as
it produces more robust results for real-world applica-
§1 UrhG-Allgemeines-dejure.org: https: tions than artificially generated datasets with as little
//dejure.org/gesetze/UrhG/1.html noise as possible (Burton et al., 2018).
3 GLips consists of 250,000 H264-compressed MPEG-4
Datenschutz-Grundverordnung (DSGVO) - dejure.org:
https://dejure.org/gesetze/DSGVO videos of speakers’ faces from parliamentary sessions
Figure 2: Pipeline for the generation of training data
of the Hessian Parliament, which are divided into 500 3.2. Multimodal Processing Pipeline
different words of 500 instances each. The word length The technical creation of the multimodal dataset GLips
distribution is shown in Fig.1. As with LRW, each is thematically divided into the two areas of extraction
video is 1.16s long at a frame rate of 25fps. The audio and processing of data. In Fig. 2, the entire pipeline
track was stored separately in an MPEG AAC audio file is shown schematically from the existence of the raw
(.m4a). For each video there is an additional metadata data to the creation of training data suitable for machine
textfile with the fields: learning, of which GLips represents a subset.
• Spoken word, Since the original audio, video and subtitle data are al-
ready available in a separate form, the technical part
• Start time of utterance in seconds, of the data extraction is limited to the acquisition of
all data and the cleaning of the text data from meta in-
• End time of utterance in seconds, formation so that only the spoken words are available
• Duration of utterance in seconds, as input for the next step. The more complex part of
the data processing is described in more detail in the
• Corresponding numerical filename in the following two sections and is divided into the two sub-
database. sections audio subtitle alignment using WebMAUS and
face detection. The audio and video files are synchro-
Start- and end-time of utterance refers to the complete nized in the last step, but are stored in separate files for
original video and not to the occurrence of the word in the sake of more diverse processing options.
the clip.
3.1. Acquisition
With the permission of the Hessian Parliament, we used
over 1000 videos and their respective subtitles. The
Hessian parliament has published a superset of these
videos also on its YouTube channel4 . The subtitles
are available as a separate text file and include man-
ually created subtitles with time intervals. Similar to
LRW subtitle editing (Chung and Zisserman, 2016),
this leads to the issue that not all subtitles are verba-
tim, as in rare cases the content but not the exact spo-
ken words have been reproduced in the subtitle, which
means that despite checks, there are likely to be some
words in the dataset that do not match the lip profile
of the speaker. In order to create GLips, we also need
the exact time of pronunciation and the duration of the
utterance for each selected word. However, the sub-
title files only contain one interval for each of several
words. The solution to this problem via alignment us-
ing the WebMAUS service is discussed in section 3.3.
YouTube - Hessischer Landtag: https://www. Figure 3: Example of GLips cropped to 96×96 pixels
The output of this pipeline consists of structured,
processed, and augmented data suitable as potential
training data for various areas of machine learning.
The TextGrid files with their phonetic information,
which are no longer needed by GLips, longer excerpts
from aligned videos for sentence-based lip reading ap-
proaches, or video clips with several people to test at-
tention mechanisms, are just a few ideas for research.
For the transfer learning in Section 4, we use a modified
GLips dataset that was reduced in size from 256×256
pixels to 96×96 pixels (see Fig. 3) by additionally
cropping the videos to focus on lip reading learning and Figure 4: Example full screen view of the raw video
to ensure better computability on consumer hardware. data from the Hessian Parliament
attention was paid to the aspect of compatibility be- depicted in Figure 6, has 4D tensors as its main layers,
tween the two datasets. The camera equipment of the with one dimension each for the temporal dimension
BBC is clearly of higher quality than the webcam of (T), height and width (H × W) of spatial dimensions
the Hessian Parliament so that despite nominally the and number of channels (C). It is an expanding im-
same video resolution, there is a difference in qual- age processing architecture that uses channel-wise con-
ity between the video datasets due to dissimilar dy- volutions as building blocks. Synchronized stochas-
namic range of the camera sensor, possibly existing tic gradient descent (SGD) was performed of parallel
camera-internal post-processing, as well as more elab- workers following the linear scaling rule for learning
orately calculated and manufactured lenses. In addi- rate and minibatch size to reduce training time (Goyal
tion, external factors such as shorter camera distance et al., 2018).
(object distance), the partially existing professional and We used the official model implementation10 that is in-
intelligibility-oriented speech training of the news pre- cluded in Pytorch Lightning-Flash11 and for video pro-
senters and the more professional lighting in the BBC cessing we use the PyTorchVideo (Fan et al., 2021) li-
dataset provide a clearer, higher-contrast and sharper brary. The model was implemented and tested on a sin-
image of the lip movements, so that it can be expected gle NVIDIA Geforce RTX2080Ti.
that LRW-trained models for lip reading will have a
higher performance in terms of word recognition than 5.1. Experiments with GLips and LRW
will be the case with GLips. The quality of the audio To evaluate whether the word recognition rate of the lip
recordings, which were integrated into the .mp4 for- reading models can be improved by transfer learning,
mat in LRW and are available separately as .m4a in we conducted two experiments. To keep the transfer
GLips, is less deviant due to the use of high-quality learning computations manageable, we create two
microphones in the Hessian Parliament. However, for subsets of each of the LRW and GLips datasets, which
training our lip reading models we will only use the we call LRW15 and GLips15 , and which consist of only
visual information. Furthermore, the number of speak- 15 randomly selected words of 500 instances each
ers in LRW is several hundred, which is significantly instead of all 500 words. Two further subsets named
higher than in GLips, which is estimated to be around GLips15-small and LRW15-small consist only of a total of
100. 95 word instances of the same 15 words as the former
subsets. We cropped the videos to 96×96 pixels
5. Model Evaluation around the lip region to increase the performance in
computation and to improve the focus of learning on
the lips.
Figure 8: Results of training (left) and validation (right) accuracy per iteration for Experiment 2: transfer learning
from a large dataset to a small dataset
models learned on large datasets transfer to models The results of Experiment 2 in Fig. 8 also clearly
learned on smaller datasets by learning from LRW15 → demonstrate the advantages of transfer learning, but
GLips15-small and also from GLips15 → LRW15-small , and look different in detail. LRW15 →GLips15-small as
also by comparing the results to models learned from well as GLips15 →LRW15-small achieve an advantage
scratch. over the respective networks trained from scratch. In
both experiments the average validation accuracies
5.2. Experimental Results of the LRW-networks reach a higher score than the
The smoothened curves of the results were plotted GLips-networks. This is particularly pronounced for
in TensorBoard12 . As seen in Fig. 7, in Experi- GLips15 →LRW15-small , which appears has the largest
ment 1 the validation accuracies of GLips15 and advantage in the validation experiment. It is surprising
LRW15 trained from scratch are lower than those of that GLips as source of transfer learning helps learn-
the transfer-learned models GLips15 →LRW15 and ing LRW more (dark red curve) than vice versa, since
LRW15 →GLips15 . Also both transfer-learned curves GLips as the source of transfer learning has lower-
rise steeper in the beginning, which means that the quality videos. An explanation could be that given the
networks learn faster and maintain their higher level low number of data points in this experiment, GLips
across all epochs, manifesting their learning advantage. with its noisier visual features did not allow the model
Additionally, GLips15 →LRW15 has a higher starting to overfit, while on the other hand, the LRW-pretrained
point, which accelerates the learning rate even more. model (light blue) might have overfit to distinct features
in LRW.
TensorBoard: https://www.tensorflow.org/
6. Discussion ability support, communication in noisy environments,
The successful transfer learning between the two lan- boosting of existing ASR systems, etc., progressing
guages indicates that there are features in both datasets state-of-the-art assistive technologies. Revisiting the
regarding lip reading that are language-independent publicly available source of the dataset, further appli-
and thus can be transferred to another language. cations would become possible, such as learning auto-
Due to the better performance of the LRW-trained net- matic speech recognition, and extended TextGrid infor-
works, we hypothesize that the difference between the mation will allow to create a dataset for sentence-level
models trained on GLips and LRW lies in the quality recognition from the original videos.
of the data. In GLips, in comparison, overall learning
is subjected to more noise in addition to the features
8. Acknowledgements
important for the complex task of lip reading. How- We gratefully acknowledge partial support from the
ever, the evaluation of the curves shows clear advan- German Research Foundation (DFG) under the project
tages of transfer learning compared to learning from Crossmodal Learning (CML, Grant TRR 169).
scratch, both in the overall performance and in the
