Paper 4
Paper 4
Paper 4
tering and keeping up with the broadcast news uploaded online on a daily basis.
The rise of large language models from deep learning with impressive text gen-
eration capabilities has placed the research focus on summarization systems that
produce paraphrased compact versions of the document content, also known as
abstractive summaries. End-to-end (E2E) modelling of S2T abstractive summa-
rization is a promising approach that offers the possibility of generating rich latent
representations that leverage non-verbal and acoustic information, as opposed to
the use of only linguistic information from automatically generated transcripts in
cascade systems. However, the few literature on E2E modelling of this task fails
on exploring different domains, namely broadcast news, which is challenging do-
main where large and diversified volumes of data are presented to the user every
day. We model S2T summarization both with a cascade and an E2E system for
a corpus of broadcast news in French. Our novel E2E model leverages external
data by resorting to transfer learning from a pre-trained T2T summarizer. Exper-
iments show that both our cascade and E2E abstractive summarizers are stronger
than an extractive baseline. However, the performance of the E2E model still lies
behind the cascade one, which is object of an extensive analysis that includes
future directions to close that gap.
1 Introduction
S2T summarization is usually achieved using a cascade approach (see Fig. 1), where
an automatic speech recognition (ASR) model generates transcripts, followed by a
text-to-text (T2T) summarization model that produces summaries [18]. Deep learning,
including attention-based architectures and self-supervised pre-training, has improved
the performance of both models. Cascade abstractive systems using these components
achieve strong results when trained on unpaired data for dialogue summarization tasks
[24]. However, the transcripts produced by the ASR model may contain errors, so meth-
ods using confusion networks or language models have been proposed to improve ro-
bustness to these errors[11,23].
Cascade systems used for S2T summarization fail to utilize non-verbal and acoustic
information that could be useful for summarization [22]. E2E modelling (see Fig. 1)
has been proposed to address this issue in two different articles [21,9]. These systems
do not make use of an intermediate speech recognition step and instead jointly optimise
an acoustic and language model. However, E2E modelling requires large amounts of
paired audio/summary data and the scarcity of publicly available large corpora on the
broadcast news domain requires techniques to leverage external data.
This work proposes both a cascade and novel E2E models for S2T abstractive
summarization of broadcast news. The former uses fine-tuned ASR and T2T abstrac-
tive summarizer on a broadcast news dataset. The E2E system follows the encoder-
decoder paradigm and utilizes speech features extracted using a self-supervised pre-
trained speech representation model as input [3]. It leverages external data from text
corpora through transfer learning from a T2T abstractive summarizer. Both models are
compared against an extractive cascade baseline and to each other using ROUGE scores
and human evaluation. We release our source code publicly4.
The remainder of this paper is organized as follows: in section 2, we present the re-
lated work; in section 3, we propose a new corpus of broadcast news in French, which
is used to evaluate the models developed in this work; in section 4, we describe the
architectures of the cascade and novel E2E abstractive summarizers, how the latter ben-
efits from the former through transfer learning and we introduce an extractive baseline;
in section 5, we detail the architecture and pre-training of the cross-modal adapter,
which is the encoder of the E2E S2T abstractive summarizer that maps speech to tex-
tual features; in section 6, we present the results for automatic and human evaluations;
4
https://github.com/Priberam/S2TSumm
Towards End-to-end Speech-to-text Summarization 3
in section 7, we discuss the obtained results; section 8 concludes this work and includes
future directions for improving the performance of the E2E model.
2 Related Work
3 Dataset
For this work, we built a dataset for S2T abstractive summarization of broadcast news
in French, that was built from articles that can be found in the EuroNews website5 .
5
https://www.euronews.com/about
4 Raul Monteiro and Diogo Pernes
Each news article from EuroNews has an audio, an abstractive summary of the news
content and the article body. Since the latter is not always a perfect transcript of the
audio, we employed an automatic procedure for selecting the news articles whose article
bodies are perfect (or almost perfect) transcripts of the audios. An XLSR-based ASR
model6 was used to produce artificial transcripts from the audios. Afterwards, the word
error rate (WER) evaluation metric was applied between the automatically generated
transcript and the article body. A threshold for the WER of 45% was set, such that
articles associated with higher values of WER were discarded. The remaining articles
were randomly shuffled and separated into three distinct splits with sizes of 13 380,
1672 and 1673 for the train, dev and test splits, respectively, and this final corpus was
named BNews7 . The mean audio duration per article is about 87 s.
4 Model Architectures
4.1 Cascade
The cascade abstractive summarizer requires both an ASR system and a T2T abstrac-
tive summarizer. Fig. 2 illustrates the realisation of the cascade and E2E abstractive
summarizers.
Automatic Speech Recognizer The ASR model was built from a W2V2 model8 that was
pre-trained on French speech data. The pre-trained model was loaded to a Wav2Vec2ForCTC
object from the Transformers library of Huggingface9. This model consists of a
pre-trained W2V2 model, followed by a linear layer and a softmax. The model is
6
https://huggingface.co/facebook/wav2vec2-large-xlsr-53-french
7
We are in contact with EuroNews to have a public license of this dataset.
8
https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-base
9
https://huggingface.co/docs/transformers/index
Towards End-to-end Speech-to-text Summarization 5
trained for speech recognition using the French sub-dataset of the Common Voice Cor-
pus 10.0 (CV) with the CTC objective. The vocabulary contains 222 characters ex-
tracted from the dev split of the BNews corpus. The model was further fine-tuned
on the latter from the checkpoint that showed lower WER on the dev split of the
CV. The WER on the test split of the BNews corpus was (18.8 ± 0.3) %, where the
BasicTextNormalizer from Whisper10 was used for text normalization.
4.2 End-to-end
The novel E2E implementation for S2T abstractive summarization proposed in this
work does not directly use the audio waveform or MFB/MFCC features as input. In-
stead, it takes speech features generated by the same pre-trained W2V2 model that was
trained for ASR. The S2T abstractive summarizer takes the speech features and converts
them to a summary of the audio content.
Speech Feature Extractor Following the same methodology used in [12], we computed
the PWCCA scores12 between word-level embeddings extracted from each transformer
layer of the W2V2 base model and pre-trained French word embeddings, which were
obtained in [4]. It is found that the 7th transformer layer is the one that generates rep-
resentations more similar to word embeddings. For this reason, the speech feature ex-
tractor is composed of all the layers of the W2V2 model up to and including the 7th
transformer layer.
The extractive baseline uses the same ASR system as the cascade abstractive sum-
marizer. We adopted a simple centroid-based approach, where the sentence embed-
dings were provided by a publicly available unsupervised extractive CamemBERT-
based model13. The summary is constructed by concatenating the top-k closest sen-
tences to the centroid until a maximum number of words w̄ is reached, where w̄ = 24
was set to match the average length of the dev split of the BNews corpus.
5 Cross-modal Adapter
5.1 Architecture
e′ e exp(ed ′ ) exp(eeos )
Att. Weights αeti = PL ti ′ e
e tj
αdtt′ = Pt−1 tt d αeos
tt′ = Pt+w tt′
eos
j=1 j=1 exp(etj ) j=t−w exp(etj )
PL̃ Pt−1 Pt+w
Cont. Vectors cet = i=1 αeti hei cdt = t′ =1 αdtt′ hdt′ ceos
t = t′ =t−w αeos
tt′ yt′
5.2 Pre-training
The pre-training of the cross-modal adapter encompasses three controlled steps, which
are described below. The input speech features and target textual features were normal-
ized such that each dimension had zero mean and unit variance.
Stage 1: At this stage, we used the same Common Voice corpus that was used to
train the ASR model. A proportion of speech features from the sequence x(i) is ran-
domly masked, where for every element of the sequence there is a probability pmask =
6.5 × 10−2 of starting a masked span at that position with length Mmask = 10 (values
identical to the ones used to train the W2V2 model). The cross-modal adapter is trained
to minimise the mean squared error (MSE) between the reference embeddings y (i) and
the ones predicted from the masked sequence ŷ (i) .
Stage 2: We dropped the CV dataset and used the BNews corpus during this training
stage. The objective remains to minimizing the MSE. Masking is no longer used and the
default teacher forcing algorithm for training Seq2seq models is replaced by the peeling
back algorithm introduced in [16]. For the j-th minibatch or training step, we use linear
decay for the teacher forcing ratio λ(j) = max(ǫ, k − cj), where ǫ = 5.0 × 10−1 ,
k = 1.0, and c = 8.0 × 10−6 .
Stage 3: The cross-modal adapter is now trained to predict the end of the sequence
of textual embeddings, again using the BNews dataset. Given a T (i) -sized sequence
(i)
of predicted textual embeddings ŷ (i) , predicting for every ŷt whether it is the end of
the sequence is a binary classification problem. Minimizing a binary cross-entropy loss
suffices. All the model weights are frozen except for the ones directly associated with
8 Raul Monteiro and Diogo Pernes
eos
end-of-sequence prediction (Wattn and Weos ), and one also makes use of the peeling
back algorithm with linear decay, where ǫ = 0.0, k = 1.0 and c = 3.0 × 10−4 .
After this three-stage pre-training, the cross-modal adapter and the text decoder are
jointly trained for abstractive summarization using the BNews dataset in a multitask
objective consisting of the usual cross-entropy loss for summarization and the binary
cross-entropy for EOS detection.
6 Evaluation
6.1 Automatic Evaluation
For assessing the performance of the different implementations developed in this work,
we make use of the ROUGE package14, more specifically, the ROUGE-1, ROUGE-2,
ROUGE-L and ROUGE-Lsum metrics. The decoding for the cascade and E2E abstrac-
tive summarizers is performed with beam search. Table 2 compares the ROUGE scores
for the extractive baseline and both cascade and E2E abstractive summarizers on the
test split of the BNews corpus. We include the topline performance, which is simply the
T2T abstractive summarizer from the cascade system applied on the gold transcripts
(GT), and thus serves as an upper bound for the performance of the cascade abstrac-
tive summarizer. We also performed ablation studies for the following cases: the S2T
abstractive summarizer is not fine-tuned on the BNews corpus after the pre-training
of the cross-modal adapter (nFT); there is no fine-tuning and the cross-modal adapter
additionally does not make use of its predictions for the end-of-sequence positions of
the sequences of textual embeddings and uses instead the gold ones (G-EOS); the pre-
training of cross-modal adapter described in section 5.2 is not performed and the S2T
abstractive summarizer is directly trained using the BNews dataset (nPre).
All the abstractive systems outperform the extractive baseline, which was expected
given that the target summaries from our corpus are abstractive. The cascade abstractive
14
https://huggingface.co/spaces/evaluate-metric/rouge
Towards End-to-end Speech-to-text Summarization 9
summarizer yields worse scores than the topline model, which is due to ASR error
propagation. On the other hand, the E2E model performs worse than the cascade model,
as measured by ROUGE scores. This contrasts with the fact that, theoretically, E2E
modelling allows leveraging non-verbal and acoustic information besides the linguistic
one from transcripts, which is the only type of information that cascade systems have
access to. Regarding the ablation studies, by comparing the performance of the E2E and
E2E (nFT) models, it is found that fine-tuning the S2T abstractive summarizer after the
pre-training of the cross-modal adapter significantly improves the ROUGE scores with
a relative increase on the interval of 25% - 50%. The similarity between the ROUGE
scores of the E2E (nFT) and E2E (G-EOS) models allows us to conclude that the cross-
modal adapter performs equally well either when using its own predictions for the end-
of-sequence positions of the sequences of textual embeddings or when using the ground
truth ones. Finally, the gap between E2E and E2E (nPre) proves that the proposed pre-
training of the cross-modal adapter provides a very significant performance increase.
FC R F
Extractive is better 0.17 0.10 0.13
Tie 0.73 0.07 0.47
Cascade (ASR + T2T) is better 0.10 0.83 0.40
Extractive is better 0.63 0.23 0.30
Tie 0.30 0.13 0.33
End-to-end is better 0.07 0.63 0.37
Cascade (ASR + T2T) is better 0.60 0.57 0.37
Tie 0.40 0.33 0.53
End-to-end is better 0.00 0.10 0.10
Table 3: Proportion of times that each model was considered the best in each pairwise
comparison, according to each criterion with respect to factual consistency (FC), rele-
vance (R) and fluency (F).
E2E model generates a repetitive summary, therefore compromising its fluency. When
comparing only the abstractive summarizers, the cascade and E2E ones, we clearly see
that the cascade system produces summaries that are better in all the three evaluated
attributes, which is in line with the automatic evaluation with the ROUGE metrics.
Text
Transcript Des milliers de personnes rassemblées à Madrid pour dire ”non” à la grâce des indépendantistes catalans, en-
visagée par le chef du gouvernement espagnol. En Espagne, des milliers de personnes se sont rassemblées ce
dimanche à Madrid pour dire ”non” à la grâce des indépendantistes catalans. (...)
Reference Des milliers de personnes rassemblées à Madrid pour dire Non à la grâce des indépendantistes catalans, envisagée
par le chef du gouvernement espagnol.
Extractive Ces deux partis sont profondément oposés à l’initiative de l’actuele Premier ministre, Le socialiste Pedron
Sanchez, qui souhaite acorder la grâce au leader séparatistes condamné pour la tentative de sécesion de 2017
Cascade En Espagne, des miliers de persones se sont rasemblées ce dimanche à Madrid pour dire non à la grâce des
indépendantistes catalans.
E2E Des milliers de personnes se sont rassemblées ce dimanche à Madrid pour dire non à la grâce des indépendantistes
catalans.
Transcript La Belgique se dit prête à accueillir Jean-Pierre Bemba. Ce jeudi, le chef de la diplomatie belge, Didier Reynders,
a annoncé que le pays était prêt à accueillir l’ancien dirigeant congolais, qui souhaite rejoindre sa famille, qui vit
dans le pays. (...)
Reference Le chef de la diplomatie belge l’a annoncé ce jeudi.
Extractive Cet ancien home d’afaires devenue ensuite vice-président de la République démocratique du Congo, avait été
condamné en première instance à 18 ans de prison il y a deux ans
Cascade Didier Renders a finalement été acquité par la Cour pénale internationale, des acusations de crime de guere et
de crimes contre l’humanité la semaine dernière qui a permis la mise en liberté conditionele de celui qui a été
incarcéré pendant dix ans à La Haye.
E2E Le chef de la diplomatie belge a annoncé que le pays était prêt à accueillir Jean-Claude Juncker. Ce jeudi, le chef
de la diplomatie belge a annoncé que le pays était prêt à accueillir Jean-Claude Juncker.
Table 4: Examples of summaries produced by the different summarizers.
Towards End-to-end Speech-to-text Summarization 11
7 Discussion
The results from automatic and human evaluation point out that the E2E abstractive
summarizer underperforms with respect to the cascade one. This underperformance
may be explained if one considers the several sub-modules of the cascade and E2E sum-
marizers. Both make use of a W2V2-based model either for speech recognition or plain
speech feature extraction. The T2T abstractive summarizer of the cascade system and
the S2T abstractive summarizer of the E2E system share the same decoder, but differ
strongly on the encoder. Thus, the limited performance of the proposed novel E2E im-
plementation when compared with the cascade system must be sourced on the particular
realization of the cross-modal adapter. We have strong reasons to believe that the large
T2T summarization corpus (MLSUM [20]), to which the encoder of the T2T summa-
rizer was exposed during its training for abstractive summarization, played a significant
role. It is likely that this enormous amount of external data makes the text encoder gen-
erate much richer textual latent representations than the ones the cross-modal adapter
could possibly generate, given that it only had access to the summarization training data
from the BNews corpus during its development.
8 Conclusion
We proposed a novel E2E model for S2T abstractive summarization of broadcast news
in French. It leverages external data from T2T summarization corpora through transfer-
ring the decoder from a T2T abstractive summarizer. Additionally, we proposed a clever
pre-training of the cross-modal adapter that leverages external data from an ASR dataset
besides the BNews corpus. We presented an extensive analysis that took into account
automatic and human evaluations for assessing the quality of the generated summaries.
Although the E2E model did not beat the cascade, our contributions helped to close the
performance gap between the two approaches, as is shown by our ablation studies.
The low amount of abstractive summarization training data for pre-training the
cross-modal adapter has been shown as the most likely source of the underperformance
of E2E model. Future work should focus on enriching the training of the cross-modal
adapter. For instance, by also transferring the text encoder from the T2T abstractive
summarizer and carefully train it to process speech features as input. Another possible
and not mutually exclusive direction would be the use of augmented data from T2T
summarization corpora through speech synthesis to enlarge the training data. Finally,
the lack of large corpora with speech/summary pairs severely jeopardizes any fully su-
pervised approach for developing an E2E system. Future work on developing this kind
of datasets is needed in order to improve the promising E2E systems.
Acknowledgments This work was supported by the EU H2020 SELMA project (grant
agreement No. 957017).
References
1. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-
supervised learning of speech representations. In: Larochelle, H., Ranzato, M., Hadsell, R.,
12 Raul Monteiro and Diogo Pernes
Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp.
12449–12460. Curran Associates, Inc. (2020). https://doi.org/10.48550/arXiv.2006.11477
2. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word
recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and
Signal Processing 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
3. Evain, S., Nguyen, H., Le, H., Boito, M.Z., Mdhaffar, S., Alisamir, S., Tong, Z., Tomashenko,
N., Dinarelli, M., Parcollet, T., Allauzen, A., Estève, Y., Lecouteux, B., Portet, F., Rossato,
S., Ringeval, F., Schwab, D., laurent besacier: Task agnostic and task specific self-supervised
learning from speech with lebenchmark. In: Thirty-fifth Conference on Neural Information
Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
4. Ferreira, D.C., Martins, A.F.T., Almeida, M.S.C.: Jointly learning to embed and predict with
multiple languages. In: Proc. of the 54th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers). pp. 2019–2028. Association for Computational
Linguistics, Berlin, Germany (Aug 2016). https://doi.org/10.18653/v1/P16-1190
5. Furui, S.: Speaker-independent isolated word recognition based on emphasized spectral dy-
namics. In: ICASSP ’86. IEEE International Conference on Acoustics, Speech, and Signal
Processing. vol. 11, pp. 1991–1994 (1986). https://doi.org/10.1109/ICASSP.1986.1168654
6. Gupta, S., Gupta, S.K.: Abstractive summarization: An overview of the
state of the art. Expert Systems with Applications 121, 49–65 (2019).
https://doi.org/https://doi.org/10.1016/j.eswa.2018.12.011
7. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V.,
Zettlemoyer, L.: BART: Denoising sequence-to-sequence pre-training for natural language
generation, translation, and comprehension. In: Proc. of the 58th Annual Meeting of the
Association for Computational Linguistics. pp. 7871–7880. Association for Computational
Linguistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.703
8. Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la Clerg-
erie, É., Seddah, D., Sagot, B.: CamemBERT: a tasty French language model. In:
Proc. of the 58th Annual Meeting of the Association for Computational Linguis-
tics. pp. 7203–7219. Association for Computational Linguistics, Online (Jul 2020).
https://doi.org/10.18653/v1/2020.acl-main.645
9. Matsuura, K., Ashihara, T., Moriya, T., Tanaka, T., Ogawa, A., Delcroix, M., Ma-
sumura, R.: Leveraging large text corpora for end-to-end speech summarization (2023).
https://doi.org/10.48550/arXiv.2303.00978
10. Morcos, A., Raghu, M., Bengio, S.: Insights on representational similarity in neural networks
with canonical correlation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-
Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 31.
Curran Associates, Inc. (2018). https://doi.org/10.48550/arXiv.1806.05759
11. Ogawa, A., Hirao, T., Nakatani, T., Nagata, M.: Ilp-based compressive speech summarization
with content word coverage maximization and its oracle performance analysis. In: ICASSP
2019 - IEEE International Conference on Acoustics, Speech and Signal Processing. pp.
7190–7194 (2019). https://doi.org/10.1109/ICASSP.2019.8683543
12. Pasad, A., Chou, J.C., Livescu, K.: Layer-wise analysis of a self-supervised speech represen-
tation model. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop
(ASRU). pp. 914–921 (2021). https://doi.org/10.1109/ASRU51503.2021.9688093
13. Paulus, R., Xiong, C., Socher, R.: A deep reinforced model for abstractive
summarization. In: International Conference on Learning Representations (2018).
https://doi.org/10.48550/arXiv.1705.04304
14. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation.
In: Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543 (2014).
https://doi.org/10.3115/v1/D14-1162
Towards End-to-end Speech-to-text Summarization 13
15. Pernes, D., Mendes, A., Martins, A.F.T.: Improving abstractive summarization with energy-
based re-ranking. In: Proc. of the 2nd Workshop on Natural Language Generation, Evalua-
tion, and Metrics. pp. 1–17. Association for Computational Linguistics, Abu Dhabi, United
Arab Emirates (2022). https://doi.org/10.48550/arXiv.2210.15553
16. Peters, B., Correia, G., Mihaylova, T.: An exploration of teacher forcing techniques for neural
machine translation (2018)
17. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever,
I.: Robust speech recognition via large-scale weak supervision (2022).
https://doi.org/10.48550/ARXIV.2212.04356
18. Rezazadegan, D., Berkovsky, S., Quiroz, J.C., Kocaballi, A.B., Wang, Y., Laranjo,
L., Coiera, E.W.: Automatic speech summarisation: A scoping review (2020).
https://doi.org/10.48550/arXiv.2008.11897
19. Rothe, S., Narayan, S., Severyn, A.: Leveraging pre-trained checkpoints for sequence genera-
tion tasks. Transactions of the Association for Computational Linguistics 8, 264–280 (2020).
https://doi.org/10.1162/tacl a 00313
20. Scialom, T., Dray, P.A., Lamprier, S., Piwowarski, B., Staiano, J.: MLSUM: The multilingual
summarization corpus. In: Proc. of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP). pp. 8051–8067. Association for Computational Linguistics,
Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.647
21. Sharma, R., Palaskar, S., Black, A.W., Metze, F.: End-to-end speech summariza-
tion using restricted self-attention. In: ICASSP 2022 - IEEE International Con-
ference on Acoustics, Speech and Signal Processing. pp. 8072–8076 (2022).
https://doi.org/10.1109/ICASSP43922.2022.9747320
22. Ákos Tündik, M., Kaszás, V., Szaszák, G.: Assessing the Semantic Space Bias Caused by
ASR Error Propagation and its Effect on Spoken Document Summarization. In: Proc. Inter-
speech 2019. pp. 1333–1337 (2019). https://doi.org/10.21437/Interspeech.2019-2154
23. Weng, S.Y., Lo, T.H., Chen, B.: An effective contextual language model-
ing framework for speech summarization with augmented features. In: 2020
28th European Signal Processing Conference (EUSIPCO). pp. 316–320 (2021).
https://doi.org/10.23919/Eusipco47968.2020.9287432
24. Zhang, Y., Ni, A., Yu, T., Zhang, R., Zhu, C., Deb, B., Celikyilmaz, A., Awadallah, A.H.,
Radev, D.: An exploratory study on long dialogue summarization: What works and what’s
next. In: Findings of the Association for Computational Linguistics: EMNLP 2021. pp.
4426–4433. Association for Computational Linguistics, Punta Cana, Dominican Republic
(2021). https://doi.org/10.18653/v1/2021.findings-emnlp.377
25. Zhuang, L., Wayne, L., Ya, S., Jun, Z.: A robustly optimized BERT pre-training approach
with post-training. In: Proc. of the 20th Chinese National Conference on Computational Lin-
guistics. pp. 1218–1227. Chinese Information Processing Society of China, Huhhot, China
(Aug 2021). https://doi.org/10.48550/arXiv.1907.11692