2022 Naacl-Srw 18
2022 Naacl-Srw 18
2022 Naacl-Srw 18
Generation
136
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies: Student Research Workshop, pages 136 - 142
July 10-15, 2022 ©2022 Association for Computational Linguistics
task whose input is a free-form text paragraph pared a wide range of extractive methods, includ-
or document(s), and the output sequence is ex- ing unsupervised ranking methods (e.g., LexRank,
pected to be a short summarization of the input. LSA, KL-divergence), supervised learning meth-
ViT5 achieves state-of-the-art results on both two ods using TF-IDF and classifiers (e.g., Support
of the single-document summarization tasks. We Vector Machine, AdaBoost, Learning-2-rank), and
also perform an analysis on the max-length hyper- deep learning methods (e.g., Convolutional Neural
parameter for input and output sequences during Network, Long-Short Term Memory). Similarly,
self-supervised learning and showed that longer the authors (Nguyen et al., 2019) also evaluated
lengths that match the downstream document’s the extractive methods on their own dataset, which
length lead to better result. was released publicly as a benchmark for future
For NER, we reformulated the per-token clas- studies.
sification task into a generation task, where the Recent work (Quoc et al., 2021) investigated the
decoder reconstructs the original input sentence combination of a pretrained BERT model and an
with inserted Named Entity tags following each unsupervised K-means clustering algorithm on ex-
token (Phan et al., 2021b). This simple and tractive text summarization. The authors utilized
straightforward formulation achieves competitive multilingual and monolingual BERT models to
results in comparison to direct per-token classifi- encode sentence-level contextual information and
cation done on encoder-only model (Nguyen and then ranked this information using the K-means
Nguyen, 2020). algorithm. Their report showed that monolingual
models achieved better results compared when to
2 Related Work multilingual models performing the same extrac-
tive summarization tasks. However, due to the
There are lots of abstractive summarization stud- lack of studies on Vietnamese abstractive summa-
ies in English. In an early example, (Gehrmann rization, we compare both multilingual and mono-
et al., 2018) employed a bottom-up content se- lingual encoder-decoder models.
lector (BottomUp) to determine which phrases
in the source document should be part of the 3 ViT5
summary, and then a copy mechanism was ap-
plied only to pre-select phrases during decoding. In this section, we will explain our newly released
Their experiments obtained significant improve- ViT5 models, the vocabulary generation steps, the
ments on ROUGE for some canonical summariza- pretraining data, and the training setup.
tion datasets.
In recent years, pretrained language models
have been used to enhance performance on lan-
guage generation tasks. (Liu and Lapata, 2019)
0.8
GSum that effectively used different types of guid- Figure 1: Loss curves for the masked span prediction
ance signals as input in order to generate more task were used to pretrain the ViT5 models. Larger
suitable words and more accurate summaries. This model with larger context optimizes much better, which
model accomplished state-of-the-art performance leads to better downstream performance.
on four popular English summarization datasets.
Meanwhile, there are a small number of stud-
ies on Vietnamese text summarization. Most of 3.1 Model
these focus on inspecting extractive summariza- ViT5 follows the encoder-decoder architecture
tion. The researchers (Nguyen et al., 2018) com- proposed by Vaswani et al. (2017) and the T5
137
wikilingua: Anh y bt xe ti tham gia ba tic ti mt nhà
hàng sang trng. Nh ng trong bu i tic, anh y ngã qu Anh y ã nhp vin sau khi tham gia ba tic.
xu ng và c a ti bnh vin. (He was hospitalized after attending the party.)
(He took the car to attend a party at a luxury restaurant. But
at the party, he collapsed and was taken to the hospital.)
<output_text>
ViT5 ViT5
<task_name>: <input_text>
Encoder Decoder
Bnh nhân PATIENT_ID* 75 PATIENT_ID* là GENDER* n GENDER* ,
AGE* 40 AGE* tu i , a ch LOCATION* Qun 2 LOCATION* ,
pho_ner: Bnh nhân 75 là n , 40 tu i , a ch LOCATION* TP. HCM LOCATION*
Qun 2 , TP. HCM (Patient PATIENT_ID* No.75 PATIENT_ID* is a GENDER*
(Patient No.75 is a female, 40 years old, and lives in female GENDER* , AGE* 40 AGE* years old, and lives
District 2, HCM city) in LOCATION* District 2 LOCATION* , LOCATION* HCM city LOCATION*)
Figure 2: An overview of ViT5 encoder-decoder architecture, with input-output examples of two downstream
tasks. For Named Entity Recognition, the decoder reconstructs the sentence with inserted Entity tags.
framework proposed by (Raffel et al., 2019). The Table 1: Input and Output Length of Finetuned
original works of T5 proposed five different con- Datasets
figs of model size: small, base, large, 3B, and 11B.
Wikilingua Vietnews
For the purpose of practical study, we adapt the
Train 13707 99134
base (310M parameters) and large (866M param-
Test 3916 22498
eters) models for ViT5 models and leave bigger
#avg body length 521 519
models for future works.
#avg abstract length 44 38
We train ViT5 models with two different in-
put and output lengths: 256 and 1024-length. We
thoroughly experimented with these two models to 4 Abstractive Summarization
have an insight into the importance of pretraining
data length for summarization tasks. For the self- 4.1 Wikilingua
supervised training learning objectives, we use the
span-corruption objective with a corruption rate of Wikilingua (Ladhak et al., 2020) is a large-scale
15%. Figure 1 shows the computed loss during the multilingual corpus for abstractive summarization
self-supervised training stage for the three models. tasks. The corpus consists of 18 languages, includ-
ing Vietnamese. These article and summary pairs
3.2 Vocabulary are extracted from WikiHow1 . These articles have
Different from some other current Vietnamese been reviewed by human authors to ensure quality.
Transformer-based language models, we find that The Vietnamese articles are translated from the
an effective vocabulary can contribute a significant original English articles and have been reviewed
improvement to our model performance. There- by WikiHow’s international translation team.
fore, we did pre-process on a 5GB subset of
our pretraining corpus with care like normalizing 4.2 Vietnews
punctuation and capitalization, splitting numbers. Vietnews (Nguyen et al., 2019) is a single-
We fixed the size of vocabulary to 36K sub-words document abstractive summarization dataset in-
and trained SentencePiece (Kudo and Richardson, cluding news data from reputable Vietnamese
2018) model on that dataset. news website (tuoitre.vn, vnexpress.net, and
3.3 Pretraining Data nguoiduatin.vn). The authors of this work re-
moved all articles related to questionnaires, ana-
We use the CC100 Dataset (Monolingual Datasets lytical comments, and weather forecasts to ensure
from Web Crawl Data) (Wenzek et al., 2020; Con- the quality of document summarization. The fi-
neau et al., 2020). The corpus contains mono- nal released dataset only includes long document
lingual data for over 100 languages. The corpus news events. The data consists of 150704 word-
was constructed using the pipeline provided by level news articles with a summary abstract and
(Wenzek et al., 2020) through processing January- body text pairs. We follow the filtering pipeline by
December 2018 Commoncrawl snapshots. The Tran et al. (2021) to deduplicate the train/dev/test
total size for the Vietnamese Corpus is 138GB dataset. The statistics after filtering are shown in
of raw text. We process and filter out 69GB of Table 1.
short paragraphs for 256-length model and 71GB
of long paragraphs for 1024-length model. 1
https://www.wikihow.com
138
Table 2: Test result on Wikilingua and Vietnews Summarization
WikiLingua Vietnews
Models
ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L
Transformer
46.25 16.57 29.82 57.56 24.25 35.53
(RND2RND)
PhoBERT2PhoBERT 50.4 19.88 32.49 60.37 29.12 39.44
mBERT2mBERT 52.82 20.57 31.55 59.67 27.36 36.73
mBART 55.21 25.69 37.33 59.81 28.28 38.71
mT5 55.27 27.63 38.30 58.05 26.76 37.38
BARTpho 57.16 31.18 40.89 61.14 30.31 40.15
ViT5base 256-length 57.86 29.98 40.23 61.85 31.70 41.70
ViT5base 1024-length 58.61 31.46 41.45 62.77 33.16 42.75
ViT5large 1024-length 60.22 33.12 43.08 63.37 34.24 43.55
Notes: The best scores are in bold and second best scores are underlined. The scores in gray color are our experiments.
Code and models for reproducing our experiments: https://github.com/vietai/ViT5
139
than the published BARTpho model in summa- 5 Named Entity Recognition (NER)
rizing Wikilingua corpus. This can be the result
of the quality of pretraining data. While BART- Table 3: Test results on PhoNER COVID19
pho (and PhoBERT) was trained on 20GB of news
data, ViT5 models are trained using CC100, which Models Micro-F1
is a subset of Common Crawl data. CC100 cor- XLM-Rlarge 93.8
pus contains more diverse and general representa- PhoBERTbase 94.2
tion of the Vietnamese language than news data. PhoBERTlarge 94.5
Meanwhile, Wikilingua is more of an academic or ViT5base 256-length 93.19
instruction representation than news-like text. ViT5base 1024-length 94.5
ViT5large 1024-length 93.8
Notes: The best scores are in bold.
140
Interestingly, a more general domain of pre- References
training texts can lead to a better domain-specific Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
summarization performance. In Section 4.4.1, our Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
ViT5 models while being trained on a more gen- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
eral corpus (CC100), outperform current models Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
that are trained on news-related corpus. More Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
technical domains such as laws, medicals, or en- Clemens Winter, Christopher Hesse, Mark Chen,
gineering are not tested as we leave these domain- Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
specific summarization tasks for future studies. Chess, Jack Clark, Christopher Berner, Sam Mc-
Candlish, Alec Radford, Ilya Sutskever, and Dario
The slightly better performance of
Amodei. 2020. Language models are few-shot
ViT5base 1024-length compared to ViT5base 256-length learners. CoRR, abs/2005.14165.
suggests that longer document summarization
(more than 512 tokens) need a comparatively The Viet Bui, Oanh Thi Tran, and Phuong Le-Hong.
2020. Improving sequence tagging for vietnamese
longer context length during the pretraining text using transformer-based neural models. CoRR,
stage. abs/2006.15994.
141
Yang Liu and Mirella Lapata. 2019. Text summariza- Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy
tion with pretrained encoders. In EMNLP/IJCNLP. Nguyen, and Anh Gia-Tuan Nguyen. 2021. Mono-
lingual versus multilingual bertology for vietnamese
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey extractive multi-document summarization.
Edunov, Marjan Ghazvininejad, Mike Lewis, and
Luke Zettlemoyer. 2020. Multilingual denoising Nguyen Luong Tran, Duong Minh Le, and Dat Quoc
pre-training for neural machine translation. CoRR, Nguyen. 2021. Bartpho: Pre-trained sequence-to-
abs/2001.08210. sequence models for vietnamese.
Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. Thinh Hung Truong, Mai Hoang Dao, and Dat Quoc
PhoBERT: Pre-trained language models for Viet- Nguyen. 2021. COVID-19 Named Entity Recog-
namese. In Findings of the Association for Com- nition for Vietnamese. In Proceedings of the 2021
putational Linguistics: EMNLP 2020, pages 1037– Conference of the North American Chapter of the
1042. Association for Computational Linguistics: Human
Language Technologies.
Hieu Nguyen, Long Phan, James Anibal, Alec Pel-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
tekian, and Hieu Tran. 2021. Viesum: How robust
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
are transformer-based models on vietnamese sum-
Kaiser, and Illia Polosukhin. 2017. Attention is all
marization?
you need. CoRR, abs/1706.03762.
Minh-Tien Nguyen, Hoang-Diep Nguyen, Thi-Hai- Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-
Nang Nguyen, and Van-Hau Nguyen. 2018. To- neau, Vishrav Chaudhary, Francisco Guzmán, Ar-
wards state-of-the-art baselines for vietnamese mand Joulin, and Edouard Grave. 2020. CCNet:
multi-document summarization. In 2018 10th In- Extracting high quality monolingual datasets from
ternational Conference on Knowledge and Systems web crawl data. In Proceedings of the 12th Lan-
Engineering (KSE), pages 85–90. guage Resources and Evaluation Conference, pages
4003–4012, Marseille, France. European Language
Van-Hau Nguyen, Thanh-Chinh Nguyen, Minh-Tien Resources Association.
Nguyen, and Nguyen Hoai. 2019. Vnds: A viet-
namese dataset for summarization. pages 375–380. Linting Xue, Noah Constant, Adam Roberts, Mi-
hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Barua, and Colin Raffel. 2020. mt5: A mas-
Gardner, Christopher Clark, Kenton Lee, and Luke sively multilingual pre-trained text-to-text trans-
Zettlemoyer. 2018. Deep contextualized word rep- former. CoRR, abs/2010.11934.
resentations. CoRR, abs/1802.05365.
142