Thư viện Trankit
Thư viện Trankit
Thư viện Trankit
Minh Van Nguyen, Viet Lai, Amir Pouran Ben Veyseh, Thien Huu Nguyen
Department of Computer and Information Science
University of Oregon, Eugene, Oregon, USA
{minhnv,vietl,apouranb,thien}@cs.uoregon.edu
Language=1,L
Joint Token and Sentence
Splitter
Multi-word
Token Expander
Shared
Joint Model for
Multilingual
POS,Morphological Tagging,
Pretrained
and Dependency Parsing
Transformer
Named Entity
Lemmatizer
Recognizer
Figure 1: Overall architecture of Trankit. A single multilingual pretrained transformer is shared across three
components (pointed by the red arrows) of the pipeline for different languages.
comes such limitations. Our toolkit can process transformer-based components for: (i) the joint
raw text for fundamental NLP tasks, supporting 56 token and sentence splitter, (ii) the joint model
languages with 90 pre-trained pipelines on 90 tree- for POS tagging, morphological tagging, depen-
banks of the Universal Dependency v2.5 (Zeman dency parsing, and (iii) the named entity recog-
et al., 2019). By utilizing the state-of-the-art multi- nizer. One potential concern for our use of a large
lingual pretrained transformer XLM-Roberta (Con- pretrained transformer model (i.e., XML-Roberta)
neau et al., 2020), Trankit advances state-of-the- in Trankit involves GPU memory where different
art performance for sentence segmentation, part- transformer-based components in the pipeline for
of-speech (POS) tagging, morphological feature one or multiple languages must be simultaneously
tagging, and dependency parsing while achieving loaded into the memory to serve multilingual tasks.
competitive or better performance for tokenization, This could extensively consume the memory if dif-
multi-word token expansion, and lemmatization ferent versions of the large pre-trained transformer
over the 90 treebanks. It also obtains competitive (finetuned for each component) are employed in
or better performance for named entity recognition the pipeline. As such, we introduce a novel plug-
(NER) on 11 public datasets. and-play mechanism with Adapters to address this
Unlike previous work, our token and sentence memory issue. Adapters are small networks in-
splitter is wordpiece-based instead of character- jected inside all layers of the pretrained transformer
based to better exploit contextual information, model that have shown their effectiveness as a light-
which are beneficial in many languages. Consider- weight alternative for the traditional finetuning
ing the following sentence: of pretrained transformers (Houlsby et al., 2019;
“John Donovan from Argghhh! has put out a ex-
Peters et al., 2019; Pfeiffer et al., 2020a,b). In
cellent slide show on what was actually found and Trankit, a set of adapters (for transfomer layers)
fought for in Fallujah.” and task-specific weights (for final predictions) are
created for each transformer-based component for
As such, Trankit correctly recognizes this as a sin-
each language while only one single large mul-
gle sentence while character-based sentence split-
tilingual pretrained transformer is shared across
ters of Stanza and UDPipe are easily fooled by the
components and languages. Adapters allow us to
exclamation mark “!”, treating it as two separate
learn language-specific features for tasks. During
sentences. To our knowledge, this is the first work
training, the shared pretrained transformer is fixed
to successfully build a wordpiece-based token and
while only the adapters and task-specific weights
sentence splitter that works well for 56 languages.
are updated. At inference time, depending on the
Figure 1 presents the overall architecture
language of the input text and the current active
of Trankit pipeline that features three novel
component, the corresponding trained adapter and
Add & Norm
task-specific weights are activated and plugged into
Adapter
the pipeline to process the input. This mechanism FF Up
not only solves the memory problem but also sub- Add & Norm
FF Down
stantially reduces the training time. Feed-forward
Adapter
Add & Norm
2 Related Work
Multi-Head Attention Add & Norm
There have been works using pre-trained trans-
formers to build models for character-based word
segmentation for Chinese (Yang, 2019; Tian et al., Figure 2: Left: location of an adapter (green box) in-
2020; Che et al., 2020); POS tagging for Dutch, side a layer of the pretrained transformer. Gray boxes
English, Chinese, and Vietnamese (de Vries et al., represent the original components of a transformer
2019; Tenney et al., 2019; Tian et al., 2020; Che layer. Right: the network architecture of an adapter.
et al., 2020; Nguyen and Nguyen, 2020); mor-
phological feature tagging for Estonian and Per-
boxes) are fixed and only the adapter weights of
sian (Kittask et al., 2020; Mohseni and Tebbifakhr,
two projection layers and the task-specific weights
2019); and dependency parsing for English and
outside the transformer (for final predictions) are
Chinese (Tenney et al., 2019; Che et al., 2020).
updated. As demonstrated in Figure 1, Trankit
However, all of these works are only developed for
involves six components described as follows.
some specific language, thus potentially unable to
support and scale to the multilingual setting. Multilingual Encoder with Adapters. This is
Some works have designed multilingual our core component that is shared across different
transformer-based systems via multilingual train- transformer-based components for different lan-
ing on the combined data of different languages guages of the system. Given an input raw text s,
(Tsai et al., 2019; Kondratyuk and Straka, 2019; we first split it into substrings by spaces. After-
Üstün et al., 2020). However, multilingual ward, Sentence Piece, a multilingual subword tok-
training is suboptimal (see Section 5). Also, these enizer (Kudo and Richardson, 2018; Kudo, 2018),
systems still rely on external resources to perform is used to further split each substring into word-
tokenization and sentence segmentation, thus pieces. By concatenating wordpiece sequences for
unable to consume raw text. To our knowedge, this substrings, we obtain an overall sequence of word-
is the first work to successfully build a multilingual pieces w = [w1 , w2 , . . . , wK ] for s. In the next
transformer-based NLP toolkit where different step, w is fed into the pretrained transformer, which
transformer-based models for many languages can is already integrated with adapters, to obtain the
be simultaneously loaded into GPU memory and wordpiece representations:
process raw text inputs of different languages.
xl,m l,m
1:K = Transformer(w1:K ; θAD ) (2)
3 Design and Architecture
l,m
Adapters. Adapters play a critical role in making Here, θAD represents the adapter weights for lan-
Trankit memory- and time-efficient for training and guage l and component m of the system. As such,
inference. Figure 2 shows the architecture and the we have specific adapters in all transformer layers
location of an adapter inside a layer of transformer. for each component m and language l. Note that if
We use the adapter architecture proposed by (Pfeif- K is larger than the maximum input length of the
fer et al., 2020a,b), which consists of two projection pretrained transformer (i.e., 512), we further divide
layers Up and Down (feed-forward networks), and w into consecutive chunks; each has the length less
a residual connection. than or equal to the maximum length. The pre-
trained transformer is then applied over each chunk
ci = AddNorm(ri ), hi = Up(ReLU(Down(ci ))) + ri (1) to obtain a representation vector for each wordpiece
in w. Finally, xl,m
1:K will be sent to component m to
where ri is the input vector from the transformer perform the corresponding task.
layer for the adapter and hi is the output vector
for the transformer layer i. During training, all the Joint Token and Sentence Splitter. Given the
weights of the pretrained transformer (i.e., gray wordpiece representations xl,m
1:K for this component,
each vector xl,m
i for wi ∈ w will be consumed by Named Entity Recognizer. Given a sentence, the
a feed-forward network with softmax in the end to named entity recognizer determines spans of en-
predict if wi is the end of a single-word token, the tity names by assigning a BIOES tag to each token
end of a multi-word token, or the end of a sentence. in the sentence. We deploy a standard sequence
The predictions for all wordpieces in w will then be labeling architecture using transformer-based rep-
aggregated to determine token, multi-word token, resentations for tokens, involving a feed-forward
and sentence boundaries for s. network followed by a Conditional Random Field.
Table 1: Systems’ performance on test sets of the Universal Dependencies v2.5 treebanks. Performance for Stanza,
UDPipe, and spaCy is obtained using their public pretrained models. The overall performance for Trankit and
Stanza is computed as the macro-averaged F1 over 90 treebanks. Detailed performance of Trankit for 90 supported
treebanks can be found at our documentation page.
{'id': 2...},
those languages. Figure 6 illustrates how to train a {'id': 3...},
{'id': 4...}
token and sentence splitter with TPipeline. }
]
]
}
Demo Website. A demo website for Trankit to
support 90 pretrained pipelines is hosted at: http: Figure 5: Output from Trankit. Some parts are col-
//nlp.uoregon.edu/trankit. Figure 7 shows its lapsed to improve visualization.
interface.
Table 2: Model performance on 9 different treebanks (macro-averaged F1 score over test sets).
Table 3 compares Trankit with Stanza (v1.1.1), Table 5: Model sizes for five languages.
Flair (v0.7), and spaCy (v2.3) on the test sets of
11 considered NER datasets. Following Stanza, we
report the performance for other toolkits with their 5.4 Speed and Memory Usage
pretrained models on the canonical data splits if Table 4 reports the relative processing time for
they are available. Otherwise, their best configura- UD and NER of the toolkits compared to spaCy’s
tions are used to train the models on the same data CPU processing time5 . For memory usage com-
splits (inherited from Stanza). Also, for the Dutch parison, we show the model sizes of Trankit and
4 5
https://universaldependencies.org/ spaCy can process 8140 tokens and 5912 tokens per sec-
conll18/evaluation.html ond for UD and NER, respectively.
Figure 7: Demo website for Trankit.
Stanza for several languages in Table 5. As can be sian, Telugu, and Lithuanian. Table 2 compares the
seen, besides the multilingual transformer, model average performance of Trankit, “Multilingual”,
packages in Trankit only take dozens of megabytes and “No-adapters”. As can be seen, “Multilingual”
while Stanza consumes hundreds of megabytes for and “No-adapters” are significantly worse than the
each package. This leads to the Stanza’s usage of proposed adapter-based Trankit. We attribute this
much more memory when the pipelines for these to the fact that multilingual training might suffer
languages are loaded at the same time. In fact, from unbalanced sizes of treebanks, causing high-
Trankit only takes 4.9GB to load all the 90 pre- resource languages to dominate others and impair-
trained pipelines for the 56 supported languages. ing the overall performance. For “No-adapters”,
fixing pretrained transformer might significantly
5.5 Ablation Study limit the models’ capacity for multiple tasks and
This section compares Trankit with two other pos- languages.
sible strategies to build a multilingual system for
fundamental NLP tasks. In the first strategy (called
“Multilingual”), we train a single pipeline where 6 Conclusion and Future Work
all the components in the pipeline are trained with
the combined training data of all the languages.
The second strategy (called “No-adapters”) in- We introduce Trankit, a transformer-based multi-
volves eliminating adapters from XLM-Roberta in lingual toolkit that significantly improves the per-
Trankit. As such, in “No-adapters”, pipelines are formance for fundamental NLP tasks, including
still trained separately for each language; the pre- sentence segmentation, part-of-speech, morpho-
trained transformer is fixed; and only task-specific logical tagging, and dependency parsing over 90
weights (for predictions) in components are up- Universal Dependencies v2.5 treebanks of 56 dif-
dated during training. ferent languages. Our toolkit is fast on GPUs and
For evaluation, we select 9 treebanks for 3 differ- efficient in memory use, making it usable for gen-
ent groups, i.e., high-resource, medium-resource, eral users. In the future, we plan to improve our
and low-resource, depending on the sizes of the toolkit by investigating different pretrained trans-
treebanks. In particular, the high-resource group formers such as mBERT and XLM-Robertalarge .
includes Czech, Russian, and Arabic; the medium- We also plan to provide Named Entity Recognizers
resource group includes French, English, and Chi- for more languages and add modules to perform
nese; and the low-resource group involves Belaru- more NLP tasks.
Acknowledgments Yoeng-Jin Chu. 1965. On the shortest arborescence of
a directed graph. Scientia Sinica.
This research has been supported by the Office
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
of the Director of National Intelligence (ODNI), Vishrav Chaudhary, Guillaume Wenzek, Francisco
Intelligence Advanced Research Projects Activ- Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
ity (IARPA), via IARPA Contract No. 2019- moyer, and Veselin Stoyanov. 2020. Unsupervised
19051600006 under the Better Extraction from cross-lingual representation learning at scale. In
Proceedings of the 58th Annual Meeting of the Asso-
Text Towards Enhanced Retrieval (BETTER) Pro- ciation for Computational Linguistics, pages 8440–
gram. The views and conclusions contained herein 8451, Online. Association for Computational Lin-
are those of the authors and should not be inter- guistics.
preted as necessarily representing the official poli-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
cies, either expressed or implied, of ODNI, IARPA, Kristina Toutanova. 2019. BERT: Pre-training of
the Department of Defense, or the U.S. Govern- deep bidirectional transformers for language under-
ment. The U.S. Government is authorized to re- standing. In Proceedings of the 2019 Conference
produce and distribute reprints for governmental of the North American Chapter of the Association
for Computational Linguistics: Human Language
purposes notwithstanding any copyright annotation Technologies, Volume 1 (Long and Short Papers),
therein. This document does not contain technol- pages 4171–4186, Minneapolis, Minnesota. Associ-
ogy or technical data controlled under either the ation for Computational Linguistics.
U.S. International Traffic in Arms Regulations or Timothy Dozat and Christopher D Manning. 2017.
the U.S. Export Administration Regulations. Deep biaffine attention for neural dependency pars-
ing. In Proceedings of the International Conference
on Learning Representations.
References Jack Edmonds. 1967. Optimum branchings. Journal
Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. of Research of the national Bureau of Standards B.
Massively multilingual neural machine translation. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
In Proceedings of the 2019 Conference of the North Bruna Morrone, Quentin De Laroussilhe, Andrea
American Chapter of the Association for Compu- Gesmundo, Mona Attariyan, and Sylvain Gelly.
tational Linguistics: Human Language Technolo- 2019. Parameter-efficient transfer learning for nlp.
gies, Volume 1 (Long and Short Papers), pages In Proceedings of the International Conference on
3874–3884, Minneapolis, Minnesota. Association Machine Learning.
for Computational Linguistics.
Hiroshi Kanayama and Ran Iwamoto. 2020. How uni-
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif versal are Universal Dependencies? exploiting syn-
Rasul, Stefan Schweter, and Roland Vollgraf. 2019. tax for multilingual clause-level sentiment detection.
FLAIR: An easy-to-use framework for state-of-the- In Proceedings of the 12th Language Resources
art NLP. In Proceedings of the 2019 Confer- and Evaluation Conference, pages 4063–4073, Mar-
ence of the North American Chapter of the Asso- seille, France. European Language Resources Asso-
ciation for Computational Linguistics (Demonstra- ciation.
tions), pages 54–59, Minneapolis, Minnesota. Asso-
ciation for Computational Linguistics. Claudia Kittask, Kirill Milintsevich, and Kairit Sirts.
2020. Evaluating multilingual bert for estonian.
Darina Benikova, Chris Biemann, and Marc Reznicek. Volume, 328:19–26.
2014. NoSta-d named entity annotation for Ger- Dan Kondratyuk and Milan Straka. 2019. 75 lan-
man: Guidelines and dataset. In Proceedings of guages, 1 model: Parsing Universal Dependencies
the Ninth International Conference on Language Re- universally. In Proceedings of the 2019 Confer-
sources and Evaluation (LREC-2014), pages 2524– ence on Empirical Methods in Natural Language
2531, Reykjavik, Iceland. European Languages Re- Processing and the 9th International Joint Con-
sources Association (ELRA). ference on Natural Language Processing (EMNLP-
IJCNLP), pages 2779–2795, Hong Kong, China. As-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and sociation for Computational Linguistics.
Tomas Mikolov. 2017. Enriching word vectors with
subword information. Transactions of the Associa- Taku Kudo. 2018. Subword regularization: Improv-
tion for Computational Linguistics, 5:135–146. ing neural network translation models with multiple
subword candidates. In Proceedings of the 56th An-
Wanxiang Che, Yunlong Feng, Libo Qin, and Ting Liu. nual Meeting of the Association for Computational
2020. N-ltp: A open-source neural chinese language Linguistics (Volume 1: Long Papers), pages 66–
technology platform with pretrained models. arXiv 75, Melbourne, Australia. Association for Compu-
preprint arXiv:2009.11616. tational Linguistics.
Taku Kudo and John Richardson. 2018. Sentence- extraction with language-universal sentence struc-
Piece: A simple and language independent subword tures. In Proceedings of the Sixth Arabic Natural
tokenizer and detokenizer for neural text processing. Language Processing Workshop (WANLP) at EACL
In Proceedings of the 2018 Conference on Empirical 2021.
Methods in Natural Language Processing: System
Demonstrations, pages 66–71, Brussels, Belgium. Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
Association for Computational Linguistics. ter, Jan Hajič, Christopher D. Manning, Sampo
Pyysalo, Sebastian Schuster, Francis Tyers, and
Jian Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2019a. Daniel Zeman. 2020. Universal Dependencies v2:
Neural cross-lingual event detection with minimal An evergrowing multilingual treebank collection.
parallel resources. In Proceedings of the 2019 Con- In Proceedings of the 12th Language Resources
ference on Empirical Methods in Natural Language and Evaluation Conference, pages 4034–4043, Mar-
Processing and the 9th International Joint Con- seille, France. European Language Resources Asso-
ference on Natural Language Processing (EMNLP- ciation.
IJCNLP), pages 738–748, Hong Kong, China. Asso-
ciation for Computational Linguistics. Joel Nothman, Nicky Ringland, Will Radford, Tara
Murphy, and James R. Curran. 2012. Learning mul-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- tilingual named entity recognition from Wikipedia.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Artificial Intelligence, 194:151–175.
Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
Roberta: A robustly optimized bert pretraining ap- Matthew E. Peters, Sebastian Ruder, and Noah A.
proach. arXiv preprint arXiv:1907.11692. Smith. 2019. To tune or not to tune? adapting pre-
trained representations to diverse tasks. In Proceed-
Christopher Manning, Mihai Surdeanu, John Bauer, ings of the 4th Workshop on Representation Learn-
Jenny Finkel, Steven Bethard, and David McClosky. ing for NLP (RepL4NLP-2019), pages 7–14, Flo-
2014. The Stanford CoreNLP natural language pro- rence, Italy. Association for Computational Linguis-
cessing toolkit. In Proceedings of 52nd Annual tics.
Meeting of the Association for Computational Lin-
guistics: System Demonstrations, pages 55–60, Bal- Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish-
timore, Maryland. Association for Computational warya Kamath, Ivan Vulić, Sebastian Ruder,
Linguistics. Kyunghyun Cho, and Iryna Gurevych. 2020a.
AdapterHub: A framework for adapting transform-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
ers. In Proceedings of the 2020 Conference on Em-
rado, and Jeff Dean. 2013. Distributed representa-
pirical Methods in Natural Language Processing:
tions of words and phrases and their composition-
System Demonstrations, pages 46–54, Online. Asso-
ality. In Proceedings of the Conference on Neural
ciation for Computational Linguistics.
Information Processing Systems.
Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se-
Kemal Oflazer, and Noah A. Smith. 2012. Recall- bastian Ruder. 2020b. MAD-X: An Adapter-Based
oriented learning of named entities in Arabic Framework for Multi-Task Cross-Lingual Transfer.
Wikipedia. In Proceedings of the 13th Confer- In Proceedings of the 2020 Conference on Empirical
ence of the European Chapter of the Association Methods in Natural Language Processing (EMNLP),
for Computational Linguistics, pages 162–173, Avi- pages 7654–7673, Online. Association for Computa-
gnon, France. Association for Computational Lin- tional Linguistics.
guistics.
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,
Mahdi Mohseni and Amirhossein Tebbifakhr. 2019. and Christopher D. Manning. 2020. Stanza: A
MorphoBERT: a Persian NER system with BERT python natural language processing toolkit for many
and morphological analysis. In Proceedings of human languages. In Proceedings of the 58th An-
The First International Workshop on NLP Solutions nual Meeting of the Association for Computational
for Under Resourced Languages (NSURL 2019) co- Linguistics: System Demonstrations, pages 101–
located with ICNLSP 2019 - Short Papers, pages 23– 108, Online. Association for Computational Lin-
30, Trento, Italy. Association for Computational Lin- guistics.
guistics.
Milan Straka. 2018. UDPipe 2.0 prototype at CoNLL
Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. 2018 UD shared task. In Proceedings of the CoNLL
PhoBERT: Pre-trained language models for Viet- 2018 Shared Task: Multilingual Parsing from Raw
namese. In Findings of the Association for Com- Text to Universal Dependencies, pages 197–207,
putational Linguistics: EMNLP 2020, pages 1037– Brussels, Belgium. Association for Computational
1042, Online. Association for Computational Lin- Linguistics.
guistics.
Nasrin Taghizadeh and Heshaam Faili. 2020. Cross-
Minh Van Nguyen and Thien Huu Nguyen. 2021. Im- lingual adaptation using universal dependencies.
proving cross-lingual transfer for event argument arXiv preprint arXiv:2003.10816.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Asahara, Luma Ateyah, Mohammed Attia, Aitz-
BERT rediscovers the classical NLP pipeline. In iber Atutxa, Liesbeth Augustinus, Elena Bad-
Proceedings of the 57th Annual Meeting of the Asso- maeva, Miguel Ballesteros, Esha Banerjee, Se-
ciation for Computational Linguistics, pages 4593– bastian Bank, Verginica Barbu Mititelu, Victo-
4601, Florence, Italy. Association for Computational ria Basmov, Colin Batchelor, John Bauer, San-
Linguistics. dra Bellato, Kepa Bengoetxea, Yevgeni Berzak, Ir-
shad Ahmad Bhat, Riyaz Ahmad Bhat, Erica Bi-
Yuanhe Tian, Yan Song, Xiang Ao, Fei Xia, Xi- agetti, Eckhard Bick, Agnė Bielinskienė, Rogier
aojun Quan, Tong Zhang, and Yonggang Wang. Blokland, Victoria Bobicev, Loı̈c Boizou, Emanuel
2020. Joint Chinese word segmentation and part- Borges Völker, Carl Börstell, Cristina Bosco, Gosse
of-speech tagging via two-way attentions of auto- Bouma, Sam Bowman, Adriane Boyd, Kristina
analyzed knowledge. In Proceedings of the 58th An- Brokaitė, Aljoscha Burchardt, Marie Candito,
nual Meeting of the Association for Computational Bernard Caron, Gauthier Caron, Tatiana Cavalcanti,
Linguistics, pages 8286–8296, Online. Association Gülşen Cebiroğlu Eryiğit, Flavio Massimiliano Cec-
for Computational Linguistics. chini, Giuseppe G. A. Celano, Slavomír Čéplö,
Savas Cetin, Fabricio Chalub, Jinho Choi, Yongseok
Erik F. Tjong Kim Sang. 2002. Introduction to the Cho, Jayeol Chun, Alessandra T. Cignarella, Sil-
CoNLL-2002 shared task: Language-independent vie Cinková, Aurélie Collomb, Çağrı Çöltekin,
named entity recognition. In COLING-02: The Miriam Connor, Marine Courtin, Elizabeth David-
6th Conference on Natural Language Learning 2002 son, Marie-Catherine de Marneffe, Valeria de Paiva,
(CoNLL-2002). Elvis de Souza, Arantza Diaz de Ilarraza, Carly
Dickerson, Bamba Dione, Peter Dirix, Kaja Do-
Erik F. Tjong Kim Sang and Fien De Meulder.
brovoljc, Timothy Dozat, Kira Droganova, Puneet
2003. Introduction to the CoNLL-2003 shared task:
Dwivedi, Hanne Eckhoff, Marhaba Eli, Ali Elkahky,
Language-independent named entity recognition. In
Binyam Ephrem, Olga Erina, Tomaž Erjavec, Aline
Proceedings of the Seventh Conference on Natu-
Etienne, Wograine Evelyn, Richárd Farkas, Hec-
ral Language Learning at HLT-NAACL 2003, pages
tor Fernandez Alcalde, Jennifer Foster, Cláudia Fre-
142–147.
itas, Kazunori Fujita, Katarína Gajdošová, Daniel
Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Ari- Galbraith, Marcos Garcia, Moa Gärdenfors, Se-
vazhagan, Xin Li, and Amelia Archer. 2019. Small bastian Garza, Kim Gerdes, Filip Ginter, Iakes
and practical BERT models for sequence labeling. Goenaga, Koldo Gojenola, Memduh Gökırmak,
In Proceedings of the 2019 Conference on Empirical Yoav Goldberg, Xavier Gómez Guinovart, Berta
Methods in Natural Language Processing and the González Saavedra, Bernadeta Griciūtė, Matias Gri-
9th International Joint Conference on Natural Lan- oni, Normunds Grūzı̄tis, Bruno Guillaume, Céline
guage Processing (EMNLP-IJCNLP), pages 3632– Guillot-Barbance, Nizar Habash, Jan Hajič, Jan
3636, Hong Kong, China. Association for Computa- Hajič jr., Mika Hämäläinen, Linh Hà Mỹ, Na-Rae
tional Linguistics. Han, Kim Harris, Dag Haug, Johannes Heinecke,
Felix Hennig, Barbora Hladká, Jaroslava Hlaváčová,
Ahmet Üstün, Arianna Bisazza, Gosse Bouma, and Florinel Hociung, Petter Hohle, Jena Hwang,
Gertjan van Noord. 2020. UDapter: Language adap- Takumi Ikeda, Radu Ion, Elena Irimia, Ọlájídé
tation for truly Universal Dependency parsing. In Ishola, Tomáš Jelínek, Anders Johannsen, Fredrik
Proceedings of the 2020 Conference on Empirical Jørgensen, Markus Juutinen, Hüner Kaşıkara, An-
Methods in Natural Language Processing (EMNLP), dre Kaasen, Nadezhda Kabaeva, Sylvain Ka-
pages 2302–2315, Online. Association for Computa- hane, Hiroshi Kanayama, Jenna Kanerva, Boris
tional Linguistics. Katz, Tolga Kayadelen, Jessica Kenney, Václava
Kettnerová, Jesse Kirchner, Elena Klementieva,
Wietse de Vries, Andreas van Cranenburgh, Arianna Arne Köhn, Kamil Kopacewicz, Natalia Kotsyba,
Bisazza, Tommaso Caselli, Gertjan van Noord, and Jolanta Kovalevskaitė, Simon Krek, Sookyoung
Malvina Nissim. 2019. Bertje: A dutch bert model. Kwak, Veronika Laippala, Lorenzo Lambertino, Lu-
arXiv preprint arXiv:1912.09582. cia Lam, Tatiana Lando, Septina Dian Larasati,
Alexei Lavrentiev, John Lee, Phương Lê Hồng,
Ralph Weischedel, Martha Palmer, Mitchell Marcus, Alessandro Lenci, Saran Lertpradit, Herman Le-
Eduard Hovy, Sameer Pradhan, Lance Ramshaw, ung, Cheuk Ying Li, Josie Li, Keying Li, KyungTae
Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Lim, Maria Liovina, Yuan Li, Nikola Ljubešić, Olga
Franchini, et al. 2013. Ontonotes release 5.0. Lin- Loginova, Olga Lyashevskaya, Teresa Lynn, Vivien
guistic Data Consortium. Macketanz, Aibek Makazhanov, Michael Mandl,
Christopher Manning, Ruli Manurung, Cătălina
Haiqin Yang. 2019. Bert meets chinese word segmen- Mărănduc, David Mareček, Katrin Marheinecke,
tation. arXiv preprint arXiv:1909.09292. Héctor Martínez Alonso, André Martins, Jan Mašek,
Yuji Matsumoto, Ryan McDonald, Sarah McGuin-
Daniel Zeman, Joakim Nivre, Mitchell Abrams, Noëmi ness, Gustavo Mendonça, Niko Miekka, Mar-
Aepli, Željko Agić, Lars Ahrenberg, Gabrielė Alek- garita Misirpashayeva, Anna Missilä, Cătălin Mi-
sandravičiūtė, Lene Antonsen, Katya Aplonova, titelu, Maria Mitrofan, Yusuke Miyao, Simonetta
Maria Jesus Aranzabe, Gashaw Arutie, Masayuki
Montemagni, Amir More, Laura Moreno Romero,
Keiko Sophie Mori, Tomohiko Morioka, Shinsuke
Mori, Shigeki Moro, Bjartur Mortensen, Bohdan
Moskalevskyi, Kadri Muischnek, Robert Munro,
Yugo Murawaki, Kaili Müürisep, Pinkey Nainwani,
Juan Ignacio Navarro Horñiacek, Anna Nedoluzhko,
Gunta Nešpore-Bērzkalne, Lương Nguyễn Thị,
Huyền Nguyễn Thị Minh, Yoshihiro Nikaido, Vi-
taly Nikolaev, Rattima Nitisaroj, Hanna Nurmi,
Stina Ojala, Atul Kr. Ojha, Adédayọ Olúòkun,
Mai Omura, Petya Osenova, Robert Östling, Lilja
Øvrelid, Niko Partanen, Elena Pascual, Marco
Passarotti, Agnieszka Patejuk, Guilherme Paulino-
Passos, Angelika Peljak-Łapińska, Siyao Peng,
Cenel-Augusto Perez, Guy Perrier, Daria Petrova,
Slav Petrov, Jason Phelan, Jussi Piitulainen,
Tommi A Pirinen, Emily Pitler, Barbara Plank,
Thierry Poibeau, Larisa Ponomareva, Martin Popel,
Lauma Pretkalniņa, Sophie Prévost, Prokopis Proko-
pidis, Adam Przepiórkowski, Tiina Puolakainen,
Sampo Pyysalo, Peng Qi, Andriela Rääbis, Alexan-
dre Rademaker, Loganathan Ramasamy, Taraka
Rama, Carlos Ramisch, Vinit Ravishankar, Livy
Real, Siva Reddy, Georg Rehm, Ivan Riabov,
Michael Rießler, Erika Rimkutė, Larissa Rinaldi,
Laura Rituma, Luisa Rocha, Mykhailo Romanenko,
Rudolf Rosa, Davide Rovati, Valentin Ros, ca, Olga
Rudina, Jack Rueter, Shoval Sadde, Benoı̂t Sagot,
Shadi Saleh, Alessio Salomoni, Tanja Samardžić,
Stephanie Samson, Manuela Sanguinetti, Dage
Särg, Baiba Saulı̄te, Yanin Sawanakunanon, Nathan
Schneider, Sebastian Schuster, Djamé Seddah, Wolf-
gang Seeker, Mojgan Seraji, Mo Shen, Atsuko
Shimada, Hiroyuki Shirasu, Muh Shohibussirri,
Dmitry Sichinava, Aline Silveira, Natalia Silveira,
Maria Simi, Radu Simionescu, Katalin Simkó,
Mária Šimková, Kiril Simov, Aaron Smith, Isabela
Soares-Bastos, Carolyn Spadine, Antonio Stella,
Milan Straka, Jana Strnadová, Alane Suhr, Umut
Sulubacak, Shingo Suzuki, Zsolt Szántó, Dima
Taji, Yuta Takahashi, Fabio Tamburini, Takaaki
Tanaka, Isabelle Tellier, Guillaume Thomas, Li-
isi Torga, Trond Trosterud, Anna Trukhina, Reut
Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka
Urešová, Larraitz Uria, Hans Uszkoreit, Andrius
Utka, Sowmya Vajjala, Daniel van Niekerk, Gert-
jan van Noord, Viktor Varga, Eric Villemonte de la
Clergerie, Veronika Vincze, Lars Wallin, Abigail
Walsh, Jing Xian Wang, Jonathan North Washing-
ton, Maximilan Wendt, Seyi Williams, Mats Wirén,
Christian Wittern, Tsegay Woldemariam, Tak-sum
Wong, Alina Wróblewska, Mary Yako, Naoki Ya-
mazaki, Chunxiao Yan, Koichi Yasuoka, Marat M.
Yavrumyan, Zhuoran Yu, Zdeněk Žabokrtský, Amir
Zeldes, Manying Zhang, and Hanzhi Zhu. 2019.
Universal dependencies 2.5. LINDAT/CLARIAH-
CZ digital library at the Institute of Formal and Ap-
plied Linguistics (ÚFAL), Faculty of Mathematics
and Physics, Charles University.
Xingran Zhu. 2020. Cross-lingual word sense disam-
biguation using mbert embeddings with syntactic
dependencies. arXiv preprint arXiv:2012.05300.