Thư viện Trankit

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Trankit: A Light-Weight Transformer-based Toolkit for Multilingual

Natural Language Processing

Minh Van Nguyen, Viet Lai, Amir Pouran Ben Veyseh, Thien Huu Nguyen
Department of Computer and Information Science
University of Oregon, Eugene, Oregon, USA
{minhnv,vietl,apouranb,thien}@cs.uoregon.edu

Abstract features, and dependency trees of sentences (called


fundamental NLP tasks). As such, building effec-
We introduce Trankit, a light-weight tive multilingual systems/pipelines for fundamental
Transformer-based Toolkit for multilingual
arXiv:2101.03289v5 [cs.CL] 15 Oct 2021

upstream NLP tasks to produce such information


Natural Language Processing (NLP). It
provides a trainable pipeline for fundamen-
has the potentials to transform multilingual down-
tal NLP tasks over 100 languages, and 90 stream systems.
pretrained pipelines for 56 languages. Built There have been several NLP toolkits that con-
on a state-of-the-art pretrained language cerns multilingualism for fundamental NLP tasks,
model, Trankit significantly outperforms featuring spaCy1 , UDify (Kondratyuk and Straka,
prior multilingual NLP pipelines over sen- 2019), Flair (Akbik et al., 2019), CoreNLP (Man-
tence segmentation, part-of-speech tagging,
ning et al., 2014), UDPipe (Straka, 2018), and
morphological feature tagging, and depen-
dency parsing while maintaining competitive Stanza (Qi et al., 2020). However, these toolk-
performance for tokenization, multi-word its have their own limitations. spaCy is designed to
token expansion, and lemmatization over 90 focus on speed, thus it needs to sacrifice the per-
Universal Dependencies treebanks. Despite formance. UDify and Flair cannot process raw text
the use of a large pretrained transformer, our as they depend on external tokenizers. CoreNLP
toolkit is still efficient in memory usage and supports raw text, but it does not offer state-of-
speed. This is achieved by our novel plug- the-art performance. UDPipe and Stanza are the
and-play mechanism with Adapters where a
multilingual pretrained transformer is shared
recent toolkits that leverage word embeddings, i.e.,
across pipelines for different languages. word2vec (Mikolov et al., 2013) and fastText (Bo-
Our toolkit along with pretrained models janowski et al., 2017), to deliver current state-of-
and code are publicly available at: https: the-art performance for many languages. However,
//github.com/nlp-uoregon/trankit. Stanza and UDPipe’s pipelines for different lan-
A demo website for our toolkit is also available guages are trained separately and do not share any
at: http://nlp.uoregon.edu/trankit. component, especially the embedding layers that
Finally, we create a demo video for Trankit at:
account for most of the model size. This makes
https://youtu.be/q0KGP3zGjGc.
their memory usage grow aggressively as pipelines
1 Introduction for more languages are simultaneously needed and
loaded into the memory (e.g., for language learn-
Many efforts have been devoted to developing ing apps). Most importantly, none of such toolk-
multilingual NLP systems to overcome language its have explored contextualized embeddings from
barriers (Aharoni et al., 2019; Liu et al., 2019a; pretrained transformer-based language models that
Taghizadeh and Faili, 2020; Zhu, 2020; Kanayama have the potentials to significantly improve the per-
and Iwamoto, 2020; Nguyen and Nguyen, 2021). A formance of the NLP tasks, as demonstrated in
large portion of existing multilingual systems has many prior works (Devlin et al., 2019; Liu et al.,
focused on downstream NLP tasks that critically 2019b; Conneau et al., 2020).
depend on upstream linguistic features, ranging In this paper, we introduce Trankit, a multi-
from basic information such as token and sentence lingual Transformer-based NLP Toolkit that over-
boundaries for raw text to more sophisticated struc-
1
tures such as part-of-speech tags, morphological https://spacy.io/
Input: Raw Sentence/Document String

Language=1,L
Joint Token and Sentence
Splitter

Multi-word
Token Expander

Shared
Joint Model for
Multilingual
POS,Morphological Tagging,
Pretrained
and Dependency Parsing
Transformer

Named Entity
Lemmatizer
Recognizer

Output: Hierarchical Native Python Dictionary

Figure 1: Overall architecture of Trankit. A single multilingual pretrained transformer is shared across three
components (pointed by the red arrows) of the pipeline for different languages.

comes such limitations. Our toolkit can process transformer-based components for: (i) the joint
raw text for fundamental NLP tasks, supporting 56 token and sentence splitter, (ii) the joint model
languages with 90 pre-trained pipelines on 90 tree- for POS tagging, morphological tagging, depen-
banks of the Universal Dependency v2.5 (Zeman dency parsing, and (iii) the named entity recog-
et al., 2019). By utilizing the state-of-the-art multi- nizer. One potential concern for our use of a large
lingual pretrained transformer XLM-Roberta (Con- pretrained transformer model (i.e., XML-Roberta)
neau et al., 2020), Trankit advances state-of-the- in Trankit involves GPU memory where different
art performance for sentence segmentation, part- transformer-based components in the pipeline for
of-speech (POS) tagging, morphological feature one or multiple languages must be simultaneously
tagging, and dependency parsing while achieving loaded into the memory to serve multilingual tasks.
competitive or better performance for tokenization, This could extensively consume the memory if dif-
multi-word token expansion, and lemmatization ferent versions of the large pre-trained transformer
over the 90 treebanks. It also obtains competitive (finetuned for each component) are employed in
or better performance for named entity recognition the pipeline. As such, we introduce a novel plug-
(NER) on 11 public datasets. and-play mechanism with Adapters to address this
Unlike previous work, our token and sentence memory issue. Adapters are small networks in-
splitter is wordpiece-based instead of character- jected inside all layers of the pretrained transformer
based to better exploit contextual information, model that have shown their effectiveness as a light-
which are beneficial in many languages. Consider- weight alternative for the traditional finetuning
ing the following sentence: of pretrained transformers (Houlsby et al., 2019;
“John Donovan from Argghhh! has put out a ex-
Peters et al., 2019; Pfeiffer et al., 2020a,b). In
cellent slide show on what was actually found and Trankit, a set of adapters (for transfomer layers)
fought for in Fallujah.” and task-specific weights (for final predictions) are
created for each transformer-based component for
As such, Trankit correctly recognizes this as a sin-
each language while only one single large mul-
gle sentence while character-based sentence split-
tilingual pretrained transformer is shared across
ters of Stanza and UDPipe are easily fooled by the
components and languages. Adapters allow us to
exclamation mark “!”, treating it as two separate
learn language-specific features for tasks. During
sentences. To our knowledge, this is the first work
training, the shared pretrained transformer is fixed
to successfully build a wordpiece-based token and
while only the adapters and task-specific weights
sentence splitter that works well for 56 languages.
are updated. At inference time, depending on the
Figure 1 presents the overall architecture
language of the input text and the current active
of Trankit pipeline that features three novel
component, the corresponding trained adapter and
Add & Norm
task-specific weights are activated and plugged into
Adapter
the pipeline to process the input. This mechanism FF Up
not only solves the memory problem but also sub- Add & Norm
FF Down
stantially reduces the training time. Feed-forward
Adapter
Add & Norm
2 Related Work
Multi-Head Attention Add & Norm
There have been works using pre-trained trans-
formers to build models for character-based word
segmentation for Chinese (Yang, 2019; Tian et al., Figure 2: Left: location of an adapter (green box) in-
2020; Che et al., 2020); POS tagging for Dutch, side a layer of the pretrained transformer. Gray boxes
English, Chinese, and Vietnamese (de Vries et al., represent the original components of a transformer
2019; Tenney et al., 2019; Tian et al., 2020; Che layer. Right: the network architecture of an adapter.
et al., 2020; Nguyen and Nguyen, 2020); mor-
phological feature tagging for Estonian and Per-
boxes) are fixed and only the adapter weights of
sian (Kittask et al., 2020; Mohseni and Tebbifakhr,
two projection layers and the task-specific weights
2019); and dependency parsing for English and
outside the transformer (for final predictions) are
Chinese (Tenney et al., 2019; Che et al., 2020).
updated. As demonstrated in Figure 1, Trankit
However, all of these works are only developed for
involves six components described as follows.
some specific language, thus potentially unable to
support and scale to the multilingual setting. Multilingual Encoder with Adapters. This is
Some works have designed multilingual our core component that is shared across different
transformer-based systems via multilingual train- transformer-based components for different lan-
ing on the combined data of different languages guages of the system. Given an input raw text s,
(Tsai et al., 2019; Kondratyuk and Straka, 2019; we first split it into substrings by spaces. After-
Üstün et al., 2020). However, multilingual ward, Sentence Piece, a multilingual subword tok-
training is suboptimal (see Section 5). Also, these enizer (Kudo and Richardson, 2018; Kudo, 2018),
systems still rely on external resources to perform is used to further split each substring into word-
tokenization and sentence segmentation, thus pieces. By concatenating wordpiece sequences for
unable to consume raw text. To our knowedge, this substrings, we obtain an overall sequence of word-
is the first work to successfully build a multilingual pieces w = [w1 , w2 , . . . , wK ] for s. In the next
transformer-based NLP toolkit where different step, w is fed into the pretrained transformer, which
transformer-based models for many languages can is already integrated with adapters, to obtain the
be simultaneously loaded into GPU memory and wordpiece representations:
process raw text inputs of different languages.
xl,m l,m
1:K = Transformer(w1:K ; θAD ) (2)
3 Design and Architecture
l,m
Adapters. Adapters play a critical role in making Here, θAD represents the adapter weights for lan-
Trankit memory- and time-efficient for training and guage l and component m of the system. As such,
inference. Figure 2 shows the architecture and the we have specific adapters in all transformer layers
location of an adapter inside a layer of transformer. for each component m and language l. Note that if
We use the adapter architecture proposed by (Pfeif- K is larger than the maximum input length of the
fer et al., 2020a,b), which consists of two projection pretrained transformer (i.e., 512), we further divide
layers Up and Down (feed-forward networks), and w into consecutive chunks; each has the length less
a residual connection. than or equal to the maximum length. The pre-
trained transformer is then applied over each chunk
ci = AddNorm(ri ), hi = Up(ReLU(Down(ci ))) + ri (1) to obtain a representation vector for each wordpiece
in w. Finally, xl,m
1:K will be sent to component m to
where ri is the input vector from the transformer perform the corresponding task.
layer for the adapter and hi is the output vector
for the transformer layer i. During training, all the Joint Token and Sentence Splitter. Given the
weights of the pretrained transformer (i.e., gray wordpiece representations xl,m
1:K for this component,
each vector xl,m
i for wi ∈ w will be consumed by Named Entity Recognizer. Given a sentence, the
a feed-forward network with softmax in the end to named entity recognizer determines spans of en-
predict if wi is the end of a single-word token, the tity names by assigning a BIOES tag to each token
end of a multi-word token, or the end of a sentence. in the sentence. We deploy a standard sequence
The predictions for all wordpieces in w will then be labeling architecture using transformer-based rep-
aggregated to determine token, multi-word token, resentations for tokens, involving a feed-forward
and sentence boundaries for s. network followed by a Conditional Random Field.

Multi-word Token Expander. This component is 4 Usage


responsible for expanding each detected multi-word
token (MWT) into multiple syntactic words2 . We Detailed documentation for Trankit can be found
follow Stanza to deploy a character-based seq2seq at: https://trankit.readthedocs.io.
model for this component. This decision is made Trankit Installation. Trankit is written in
based on our observation that the task is done best Python and available on PyPI: https://pypi.
at character level, and the character-based model org/project/trankit/. Users can install our
(with character embeddings) is very small. toolkit via pip using:
Joint Model for POS Tagging, Morphological pip install trankit
Tagging and Dependency Parsing. In Trankit,
given the detected sentences and tokens/words, we Initialize a Pipeline. Lines 1-4 in Figure 3 shows
use a single model to jointly perform POS tag- how to initialize a pretrained pipeline for English; it
ging, morphological feature tagging and depen- is instructed to run on GPU and store downloaded
dency parsing at sentence level. Joint modeling pretrained models to the specified cache directory.
mitigates error propagation, saves the memory, and Trankit will not download pretrained models if they
speedups the system. In particular, given a sen- already exist.
tence, the representation for each word is computed
Multilingual Usage. Figure 3 shows how to ini-
as the average of its wordpieces’ transformer-based
tialize a multilingual pipeline and process inputs of
representations in xl,m
1:K . Let t1:N = [t1 , t2 , . . . , tN ] different languages in Trankit:
be the representations of the words in the sen-
tence. We compute the following vectors using 1 from trankit import Pipeline
2
feed-forward networks FFN∗ : 3 # initialize a multilingual pipeline
4 p = Pipeline(lang='english', gpu=True, cache_dir='./cache')
5 langs = ['arabic', 'chinese', 'dutch']
upos xpos 6 for lang in langs:
r1:N = FFNupos (t1:N ), r1:N = FFNxpos (t1:N ) 7 p.add(lang)
uf eats dep 8
r1:N = FFNuf eats (t1:N ), r0:N = [xcls ; FFNdep (t1:N )] 9 # tokenize English input
10 p.set_active('english')
11 en = p.tokenize('Rich was here before the scheduled time.')
upos xpos uf eats 12
Vectors for the words in r1:N , r1:N , r1:N are 13 # get ner tags for Arabic input
14 p.set_active('arabic')
then passed to a softmax layer to make predic- 15 ar = p.ner('.‫)'ﻮﻛﺎﻧ ﻚﻨﻋﺎﻧ ﻖﺒﻟ ﺬﻠﻛ ﺮﺌﻴﺳ ﺞﻫﺍﺯ ﺍﻼﻤﻧ ﻭﺍﻼﺴﺘﻃﻼﻋ ﻞﻠﻗﻭﺎﺗ ﺎﻠﺳﻭﺮﻳﺓ ﺎﻠﻋﺎﻤﻟﺓ ﻒﻳ ﻞﺒﻧﺎﻧ‬

tions for UPOS, XPOS, and UFeats tags for each


Figure 3: Multilingual pipeline initialization.
word. For dependency parsing, we use the classi-
fication token <s> to represent the root node, and
apply Deep Biaffine Attention (Dozat and Man- Basic Functions. Trankit can process inputs which
ning, 2017) and the Chu-Liu/Edmonds algorithm are untokenized (raw) or pretokenized strings, at
(Chu, 1965; Edmonds, 1967) to assign a syntac- both sentence and document levels. Figure 4 illus-
tic head and the associated dependency relation to trates a simple code to perform all the supported
each word in the sentence. tasks for an input text. We organize Trankit’s out-
puts into hierarchical native Python dictionaries,
Lemmatizer. This component receives sentences
which can be easily inspected by users. Figure 5
and their predicted UPOS tags to produce the
demonstrates the outputs of the command line 6 in
canonical form for each word. We also employ a
Figure 4.
character-based seq2seq model for this component
as in Stanza. Training your own Pipelines. Trankit also pro-
2
For languages (e.g., English, Chinese) that do not require vides a trainable pipeline for 100 languages via the
MWT expansion, tokens and words are the same concepts. class TPipeline. This ability is inherited from
Treebank System Tokens Sents. Words UPOS XPOS UFeats Lemmas UAS LAS
Trankit 99.23 91.82 99.02 95.65 94.05 93.21 94.27 87.06 83.69
Overall (90 treebanks)
Stanza 99.26 88.58 98.90 94.21 92.50 91.75 94.15 83.06 78.68
Trankit 99.93 96.59 99.22 96.31 94.08 94.28 94.65 88.39 84.68
Arabic-PADT Stanza 99.98 80.43 97.88 94.89 91.75 91.86 93.27 83.27 79.33
UDPipe 99.98 82.09 94.58 90.36 84.00 84.16 88.46 72.67 68.14
Trankit 97.01 99.7 97.01 94.21 94.02 96.59 97.01 85.19 82.54
Chinese-GSD Stanza 92.83 98.80 92.83 89.12 88.93 92.11 92.83 72.88 69.82
UDPipe 90.27 99.10 90.27 84.13 84.04 89.05 90.26 61.60 57.81
Trankit 98.48 88.35 98.48 95.95 95.71 96.26 96.84 90.14 87.96
Stanza 99.01 81.13 99.01 95.40 95.12 96.11 97.21 86.22 83.59
English-EWT
UDPipe 98.90 77.40 98.90 93.26 92.75 94.23 95.45 80.22 77.03
spaCy 97.44 63.16 97.44 86.99 91.05 - 87.16 55.38 37.03
Trankit 99.7 96.63 99.66 97.85 - 97.16 97.80 94.00 92.34
Stanza 99.68 94.92 99.48 97.30 - 96.72 97.64 91.38 89.05
French-GSD
UDPipe 99.68 93.59 98.81 95.85 - 95.55 96.61 87.14 84.26
spaCy 99.02 89.73 94.81 89.67 - - 88.55 75.22 66.93
Trankit 99.94 99.13 99.93 99.02 98.94 98.8 99.17 94.11 92.41
Stanza 99.98 99.07 99.98 98.78 98.67 98.59 99.19 92.21 90.01
Spanish-Ancora
UDPipe 99.97 98.32 99.95 98.32 98.13 98.13 98.48 88.22 85.10
spaCy 99.95 97.54 99.43 93.43 - - 80.02 89.35 83.81

Table 1: Systems’ performance on test sets of the Universal Dependencies v2.5 treebanks. Performance for Stanza,
UDPipe, and spaCy is obtained using their public pretrained models. The overall performance for Trankit and
Stanza is computed as the macro-averaged F1 over 90 treebanks. Detailed performance of Trankit for 90 supported
treebanks can be found at our documentation page.

1 from trankit import Pipeline // Output


2 {
3 p = Pipeline(lang='english', gpu=True, cache_dir='./cache') 'text': 'Hello! This is Trankit.', // input string
'sentences': [ // list of sentences
4 {
5 doc = '''Hello! This is Trankit.''' 'id': 1, 'text': 'Hello!', 'dspan': (0, 6), 'tokens': [...]
6 # perform all tasks on the input },
7 all = p(doc) {
'id': 2, // sentence index
'text': 'This is Trankit.', 'dspan': (7, 23), // sentence span
'tokens’: [ // list of tokens
Figure 4: A function performing all tasks on the input. {
'id': 1, // token index
'text': 'This', 'upos': 'PRON', 'xpos': 'DT',
'feats': 'Number=Sing|PronType=Dem',
'head': 3, 'deprel': 'nsubj', 'lemma': 'this', 'ner': 'O',
'dspan': (7, 11), // document-level span of the token
the XLM-Roberta encoder which is pretrained on },
'span': (0, 4) // sentence-level span of the token

{'id': 2...},
those languages. Figure 6 illustrates how to train a {'id': 3...},
{'id': 4...}
token and sentence splitter with TPipeline. }
]

]
}
Demo Website. A demo website for Trankit to
support 90 pretrained pipelines is hosted at: http: Figure 5: Output from Trankit. Some parts are col-
//nlp.uoregon.edu/trankit. Figure 7 shows its lapsed to improve visualization.
interface.

5 System Evaluation der, 2003), GermEval14 (Benikova et al., 2014),


OntoNotes (Weischedel et al., 2013), and WikiNER
5.1 Datasets & Hyper-parameters
(Nothman et al., 2012). Hyper-parameters for all
To achieve a fair comparison, we follow Stanza (Qi models and datasets are selected based on the de-
et al., 2020) to train and evaluate all the models velopment data in this work.
on the same canonical data splits of 90 Universal
Dependencies treebanks v2.5 (UD2.5)3 (Zeman 5.2 Universal Dependencies performance
et al., 2019), and 11 public NER datasets pro- Table 1 compares the performance of Trankit and
vided in the following corpora: AQMAR (Mo- the latest available versions of other popular toolk-
hit et al., 2012), CoNLL02 (Tjong Kim Sang, its, including Stanza (v1.1.1) with current state-
2002), CoNLL03 (Tjong Kim Sang and De Meul- of-the-art performance, UDPipe (v1.2), and spaCy
3
We skip 10 treebanks whose languages are not supported (v2.3) on the UD2.5 test sets. The performance
by XLM-Roberta. for all systems is obtained using the official scorer
System Tokens Sents. Words UPOS XPOS UFeats Lemmas UAS LAS
Trankit (plug-and-play with adapters) 99.05 95.12 98.96 95.43 89.02 92.69 93.46 86.20 82.51
Multilingual 96.69 88.95 96.35 91.19 84.64 88.10 90.02 72.96 68.66
No-adapters 95.06 89.57 94.08 88.79 82.54 83.76 88.33 66.63 63.11

Table 2: Model performance on 9 different treebanks (macro-averaged F1 score over test sets).

1 from trankit import TPipeline


2 datasets, we retrain the models in Flair as those
3 tp = TPipeline(training_config={
4 'task': 'tokenize', models (for Dutch) have been updated in version
5 'save_dir': './saved_model',
6 'train_txt_fpath': './train.txt', v0.7. As can be seen, Trankit obtains competitive
7 'train_conllu_fpath': './train.conllu',
8 'dev_txt_fpath': './dev.txt', or better performance for most of the languages,
9 'dev_conllu_fpath': './dev.conllu'})
10 clearly demonstrating the benefit of using the pre-
11 trainer.train()
trained transformer for multilingual NER.
Figure 6: Training a token and sentence splitter using
Language Corpus Trankit Stanza Flair spaCy
the CONLL-U formatted data (Nivre et al., 2020).
Arabic AQMAR 74.8 74.3 74.0 -
Chinese OntoNotes 80.0 79.2 - 69.3
CoNLL02 91.8 89.2 91.3 73.8
of the CoNLL 2018 Shared Task4 . On five illus- Dutch
WikiNER 94.8 94.8 94.8 90.9
trated languages, Trankit achieves competitive per- CoNLL03 92.1 92.1 92.7 81.0
English
OntoNotes 89.6 88.8 89.0 85.4
formance on tokenization, MWT expansion, and French WikiNER 92.3 92.9 92.5 88.8
lemmatization. Importantly, Trankit outperforms CoNLL03 84.6 81.9 82.5 63.9
German
other toolkits over all remaining tasks (e.g., POS GermEval14 86.9 85.2 85.4 68.4
Russian WikiNER 92.8 92.9 - -
and morphological tagging) in which the improve-
Spanish CoNLL02 88.9 88.1 87.3 77.5
ment boost is substantial and significant for sen-
tence segmentation and dependency parsing. For Table 3: Performance (F1) on NER test sets.
example, English enjoys a 7.22% improvement for
sentence segmentation, a 3.92% and 4.37% im-
GPU CPU
provement for UAS and LAS in dependency pars- System
UD NER UD NER
ing. For Arabic, Trankit has a remarkable improve- Trankit 4.50× 1.36× 19.8× 31.5×
ment of 16.16% for sentence segmentation while Stanza 3.22× 1.08× 10.3× 17.7×
UDPipe - - 4.30× -
Chinese observes 12.31% and 12.72% improve- Flair - 1.17× - 51.8×
ment of UAS and LAS for dependency parsing.
Over all 90 treebanks, Trankit outperforms the Table 4: Run time on processing the English EWT tree-
previous state-of-the-art framework Stanza in most bank and the English Ontonotes NER dataset. Mea-
of the tasks, particularly for sentence segmenta- surements are done on an NVIDIA Titan RTX card.
tion (+3.24%), POS tagging (+1.44% for UPOS
and +1.55% for XPOS), morphological tagging Model Package Trankit Stanza
(+1.46%), and dependency parsing (+4.0% for Multilingual Transformer 1146.9MB -
UAS and +5.01% for LAS) while maintaining the Arabic 38.6MB 393.9MB
Chinese 40.6MB 225.2MB
competitive performance on tokenization, multi- English 47.9MB 383.5MB
word expansion, and lemmatization. French 39.6MB 561.9MB
Spanish 37.3MB 556.1MB
5.3 NER results Total size 1350.9MB 2120.6MB

Table 3 compares Trankit with Stanza (v1.1.1), Table 5: Model sizes for five languages.
Flair (v0.7), and spaCy (v2.3) on the test sets of
11 considered NER datasets. Following Stanza, we
report the performance for other toolkits with their 5.4 Speed and Memory Usage
pretrained models on the canonical data splits if Table 4 reports the relative processing time for
they are available. Otherwise, their best configura- UD and NER of the toolkits compared to spaCy’s
tions are used to train the models on the same data CPU processing time5 . For memory usage com-
splits (inherited from Stanza). Also, for the Dutch parison, we show the model sizes of Trankit and
4 5
https://universaldependencies.org/ spaCy can process 8140 tokens and 5912 tokens per sec-
conll18/evaluation.html ond for UD and NER, respectively.
Figure 7: Demo website for Trankit.

Stanza for several languages in Table 5. As can be sian, Telugu, and Lithuanian. Table 2 compares the
seen, besides the multilingual transformer, model average performance of Trankit, “Multilingual”,
packages in Trankit only take dozens of megabytes and “No-adapters”. As can be seen, “Multilingual”
while Stanza consumes hundreds of megabytes for and “No-adapters” are significantly worse than the
each package. This leads to the Stanza’s usage of proposed adapter-based Trankit. We attribute this
much more memory when the pipelines for these to the fact that multilingual training might suffer
languages are loaded at the same time. In fact, from unbalanced sizes of treebanks, causing high-
Trankit only takes 4.9GB to load all the 90 pre- resource languages to dominate others and impair-
trained pipelines for the 56 supported languages. ing the overall performance. For “No-adapters”,
fixing pretrained transformer might significantly
5.5 Ablation Study limit the models’ capacity for multiple tasks and
This section compares Trankit with two other pos- languages.
sible strategies to build a multilingual system for
fundamental NLP tasks. In the first strategy (called
“Multilingual”), we train a single pipeline where 6 Conclusion and Future Work
all the components in the pipeline are trained with
the combined training data of all the languages.
The second strategy (called “No-adapters”) in- We introduce Trankit, a transformer-based multi-
volves eliminating adapters from XLM-Roberta in lingual toolkit that significantly improves the per-
Trankit. As such, in “No-adapters”, pipelines are formance for fundamental NLP tasks, including
still trained separately for each language; the pre- sentence segmentation, part-of-speech, morpho-
trained transformer is fixed; and only task-specific logical tagging, and dependency parsing over 90
weights (for predictions) in components are up- Universal Dependencies v2.5 treebanks of 56 dif-
dated during training. ferent languages. Our toolkit is fast on GPUs and
For evaluation, we select 9 treebanks for 3 differ- efficient in memory use, making it usable for gen-
ent groups, i.e., high-resource, medium-resource, eral users. In the future, we plan to improve our
and low-resource, depending on the sizes of the toolkit by investigating different pretrained trans-
treebanks. In particular, the high-resource group formers such as mBERT and XLM-Robertalarge .
includes Czech, Russian, and Arabic; the medium- We also plan to provide Named Entity Recognizers
resource group includes French, English, and Chi- for more languages and add modules to perform
nese; and the low-resource group involves Belaru- more NLP tasks.
Acknowledgments Yoeng-Jin Chu. 1965. On the shortest arborescence of
a directed graph. Scientia Sinica.
This research has been supported by the Office
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
of the Director of National Intelligence (ODNI), Vishrav Chaudhary, Guillaume Wenzek, Francisco
Intelligence Advanced Research Projects Activ- Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
ity (IARPA), via IARPA Contract No. 2019- moyer, and Veselin Stoyanov. 2020. Unsupervised
19051600006 under the Better Extraction from cross-lingual representation learning at scale. In
Proceedings of the 58th Annual Meeting of the Asso-
Text Towards Enhanced Retrieval (BETTER) Pro- ciation for Computational Linguistics, pages 8440–
gram. The views and conclusions contained herein 8451, Online. Association for Computational Lin-
are those of the authors and should not be inter- guistics.
preted as necessarily representing the official poli-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
cies, either expressed or implied, of ODNI, IARPA, Kristina Toutanova. 2019. BERT: Pre-training of
the Department of Defense, or the U.S. Govern- deep bidirectional transformers for language under-
ment. The U.S. Government is authorized to re- standing. In Proceedings of the 2019 Conference
produce and distribute reprints for governmental of the North American Chapter of the Association
for Computational Linguistics: Human Language
purposes notwithstanding any copyright annotation Technologies, Volume 1 (Long and Short Papers),
therein. This document does not contain technol- pages 4171–4186, Minneapolis, Minnesota. Associ-
ogy or technical data controlled under either the ation for Computational Linguistics.
U.S. International Traffic in Arms Regulations or Timothy Dozat and Christopher D Manning. 2017.
the U.S. Export Administration Regulations. Deep biaffine attention for neural dependency pars-
ing. In Proceedings of the International Conference
on Learning Representations.
References Jack Edmonds. 1967. Optimum branchings. Journal
Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. of Research of the national Bureau of Standards B.
Massively multilingual neural machine translation. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
In Proceedings of the 2019 Conference of the North Bruna Morrone, Quentin De Laroussilhe, Andrea
American Chapter of the Association for Compu- Gesmundo, Mona Attariyan, and Sylvain Gelly.
tational Linguistics: Human Language Technolo- 2019. Parameter-efficient transfer learning for nlp.
gies, Volume 1 (Long and Short Papers), pages In Proceedings of the International Conference on
3874–3884, Minneapolis, Minnesota. Association Machine Learning.
for Computational Linguistics.
Hiroshi Kanayama and Ran Iwamoto. 2020. How uni-
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif versal are Universal Dependencies? exploiting syn-
Rasul, Stefan Schweter, and Roland Vollgraf. 2019. tax for multilingual clause-level sentiment detection.
FLAIR: An easy-to-use framework for state-of-the- In Proceedings of the 12th Language Resources
art NLP. In Proceedings of the 2019 Confer- and Evaluation Conference, pages 4063–4073, Mar-
ence of the North American Chapter of the Asso- seille, France. European Language Resources Asso-
ciation for Computational Linguistics (Demonstra- ciation.
tions), pages 54–59, Minneapolis, Minnesota. Asso-
ciation for Computational Linguistics. Claudia Kittask, Kirill Milintsevich, and Kairit Sirts.
2020. Evaluating multilingual bert for estonian.
Darina Benikova, Chris Biemann, and Marc Reznicek. Volume, 328:19–26.
2014. NoSta-d named entity annotation for Ger- Dan Kondratyuk and Milan Straka. 2019. 75 lan-
man: Guidelines and dataset. In Proceedings of guages, 1 model: Parsing Universal Dependencies
the Ninth International Conference on Language Re- universally. In Proceedings of the 2019 Confer-
sources and Evaluation (LREC-2014), pages 2524– ence on Empirical Methods in Natural Language
2531, Reykjavik, Iceland. European Languages Re- Processing and the 9th International Joint Con-
sources Association (ELRA). ference on Natural Language Processing (EMNLP-
IJCNLP), pages 2779–2795, Hong Kong, China. As-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and sociation for Computational Linguistics.
Tomas Mikolov. 2017. Enriching word vectors with
subword information. Transactions of the Associa- Taku Kudo. 2018. Subword regularization: Improv-
tion for Computational Linguistics, 5:135–146. ing neural network translation models with multiple
subword candidates. In Proceedings of the 56th An-
Wanxiang Che, Yunlong Feng, Libo Qin, and Ting Liu. nual Meeting of the Association for Computational
2020. N-ltp: A open-source neural chinese language Linguistics (Volume 1: Long Papers), pages 66–
technology platform with pretrained models. arXiv 75, Melbourne, Australia. Association for Compu-
preprint arXiv:2009.11616. tational Linguistics.
Taku Kudo and John Richardson. 2018. Sentence- extraction with language-universal sentence struc-
Piece: A simple and language independent subword tures. In Proceedings of the Sixth Arabic Natural
tokenizer and detokenizer for neural text processing. Language Processing Workshop (WANLP) at EACL
In Proceedings of the 2018 Conference on Empirical 2021.
Methods in Natural Language Processing: System
Demonstrations, pages 66–71, Brussels, Belgium. Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
Association for Computational Linguistics. ter, Jan Hajič, Christopher D. Manning, Sampo
Pyysalo, Sebastian Schuster, Francis Tyers, and
Jian Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2019a. Daniel Zeman. 2020. Universal Dependencies v2:
Neural cross-lingual event detection with minimal An evergrowing multilingual treebank collection.
parallel resources. In Proceedings of the 2019 Con- In Proceedings of the 12th Language Resources
ference on Empirical Methods in Natural Language and Evaluation Conference, pages 4034–4043, Mar-
Processing and the 9th International Joint Con- seille, France. European Language Resources Asso-
ference on Natural Language Processing (EMNLP- ciation.
IJCNLP), pages 738–748, Hong Kong, China. Asso-
ciation for Computational Linguistics. Joel Nothman, Nicky Ringland, Will Radford, Tara
Murphy, and James R. Curran. 2012. Learning mul-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- tilingual named entity recognition from Wikipedia.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Artificial Intelligence, 194:151–175.
Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
Roberta: A robustly optimized bert pretraining ap- Matthew E. Peters, Sebastian Ruder, and Noah A.
proach. arXiv preprint arXiv:1907.11692. Smith. 2019. To tune or not to tune? adapting pre-
trained representations to diverse tasks. In Proceed-
Christopher Manning, Mihai Surdeanu, John Bauer, ings of the 4th Workshop on Representation Learn-
Jenny Finkel, Steven Bethard, and David McClosky. ing for NLP (RepL4NLP-2019), pages 7–14, Flo-
2014. The Stanford CoreNLP natural language pro- rence, Italy. Association for Computational Linguis-
cessing toolkit. In Proceedings of 52nd Annual tics.
Meeting of the Association for Computational Lin-
guistics: System Demonstrations, pages 55–60, Bal- Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish-
timore, Maryland. Association for Computational warya Kamath, Ivan Vulić, Sebastian Ruder,
Linguistics. Kyunghyun Cho, and Iryna Gurevych. 2020a.
AdapterHub: A framework for adapting transform-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
ers. In Proceedings of the 2020 Conference on Em-
rado, and Jeff Dean. 2013. Distributed representa-
pirical Methods in Natural Language Processing:
tions of words and phrases and their composition-
System Demonstrations, pages 46–54, Online. Asso-
ality. In Proceedings of the Conference on Neural
ciation for Computational Linguistics.
Information Processing Systems.

Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se-
Kemal Oflazer, and Noah A. Smith. 2012. Recall- bastian Ruder. 2020b. MAD-X: An Adapter-Based
oriented learning of named entities in Arabic Framework for Multi-Task Cross-Lingual Transfer.
Wikipedia. In Proceedings of the 13th Confer- In Proceedings of the 2020 Conference on Empirical
ence of the European Chapter of the Association Methods in Natural Language Processing (EMNLP),
for Computational Linguistics, pages 162–173, Avi- pages 7654–7673, Online. Association for Computa-
gnon, France. Association for Computational Lin- tional Linguistics.
guistics.
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,
Mahdi Mohseni and Amirhossein Tebbifakhr. 2019. and Christopher D. Manning. 2020. Stanza: A
MorphoBERT: a Persian NER system with BERT python natural language processing toolkit for many
and morphological analysis. In Proceedings of human languages. In Proceedings of the 58th An-
The First International Workshop on NLP Solutions nual Meeting of the Association for Computational
for Under Resourced Languages (NSURL 2019) co- Linguistics: System Demonstrations, pages 101–
located with ICNLSP 2019 - Short Papers, pages 23– 108, Online. Association for Computational Lin-
30, Trento, Italy. Association for Computational Lin- guistics.
guistics.
Milan Straka. 2018. UDPipe 2.0 prototype at CoNLL
Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. 2018 UD shared task. In Proceedings of the CoNLL
PhoBERT: Pre-trained language models for Viet- 2018 Shared Task: Multilingual Parsing from Raw
namese. In Findings of the Association for Com- Text to Universal Dependencies, pages 197–207,
putational Linguistics: EMNLP 2020, pages 1037– Brussels, Belgium. Association for Computational
1042, Online. Association for Computational Lin- Linguistics.
guistics.
Nasrin Taghizadeh and Heshaam Faili. 2020. Cross-
Minh Van Nguyen and Thien Huu Nguyen. 2021. Im- lingual adaptation using universal dependencies.
proving cross-lingual transfer for event argument arXiv preprint arXiv:2003.10816.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Asahara, Luma Ateyah, Mohammed Attia, Aitz-
BERT rediscovers the classical NLP pipeline. In iber Atutxa, Liesbeth Augustinus, Elena Bad-
Proceedings of the 57th Annual Meeting of the Asso- maeva, Miguel Ballesteros, Esha Banerjee, Se-
ciation for Computational Linguistics, pages 4593– bastian Bank, Verginica Barbu Mititelu, Victo-
4601, Florence, Italy. Association for Computational ria Basmov, Colin Batchelor, John Bauer, San-
Linguistics. dra Bellato, Kepa Bengoetxea, Yevgeni Berzak, Ir-
shad Ahmad Bhat, Riyaz Ahmad Bhat, Erica Bi-
Yuanhe Tian, Yan Song, Xiang Ao, Fei Xia, Xi- agetti, Eckhard Bick, Agnė Bielinskienė, Rogier
aojun Quan, Tong Zhang, and Yonggang Wang. Blokland, Victoria Bobicev, Loı̈c Boizou, Emanuel
2020. Joint Chinese word segmentation and part- Borges Völker, Carl Börstell, Cristina Bosco, Gosse
of-speech tagging via two-way attentions of auto- Bouma, Sam Bowman, Adriane Boyd, Kristina
analyzed knowledge. In Proceedings of the 58th An- Brokaitė, Aljoscha Burchardt, Marie Candito,
nual Meeting of the Association for Computational Bernard Caron, Gauthier Caron, Tatiana Cavalcanti,
Linguistics, pages 8286–8296, Online. Association Gülşen Cebiroğlu Eryiğit, Flavio Massimiliano Cec-
for Computational Linguistics. chini, Giuseppe G. A. Celano, Slavomír Čéplö,
Savas Cetin, Fabricio Chalub, Jinho Choi, Yongseok
Erik F. Tjong Kim Sang. 2002. Introduction to the Cho, Jayeol Chun, Alessandra T. Cignarella, Sil-
CoNLL-2002 shared task: Language-independent vie Cinková, Aurélie Collomb, Çağrı Çöltekin,
named entity recognition. In COLING-02: The Miriam Connor, Marine Courtin, Elizabeth David-
6th Conference on Natural Language Learning 2002 son, Marie-Catherine de Marneffe, Valeria de Paiva,
(CoNLL-2002). Elvis de Souza, Arantza Diaz de Ilarraza, Carly
Dickerson, Bamba Dione, Peter Dirix, Kaja Do-
Erik F. Tjong Kim Sang and Fien De Meulder.
brovoljc, Timothy Dozat, Kira Droganova, Puneet
2003. Introduction to the CoNLL-2003 shared task:
Dwivedi, Hanne Eckhoff, Marhaba Eli, Ali Elkahky,
Language-independent named entity recognition. In
Binyam Ephrem, Olga Erina, Tomaž Erjavec, Aline
Proceedings of the Seventh Conference on Natu-
Etienne, Wograine Evelyn, Richárd Farkas, Hec-
ral Language Learning at HLT-NAACL 2003, pages
tor Fernandez Alcalde, Jennifer Foster, Cláudia Fre-
142–147.
itas, Kazunori Fujita, Katarína Gajdošová, Daniel
Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Ari- Galbraith, Marcos Garcia, Moa Gärdenfors, Se-
vazhagan, Xin Li, and Amelia Archer. 2019. Small bastian Garza, Kim Gerdes, Filip Ginter, Iakes
and practical BERT models for sequence labeling. Goenaga, Koldo Gojenola, Memduh Gökırmak,
In Proceedings of the 2019 Conference on Empirical Yoav Goldberg, Xavier Gómez Guinovart, Berta
Methods in Natural Language Processing and the González Saavedra, Bernadeta Griciūtė, Matias Gri-
9th International Joint Conference on Natural Lan- oni, Normunds Grūzı̄tis, Bruno Guillaume, Céline
guage Processing (EMNLP-IJCNLP), pages 3632– Guillot-Barbance, Nizar Habash, Jan Hajič, Jan
3636, Hong Kong, China. Association for Computa- Hajič jr., Mika Hämäläinen, Linh Hà Mỹ, Na-Rae
tional Linguistics. Han, Kim Harris, Dag Haug, Johannes Heinecke,
Felix Hennig, Barbora Hladká, Jaroslava Hlaváčová,
Ahmet Üstün, Arianna Bisazza, Gosse Bouma, and Florinel Hociung, Petter Hohle, Jena Hwang,
Gertjan van Noord. 2020. UDapter: Language adap- Takumi Ikeda, Radu Ion, Elena Irimia, Ọlájídé
tation for truly Universal Dependency parsing. In Ishola, Tomáš Jelínek, Anders Johannsen, Fredrik
Proceedings of the 2020 Conference on Empirical Jørgensen, Markus Juutinen, Hüner Kaşıkara, An-
Methods in Natural Language Processing (EMNLP), dre Kaasen, Nadezhda Kabaeva, Sylvain Ka-
pages 2302–2315, Online. Association for Computa- hane, Hiroshi Kanayama, Jenna Kanerva, Boris
tional Linguistics. Katz, Tolga Kayadelen, Jessica Kenney, Václava
Kettnerová, Jesse Kirchner, Elena Klementieva,
Wietse de Vries, Andreas van Cranenburgh, Arianna Arne Köhn, Kamil Kopacewicz, Natalia Kotsyba,
Bisazza, Tommaso Caselli, Gertjan van Noord, and Jolanta Kovalevskaitė, Simon Krek, Sookyoung
Malvina Nissim. 2019. Bertje: A dutch bert model. Kwak, Veronika Laippala, Lorenzo Lambertino, Lu-
arXiv preprint arXiv:1912.09582. cia Lam, Tatiana Lando, Septina Dian Larasati,
Alexei Lavrentiev, John Lee, Phương Lê Hồng,
Ralph Weischedel, Martha Palmer, Mitchell Marcus, Alessandro Lenci, Saran Lertpradit, Herman Le-
Eduard Hovy, Sameer Pradhan, Lance Ramshaw, ung, Cheuk Ying Li, Josie Li, Keying Li, KyungTae
Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Lim, Maria Liovina, Yuan Li, Nikola Ljubešić, Olga
Franchini, et al. 2013. Ontonotes release 5.0. Lin- Loginova, Olga Lyashevskaya, Teresa Lynn, Vivien
guistic Data Consortium. Macketanz, Aibek Makazhanov, Michael Mandl,
Christopher Manning, Ruli Manurung, Cătălina
Haiqin Yang. 2019. Bert meets chinese word segmen- Mărănduc, David Mareček, Katrin Marheinecke,
tation. arXiv preprint arXiv:1909.09292. Héctor Martínez Alonso, André Martins, Jan Mašek,
Yuji Matsumoto, Ryan McDonald, Sarah McGuin-
Daniel Zeman, Joakim Nivre, Mitchell Abrams, Noëmi ness, Gustavo Mendonça, Niko Miekka, Mar-
Aepli, Željko Agić, Lars Ahrenberg, Gabrielė Alek- garita Misirpashayeva, Anna Missilä, Cătălin Mi-
sandravičiūtė, Lene Antonsen, Katya Aplonova, titelu, Maria Mitrofan, Yusuke Miyao, Simonetta
Maria Jesus Aranzabe, Gashaw Arutie, Masayuki
Montemagni, Amir More, Laura Moreno Romero,
Keiko Sophie Mori, Tomohiko Morioka, Shinsuke
Mori, Shigeki Moro, Bjartur Mortensen, Bohdan
Moskalevskyi, Kadri Muischnek, Robert Munro,
Yugo Murawaki, Kaili Müürisep, Pinkey Nainwani,
Juan Ignacio Navarro Horñiacek, Anna Nedoluzhko,
Gunta Nešpore-Bērzkalne, Lương Nguyễn Thị,
Huyền Nguyễn Thị Minh, Yoshihiro Nikaido, Vi-
taly Nikolaev, Rattima Nitisaroj, Hanna Nurmi,
Stina Ojala, Atul Kr. Ojha, Adédayọ Olúòkun,
Mai Omura, Petya Osenova, Robert Östling, Lilja
Øvrelid, Niko Partanen, Elena Pascual, Marco
Passarotti, Agnieszka Patejuk, Guilherme Paulino-
Passos, Angelika Peljak-Łapińska, Siyao Peng,
Cenel-Augusto Perez, Guy Perrier, Daria Petrova,
Slav Petrov, Jason Phelan, Jussi Piitulainen,
Tommi A Pirinen, Emily Pitler, Barbara Plank,
Thierry Poibeau, Larisa Ponomareva, Martin Popel,
Lauma Pretkalniņa, Sophie Prévost, Prokopis Proko-
pidis, Adam Przepiórkowski, Tiina Puolakainen,
Sampo Pyysalo, Peng Qi, Andriela Rääbis, Alexan-
dre Rademaker, Loganathan Ramasamy, Taraka
Rama, Carlos Ramisch, Vinit Ravishankar, Livy
Real, Siva Reddy, Georg Rehm, Ivan Riabov,
Michael Rießler, Erika Rimkutė, Larissa Rinaldi,
Laura Rituma, Luisa Rocha, Mykhailo Romanenko,
Rudolf Rosa, Davide Rovati, Valentin Ros, ca, Olga
Rudina, Jack Rueter, Shoval Sadde, Benoı̂t Sagot,
Shadi Saleh, Alessio Salomoni, Tanja Samardžić,
Stephanie Samson, Manuela Sanguinetti, Dage
Särg, Baiba Saulı̄te, Yanin Sawanakunanon, Nathan
Schneider, Sebastian Schuster, Djamé Seddah, Wolf-
gang Seeker, Mojgan Seraji, Mo Shen, Atsuko
Shimada, Hiroyuki Shirasu, Muh Shohibussirri,
Dmitry Sichinava, Aline Silveira, Natalia Silveira,
Maria Simi, Radu Simionescu, Katalin Simkó,
Mária Šimková, Kiril Simov, Aaron Smith, Isabela
Soares-Bastos, Carolyn Spadine, Antonio Stella,
Milan Straka, Jana Strnadová, Alane Suhr, Umut
Sulubacak, Shingo Suzuki, Zsolt Szántó, Dima
Taji, Yuta Takahashi, Fabio Tamburini, Takaaki
Tanaka, Isabelle Tellier, Guillaume Thomas, Li-
isi Torga, Trond Trosterud, Anna Trukhina, Reut
Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka
Urešová, Larraitz Uria, Hans Uszkoreit, Andrius
Utka, Sowmya Vajjala, Daniel van Niekerk, Gert-
jan van Noord, Viktor Varga, Eric Villemonte de la
Clergerie, Veronika Vincze, Lars Wallin, Abigail
Walsh, Jing Xian Wang, Jonathan North Washing-
ton, Maximilan Wendt, Seyi Williams, Mats Wirén,
Christian Wittern, Tsegay Woldemariam, Tak-sum
Wong, Alina Wróblewska, Mary Yako, Naoki Ya-
mazaki, Chunxiao Yan, Koichi Yasuoka, Marat M.
Yavrumyan, Zhuoran Yu, Zdeněk Žabokrtský, Amir
Zeldes, Manying Zhang, and Hanzhi Zhu. 2019.
Universal dependencies 2.5. LINDAT/CLARIAH-
CZ digital library at the Institute of Formal and Ap-
plied Linguistics (ÚFAL), Faculty of Mathematics
and Physics, Charles University.
Xingran Zhu. 2020. Cross-lingual word sense disam-
biguation using mbert embeddings with syntactic
dependencies. arXiv preprint arXiv:2012.05300.

You might also like