Continual Pre-Training Mitigates Forgetting in Language and Vision
Continual Pre-Training Mitigates Forgetting in Language and Vision
Continual Pre-Training Mitigates Forgetting in Language and Vision
tinne.tuytelaars@kuleuven.be
Abstract
1 Introduction
Continual Learning (CL) (Lesort et al., 2020) focuses on the design of agents able to learn from
a stream of non-stationary data while preserving previously acquired knowledge. The tendency
of neural networks to catastrophically forget when confronted with new data has been the subject
of many studies (McCloskey, Cohen, 1989; French, 1999), mostly focused on the design of new
CL strategies that mitigate such problem (De Lange et al., 2021). The traditional CL scenario
currently used in the literature considers a single model tackling a sequence of tasks, one after the
other (Parisi et al., 2019). In this setting, the CL model needs to learn its features while, at the
same time, leveraging the same features to solve the supervised task. However, this scenario is
not the only conceivable one. Natural Language Processing (NLP), for example, often exploits
Transfer Learning techniques (Ruder et al., 2019) implemented through the so-called pre-training
fine-tuning setup. In this setting, the more general linguistic knowledge acquired with pre-training is
leveraged as a starting point to target specific downstream tasks. Specifically: 1) during pre-training,
language models focus on unsupervised learning tasks (e.g. predicting masked words based on the
surrounding context), and 2) during fine-tuning, the pre-trained model is further trained on supervised
learning tasks (e.g. sequence labeling). Pre-trained models are widespread also in CL (Mehta et al.,
2021; Wu et al., 2021), where they are mostly used to conveniently initialize the model weights
before learning from the non-stationary stream of data. However, the generality and robustness of
∗
Corresponding author
2
2 Related Works
The ability of pre-trained models to solve a diverse set of tasks through fine-tuning has led to consider
them as almost static models. However, it was recently shown that taking a pre-trained model and
performing an additional step of pre-training on domain-specific data is beneficial for the downstream
performance in that domain (e.g., Q/A in bio-medicine as showed by Gururangan et al. (2020); Lee
et al. (2020)). Pre-trained models are helpful also in CL, where leveraging a pre-trained model as the
starting point for the continual training leads to better results with respect to forgetting both in CV
(Mehta et al., 2021; Ramasesh et al., 2021) and NLP (Wu et al., 2021), especially when combined with
CL strategies. An additional pre-training step before the continual training also provides advantages
in terms of downstream performance on domain-specific tasks (Rongali et al., 2021).
The need to perform continual pre-training is present in many different applications, where updating
the pre-trained model is fundamental to incorporate new knowledge and update or erase outdated
information (Lazaridou et al., 2021; Han et al., 2021; Jang et al., 2022). While models trained directly
on a domain task may achieve similar or even better performance on downstream tasks (Gu et al.,
2021), the cost of starting from scratch each time is large and mitigating it is one of the objectives of
CL. Continual pre-training has been recently explored in the context of NLP by leveraging either
domain-specific datasets (like multi-domain research papers) (Jin et al., 2021) or news/tweets corpora
split into different temporal segments (Loureiro et al., 2022; Jang et al., 2021). The results show that
continual pre-training is beneficial to the downstream performance and that forgetting on the tasks
stream can be effectively mitigated by employing CL strategies. Moreover, continual pre-training is
also able to provide advantages in terms of temporal generalization on unseen future data (Loureiro
et al., 2022) and event temporal reasoning (Han et al., 2021). The work by Hu et al. (2021) focuses
on the performance difference between contrastive self-supervised (MoCo-v2 by Chen et al. (2020))
and supervised pre-training in CV, showing that self-supervised leads to robust features in terms
of forgetting. A more detailed discussion of related works is presented in Appendix D. Our work
provides new evidence of the behavior of pre-trained models in the continual pre-training scenario.
We propose to evaluate the performance in terms of catastrophic forgetting on a FC dataset not present
in the CL stream. We provided results for both CV and NLP, with experiments on longer streams
than most of the existing studies (with the exception of Qin et al. (2022)). Unlike prior works, we did
not use any CL strategy, but we just employed naive fine-tuning.
The traditional CL scenario (Lomonaco et al., 2021) trains a model h0 on a (possibly infinite) stream
of experiences S = (e1 , e2 , e3 , . . .), where each experience ei brings a dataset Di , representing the
current task. The model is trained on S, one experience after the other, and needs to address the
non-stationarity and drifts occurring between experiences without having access to the previously
encountered data. The model h0 is sometimes initialized with the weights of a pre-trained model.
The pre-training phase is conducted on the dataset Dpr which is however not available during CL.
We provide a formal characterization of the continual pre-training scenario (pseudo-code in Appendix
C) and highlight the differences with respect to the traditional CL setup. The continual pre-training
scenario leverages a model hpr pr
0 originally pre-trained on dataset D0 , not available anymore. The
model is presented with a (possibly infinite) stream of experiences, where each experience ei brings a
dataset Dipr for pre-training and a downstream dataset Dids for fine-tuning. For each experience ei , the
last pre-trained model hpr pr
i−1 is further pre-trained on Di . After the pre-training step, the model hi
pr
to verify that the continual pre-training step actually contributes to learning meaningful features for
the downstream task. In this way we avoid the uninteresting case where pre-training leaves features
(mostly) unchanged, resulting in no catastrophic forgetting of previous knowledge but also in a lower
performance on the downstream task. It is important to note that the head (last layer of the model) used
during pre-training is not the one used during fine-tuning. In fact, the pre-training and downstream
3
tasks are different ones and they therefore require different heads. Before fine-tuning on each down-
stream task, the head of hpr i is replaced with a randomly initialized head. The model is then trained
until convergence to obtain hdsi . During the continual pre-training step instead, the head is not replaced.
The continual pre-training scenario has different char-
acteristics with respect to the traditional CL setup. Table 1: Combinations for the main compo-
Firstly, the continual pre-training scenario updates nents of the continual pre-training scenario
continuously the pre-trained model and then adapts explored in this paper. MLM=Masked Lan-
it to specific tasks. The traditional CL setup does not guage modeling, MIM=Masked Image Mod-
consider this important distinction, using the same eling, CLF=Image Classification.
model both for representation learning and to solve
incoming tasks. Secondly, model evaluation in contin- Pre-training Architecture Data
ual pre-training requires an additional training phase
Unsupervised (MLM) Transformer Words
on the target task, while CL usually requires the
model to be readily able to tackle all tasks seen so far Unsupervised (MIM) Transformer Images
without any additional training. Therefore, the model Supervised (CLF) Transformer Images
has to focus on the new task without the opportunity
to build robust, general features via pre-training pro- Supervised (CLF) CNN Images
tocols. As our results will show, the additional cost
of a training phase in continual pre-training can be largely mitigated by a quick adaptation phase
(e.g., one epoch of training). In fact, fast remembering of previous knowledge is considered one of
the objectives of CL (Hadsell et al., 2020).
Ultimately, our continual pre-training scenario aims at building models which are general learners,
able to quickly adapt to unseen data while still preserving the original knowledge. We studied
continual pre-training by introducing two evaluation environments: one for NLP and one for CV.
They are designed to investigate the impact on forgetting of specific components of the scenario
(Table 1), namely the input modality, the pre-training protocol and the model architecture.
Current NLP applications are all based on the idea of leveraging large-scale pre-trained models to
then solve different tasks under fine-tuning, few- or even zero-shot learning settings. Therefore, NLP
applications based on the traditional pre-training fine-tuning setting seem to be the most natural
field for evaluating our continual pre-training scenario. For example, when dealing with a stream
of news, it is important to keep the language model updated (Lazaridou et al., 2021) so that it can
incorporate information which was not previously available. Our NLP environment employs an
unsupervised/self-supervised pre-training protocol and different Transformer architectures (Vaswani
et al., 2017). These components are standard ones in NLP and represent the state of the art of the field.
We uses the popular pre-trained Transformers RoBERTa (Liu et al., 2019) and BERT (Devlin et al.,
2019), pre-trained on Wikipedia. In addition, we study a variant of RoBERTa in which the vocabulary
is dynamically expanded with the addition of new tokens. We select the most frequent tokens of
the continual pre-training dataset which were not present in the pre-trained tokenizer. Vocabulary
expansion is beneficial for downstream performance, as showed by recent works on dynamic token
expansion in both CV (Douillard et al., 2022) and NLP (Zhang et al., 2020; Han et al., 2021). Our
aim is to understand whether the addition of new tokens may result in a larger forgetting of existing
knowledge. We apply continual pre-training on a dataset of scientific abstracts from arXiv
(Geiger, 2019). The motivation behind the choice of this dataset is that scientific abstracts
represent a very specific domain for NLP both in terms of syntactic structures and domain-specific
terminology. Indeed, updating the language model before fine-tuning is particularly beneficial
under these circumstances. The downstream task is modeled as a document classification problem
aiming to associate scientific abstracts to their corresponding arXiv classes. The CL stream
includes 5 experiences, with 2 scientific domains (classes) in each experience (as in common CL
benchmarks like Split-MNIST/CIFAR-10). Please, refer to Appendix A for a complete description
of the split used for pretraining and downstream fine-tuning. We test two different FC datasets to
measure forgetting: sentiment analysis from tweets and Question Answering Natural Language
Inference (QNLI). The idea behind these choices is that the dataset of scientific abstracts
should not contain much knowledge neither about sentiments, nor about generic facts for language
inference. Pre-training on scientific abstracts may therefore disrupt the knowledge contained in the
original language model. We additionally expand our analysis by using the 20 datasets present in the
SentEval benchmark (Conneau, Kiela, May 7-12, 2018 2018) as FC datasets.
4
Table 2: Accuracy on the entire dataset of sentiment analysis with RoBERTa model. Continual
pre-training has been performed sequentially over each experience of scientific abstracts.
Base refers to the model pre-trained on Wikipedia, while NT refers to the model with vocabulary
expansion.
RoBERTa Accuracy 1-epoch Accuracy
Base 93.40 92.40
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
Pretr 93.40 93.15 93.35 93.20 92.90 92.40 91.80 92.30 91.85 92.20
Pretr. NT 93.75 93.70 93.75 93.60 94.10 91.75 91.15 92.00 92.30 92.45
Table 3: Accuracy on the entire dataset of QNLI with RoBERTa model. Continual pre-training has
been performed sequentially over each experience of scientific abstracts. Base refers to the
model pre-trained on Wikipedia, while NT refers to the model with expanding vocabulary.
RoBERTa Accuracy 1-epoch Accuracy
Base 92.73 91.76
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
Pretr. 91.96 91.87 91.96 91.76 92.07 90.68 91.32 90.70 90.83 90.85
Pretr. NT 92.09 91.62 91.31 91.45 91.51 91.49 91.05 91.31 89.99 90.99
We found CV to be a useful test-bed to disentangle the importance of the three components in our
continual pre-training scenario. In particular, we design the CV environment to understand to what
extent forgetting depends on the input modality (natural language against vision), on the architecture
(Transformer against CNN) and on the pre-training protocol (unsupervised/self-supervised against
supervised). To limit the large number of experiments needed to explore these three factors, in the CV
environment we do not measure the performance on the downstream task after each step of continual
pre-training. Instead, we focus on the study of forgetting on the FC dataset. In fact, the impact of
pre-training on downstream tasks similar to the ones in the pre-training stream is assessed both in the
discussion of related works (Section 2 above) and in the experiments with scientific abstracts
classification in NLP environment (results presented below in Section 4 and Appendix B.2).
The CV environment uses iNaturalist (Van Horn et al., 2018) for continual pre-training and
CORe50 (Lomonaco, Maltoni, 2017) as FC dataset for catastrophic forgetting. We use ResNet101,
Vision Transformer (ViT) and BEiT originally pre-trained on ImageNet. The choice of ResNet and
ViT is fundamental to disentangle the role of the architecture (NLP uses only Transformers) and the
pre-training protocol (NLP uses only self-supervised pre-training). In fact, ResNet (He et al., 2016)
and ViT (Dosovitskiy et al., 2020) are pre-trained via supervised image classification. The choice
of BEiT (Bao et al., 2021), instead, allows to understand the role of the input modality. BEiT uses
the recent self-supervised masked image modeling pre-training, which closely resembles the masked
language modeling one used in NLP. The proposed setup allows to run experiments by changing one
factor at a time among the three we studied and to keep fixed the other two. In this way, we are able
to properly compare results between the NLP and CV environments.
For NLP, we use the Huggingface’s pre-trained BERT and RoBERTa with 12 layers. The NLP
datasets, SentEval excluded, are also taken from Huggingface. For SentEval, we train our models
using the original code. We use the same pre-training protocol across all experiments, with a learning
rate of 5e-5 and 30 epochs with early stopping with 2 epochs patience. For fine-tuning, we adopt
a similar setup but with a learning rate of 1e-5 and 20 epochs. For CV, we use ResNet101 and
iNaturalist from Torchvision, while we retrieve ViT and BEiT models from Huggingface, using the
version with 12 layers in order to properly compare results with NLP experiments. We use Avalanche
(Lomonaco et al., 2021) to run the continual pre-training and fine-tuning. For fine-tuning on FC task,
5
Table 4: Accuracy on the entire dataset of sentiment analysis (ER) and QNLI with BERT
model. Continual pre-training has been performed sequentially over each experience of scientific
abstracts. Base refers to the model pre-trained on Wikipedia.
BERT Accuracy 1-epoch Accuracy
Base ER 93.05 92.70
Base QNLI 90.43 90.43
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
Pr. ER 92.95 92.90 92.90 92.65 92.45 92.25 92.35 91.90 92.15 91.90
Pr. QNLI 90.28 89.75 90.50 89.93 90.01 90.01 89.49 89.31 89.11 89.29
roberta pr. bert pr. roberta base bert base glove fasttext
100
90
80
Accuracy
70
60
50
40
CI
CR
MR
WC
TC
BS
SN
ON
SUBJ
SST2
SST5
SNLI
Tense
Depth
OMO
MPQA
TREC
MRPC
Length
SICK-E
Figure 2: Accuracy on the 10 transfer tasks (left) and 10 probing tasks (right) of SentEval. Trans-
formers are fine-tuned after 5 experiences of pre-training on the scientific abstracts. Base
refers to the model pre-trained on Wikipedia. Full results in Appendix B.1,
we try few combinations of learning rates (1e − 5, 1e − 4, 1e − 3) and batch sizes (64, 128, 256) on
a held-out validation set built from CORe50. We report the best performance in terms of accuracy on
the test set. The experimental setup is described in detail in Appendix A.
4 Results
We provide strong empirical evidence supporting the hypothesis that the continual pre-training
scenario is less impacted by catastrophic forgetting than the traditional one. In particular, we found
the unsupervised pre-training objective to be the common factor for the resistance to forgetting in the
proposed environments. Our result adds to the evidences discussed in Section 2 for the robustness of
unsupervised and self-supervised protocols with respect to catastrophic forgetting. Our evaluation
offers similar conclusions for the novel continual pre-training scenario.
6
Table 5: Fine-tuning accuracy on the entire dataset of CORe50. Pre-training has been performed
sequentially over each experience of iNaturalist.
Model Accuracy 1-epoch Accuracy
ResNet 94.72 94.28
ViT Base 90.56 90.56
BEiT Base 90.15 82.51
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
ResNet Pr. 89.88 81.29 80.82 77.78 74.35 88.40 69.93 70.43 65.91 57.60
ViT Pr. 90.29 81.36 81.47 79.71 77.42 88.48 79.33 78.60 75.01 75.72
BEiT Pr. 88.37 86.45 86.73 87.07 86.46 80.55 78.06 78.88 77.27 77.06
Table 6: Linear evaluation accuracy on the sentiment analysis (ER) and QNLI datasets. Pre-
training has been performed sequentially over each experience of scientific abstracts.
Model ER QNLI
RoBERTa Base 60.05 69.43
BERT Base 59.85 77.87
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
RoBERTa Pr. 59.15 59.85 57.00 54.10 58.05 68.88 68.97 67.16 68.08 67.55
BERT Pr. 60.15 59.15 59.35 58.20 56.70 75.62 74.15 72.93 73.37 73.44
reaching an accuracy comparable to the one originally obtained by the model before continual
pre-training. This happens both for sentiment analysis and QNLI. Moreover, a single epoch of
gradient descent is sufficient to retain most of the original performance, showing the quick adaptation
capabilities of the pre-trained models. Notably, the additional pretraining steps on domain-specific
texts along with the expansion of the RoBERTa vocabulary does not worsen the effects of catastrophic
forgetting. We conducted a broader empirical assessment on a diverse set of NLP tasks by using the
SentEval benchmark. Figure 2 shows the downstream performance of BERT and RoBERTa after
the entire continual pre-training stream. GloVe and fastText results are used as baselines and are
taken from Conneau, Kiela (May 7-12, 2018 2018), except on SNLI and on all probing tasks, for
which they were not available. Therefore, we computed these results using original code. The results
confirm our findings: BERT and RoBERTa do not not show clear signs of forgetting, neither with
respect to their original pre-trained version, nor with respect to the baselines.
The negligible role of the architecture. The type of Transformer used in the experiments does
not appear to be a fundamental component: we experimented with larger vision models with 24
layers instead of 12 (Appendix B.5) without being able to appreciate any significant difference. The
difference between convolutional networks like ResNet and attention-based transformers does not
seem to have an impact, either. While ResNet sometimes exhibits worse performance than ViT, there
is no clear evidence that this kind of model is more susceptible to forgetting.
7
Feature Space Analysis: supervised pre-training Table 7: Linear evaluation accuracy on the en-
induces larger drifts. We verified the coherence tire dataset of CORe50. Pre-training has been
of our findings by studying the feature space of the performed sequentially over each experience
models. We leveraged linear evaluation for a quantita- of iNaturalist.
tive analysis and Centered Kernel Alignment (CKA)
(Kornblith et al., 2019) for a qualitatively analysis. Model Accuracy
Linear evaluation (i.e., training only the linear clas- ResNet 82.50
sifier and keeping the rest of the model fixed) is a
powerful tool to understand the impact of the learned ViT Base 91.90
model representations in terms of catastrophic forget- BEiT Base 52.75
ting (Davari et al., 2022). A model which exhibits
Exp. e1 e2 e3 e4 e5
forgetting during linear evaluation is likely to posses
features which are not representative of the task. Con- ResNet Pr. 61.99 31.02 34.71 26.41 22.01
versely, a good linear evaluation performance points ViT Pr. 79.38 55.20 57.98 60.49 48.25
to a set of strong features, since it means that the task
is linearly separable in that feature space. We adopted BEiT Pr. 52.34 51.71 51.31 53.12 52.51
this approach for our continual pre-training scenario.
In the NLP environment (Table 6), the features built by the models during continual pre-training
are robust and do not cause a large deviation of performance with respect to the original pre-trained
model. The lower training accuracy with respect to fine-tuning is expected, since we train only a
subset of all parameters. In the CV environment (Table 7), both ResNet and Vit suffer from forgetting,
while BEiT does not (although it reaches a lower absolute accuracy).
Following Hu et al. (2021), we used CKA with linear kernel Kornblith et al. (2019) to compute layers
similarity between the original pre-trained model and its continually pre-trained versions. From
Figure 3, we can see that all models show large correlations across bottom layers (features are not
drifting much). Instead, ViT and ResNet show lower correlation values for the final layers than
BEiT. This corresponds to a larger drift (full set of results in Appendix B.4) in those layers. These
results are compatible with what showed by Madaan et al. (2021) for unsupervised CL, namely that
unsupervised models in the traditional CL scenario have larger correlations in the lower layers than
supervised ones. Our results further extend this conclusion to continual pre-training, supporting the
idea that pre-training acts mainly in the upper layer of the networks (the ones containing more specific
domain knowledge) and that heavy changes in these layers are enough to cause a deterioration of
performance on the FC dataset, resulting in forgetting.
8
12 12 12 12
10 0.8 10 0.8 10 10 80
0.8 0.8 0.8
(a) RoBERTa QNLI (b) BERT Tweets (c) BEiT (d) ViT (e) ResNet
Figure 3: CKA for RoBERTa, BERT, BEiT, Vit and ResNet. Pre-trained model hds 5 after the last
experience (x axis) is compared with the original pre-trained model hds
0 (y axis). Each row is the
similarity of a layer with respect to each layer of the other model.
day - continual pre-training on a single A100), the number of experiments per environment was large.
We preferred to thoroughly evaluate few environments rather than trying to address a wide range of
different datasets without being able to properly explore them (Table 1). We are well aware that a
comprehensive exploration of continual pre-training in both NLP and CV domains is an ambitious
objective, possible only in the context of a broad research program. However, we are confident of the
fact that this study has shed some light on the subject and clearly pointed towards promising research
directions.
6 Conclusion
Continual pre-training represents a novel CL scenario with promising opportunities and unexpected
characteristics. In this work, we formally defined the continual pre-training scenario and we showed
the effect that pre-training has on catastrophic forgetting, for both NLP and CV environments and
with different architectures. Our results show that forgetting can be effectively mitigated by means of
self-supervised pretraining, even with a single epoch of fine-tuning on the FC dataset. Ultimately,
this work opens up the possibility to continually train large pre-trained models in a scalable and
efficient way. Much like Deep Learning has advanced by disentangling the representation learning
objective from the solution to specific tasks, continual pre-training aims to focus on the incremental
development of robust features which are kept updated over time. This is a fundamental property
towards the achievement of agents who can truly learn continuously in the real-world.
References
Bao Hangbo, Dong Li, Piao Songhao, Wei Furu. BEiT: BERT Pre-Training of Image Transformers
// International Conference on Learning Representations. 2021.
Chen Xinlei, Fan Haoqi, Girshick Ross, He Kaiming. Improved Baselines with Momentum Con-
trastive Learning // arXiv:2003.04297 [cs]. 2020.
Conneau Alexis, Kiela Douwe. SentEval: An Evaluation Toolkit for Universal Sentence Repre-
sentations // Proceedings of the Eleventh International Conference on Language Resources and
Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA),
May 7-12, 2018 2018.
Davari MohammadReza, Asadi Nader, Mudur Sudhir, Aljundi Rahaf, Belilovsky Eugene. Probing
Representation Forgetting in Supervised and Unsupervised Continual Learning // Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
De Lange Matthias, Aljundi Rahaf, Masana Marc, Parisot Sarah, Jia Xu, Leonardis Ales, Slabaugh
Gregory, Tuytelaars Tinne. A Continual Learning Survey: Defying Forgetting in Classification
Tasks // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021.
9
Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for
Computational Linguistics, 2019. 4171–4186.
Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner
Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob,
Houlsby Neil. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale //
International Conference on Learning Representations. 2020.
Douillard Arthur, Ramé Alexandre, Couairon Guillaume, Cord Matthieu. DyTox: Transformers for
Continual Learning with DYnamic TOken eXpansion // IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2022.
Geiger R. Stuart. ArXiV Archive: A Tidy and Complete Archive of Metadata for Papers on Arxiv.Org,
1993-2019. 2019.
Gu Yu, Tinn Robert, Cheng Hao, Lucas Michael, Usuyama Naoto, Liu Xiaodong, Naumann Tristan,
Gao Jianfeng, Poon Hoifung. Domain-Specific Language Model Pretraining for Biomedical
Natural Language Processing // ACM Transactions on Computing for Healthcare. 2021. 3, 1.
2:1–2:23.
Gururangan Suchin, Marasović Ana, Swayamdipta Swabha, Lo Kyle, Beltagy Iz, Downey Doug, Smith
Noah A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks // Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association
for Computational Linguistics, 2020. 8342–8360.
Hadsell Raia, Rao Dushyant, Rusu Andrei A, Pascanu Razvan. Embracing Change: Continual
Learning in Deep Neural Networks // Trends in Cognitive Sciences. 2020.
Han Rujun, Ren Xiang, Peng Nanyun. ECONET: Effective Continual Pretraining of Language Models
for Event Temporal Reasoning // Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for
Computational Linguistics, 2021. 5367–5380.
He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian. Deep Residual Learning for Image Recognition
// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. 770–778.
Hu Dapeng, Yan Shipeng, Lu Qizhengqiu, Hong Lanqing, Hu Hailin, Zhang Yifan, Li Zhenguo, Wang
Xinchao, Feng Jiashi. How Well Does Self-Supervised Pre-Training Perform with Streaming Data?
// International Conference on Learning Representations. 2021.
Jang Joel, Ye Seonghyeon, Lee Changho, Yang Sohee, Shin Joongbo, Han Janghoon, Kim Gyeonghun,
Seo Minjoon. TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving
Language Models // arXiv:2204.14211 [cs]. 2022.
Jang Joel, Ye Seonghyeon, Yang Sohee, Shin Joongbo, Han Janghoon, Kim Gyeonghun, Choi
Stanley Jungkyu, Seo Minjoon. Towards Continual Knowledge Learning of Language Models //
International Conference on Learning Representations. 2021.
Jin Xisen, Zhang Dejiao, Zhu Henghui, Xiao Wei, Li Shang-Wen, Wei Xiaokai, Arnold Andrew, Ren
Xiang. Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora //
arXiv:2110.08534 [cs]. 2021.
Kornblith Simon, Norouzi Mohammad, Lee Honglak, Hinton Geoffrey. Similarity of Neural Network
Representations Revisited // Proceedings of the 36th International Conference on Machine Learning.
2019. 3519–3529.
10
Lazaridou Angeliki, Kuncoro Adhiguna, Gribovskaya Elena, Agrawal Devang, Liska Adam, Terzi
Tayfun, Gimenez Mai, d’Autume Cyprien de Masson, Kočiský Tomáš, Ruder Sebastian, Yogatama
Dani, Cao Kris, Young Susannah, Blunsom Phil. Mind the Gap: Assessing Temporal Generalization
in Neural Language Models // Thirty-Fifth Conference on Neural Information Processing Systems.
2021.
Lee Jinhyuk, Yoon Wonjin, Kim Sungdong, Kim Donghyeon, Kim Sunkyu, So Chan Ho, Kang Jaewoo.
BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining
// Bioinformatics. 2020. 36, 4. 1234–1240.
Lesort Timothée, Lomonaco Vincenzo, Stoian Andrei, Maltoni Davide, Filliat David, Díaz-Rodríguez
Natalia. Continual Learning for Robotics: Definition, Framework, Learning Strategies, Opportuni-
ties and Challenges // Information Fusion. 2020. 58. 52–68.
Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis
Mike, Zettlemoyer Luke, Stoyanov Veselin. RoBERTa: A Robustly Optimized BERT Pretraining
Approach // arXiv:1907.11692 [cs]. 2019.
Lomonaco Vincenzo, Maltoni Davide. CORe50: A New Dataset and Benchmark for Continuous
Object Recognition // Proceedings of the 1st Annual Conference on Robot Learning. 78. 2017.
17–26. (Proceedings of Machine Learning Research).
Lomonaco Vincenzo, Pellegrini Lorenzo, Cossu Andrea, Carta Antonio, Graffieti Gabriele, Hayes
Tyler L., De Lange Matthias, Masana Marc, Pomponi Jary, van de Ven Gido, Mundt Martin, She Qi,
Cooper Keiland, Forest Jeremy, Belouadah Eden, Calderara Simone, Parisi German I., Cuzzolin
Fabio, Tolias Andreas, Scardapane Simone, Antiga Luca, Amhad Subutai, Popescu Adrian, Kanan
Christopher, van de Weijer Joost, Tuytelaars Tinne, Bacciu Davide, Maltoni Davide. Avalanche:
An End-to-End Library for Continual Learning // CLVision Workshop at CVPR. 2021.
Lopez-Paz David, Ranzato Marc’Aurelio. Gradient Episodic Memory for Continual Learning // NIPS.
2017.
Loureiro Daniel, Barbieri Francesco, Neves Leonardo, Anke Luis Espinosa, Camacho-Collados Jose.
TimeLMs: Diachronic Language Models from Twitter // arXiv:2202.03829 [cs]. 2022.
Madaan Divyam, Yoon Jaehong, Li Yuanchun, Liu Yunxin, Hwang Sung Ju. Representational Continu-
ity for Unsupervised Continual Learning // International Conference on Learning Representations.
2021.
McCloskey Michael, Cohen Neal J. Catastrophic Interference in Connectionist Networks: The
Sequential Learning Problem // Psychology of Learning and Motivation. 24. 1989. 109–165.
Mehta Nikhil, Liang Kevin J, Carin Lawrence. Bayesian Nonparametric Weight Factorization for
Continual Learning // arXiv. 2020. 1–17.
Mehta Sanket Vaibhav, Patil Darshan, Chandar Sarath, Strubell Emma. An Empirical Investigation
of the Role of Pre-training in Lifelong Learning // arXiv:2112.09153 [cs]. 2021.
Merity Stephen, Xiong Caiming, Bradbury James, Socher Richard. Pointer Sentinel Mixture Models
// arXiv:1609.07843 [cs]. 2016.
Nguyen Thao, Raghu Maithra, Kornblith Simon. Do Wide and Deep Networks Learn the Same
Things? Uncovering How Neural Network Representations Vary with Width and Depth //
International Conference on Learning Representations. 2020.
Parisi German I, Kemker Ronald, Part Jose L, Kanan Christopher, Wermter Stefan. Continual
Lifelong Learning with Neural Networks: A Review // Neural Networks. 2019. 113. 54–71.
Qin Yujia, Zhang Jiajie, Lin Yankai, Liu Zhiyuan, Li Peng, Sun Maosong, Zhou Jie. ELLE: Efficient
Lifelong Pre-training for Emerging Data // Findings of ACL. 2022.
Ramasesh Vinay Venkatesh, Lewkowycz Aitor, Dyer Ethan. Effect of Scale on Catastrophic Forgetting
in Neural Networks // International Conference on Learning Representations. 2021.
11
Rongali Subendhu, Jagannatha Abhyuday, Rawat Bhanu Pratap Singh, Yu Hong. Continual Domain-
Tuning for Pretrained Language Models // arXiv:2004.02288 [cs]. 2021.
Ruder Sebastian, Peters Matthew E., Swayamdipta Swabha, Wolf Thomas. Transfer Learning in
Natural Language Processing // Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Tutorials. Minneapolis, Minnesota:
Association for Computational Linguistics, 2019. 15–18.
Van Horn Grant, Mac Aodha Oisin, Song Yang, Cui Yin, Sun Chen, Shepard Alex, Adam Hartwig,
Perona Pietro, Belongie Serge. The INaturalist Species Classification and Detection Dataset //
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. 8769–
8778.
Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser
Łukasz, Polosukhin Illia. Attention Is All You Need // Advances in Neural Information Processing
Systems 30. 2017. 5998–6008.
Wu Tongtong, Caccia Massimo, Li Zhuang, Li Yuan-Fang, Qi Guilin, Haffari Gholamreza. Pretrained
Language Model in Continual Learning: A Comparative Study // International Conference on
Learning Representations. 2021.
Zhang Rong, Gangi Reddy Revanth, Sultan Md Arafat, Castelli Vittorio, Ferritto Anthony, Florian
Radu, Sarioglu Kayi Efsun, Roukos Salim, Sil Avi, Ward Todd. Multi-Stage Pre-training for Low-
Resource Domain Adaptation // Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020.
5461–5468.
NLP The continual pre-training dataset of scientific abstracts is taken from GitHub2 . We
selected 10 ArXiv classes to build our continual pre-training stream, namely ‘hep-ph’, ‘astro-ph’,
‘hep-th’, ‘quant-ph’, ‘cond-mat.mes-hall’, ‘gr-qc’, ‘cond-mat.mtrl-sci’, ‘cond-mat.str-el’, ‘cond-
mat.stat-mech’ and ‘astro-ph.SR’. For both pre-training and downstream fine-tuning, we selected
10, 000 abstracts for each of the 10 classes for the training set and 1, 000 for the test set. Hence, an
abstract present in one of the training/test set of continual pre-training or downstream fine-tuning is
not present in the other partitions. We chose similar abstract categories since being able to distinguish
very different kinds of abstracts may greatly simplify the problem (e.g., one term may be enough to
classify the entire abstract). We will publicly release our version of the scientific abstract dataset used
in the experiments. The dataset can be easily loaded via Huggingface.
In order to select new tokens for the expansion of RoBERTa vocabulary at each experience of continual
pre-training, we trained from scratch a tokenizer on the WikiText dataset (Merity et al., 2016). This
tokenizer quickly approximates the tokens present in Wikipedia. We also train a tokenizer on our
scientific abstracts dataset and ranked the tokens which were occurring in the latter but not in
the former tokenizer. That is, the domain tokens related to the scientific abstracts datasets.
We selected 426 new tokens for joint training experiments (Appendix B.2) and 39/42/28/30/10 for
each of the 5 experiences of continual pre-training.
We added tokens to the tokenizer such that new tokens have precedence over already existing
tokens during tokenization process. Within new tokens, we sorted inversely by token length and the
precedence is given by the order of addition (First In First Out). The list of new tokens is embedded
in the released code. We also ran few experiments (not reported here) by adding with the same
procedure sub-word tokens (BPE encoding) instead of word tokens. We did not find significant
differences in the results, which do not seem to depend on which specific new tokens are selected, as
long as they provide domain knowledge about the task.
2
R. Stuart Geiger (2020), ArXiV Archive: A Tidy and Complete Archive of Metadata for Papers on arxiv.org,
Zenodo: http://doi.org/10.5281/zenodo.1463242
12
The FC dataset QNLI is available from Huggingface as part of the GLUE benchmark https://
huggingface.co/datasets/glue. The sentiment analysis from tweets dataset is also taken
from Huggingface at https://huggingface.co/datasets/emotion. Senteval benchmark is
taken from the official codebase at https://github.com/facebookresearch/SentEval.
During linear evaluation, we removed the feedforward layer right before the classifier. We observed
that keeping it frozen yielded a very low training performance. On the other side, fine-tuning it
together with the linear classifier did not show the issue but resulted in a non linear fine-tuning
procedure, making it difficult to compare results against the CV setup. Therefore, linear evaluation is
performed by taking the representation built for the special CLF token by the last hidden layer of the
transformer and decoding it with a trained linear classifier.
Computer Vision We adopted the Masked Image Modeling task for self-supervised pre-training
with BEiT. Following the original BEiT paper, we leveraged the DALL-E encoder, which is
kept fixed during continual pre-training. A simple example of masked image modeling can
be found at https://github.com/NielsRogge/Transformers-Tutorials/blob/master/
BEiT/Understanding_BeitForMaskedImageModeling.ipynb. Experiments which continually
pre-train also the encoder may constitute interesting future works.
Following the original Pytorch code at https://github.com/pytorch/vision/blob/main/
references/classification/presets.py, for continual pre-training and fine-tuning on FC
dataset with ResNet we used a chain of augmentations: RandomResizedCrop with bilinear interpo-
lation, RandomHorizontalFlip and normalization of mean and standard deviation. On the test sets,
we resized the image to 256x256, applied center crop and normalization. ViT uses the same setup
without normalization. BEiT applies the ViT setup on the FC dataset only.
For all CKA experiments, we used the Python library from https://github.com/AntixK/
PyTorch-Model-Compare, which provides the unbiased minibatch estimator of the CKA.
B Additional Results
B.1 SentEval Results
Table 9 shows the complete set of results for the SentEval benchmark. We compare the performance
of continual pre-training after 5 experiences on scientific abstracts against two baselines
(GloVe and fastText) and the original pre-trained model. For RoBERTa, we also provide the results in
case of vocabulary expansion. We used one hidden layer of 50 units for probing tasks and logistic
regression for the transfer tasks.
Table 10 shows the accuracy on the entire dataset of scientific abstracts classification after
pre-training on the entire datasets of scientific abstracts (held-out sets). Therefore, this setup
uses only one step of pre-training to assess its effectiveness on the performance on the downstream
task. We show that pre-training is beneficial to the final performance with respect to the original
model pre-trained on Wikipedia.
Similarly, Table 11 shows the impact of 1 and 5 steps of continual pre-training on the dataset of
scientific abstracts classification. Each fine-tuning step is performed on the corresponding
split of the scientific abstract dataset. Again, we see a moderate improvement in the final
performance.
It is important to note that the improvement, although small, is nonetheless present even if each
experience of continual pre-training contains a smaller set of samples with respect to the pre-training
dataset typically used in the NLP literature, like Wikipedia. For each experience, we have 20, 000
samples. This aspect is particularly important for continual learning, where the model is not updated
one-shot with a large dataset, but in multiple steps with few samples.
Table 12 shows that in a traditional CL setup, fine-tuning a single model on scientific abstracts
classification tasks continuously leads to large forgetting on the same scientific abstracts
13
Table 9: Accuracy on 10 transfer and 10 probing tasks from SentEval. For comparison, we report
the performance of the pre-trained models at the end of pre-training on the last experience (e5) of
scientific abstracts dataset.
RoBERTa BERT
Task GloVe fastText Base Pretr. Pretr. NT Base Pretr.
CR 78.70 80.20 88.34 85.38 86.20 86.01 83.66
MR 77.40 78.20 84.35 80.95 80.65 80.46 76.37
MPQA 87.70 88.00 86.12 82.34 82.04 87.83 84.22
SUBJ 91.20 91.80 95.28 93.34 93.36 94.79 93.19
SST2 80.30 82.30 89.46 85.67 85.17 84.51 80.62
SST5 44.70 45.10 51.27 46.88 46.65 45.48 43.21
TREC 83.00 83.40 93.20 90.20 90.40 92.80 88.40
MRPC 72.70 74.40 74.20 74.78 74.67 75.07 73.39
SNLI 65.97 68.80 72.18 70.26 70.69 70.59 68.88
SICK-E 78.50 78.90 80.29 79.78 79.16 79.74 78.63
Length 71.76 64.20 87.03 87.33 86.17 86.11 87.58
Word Content 80.61 82.10 59.68 60.44 62.63 59.28 62.60
Depth 36.50 36.38 43.93 44.67 44.21 41.41 43.80
Top Constituents 66.09 66.34 75.23 76.02 75.91 75.46 77.72
Bigram Shift 49.90 49.67 90.84 85.89 85.75 88.96 85.96
Tense 85.34 87.18 88.56 88.14 87.88 89.06 88.80
Subj Number 79.26 80.78 86.89 87.81 87.44 85.53 86.44
Obj Number 77.66 80.29 84.49 84.46 84.80 83.44 83.42
Odd Man Out 53.15 49.96 68.65 62.45 61.67 65.86 60.99
Coordination Inversion 54.13 52.23 73.87 70.13 70.33 72.36 69.65
classification task (held-out dataset), unless CL strategies are employed. We measure the popular
ACC metric (Lopez-Paz, Ranzato, 2017) which computes the accuracy on all tasks after training
on the last task. The lower its value, the larger the forgetting effect. This shows that, although in
the traditional CL scenario we always have a model ready to tackle all the previous tasks without
retraining, the loss in terms of performance (accuracy in this case) is very large with respect to the
continual pre-training scenario.
CKA is computed incrementally in minibatches, following Nguyen et al. (2020). We provide the full
set of CKA plots in Figure 4 for the NLP environment and in Figure 5 for the CV environment. We
include the CKA against the original pre-trained model and its continually pre-trained version after
each experience of continual pre-training. The upper-right corner of each image represents the upper
layers of the models and its correlation is very low only for ViT and ResNet, while it stays large for
BEiT, RoBERTa and BERT on all FC datasets.
We report in Table 13 and Table 14 the performance obtained by larger Vision Transformers models
with 24 transformers layers for fine-tuning and linear evaluation, respectively. The results are in line
with our main findings with smaller models, except for the ViT, which shows a smaller degree of
14
Table 10: Accuracy on the entire downstream dataset of scientific abstracts classification
after joint training on the entire pre-training dataset of scientific abstracts. The scratch term
indicates that the model is randomly initialized at the beginning and not pre-trained on Wikipedia.
Model Accuracy 1-epoch Accuracy
RoBERTa Base 82.25 79.27
BERT Base 82.57 79.37
RoBERTa NT 81.84 77.88
RoBERTa Pr. 82.26 81.01
BERT Pr. 83.49 82.62
RoBERTa Pr. NT 83.51 81.94
RoBERTa scratch 80.48 75.79
RoBERTa scratch Pr. 82.50 81.50
Table 11: Accuracy on the downstream dataset of scientific abstracts classification after
continual pre-training. The split used for downstream classification and pre-training contains different
documents. The digit next to the model indicates the last experience the model has been trained on
(e.g., 5 means that the model has been pre-trained on all 5 experiences sequentially).
Model Accuracy 1-epoch Accuracy
RoBERTa Pr. 1 82.59 79.88
BERT Pr. 1 82.64 80.91
RoBERTa Pr. NT 1 82.37 80.58
RoBERTa Pr. 5 83.24 81.19
BERT Pr. 5 83.08 81.84
RoBERTa Pr. NT 5 83.06 81.22
forgetting. However, the training curves for the large ViT shows an unstable trend: the best accuracy
is reached usually after one epoch, after which the value quickly degrades to a lower performance.
We believe that future works investigating the impact of model depth on our results may shed a light
on this phenomenon.
15
Table 12: ACC on scientific abstracts classification for 5 experiences with RoBERTa. Pre-
trained only on the first experience of scientific abstracts dataset. Replay memory size is 500.
Joint training from Table 10. ACC around 20.00 means complete forgetting (only the last task is
correctly classified).
Model Joint Naive Replay DSLDA
RoBERTa Base 80.00 19.95 52.94 69.22
RoBERTa Pr. 82.26 19.90 50.78 72.03
RoBERTa Pr. NT 83.51 19.90 51.37 73.32
Table 13: Fine-tuning accuracy on the entire dataset of CORe50 with large transformers. Pre-training
has been performed sequentially over each experience of iNaturalist.
Model Accuracy 1-epoch Accuracy
ViT Base 92.95 90.77
BEiT Base 90.41 89.41
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
ViT Pr. 91.50 89.37 89.93 89.12 87.72 91.39 89.22 89.30 89.12 87.70
BEiT Pr. 89.78 89.90 89.18 88.50 90.09 86.81 85.94 87.50 88.50 88.50
on the final performance, rather than on the kind of pre-training protocol and on its impact on a
separate FC task. Moreover, the downstream tasks used to measure performance are strongly related
to the pre-training stream, making it difficult to understand the impact of each pre-training step on
catastrophic forgetting. The results they provided show that the amount of forgetting does not depend
on the specific CL strategy used. In line with our findings, a naive fine-tuning approach is robust and
does not show a catastrophic loss in performance.
The Continual Knowledge Learning (CKL) framework (Jang et al., 2021) shares some similarities
with the continual pre-training scenario adopted in our work. The CKL considers a pre-trained
model updated continuously and, throughout its training, focuses on different objectives: recognizing
invariant knowledge which does not change over time, incorporating new knowledge not present
before and updating knowledge which is outdated. The proposed benchmark is entirely based on
NLP: it consists of a continual pre-training dataset of news, a "time-invariant knowledge" dataset
hand-crafted from relations dataset and an "updated knowledge" and "new knowledge" datasets built
from scratch through Amazon Mechanical Turk and validated by a set of external experts. The
empirical evaluation provided in the paper is based on a new metric, called FUAR, which condenses
the performance of the pre-trained model in these three tasks into a single number. The experiments
are conducted on the T5 transformer endowed with existing CL strategies. The authors found out
that that parameter expansion methods are amongst the best performing ones, although they require a
larger number of parameters with respect to static alternatives.
The study of Hu et al. (2021) focused on the impact of self-supervised pre-training on streaming data
16
Table 14: Linear evaluation accuracy on the entire dataset of CORe50 with large Transformers.
Pre-training has been performed sequentially over each experience of iNaturalist.
Model Accuracy
ViT Base 82.39
BEiT Base 52.04
Exp. e1 e2 e3 e4 e5
ViT Pr. 85.62 73.75 73.73 75.89 68.27
BEiT Pr. 56.67 55.62 56.12 55.74 56.76
subjected to different types of drifts (some of them ascribable to existing CL scenarios like domain-
incremental, data-incremental, class-incremental). The authors adopted the MoCo-v2 self-supervised
technique for pre-training and a vast set of downstream tasks to measure forgetting, all belonging to
CV. Importantly for our work, the authors discussed the problem of catastrophic forgetting. However,
differently from our work, the evaluation is performed on the same data used for pre-training instead
of relying on a separate downstream task. In our opinion, reporting results on a FC dataset better fits
the continual pre-training scenario and delivers a clearer picture of the effect of continual pre-training.
Nonetheless, the results obtained by Hu et al. (2021) are compatible with our findings, showing
that self-supervised pre-training reduces features drift and mitigates forgetting. The CKA analysis
provided by the authors, similar to ours, supports the experimental results.
17
1.0 12 12 12
12
0 0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers roberta0 Layers roberta1 Layers roberta2 Layers roberta3
(a) RoBERTa QNLI 1 (b) RoBERTa QNLI 2 (c) RoBERTa QNLI 3 (d) RoBERTa QNLI 4
12 12 1.0 12 1.0 12
0 0 0.0 0 0.0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers roberta4 Layers roberta0 Layers roberta1 Layers roberta2
(e) RoBERTa QNLI 5 (f) RoBERTa Tweets 1 (g) RoBERTa Tweets 2 (h) RoBERTa Tweets 3
12 1.0 12 12 1.0 12 1.0
10 0.8 10 0.8 10 10
Layers roberta joint
0.8 0.8
8 8 8 8
0.6 0.6
6 6 6 0.6 6 0.6
0.4 0.4
4 4 4 0.4 4
(i) RoBERTa Tweets 4 (j) RoBERTa Tweets 5 (k) BERT QNLI 1 (l) BERT QNLI 2
12 12 12 12
1.0
10 0.8 10 0.8 10 0.8 10
Layers bert joint
0.8
8 8 8 8
0.6 0.6
6 0.6 6 6 6 0.6
(m) BERT QNLI 3 (n) BERT QNLI 4 (o) BERT QNLI 5 (p) BERT Tweets 1
12 12 12 12
1.0 0.8 0.8 0.8
10 10 10 10
Layers bert joint
0.8
8 8 0.6 8 0.6 8 0.6
6 0.6 6 6 6
0.4 0.4 0.4
4 0.4 4 4 4
(q) BERT Tweets 2 (r) BERT Tweets 3 (s) BERT Tweets 4 (t) BERT Tweets 5
Figure 4: CKA for RoBERTa and BERT. Pre-trained models after each experience are compared with
the original pre-trained model.
18
12 1.0 12 12 12
2 2 2 2
0.2 0.2 0.2 0.2
0 0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers beit0 Layers beit1 Layers beit2 Layers beit3
4 0.4 4 4 4
0.4 0.4 0.4
2 2 2 2
0.2
0 0 0.2 0 0.2 0 0.2
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers beit4 Layers vit0 Layers vit1 Layers vit2
8 8
60 60 0.6
0.6 0.6 0.6
6 6
40 40
4 4 0.4 0.4
0.4 0.4
2 2 20 20
0.2 0.2
0 0.2 0 0.2
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 20 40 60 80 0 20 40 60 80
Layers vit3 Layers vit4 Layers resnet0 Layers resnet1
60 60 0.6 60
0.6 0.6
40 40 40
0.4 0.4 0.4
20 20 20
0.2 0.2 0.2
0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Layers resnet2 Layers resnet3 Layers resnet4
19