Continual Pre-Training Mitigates Forgetting in Language and Vision

Continual Pre-Training Mitigates Forgetting in
Language and Vision
Andrea Cossu∗ Tinne Tuytelaars

Scuola Normale Superiore PSI, ESAT
andrea.cossu@sns.it KU Leuven
arXiv:2205.09357v1 [cs.LG] 19 May 2022
tinne.tuytelaars@kuleuven.be
Antonio CartaLucia Passaro Vincenzo Lomonaco Davide Bacciu

Computer Science Department
University of Pisa
{antonio.carta, lucia.passaro, vincenzo.lomonaco, davide.bacciu}@unipi.it
Abstract
Pre-trained models are nowadays a fundamental component of machine learning

research. In continual learning, they are commonly used to initialize the model
before training on the stream of non-stationary data. However, pre-training is rarely
applied during continual learning. We formalize and investigate the characteristics
of the continual pre-training scenario in both language and vision environments,
where a model is continually pre-trained on a stream of incoming data and only
later fine-tuned to different downstream tasks. We show that continually pre-trained
models are robust against catastrophic forgetting and we provide strong empirical
evidence supporting the fact that self-supervised pre-training is more effective in re-
taining previous knowledge than supervised protocols. Code is provided at https:
//github.com/AndreaCossu/continual-pretraining-nlp-vision.
1 Introduction
Continual Learning (CL) (Lesort et al., 2020) focuses on the design of agents able to learn from
a stream of non-stationary data while preserving previously acquired knowledge. The tendency
of neural networks to catastrophically forget when confronted with new data has been the subject
of many studies (McCloskey, Cohen, 1989; French, 1999), mostly focused on the design of new
CL strategies that mitigate such problem (De Lange et al., 2021). The traditional CL scenario
currently used in the literature considers a single model tackling a sequence of tasks, one after the
other (Parisi et al., 2019). In this setting, the CL model needs to learn its features while, at the
same time, leveraging the same features to solve the supervised task. However, this scenario is
not the only conceivable one. Natural Language Processing (NLP), for example, often exploits
Transfer Learning techniques (Ruder et al., 2019) implemented through the so-called pre-training
fine-tuning setup. In this setting, the more general linguistic knowledge acquired with pre-training is
leveraged as a starting point to target specific downstream tasks. Specifically: 1) during pre-training,
language models focus on unsupervised learning tasks (e.g. predicting masked words based on the
surrounding context), and 2) during fine-tuning, the pre-trained model is further trained on supervised
learning tasks (e.g. sequence labeling). Pre-trained models are widespread also in CL (Mehta et al.,
2021; Wu et al., 2021), where they are mostly used to conveniently initialize the model weights
before learning from the non-stationary stream of data. However, the generality and robustness of
∗
Corresponding author
Preprint. Under review.

the neural representations features may be greatly impaired during the continual training on the
sequence of tasks, since the model will tend to overfit to the tasks objective. By separating the goal
of building robust features from that of solving the task during the continual training, we provide a
new way to design continual learning models which are 1) kept continuously up-to-date over time
and 2) more robust to catastrophic forgetting since pre-trained features have been reported to be
subjected to softer drifts during adaptation to the task (Mehta et al., 2020; Ramasesh et al., 2021).
The former point can be better understood
with an example: let us consider the case
Continual Pre-Training Stream
in which a model is pre-trained on a snap-
shot of Wikipedia containing articles up to
2018. Part of the knowledge contained in-
side the model will soon become outdated:
on one hand, the information contained in
the original articles is likely to be replaced
with up-to-date versions (e.g., changes in
public figures such as a new President). On
Forgetting Control Task Downstream Task
the other hand, outdated models do not in- (Sentiment Analysis) (Document Classification)
corporate the semantics of concepts related
to more recent events. For example, the
semantics of a term like COVID-19, which
becomes important in a short amount of
time, cannot be incorporated in the model Figure 1: The Continual Pre-training scenario. During
without additional pre-training. As a conse- each stage (experience) i of continual pre-training (top),
quence, an outdated language model may the model hpr is pre-trained (center) on the dataset Dpr
i i
perform worse on tasks like language gener- (e.g., scientific abstracts). Subsequently (bottom), the
ation and Question Answering (Q/A), since model is fine-tuned against one (or more) downstream
it will not be able to generate sentences re- task Dds (e.g. scientific abstracts classification). Forget-
i
lated to recent events (Jang et al., 2022).
ting is measure by fine-tuning on Df c (e.g. sentiment
In this paper, we formalize and study the
analysis). At each stage, only the current pre-trained
continual pre-training scenario (Figure 1),
and downstream datasets/models are available.
where the model is continuously updated
via an appropriate pre-training objective on
a non-stationary stream of (possibly unlabeled) data. After each stage of pre-training, we build a
new model from the pre-trained one (e.g., by substituting its final classifier head) and we train it
on a number of downstream tasks. We monitor whether continual pre-training improves/worsens
the performance on tasks which are similar/different with respect to the ones encountered during
continual pre-training. We are particularly interested in studying the possible deterioration, which rep-
resents catastrophic forgetting. For the sake of the evaluation, we specifically introduce a Forgetting
Control (FC) dataset as one of the downstream tasks. The FC dataset contains samples different from
the ones present in the non-stationary stream and more similar to the dataset used for the original
pre-training phase prior to continual training. Against this FC dataset we compare the performance
of the pre-trained model at the beginning of the sequence of tasks with the performance of the
model after each stage of continual pre-training. Our aim is to investigate the behavior of different
architectures, pre-training protocols and input modalities in the continual pre-training scenario and
how these factors impact on catastrophic forgetting. In order to explore this broad research question:
1. we formally define the continual pre-training scenario and we describe an evaluation method-
ology to assess the impact of catastrophic forgetting (Section 3);
2. we build two evaluation environments based on Natural Language Processing (NLP) and
Computer Vision (CV) tasks (Sections 3.1 and 3.2, respectively). We thoroughly study them
by using different datasets, models architectures and pre-training protocols;
3. we show that unsupervised/self-supervised pre-training protocols play a fundamental role in
the mitigation of forgetting, while supervised protocols hurt the performance. The role of
the architecture type and depth does not have an equivalent impact (Section 4);
4. we study the feature space of our pre-trained models by using linear evaluation and Centered
Kernel Alignment (Kornblith et al., 2019) (Section 4). We observe that keeping the hidden
features fixed during linear evaluation exacerbates forgetting for supervised pre-training.
Supervised pre-training also causes a larger drift in the feature space compared to self-
supervised pre-training.
2
2 Related Works
The ability of pre-trained models to solve a diverse set of tasks through fine-tuning has led to consider
them as almost static models. However, it was recently shown that taking a pre-trained model and
performing an additional step of pre-training on domain-specific data is beneficial for the downstream
performance in that domain (e.g., Q/A in bio-medicine as showed by Gururangan et al. (2020); Lee
et al. (2020)). Pre-trained models are helpful also in CL, where leveraging a pre-trained model as the
starting point for the continual training leads to better results with respect to forgetting both in CV
(Mehta et al., 2021; Ramasesh et al., 2021) and NLP (Wu et al., 2021), especially when combined with
CL strategies. An additional pre-training step before the continual training also provides advantages
in terms of downstream performance on domain-specific tasks (Rongali et al., 2021).
The need to perform continual pre-training is present in many different applications, where updating
the pre-trained model is fundamental to incorporate new knowledge and update or erase outdated
information (Lazaridou et al., 2021; Han et al., 2021; Jang et al., 2022). While models trained directly
on a domain task may achieve similar or even better performance on downstream tasks (Gu et al.,
2021), the cost of starting from scratch each time is large and mitigating it is one of the objectives of
CL. Continual pre-training has been recently explored in the context of NLP by leveraging either
domain-specific datasets (like multi-domain research papers) (Jin et al., 2021) or news/tweets corpora
split into different temporal segments (Loureiro et al., 2022; Jang et al., 2021). The results show that
continual pre-training is beneficial to the downstream performance and that forgetting on the tasks
stream can be effectively mitigated by employing CL strategies. Moreover, continual pre-training is
also able to provide advantages in terms of temporal generalization on unseen future data (Loureiro
et al., 2022) and event temporal reasoning (Han et al., 2021). The work by Hu et al. (2021) focuses
on the performance difference between contrastive self-supervised (MoCo-v2 by Chen et al. (2020))
and supervised pre-training in CV, showing that self-supervised leads to robust features in terms
of forgetting. A more detailed discussion of related works is presented in Appendix D. Our work
provides new evidence of the behavior of pre-trained models in the continual pre-training scenario.
We propose to evaluate the performance in terms of catastrophic forgetting on a FC dataset not present
in the CL stream. We provided results for both CV and NLP, with experiments on longer streams
than most of the existing studies (with the exception of Qin et al. (2022)). Unlike prior works, we did
not use any CL strategy, but we just employed naive fine-tuning.
3 Continual Pre-Training Scenario
The traditional CL scenario (Lomonaco et al., 2021) trains a model h0 on a (possibly infinite) stream
of experiences S = (e1 , e2 , e3 , . . .), where each experience ei brings a dataset Di , representing the
current task. The model is trained on S, one experience after the other, and needs to address the
non-stationarity and drifts occurring between experiences without having access to the previously
encountered data. The model h0 is sometimes initialized with the weights of a pre-trained model.
The pre-training phase is conducted on the dataset Dpr which is however not available during CL.
We provide a formal characterization of the continual pre-training scenario (pseudo-code in Appendix
C) and highlight the differences with respect to the traditional CL setup. The continual pre-training
scenario leverages a model hpr pr
0 originally pre-trained on dataset D0 , not available anymore. The
model is presented with a (possibly infinite) stream of experiences, where each experience ei brings a
dataset Dipr for pre-training and a downstream dataset Dids for fine-tuning. For each experience ei , the
last pre-trained model hpr pr
i−1 is further pre-trained on Di . After the pre-training step, the model hi
pr
is fine-tuned on Dids , resulting in hds

i . We adopt naive fine-tuning, without any CL strategies. In order
to measure catastrophic forgetting, we leverage a FC dataset Df c in place of the D0pr originally used
during the first pre-training phase. While each Dids contains samples similar to the ones encountered
during
S pre-training, the FC dataset contains knowledge more similar to the one in D0pr than the one
pr
in i=1,2,3,... Di . Forgetting is assessed after each experience ei by comparing the performance
of hpr0 fine-tuned on D
fc
with the performance of hpr i fine-tuned on the same dataset. We use hi
ds
to verify that the continual pre-training step actually contributes to learning meaningful features for
the downstream task. In this way we avoid the uninteresting case where pre-training leaves features
(mostly) unchanged, resulting in no catastrophic forgetting of previous knowledge but also in a lower
performance on the downstream task. It is important to note that the head (last layer of the model) used
during pre-training is not the one used during fine-tuning. In fact, the pre-training and downstream
3
tasks are different ones and they therefore require different heads. Before fine-tuning on each down-
stream task, the head of hpr i is replaced with a randomly initialized head. The model is then trained
until convergence to obtain hdsi . During the continual pre-training step instead, the head is not replaced.
The continual pre-training scenario has different char-
acteristics with respect to the traditional CL setup. Table 1: Combinations for the main compo-
Firstly, the continual pre-training scenario updates nents of the continual pre-training scenario
continuously the pre-trained model and then adapts explored in this paper. MLM=Masked Lan-
it to specific tasks. The traditional CL setup does not guage modeling, MIM=Masked Image Mod-
consider this important distinction, using the same eling, CLF=Image Classification.
model both for representation learning and to solve
incoming tasks. Secondly, model evaluation in contin- Pre-training Architecture Data
ual pre-training requires an additional training phase
Unsupervised (MLM) Transformer Words
on the target task, while CL usually requires the
model to be readily able to tackle all tasks seen so far Unsupervised (MIM) Transformer Images
without any additional training. Therefore, the model Supervised (CLF) Transformer Images
has to focus on the new task without the opportunity
to build robust, general features via pre-training pro- Supervised (CLF) CNN Images
tocols. As our results will show, the additional cost
of a training phase in continual pre-training can be largely mitigated by a quick adaptation phase
(e.g., one epoch of training). In fact, fast remembering of previous knowledge is considered one of
the objectives of CL (Hadsell et al., 2020).
Ultimately, our continual pre-training scenario aims at building models which are general learners,
able to quickly adapt to unseen data while still preserving the original knowledge. We studied
continual pre-training by introducing two evaluation environments: one for NLP and one for CV.
They are designed to investigate the impact on forgetting of specific components of the scenario
(Table 1), namely the input modality, the pre-training protocol and the model architecture.
3.1 Natural Language Processing Environment
Current NLP applications are all based on the idea of leveraging large-scale pre-trained models to
then solve different tasks under fine-tuning, few- or even zero-shot learning settings. Therefore, NLP
applications based on the traditional pre-training fine-tuning setting seem to be the most natural
field for evaluating our continual pre-training scenario. For example, when dealing with a stream
of news, it is important to keep the language model updated (Lazaridou et al., 2021) so that it can
incorporate information which was not previously available. Our NLP environment employs an
unsupervised/self-supervised pre-training protocol and different Transformer architectures (Vaswani
et al., 2017). These components are standard ones in NLP and represent the state of the art of the field.
We uses the popular pre-trained Transformers RoBERTa (Liu et al., 2019) and BERT (Devlin et al.,
2019), pre-trained on Wikipedia. In addition, we study a variant of RoBERTa in which the vocabulary
is dynamically expanded with the addition of new tokens. We select the most frequent tokens of
the continual pre-training dataset which were not present in the pre-trained tokenizer. Vocabulary
expansion is beneficial for downstream performance, as showed by recent works on dynamic token
expansion in both CV (Douillard et al., 2022) and NLP (Zhang et al., 2020; Han et al., 2021). Our
aim is to understand whether the addition of new tokens may result in a larger forgetting of existing
knowledge. We apply continual pre-training on a dataset of scientific abstracts from arXiv
(Geiger, 2019). The motivation behind the choice of this dataset is that scientific abstracts
represent a very specific domain for NLP both in terms of syntactic structures and domain-specific
terminology. Indeed, updating the language model before fine-tuning is particularly beneficial
under these circumstances. The downstream task is modeled as a document classification problem
aiming to associate scientific abstracts to their corresponding arXiv classes. The CL stream
includes 5 experiences, with 2 scientific domains (classes) in each experience (as in common CL
benchmarks like Split-MNIST/CIFAR-10). Please, refer to Appendix A for a complete description
of the split used for pretraining and downstream fine-tuning. We test two different FC datasets to
measure forgetting: sentiment analysis from tweets and Question Answering Natural Language
Inference (QNLI). The idea behind these choices is that the dataset of scientific abstracts
should not contain much knowledge neither about sentiments, nor about generic facts for language
inference. Pre-training on scientific abstracts may therefore disrupt the knowledge contained in the
original language model. We additionally expand our analysis by using the 20 datasets present in the
SentEval benchmark (Conneau, Kiela, May 7-12, 2018 2018) as FC datasets.
4
Table 2: Accuracy on the entire dataset of sentiment analysis with RoBERTa model. Continual
pre-training has been performed sequentially over each experience of scientific abstracts.
Base refers to the model pre-trained on Wikipedia, while NT refers to the model with vocabulary
expansion.
RoBERTa Accuracy 1-epoch Accuracy
Base 93.40 92.40
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
Pretr 93.40 93.15 93.35 93.20 92.90 92.40 91.80 92.30 91.85 92.20
Pretr. NT 93.75 93.70 93.75 93.60 94.10 91.75 91.15 92.00 92.30 92.45
Table 3: Accuracy on the entire dataset of QNLI with RoBERTa model. Continual pre-training has
been performed sequentially over each experience of scientific abstracts. Base refers to the
model pre-trained on Wikipedia, while NT refers to the model with expanding vocabulary.
RoBERTa Accuracy 1-epoch Accuracy
Base 92.73 91.76
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
Pretr. 91.96 91.87 91.96 91.76 92.07 90.68 91.32 90.70 90.83 90.85
Pretr. NT 92.09 91.62 91.31 91.45 91.51 91.49 91.05 91.31 89.99 90.99
3.2 Computer Vision Environment
We found CV to be a useful test-bed to disentangle the importance of the three components in our
continual pre-training scenario. In particular, we design the CV environment to understand to what
extent forgetting depends on the input modality (natural language against vision), on the architecture
(Transformer against CNN) and on the pre-training protocol (unsupervised/self-supervised against
supervised). To limit the large number of experiments needed to explore these three factors, in the CV
environment we do not measure the performance on the downstream task after each step of continual
pre-training. Instead, we focus on the study of forgetting on the FC dataset. In fact, the impact of
pre-training on downstream tasks similar to the ones in the pre-training stream is assessed both in the
discussion of related works (Section 2 above) and in the experiments with scientific abstracts
classification in NLP environment (results presented below in Section 4 and Appendix B.2).
The CV environment uses iNaturalist (Van Horn et al., 2018) for continual pre-training and
CORe50 (Lomonaco, Maltoni, 2017) as FC dataset for catastrophic forgetting. We use ResNet101,
Vision Transformer (ViT) and BEiT originally pre-trained on ImageNet. The choice of ResNet and
ViT is fundamental to disentangle the role of the architecture (NLP uses only Transformers) and the
pre-training protocol (NLP uses only self-supervised pre-training). In fact, ResNet (He et al., 2016)
and ViT (Dosovitskiy et al., 2020) are pre-trained via supervised image classification. The choice
of BEiT (Bao et al., 2021), instead, allows to understand the role of the input modality. BEiT uses
the recent self-supervised masked image modeling pre-training, which closely resembles the masked
language modeling one used in NLP. The proposed setup allows to run experiments by changing one
factor at a time among the three we studied and to keep fixed the other two. In this way, we are able
to properly compare results between the NLP and CV environments.
3.3 Experimental Setup
For NLP, we use the Huggingface’s pre-trained BERT and RoBERTa with 12 layers. The NLP
datasets, SentEval excluded, are also taken from Huggingface. For SentEval, we train our models
using the original code. We use the same pre-training protocol across all experiments, with a learning
rate of 5e-5 and 30 epochs with early stopping with 2 epochs patience. For fine-tuning, we adopt
a similar setup but with a learning rate of 1e-5 and 20 epochs. For CV, we use ResNet101 and
iNaturalist from Torchvision, while we retrieve ViT and BEiT models from Huggingface, using the
version with 12 layers in order to properly compare results with NLP experiments. We use Avalanche
(Lomonaco et al., 2021) to run the continual pre-training and fine-tuning. For fine-tuning on FC task,
5
Table 4: Accuracy on the entire dataset of sentiment analysis (ER) and QNLI with BERT
model. Continual pre-training has been performed sequentially over each experience of scientific
abstracts. Base refers to the model pre-trained on Wikipedia.
BERT Accuracy 1-epoch Accuracy
Base ER 93.05 92.70
Base QNLI 90.43 90.43
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
Pr. ER 92.95 92.90 92.90 92.65 92.45 92.25 92.35 91.90 92.15 91.90
Pr. QNLI 90.28 89.75 90.50 89.93 90.01 90.01 89.49 89.31 89.11 89.29
roberta pr. bert pr. roberta base bert base glove fasttext
100
90
80
Accuracy
70
60
50
40
CI
CR
MR
WC
TC
BS
SN
ON
SUBJ
SST2
SST5
SNLI
Tense
Depth
OMO
MPQA
TREC
MRPC
Length
SICK-E
Figure 2: Accuracy on the 10 transfer tasks (left) and 10 probing tasks (right) of SentEval. Trans-
formers are fine-tuned after 5 experiences of pre-training on the scientific abstracts. Base
refers to the model pre-trained on Wikipedia. Full results in Appendix B.1,
we try few combinations of learning rates (1e − 5, 1e − 4, 1e − 3) and batch sizes (64, 128, 256) on
a held-out validation set built from CORe50. We report the best performance in terms of accuracy on
the test set. The experimental setup is described in detail in Appendix A.
4 Results
We provide strong empirical evidence supporting the hypothesis that the continual pre-training
scenario is less impacted by catastrophic forgetting than the traditional one. In particular, we found
the unsupervised pre-training objective to be the common factor for the resistance to forgetting in the
proposed environments. Our result adds to the evidences discussed in Section 2 for the robustness of
unsupervised and self-supervised protocols with respect to catastrophic forgetting. Our evaluation
offers similar conclusions for the novel continual pre-training scenario.
Continual pre-training improves performance on the downstream task without forgetting on

FC datasets. We verified that continual pre-training positively impacts on the performance on
the downstream scientific abstracts classification task. That is, we observed that acquiring
domain knowledge on scientific abstracts helps when solving the classification task (on held-
out data). Appendix B.2 shows that continual pre-training on 5 experiences improves the downstream
classification performance (Table 11). Performance is improved also with one step of pre-training on
the entire scientific abstracts dataset. As discussed in Appendix B.2, while the improvement
is relatively small, we were able to achieve it by using a smaller number of samples with respect to the
common pre-training datasets (e.g. Wikipedia): this points to the fact that continual pre-training does
not necessarily need enormous datasets to actually be beneficial (a very useful aspect for continual
learning).
In terms of catastrophic forgetting on the FC dataset after continual pre-training, we show that, quite
surprisingly, both RoBERTa (Table 2 and 3) and BERT (Table 4) achieves almost zero forgetting,
6
Table 5: Fine-tuning accuracy on the entire dataset of CORe50. Pre-training has been performed
sequentially over each experience of iNaturalist.
Model Accuracy 1-epoch Accuracy
ResNet 94.72 94.28
ViT Base 90.56 90.56
BEiT Base 90.15 82.51
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
ResNet Pr. 89.88 81.29 80.82 77.78 74.35 88.40 69.93 70.43 65.91 57.60
ViT Pr. 90.29 81.36 81.47 79.71 77.42 88.48 79.33 78.60 75.01 75.72
BEiT Pr. 88.37 86.45 86.73 87.07 86.46 80.55 78.06 78.88 77.27 77.06
Table 6: Linear evaluation accuracy on the sentiment analysis (ER) and QNLI datasets. Pre-
training has been performed sequentially over each experience of scientific abstracts.
Model ER QNLI
RoBERTa Base 60.05 69.43
BERT Base 59.85 77.87
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
RoBERTa Pr. 59.15 59.85 57.00 54.10 58.05 68.88 68.97 67.16 68.08 67.55
BERT Pr. 60.15 59.15 59.35 58.20 56.70 75.62 74.15 72.93 73.37 73.44
reaching an accuracy comparable to the one originally obtained by the model before continual
pre-training. This happens both for sentiment analysis and QNLI. Moreover, a single epoch of
gradient descent is sufficient to retain most of the original performance, showing the quick adaptation
capabilities of the pre-trained models. Notably, the additional pretraining steps on domain-specific
texts along with the expansion of the RoBERTa vocabulary does not worsen the effects of catastrophic
forgetting. We conducted a broader empirical assessment on a diverse set of NLP tasks by using the
SentEval benchmark. Figure 2 shows the downstream performance of BERT and RoBERTa after
the entire continual pre-training stream. GloVe and fastText results are used as baselines and are
taken from Conneau, Kiela (May 7-12, 2018 2018), except on SNLI and on all probing tasks, for
which they were not available. Therefore, we computed these results using original code. The results
confirm our findings: BERT and RoBERTa do not not show clear signs of forgetting, neither with
respect to their original pre-trained version, nor with respect to the baselines.
Self-supervised continual pre-training mitigates forgetting. We found out that self-supervised

continual pre-training is the main responsible for the mitigation of forgetting in continual pre-training.
Since all NLP models use the self-supervised masked language modeling task for pre-training, we
turned our attention to the CV environment. In fact, ResNet and ViT both use a supervised image
classification during pre-training. In contrast, BEiT uses the recent self-supervised protocol of
masked image modeling (Bao et al., 2021) (mirroring the NLP setting). We show (Table 5) that BEiT
shares the same properties of the NLP transformers, showing little forgetting with respect to the
original version on the FC dataset (and one epoch of fine-tuning is sufficient to recover the original
performance). Interestingly, ResNet and ViT exhibit a qualitatively different trend, with a substantial
accuracy drop of around 20% and 13%, respectively. This difference in performance hints towards
the fact that supervised pre-training in both ResNet and ViT is the main responsible of forgetting.
The negligible role of the architecture. The type of Transformer used in the experiments does
not appear to be a fundamental component: we experimented with larger vision models with 24
layers instead of 12 (Appendix B.5) without being able to appreciate any significant difference. The
difference between convolutional networks like ResNet and attention-based transformers does not
seem to have an impact, either. While ResNet sometimes exhibits worse performance than ViT, there
is no clear evidence that this kind of model is more susceptible to forgetting.
7
Feature Space Analysis: supervised pre-training Table 7: Linear evaluation accuracy on the en-
induces larger drifts. We verified the coherence tire dataset of CORe50. Pre-training has been
of our findings by studying the feature space of the performed sequentially over each experience
models. We leveraged linear evaluation for a quantita- of iNaturalist.
tive analysis and Centered Kernel Alignment (CKA)
(Kornblith et al., 2019) for a qualitatively analysis. Model Accuracy
Linear evaluation (i.e., training only the linear clas- ResNet 82.50
sifier and keeping the rest of the model fixed) is a
powerful tool to understand the impact of the learned ViT Base 91.90
model representations in terms of catastrophic forget- BEiT Base 52.75
ting (Davari et al., 2022). A model which exhibits
Exp. e1 e2 e3 e4 e5
forgetting during linear evaluation is likely to posses
features which are not representative of the task. Con- ResNet Pr. 61.99 31.02 34.71 26.41 22.01
versely, a good linear evaluation performance points ViT Pr. 79.38 55.20 57.98 60.49 48.25
to a set of strong features, since it means that the task
is linearly separable in that feature space. We adopted BEiT Pr. 52.34 51.71 51.31 53.12 52.51
this approach for our continual pre-training scenario.
In the NLP environment (Table 6), the features built by the models during continual pre-training
are robust and do not cause a large deviation of performance with respect to the original pre-trained
model. The lower training accuracy with respect to fine-tuning is expected, since we train only a
subset of all parameters. In the CV environment (Table 7), both ResNet and Vit suffer from forgetting,
while BEiT does not (although it reaches a lower absolute accuracy).
Following Hu et al. (2021), we used CKA with linear kernel Kornblith et al. (2019) to compute layers
similarity between the original pre-trained model and its continually pre-trained versions. From
Figure 3, we can see that all models show large correlations across bottom layers (features are not
drifting much). Instead, ViT and ResNet show lower correlation values for the final layers than
BEiT. This corresponds to a larger drift (full set of results in Appendix B.4) in those layers. These
results are compatible with what showed by Madaan et al. (2021) for unsupervised CL, namely that
unsupervised models in the traditional CL scenario have larger correlations in the lower layers than
supervised ones. Our results further extend this conclusion to continual pre-training, supporting the
idea that pre-training acts mainly in the upper layer of the networks (the ones containing more specific
domain knowledge) and that heavy changes in these layers are enough to cause a deterioration of
performance on the FC dataset, resulting in forgetting.
5 Discussion and Limitations
Our empirical evaluation provides evidence that for-

getting is mitigated in continual pre-training by the Table 8: Main takeaways from the experi-
usage of self-supervised pre-training protocols (Table ments presented in this paper. Only the su-
8). Fine-tuning for only one epoch allows to recover pervised pre-training protocols showed clear
most of the performance: this is important since an signs of forgetting, while unsupervised/self-
expensive fine-tuning phase might reduce the applica- supervised protocols did not.
bility of continual pre-training in environments with
constrained resources. Deciding when to use con- Pre-training Architecture Data Forgetting
tinual pre-training and when to use the traditional
Unsupervised Transformer Words ×
CL scenario is an open question. As previously dis-
cussed, the properties of continual pre-training do Unsupervised Transformer Images ×
not fit the case in which a single model has to be Supervised Transformer Images X
readily applicable to different tasks without a step of
fine-tuning. Nonetheless, we believe that whenever Supervised CNN Images X
knowledge must be kept updated over time, continual
pre-training can deliver a superior solution, less affected by forgetting (see Appendix B.3 for a
comparison with the traditional CL scenario). Continual pre-training offers the possibility to shift the
focus from the mitigation of forgetting to other CL objectives like quick adaptation and knowledge
reuse and transfer.
The main limitation of our study is related to the scale of the experiments, as we were able to experi-
ment with only a limited number of datasets for each environment. While the computational cost of
each experiment was reasonable (each experiment took from few hours - fine-tuning - to around one
8
12 12 12 12
10 0.8 10 0.8 10 10 80
0.8 0.8 0.8
Layers roberta joint
Layers resnet joint

Layers bert joint
Layers beit joint
Layers vit joint

8 8 0.6 8 8
0.6 60
0.6 0.6 0.6
6 6 6 6
0.4 0.4 40
4 4 4 0.4 4 0.4
0.4
2 0.2 2 0.2 2 2 20
0.2 0.2
0 0 0 0 0.2
0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 20 40 60 80
Layers roberta4 Layers bert4 Layers beit4 Layers vit4 Layers resnet4
(a) RoBERTa QNLI (b) BERT Tweets (c) BEiT (d) ViT (e) ResNet
Figure 3: CKA for RoBERTa, BERT, BEiT, Vit and ResNet. Pre-trained model hds 5 after the last
experience (x axis) is compared with the original pre-trained model hds
0 (y axis). Each row is the
similarity of a layer with respect to each layer of the other model.
day - continual pre-training on a single A100), the number of experiments per environment was large.
We preferred to thoroughly evaluate few environments rather than trying to address a wide range of
different datasets without being able to properly explore them (Table 1). We are well aware that a
comprehensive exploration of continual pre-training in both NLP and CV domains is an ambitious
objective, possible only in the context of a broad research program. However, we are confident of the
fact that this study has shed some light on the subject and clearly pointed towards promising research
directions.
6 Conclusion
Continual pre-training represents a novel CL scenario with promising opportunities and unexpected
characteristics. In this work, we formally defined the continual pre-training scenario and we showed
the effect that pre-training has on catastrophic forgetting, for both NLP and CV environments and
with different architectures. Our results show that forgetting can be effectively mitigated by means of
self-supervised pretraining, even with a single epoch of fine-tuning on the FC dataset. Ultimately,
this work opens up the possibility to continually train large pre-trained models in a scalable and
efficient way. Much like Deep Learning has advanced by disentangling the representation learning
objective from the solution to specific tasks, continual pre-training aims to focus on the incremental
development of robust features which are kept updated over time. This is a fundamental property
towards the achievement of agents who can truly learn continuously in the real-world.
Acknowledgments and Disclosure of Funding

This work has been partially supported by the H2020 TEACHING project (GA 871385).
References
Bao Hangbo, Dong Li, Piao Songhao, Wei Furu. BEiT: BERT Pre-Training of Image Transformers
// International Conference on Learning Representations. 2021.
Chen Xinlei, Fan Haoqi, Girshick Ross, He Kaiming. Improved Baselines with Momentum Con-
trastive Learning // arXiv:2003.04297 [cs]. 2020.
Conneau Alexis, Kiela Douwe. SentEval: An Evaluation Toolkit for Universal Sentence Repre-
sentations // Proceedings of the Eleventh International Conference on Language Resources and
Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA),
May 7-12, 2018 2018.
Davari MohammadReza, Asadi Nader, Mudur Sudhir, Aljundi Rahaf, Belilovsky Eugene. Probing
Representation Forgetting in Supervised and Unsupervised Continual Learning // Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
De Lange Matthias, Aljundi Rahaf, Masana Marc, Parisot Sarah, Jia Xu, Leonardis Ales, Slabaugh
Gregory, Tuytelaars Tinne. A Continual Learning Survey: Defying Forgetting in Classification
Tasks // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021.
9
Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for
Computational Linguistics, 2019. 4171–4186.
Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner
Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob,
Houlsby Neil. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale //
International Conference on Learning Representations. 2020.
Douillard Arthur, Ramé Alexandre, Couairon Guillaume, Cord Matthieu. DyTox: Transformers for
Continual Learning with DYnamic TOken eXpansion // IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2022.
French Robert. Catastrophic Forgetting in Connectionist Networks // Trends in Cognitive Sciences.

1999. 3, 4. 128–135.
Geiger R. Stuart. ArXiV Archive: A Tidy and Complete Archive of Metadata for Papers on Arxiv.Org,
1993-2019. 2019.
Gu Yu, Tinn Robert, Cheng Hao, Lucas Michael, Usuyama Naoto, Liu Xiaodong, Naumann Tristan,
Gao Jianfeng, Poon Hoifung. Domain-Specific Language Model Pretraining for Biomedical
Natural Language Processing // ACM Transactions on Computing for Healthcare. 2021. 3, 1.
2:1–2:23.
Gururangan Suchin, Marasović Ana, Swayamdipta Swabha, Lo Kyle, Beltagy Iz, Downey Doug, Smith
Noah A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks // Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association
for Computational Linguistics, 2020. 8342–8360.
Hadsell Raia, Rao Dushyant, Rusu Andrei A, Pascanu Razvan. Embracing Change: Continual
Learning in Deep Neural Networks // Trends in Cognitive Sciences. 2020.
Han Rujun, Ren Xiang, Peng Nanyun. ECONET: Effective Continual Pretraining of Language Models
for Event Temporal Reasoning // Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for
Computational Linguistics, 2021. 5367–5380.
He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian. Deep Residual Learning for Image Recognition
// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. 770–778.
Hu Dapeng, Yan Shipeng, Lu Qizhengqiu, Hong Lanqing, Hu Hailin, Zhang Yifan, Li Zhenguo, Wang
Xinchao, Feng Jiashi. How Well Does Self-Supervised Pre-Training Perform with Streaming Data?
// International Conference on Learning Representations. 2021.
Jang Joel, Ye Seonghyeon, Lee Changho, Yang Sohee, Shin Joongbo, Han Janghoon, Kim Gyeonghun,
Seo Minjoon. TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving
Language Models // arXiv:2204.14211 [cs]. 2022.
Jang Joel, Ye Seonghyeon, Yang Sohee, Shin Joongbo, Han Janghoon, Kim Gyeonghun, Choi
Stanley Jungkyu, Seo Minjoon. Towards Continual Knowledge Learning of Language Models //
Jin Xisen, Zhang Dejiao, Zhu Henghui, Xiao Wei, Li Shang-Wen, Wei Xiaokai, Arnold Andrew, Ren
Xiang. Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora //
arXiv:2110.08534 [cs]. 2021.
Kornblith Simon, Norouzi Mohammad, Lee Honglak, Hinton Geoffrey. Similarity of Neural Network
Representations Revisited // Proceedings of the 36th International Conference on Machine Learning.
2019. 3519–3529.
10
Lazaridou Angeliki, Kuncoro Adhiguna, Gribovskaya Elena, Agrawal Devang, Liska Adam, Terzi
Tayfun, Gimenez Mai, d’Autume Cyprien de Masson, Kočiský Tomáš, Ruder Sebastian, Yogatama
Dani, Cao Kris, Young Susannah, Blunsom Phil. Mind the Gap: Assessing Temporal Generalization
in Neural Language Models // Thirty-Fifth Conference on Neural Information Processing Systems.
2021.
Lee Jinhyuk, Yoon Wonjin, Kim Sungdong, Kim Donghyeon, Kim Sunkyu, So Chan Ho, Kang Jaewoo.
BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining
// Bioinformatics. 2020. 36, 4. 1234–1240.
Lesort Timothée, Lomonaco Vincenzo, Stoian Andrei, Maltoni Davide, Filliat David, Díaz-Rodríguez
Natalia. Continual Learning for Robotics: Definition, Framework, Learning Strategies, Opportuni-
ties and Challenges // Information Fusion. 2020. 58. 52–68.
Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis
Mike, Zettlemoyer Luke, Stoyanov Veselin. RoBERTa: A Robustly Optimized BERT Pretraining
Approach // arXiv:1907.11692 [cs]. 2019.
Lomonaco Vincenzo, Maltoni Davide. CORe50: A New Dataset and Benchmark for Continuous
Object Recognition // Proceedings of the 1st Annual Conference on Robot Learning. 78. 2017.
17–26. (Proceedings of Machine Learning Research).
Lomonaco Vincenzo, Pellegrini Lorenzo, Cossu Andrea, Carta Antonio, Graffieti Gabriele, Hayes
Tyler L., De Lange Matthias, Masana Marc, Pomponi Jary, van de Ven Gido, Mundt Martin, She Qi,
Cooper Keiland, Forest Jeremy, Belouadah Eden, Calderara Simone, Parisi German I., Cuzzolin
Fabio, Tolias Andreas, Scardapane Simone, Antiga Luca, Amhad Subutai, Popescu Adrian, Kanan
Christopher, van de Weijer Joost, Tuytelaars Tinne, Bacciu Davide, Maltoni Davide. Avalanche:
An End-to-End Library for Continual Learning // CLVision Workshop at CVPR. 2021.
Lopez-Paz David, Ranzato Marc’Aurelio. Gradient Episodic Memory for Continual Learning // NIPS.
2017.
Loureiro Daniel, Barbieri Francesco, Neves Leonardo, Anke Luis Espinosa, Camacho-Collados Jose.
TimeLMs: Diachronic Language Models from Twitter // arXiv:2202.03829 [cs]. 2022.
Madaan Divyam, Yoon Jaehong, Li Yuanchun, Liu Yunxin, Hwang Sung Ju. Representational Continu-
ity for Unsupervised Continual Learning // International Conference on Learning Representations.
2021.
McCloskey Michael, Cohen Neal J. Catastrophic Interference in Connectionist Networks: The
Sequential Learning Problem // Psychology of Learning and Motivation. 24. 1989. 109–165.
Mehta Nikhil, Liang Kevin J, Carin Lawrence. Bayesian Nonparametric Weight Factorization for
Continual Learning // arXiv. 2020. 1–17.
Mehta Sanket Vaibhav, Patil Darshan, Chandar Sarath, Strubell Emma. An Empirical Investigation
of the Role of Pre-training in Lifelong Learning // arXiv:2112.09153 [cs]. 2021.
Merity Stephen, Xiong Caiming, Bradbury James, Socher Richard. Pointer Sentinel Mixture Models
// arXiv:1609.07843 [cs]. 2016.
Nguyen Thao, Raghu Maithra, Kornblith Simon. Do Wide and Deep Networks Learn the Same
Things? Uncovering How Neural Network Representations Vary with Width and Depth //
Parisi German I, Kemker Ronald, Part Jose L, Kanan Christopher, Wermter Stefan. Continual
Lifelong Learning with Neural Networks: A Review // Neural Networks. 2019. 113. 54–71.
Qin Yujia, Zhang Jiajie, Lin Yankai, Liu Zhiyuan, Li Peng, Sun Maosong, Zhou Jie. ELLE: Efficient
Lifelong Pre-training for Emerging Data // Findings of ACL. 2022.
Ramasesh Vinay Venkatesh, Lewkowycz Aitor, Dyer Ethan. Effect of Scale on Catastrophic Forgetting
in Neural Networks // International Conference on Learning Representations. 2021.
11
Rongali Subendhu, Jagannatha Abhyuday, Rawat Bhanu Pratap Singh, Yu Hong. Continual Domain-
Tuning for Pretrained Language Models // arXiv:2004.02288 [cs]. 2021.
Ruder Sebastian, Peters Matthew E., Swayamdipta Swabha, Wolf Thomas. Transfer Learning in
Natural Language Processing // Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Tutorials. Minneapolis, Minnesota:
Association for Computational Linguistics, 2019. 15–18.
Van Horn Grant, Mac Aodha Oisin, Song Yang, Cui Yin, Sun Chen, Shepard Alex, Adam Hartwig,
Perona Pietro, Belongie Serge. The INaturalist Species Classification and Detection Dataset //
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. 8769–
8778.
Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser
Łukasz, Polosukhin Illia. Attention Is All You Need // Advances in Neural Information Processing
Systems 30. 2017. 5998–6008.
Wu Tongtong, Caccia Massimo, Li Zhuang, Li Yuan-Fang, Qi Guilin, Haffari Gholamreza. Pretrained
Language Model in Continual Learning: A Comparative Study // International Conference on
Learning Representations. 2021.
Zhang Rong, Gangi Reddy Revanth, Sultan Md Arafat, Castelli Vittorio, Ferritto Anthony, Florian
Radu, Sarioglu Kayi Efsun, Roukos Salim, Sil Avi, Ward Todd. Multi-Stage Pre-training for Low-
Resource Domain Adaptation // Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020.
5461–5468.
A Extended Experimental Setup

Here, we describe the experimental setup we adopted in our work for both the NLP environment and
the CV environment. All our experiments were run on a single A100 GPU with 80 GB of memory,
on a server with 96 cores.
NLP The continual pre-training dataset of scientific abstracts is taken from GitHub2 . We
selected 10 ArXiv classes to build our continual pre-training stream, namely ‘hep-ph’, ‘astro-ph’,
‘hep-th’, ‘quant-ph’, ‘cond-mat.mes-hall’, ‘gr-qc’, ‘cond-mat.mtrl-sci’, ‘cond-mat.str-el’, ‘cond-
mat.stat-mech’ and ‘astro-ph.SR’. For both pre-training and downstream fine-tuning, we selected
10, 000 abstracts for each of the 10 classes for the training set and 1, 000 for the test set. Hence, an
abstract present in one of the training/test set of continual pre-training or downstream fine-tuning is
not present in the other partitions. We chose similar abstract categories since being able to distinguish
very different kinds of abstracts may greatly simplify the problem (e.g., one term may be enough to
classify the entire abstract). We will publicly release our version of the scientific abstract dataset used
in the experiments. The dataset can be easily loaded via Huggingface.
In order to select new tokens for the expansion of RoBERTa vocabulary at each experience of continual
pre-training, we trained from scratch a tokenizer on the WikiText dataset (Merity et al., 2016). This
tokenizer quickly approximates the tokens present in Wikipedia. We also train a tokenizer on our
scientific abstracts dataset and ranked the tokens which were occurring in the latter but not in
the former tokenizer. That is, the domain tokens related to the scientific abstracts datasets.
We selected 426 new tokens for joint training experiments (Appendix B.2) and 39/42/28/30/10 for
each of the 5 experiences of continual pre-training.
We added tokens to the tokenizer such that new tokens have precedence over already existing
tokens during tokenization process. Within new tokens, we sorted inversely by token length and the
precedence is given by the order of addition (First In First Out). The list of new tokens is embedded
in the released code. We also ran few experiments (not reported here) by adding with the same
procedure sub-word tokens (BPE encoding) instead of word tokens. We did not find significant
differences in the results, which do not seem to depend on which specific new tokens are selected, as
long as they provide domain knowledge about the task.
2
R. Stuart Geiger (2020), ArXiV Archive: A Tidy and Complete Archive of Metadata for Papers on arxiv.org,
Zenodo: http://doi.org/10.5281/zenodo.1463242
12
The FC dataset QNLI is available from Huggingface as part of the GLUE benchmark https://
huggingface.co/datasets/glue. The sentiment analysis from tweets dataset is also taken
from Huggingface at https://huggingface.co/datasets/emotion. Senteval benchmark is
taken from the official codebase at https://github.com/facebookresearch/SentEval.
During linear evaluation, we removed the feedforward layer right before the classifier. We observed
that keeping it frozen yielded a very low training performance. On the other side, fine-tuning it
together with the linear classifier did not show the issue but resulted in a non linear fine-tuning
procedure, making it difficult to compare results against the CV setup. Therefore, linear evaluation is
performed by taking the representation built for the special CLF token by the last hidden layer of the
transformer and decoding it with a trained linear classifier.
Computer Vision We adopted the Masked Image Modeling task for self-supervised pre-training
with BEiT. Following the original BEiT paper, we leveraged the DALL-E encoder, which is
kept fixed during continual pre-training. A simple example of masked image modeling can
be found at https://github.com/NielsRogge/Transformers-Tutorials/blob/master/
BEiT/Understanding_BeitForMaskedImageModeling.ipynb. Experiments which continually
pre-train also the encoder may constitute interesting future works.
Following the original Pytorch code at https://github.com/pytorch/vision/blob/main/
references/classification/presets.py, for continual pre-training and fine-tuning on FC
dataset with ResNet we used a chain of augmentations: RandomResizedCrop with bilinear interpo-
lation, RandomHorizontalFlip and normalization of mean and standard deviation. On the test sets,
we resized the image to 256x256, applied center crop and normalization. ViT uses the same setup
without normalization. BEiT applies the ViT setup on the FC dataset only.
For all CKA experiments, we used the Python library from https://github.com/AntixK/
PyTorch-Model-Compare, which provides the unbiased minibatch estimator of the CKA.
B Additional Results
B.1 SentEval Results
Table 9 shows the complete set of results for the SentEval benchmark. We compare the performance
of continual pre-training after 5 experiences on scientific abstracts against two baselines
(GloVe and fastText) and the original pre-trained model. For RoBERTa, we also provide the results in
case of vocabulary expansion. We used one hidden layer of 50 units for probing tasks and logistic
regression for the transfer tasks.
B.2 Effect of Pre-Training on the Downstream Domain Task
Table 10 shows the accuracy on the entire dataset of scientific abstracts classification after
pre-training on the entire datasets of scientific abstracts (held-out sets). Therefore, this setup
uses only one step of pre-training to assess its effectiveness on the performance on the downstream
task. We show that pre-training is beneficial to the final performance with respect to the original
model pre-trained on Wikipedia.
Similarly, Table 11 shows the impact of 1 and 5 steps of continual pre-training on the dataset of
scientific abstracts classification. Each fine-tuning step is performed on the corresponding
split of the scientific abstract dataset. Again, we see a moderate improvement in the final
performance.
It is important to note that the improvement, although small, is nonetheless present even if each
experience of continual pre-training contains a smaller set of samples with respect to the pre-training
dataset typically used in the NLP literature, like Wikipedia. For each experience, we have 20, 000
samples. This aspect is particularly important for continual learning, where the model is not updated
one-shot with a large dataset, but in multiple steps with few samples.
B.3 Results With Traditional CL scenario
Table 12 shows that in a traditional CL setup, fine-tuning a single model on scientific abstracts
classification tasks continuously leads to large forgetting on the same scientific abstracts
13
Table 9: Accuracy on 10 transfer and 10 probing tasks from SentEval. For comparison, we report
the performance of the pre-trained models at the end of pre-training on the last experience (e5) of
scientific abstracts dataset.
RoBERTa BERT
Task GloVe fastText Base Pretr. Pretr. NT Base Pretr.
CR 78.70 80.20 88.34 85.38 86.20 86.01 83.66
MR 77.40 78.20 84.35 80.95 80.65 80.46 76.37
MPQA 87.70 88.00 86.12 82.34 82.04 87.83 84.22
SUBJ 91.20 91.80 95.28 93.34 93.36 94.79 93.19
SST2 80.30 82.30 89.46 85.67 85.17 84.51 80.62
SST5 44.70 45.10 51.27 46.88 46.65 45.48 43.21
TREC 83.00 83.40 93.20 90.20 90.40 92.80 88.40
MRPC 72.70 74.40 74.20 74.78 74.67 75.07 73.39
SNLI 65.97 68.80 72.18 70.26 70.69 70.59 68.88
SICK-E 78.50 78.90 80.29 79.78 79.16 79.74 78.63
Length 71.76 64.20 87.03 87.33 86.17 86.11 87.58
Word Content 80.61 82.10 59.68 60.44 62.63 59.28 62.60
Depth 36.50 36.38 43.93 44.67 44.21 41.41 43.80
Top Constituents 66.09 66.34 75.23 76.02 75.91 75.46 77.72
Bigram Shift 49.90 49.67 90.84 85.89 85.75 88.96 85.96
Tense 85.34 87.18 88.56 88.14 87.88 89.06 88.80
Subj Number 79.26 80.78 86.89 87.81 87.44 85.53 86.44
Obj Number 77.66 80.29 84.49 84.46 84.80 83.44 83.42
Odd Man Out 53.15 49.96 68.65 62.45 61.67 65.86 60.99
Coordination Inversion 54.13 52.23 73.87 70.13 70.33 72.36 69.65
classification task (held-out dataset), unless CL strategies are employed. We measure the popular
ACC metric (Lopez-Paz, Ranzato, 2017) which computes the accuracy on all tasks after training
on the last task. The lower its value, the larger the forgetting effect. This shows that, although in
the traditional CL scenario we always have a model ready to tackle all the previous tasks without
retraining, the loss in terms of performance (accuracy in this case) is very large with respect to the
continual pre-training scenario.
B.4 CKA Plots
CKA is computed incrementally in minibatches, following Nguyen et al. (2020). We provide the full
set of CKA plots in Figure 4 for the NLP environment and in Figure 5 for the CV environment. We
include the CKA against the original pre-trained model and its continually pre-trained version after
each experience of continual pre-training. The upper-right corner of each image represents the upper
layers of the models and its correlation is very low only for ViT and ResNet, while it stays large for
BEiT, RoBERTa and BERT on all FC datasets.
B.5 Experiments with Larger CV Models
We report in Table 13 and Table 14 the performance obtained by larger Vision Transformers models
with 24 transformers layers for fine-tuning and linear evaluation, respectively. The results are in line
with our main findings with smaller models, except for the ViT, which shows a smaller degree of
14
Table 10: Accuracy on the entire downstream dataset of scientific abstracts classification
after joint training on the entire pre-training dataset of scientific abstracts. The scratch term
indicates that the model is randomly initialized at the beginning and not pre-trained on Wikipedia.
RoBERTa Base 82.25 79.27
BERT Base 82.57 79.37
RoBERTa NT 81.84 77.88
RoBERTa Pr. 82.26 81.01
BERT Pr. 83.49 82.62
RoBERTa Pr. NT 83.51 81.94
RoBERTa scratch 80.48 75.79
RoBERTa scratch Pr. 82.50 81.50
Table 11: Accuracy on the downstream dataset of scientific abstracts classification after
continual pre-training. The split used for downstream classification and pre-training contains different
documents. The digit next to the model indicates the last experience the model has been trained on
(e.g., 5 means that the model has been pre-trained on all 5 experiences sequentially).
RoBERTa Pr. 1 82.59 79.88
BERT Pr. 1 82.64 80.91
RoBERTa Pr. NT 1 82.37 80.58
RoBERTa Pr. 5 83.24 81.19
BERT Pr. 5 83.08 81.84
RoBERTa Pr. NT 5 83.06 81.22
forgetting. However, the training curves for the large ViT shows an unstable trend: the best accuracy
is reached usually after one epoch, after which the value quickly degrades to a lower performance.
We believe that future works investigating the impact of model depth on our results may shed a light
on this phenomenon.
C Continual Pre-Training Pseudocode

Algorithm 1 provides a high-level description of the continual pre-training scenario, showing the
steps of continual pre-training, downstream fine-tuning and catastrophic forgetting evaluation against
the FC dataset. To obtain the configuration we used in linear evaluation, it is sufficient to change
fine-tune with linear-eval in Line 6.
D Extended Related Works

The continual pre-training scenario appeared very recently in the literature. In this section, we provide
a more detailed description of the existing works exploring continual pre-training and the differences
with respect to our work. Section 2 already provides a brief description but, due to lack of space, we
were unable to thoroughly discuss the few existing studies.
Among existing works, the CL scenario used in (Jin et al., 2021) constitutes the most similar setup
with respect to our definition of continual pre-training. Like us, the authors used a dataset of research
papers as pre-training stream and leveraged RoBERTa in their experiments. Differently from us,
though, their work is focused on NLP tasks and on the impact that different CL strategies have
15
Table 12: ACC on scientific abstracts classification for 5 experiences with RoBERTa. Pre-
trained only on the first experience of scientific abstracts dataset. Replay memory size is 500.
Joint training from Table 10. ACC around 20.00 means complete forgetting (only the last task is
correctly classified).
Model Joint Naive Replay DSLDA
RoBERTa Base 80.00 19.95 52.94 69.22
RoBERTa Pr. 82.26 19.90 50.78 72.03
RoBERTa Pr. NT 83.51 19.90 51.37 73.32
Table 13: Fine-tuning accuracy on the entire dataset of CORe50 with large transformers. Pre-training
has been performed sequentially over each experience of iNaturalist.
ViT Base 92.95 90.77
BEiT Base 90.41 89.41
Exp. e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
ViT Pr. 91.50 89.37 89.93 89.12 87.72 91.39 89.22 89.30 89.12 87.70
BEiT Pr. 89.78 89.90 89.18 88.50 90.09 86.81 85.94 87.50 88.50 88.50
on the final performance, rather than on the kind of pre-training protocol and on its impact on a
separate FC task. Moreover, the downstream tasks used to measure performance are strongly related
to the pre-training stream, making it difficult to understand the impact of each pre-training step on
catastrophic forgetting. The results they provided show that the amount of forgetting does not depend
on the specific CL strategy used. In line with our findings, a naive fine-tuning approach is robust and
does not show a catastrophic loss in performance.
The Continual Knowledge Learning (CKL) framework (Jang et al., 2021) shares some similarities
with the continual pre-training scenario adopted in our work. The CKL considers a pre-trained
model updated continuously and, throughout its training, focuses on different objectives: recognizing
invariant knowledge which does not change over time, incorporating new knowledge not present
before and updating knowledge which is outdated. The proposed benchmark is entirely based on
NLP: it consists of a continual pre-training dataset of news, a "time-invariant knowledge" dataset
hand-crafted from relations dataset and an "updated knowledge" and "new knowledge" datasets built
from scratch through Amazon Mechanical Turk and validated by a set of external experts. The
empirical evaluation provided in the paper is based on a new metric, called FUAR, which condenses
the performance of the pre-trained model in these three tasks into a single number. The experiments
are conducted on the T5 transformer endowed with existing CL strategies. The authors found out
that that parameter expansion methods are amongst the best performing ones, although they require a
larger number of parameters with respect to static alternatives.
The study of Hu et al. (2021) focused on the impact of self-supervised pre-training on streaming data
Algorithm 1 Continual Pre-training scenario

Require: Pre-trained model hpr fc
0 , stream of experiences S = (e1 , e2 , e3 , . . .), FC dataset D .
fc pr fc
1: h0 ← fine-tune(h0 , D ) . Evaluate model on FC dataset before continual pre-training
2: for ei ∈ S do
3: Dipr , Dids ← split(Di )
4: hpr pr
i ← pre-train(hi−1 , Di )
pr
. Choose appropriate pre-train objective
ds pr ds
5: hi ← fine-tune(hi , Di )
6: hfi c ← fine-tune(hpr
i ,D )
fc
. Evaluate model on FC dataset
7: Compare performance of hfi c with hf0 c to assess forgetting.
8: end for
9: return y
16
Table 14: Linear evaluation accuracy on the entire dataset of CORe50 with large Transformers.
Pre-training has been performed sequentially over each experience of iNaturalist.
Model Accuracy
ViT Base 82.39
BEiT Base 52.04
Exp. e1 e2 e3 e4 e5
ViT Pr. 85.62 73.75 73.73 75.89 68.27
BEiT Pr. 56.67 55.62 56.12 55.74 56.76
subjected to different types of drifts (some of them ascribable to existing CL scenarios like domain-
incremental, data-incremental, class-incremental). The authors adopted the MoCo-v2 self-supervised
technique for pre-training and a vast set of downstream tasks to measure forgetting, all belonging to
CV. Importantly for our work, the authors discussed the problem of catastrophic forgetting. However,
differently from our work, the evaluation is performed on the same data used for pre-training instead
of relying on a separate downstream task. In our opinion, reporting results on a FC dataset better fits
the continual pre-training scenario and delivers a clearer picture of the effect of continual pre-training.
Nonetheless, the results obtained by Hu et al. (2021) are compatible with our findings, showing
that self-supervised pre-training reduces features drift and mitigates forgetting. The CKA analysis
provided by the authors, similar to ours, supports the experimental results.
17
1.0 12 12 12
12
10 0.8 10 0.8 10 0.8 10 0.8


8 8 8 8
0.6 0.6 0.6 0.6
6 6 6 6
0.4 0.4 0.4 0.4
4 4 4 4
2 0.2 2 0.2 2 0.2 2 0.2
0 0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers roberta0 Layers roberta1 Layers roberta2 Layers roberta3
(a) RoBERTa QNLI 1 (b) RoBERTa QNLI 2 (c) RoBERTa QNLI 3 (d) RoBERTa QNLI 4
12 12 1.0 12 1.0 12
10 0.8 10 0.8 10 0.8 10 0.8


8 8 8 8
0.6 0.6 0.6 0.6
6 6 6 6
0.4 0.4 0.4 0.4
4 4 4 4
2 0.2 2 0.2 2 0.2 2 0.2
0 0 0.0 0 0.0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers roberta4 Layers roberta0 Layers roberta1 Layers roberta2
(e) RoBERTa QNLI 5 (f) RoBERTa Tweets 1 (g) RoBERTa Tweets 2 (h) RoBERTa Tweets 3
12 1.0 12 12 1.0 12 1.0
10 0.8 10 0.8 10 10
0.8 0.8
Layers bert joint

Layers bert joint
8 8 8 8
0.6 0.6
6 6 6 0.6 6 0.6
0.4 0.4
4 4 4 0.4 4
2 0.2 2 0.2 2 2 0.4

0.2
0 0 0 0
0.0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers roberta3 Layers roberta4 Layers bert0 Layers bert1
(i) RoBERTa Tweets 4 (j) RoBERTa Tweets 5 (k) BERT QNLI 1 (l) BERT QNLI 2
12 12 12 12
1.0
10 0.8 10 0.8 10 0.8 10
Layers bert joint
Layers bert joint
Layers bert joint
Layers bert joint
0.8
8 8 8 8
0.6 0.6
6 0.6 6 6 6 0.6
4 4 0.4 4 0.4 4 0.4

0.4
2 2 2 2 0.2
0.2 0.2
0 0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers bert2 Layers bert3 Layers bert4 Layers bert0
(m) BERT QNLI 3 (n) BERT QNLI 4 (o) BERT QNLI 5 (p) BERT Tweets 1
12 12 12 12
1.0 0.8 0.8 0.8
10 10 10 10
Layers bert joint
Layers bert joint
Layers bert joint
Layers bert joint
0.8
8 8 0.6 8 0.6 8 0.6
6 0.6 6 6 6
0.4 0.4 0.4
4 0.4 4 4 4
2 2 0.2 2 0.2 2 0.2

0.2
0 0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers bert1 Layers bert2 Layers bert3 Layers bert4
(q) BERT Tweets 2 (r) BERT Tweets 3 (s) BERT Tweets 4 (t) BERT Tweets 5
Figure 4: CKA for RoBERTa and BERT. Pre-trained models after each experience are compared with
the original pre-trained model.
18
12 1.0 12 12 12
10 0.8 10 0.8 10 0.8 10 0.8

Layers beit joint
Layers beit joint
Layers beit joint
Layers beit joint

8 8 8 8
0.6 0.6 0.6 0.6
6 6 6 6
4 0.4 4 0.4 4 0.4 4 0.4
2 2 2 2
0.2 0.2 0.2 0.2
0 0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers beit0 Layers beit1 Layers beit2 Layers beit3
(a) BEiT 1 (b) BEiT 2 (c) BEiT 3 (d) BEiT 4

12 12 12 12
10 0.8 10 0.8 10 0.8 10 0.8

Layers beit joint
Layers vit joint
Layers vit joint
Layers vit joint

8 8 8 8
0.6 0.6 0.6 0.6
6 6 6 6
4 0.4 4 4 4
0.4 0.4 0.4
2 2 2 2
0.2
0 0 0.2 0 0.2 0 0.2
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Layers beit4 Layers vit0 Layers vit1 Layers vit2
(e) BEiT 5 (f) ViT 1 (g) ViT 2 (h) ViT 3

12 12
10 0.8 10 0.8 80 0.8 80 0.8

Layers resnet joint
Layers resnet joint

Layers vit joint
Layers vit joint
8 8
60 60 0.6
0.6 0.6 0.6
6 6
40 40
4 4 0.4 0.4
0.4 0.4
2 2 20 20
0.2 0.2
0 0.2 0 0.2
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 20 40 60 80 0 20 40 60 80
Layers vit3 Layers vit4 Layers resnet0 Layers resnet1
(i) ViT 4 (j) ViT 5 (k) ResNet 1 (l) ResNet 2
80 0.8 80 0.8 80 0.8

Layers resnet joint
Layers resnet joint
Layers resnet joint
60 60 0.6 60
0.6 0.6
40 40 40
0.4 0.4 0.4
20 20 20
0.2 0.2 0.2
0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Layers resnet2 Layers resnet3 Layers resnet4
(m) ResNet 3 (n) ResNet 4 (o) ResNet 5

Figure 5: CKA for BEiT, Vit and ResNet. Pre-trained models after each experience are compared
with the original pre-trained model.
19

Continual Pre-Training Mitigates Forgetting in Language and Vision

Uploaded by

Copyright:

Available Formats

Continual Pre-Training Mitigates Forgetting in Language and Vision

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Continual Pre-Training Mitigates Forgetting in Language and Vision

Uploaded by

Copyright:

Available Formats

Continual Pre-Training Mitigates Forgetting in

Language and Vision

Andrea Cossu∗ Tinne Tuytelaars

Antonio CartaLucia Passaro Vincenzo Lomonaco Davide Bacciu

Pre-trained models are nowadays a fundamental component of machine learning

Preprint. Under review.

3 Continual Pre-Training Scenario

is fine-tuned on Dids , resulting in hds

3.1 Natural Language Processing Environment

3.2 Computer Vision Environment

3.3 Experimental Setup

Continual pre-training improves performance on the downstream task without forgetting on

Self-supervised continual pre-training mitigates forgetting. We found out that self-supervised

5 Discussion and Limitations

Our empirical evaluation provides evidence that for-

Layers roberta joint

Layers resnet joint

Layers beit joint

Layers vit joint

Acknowledgments and Disclosure of Funding

French Robert. Catastrophic Forgetting in Connectionist Networks // Trends in Cognitive Sciences.

A Extended Experimental Setup

B.2 Effect of Pre-Training on the Downstream Domain Task

B.3 Results With Traditional CL scenario

B.4 CKA Plots

B.5 Experiments with Larger CV Models

C Continual Pre-Training Pseudocode

D Extended Related Works

Algorithm 1 Continual Pre-training scenario

10 0.8 10 0.8 10 0.8 10 0.8

Layers roberta joint

Layers roberta joint

Layers roberta joint

2 0.2 2 0.2 2 0.2 2 0.2

10 0.8 10 0.8 10 0.8 10 0.8

Layers roberta joint

Layers roberta joint

2 0.2 2 0.2 2 0.2 2 0.2

Layers roberta joint

Layers bert joint

2 0.2 2 0.2 2 2 0.4

Layers bert joint

Layers bert joint

Layers bert joint

4 4 0.4 4 0.4 4 0.4

Layers bert joint

Layers bert joint

Layers bert joint

2 2 0.2 2 0.2 2 0.2

10 0.8 10 0.8 10 0.8 10 0.8

Layers beit joint

Layers beit joint

Layers beit joint

4 0.4 4 0.4 4 0.4 4 0.4

(a) BEiT 1 (b) BEiT 2 (c) BEiT 3 (d) BEiT 4

10 0.8 10 0.8 10 0.8 10 0.8

Layers vit joint

Layers vit joint

Layers vit joint