2407.07263v1
2407.07263v1
2407.07263v1
Jupinder Parmar* , Sanjeev Satheesh, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
NVIDIA
become intractable except for the most well- Due to the large computational cost that pre-
resourced teams. This increasing cost makes it training of modern LMs incurs, frequent complete
ever more important to be able to reuse a model retraining is intractable. This makes the reuse of
after it has completed pretraining; allowing for already developed LMs via continued pretraining
a model’s abilities to further improve without an attractive proposition. While most recent works
needing to train from scratch. In this work, we (Ibrahim et al., 2024; Jang et al., 2022; Ke et al.,
detail a set of guidelines that cover how to de-
2023; Çağatay Yıldız et al., 2024) have recom-
sign efficacious data distributions and learning
rate schedules for continued pretraining of lan- mended guidelines for continued pretraining when
guage models. When applying these findings adapting language models to new data domains or
within a continued pretraining run on top of a distribution shifts, intuition or recommendations on
well-trained 15B parameter model, we show an how to improve a model’s general purpose abilities
improvement of 9% in average model accuracy from a previously finalized checkpoint with contin-
compared to the baseline of continued train- ued pretraining have not been widely explored. In
ing on the pretraining set. The resulting recipe this paper, we focus on this under-studied setting
provides a practical starting point with which
and identify strategies that allow for already trained
to begin developing language models through
reuse rather than retraining. LMs to improve upon areas of weakness without
experiencing degradations in other capabilities.
1 Introduction In our experiments, we start on top of a 15B pa-
rameter LM that has seen 8T tokens of pretraining
Language modeling abilities have seen massive data (Parmar et al., 2024). Experimenting with a
improvements over the past few years (Brown well trained model of this scale ensures that our
et al., 2020; Chowdhery et al., 2022; OpenAI, 2024; findings will be transferable to most settings and
Team, 2024). While these advancements have en- model sizes. We first identify the type of data dis-
abled language models (LMs) to become highly- tribution that should be used during continued pre-
skilled conversational agents (OpenAI, 2024; An- training and find that it is optimal to have two distri-
thropic, 2024; Team, 2024), they have come with butions, with the final one more heavily weighting
increased computational cost as pretraining has be- data sources that relate to the abilities we want to
come ever more expensive due to both the number improve in the model. Second, we determine what
of model parameters (Team et al., 2024; DeepSeek- learning rate schedules enable the most efficient
AI et al., 2024) and pretraining dataset size (Tou- learning during continued pretraining and deter-
vron et al., 2023; Gemma Team, 2024; Parmar et al., mine that the most performant one strikes a balance
2024) continuing to grow in scale. With new LMs between magnitude of learning rate and steepness
that set state of the art accuracy being released of decay. Lastly, we show how the learning rate
on a frequent basis, LMs developed only a cou- value at which we switch between data distribu-
ple months back are becoming obsolete as their tions affects downstream accuracy and identify the
capabilities are no longer up to par. This leaves point at which this switch should be made.
* Correspondence to: jupinderp@nvidia.com These findings culminate in a recipe that can be
1
used to perform continued pretraining to improve to improve the general capabilities of LMs, train
the capabilities of an existing LM. We demonstrate for far fewer tokens, and use much smaller model
that this recipe is beneficial at continued training sizes. The main study which offers a compara-
scales from 100B to 1 trillion tokens, illustrating tive setting to ours is (Ibrahim et al., 2024) which
its flexibility and robustness to be used in a wide provides a recipe, based on learning rate schedule
variety of settings. We hope that this recipe will and example replay recommendations, for main-
allow for model providers to forgo the need to reg- taining general purpose abilities during continued
ularly retrain models from scratch as it makes it pretraining on data distribution shifts. Their experi-
possible to reuse a trained model to attain improved mental setting consists of a 10B parameter model
capabilities. that was pretrained for 300B tokens. Our study
differs from (Ibrahim et al., 2024) as we aim to
2 Related Works improve the general capabilities of the LM further,
and in our experimental setting we perform con-
Continued training methods aim to take an already tinued pretraining for up to 1T tokens with a 15B
trained model and incorporate new data, adapt it parameter model that was pretrained on 8T tokens.
for a given domain, or specialize it on a certain task
(Rolnick et al., 2019; Caccia et al., 2021; Lesort 3 Experimental Setup
et al., 2022; Gupta et al., 2023; Lin et al., 2024).
The major challenge that arises during continued The continued pretraining process is as follows: a
training is enabling a model to learn new informa- model is first pretrained, then a data distribution
tion without forgetting previously attained knowl- and learning rate schedule are chosen, a continued
edge or capabilities (Robins, 1995; French, 1999). pretraining run takes place, and finally the, hope-
The learning rate schedule and data distribution fully improved, model is returned. Before delv-
used during continued training (Gupta et al., 2023; ing into the experiments that define the continued
Ibrahim et al., 2024; Winata et al., 2023; Scialom training recipe, we detail the datasets and model
et al., 2022) have been shown to be particularly im- architecture that are used.
portant in preventing such catastrophic forgetting.
3.1 Data Sources
For LMs, one major setting of continued training
has been to embed more recent knowledge into the 3.1.1 Pretraining
model by using data collected at a date later than Our pretraining dataset consists of three different
when the pretraining set was constructed (Jin et al., domains of data: English natural language data,
2022; Jang et al., 2022, 2023; Loureiro et al., 2022; multilingual natural language data, and source code
Qin et al., 2022). Results from these studies found data. Table 1 highlights the data sources that com-
that using experience replay (Chaudhry et al., 2019) pose the pretraining set along with their respec-
and knowledge distillation (Hinton et al., 2015) are tive token counts. In our English corpus, the Web
particularly effective. Continued training is also Crawl data is sourced from Common Crawl (CC)
commonly used in LMs to adapt the model to data snapshots while the remaining categories are com-
coming from a new domain (Ke et al., 2023; Guru- prised of high-quality sets. For instance, the miscel-
rangan et al., 2020; Wu et al., 2024). Many of these laneous category consists of BigScience ROOTS
methods for domain adaptive continued training (Lachaux et al., 2020), Reddit, and Pile-Stories
update a portion of the model’s weights with the (Gao et al., 2020), the encyclopedia category con-
new data to ensure that previous knowledge is not tains Wikipedia and Stack Exchange, and scientific
lost. For instance, (Wu et al., 2024) does so via papers includes ArXiv and PubMed.
an expansion of the transformer blocks and only The multilingual dataset consists of 53 languages
updating the newly added weights. with the majority of examples being drawn from
More related to the setting which we explore, CC snapshots, although a small portion comes from
several studies utilize continued pretraining to spe- machine translation parallel corpora (Schwenk
cialize a LM on a given task or domain (Zan et al., et al., 2019; El-Kishky et al., 2019). Lastly, our
2022; Yadav et al., 2023; Ma et al., 2023; Yang source code data is drawn from permissively li-
et al., 2024; Labrak et al., 2024). Despite investi- censed GitHub repositories and totals over 43 lan-
gating effective strategies for continued pretraining, guages.
these studies differ from ours as they do not aim We pretrain the model for 8T tokens. Given
2
Data type Data source Tokens (B) ding parameters and 12.5 billion non-embedding
parameters. Additional architectural specifications
Web Crawl 5,106 include: 32 transformer layers, a hidden size of
Misc. 179 6144, 48 attention heads, Rotary Position Embed-
News 93 dings (RoPE) (Su et al., 2023), squared ReLU acti-
Scientific Papers 82 vations in the MLP layers, a SentencePiece (Kudo
English
Books 80 and Richardson, 2018) tokenizer with a vocabulary
Legal 50 size of 256k, no bias terms, and untied input-output
Encyclopedia 31 embeddings. Additionally, we use grouped query
Finance 20 attention (GQA) (Ainslie et al., 2023) with 8 KV
Web crawl 2,229 heads.
Multilingual
Parallel corpora 55 The model is pretrained with a sequence length
Source Code GitHub 583 of 4,096 and uses batch size rampup over the first
5% of pretraining tokens, starting from a batch size
Table 1: The pretraining data composition. Appendix of 384 and building up to one of 1,152. We use
A.1 and A.2 breakdown the multilingual and coding
a cosine learning rate schedule, with warmup of
languages.
16B tokens, to decay from a maximum learning
rate (LR) of ηmax = 4.5e-4 to ηmin = 4.5e-5. We
that current state of the art LMs are pretrained for train using the AdamW (Loshchilov and Hutter,
trillions of tokens, we want to experiment on top of 2019) optimizer with β1 = 0.9, β2 = 0.95, and a
a pretrained model that is emblematic of the type weight decay of 0.1. In continued pretraining, the
of models which the continued pretraining recipe only hyperparameter that is altered is the learning
would be used for. rate schedule.
3
Recipe experience training instability and accuracy regres-
sion. Through a series of runs which tackle what
• Start with a data distribution that is compositions of data distributions best improve the
similar to the pretraining set but places abilities of a pretrained model, we identify general
larger weight on high quality sources characteristics that can be applied across most con-
before transitioning to a second distri- tinuous pretraining scenarios. In these experiments,
bution that incorporates QA data and we use a learning rate schedule that starts from
upweights sources in areas of model ηmin and decays to 0 with cosine annealing.
weakness. First, we examine if the inclusion of QA data,
which improves the ability of a model to extract
• The learning rate schedule should start
stored knowledge (Allen-Zhu and Li, 2023), im-
from ηmin of the pretrained model and
proves model accuracy. Coupled with this question
decay with cosine annealing to η100
min
.
is another on how to best incorporate the QA data,
• The switch between data distribution or more generally any dataset which is not con-
should occur at ηmax in the learning tained within the pretraining data distribution, into
5
rate schedule. the continued training run: immediately at the be-
ginning and throughout the entirety of continued
5 Experiments training, or rather reserved till the end of contin-
ued training following a curriculum learning setup
The results of the pretrained base model are shown (Soviany et al., 2022; Blakeney et al., 2024). We
in Table 3. The aim for our continuous training hypothesize that inclusion of new data sources at
recipe will be to define steps that help maximally the beginning of continued pretraining allows for
improve upon this benchmark. All detailed exper- the model to best learn the new information, but
iments perform continuous pretraining for 300B may cause learning instabilities that could be mit-
tokens. Additionally, we note that in our experi- igated by showing the new dataset at the end of
ments we choose to load in the optimizer state from the run when the learning rate is less aggressive.
the pretrained model as we found that there was a To answer these questions, we compare continued
negligible difference in evaluation accuracy when training entirely with the pretraining data blend,
the optimizer state was loaded in or when initial- entirely with a QA data blend, and with a mix of
ized from scratch. Thus, we expect that whether the pretraining and QA data blends where we start
eventual practitioners have the optimizer state of with the pretraining blend and switch to the QA
the pretrained model available or not, the resulting data blend late in the training run. The QA data
findings will hold. blend in this scenario adds the QA dataset to the
pretraining data distribution with a weight of 10%.
Model Average Accuracy
Pretrained 48.9 Data Blend Avg. Acc.
Table 3: Model accuracy after 8T tokens of pretraining. Pretraining 51.5
Per-task evaluations scores are shared in Table 12, we QA 53.4
find the model particularly struggles on tasks that assess
Pretraining (250B), QA (50B) 54.3
STEM based reasoning capabilities.
Table 4: Using two data distributions, with the QA data
appearing in the latter, leads to the largest improvement
5.1 Data Distribution via continued pretraining. () indicates the number of
A crucial component of any training run is the data training tokens for each blend. Per-task evaluations
distribution – it defines the information which a scores are shared in Table 13.
model sees and directly impacts the model’s capa-
bilities. As continuous pretraining builds on top Table 4 illustrates that the incorporation of QA
of a model which has already seen a given pre- data markedly outperforms solely using existing
training distribution, it is important to define a data data from the pretraining set. Additionally, first
distribution which allows the model to learn new using the pretraining data blend for the majority
concepts without also deviating too far from the pre- of training tokens before transitioning to the QA
training distribution such that the model begins to data blend at the end of continued pretraining ex-
4
Figure 1: Breakdown of the various distributions considered for the General Blend (GB). We use Upweight Non
Web w/ High Quality Web as the GB moving forward given its strong performance across all evaluation areas.
hibits improved accuracy compared to using the across all considered tasks as shown in Table 13.
QA blend throughout the entirety of training. This
indicates that continued pretraining runs should Data Blend Avg. Acc.
begin with a data distribution which more closely
Pretraining 51.5
aligns to the pretraining one followed by a blend
Reweight Domains 51.7
that then introduces new data. Moving forward,
Pretraining w/ High Quality Web 52.5
we refer to the initial blend as the general blend,
No Web 52.9
GB, and the latter blend as the QA blend, QB, and
UW Non Web w/ High Quality Web 52.0
discuss how they can be refined to realize further
improvements. Table 5: Evaluation results of various GB candidate
distributions. Per-task evaluations scores are shared in
We hypothesize that the optimal GB will be one Table 13
which places greater emphasis on high quality data
sources and areas of model weakness, without de- With a GB distribution in place, we now look
viating too far from the pretraining distribution. to define the QB distribution by first refining the
Such a blend will enhance knowledge in needed ar- weights placed on the sources within the QA data
eas and prime the model for the QB blend without and then optimizing the QB distribution as a whole.
worry of experiencing large training instabilities. In the initial QB distribution, the QA data was
Figure 1 illustrates the various GB distributions we added as is, and this weighting is shown as QA
consider; in addition to upweighting sources of in- blend 1 in Figure 2. Given that the pretrained model
terest, we either subset web crawl to just high qual- struggles on STEM tasks, we create two additional
ity documents, as identified by being in the bottom blends that both upweight the QA STEM data while
quartile of perplexity scores from a KenLM model either maintaining the original weight of QA world
(Heafield, 2011) trained on Wikipedia, or remove knowledge, blend 2, or QA chat, blend 3, data as
web crawl altogether. Experimenting with the var- seen in Figure 2. We choose to maintain the weight
ious GB distributions for all 300B tokens of con- in world knowledge and chat information as such
tinued training, Table 5 shows that each improves examples cover a broad range of topics and help
upon the pretraining distribution. Even though it better align model responses to questions respec-
does not achieve the highest average accuracy, we tively. Table 6 highlights that upon adding each of
choose Upweight Non Web with High Quality Web the QA blends to the initial QB distribution follow-
as the GB moving forward, because compared to ing 250B tokens of the identified GB, QA data that
others, it most consistently achieves high scores emphasizes both STEM and chat information leads
5
Figure 2: Various distributions of QA data. We use
Blend 3.
6
Figure 4: Breakdown of the various distributions considered for the QB. N e refers to N epochs of the QA data. The
final chosen distribution is shown as QA Blend which used 2 epochs of QA data.
7
This finalizes our continued pretraining recipe. documents to each example in the QA set. The
We highlight the utility of this recipe as it allows identified subset of examples constitutes 60B to-
the model to achieve an average accuracy of 56.1, kens, and we term this approach document mining.
which improves upon the natural baseline of con- Table 11 shows a training run where we replace
tinued training on the pretraining distribution, as all non-QA data sources in the QB distribution
shared in Table 4, by 9%. solely with the examples identified via document
mining. We find that these documents substan-
6 Ablations tially improve the performance of the continued
6.1 Varying Token Horizons pretraining run and believe that document mining
is a viable approach at extracting further utility
We show the efficacy of the identified continued
from existing data sources.
pretraining recipe when used at varying numbers of
continued training tokens. Table 10 illustrates that
Blend MMLU Avg. Acc.
on continued training horizons from 100B to 1T
tokens, the identified recipe consistently achieves CT 1T 65.3 56.8
improved evaluation results – realizing a 16% gain CT 1T w/ Mined Docs 66.6 57.9
over the pretrained model when using 1T tokens Table 11: Mining examples related to QA documents
of continued training. We do note that the slope further improves accuracy. Per-task evaluations scores
in accuracy improvement from 300B to 1T tokens are shared in Table 20
is lower than that from 100B to 300B tokens, we
hypothesize that as we are mainly reusing docu-
ments from the pretraining set when doing a large 7 Conclusion
number of continued training tokens the repeated We investigate how to effectively continue training
number of epochs on the same data sources have LMs to improve upon their existing capabilities.
decreasing marginal utility. Our experiments show that it is especially impor-
tant to carefully define the data distribution and
Num CPT Tokens MMLU Avg. Acc. learning rate decay schedule used during continued
0B 59.3 48.9 pretraining so that the model is able to smoothly
100B 63.0 55.0 transition away from the pretraining distribution
300B 63.8 56.1 and better learn the newly emphasized data sources.
1T 65.3 56.8 With these findings we propose a general recipe
that model developers can use in order to perform
Table 10: Performance of the continuous pretraining
continued pretraining on top of their own LMs and
(CPT) recipe across different token horizons. Per-task
evaluations scores are shared in Table 19
show that for our base model, we are able to im-
prove cumulative accuracy by over 18%. We hope
that this will be a starting point to enable future
6.2 Document Mining LMs to be developed through the reuse of existing
In an effort to improve the utility of the data sources models rather than retraining from scratch.
that are seen for multiple epochs in long horizon
Limitations
continued pretraining runs, we aim to find a sub-
set of examples that are most helpful for model In the development of our continued pretraining
improvement. As the QA dataset was shown to sig- recipe, we only experiment along the axes of data
nificantly boost model accuracies, we hypothesize distributions and hyperparameter configurations.
that restricting each pretraining data source to the Although we did not include them within our study,
set of documents which are most similar to the QA there may be added benefit in exploring other as-
examples would be beneficial. To do so, we use pects such as altering the learning algorithm. Addi-
the E5-large-v2 (Wang et al., 2022) text embedding tionally, given that our study is conducted on top
model to obtain an embedding for each document of a model with a given configuration and which
in our pretraining and QA sets. Using the Faiss was pretrained using a certain data distribution, the
library (Johnson et al., 2017), we efficiently per- results that we highlight are likely to not extrap-
form a 50-nearest neighbor search across all these olate well when used in settings highly divergent
embeddings to obtain the 50 most similar, non-QA from the one utilized in the study. Finally, we
8
limited our goal within continued pretraining to
improving the general purpose capabilities of the
pretrained model; however, there are many addi-
tional angles when considering model reuse such
as domain specialization and the efficient addition
of new knowledge into existing models.
9
References Sutskever, and Wojciech Zaremba. 2021. Evaluat-
ing large language models trained on code. Preprint,
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, arXiv:2107.03374.
Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang-
hai. 2023. GQA: Training Generalized Multi-Query Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Transformer Models from Multi-Head Checkpoints. Maarten Bosma, Gaurav Mishra, Adam Roberts,
arXiv preprint arXiv:2305.13245. Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, et al. 2022. PaLM: Scaling
Zeyuan Allen-Zhu and Yuanzhi Li. 2023. Physics of
Language Modeling with Pathways. arXiv preprint
language models: Part 3.1, knowledge storage and
arXiv:2204.02311.
extraction. Preprint, arXiv:2309.14316.
Anthropic. 2024. The Claude 3 Model Family: Opus, DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting
Sonnet, Haiku. Chen, Shanhuang Chen, Damai Dai, Chengqi Deng,
Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu,
Cody Blakeney, Mansheej Paul, Brett W. Larsen, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge,
Sean Owen, and Jonathan Frankle. 2024. Does Kang Guan, Daya Guo, Jianzhong Guo, Guangbo
your data spark joy? performance gains from do- Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan
main upsampling at the end of training. Preprint, Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li,
arXiv:2406.03476. Y. K. Li, Wenfeng Liang, Fangyun Lin, A. X. Liu,
Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma,
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu,
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Sha, Zhihong Shao, Junxiao Song, Xuecheng Su,
Gretchen Krueger, Tom Henighan, Rewon Child, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingx-
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, uan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang,
Clemens Winter, Christopher Hesse, Mark Chen, Yongji Wang, Tong Wu, Y. Wu, Xin Xie, Zhenda Xie,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Ziwei Xie, Yiliang Xiong, Hanwei Xu, R. X. Xu,
Chess, Jack Clark, Christopher Berner, Sam Mc- Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping
Candlish, Alec Radford, Ilya Sutskever, and Dario Yu, Xingkai Yu, B. Zhang, Haowei Zhang, Lecong
Amodei. 2020. Language models are few-shot learn- Zhang, Liyue Zhang, Mingchuan Zhang, Minghua
ers. Preprint, arXiv:2005.14165. Zhang, Wentao Zhang, Yichao Zhang, Chenggang
Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou,
Massimo Caccia, Pau Rodriguez, Oleksiy Ostapenko, Qihao Zhu, and Yuheng Zou. 2024. Deepseek llm:
Fabrice Normandin, Min Lin, Lucas Caccia, Is- Scaling open-source language models with longter-
sam Laradji, Irina Rish, Alexandre Lacoste, David mism. Preprint, arXiv:2401.02954.
Vazquez, and Laurent Charlin. 2021. Online
fast adaptation and knowledge accumulation: a Ahmed El-Kishky, Vishrav Chaudhary, Francisco
new approach to continual learning. Preprint, Guzmán, and Philipp Koehn. 2019. Ccaligned: A
arXiv:2003.05856. massive collection of cross-lingual web-document
pairs. arXiv preprint arXiv:1911.06154.
Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho-
seiny, Thalaiyasingam Ajanthan, Puneet K. Doka- Robert M. French. 1999. Catastrophic forgetting in con-
nia, Philip H. S. Torr, and Marc’Aurelio Ranzato. nectionist networks. Trends in Cognitive Sciences,
2019. On tiny episodic memories in continual learn- 3(4):128–135.
ing. Preprint, arXiv:1902.10486.
Leo Gao, Stella Biderman, Sid Black, Laurence Gold-
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming ing, Travis Hoppe, Charles Foster, Jason Phang,
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- Horace He, Anish Thite, Noa Nabeshima, Shawn
plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Presser, and Connor Leahy. 2020. The Pile: An
Greg Brockman, Alex Ray, Raul Puri, Gretchen 800gb dataset of diverse text for language modeling.
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- arXiv preprint arXiv:2101.00027.
try, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Google DeepMind Gemma Team. 2024. Gemma: Open
Kaiser, Mohammad Bavarian, Clemens Winter, Models Based on Gemini Research and Technology.
Philippe Tillet, Felipe Petroski Such, Dave Cum-
mings, Matthias Plappert, Fotios Chantzis, Eliza- Kshitij Gupta, Benjamin Thérien, Adam Ibrahim,
beth Barnes, Ariel Herbert-Voss, William Hebgen Mats L. Richter, Quentin Anthony, Eugene
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Belilovsky, Irina Rish, and Timothée Lesort. 2023.
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Continual pre-training of large language mod-
William Saunders, Christopher Hesse, Andrew N. els: How to (re)warm your model? Preprint,
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan arXiv:2308.04014.
Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Suchin Gururangan, Ana Marasović, Swabha
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,
10
and Noah A. Smith. 2020. Don’t stop pretraining: Taku Kudo and John Richardson. 2018. Sentencepiece:
Adapt language models to domains and tasks. In A Simple and Language Independent Subword Tok-
Proceedings of the 58th Annual Meeting of the enizer and Detokenizer for Neural Text Processing.
Association for Computational Linguistics, pages arXiv preprint arXiv:1808.06226.
8342–8360, Online. Association for Computational
Linguistics. Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina
Lee, Oded Padon, Alex Aiken, and Percy Liang. 2019.
Kenneth Heafield. 2011. Kenlm: Faster and smaller Spoc: Search-based pseudocode to code. Preprint,
language model queries. In Proceedings of the sixth arXiv:1906.04908.
workshop on statistical machine translation, pages
Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-
187–197.
Antoine Gourraud, Mickael Rouvier, and Richard
Dufour. 2024. Biomistral: A collection of open-
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, source pretrained large language models for medical
Mantas Mazeika, Dawn Song, and Jacob Steinhardt. domains. Preprint, arXiv:2402.10373.
2020. Measuring Massive Multitask Language Un-
derstanding. arXiv preprint arXiv:2009.03300. Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanus-
sot, and Guillaume Lample. 2020. Unsupervised
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. translation of programming languages. Preprint,
Distilling the knowledge in a neural network. arXiv:2006.03511.
Preprint, arXiv:1503.02531.
Timothée Lesort, Massimo Caccia, and Irina Rish.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu 2022. Understanding continual learning settings
Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang with data distribution drift analysis. Preprint,
Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng arXiv:2104.01678.
Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao,
Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu,
Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang,
Zhiyuan Liu, and Maosong Sun. 2024. Minicpm: Jian Jiao, Nan Duan, and Weizhu Chen. 2024. Rho-
Unveiling the potential of small language mod- 1: Not all tokens are what you need. Preprint,
els with scalable training strategies. Preprint, arXiv:2404.07965.
arXiv:2404.06395. Ilya Loshchilov and Frank Hutter. 2019. De-
coupled weight decay regularization. Preprint,
Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, arXiv:1711.05101.
Mats L. Richter, Quentin Anthony, Timothée Lesort,
Eugene Belilovsky, and Irina Rish. 2024. Simple Daniel Loureiro, Francesco Barbieri, Leonardo Neves,
and scalable strategies to continually pre-train large Luis Espinosa Anke, and Jose Camacho-Collados.
language models. Preprint, arXiv:2403.08763. 2022. Timelms: Diachronic language models from
twitter. Preprint, arXiv:2202.03829.
Joel Jang, Seonghyeon Ye, Changho Lee, Sohee Yang,
Joongbo Shin, Janghoon Han, Gyeonghun Kim, and Shirong Ma, Shen Huang, Shulin Huang, Xiaobin
Minjoon Seo. 2023. Temporalwiki: A lifelong bench- Wang, Yangning Li, Hai-Tao Zheng, Pengjun Xie,
mark for training and evaluating ever-evolving lan- Fei Huang, and Yong Jiang. 2023. Ecomgpt-ct:
guage models. Preprint, arXiv:2204.14211. Continual pre-training of e-commerce large lan-
guage models with semi-structured data. Preprint,
Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, arXiv:2312.15696.
Janghoon Han, Gyeonghun Kim, Stanley Jungkyu
Choi, and Minjoon Seo. 2022. Towards continual OpenAI. 2024. Gpt-4 technical report. Preprint,
knowledge learning of language models. Preprint, arXiv:2303.08774.
arXiv:2110.03215. Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings,
Mostofa Patwary, Sandeep Subramanian, Dan Su,
Xisen Jin, Dejiao Zhang, Henghui Zhu, Wei Xiao, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala,
Shang-Wen Li, Xiaokai Wei, Andrew Arnold, and Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya
Xiang Ren. 2022. Lifelong pretraining: Continu- Mahabaleshwarkar, Osvald Nitski, Annika Brundyn,
ally adapting language models to emerging corpora. James Maki, Miguel Martinez, Jiaxuan You, John
Preprint, arXiv:2110.08534. Kamalu, Patrick LeGresley, Denys Fridman, Jared
Casper, Ashwath Aithal, Oleksii Kuchaiev, Moham-
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. mad Shoeybi, Jonathan Cohen, and Bryan Catanzaro.
Billion-scale similarity search with gpus. Preprint, 2024. Nemotron-4 15b technical report. Preprint,
arXiv:1702.08734. arXiv:2402.16819.
Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Kon- Yujia Qin, Jiajie Zhang, Yankai Lin, Zhiyuan Liu, Peng
ishi, Gyuhak Kim, and Bing Liu. 2023. Con- Li, Maosong Sun, and Jie Zhou. 2022. Elle: Efficient
tinual pre-training of language models. Preprint, lifelong pre-training for emerging data. Preprint,
arXiv:2302.03241. arXiv:2203.06311.
11
Anthony V. Robins. 1995. Catastrophic forgetting, re- Genta Indra Winata, Lingjue Xie, Karthik Radhakrish-
hearsal and pseudorehearsal. Connect. Sci., 7:123– nan, Shijie Wu, Xisen Jin, Pengxiang Cheng, Mayank
146. Kulkarni, and Daniel Preotiuc-Pietro. 2023. Over-
coming catastrophic forgetting in massively multilin-
David Rolnick, Arun Ahuja, Jonathan Schwarz, Tim- gual continual learning. Preprint, arXiv:2305.16252.
othy P. Lillicrap, and Greg Wayne. 2019. Ex-
perience replay for continual learning. Preprint, Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao
arXiv:1811.11682. Wang, Ye Feng, Ying Shan, and Ping Luo. 2024.
Llama pro: Progressive llama with block expansion.
Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Preprint, arXiv:2401.02415.
Edouard Grave, and Armand Joulin. 2019. Ccmatrix:
Mining billions of high-quality parallel sentences on Prateek Yadav, Qing Sun, Hantian Ding, Xiaopeng
the web. arXiv preprint arXiv:1911.04944. Li, Dejiao Zhang, Ming Tan, Xiaofei Ma, Parmin-
der Bhatia, Ramesh Nallapati, Murali Krishna Ra-
Thomas Scialom, Tuhin Chakrabarty, and Smaranda manathan, Mohit Bansal, and Bing Xiang. 2023. Ex-
Muresan. 2022. Fine-tuned language models are ploring continual learning for code generation mod-
continual learners. Preprint, arXiv:2205.12393. els. Preprint, arXiv:2307.02435.
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Xianjun Yang, Junfeng Gao, Wenxin Xue, and Erik
Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Alexandersson. 2024. Pllama: An open-source
Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan large language model for plant science. Preprint,
Das, and Jason Wei. 2022. Language models are arXiv:2401.01600.
multilingual chain-of-thought reasoners. Preprint,
arXiv:2210.03057. Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin,
Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen,
Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and and Jian-Guang Lou. 2022. Cert: Continual pre-
Nicu Sebe. 2022. Curriculum learning: A survey. training on sketches for library-oriented code genera-
Preprint, arXiv:2101.10382. tion. Preprint, arXiv:2206.06888.
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Bo Wen, and Yunfeng Liu. 2023. Roformer: En- Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
hanced transformer with rotary position embedding. machine really finish your sentence? In ACL.
Preprint, arXiv:2104.09864.
Çağatay Yıldız, Nishaanth Kanna Ravichandran, Pr-
Gemini Team. 2024. Gemini: A family of highly capa-
ishruit Punia, Matthias Bethge, and Beyza Ermis.
ble multimodal models. Preprint, arXiv:2312.11805.
2024. Investigating continual pretraining in large lan-
Reka Team, Aitor Ormazabal, Che Zheng, Cyprien guage models: Insights and implications. Preprint,
de Masson d’Autume, Dani Yogatama, Deyu Fu, arXiv:2402.17400.
Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai
Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew A Data
Henderson, Max Bain, Mikel Artetxe, Nishant Relan,
Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, A.1 Multilingual Data
Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, The 53 multilingual languages contained within the
and Zhihui Xie. 2024. Reka core, flash, and edge:
A series of powerful multimodal language models. pretraining set are: AR, AZ, BG, BN, CA, CS, DA,
Preprint, arXiv:2404.12387. DE, EL, ES, ET, FA, FI, FR, GL, HE, HI, HR, HU,
HY, ID, IS, IT, JA, KA, KK, KN, KO, LT, LV, MK,
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay ML, MR, NE, NL, NO, PL, PT, RO, RU, SK, SL,
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti SQ, SR, SV, TA, TE, TH, TR, UK, UR, VI, and
Bhosale, et al. 2023. Llama 2: Open Founda- ZH.
tion and Fine-tuned Chat Models. arXiv preprint
arXiv:2307.09288. A.2 Code Data
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob The 43 programming languags contained within
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz our pretraining set are: assembly, c, c-sharp,
Kaiser, and Illia Polosukhin. 2017. Attention is all common-lisp, cpp, css, cuda, dart, dockerfile, for-
you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc. tran, go, haskell, html, java, javascript, json, julia,
jupyter-scripts, lua, makefile, markdown, mathe-
Liang Wang, Nan Yang, Xiaolong Huang, Binxing matica, omniverse, pascal, perl, php, python, R,
Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder,
and Furu Wei. 2022. Text embeddings by weakly-
restructuredtext, ruby, rust, scala, shell, sql, swift,
supervised contrastive pre-training. arXiv preprint systemverilog, tex, typescript, verilog, vhdl, visual-
arXiv:2212.03533. basic, xml, and yaml.
12
B Experiments As highlighted in Table 15, we find that includ-
ing any level of warmup within the continued train-
The evaluation results across all considered tasks
ing learning rate schedule causes regressions in
are shared below for each of our experiments.
evaluation accuracies, indicating that it is best to
decay directly from ηmin .
Task Pretrained Model In addition to cosine annealing, we experiment
MMLU 59.3 with the WSD learning rate scheduler (Hu et al.,
HellaSwag 80.4 2024). Table 16 compares the best found setting of
HumanEval 31.1 WSD with cosine annealing. The WSD schedule
MGSM (ES, JA, TH) 24.9 produces significantly lower evaluation accuracies
than cosine annealing. We hypothesize that in con-
Table 12: Model accuracy after 8T tokens of pretraining.
We find that the model struggles on STEM based rea- tinued pretraining, switching the decay schedule
soning tasks due to its low scores on MGSM and STEM from the one used during pretraining is harmful.
substasks of MMLU. Hence, for models pretrained with cosine anneal-
ing, the learning rate schedule in continued training
should also use cosine annealing.
B.1 Data Distribution
Table 13 shares the results across all tasks for each B.3 Switch of Data Distributions
experiment mentioned within Section 5.1. Table 18 highlights that the findings of our exper-
iments in Section 5.3 also hold at the continued
B.2 Learning Rate Schedule training token horizon of 100B tokens. This indi-
In identifying a learning rate schedule for contin- cates that regardless of the number of continued
ued pretraining, we experiment with various de- training tokens, transitioning between the GB and
η
grees of warmup and values of ηmaxct . The com- QB distributions at max 5
ct
is optimal.
binations we consider are: warmup from ηmin to
ηmaxct = 1.5 ∗ ηmin , warmup from 0.5 ∗ ηmin to C Ablations
ηmaxct = ηmin , and warmup from 0 to what the C.1 Varying Token Horizons
expected learning rate value would be had the pre-
When extending the number of continued pretrain-
training learning rate schedule been extended to in-
ing tokens to 1T, we found that our existing QB
corporate the continued training tokens (i.e., from
distribution would cause the small QA dataset to be
8T to 8.3T). We use ηmin to specify the minimum
trained on for a large number of epochs. To correct
learning rate value of the pretrained model, which
for this, we reduce the weight on the QA datset
is 4.5e-5. Figure 6 highlights each of these sched-
so that it would be trained on for no more than
ules, and we note that these combinations were
4 epochs. Figure 7 demonstrates the distribution
chosen to quantify different degrees of aggressive-
of the QB when used at the scale of 1T continued
ness when using warmup in a continued pretraining
pretraining tokens.
learning rate schedule.
13
Data Blend MMLU HellaSwag HumanEval MGSM (ES, JA, TH)
Pretraining 61.9 81.2 28.1 34.7
QA 62 78.7 32.9 40.1
Pretraining (250B) + QA (50B) 62.6 82.2 29.9 42.4
Pretraining 61.9 81.2 28.1 34.7
Reweight Domains 61.9 81.7 29.9 33.2
Pretraining w/ High Quality Web 62.2 80.9 34.1 32.9
No Web 62.3 81.8 29.9 37.7
Upweight Non Web w/ High Quality Web 62.6 81.4 31.7 32.1
QA 1 63.0 82.4 29.9 41.9
QA 2 (+STEM, +World Knowledge) 63.9 82.3 29.3 36.7
QA 3 (+STEM, +Chat) 64.1 82.2 28.7 44.7
QA 64.2 82.4 30.5 44.5
QA w/ Upweighted STEM 64.1 82.3 28.1 42.9
QA w/ 1.5e QA data 64.1 82.2 28.7 44.7
QA w/ 3.5e QA data 64.4 27.4 82.4 43.3
Table 13: Per-task evaluation results of each experiment mentioned within Section 5.1 on defining data distributions
for continued pretraining.
LR Schedule MMLU HellaSwag HumanEval MGSM (ES, JA, TH) Avg. Acc.
Warmup to 6.75e-5 64.0 81.9 31.1 42.3 54.8
Warmup to 4.5e-5 64.0 82.1 32.9 41.5 55.1
Warmup to Expected LR 63.3 82.1 31.7 42.5 54.9
No Warmup 64.2 31.1 82.2 45.2 55.7
Table 15: Comparison of including warmup within learning rate schedules for continued pretraining. No warmup
achieves the best evaluation results.
LR Schedule MMLU HellaSwag HumanEval MGSM (ES, JA, TH) Avg. Acc.
WSD 63.6 80.2 28.1 39.5 52.8
Cosine Annealing 64.2 82.2 31.1 45.2 55.7
Table 16: We find that WSD causes significant regression in evaluation accuracy compared to cosine annealing.
η
Both learning rate schedules were decayed till max
100 .
ct
14
Distribution Switch MMLU HellaSwag HumanEval MGSM (ES, JA, TH)
At ηmaxct (from step 0) 65.0 78.7 29.9 37.7
η
At max
2
ct
60.9 81.6 32.3 44.1
ηmaxct
At 5 63.8 82.2 32.3 46.1
η
At max
10
ct
63.9 82.2 29.3 44.7
ηmaxct
At 50 63.3 81.6 31.1 42.3
Table 17: Per-task evaluation results of the experiments mentioned in Table 9 on how to switch between data
distributions in continued pretraining.
Distribution Switch MMLU HellaSwag HumanEval MGSM (ES, JA, TH) AVG
At ηmaxct (from step 0) 64.1 79.2 31.1 40.0 53.6
η
At max
2
ct
63.2 81.6 27.4 44.1 54.1
ηmaxct
At 5 63.0 81.9 31.7 43.6 55.0
η
At max
10
ct
63.6 81.8 30.5 39.7 53.9
ηmaxct
At 50 63.3 81.6 31.1 42.3 54.6
Table 18: Ablation of the data distribution switch experiments at a continued pretraining scale of 100B tokens. As
η
found for the 300B token continued training horizon, switching distributions at max
5
ct
achieves the highest accuracy.
Figure 7: Distribution of the QB blend when extending the number of continued pretraining tokens to 1T.
Num CT Tokens MMLU HellaSwag HumanEval MGSM (ES, JA, TH) AVG
0B 59.3 80.4 31.1 24.9 48.9
100B 63.0 81.9 31.7 43.6 55.0
300B 63.8 82.2 32.3 46.1 56.1
1T 65.3 82.4 34.1 45.5
Table 19: Per-task evaluation results of the experiments mentioned in Table 11 on how the identified continued
pretraining recipe performs at varying amounts of continued training tokens.
15