Universal Language Model Fine-Tuning For Text Classification
Universal Language Model Fine-Tuning For Text Classification
Universal Language Model Fine-Tuning For Text Classification
pacted computer vision, but existing ap- and days to converge. Research in NLP focused
proaches in NLP still require task-specific mostly on transductive transfer (Blitzer et al.,
modifications and training from scratch. 2007). For inductive transfer, fine-tuning pre-
We propose Universal Language Model trained word embeddings (Mikolov et al., 2013),
Fine-tuning (ULMFiT), an effective trans- a simple transfer technique that only targets a
fer learning method that can be applied to model’s first layer, has had a large impact in prac-
any task in NLP, and introduce techniques tice and is used in most state-of-the-art models.
that are key for fine-tuning a language Recent approaches that concatenate embeddings
model. Our method significantly outper- derived from other tasks with the input at different
forms the state-of-the-art on six text clas- layers (Peters et al., 2017; McCann et al., 2017;
sification tasks, reducing the error by 18- Peters et al., 2018) still train the main task model
24% on the majority of datasets. Further- from scratch and treat pretrained embeddings as
more, with only 100 labeled examples, it fixed parameters, limiting their usefulness.
matches the performance of training from In light of the benefits of pretraining (Erhan
scratch on 100× more data. We open- et al., 2010), we should be able to do better than
source our pretrained models and code1 . randomly initializing the remaining parameters of
our models. However, inductive transfer via fine-
1 Introduction tuning has been unsuccessful for NLP (Mou et al.,
2016). Dai and Le (2015) first proposed fine-
Inductive transfer learning has had a large impact
tuning a language model (LM) but require millions
on computer vision (CV). Applied CV models (in-
of in-domain documents to achieve good perfor-
cluding object detection, classification, and seg-
mance, which severely limits its applicability.
mentation) are rarely trained from scratch, but in-
We show that not the idea of LM fine-tuning but
stead are fine-tuned from models that have been
our lack of knowledge of how to train them ef-
pretrained on ImageNet, MS-COCO, and other
fectively has been hindering wider adoption. LMs
datasets (Sharif Razavian et al., 2014; Long et al.,
overfit to small datasets and suffered catastrophic
2015a; He et al., 2016; Huang et al., 2017).
forgetting when fine-tuned with a classifier. Com-
Text classification is a category of Natural Lan-
pared to CV, NLP models are typically more shal-
guage Processing (NLP) tasks with real-world ap-
low and thus require different fine-tuning methods.
plications such as spam, fraud, and bot detection
We propose a new method, Universal Language
(Jindal and Liu, 2007; Ngai et al., 2011; Chu et al.,
Model Fine-tuning (ULMFiT) that addresses these
2012), emergency response (Caragea et al., 2011),
issues and enables robust inductive transfer learn-
and commercial document classification, such as
ing for any NLP task, akin to fine-tuning ImageNet
for legal discovery (Roitblat et al., 2010).
models: The same 3-layer LSTM architecture—
1
http://nlp.fast.ai/ulmfit. with the same hyperparameters and no addi-
?
Equal contribution. Jeremy focused on the algorithm de-
velopment and implementation, Sebastian focused on the ex- tions other than tuned dropout hyperparameters—
periments and writing. outperforms highly engineered models and trans-
fer learning approaches on six widely studied text et al. (2017), and McCann et al. (2017) who use
classification tasks. On IMDb, with 100 labeled language modeling, paraphrasing, entailment, and
examples, ULMFiT matches the performance of Machine Translation (MT) respectively for pre-
training from scratch with 10× and—given 50k training. Specifically, Peters et al. (2018) require
unlabeled examples—with 100× more data. engineered custom architectures, while we show
state-of-the-art performance with the same basic
Contributions Our contributions are the follow-
architecture across a range of tasks. In CV, hyper-
ing: 1) We propose Universal Language Model
columns have been nearly entirely superseded by
Fine-tuning (ULMFiT), a method that can be used
end-to-end fine-tuning (Long et al., 2015a).
to achieve CV-like transfer learning for any task
for NLP. 2) We propose discriminative fine-tuning, Multi-task learning A related direction is
slanted triangular learning rates, and gradual multi-task learning (MTL) (Caruana, 1993). This
unfreezing, novel techniques to retain previous is the approach taken by Rei (2017) and Liu et al.
knowledge and avoid catastrophic forgetting dur- (2018) who add a language modeling objective
ing fine-tuning. 3) We significantly outperform the to the model that is trained jointly with the main
state-of-the-art on six representative text classifi- task model. MTL requires the tasks to be trained
cation datasets, with an error reduction of 18-24% from scratch every time, which makes it inefficient
on the majority of datasets. 4) We show that our and often requires careful weighting of the task-
method enables extremely sample-efficient trans- specific objective functions (Chen et al., 2017).
fer learning and perform an extensive ablation
analysis. 5) We make the pretrained models and Fine-tuning Fine-tuning has been used success-
our code available to enable wider adoption. fully to transfer between similar tasks, e.g. in QA
(Min et al., 2017), for distantly supervised senti-
2 Related work ment analysis (Severyn and Moschitti, 2015), or
MT domains (Sennrich et al., 2015) but has been
Transfer learning in CV Features in deep neu-
shown to fail between unrelated ones (Mou et al.,
ral networks in CV have been observed to tran-
2016). Dai and Le (2015) also fine-tune a lan-
sition from general to task-specific from the first
guage model, but overfit with 10k labeled exam-
to the last layer (Yosinski et al., 2014). For this
ples and require millions of in-domain documents
reason, most work in CV focuses on transferring
for good performance. In contrast, ULMFiT lever-
the first layers of the model (Long et al., 2015b).
ages general-domain pretraining and novel fine-
Sharif Razavian et al. (2014) achieve state-of-the-
tuning techniques to prevent overfitting even with
art results using features of an ImageNet model as
only 100 labeled examples and achieves state-of-
input to a simple classifier. In recent years, this
the-art results also on small datasets.
approach has been superseded by fine-tuning ei-
ther the last (Donahue et al., 2014) or several of
3 Universal Language Model Fine-tuning
the last layers of a pretrained model and leaving
the remaining layers frozen (Long et al., 2015a). We are interested in the most general inductive
transfer learning setting for NLP (Pan and Yang,
Hypercolumns In NLP, only recently have
2010): Given a static source task TS and any tar-
methods been proposed that go beyond transfer-
get task TT with TS 6= TT , we would like to im-
ring word embeddings. The prevailing approach
prove performance on TT . Language modeling
is to pretrain embeddings that capture additional
can be seen as the ideal source task and a counter-
context via other tasks. Embeddings at different
part of ImageNet for NLP: It captures many facets
levels are then used as features, concatenated ei-
of language relevant for downstream tasks, such as
ther with the word embeddings or with the in-
long-term dependencies (Linzen et al., 2016), hi-
puts at intermediate layers. This method is known
erarchical relations (Gulordava et al., 2018), and
as hypercolumns (Hariharan et al., 2015) in CV2
sentiment (Radford et al., 2017). In contrast to
and is used by Peters et al. (2017), Peters et al.
tasks like MT (McCann et al., 2017) and entail-
(2018), Wieting and Gimpel (2017), Conneau
ment (Conneau et al., 2017), it provides data in
2
A hypercolumn at a pixel in CV is the vector of activa- near-unlimited quantities for most domains and
tions of all CNN units above that pixel. In analogy, a hyper-
column for a word or sentence in NLP is the concatenation of languages. Additionally, a pretrained LM can be
embeddings at different layers in a pretrained model. easily adapted to the idiosyncrasies of a target
Softmax
Softmax Softmax
layer
layer layer
Layer 3
Layer 3 Layer 3
Figure 1: ULMFiT consists of three stages: a) The LM is trained on a general-domain corpus to capture
general features of the language in different layers. b) The full LM is fine-tuned on target task data using
discriminative fine-tuning (‘Discr’) and slanted triangular learning rates (STLR) to learn task-specific
features. c) The classifier is fine-tuned on the target task using gradual unfreezing, ‘Discr’, and STLR to
preserve low-level representations and adapt high-level ones (shaded: unfreezing stages; black: frozen).
1/1
1/1
1/1
θtl = θt−1
l
− η l · ∇θl J(θ) (2)
TREC-6
IMDb oh-LSTM (Johnson and Zhang, 2016) 5.9 TBCNN (Mou et al., 2015) 4.0
Virtual (Miyato et al., 2016) 5.9 LSTM-CNN (Zhou et al., 2016) 3.9
ULMFiT (ours) 4.6 ULMFiT (ours) 3.6
Table 2: Test error rates (%) on two text classification datasets used by McCann et al. (2017).
Table 3: Test error rates (%) on text classification datasets used by Johnson and Zhang (2017).
Andrew M. Dai and Quoc V. Le. 2015. Semi- Rie Johnson and Tong Zhang. 2016. Supervised and
supervised Sequence Learning. Advances in Neu- semi-supervised text categorization using lstm for
ral Information Processing Systems (NIPS ’15) region embeddings. In International Conference on
http://arxiv.org/abs/1511.01432. Machine Learning. pages 526–534.
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoff- Rie Johnson and Tong Zhang. 2017. Deep pyramid
man, Ning Zhang, Eric Tzeng, and Trevor Darrell. convolutional neural networks for text categoriza-
2014. Decaf: A deep convolutional activation fea- tion. In Proceedings of the 55th Annual Meeting of
ture for generic visual recognition. In International the Association for Computational Linguistics (Vol-
conference on machine learning. pages 647–655. ume 1: Long Papers). volume 1, pages 562–570.
Timothy Dozat and Christopher D. Manning. 2017. Tal Linzen, Emmanuel Dupoux, and Yoav Gold-
Deep Biaffine Attention for Neural Dependency berg. 2016. Assessing the ability of lstms to
Parsing. In Proceedings of ICLR 2017. learn syntax-sensitive dependencies. arXiv preprint
arXiv:1611.01368 .
Dumitru Erhan, Yoshua Bengio, Aaron Courville,
Pierre-Antoine Manzagol, Pascal Vincent, and Samy Liyuan Liu, Jingbo Shang, Frank Xu, Xiang Ren, Huan
Bengio. 2010. Why does unsupervised pre-training Gui, Jian Peng, and Jiawei Han. 2018. Empower
help deep learning? Journal of Machine Learning sequence labeling with task-aware neural language
Research 11(Feb):625–660. model. In Proceedings of AAAI 2018.
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Neural Networks in NLP Applications? Proceed-
2015a. Fully convolutional networks for semantic ings of 2016 Conference on Empirical Methods in
segmentation. In Proceedings of the IEEE Confer- Natural Language Processing .
ence on Computer Vision and Pattern Recognition.
pages 3431–3440. Lili Mou, Hao Peng, Ge Li, Yan Xu, Lu Zhang, and
Zhi Jin. 2015. Discriminative neural sentence mod-
Mingsheng Long, Yue Cao, Jianmin Wang, and eling by tree-based convolution. In Proceedings of
Michael I. Jordan. 2015b. Learning Transferable the 2015 Conference on Empirical Methods in Nat-
Features with Deep Adaptation Networks. In Pro- ural Language Processing.
ceedings of the 32nd International Conference on
Machine learning (ICML ’15). volume 37. EWT Ngai, Yong Hu, YH Wong, Yijun Chen, and Xin
Sun. 2011. The application of data mining tech-
Ilya Loshchilov and Frank Hutter. 2017. SGDR: niques in financial fraud detection: A classifica-
Stochastic Gradient Descent with Warm Restarts. In tion framework and an academic review of literature.
Proceedings of the Internal Conference on Learning Decision Support Systems 50(3):559–569.
Representations 2017.
Sinno Jialin Pan and Qiang Yang. 2010. A survey on
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan transfer learning. IEEE Transactions on Knowledge
Huang, Andrew Y Ng, and Christopher Potts. 2011. and Data Engineering 22(10):1345–1359.
Learning word vectors for sentiment analysis. In Matthew E Peters, Waleed Ammar, Chandra Bhagavat-
Proceedings of the 49th Annual Meeting of the Asso- ula, and Russell Power. 2017. Semi-supervised se-
ciation for Computational Linguistics: Human Lan- quence tagging with bidirectional language models.
guage Technologies-Volume 1. Association for Com- In Proceedings of ACL 2017.
putational Linguistics, pages 142–150.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Gardner, Christopher Clark, Kenton Lee, and Luke
Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Zettlemoyer. 2018. Deep contextualized word rep-
Bharambe, and Laurens van der Maaten. 2018. Ex- resentations. In Proceedings of NAACL 2018.
ploring the Limits of Weakly Supervised Pretraining
. Alec Radford, Rafal Jozefowicz, and Ilya Sutskever.
2017. Learning to generate reviews and discovering
Bryan McCann, James Bradbury, Caiming Xiong, and sentiment. arXiv preprint arXiv:1704.01444 .
Richard Socher. 2017. Learned in Translation: Con-
textualized Word Vectors. In Advances in Neural Marek Rei. 2017. Semi-supervised multitask learning
Information Processing Systems. for sequence labeling. In Proceedings of ACL 2017.
Stephen Merity, Nitish Shirish Keskar, and Richard Herbert L Roitblat, Anne Kershaw, and Patrick Oot.
Socher. 2017a. Regularizing and Optimiz- 2010. Document categorization in legal electronic
ing LSTM Language Models. arXiv preprint discovery: computer classification vs. manual re-
arXiv:1708.02182 . view. Journal of the Association for Information
Science and Technology 61(1):70–80.
Stephen Merity, Caiming Xiong, James Bradbury, and
Richard Socher. 2017b. Pointer Sentinel Mixture Sebastian Ruder. 2016. An overview of gradient
Models. In Proceedings of the International Con- descent optimization algorithms. arXiv preprint
ference on Learning Representations 2017. arXiv:1609.04747 .
Ruslan Salakhutdinov and Geoffrey Hinton. 2009.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Deep boltzmann machines. In Artificial Intelligence
Dean. 2013. Distributed Representations of Words and Statistics. pages 448–455.
and Phrases and their Compositionality. In Ad-
vances in Neural Information Processing Systems. Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Improving neural machine translation
Sewon Min, Minjoon Seo, and Hannaneh Hajishirzi. models with monolingual data. arXiv preprint
2017. Question Answering through Transfer Learn- arXiv:1511.06709 .
ing from Large Fine-grained Supervision Data. In
Proceedings of the 55th Annual Meeting of the As- Aliaksei Severyn and Alessandro Moschitti. 2015.
sociation for Computational Linguistics (Short Pa- UNITN: Training Deep Convolutional Neural Net-
pers). work for Twitter Sentiment Classification. Proceed-
ings of the 9th International Workshop on Semantic
Takeru Miyato, Andrew M Dai, and Ian Good- Evaluation (SemEval 2015) pages 464–469.
fellow. 2016. Adversarial training methods for
semi-supervised text classification. arXiv preprint Ali Sharif Razavian, Hossein Azizpour, Josephine Sul-
arXiv:1605.07725 . livan, and Stefan Carlsson. 2014. Cnn features off-
the-shelf: an astounding baseline for recognition. In
Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Proceedings of the IEEE conference on computer vi-
Lu Zhang, and Zhi Jin. 2016. How Transferable are sion and pattern recognition. pages 806–813.
Leslie N Smith. 2017. Cyclical learning rates for train-
ing neural networks. In Applications of Computer
Vision (WACV), 2017 IEEE Winter Conference on.
IEEE, pages 464–472.
Vladimir Naumovich Vapnik and Samuel Kotz. 1982.
Estimation of dependences based on empirical data,
volume 40. Springer-Verlag New York.
Ellen M Voorhees and Dawn M Tice. 1999. The trec-8
question answering track evaluation. In TREC. vol-
ume 1999, page 82.
John Wieting and Kevin Gimpel. 2017. Revisiting Re-
current Networks for Paraphrastic Sentence Embed-
dings. In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics (ACL
2017).
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod
Lipson. 2014. How transferable are features in deep
neural networks? In Advances in neural information
processing systems. pages 3320–3328.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text clas-
sification. In Advances in neural information pro-
cessing systems. pages 649–657.
Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu,
Hongyun Bao, and Bo Xu. 2016. Text classification
improved by integrating bidirectional lstm with two-
dimensional max pooling. In Proceedings of COL-
ING 2016.