Training Objectives & Architectures: Bert, GPT, T5, Bart & Xlnet: Comprehensively Compared
Training Objectives & Architectures: Bert, GPT, T5, Bart & Xlnet: Comprehensively Compared
Training Objectives & Architectures: Bert, GPT, T5, Bart & Xlnet: Comprehensively Compared
2
Wikipedia, books, quora, reddit, webpage, etc… Multiple Downstream Tasks
𝜃" , 𝐷" , arch"
Fine-Tuning • Generative:
Fine-Tuning
Fine-Tuning Summarization, QA
(self-supervised) Pre-Training 𝜃) , 𝐷) , arch* • Classification:
Fine-Tuning Sentiment Analysis
Fine-Tuning • Informational Retrieval
𝜃+ , 𝐷+ , arch+ • …
Auto-Encoding (AE) Auto-Regressive (AR)
(de-noising)
Encoder Decoder
3
User-Facing Tasks
Toxicity
detection
Dialogue
----------------------------------------------------------------------------
Named Entity Recognition
Part-of-Speech Tagging
Syntactic Parsing Intermediate Tasks
information extraction 4
… Img Src: Refs[8]
Wikipedia, books, quora, webpage, etc…
Fine-Tuning
Fine-Tuning
(self-supervised) Pre-Training Fine-Tuning
Fine-Tuning
Fine-Tuning
Auto-Encoding (AE) Auto-Regressive (AR)
Encoder Decoder
5
Wikipedia, books, quora, webpage, etc…
𝜃 𝐷
(self-supervised) Pre-Training Fine-Tuning
Encoder Decoder
6
T5 (Unifying Text-To-Text Transfer Transformer) (2019.10)
7
Img Src: Refs[6]
T5 (Unifying Text-To-Text Transfer Transformer) (2019.10)
___ is Canadian National Day? à When ?Any NLP problem can convert to a generative problem? 8
Img Src: Refs[6]
9
decoder
encoder
10
Img Src: Refs[1]
decoder
encoder
N×
Query Key
Value
11
Img Src: Refs[1]
decoder
encoder
12
Img Src: Refs[1]
A natural seq2seq model (Autoregressive Objective)
decoder
Computation
Legumes partagent des ressources
encoder
Parallel
Legumes partagent des ressources avec
… …
13
Img Src: Refs[1]
A natural seq2seq model (Autoregressive Objective)
decoder
Computation
Legumes partagent des ressources
encoder
Parallel
Legumes partagent des ressources avec
… …
GPT Remove
cross-attention layer
15
Pre-training Img Src: Refs[2,3]
§ Auto-Encoding (AE): § Auto-Regressive (AR):
de-noising—predicting masked tokens one after one token generating
variant: XLNet
variants: T5, BART (permutation)
GPT Remove
cross-attention layer
16
Pre-training Img Src: Refs[2,3]
§ Auto-Encoding (AE): § Auto-Regressive (AR):
de-noising—predicting masked tokens one after one token generating
BERT GPT-1 17
Fine-tuning Fine-tuning
Img Src: Refs[2,3]
GPT-1
18
Img Src: Refs[2,3]
GPT-1 GPT-2
19
Img Src: Refs[2,3]
GPT-1 GPT-2
20
GPT-3 Img Src: Refs[2,3]
21
Img Src: Refs[5]
Encoder Decoder
Auto-Regressive (AR)
Seq2Seq Generative
less discrepancy
De-noising seq2seq
(User-Facing) Downstream Tasks
24
Img Src: Refs[1]
T5, BART:
Encoder-Decoder Architecture Decoder
Auto-Regressive (AR)
Seq2Seq Generative
less discrepancy
de-noising seq2seq
(User-Facing) Downstream Tasks
25
Img Src: Refs[1]
T5, BART:
Encoder-Decoder Architecture Decoder
Auto-Regressive (AR)
Seq2Seq Generative
less discrepancy
De-noising seq2seq
(User-Facing) Downstream Tasks
Pros:
1. Separate prompts (input)
& generation (output)
2. bidirectional 26
3. cross-attention Img Src: Refs[1]
T5, BART:
Encoder-Decoder Architecture Decoder
Auto-Regressive (AR)
Seq2Seq Generative
less discrepancy
De-noising seq2seq
(User-Facing) Downstream Tasks
Pros:
1. Separate prompts (input)
& generation (output)
2. bidirectional 27
3. cross-attention Img Src: Refs[1,6]
T5, BART:
Encoder-Decoder Architecture Decoder
Auto-Regressive (AR)
Seq2Seq Generative
less discrepancy
Solve Limitation:
left-to-right
De-noising seq2seq
context learning
(User-Facing) Downstream Tasks
Pros:
1. Separate prompts (input) permutation
& generation (output)
(XLNet)
2. bidirectional 28
3. cross-attention Img Src: Refs[1,6]
T5, BART:
Encoder-Decoder Architecture Decoder
Con:
Auto-Regressive (AR)
self-supervised inconvenient to
break into input and output Seq2Seq Generative
less discrepancy
Solve Limitation:
left-to-right
De-noising seq2seq
context learning
Google’s (User-Facing) Downstream Tasks
PaML-E
Pros:
1. Separate prompts (input) permutation
& generation (output)
(XLNet)
2. bidirectional 29
3. cross-attention Img Src: Refs[1,6]
30
Fine-Tune by Human Feedback à
InstructGPT
31
Fine-Tune by Human Feedback à
InstructGPT
1. Supervised Fine-Tuning (SFT):
Labeler Demonstration
32
Img Src: Refs[9]
Conventional Fine-Tuning InstructGPT Fine-Tuning
unifying network set
𝜃 ,𝐷
𝜃" , 𝐷" , arch"
Fine-Tuning
Fine-Tuning
Pre-Training Fine-Tuning Pre-Training Fine-Tuning
𝜃) , 𝐷) , arch*
Fine-Tuning
Fine-Tuning 𝜃+ , 𝐷+ , arch+
various prompts
33
§ AutoRegressive (AR) model as a
generative model has less discrepancy
with downstream tasks.
§ Fine-tuning on human feedback to set a
unifying network setting.
34
1. Vaswani, et al., 2017. ”Attention Is All You Need”
2. Devlin, et al., 2018. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”
3. Radford et. al., 2018. “Improving Language Understanding by Generative Pre-Training”
4. Radford et. al., 2019. “Language Models are Unsupervised Multitask Learners”
5. Brown et. al., 2020. “Language Models are Few-Shot Learners”
6. Raffel et. al., 2019. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”
7. Dai et. al., 2019. “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”
8. Liang et. al. (2022). “Holistic Evaluation of Language Models”
9. Ouyang et. al. (2022). “Training Language Models to Follow Instructions with Human Feedback”
10. My blog: medium.com/@yulemoon/an-in-depth-look-at-the-transformer-based-models-22e5f5d17b6b
11. Cover page image: https://www.lifestyleasia.com/sg/tech/google-bard-vs-open-ai-chatgpt-which-chatbot-is-better-and-why/
35
z denotes a permutation in the set 𝒁𝑻 , which contains all possible permutations of
the text sequence x of length T. The t-th token at permutation sequence z is denoted
by 𝑥1 2 , and the tokens sequence preceding the t-th position are denoted by 𝑥1 <t.
36
Yang et. al. “XLNet: Generalized Autoregressive Pretraining for Language Understanding”