Training Objectives & Architectures: Bert, GPT, T5, Bart & Xlnet: Comprehensively Compared

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Training Objectives & Architectures 1

BERT, GPT, T5, BART & XLNet: Comprehensively Compared

Presenter: Yule Wang, PhD


My Relavant blog
§ Pre-training, Fine-tuning
§ Transformer
– encoder, decoder
§ Pre-training Objectives
-- AutoEncoding, AutoRegressive
§ Emergence of In-Context Learning
§ Unifying Multi-Tasks?
§ Fine-tuning by InstructGPT

2
Wikipedia, books, quora, reddit, webpage, etc… Multiple Downstream Tasks
𝜃" , 𝐷" , arch"
Fine-Tuning • Generative:
Fine-Tuning
Fine-Tuning Summarization, QA
(self-supervised) Pre-Training 𝜃) , 𝐷) , arch* • Classification:
Fine-Tuning Sentiment Analysis
Fine-Tuning • Informational Retrieval
𝜃+ , 𝐷+ , arch+ • …
Auto-Encoding (AE) Auto-Regressive (AR)
(de-noising)
Encoder Decoder

3
User-Facing Tasks

Toxicity
detection

Dialogue
----------------------------------------------------------------------------
Named Entity Recognition
Part-of-Speech Tagging
Syntactic Parsing Intermediate Tasks
information extraction 4
… Img Src: Refs[8]
Wikipedia, books, quora, webpage, etc…

Fine-Tuning
Fine-Tuning
(self-supervised) Pre-Training Fine-Tuning
Fine-Tuning
Fine-Tuning
Auto-Encoding (AE) Auto-Regressive (AR)

Encoder Decoder

5
Wikipedia, books, quora, webpage, etc…

𝜃 𝐷
(self-supervised) Pre-Training Fine-Tuning

Auto-Encoding (AE) Auto-Regressive (AR)

Encoder Decoder

6
T5 (Unifying Text-To-Text Transfer Transformer) (2019.10)

Even Classification Problem

7
Img Src: Refs[6]
T5 (Unifying Text-To-Text Transfer Transformer) (2019.10)

Even Classification Problem

___ is Canadian National Day? à When ?Any NLP problem can convert to a generative problem? 8
Img Src: Refs[6]
9
decoder

Original Aim -- Translation:


"Legumes share resources with nitrogen-fixing bacteria.” à
"Legumes partagent des ressources avec des bactéries azotantes. ” (French)

encoder

10
Img Src: Refs[1]
decoder

Original Aim -- Translation:


"Legumes share resources with nitrogen-fixing bacteria.” à
"Legumes partagent des ressources avec des bactéries azotantes. ” (French)

encoder

Query Key

Value

11
Img Src: Refs[1]
decoder

Original Aim -- Translation:


"Legumes share resources with nitrogen-fixing bacteria.” à
"Legumes partagent des ressources avec des bactéries azotantes. ” (French)

encoder

12
Img Src: Refs[1]
A natural seq2seq model (Autoregressive Objective)

decoder

Original Aim -- Translation:


"Legumes share resources with nitrogen-fixing bacteria.” à
"Legumes partagent des ressources avec des bactéries azotantes. ” (French) Legumes
Legumes partagent
Legumes partagent des

Computation
Legumes partagent des ressources
encoder

Parallel
Legumes partagent des ressources avec

… …

Legumes partagent des ressources avec


des bactéries azotantes

13
Img Src: Refs[1]
A natural seq2seq model (Autoregressive Objective)

decoder

Original Aim -- Translation:


"Legumes share resources with nitrogen-fixing bacteria.” à
"Legumes partagent des ressources avec des bactéries azotantes. ” (French) Legumes
Legumes partagent
Legumes partagent des

Computation
Legumes partagent des ressources
encoder

Parallel
Legumes partagent des ressources avec

… …

Legumes partagent des ressources avec


des bactéries azotantes

? Legumes share resources with nitrogen-fixing bacteria | …


A question: Why don’t abandon encoder part?
14
Img Src: Refs[1]
§ Auto-Encoding (AE): § Auto-Regressive (AR):
de-noising—predicting masked tokens one after one token generating

GPT Remove
cross-attention layer

15
Pre-training Img Src: Refs[2,3]
§ Auto-Encoding (AE): § Auto-Regressive (AR):
de-noising—predicting masked tokens one after one token generating

variant: XLNet
variants: T5, BART (permutation)

GPT Remove
cross-attention layer

16
Pre-training Img Src: Refs[2,3]
§ Auto-Encoding (AE): § Auto-Regressive (AR):
de-noising—predicting masked tokens one after one token generating

BERT GPT-1 17
Fine-tuning Fine-tuning
Img Src: Refs[2,3]
GPT-1

18
Img Src: Refs[2,3]
GPT-1 GPT-2

19
Img Src: Refs[2,3]
GPT-1 GPT-2

20
GPT-3 Img Src: Refs[2,3]
21
Img Src: Refs[5]
Encoder Decoder

Auto-Encoding (AE) Auto-Regressive (AR)


De-noising Seq2Seq Generative

discrepancy // less discrepancy

Downstream Tasks (User-Facing) Downstream Tasks

Human Language Natural Interaction


A Generative Process?
22
Encoder Decoder

Auto-Encoding (AE) Auto-Regressive (AR)


De-noising Seq2Seq Generative

discrepancy // less discrepancy

Downstream Tasks (User-Facing) Downstream Tasks

Human Language Natural Interaction


A Generative Process?
23
T5, BART:
Encoder-Decoder Architecture Decoder

Auto-Regressive (AR)
Seq2Seq Generative

less discrepancy

De-noising seq2seq
(User-Facing) Downstream Tasks

24
Img Src: Refs[1]
T5, BART:
Encoder-Decoder Architecture Decoder

Auto-Regressive (AR)
Seq2Seq Generative

less discrepancy

de-noising seq2seq
(User-Facing) Downstream Tasks

25
Img Src: Refs[1]
T5, BART:
Encoder-Decoder Architecture Decoder

Auto-Regressive (AR)
Seq2Seq Generative

less discrepancy

De-noising seq2seq
(User-Facing) Downstream Tasks

Pros:
1. Separate prompts (input)
& generation (output)
2. bidirectional 26
3. cross-attention Img Src: Refs[1]
T5, BART:
Encoder-Decoder Architecture Decoder

Auto-Regressive (AR)
Seq2Seq Generative

less discrepancy

De-noising seq2seq
(User-Facing) Downstream Tasks

Pros:
1. Separate prompts (input)
& generation (output)
2. bidirectional 27
3. cross-attention Img Src: Refs[1,6]
T5, BART:
Encoder-Decoder Architecture Decoder

Auto-Regressive (AR)
Seq2Seq Generative

less discrepancy
Solve Limitation:
left-to-right
De-noising seq2seq
context learning
(User-Facing) Downstream Tasks

Pros:
1. Separate prompts (input) permutation
& generation (output)
(XLNet)
2. bidirectional 28
3. cross-attention Img Src: Refs[1,6]
T5, BART:
Encoder-Decoder Architecture Decoder

Con:
Auto-Regressive (AR)
self-supervised inconvenient to
break into input and output Seq2Seq Generative

less discrepancy
Solve Limitation:
left-to-right
De-noising seq2seq
context learning
Google’s (User-Facing) Downstream Tasks
PaML-E
Pros:
1. Separate prompts (input) permutation
& generation (output)
(XLNet)
2. bidirectional 29
3. cross-attention Img Src: Refs[1,6]
30
Fine-Tune by Human Feedback à
InstructGPT

31
Fine-Tune by Human Feedback à
InstructGPT
1. Supervised Fine-Tuning (SFT):
Labeler Demonstration

2. Reinforcement Fine-Tuning From Human Feedback (RLHF): summarization task


Similar to pairwise ranking

32
Img Src: Refs[9]
Conventional Fine-Tuning InstructGPT Fine-Tuning
unifying network set

𝜃 ,𝐷
𝜃" , 𝐷" , arch"
Fine-Tuning
Fine-Tuning
Pre-Training Fine-Tuning Pre-Training Fine-Tuning
𝜃) , 𝐷) , arch*
Fine-Tuning
Fine-Tuning 𝜃+ , 𝐷+ , arch+
various prompts

33
§ AutoRegressive (AR) model as a
generative model has less discrepancy
with downstream tasks.
§ Fine-tuning on human feedback to set a
unifying network setting.

34
1. Vaswani, et al., 2017. ”Attention Is All You Need”
2. Devlin, et al., 2018. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”
3. Radford et. al., 2018. “Improving Language Understanding by Generative Pre-Training”
4. Radford et. al., 2019. “Language Models are Unsupervised Multitask Learners”
5. Brown et. al., 2020. “Language Models are Few-Shot Learners”
6. Raffel et. al., 2019. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”
7. Dai et. al., 2019. “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”
8. Liang et. al. (2022). “Holistic Evaluation of Language Models”
9. Ouyang et. al. (2022). “Training Language Models to Follow Instructions with Human Feedback”
10. My blog: medium.com/@yulemoon/an-in-depth-look-at-the-transformer-based-models-22e5f5d17b6b
11. Cover page image: https://www.lifestyleasia.com/sg/tech/google-bard-vs-open-ai-chatgpt-which-chatbot-is-better-and-why/

35
z denotes a permutation in the set 𝒁𝑻 , which contains all possible permutations of
the text sequence x of length T. The t-th token at permutation sequence z is denoted
by 𝑥1 2 , and the tokens sequence preceding the t-th position are denoted by 𝑥1 <t.
36
Yang et. al. “XLNet: Generalized Autoregressive Pretraining for Language Understanding”

You might also like