ChatGPT, LLM and RLHF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45
At a glance
Powered by AI
The document discusses the hype around ChatGPT, large language models including BERT and GPT, and reinforcement learning from human feedback. It also covers challenges and opportunities of these technologies.

The document discusses BERT, GPT, and BART models which are large pretrained language models used for natural language understanding and generation tasks. It also mentions reinforcement learning models like AlphaGo.

BERT is a bidirectional encoder model pretrained on masked language modeling and next sentence prediction tasks. It requires fine-tuning for specific tasks. GPT is a decoder-only model that can generate text without fine-tuning. BERT models are smaller while GPT models have grown very large over time.

ChatGPT, LLMs and RLHF

Challenges and opportunities

Speaker: Hui Yang


Index

● The Hype: ChatGPT


● Two Stories: LLMs and RLHF
○ LLM - Large Language Models, BERT vs. GPT
○ RLHF - Reinforcement Learning from Human Feedback
● Challenges and opportunities for us
The Hype: ChatGPT
The Hype: ChatGPT - Industry
The Hype: ChatGPT - Passing human tests
The Hype: ChatGPT - Understand human humor

From George Ding’s slack message


Index

● The Hype: ChatGPT


● Two Stories: LLMs and RLHF
○ LLM - Large Language Models, BERT vs. GPT
○ RLHF - Reinforcement Learning from Human Feedback
● Challenges and opportunities for us
Two Stories: LLMs and RLHF AlexaTM,
InstructGPT,
Chain Of
Thought, ToolFormer,
BERT, RLHF (Anthropic), LLaMA,
GPT-1, ChatGPT
Word2vec Attention Transformer BART GPT-3 Bard
GPT-2

2013 2015 2017 2018 2019 2020 2022 2023

Deep Q- InstructGPT,
Policy Gradient Learning RLHF (Anthropic),
(DQN) DPG DeepDPG AlphaGo PPO AlphaFold ChatGPT

1999 2013 2014 2015 2016 2017 2021 2022


Background Knowledge
● Fundamental ML and Deep Learning
○ Deep Neural Networks
○ Back-Propagation, Gradient Descent Optimizations
● Basic NLP tasks and Language Models
○ Word Embeddings
○ Pretraining and Fine Tuning
○ Zero-Shot vs. Few-Shot
● Attention and Transformers
○ Encoder and Decoders
○ BERT vs. GPT
○ Position Encoding and Masks
● Reinforcement Learning
○ Q-Learning
○ Policy Gradient
○ PPO (Proximal Policy Optimization)
Attention and Transformers

Wasw ani, A., et al. "Attention is all you need." NIPS. 2017.
Attention and Transformers

● Encoder-decoder structure for translation


● Position encoding to keep the sequence information
● Encoder no mask
○ Any input token is queried against all input
tokens
● Decoder mask the following words in the output
○ The current predicted word won’t cheat by
looking into the future
○ Another self-attention layer looking into all
input tokens

Wasw ani, A., et al. "Attention is all you need." NIPS. 2017.
Attention and Transformers

"Neural machine translation by jointly learning to align and Wasw ani, A., et al. "Attention is all you need." NIPS. 2017.
translate." arXiv preprint arXiv:1409.0473 (2014).
Story 1: BERT vs. GPT: Large Language Models

● BERT - Bidirectional Encoder Representations from Transformers


○ Goal: Learn a deep representation of languages
■ Encoder only model
○ Pretrained on two tasks: Masked Language Model & Next Sentence Prediction
○ Need supervised fine-tuning to be functional on specific NLU tasks
○ Original Model size
■ BERT-Base contains 110M parameters
■ BERT-Large contains 340M parameters
● GPT - Generative Pre-trained Transformer
○ Goal: Learn how to generate high-quality text
■ Functional out of the box
○ Decoder only
○ Original Model size
■ GPT-1 117M parameters
■ GPT-2 1.5B parameters
■ GPT-3 175B parameters
● BART - Connecting BERT with GPT
○ Seq-2-Seq model
Story 1: BERT vs. GPT: Large Language Models

BERT: Pre-training
then connect with a
downstream MLP for
fine tuning

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
Story 1: BERT vs. GPT: Large Language Models

GPT-1: Still follow


generative pre-training +
specific fine-tuning flow

Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).


Story 1: BERT vs. GPT: Large Language Models
GPT-2 and GPT-3:
Fine-tuning is not needed,
instead we can do zero-shot
or few-shot prompt.

Brow n, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
Story 1: BERT vs. GPT: Large Language Models

● Zero-shot
○ “Please answer, 3+5=?”
● One-shot
○ “1+8=9, please answer, 3+5=?|”
● Few-shot (K=3)
○ “1+1=2, 3+4=7, 12+5=17, please answer, 3+5=?”
Story 1: BERT vs. GPT: Large Language Models

Brow n, Tom, et al. "Language models are few -shot learners." Advances in neural information processing systems33 (2020): 1877-1901.
Story 1: BERT vs. GPT: Large Language Models

The loss scales as a power-law with model size, dataset size, and the amount of compute used for training

Kaplan, Jared, et al. "Scaling law s for neural language m odels." arXiv preprint arXiv:2001.08361 (2020).
Story 1: BERT vs. GPT: Large Language Models

Emergence is when
quantitative changes in a
system result in qualitative
changes in behavior.

Wei, Jason, et al. "Emergent abilities of large language models." arXiv preprint arXiv:2206.07682 (2022).
Story 1: BERT vs. GPT: Philosophical Debate

● BERT:
○ Understand the language first before generating a response
○ An encoder to learn the intermediate representation is essential
○ Fine tune for specific tasks
○ Got more adoption among NLP community at the beginning (60K+ citations)
● GPT:
○ Mainly focus on predicting the next token
○ Reach State-of-the-art one-shot or few-shot performance without fine-tuning
○ Scaling up parameters, less adopted at the beginning (GPT-1 & GPT-2 ~5K
citations)
○ Prompt vs. fine-tuning
Story 1: BERT vs. GPT: Philosophical Debate

● Philosophy and Paradigm shift


○ We understand other humans by the response
○ Rarely we need to poke into others’ brain to understand their meanings
○ Rely on the output directly for any specific tasks
○ Closer to the idea of AGI (Artificial General Intelligence)
● With the popularity of ChatGPT, the GPT method is the new STOA in the industry
as today!
Story 1: Understanding by NLG

Please recall your memory and tell me what you were thinking.
Story 1: Understanding by NLG

Please recall your memory and tell me what you were thinking.
Story 2: RLHF - Alignment with human value

● 3Hs
○ Helpfulness
○ Honesty
○ Harmlessness
● The pretrained LLMs won’t have these values
aligned by nature
● Previous LLMs had issues of being toxic and
biased
○ Both GPTs and other LLMs
Story 2: RLHF - Alignment with human value

https://www.cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec14.pdf
Story 2: RLHF - Reinforcement Learning
Story 2: RLHF - Reinforcement Learning

● Modeling agent interacting with its environment


● s - state, the current state the agent is in
● a - action, the agent performs an action to lead
to a different state
● r - reward, the agent got either reward or
punishment after performing an action
● Episode - everything happened between the first
state and a terminal state
● π (a|s) - policy, what actions the agent should
take at the state s, could be stochastic
● V(s) - value function, the estimated total reward
for the agent to be in state s
● Q(a, s) - Q function, the estimated total reward
of the agent taking action a in state s.
Story 2: RLHF - Reinforcement Learning

● Supervised Learning
○ Newton was given 500 good apples and 500 bad ones
■ He needs to learn the task of how to differentiate good vs. bad apples
● Reinforcement Learning
○ Newton was thrown into a forest, and asked to eat apples to survive.
■ If he picked a good apple and ate, he got +100 score
■ If he picked a bad one and ate, he got -50 score
■ If the score goes below -100, he die
■ If the score goes above +200, he won
■ At most we will collect 200 actions before wrapping up this episode
Story 2: RLHF - Deep Q Learning
● Works well playing games in a contained environment
● Drawbacks
○ The model only learns how to map states into
Q(s, a), not the actual policy
○ Requires discrete actions and states
○ Requires some ad hoc exploration methods for
off-policy actions, such as epsilon-greedy

Optimal Q

Loss function

Target Q(s, a)

Mnih, Volody my r, et al. "Play ing atari with deep reinf orcement learning." arXiv preprint arXiv:1312.5602 (2013). https://towardsdatascienc e.c om/deep- q-learni ng-tutorial-mindqn-
2a4c855abffc#:~:tex t=Deep%20Q% 2DLearning% 20uses% 20Experienc e,to%20train% 20after%20each% 20st
ep.
Story 2: RLHF - Policy Gradient and Actor-Critic
REINFORCE - Gt here is the discounted total reward based on sampling, very high variance

Actor-Critic - We need to learn two models, the actor model π(θ), and the critic model V(ω)

In practical, both have been difficult to converge and expensive to compute due to the MCMC
sampling needed.
https://tow ardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d
Story 2: RLHF - Proxy Policy Optimization (PPO)
Policy Loss

Deviation from
old policy

Advantage Function

Actor Critic Loss


Story 2: RLHF - Proxy Policy Optimization (PPO)

https://spinningup.openai.com/en/latest/algorithms/ppo.html#proximal-policy-optimization
Story 2: RLHF - Proxy Policy Optimization (PPO)

PPO: Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
ChatGPT: LLM + RLHF
● ChatGPT methods are not fully open source
● GPT 3.5 + RLHF
● (rumor) about 10x spend on human annotation budget
● (rumor) modification of RLHF training, beyond PPO
● Data quality and how to collect data is one critical factor
○ Starting from GPT-3, OpenAI has a framework to upweight high-quality literature, and filter out
low quality ones.
● We will focus on two published methods below
○ InstructGPT from OpenAI
○ RLHF method from Anthropic, founded by previous OpenAI executives
InstructGPT - LLM with RLHF

● Based on GPT-3
● 1.3B RLHF model outperforms
175B GPT-3 model on human
preference
● Two additional models are
learned:
○ Reward model
○ PPO policy model

Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155(2022).
InstructGPT - LLM with RLHF
LM with RLHF from Anthropic

● Can be trained in a static


way or in a weekly cadence
● Requires continuous human
feedbacks - A or B response
is preferred

Bai, Yuntao, et al. "Training a helpful and harmless assistant w ith reinforcement learning from human feedback." arXiv preprint arXiv:2204.05862 (2022).
LM with RLHF from Anthropic

● For small LMs (<10B),


there’s an alignment tax on
standard NLP tasks
● For LLMs (>10B), alignment
has slight benefit

Bai, Yuntao, et al. "Training a helpful and harmless assistant w ith reinforcement learning from human feedback." arXiv preprint arXiv:2204.05862 (2022).
Evaluation on ChatGPT

Qin, Chengw ei, et al. "Is ChatGPT a General-Purpose Natural Language Processing Task Solver?." arXiv preprint arXiv:2302.06476 (2023).
Evaluation on ChatGPT

Bang, Yejin, et al. "A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity." arXiv preprint arXiv:2302.04023 (2023).
Story 2: RLHF - Alignment with human value
LaMDA: 137B, decoder-only, fine-tuned through Supervised Learning not RLHF

Bai, Yuntao, et al. "Training a helpful and harmless assistant w ith reinforcement learning from human feedback." arXiv preprint arXiv:2204.05862 (2022).
ChatGPT: LLM + RLHF

Without RLHF - a killer With RLHF - an assistant


Index

● The Hype: ChatGPT


● Two Stories: LLMs and RLHF
○ LLM - Large Language Models, BERT vs. GPT
○ RLHF - Reinforcement Learning from Human Feedback
● Challenges and opportunities for us
Appendix - NLP milestone papers
● Word2Vec: Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality."Advances in neural information processing
systems 26 (2013).
● Attention: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint
arXiv:1409.0473 (2014).
● Transformer: Waswani, A., et al. "Attention is all you need." NIPS. 2017.
● BERT: Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
● GPT-1: Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
● GPT-2: Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI b log 1.8 (2019): 9.
● BART: Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv
preprint arXiv:1910.13461 (2019).
● GPT-3: Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
● AlexaTM: Soltan, Saleh, et al. "Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model." arXiv preprint arXiv:2208.01448 (2022).
● InstructGPT: Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
● Chain Of Thought: Wei, Jason, et al. "Chain of thought prompting elicits reasoning in large language models." arXiv preprint arXiv:2201.11903 (2022).
● RLHF (Anthropic): Bai, Yuntao, et al. "Training a helpful and harmless assistant with reinforcement learning from human feedb ack." arXiv preprint
arXiv:2204.05862 (2022).
● ChatGPT: No paper, still secrets from OpenAI
● ToolFormer: Schick, Timo, et al. "Toolformer: Language models can teach themselves to use tools." arXiv preprint arXiv:2302.04761 (2023).
● LLaMA: Touvron, Hugo, et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971 (2023).
● Bard: No paper yet, Google’s new secret product

● Policy Gradient: Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information
processing systems 12 (1999).
● Deep-Q-Learning: Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
● DPG: Silver, David, et al. "Deterministic policy gradient algorithms." International conference on machine learning. Pmlr, 2014.
● DDPG: Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
● PPO: Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
● AlphaGo: Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484-489.
● AlphaFold: Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596.7873 (2021): 583-589.

You might also like