ChatGPT, LLM and RLHF
ChatGPT, LLM and RLHF
ChatGPT, LLM and RLHF
Deep Q- InstructGPT,
Policy Gradient Learning RLHF (Anthropic),
(DQN) DPG DeepDPG AlphaGo PPO AlphaFold ChatGPT
Wasw ani, A., et al. "Attention is all you need." NIPS. 2017.
Attention and Transformers
Wasw ani, A., et al. "Attention is all you need." NIPS. 2017.
Attention and Transformers
"Neural machine translation by jointly learning to align and Wasw ani, A., et al. "Attention is all you need." NIPS. 2017.
translate." arXiv preprint arXiv:1409.0473 (2014).
Story 1: BERT vs. GPT: Large Language Models
BERT: Pre-training
then connect with a
downstream MLP for
fine tuning
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
Story 1: BERT vs. GPT: Large Language Models
Brow n, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
Story 1: BERT vs. GPT: Large Language Models
● Zero-shot
○ “Please answer, 3+5=?”
● One-shot
○ “1+8=9, please answer, 3+5=?|”
● Few-shot (K=3)
○ “1+1=2, 3+4=7, 12+5=17, please answer, 3+5=?”
Story 1: BERT vs. GPT: Large Language Models
Brow n, Tom, et al. "Language models are few -shot learners." Advances in neural information processing systems33 (2020): 1877-1901.
Story 1: BERT vs. GPT: Large Language Models
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training
Kaplan, Jared, et al. "Scaling law s for neural language m odels." arXiv preprint arXiv:2001.08361 (2020).
Story 1: BERT vs. GPT: Large Language Models
Emergence is when
quantitative changes in a
system result in qualitative
changes in behavior.
Wei, Jason, et al. "Emergent abilities of large language models." arXiv preprint arXiv:2206.07682 (2022).
Story 1: BERT vs. GPT: Philosophical Debate
● BERT:
○ Understand the language first before generating a response
○ An encoder to learn the intermediate representation is essential
○ Fine tune for specific tasks
○ Got more adoption among NLP community at the beginning (60K+ citations)
● GPT:
○ Mainly focus on predicting the next token
○ Reach State-of-the-art one-shot or few-shot performance without fine-tuning
○ Scaling up parameters, less adopted at the beginning (GPT-1 & GPT-2 ~5K
citations)
○ Prompt vs. fine-tuning
Story 1: BERT vs. GPT: Philosophical Debate
Please recall your memory and tell me what you were thinking.
Story 1: Understanding by NLG
Please recall your memory and tell me what you were thinking.
Story 2: RLHF - Alignment with human value
● 3Hs
○ Helpfulness
○ Honesty
○ Harmlessness
● The pretrained LLMs won’t have these values
aligned by nature
● Previous LLMs had issues of being toxic and
biased
○ Both GPTs and other LLMs
Story 2: RLHF - Alignment with human value
https://www.cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec14.pdf
Story 2: RLHF - Reinforcement Learning
Story 2: RLHF - Reinforcement Learning
● Supervised Learning
○ Newton was given 500 good apples and 500 bad ones
■ He needs to learn the task of how to differentiate good vs. bad apples
● Reinforcement Learning
○ Newton was thrown into a forest, and asked to eat apples to survive.
■ If he picked a good apple and ate, he got +100 score
■ If he picked a bad one and ate, he got -50 score
■ If the score goes below -100, he die
■ If the score goes above +200, he won
■ At most we will collect 200 actions before wrapping up this episode
Story 2: RLHF - Deep Q Learning
● Works well playing games in a contained environment
● Drawbacks
○ The model only learns how to map states into
Q(s, a), not the actual policy
○ Requires discrete actions and states
○ Requires some ad hoc exploration methods for
off-policy actions, such as epsilon-greedy
Optimal Q
Loss function
Target Q(s, a)
Mnih, Volody my r, et al. "Play ing atari with deep reinf orcement learning." arXiv preprint arXiv:1312.5602 (2013). https://towardsdatascienc e.c om/deep- q-learni ng-tutorial-mindqn-
2a4c855abffc#:~:tex t=Deep%20Q% 2DLearning% 20uses% 20Experienc e,to%20train% 20after%20each% 20st
ep.
Story 2: RLHF - Policy Gradient and Actor-Critic
REINFORCE - Gt here is the discounted total reward based on sampling, very high variance
Actor-Critic - We need to learn two models, the actor model π(θ), and the critic model V(ω)
In practical, both have been difficult to converge and expensive to compute due to the MCMC
sampling needed.
https://tow ardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d
Story 2: RLHF - Proxy Policy Optimization (PPO)
Policy Loss
Deviation from
old policy
Advantage Function
https://spinningup.openai.com/en/latest/algorithms/ppo.html#proximal-policy-optimization
Story 2: RLHF - Proxy Policy Optimization (PPO)
PPO: Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
ChatGPT: LLM + RLHF
● ChatGPT methods are not fully open source
● GPT 3.5 + RLHF
● (rumor) about 10x spend on human annotation budget
● (rumor) modification of RLHF training, beyond PPO
● Data quality and how to collect data is one critical factor
○ Starting from GPT-3, OpenAI has a framework to upweight high-quality literature, and filter out
low quality ones.
● We will focus on two published methods below
○ InstructGPT from OpenAI
○ RLHF method from Anthropic, founded by previous OpenAI executives
InstructGPT - LLM with RLHF
● Based on GPT-3
● 1.3B RLHF model outperforms
175B GPT-3 model on human
preference
● Two additional models are
learned:
○ Reward model
○ PPO policy model
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155(2022).
InstructGPT - LLM with RLHF
LM with RLHF from Anthropic
Bai, Yuntao, et al. "Training a helpful and harmless assistant w ith reinforcement learning from human feedback." arXiv preprint arXiv:2204.05862 (2022).
LM with RLHF from Anthropic
Bai, Yuntao, et al. "Training a helpful and harmless assistant w ith reinforcement learning from human feedback." arXiv preprint arXiv:2204.05862 (2022).
Evaluation on ChatGPT
Qin, Chengw ei, et al. "Is ChatGPT a General-Purpose Natural Language Processing Task Solver?." arXiv preprint arXiv:2302.06476 (2023).
Evaluation on ChatGPT
Bang, Yejin, et al. "A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity." arXiv preprint arXiv:2302.04023 (2023).
Story 2: RLHF - Alignment with human value
LaMDA: 137B, decoder-only, fine-tuned through Supervised Learning not RLHF
Bai, Yuntao, et al. "Training a helpful and harmless assistant w ith reinforcement learning from human feedback." arXiv preprint arXiv:2204.05862 (2022).
ChatGPT: LLM + RLHF
● Policy Gradient: Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information
processing systems 12 (1999).
● Deep-Q-Learning: Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
● DPG: Silver, David, et al. "Deterministic policy gradient algorithms." International conference on machine learning. Pmlr, 2014.
● DDPG: Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
● PPO: Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
● AlphaGo: Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484-489.
● AlphaFold: Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596.7873 (2021): 583-589.