ChatGPT, LLM and RLHF

ChatGPT, LLMs and RLHF
Challenges and opportunities
Speaker: Hui Yang

Index
● The Hype: ChatGPT

● Two Stories: LLMs and RLHF
○ LLM - Large Language Models, BERT vs. GPT
○ RLHF - Reinforcement Learning from Human Feedback
● Challenges and opportunities for us
The Hype: ChatGPT
The Hype: ChatGPT - Industry
The Hype: ChatGPT - Passing human tests
The Hype: ChatGPT - Understand human humor
From George Ding’s slack message

Index

Two Stories: LLMs and RLHF AlexaTM,
InstructGPT,
Chain Of
Thought, ToolFormer,
BERT, RLHF (Anthropic), LLaMA,
GPT-1, ChatGPT
Word2vec Attention Transformer BART GPT-3 Bard
GPT-2
2013 2015 2017 2018 2019 2020 2022 2023
Deep Q- InstructGPT,
Policy Gradient Learning RLHF (Anthropic),
(DQN) DPG DeepDPG AlphaGo PPO AlphaFold ChatGPT
1999 2013 2014 2015 2016 2017 2021 2022

Background Knowledge
● Fundamental ML and Deep Learning
○ Deep Neural Networks
○ Back-Propagation, Gradient Descent Optimizations
● Basic NLP tasks and Language Models
○ Word Embeddings
○ Pretraining and Fine Tuning
○ Zero-Shot vs. Few-Shot
● Attention and Transformers
○ Encoder and Decoders
○ BERT vs. GPT
○ Position Encoding and Masks
● Reinforcement Learning
○ Q-Learning
○ Policy Gradient
○ PPO (Proximal Policy Optimization)
Attention and Transformers
Wasw ani, A., et al. "Attention is all you need." NIPS. 2017.
● Encoder-decoder structure for translation

● Position encoding to keep the sequence information
● Encoder no mask
○ Any input token is queried against all input
tokens
● Decoder mask the following words in the output
○ The current predicted word won’t cheat by
looking into the future
○ Another self-attention layer looking into all
input tokens
Wasw ani, A., et al. "Attention is all you need." NIPS. 2017.
"Neural machine translation by jointly learning to align and Wasw ani, A., et al. "Attention is all you need." NIPS. 2017.
translate." arXiv preprint arXiv:1409.0473 (2014).
Story 1: BERT vs. GPT: Large Language Models
● BERT - Bidirectional Encoder Representations from Transformers

○ Goal: Learn a deep representation of languages
■ Encoder only model
○ Pretrained on two tasks: Masked Language Model & Next Sentence Prediction
○ Need supervised fine-tuning to be functional on specific NLU tasks
○ Original Model size
■ BERT-Base contains 110M parameters
■ BERT-Large contains 340M parameters
● GPT - Generative Pre-trained Transformer
○ Goal: Learn how to generate high-quality text
■ Functional out of the box
○ Decoder only
○ Original Model size
■ GPT-1 117M parameters
■ GPT-2 1.5B parameters
■ GPT-3 175B parameters
● BART - Connecting BERT with GPT
○ Seq-2-Seq model
BERT: Pre-training
then connect with a
downstream MLP for
fine tuning
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
GPT-1: Still follow

generative pre-training +
specific fine-tuning flow
Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).

GPT-2 and GPT-3:
Fine-tuning is not needed,
instead we can do zero-shot
or few-shot prompt.
Brow n, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
● Zero-shot
○ “Please answer, 3+5=?”
● One-shot
○ “1+8=9, please answer, 3+5=?|”
● Few-shot (K=3)
○ “1+1=2, 3+4=7, 12+5=17, please answer, 3+5=?”
Brow n, Tom, et al. "Language models are few -shot learners." Advances in neural information processing systems33 (2020): 1877-1901.
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training
Kaplan, Jared, et al. "Scaling law s for neural language m odels." arXiv preprint arXiv:2001.08361 (2020).
Emergence is when
quantitative changes in a
system result in qualitative
changes in behavior.
Wei, Jason, et al. "Emergent abilities of large language models." arXiv preprint arXiv:2206.07682 (2022).
Story 1: BERT vs. GPT: Philosophical Debate
● BERT:
○ Understand the language first before generating a response
○ An encoder to learn the intermediate representation is essential
○ Fine tune for specific tasks
○ Got more adoption among NLP community at the beginning (60K+ citations)
● GPT:
○ Mainly focus on predicting the next token
○ Reach State-of-the-art one-shot or few-shot performance without fine-tuning
○ Scaling up parameters, less adopted at the beginning (GPT-1 & GPT-2 ~5K
citations)
○ Prompt vs. fine-tuning
Story 1: BERT vs. GPT: Philosophical Debate
● Philosophy and Paradigm shift

○ We understand other humans by the response
○ Rarely we need to poke into others’ brain to understand their meanings
○ Rely on the output directly for any specific tasks
○ Closer to the idea of AGI (Artificial General Intelligence)
● With the popularity of ChatGPT, the GPT method is the new STOA in the industry
as today!
Story 1: Understanding by NLG
Please recall your memory and tell me what you were thinking.
Story 1: Understanding by NLG
Please recall your memory and tell me what you were thinking.
Story 2: RLHF - Alignment with human value
● 3Hs
○ Helpfulness
○ Honesty
○ Harmlessness
● The pretrained LLMs won’t have these values
aligned by nature
● Previous LLMs had issues of being toxic and
biased
○ Both GPTs and other LLMs
https://www.cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec14.pdf
Story 2: RLHF - Reinforcement Learning
● Modeling agent interacting with its environment

● s - state, the current state the agent is in
● a - action, the agent performs an action to lead
to a different state
● r - reward, the agent got either reward or
punishment after performing an action
● Episode - everything happened between the first
state and a terminal state
● π (a|s) - policy, what actions the agent should
take at the state s, could be stochastic
● V(s) - value function, the estimated total reward
for the agent to be in state s
● Q(a, s) - Q function, the estimated total reward
of the agent taking action a in state s.
● Supervised Learning
○ Newton was given 500 good apples and 500 bad ones
■ He needs to learn the task of how to differentiate good vs. bad apples
● Reinforcement Learning
○ Newton was thrown into a forest, and asked to eat apples to survive.
■ If he picked a good apple and ate, he got +100 score
■ If he picked a bad one and ate, he got -50 score
■ If the score goes below -100, he die
■ If the score goes above +200, he won
■ At most we will collect 200 actions before wrapping up this episode
Story 2: RLHF - Deep Q Learning
● Works well playing games in a contained environment
● Drawbacks
○ The model only learns how to map states into
Q(s, a), not the actual policy
○ Requires discrete actions and states
○ Requires some ad hoc exploration methods for
off-policy actions, such as epsilon-greedy
Optimal Q
Loss function
Target Q(s, a)
Mnih, Volody my r, et al. "Play ing atari with deep reinf orcement learning." arXiv preprint arXiv:1312.5602 (2013). https://towardsdatascienc e.c om/deep- q-learni ng-tutorial-mindqn-
2a4c855abffc#:~:tex t=Deep%20Q% 2DLearning% 20uses% 20Experienc e,to%20train% 20after%20each% 20st
ep.
Story 2: RLHF - Policy Gradient and Actor-Critic
REINFORCE - Gt here is the discounted total reward based on sampling, very high variance
Actor-Critic - We need to learn two models, the actor model π(θ), and the critic model V(ω)
In practical, both have been difficult to converge and expensive to compute due to the MCMC
sampling needed.
https://tow ardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d
Story 2: RLHF - Proxy Policy Optimization (PPO)
Policy Loss
Deviation from
old policy
Advantage Function
Actor Critic Loss

https://spinningup.openai.com/en/latest/algorithms/ppo.html#proximal-policy-optimization
PPO: Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
ChatGPT: LLM + RLHF
● ChatGPT methods are not fully open source
● GPT 3.5 + RLHF
● (rumor) about 10x spend on human annotation budget
● (rumor) modification of RLHF training, beyond PPO
● Data quality and how to collect data is one critical factor
○ Starting from GPT-3, OpenAI has a framework to upweight high-quality literature, and filter out
low quality ones.
● We will focus on two published methods below
○ InstructGPT from OpenAI
○ RLHF method from Anthropic, founded by previous OpenAI executives
InstructGPT - LLM with RLHF
● Based on GPT-3
● 1.3B RLHF model outperforms
175B GPT-3 model on human
preference
● Two additional models are
learned:
○ Reward model
○ PPO policy model
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155(2022).
InstructGPT - LLM with RLHF
LM with RLHF from Anthropic
● Can be trained in a static

way or in a weekly cadence
● Requires continuous human
feedbacks - A or B response
is preferred
Bai, Yuntao, et al. "Training a helpful and harmless assistant w ith reinforcement learning from human feedback." arXiv preprint arXiv:2204.05862 (2022).
LM with RLHF from Anthropic
● For small LMs (<10B),

there’s an alignment tax on
standard NLP tasks
● For LLMs (>10B), alignment
has slight benefit
Evaluation on ChatGPT
Qin, Chengw ei, et al. "Is ChatGPT a General-Purpose Natural Language Processing Task Solver?." arXiv preprint arXiv:2302.06476 (2023).
Evaluation on ChatGPT
Bang, Yejin, et al. "A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity." arXiv preprint arXiv:2302.04023 (2023).
LaMDA: 137B, decoder-only, fine-tuned through Supervised Learning not RLHF
ChatGPT: LLM + RLHF
Without RLHF - a killer With RLHF - an assistant

Index

Appendix - NLP milestone papers
● Word2Vec: Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality."Advances in neural information processing
systems 26 (2013).
● Attention: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint
arXiv:1409.0473 (2014).
● Transformer: Waswani, A., et al. "Attention is all you need." NIPS. 2017.
● BERT: Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
● GPT-1: Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
● GPT-2: Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI b log 1.8 (2019): 9.
● BART: Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv
preprint arXiv:1910.13461 (2019).
● GPT-3: Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
● AlexaTM: Soltan, Saleh, et al. "Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model." arXiv preprint arXiv:2208.01448 (2022).
● InstructGPT: Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
● Chain Of Thought: Wei, Jason, et al. "Chain of thought prompting elicits reasoning in large language models." arXiv preprint arXiv:2201.11903 (2022).
● RLHF (Anthropic): Bai, Yuntao, et al. "Training a helpful and harmless assistant with reinforcement learning from human feedb ack." arXiv preprint
arXiv:2204.05862 (2022).
● ChatGPT: No paper, still secrets from OpenAI
● ToolFormer: Schick, Timo, et al. "Toolformer: Language models can teach themselves to use tools." arXiv preprint arXiv:2302.04761 (2023).
● LLaMA: Touvron, Hugo, et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971 (2023).
● Bard: No paper yet, Google’s new secret product
● Policy Gradient: Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information
processing systems 12 (1999).
● Deep-Q-Learning: Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
● DPG: Silver, David, et al. "Deterministic policy gradient algorithms." International conference on machine learning. Pmlr, 2014.
● DDPG: Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
● PPO: Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
● AlphaGo: Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484-489.
● AlphaFold: Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596.7873 (2021): 583-589.

ChatGPT, LLM and RLHF

Uploaded by

Copyright:

Available Formats

ChatGPT, LLM and RLHF

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ChatGPT, LLM and RLHF

Uploaded by

Copyright:

Available Formats

What are the main types of large language models discussed?

What are the main types of large language models discussed?

What are the main differences between BERT and GPT models?

What are the main differences between BERT and GPT models?

ChatGPT, LLMs and RLHF

Challenges and opportunities

Speaker: Hui Yang

● The Hype: ChatGPT

From George Ding’s slack message

● The Hype: ChatGPT

2013 2015 2017 2018 2019 2020 2022 2023

1999 2013 2014 2015 2016 2017 2021 2022

● Encoder-decoder structure for translation

● BERT - Bidirectional Encoder Representations from Transformers

GPT-1: Still follow

Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).

● Philosophy and Paradigm shift

● Modeling agent interacting with its environment

Actor Critic Loss

● Can be trained in a static

● For small LMs (<10B),

Without RLHF - a killer With RLHF - an assistant

● The Hype: ChatGPT

You might also like