flaming hot initiation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Flaming-hot Initiation with Regular Execution Sampling for Large

Language Models
Weizhe Chen Zhicheng Zhang
University of Southern California Carnegie Mellon University
weizhech@usc.edu zhichen3@cs.cmu.edu

Guanlin Liu, Renjie Zheng, Wenlei Shi, Chen Dun, Zheng Wu, Xing Jin, Lin Yan
ByteDance
{guanlin.liu, renjie.zheng, wenlei.shi}@bytedance.com
{chen.dun, zheng.wu1, jinxing.9, neil}@bytedance.com
Abstract To develop general capabilities, LLMs are typ-
ically trained through a three-stage process: pre-
Since the release of ChatGPT, large language training, supervised fine-tuning (SFT), and align-
models (LLMs) have demonstrated remarkable
arXiv:2410.21236v1 [cs.LG] 28 Oct 2024

ment (Bai et al., 2022; Ouyang et al., 2022). Dur-


capabilities across various domains. A key chal-
lenge in developing these general capabilities is
ing pretraining, the model learns from a vast array
efficiently sourcing diverse, high-quality data. of data gathered from publicly available sources.
This becomes especially critical in reasoning- Then, in the SFT and alignment stages, the model’s
related tasks with sandbox checkers, such as abilities are further refined, allowing it to increase
math or code, where the goal is to generate cor- reasoning abilities and better follow users’ instruc-
rect solutions to specific problems with higher tions. In order to enhance reasoning tasks, a sand-
probability. In this work, we introduce Flaming- box checker — a tool used to verify the correctness
hot Initiation with Regular Execution (FIRE)
of solutions — is often used during training (Liu
sampling, a simple yet highly effective method
to efficiently find good responses. Our em- et al., 2023b). Therefore, one of the key challenges
pirical findings show that FIRE sampling en- in achieving effective and efficient training is de-
hances inference-time generation quality and termining how to obtain more successful samples
also benefits training in the alignment stage. within a fixed number of trials, particularly when
Furthermore, we explore how FIRE sampling addressing complex problems.
improves performance by promoting diversity In this paper, we introduce Flaming-hot Initia-
and analyze the impact of employing FIRE at
tion with Regular Execution (FIRE), a simple yet
different positions within a response.
effective sampling method for training large lan-
1 Introduction guage models. Inspired by recent findings on atten-
tion sink (Xiao et al., 2023), our approach begins
Large language models (LLMs) have achieved re- by sampling the initial token at a very high tem-
markable success in a wide range of tasks since perature and proceeds with the regular sampling
the release of ChatGPT (OpenAI, 2022). In ad- process for the remaining sequence. Our algorithm
dition to traditional natural language processing can be viewed as a simplified and more general
tasks such as summarization and sentiment analy- version of CoT-decoding (Wang and Zhou, 2024),
sis, LLMs have demonstrated effectiveness in many especially with a focus on training in math and cod-
new domains, including code generation (Chen ing domains where a sandbox checker is available
et al., 2023; Roziere et al., 2023), human-computer at a relatively cheap cost.
interaction (Li et al., 2023), and math problem- We first show that our method, at inference time,
solving (Wei et al., 2022; Yu et al., 2024). Al- can improve the pass rate within N trials (pass@n),
though standalone LLMs have limited reasoning also known as the best-of-N (BoN) when only the
capabilities (Sun et al., 2023; Valmeekam et al., correctness of the final answer is considered. To
2023; Chen et al., 2024b), researchers have tried to demonstrate its effectiveness in training, we show
enhance them by incorporating tool-use and devel- that it can be directly integrated into the reinforce-
oping integrated systems known as LLM agents (Xi ment learning process of large language models.
et al., 2023; Wang et al., 2024), which expands the Our approach proves to be effective across multiple
applications of LLMs to more general domains like open-source models and various LLM capabilities,
robot control (Wang et al., 2023a) and autonomous including mathematical reasoning and coding. We
driving (Mao* et al., 2023). highlight how our method promotes diversity in

1
generated samples, a key factor linked to perfor- the top-k alternative tokens and aggregating
mance improvements in pass rate. Importantly, this the responses by scoring the decoded responses
diversity is maintained even after training with our with confidence on the final answer. However,
sampling method, indicating room for further en- our approach differs in three key aspects: (1)
hancement. We also discuss the effects of simple we introduce a differentiable sampling method
variations of our method, where the temperature that can be directly integrated with existing
change occurs mid-process rather than at the start, inference and training frameworks, (2) we focus
on performance outcomes. on improving model performance in scenarios with
a sandbox checker, where aggregating responses
2 Related Works is less data-efficient, and (3) our method operates
without assumptions about the prompts, even when
Researchers have been exploring two primary direc- a chain of thought (CoT) is included, extending
tions to efficiently improve response quality under a beyond the scope of CoT-decoding.
frozen pre-trained LLM. The first direction focuses
on prompting techniques such as Chain-of-Thought 3 Flaming-hot Initiation Regular
(Wei et al., 2022) and Tree-of-Thought (Yao et al., Execution
2023a). The second direction involves letting
LLMs fix their own mistakes (Wang et al., 2023b; 3.1 Method
Yao et al., 2023b; Shinn et al., 2023; Madaan et al., In this work, we propose a sampling method,
2023; Chen et al., 2024a). In line with these two Flaming-hot Initiation with Regular Execution
directions, there has been increasing focus on con- (FIRE), inspired by the attention sink phe-
trolled decoding in LLMs to enhance reasoning nomenon (Xiao et al., 2023) that demonstrates the
capabilities during inference, ranging from search- importance of initial tokens.
based approaches applied to policy models (Mud- FIRE first samples the initial token at a very high
gal et al., 2023; Huang et al., 2024) to utilizing temperature p ≫ 1, combined with top-k filtering
value models trained in the alignment phase (Liu to make the candidate tokens more controllable. At
et al., 2023a; Feng et al., 2023). higher temperatures, the candidate tokens are sam-
In this paper, we also focus on inference time; pled from a probability distribution that approaches
however, our approach extends to the sampling uniform sampling. After the initial token is sam-
processes used during the training of large lan- pled, FIRE proceeds with the decoding stage using
guage models, as commonly practiced in Instruct- a regular temperature setting.
GPT (Ouyang et al., 2022). This process consists Our approach FIRE is similar to CoT-
of three key stages: pretraining, supervised fine- decoding (Wang and Zhou, 2024) that enumerates
tuning (SFT), and alignment, also known as rein- the top-k candidates of the initial token. However,
forcement learning with human feedback (RLHF). while CoT-decoding focuses more on the decod-
For large language models trained in this paradigm, ing stage and extracting Chain-of-Thought without
there could be some helpful properties that, without prompt, our approach FIRE serves as a general
strong theoretical guarantees, are empirically true differentiable sampling method, which can be com-
and thus helpful for LLMs. Our work is related to bined with existing sampling frameworks and can
attention sink (Xiao et al., 2023). An attention sink be more efficient in the training stage where a sand-
refers to a token or set of tokens that dispropor- box checker that judges whether a specific answer
tionately receive attention from other tokens during is correct or not is available with a cheap cost.
the attention mechanism within transformer archi- While FIRE can be applied to any token in the
tectures. In their study, they found that one of the decoding stage, we restrict its application to the
most identifiable tokens was shown to be the initial initial token to prevent the generation of random
token. While there are no theoretical guarantees, tokens that are wrong in the context. For example,
they propose an intuition that initial tokens are vis- if we apply FIRE after the prefix "1+2=", it would
ible and used in all later token generations, making sample, in addition to the token "3", other tokens
them more readily trained to be attention sinks. like "4" or "5", which are very likely to be wrong.
Our work is closely related to CoT- In contrast, since FIRE is only applied to the initial
decoding (Wang and Zhou, 2024), which token, it would unlikely lead to broken sentences
uncovers the CoT-paths by enumerating over or code with syntax errors. In our empirical exper-

2
Regular FIRE Regular FIRE
Model Pass% #EA Pass% #EA p k min-p n=10 n=40 n=10 n=40
DeepSeek 97.57 2.26 98.71 2.76 0.01 66.4 75.8 70.0 78.9
16
GSM8K Gemma-2 86.81 3.87 87.57 4.01 0 66.4 75.8 70.0 78.9
0.7
Qwen2 95.90 2.58 98.25 3.17
Qwen2-RL 96.90 2.63 97.90 3.26 0.01 66.2 75.3 70.1 78.9
32
0 66.2 75.2 70.1 78.9
DeepSeek 76.16 5.63 78.16 7.89
MATH Gemma-2 49.20 9.24 51.48 10.39 0.01 66.1 76.6 69.5 78.9
16
Qwen2 76.60 7.44 79.08 9.03 0 66.2 76.6 69.5 78.9
0.9
Qwen2.5-72B 79.30 2.39 80.40 2.60 0.01 66.7 76.4 69.5 79.1
32
0 66.8 74.4 69.1 79.0
Table 1: Inference results for different models on dif-
ferent datasets with best hyperparameters combinations. Table 3: Pass rate (%) for Qwen2-7B-Instruct on MATH
Specifically, Qwen2-RL is a fine-tuned model trained by dataset with different hyperparameter combinations. p:
ourselves. We show the pass rate (%) with 40 samples, nucleus sampling parameter, k: top-k sampling parame-
and the effective answers (EA) among the 40 samples. ter, min-p: minimum probability threshold (0 indicates
min-p is not used). n=10 and n=40 represent the number
Regular FIRE of samples for calculating the pass rate.
Pass@1 Pass@10 Pass@1 Pass@10
MBPP 61.2 82.8 50.6 86.6
MBPP+ 52.7 74.2 44.1 77.0 To ensure a fair comparison, we conducted a thor-
ough enumeration over hyperparameters, including
Table 2: Pass rate (%) with different number of samples p, k, and min-p (Huggingface, 2023). Table 1 and
from Qwen2-7B-Instruct on MBPP and MBPP+. Table 2 present the aggregated results, where the
reported numbers represent the best outcomes from
iments, we found that the initial token frequently the enumeration. We observe that FIRE consis-
consists of words like "Let’s", "Sure", "So", and tently improves the pass rate compared to regular
"The", which do not directly convey any informa- settings across all models on different benchmarks.
tion. But what these initial tokens affect is the To further demonstrate the consistent improvement
reasoning steps afterward, with the same intuition over different hyperparameters, we provide an ex-
as StreamingLLM (Xiao et al., 2023). ample result of Qwen2-7B-Instruct on the MATH
dataset in Table 3. Full results for all models and
3.2 Experiments datasets are provided in the appendix. Table 3 re-
In this section, we evaluate our algorithm, FIRE, veals that although FIRE may alter the hyperparam-
by addressing several key research questions that eter combination that yields optimal performance,
guide our experiments. it consistently outperforms regular sampling across
all hyperparameter combinations.
How effective is FIRE during inference? We
first showcase the effectiveness of FIRE sampling Why is FIRE effective? FIRE introduces more
in inference-only scenarios. We tested four open- diversity to the initial token that is generated at a
source models: Qwen2-7B-Instruct (Qwen2) (Yang high temperature, and due to the strong attention
et al., 2024), Qwen2.5-72B-Instruct (Qwen2.5- scores towards initial tokens (Xiao et al., 2023),
72B) (Yang et al., 2024), DeepSeek-coder-v2- this diversity benefits the entire subsequent gen-
Instruct (DeepSeek)(Zhu et al., 2024), and Gemma- eration. To measure diversity quantitatively, we
2-2b-it (Gemma-2)(Team et al., 2024), on a di- use the number of unique answers (effective an-
verse set of datasets, including GSM8K (Cobbe swers) within a set of responses as our metric.
et al., 2021), MATH (Hendrycks et al., 2021), and We choose not to use some popular metrics like
MBPP(+) (Austin et al., 2021; Liu et al., 2023c). In n-grams since we only control the initial token,
GSM8K and MATH, we extend the prompts with and in tasks with long reasoning paths, such as
phrase "Please reason step-by-step" to ensure CoT math and coding, similar n-grams will likely al-
reasoning in models’ responses, a setting where the ways appear, making it unsuitable for measuring
original motivation of CoT-decoding becomes less diversity. As shown in Figure 1, Table 1 (#EA),
meaningful as CoT paths would naturally occur. FIRE demonstrates increased diversity across vari-
For the regular sampling settings, we use a com- ous models and datasets, which contributes to en-
bination of nucleus sampling and top-k sampling. hanced pass@n performance. As anticipated, FIRE

3
(a) Deepseek-Code-v2-Lite-Instruct (b) Qwen2-7B-Instruct

Figure 1: Curves for pass rate and number of effective answers with different numbers of samples on GSM8K.

Dataset Model PPO PPO+FIRE els. Furthermore, after our RL training, the model
Deepseek 80.64 82.16 still exhibits diversity and continues to benefit
GSM8K Qwen2 80.16 82.02 from inference-time pass rate improvements, as
Gemma 40.39 42.91
Gemma-2 58.07 61.20
evidenced by Qwen2-RL in Table 1. Consequently,
FIRE can be applied iteratively to refine the model,
MATH Qwen2 53.50 55.07
leading to an even bigger improvement margin.
Table 4: Pass@1 on GSM8K and Math for Different Can FIRE sampling work in mid-sequence?
models trained with PPO with different sampling.
Finally, we explore the effect of applying FIRE
sampling midway through a response. We first
1st-line 2nd-line 3rd-line PRM-line construct a dataset that ensures the correctness of
Regular 46.07 74.36 74.77 75.73 the initial sentences, by utilizing a Process Re-
FIRE 64.59 74.96 75.92 78.21
ward Model (PRM) to identify the first sentences
Table 5: Pass@10 Results from Qwen2-7B-Instruct on at which the response becomes incorrect. We then
the training set of MATH dataset for FIRE variants with evaluate the effect of applying FIRE sampling at
different sampling points, compared to regular sampling the beginning of different sentences (1st, 2nd, and
method that does not change the temperature. 3rd-line) or at the first token deemed incorrect by
the PRM ("PRM-line"). We refer the reader to the
appendix for a more detailed description of the con-
does not improve Pass@1 performance due to its struction of this dataset. As shown in Table 5, while
focus on promoting diversity. However, it consis- FIRE sampling offers benefits throughout different
tently delivers improvements when more samples settings, its advantages diminish for tokens beyond
are considered. the initial ones, despite an overall increase in accu-
Is FIRE helpful when integrated into train- racy due to the prefix guaranteed to be correct.
ing? Having established that our method im-
4 Conclusion
proves pass@n by improving diversity, we directly
apply FIRE to boost language model training. To In this paper, we introduced a novel sampling
test this, we use Proximal Policy Optimization method called Flaming-hot Initiation with Regular
(PPO) (Schulman et al., 2017) to finetune several Execution (FIRE). Through empirical analysis, we
models using the GSM8K and MATH datasets, and demonstrated that FIRE enhances both inference-
assess their performance through the final pass rate time performance and reinforcement learning, par-
for single samples (Pass@1). As shown in Ta- ticularly when a chain of thought is integrated into
ble 4, integrating FIRE into the training process the prompt. We showed that FIRE improves gen-
leads to an improvement in Pass@1. Notably, even eration diversity, and we believe that this diversity
though each data point is sampled only once during contributes to its overall effectiveness. Addition-
PPO training following common practice (Ouyang ally, we explored several variants of FIRE that mod-
et al., 2022; Sheng et al., 2024), our method still ify the sampling process not only immediately after
yields improvements. The results also show that the question but also during the middle of the gen-
the improvements are consistent for different mod- eration, further showcasing its versatility.

4
5 Limitations James Y Huang, Sailik Sengupta, Daniele Bonadiman,
Yi-an Lai, Arshit Gupta, Nikolaos Pappas, Saab Man-
While this work focuses on improving the effi- sour, Katrin Kirchoff, and Dan Roth. 2024. Deal:
ciency of LLM training through better sampling Decoding-time alignment for large language models.
methods, there are two limitations. First, our ap- arXiv preprint arXiv:2402.06147.
proach lacks a strong theoretical guarantee, mean- Huggingface. 2023. New sampling strategy
ing that there is a possibility that future models, dropped in transformers – min p sampling.
especially ones that are with different model archi- https://huggingface.co/posts/joaogante/
319451541682734. Accessed: 2024-10-15.
tectures, may not benefit from it. Second, although
our method is designed for training LLMs, the Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
inference-time algorithm could potentially bypass Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E.
safety measures by sampling out-of-distribution Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi-
cient memory management for large language model
data. However, we argue that this concern can be
serving with pagedattention. In Proceedings of the
inherently mitigated in models trained with our ACM SIGOPS 29th Symposium on Operating Systems
proposed sampling technique. Principles.

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii


References Khizbullin, and Bernard Ghanem. 2023. CAMEL:
communicative agents for "mind" exploration of large
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten language model society. In Advances in Neural In-
Bosma, Henryk Michalewski, David Dohan, Ellen formation Processing Systems (NeurIPS).
Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021.
Program synthesis with large language models. arXiv Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru,
preprint arXiv:2108.07732. Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyil-
maz. 2023a. Making ppo even better: Value-guided
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda monte-carlo tree search decoding. arXiv preprint
Askell, Anna Chen, Nova DasSarma, Dawn Drain, arXiv:2309.15028.
Stanislav Fort, Deep Ganguli, Tom Henighan, et al.
2022. Training a helpful and harmless assistant with Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han,
reinforcement learning from human feedback. arXiv Wei Yang, and Deheng Ye. 2023b. Rltf: Reinforce-
preprint arXiv:2204.05862. ment learning from unit test feedback. arXiv preprint
arXiv:2307.04349.
Weizhe Chen, Sven Koenig, and Bistra Dilkina. 2024a.
Reprompt: Planning by automatic prompt engineer- Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling-
ing for large language models agents. arXiv preprint ming Zhang. 2023c. Is your code generated by chat-
arXiv:2406.11132. GPT really correct? rigorous evaluation of large lan-
guage models for code generation. In Thirty-seventh
Weizhe Chen, Sven Koenig, and Bistra Dilkina. 2024b. Conference on Neural Information Processing Sys-
Why solving multi-agent path finding with large lan- tems.
guage model has not succeeded yet. arXiv preprint
arXiv:2401.03630. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
Denny Zhou. 2023. Teaching large language models et al. 2023. Self-refine: Iterative refinement with
to self-debug. arXiv preprint arXiv:2304.05128. self-feedback. In Advances in Neural Information
Processing Systems (NeurIPS).
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Jiageng Mao*, Junjie Ye*, Yuxi Qian, Marco Pavone,
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro and Yue Wang. 2023. A language agent for au-
Nakano, Christopher Hesse, and John Schulman. tonomous driving. arXiv preprint arXiv:2311.10813.
2021. Training verifiers to solve math word prob-
lems. arXiv preprint arXiv:2110.14168. Sidharth Mudgal, Jong Lee, Harish Ganapathy,
YaGuang Li, Tao Wang, Yanping Huang, Zhifeng
Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Chen, Heng-Tze Cheng, Michael Collins, Trevor
Weinan Zhang, and Jun Wang. 2023. Alphazero-like Strohman, et al. 2023. Controlled decoding from
tree-search can guide large language model decoding language models. arXiv preprint arXiv:2310.17022.
and training. arXiv preprint arXiv:2309.17179.
OpenAI. 2022. Introducing chatgpt.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
Arora, Steven Basart, Eric Tang, Dawn Song, and Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Jacob Steinhardt. 2021. Measuring mathematical Carroll Wainwright, Pamela Mishkin, Chong Zhang,
problem solving with the math dataset. NeurIPS. Sandhini Agarwal, Katarina Slama, Alex Ray, et al.

5
2022. Training language models to follow instruc- Meeting of the Association for Computational Lin-
tions with human feedback. Advances in neural in- guistics (ACL), pages 13484–13508. Association for
formation processing systems, 35:27730–27744. Computational Linguistics.

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le,
Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. and Denny Zhou. 2022. Chain-of-thought prompt-
Code llama: Open foundation models for code. arXiv ing elicits reasoning in large language models. In
preprint arXiv:2308.12950. NeurIPS.

John Schulman, Filip Wolski, Prafulla Dhariwal, Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen
Alec Radford, and Oleg Klimov. 2017. Proxi- Ding, Boyang Hong, Ming Zhang, Junzhe Wang,
mal policy optimization algorithms. arXiv preprint Senjie Jin, Enyu Zhou, et al. 2023. The rise and
arXiv:1707.06347. potential of large language model based agents: A
survey. arXiv preprint arXiv:2309.07864.
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song
Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin
Han, and Mike Lewis. 2023. Efficient streaming
Lin, and Chuan Wu. 2024. Hybridflow: A flex-
language models with attention sinks. arXiv preprint
ible and efficient rlhf framework. arXiv preprint
arXiv:2309.17453.
arXiv:2409.19256.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
Noah Shinn, Federico Cassano, Ashwin Gopinath, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan
Karthik Narasimhan, and Shunyu Yao. 2023. Re- Li, Dayiheng Liu, Fei Huang, et al. 2024. Qwen2
flexion: language agents with verbal reinforcement technical report. arXiv preprint arXiv:2407.10671.
learning. In Advances in Neural Information Pro-
cessing Systems (NeurIPS). Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Tom Griffiths, Yuan Cao, and Karthik Narasimhan.
Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying 2023a. Tree of thoughts: Deliberate problem solving
Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu with large language models. Advances in Neural
Ding, Hongyang Li, Mengzhe Geng, et al. 2023. A Information Processing Systems.
survey of reasoning with foundation models. arXiv
preprint arXiv:2312.11562. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik R. Narasimhan, and Yuan Cao.
Gemma Team, Morgane Riviere, Shreya Pathak, 2023b. ReAct: Synergizing reasoning and acting
Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- in language models. In International Conference on
raju, Léonard Hussenot, Thomas Mesnard, Bobak Learning Representations (ICLR).
Shahriari, Alexandre Ramé, et al. 2024. Gemma 2:
Improving open language models at a practical size. Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu,
arXiv preprint arXiv:2408.00118. Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo
Li, Adrian Weller, and Weiyang Liu. 2024. Meta-
Karthik Valmeekam, Matthew Marquez, Alberto Olmo, math: Bootstrap your own mathematical questions
Sarath Sreedharan, and Subbarao Kambhampati. for large language models. In International Confer-
2023. Planbench: An extensible benchmark for eval- ence on Learning Representations, (ICLR).
uating large language models on planning and reason-
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang,
ing about change. Advances in Neural Information
Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo
Processing Systems (NeurIPS).
Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2:
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Breaking the barrier of closed-source models in code
Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, intelligence. arXiv preprint arXiv:2406.11931.
Xu Chen, Yankai Lin, et al. 2024. A survey on large
language model based autonomous agents. Frontiers
A Implementation Details
of Computer Science, 18(6):186345. In the paper, we proposed FIRE sampling, which is
Xuezhi Wang and Denny Zhou. 2024. Chain-of- similar to CoT-decoding, and removed the need to
thought reasoning without prompting. arXiv preprint calculate the confidence score. One of the biggest
arXiv:2402.10200. benefits of simplifying the method is getting an
extremely easy implementation. For inference,
Yen-Jen Wang, Bike Zhang, Jianyu Chen, and Koushil
Sreenath. 2023a. Prompt a robot to walk with large we use vLLM (Kwon et al., 2023) and do a two-
language models. arXiv preprint arXiv:2309.09969. stage sampling, with the first stage sampling only
one token with high temperature and the second
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
stage continuing the sampling with regular sam-
Hajishirzi. 2023b. Self-instruct: Aligning language pling. For training, we implement based on Hybrid-
models with self-generated instructions. In Annual Flow(Sheng et al., 2024), a newly released RLHF

6
code base, which supports sampling with vLLM. sponse, we enumerate the sentences and sample 20
Thus, we only changed the sampling part of the times using different numbers of sentences as the
code in the RLHF framework. As shared in all prefix. Thus, we obtained an approximation of the
experiments, the temperature used for the initial point at which the original samples became wrong.
token is set at 30. Specifically, if one response is wrong before the
In our experiment, we enumerate the parame- number of lines we enumerate in Table 5, we use
ters of top-p sampling, top-k sampling, and min- all the prefix up to the point that is still correct
p sampling. We list all the parameters we have for that response, i.e., if for a specific sample, the
tried in the next section. Due to computation correct sentences are less than 2, 3rd-line pass rate
costs, some of the models are not enumerated in will be calculated in the same way as PRM-line.
the same number as others. However, our con-
clusion that FIRE outperforms regular sampling B Extra Experiment Results
is consistent, as we will show later. Specifically We provide our full inference experiment table in
for MBPP(+), and for Qwen-RL, the model after Table 6, Table 7, and Table 8. We observe that
our fine-tuning, we test on a single hyperparame- among all hyperparameter combinations, FIRE
ter combination of top − p = 0.9, top − k = 16, stably outperforms regular sampling, starting at
which follows the best configuration from previous Pass@10 to Pass@20 and Pass@40. In most set-
trials. For Qwen2.5-72b-Instruct, we follow the tings, FIRE is superior to regular sampling at
recommended hyperparameters of top − p = 0.8, Pass@5, and for certain settings in the MATH
top − k = 16. For reinforcement learning prob- dataset, FIRE could even show an advantage in
lems, we use the default parameters in HybridFlow, Pass@1.
specifically, top − k = 16, top − p = 1.0. For
training with FIRE sampling, to enable PPO to ac-
cept the relatively out-of-distribution samples, we
change the clipping ratio for PPO from 0.2 to 0.5.
We observe that for PPO+FIRE to use the original
clip rate, it will generally match the original per-
formance, while pure PPO with a higher clip ratio
will lead the training to a failure and converge to a
pass rate close to 0.
In the paper, we use three different datasets:
GSM8K, MATH, and MBPP(+). GSM8K is a
dataset with 8.5K total instances of math prob-
lem, of which 7.5K is in the training set and 1.3K
is in the test set. MATH is a math dataset that
is slightly more difficult and more comprehen-
sive than GSM8K, with 7.5K training data and
5K test data. MBPP is a benchmark consisting
of around 1,000 crowd-sourced Python program-
ming problems, and MBPP+ is a benchmark that
enlarges MBPP with some harder problems, reach-
ing around 35K total test problems. While MBPP+
is still under regular update, we use version 0.1.0
in our paper.
For the final part of the experiment about gen-
eration in the middle sequence, we use a dataset
that guarantees a certain number of sentences of
prefixes to be correct. Here, the sentences are de-
fined based on ’.’ in the answer. This dataset is
generated on the training set of the MATH dataset,
for which we first use Qwen2-7b-Instruct to sample
10 responses for each question. Then, for each re-

7
Dataset top-p top-k min-p Sampling Pass@1 Pass@5 Pass@10 Pass@20 Pass@30 Pass@40 EA@40
0.7 16 0.01 Reg 87.19 93.40 95.07 95.53 95.98 96.29 1.96
0.7 16 0.01 FIRE 85.75 94.62 96.21 97.42 97.95 98.18 2.58
0.7 16 0.0 Reg 87.19 93.40 95.07 95.53 95.98 96.29 1.96
0.7 16 0.0 FIRE 85.75 94.62 96.21 97.42 97.95 98.18 2.58
0.7 32 0.01 Reg 87.72 93.63 94.77 95.45 96.13 96.36 1.97
0.7 32 0.01 FIRE 85.06 95.07 96.74 97.73 98.18 98.26 2.64
0.7 32 0.0 Reg 86.58 93.40 94.84 96.06 96.36 96.82 1.97
0.7 32 0.0 FIRE 85.67 94.84 96.66 97.57 98.10 98.26 2.63
GSM8K
0.9 16 0.0 Reg 87.41 94.16 95.68 96.66 97.35 97.57 2.26
0.9 16 0.0 FIRE 84.46 95.83 96.89 97.88 98.26 98.71 2.76
0.9 32 0.01 Reg 86.05 94.31 95.83 96.74 97.04 97.27 2.28
0.9 32 0.01 FIRE 84.76 94.84 96.59 97.88 98.33 98.41 2.86
0.9 32 0.0 Reg 86.43 94.01 95.53 96.66 97.19 97.35 2.29
0.9 32 0.0 FIRE 84.76 94.84 96.59 97.88 98.33 98.41 2.86
0.7 16 0.01 Reg 51.04 64.66 69.20 72.36 73.86 74.82 5.08
0.7 16 0.01 FIRE 49.68 64.94 70.16 73.52 75.58 76.68 6.33
0.7 16 0.0 Reg 51.04 64.56 68.56 71.84 73.44 74.62 5.07
0.7 16 0.0 FIRE 49.58 65.48 70.22 74.34 76.02 77.16 6.34
0.7 32 0.01 Reg 51.00 64.36 68.42 71.74 73.46 74.38 5.06
0.7 32 0.01 FIRE 49.08 65.64 70.22 73.92 76.10 77.06 6.90
0.7 32 0.0 Reg 51.00 64.36 68.42 71.74 73.46 74.38 5.06
0.7 32 0.0 FIRE 49.08 65.64 70.22 73.92 76.10 77.06 6.90
MATH
0.9 16 0.01 Reg 50.42 64.98 69.44 72.98 75.00 76.08 5.66
0.9 16 0.01 FIRE 48.96 65.34 70.36 74.12 76.16 77.64 7.29
0.9 16 0.0 Reg 50.82 65.36 69.62 73.12 75.06 76.16 5.64
0.9 16 0.0 FIRE 48.26 65.00 69.98 74.42 76.18 77.64 7.26
0.9 32 0.01 Reg 50.00 65.40 69.32 72.88 74.72 75.98 5.65
0.9 32 0.01 FIRE 47.66 65.48 70.48 74.58 76.86 78.16 7.90
0.9 32 0.0 Reg 50.00 65.40 69.32 72.88 74.72 75.98 5.65
0.9 32 0.0 FIRE 47.66 65.48 70.48 74.58 76.86 78.16 7.90

Table 6: Deepseek-coder-v2-Instruct on different datasets with regular sampling (Reg) and FIRE (ours). We show
the pass rate with different number of samples (Pass@n), and the effective answers (EA) of the total 40 samples.

Dataset top-p top-k min-p Sampling Pass@1 Pass@5 Pass@10 Pass@20 Pass@30 Pass@40 EA@40
0.7 16 0.01 Reg 15.90 29.94 36.40 42.36 45.74 48.14 8.44
0.7 16 0.01 FIRE 17.20 32.28 39.22 45.30 48.52 51.18 9.82
0.7 16 0.0 Reg 15.90 29.94 36.40 42.36 45.74 48.14 8.44
0.7 16 0.0 FIRE 17.20 32.28 39.22 45.30 48.52 51.18 9.82
0.7 32 0.01 Reg 15.78 29.84 36.16 41.70 45.20 47.70 8.40
0.7 32 0.01 FIRE 16.68 32.40 38.80 45.32 48.90 51.26 9.76
0.7 32 0.0 Reg 15.78 29.84 36.16 41.70 45.20 47.70 8.40
0.7 32 0.0 FIRE 16.68 32.40 38.80 45.32 48.90 51.26 9.76
MATH
0.9 16 0.01 Reg 14.74 30.46 37.02 43.30 46.98 49.20 9.23
0.9 16 0.01 FIRE 15.12 31.48 38.30 45.48 48.90 51.48 10.39
0.9 16 0.0 Reg 14.74 30.46 37.02 43.30 46.98 49.20 9.23
0.9 16 0.0 FIRE 15.12 31.48 38.30 45.48 48.90 51.48 10.39
0.9 32 0.01 Reg 14.58 30.16 36.20 42.28 45.98 48.34 9.17
0.9 32 0.01 FIRE 15.04 31.24 37.60 44.26 47.84 50.54 10.34
0.9 32 0.0 Reg 15.02 30.06 36.48 43.12 46.52 49.08 9.15
0.9 32 0.0 FIRE 14.58 31.40 38.36 44.92 48.48 51.06 10.35
0.7 16 0.01 Reg 36.54 66.41 75.66 82.34 84.46 86.81 3.86
0.7 16 0.01 FIRE 32.45 66.57 76.57 83.32 85.97 87.26 3.97
0.7 16 0.0 Reg 36.54 66.41 75.66 82.34 84.46 86.81 3.86
0.7 16 0.0 FIRE 32.45 66.57 76.57 83.32 85.97 87.26 3.97
GSM8K
0.7 32 0.01 Reg 36.92 67.25 75.66 82.11 84.08 85.52 3.91
0.7 32 0.01 FIRE 31.24 66.79 76.27 82.87 85.97 87.57 4.01
0.7 32 0.0 Reg 36.92 67.25 75.66 82.11 84.08 85.52 3.91
0.7 32 0.0 FIRE 31.24 66.79 76.27 82.87 85.97 87.57 4.01

Table 7: Gemma-2-2b-it on different datasets with regular sampling (Reg) and FIRE (ours). We show the pass rate
with different number of samples (Pass@n), and the effective answers (EA) of the total 40 samples.

8
Dataset top-p top-k min-p Sampling Pass@1 Pass@5 Pass@10 Pass@20 Pass@30 Pass@40 EA@40
0.7 16 0.01 Reg 66.72 89.23 92.80 94.47 95.07 95.83 2.61
0.7 16 0.01 FIRE 66.49 92.87 94.92 96.66 97.19 97.35 3.08
0.7 16 0.0 Reg 66.72 89.23 92.80 94.47 95.07 95.83 2.61
0.7 16 0.0 FIRE 66.49 92.87 94.92 96.66 97.19 97.35 3.08
GSM8K
0.7 32 0.0 Reg 67.02 89.16 92.27 94.31 95.30 95.91 2.58
0.7 32 0.0 FIRE 66.34 92.95 95.75 97.19 97.88 98.26 3.17
0.9 16 0.0 Reg 64.52 90.83 94.16 95.75 96.89 97.42 2.96
0.9 16 0.0 FIRE 64.22 92.27 95.07 97.04 97.65 97.95 3.33
0.7 16 0.01 Reg 35.80 59.40 66.40 71.68 74.22 75.76 6.47
0.7 16 0.01 FIRE 40.26 63.74 69.98 75.08 77.42 78.90 7.86
0.7 16 0.0 Reg 35.80 59.40 66.40 71.68 74.22 75.76 6.47
0.7 16 0.0 FIRE 40.26 63.74 69.98 75.08 77.42 78.90 7.86
0.7 32 0.01 Reg 36.42 59.42 66.22 71.10 73.52 75.26 6.47
0.7 32 0.01 FIRE 39.52 63.54 70.10 75.06 77.42 78.92 8.11
0.7 32 0.0 Reg 36.42 59.42 66.22 71.10 73.52 75.26 6.47
0.7 32 0.0 FIRE 39.52 63.54 70.10 75.06 77.42 78.92 8.11
MATH
0.9 16 0.01 Reg 35.30 59.48 66.16 71.86 74.68 76.60 7.44
0.9 16 0.01 FIRE 38.70 62.44 69.50 74.64 77.36 78.86 8.76
0.9 16 0.0 Reg 35.30 59.48 66.16 71.86 74.68 76.60 7.44
0.9 16 0.0 FIRE 38.70 62.44 69.50 74.64 77.36 78.86 8.76
0.9 32 0.01 Reg 35.82 59.84 66.70 72.02 74.64 76.38 7.40
0.9 32 0.01 FIRE 37.44 62.72 69.50 75.08 77.50 79.08 9.03
0.9 32 0.0 Reg 35.14 59.84 66.80 72.34 74.72 76.40 7.43
0.9 32 0.0 FIRE 36.70 62.52 69.12 74.54 77.10 79.04 9.04

Table 8: Qwen2-7B-Instruct on different datasets with regular sampling (Reg) and FIRE (ours). We show the pass
rate with different number of samples (Pass@n), and the effective answers (EA) of the total 40 samples.

You might also like