0% found this document useful (0 votes)

12 views

CodeTree

A paper about code tree

Uploaded by

hleather

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

CodeTree

A paper about code tree

Uploaded by

hleather

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

CodeTree: Agent-guided Tree Search for Code Generation with Large

Language Models

Jierui Li1 * , Hung Le2 , Yingbo Zhou2 , Caiming Xiong2 , Silvio Savarese2 , Doyen Sahoo2
1
The University of Texas at Austin
2
Salesforce Research
jierui@cs.utexas.edu, hungle@salesforce.com

Abstract containing no programmatic syntax errors and pass-

ing all possible test cases (Chen et al., 2021; Austin
Pre-trained on massive amounts of code and et al., 2021; Hendrycks et al., 2021). Given the ex-
arXiv:2411.04329v2 [cs.CL] 12 Nov 2024

text data, large language models (LLMs) have tremely large search space in code, early methods
demonstrated remarkable achievements in per-
propose to sample a very large number of genera-
forming code generation tasks. With additional
execution-based feedback, these models can act tion outputs (for example, Li et al. (2022) generated
as agents with capabilities to self-refine and im- up to 1 million samples per problem) to increase
prove generated code autonomously. However, the chance of generating a correct code solution.
on challenging coding tasks with extremely More recently, several approaches adopted a
large search space, current agentic approaches “vertical” strategy in which LLMs first generate
still struggle with multi-stage planning, gener- one (or very few) generation output, and then itera-
ating, and debugging. To address this problem,
tively refine this output multiple times, often con-
we propose CodeTree, a framework for LLM
agents to efficiently explore the search space in ditioned by some forms of external feedback (Le
different stages of the code generation process. et al., 2022; Chen et al., 2023b; Shinn et al., 2023).
Specifically, we adopted a unified tree struc- While these approaches are more cost-effective by
ture to explicitly explore different coding strate- focusing only on a small subset of the search space
gies, generate corresponding coding solutions, (i.e. starting from an initial output candidate), the
and subsequently refine the solutions. In each performances of these approaches are bounded by
stage, critical decision-making (ranking, termi- the local optima of the chosen search space. Re-
nation, expanding) of the exploration process
lated to our work, several methods for NLP reason-
is guided by both the environmental execution-
based feedback and LLM-agent-generated feed- ing tasks were introduced to control and enhance
back. We comprehensively evaluated CodeTree the generation procedure of LLMs. For example,
on 7 code generation benchmarks and demon- Wang et al. (2022) proposed to enhance LLMs with
strated the significant performance gains of chains of thought and statistically select the right
CodeTree against strong baselines. Using GPT- solutions based on majority voting. Zhou et al.
4o as the base model, we consistently achieved (2023) decomposed a task into smaller sub-tasks
top results of 95.1% on HumanEval, 98.7%
and addressed them by increasing the order of diffi-
on MBPP, and 43.0% on CodeContests. On
the challenging SWEBench benchmark, our ap- culty. Yao et al. (2024) proposed to improve LLMs
proach led to significant performance gains. by adopting a tree-based structure to explicitly sim-
ulate the exploration of thoughts in a tree. We are
1 Introduction motivated by this line of research and proposed
CodeTree, a new generation framework to effec-
Recently, we have witnessed significant impacts of tively explore the search space of code generation
large language models (LLMs) beyond the NLP do- tasks through a tree-based structure. An overview
main such as in coding tasks (Achiam et al., 2023; of CodeTree is given in Figure 1.
Touvron et al., 2023a; Wang et al., 2023; Rozière We define 3 standard agents, Thinker, Solver,
et al., 2023). However, different from traditional and Debugger, to equip the strategy-planning,
NLP tasks, coding tasks require generated code solution implementation, and solution improve-
to be fully executable and functionally correct i.e. ment correspondingly, posing comprehensive roles
* The work was done while Jierui Li was a research intern needed for code generation. A CodeTree starts
in Salesforce Research Asia. from the input problem as the tree root and subse-
Execution AI Multi-
Approach Explore Exploit Action
feedback feedback agent
CodeRanker (Inala et al., 2022) ✓ ✓
AlphaCode (Li et al., 2022), MBR-Exec (Shi et al., ✓ ✓
2022), CodeT (Chen et al., 2023b)
LEVER (Ni et al., 2023), Coder-Reviewer (Zhang et al., ✓ ✓ ✓
2023b)
Self-correct (Welleck et al., 2023), ILF (Chen et al., ✓ ✓
2023a), Self-refine (Madaan et al., 2023)
CodeChain (Le et al., 2024) ✓ ✓
Self-debug (Chen et al., 2023d), Self-repair (Olausson ✓ ✓ ✓
et al., 2023), Reflexion (Shinn et al., 2023)
CAMEL (Li et al., 2023a) ✓ ✓
ChatDev (Qian et al., 2024), MetaGPT (Hong et al., ✓ ✓ ✓
2023), AgentVerse (Chen et al., 2023c)
Self-collaboration (Dong et al., 2023), AgentCoder ✓ ✓ ✓ ✓
(Huang et al., 2023)
CodeTree (ours) ✓ ✓ ✓ ✓ ✓ ✓

Table 1: We compare CodeTree with related methods in 6 aspects: (1) Explore which adopts a brute-force approach
to independently generate a large number of code candidates; (2) Exploit which focuses on self-refinement using
a small subset of output solutions; (3) Execution feedback which uses code execution outcomes to improve code
qualities; (4) AI feedback which enables synthetic feedback generated by LLMs to improve output code; (5)
Multi-agent which adopts multiple LLM agents to play different roles in the code generation process; and (6) Action
where LLM agents can take different actions and facilitate decision-making.

quent nodes represent code solutions. At any node to competition-level coding tasks. Our results
of the tree, one can either explore sibling nodes demonstrated the significant and consistent perfor-
(other strategies from the same parent node) or its mance gains of CodeTree against strong baselines.
children (refinements of this node). Within Code- Using GPT-4o as the base models, we achieved the
Tree, agents can interact with each other through a top results on HumanEval+, MBPP+ (Liu et al.,
tree expansion guided by a Critic Agent, searching 2023), and CodeContests (Li et al., 2022) respec-
for the optimal code solution. tively. On the challenging SWEBench benchmark,
Rather than following heuristic rules or classic our approach led to significant performance gains.
tree traversal methods, we use Critic Agent to self- We also conducted comprehensive ablation and
evaluate the status of tree nodes at each tree expan- qualitative analysis to derive the best practices and
sion step by executing the following tasks: any limitations of the current method.

• Node scoring: Evaluate the test case outputs 2 Related Work

of generated code and assess whether the tree
Our work is broadly related to the research of large
nodes faithfully follow their corresponding
language models (LLMs) (Chowdhery et al., 2022;
coding strategies.
Achiam et al., 2023; Touvron et al., 2023a). Roz-
• Solution verification & evaluation: For a solu- ière et al. (2023); Li et al. (2023b); Wang et al.
tion that passes visible test cases, verify if it (2023); Chen et al. (2021) extended this line of re-
should be further improved; for a solution that search by allowing LLMs to learn from large-scale
fails, evaluate if it is a promising direction to code-related data such as open-sourced Github
debug and refine the solution. repositories and code commits. Treating code
generation as an autoregressive generation tasks,
Based on the outputs of the above evaluation tasks, LLMs can generate code that correctly follow pro-
Critic Agent can take action to either refine, abort, gramming syntactic rules and are functionally cor-
or accept the current solution, which automatically rect (Chen et al., 2021; Gunasekar et al., 2023;
expands or terminates the tree search. CodeTree is Nijkamp et al., 2023). Early approaches (Chen
flexible and efficient, avoiding duplicated or redun- et al., 2021; Austin et al., 2021; Hendrycks et al.,
dant exploration of functionally similar or unfeasi- 2021) adopted a brute-force principle by indepen-
ble solutions. dently generating a very large number of output
We comprehensively evaluated CodeTree on di- code candidates, so one of which might be the opti-
verse code generation benchmarks from beginner- mal code solution. Subsequently, Li et al. (2022);
Chen et al. (2023b); Ni et al. (2023) proposed to uti- cases are denoted as {(ij , oj )v } while hidden test
lize unit test outcomes to filter for more prospective cases are denoted as {(ij , oj )h }. An output code
generated code samples. Ŵ is correct when Ŵ (ij ) = oj ∀j.
Also related to our work is the studies of self- Please refer to Figure 1 for an overview of our
refinement capabilities of LLMs. These studies method and Figure 2 for a simplified version of
leverage the inherent ability of LLM to perceive instruction prompts to our LLM agents.
arbitrary natural language contexts, including in-
formation such as environmental feedback, to iter- 3.1 Coding Task-Specific Agents
atively fix problematic code and improve its cor- We first introduce three unit LLM agents, specifi-
rectness. For example, Zhang et al. (2023a) uti- cally targeting different parts of the code generation
lized test outcomes as a form of feedback while process, including strategy thinking, code imple-
Welleck et al. (2023); Le et al. (2022) introduced mentation, and code debugging.
LLM-generated feedback from predicted probabil-
Strategy Generation with Thinker Agent Con-
ities about code correctness. Shinn et al. (2023);
ventional approaches such as (Chen et al., 2021;
Chen et al. (2023d); Madaan et al. (2023) focused
Austin et al., 2021; Hendrycks et al., 2021) directly
on more natural forms of feedback such as reflec-
generate output code candidates given a problem
tions and explanations of the code. Le et al. (2024)
description. However, these approaches do not
proposed to cluster sub-module representations of
fully exploit the advantage of LLMs in generating
code as a form of collective feedback.
more expressive outputs in text. Wei et al. (2022)
More related to our work is the research for en-
showed that allowing LLMs to generate natural
hancing and controlling the generation procedure
language step-by-step thoughts can lead to signifi-
of LLMs. Yao et al. (2024); Koh et al. (2024) sim-
cant performance gains in downstream NLP tasks.
ulated the step-by-step exploration of thoughts as a
Following the setting in Yao et al. (2024), request-
tree search to support NLP and multimodal tasks
ing LLMs to generate a list of different natural
respectively. In the code domain, Song et al. (2024)
language thoughts can enhance the diversity of so-
used tree search to guide the debugging of gener-
lutions. We propose to adapt this technique to allow
ated code. Islam et al. (2024) adopted multi-agent
an LLM θT (denoted as “Thinker” agent) to sequen-
system to support different stages of code gener-
tially generate a set of high-level strategies given
ation but the exploration is not explicitly defined.
an input coding problem. Each strategy Ŝi is gen-
Chen et al. (2024) introduced a tree structure to
erated autoregressively given previously generated
explore sub-functions in generated code. Different
strategies following:
from prior approaches, we introduce a tree-based
structure as a unified search space for LLM agents Ŝi ∼ pθT (.|Ŝ1:i−1 , D) (1)
to efficiently perform exploration throughout dif-
ferent stages of code generation. See Table 1 for a By allowing models to first generate coding strate-
more systematic comparison with related methods. gies, we enable LLMs to tackle coding problems
using their reasoning capabilities learned from the
3 Method text domain. The expressiveness of generated
strategies in a natural language can potentially
To apply LLMs to code generation, we can define
guide the code-generation process toward more
this task as a sequence-to-sequence task. The input
diverse exploration paths. Notably, we let Thinker
sequence consists of a problem description D, usu-
Agent dynamically decide the number of gener-
ally in the form of a function docstring (including
ated coding strategies, given the fact that different
expected input and output) or the textual explana-
coding problems can have more or fewer feasible
tion of the problem. The output is a corresponding
strategies.
code solution, flattened into a sequence of tokens
Ŵ = (ŵ1 , ..., ŵT ) with ŵt ∈ V. Solution Generation with Solver Agent Given a
Generated codes are evaluated against hidden complete generated strategy Ŝi , we let an LLM θS
test cases to check the functional correctness (denoted as “Solver” agent) generate a set of initial
(Hendrycks et al., 2021; Chen et al., 2021; Li et al., code solutions. Since LLMs are often fine-tuned
2022). The test cases are a set of input-output pairs to follow arbitrary instructions in natural language,
{(ij , oj )} = {(ij , oj )v } ∪ {(ij , oj )h }. Visible test these models can understand novel unseen tasks
Problem Specification
Critic agent
Strategy Thinker
Strategy Strategy Strategy Strategy
agent Execution
Initial code
Initial code Initial code Initial code

Refine
Solver
Reflection Reflection Reflection Reflection agent ● Scoring
Reflection
● Expanding
Refined code Refined code Refined code Refined code ● Terminating

Abort Action:
Reflection Reflection Debugger
agent
Refined code Refined code Agent-guided
tree search
Accept

Figure 1: CodeTree creates a unified search space for exploration throughout the multi-stage code generation process:
strategy generation by a “Thinker” agent, initial code generation by a “Solver” agent, and code improvement by a
“Debugger” agent. To effectively perform exploration within the tree structure, we incorporate both environmental
execution-based feedback as well as AI-generated feedback (generated by a “Critic” LLM agent).

Critic Agent Scoring & Evaluation: (e.g. execution results, compiler signals):
Thinker: Your task is to evaluate a strategy and
Your goal is to think of multiple strategies in corresponding implementation for
English on how to [approach and solve this solving a programming problem. The
problem)]/[improve this solution]. You should
decide how many and what strategies are
solution fails on test cases.
You should score from 1 to 5 on how
Fexe,i = Ŵi ({(ij , oj )v }) (3)
feasible and list and number them line by line. … well the execution outputs match the
[Problem]: <problem description>
[Solution]: <previous solution> [Feedback] if any
expected ones. …
—----------------------------------------------
Fcri,i = θC (Ŵi , Ŝi , Fexe,i , D) (4)
You should score from 1 to 5 on how
Solver: well the solution implements the
Your goal is to implement the solution for a
programming problem based on the user's
strategy and solve the task.
Evaluate if one should keep refining
where θC is Critic Agent. Denoting the collective
instructions… this solution or try other strategies.
[Problem]: <problem description>
[Instruction]: <strategy>
[problem] [solution] [exe feedback] feedback as Fi = {Fexe,i , Fcri,i }, a set of reflec-
Debugger:
Critic Agent Solution Verification:
tions Ri about the code candidates are generated
Your goal is to improve the following solution for
You are given a programming task along
a programming problem, based on its execution
feedback on test cases, including
with a user's solution that passes all by Thinker Agent.
visible tests. Your job is to verify whether
evaluation/reflection for the solution and
this solution will pass the hidden test
instruction. …
cases. Answer True if it’s an acceptable
[Problem]: <problem description>
[Solution]: <previous solution>
solution, Answer False if it’s not. Your
answer should be a single word
R̂i,j ∼ pθT (.|R̂i,1:j−1 , Fi , Ŵi , Ŝi , D) (5)
[Feedback]: <execution feedback>, <Critic
True/False. …
Agent feedback>
[Instruction]: <reflection>
[problem][solution][exe feedback] Ŵi,j ∼ pθD (.|R̂i,j , Fi , Ŵi , Ŝi , D) (6)

Figure 2: Simplified versions of instruction prompts R̂i,j denotes the jth reflection that “Thinker” gener-
used for Thinker, Solver, Debugger, and Critic agents. ates for Ŵi . An LLM θD (“Debugger” Agent) will
Some details are omited for illustration purposes. modify Ŵi , referring this reflection, generating a
new program correspondingly.
during test time (Ouyang et al., 2022; Touvron 3.2 Tree Expanding with Critic Agent
et al., 2023b; Wang et al., 2023). Therefore, by CodeTree builds a heterogeneous tree for each
including the strategy as part of the input instruc- problem, where the tree root represents a problem
tion, we can condition Solver Agent to produce specification (D, {(ij , oj )}) and every subsequent
strategy-specific code candidates. For each candi- tree node represents a generated code solution Ŵi .
date, the generation objective is defined as: Each node has relevant attributes including its col-
Ŵi ∼ pθS (Ŝi , D) (2) lective code feedback Fi and its corresponding
strategy and reflections: Si and Ri . Typically,
Solution Refining with Thinker & Debugger adding a tree node is a two-step process: 1) gener-
Agents Prior approaches such as (Chen et al., ate a code solution from the corresponding strategy
2023a,d; Shinn et al., 2023; Madaan et al., 2023) (Eq. 2 or Eq. 6), 2) evaluate the generated solution
found that syntactic mistakes or even logical flaws Ŵi and obtain environmental feedback (Eq. 4).
in generated code can be fixed by allowing LLMs Unlike previous tree-structure search methods
to iteratively refine and regenerate the code. This (Yao et al., 2024; Islam et al., 2024), we do not con-
self-refinement capability is typically strengthened struct the entire tree in the beginning. Instead, we
by some forms of feedback about the code qualities introduce a Critic Agent to dynamically expand the
tree based on potential strategies. It will guide the 4 Experiments
expansion and spanning of the tree, taking actions
based on its evaluation of the current node. 4.1 Experimental Setup
We applied pass@1(Chen et al., 2021) as our eval-
Node Scoring and Evaluation For a given so-
uation metric: only one code candidate can be se-
lution and corresponding Fexe , Critic Agent per-
lected and submitted for the final evaluation with
forms an evaluation, measuring how promising it is
hidden test cases. We set the generation budget to
through equation 4, which results in Fcri . We sepa-
be 20 samples per coding task. To fairly compare
rately evaluate how well: 1) the execution outputs
our approach with other baselines, we adopted the
of test cases match expected outputs on visible test
same generation budget in all methods. For abla-
cases; and 2) the solution robustly implements its
tion experiments without using Critic Agent, we
corresponding strategy towards problem-solving.
followed similar strategies from (Shinn et al., 2023;
For one program Ŵi and its corresponding feed-
Chen et al., 2023d): we select a solution which
back Fi , the Critic Agent will evaluate whether the
passes all visible test cases as the final solution to
current solution is worth refining, or it should not
be evaluated with hidden test cases.
be explored, making decision between refinement
and abort. The critic score is calculated following Benchmarks We conducted experiments on 2
the equation: categories of code generation tasks: 1) Function
Score(Ŵi ) = Score(Fexe,i ) + Score(Fcri,i ) (7) implementation where a coding task is to com-
plete a single function following a specific func-
Solution Verification For one Ŵi that passes all tion signature: HumanEval (Chen et al., 2021),
visible test cases, it might potentially over-fit the MBPP (Austin et al., 2021), and their EvalPlus
visible test cases and could fail hidden test cases. variants from (Liu et al., 2023), denoted as Hu-
Hence, the critic agent θC will verify if this solution manEval+ and MBPP+; and 2) Program implemen-
is robust and generalizable to unseen test cases. tation where a coding task is to solve an algorithmic
problem: CodeContests (Li et al., 2022) and APPS
Decision-Making by Critic Agent Starting from
(Hendrycks et al., 2021). The sizes of test set are
the initial Si , Wi , Fi , Critic Agent guides the search
164, 378 and 165 for HumanEval(+), MBPP(+)
for a correct solution. At each node, it has an action
and CodeContests respectively. For APPS, we ran-
space of three actions: Refine: Continue exploring
domly sampled 150 samples (50 for each level of
from the current node by generating multiple reflec-
difficulty) from the test split.
tions for this node; Abort: Prune this node due to
its low evaluation score, and retrace the exploration Baselines We introduce the following baselines:
to its sibling nodes; and Accept: Accept the current Direct instructs the model to generate code directly
node as the final solution and terminate the search from the input problem; CoT (Wei et al., 2022)
early. instructs the model to provide chain-of-thought
3.3 Multi-agent Collaboration reasoning before giving the solution program; Re-
flexion (Shinn et al., 2023) utilizes solution’s ex-
Throughout the expansion of the tree, the task- ecution feedback to generate self-reflections. The
specific agents collaborate with Critic Agent, utiliz- reflections are used to iteratively refine the solution;
ing its feedback, and follow its guidance to perform MapCoder (Islam et al., 2024) proposes an agent
exploration. The flexibility of the tree expansion collaboration system to plan, solve, test, and refine
and search is determined by LLM agents’ decision- the solution. We set #plans=4, #debug-round=5
making, e.g. determining the number of strategies and generation budget=20; and Resample follows
and deciding the search path. During inference a similar principle as Li et al. (2022): resample so-
time, practically, we limit the number of explo- lutions repeatedly and filter them with visible test
ration steps to avoid large computation overhead. cases.1
Whenever a termination signal (i.e. to accept a
code solution) is found or the maximum number Models We studied our method on three mod-
of exploration steps is reached, a code candidate is els with different model sizes and capacities. We
selected based on its evaluation score Score(W ˆ i ).
1
We set sampling temperature=1 for Resample, and report
Please refer to Appendix A for all example instruc- the best results over 2 runs. For other methods, we report the
tion prompts of our LLM agents. single run’s results with deterministic inference.
Model Method HumanEval HumanE+ MBPP MBPP+ Codecontests Avg.
Direct 86.6% 78.7% 87.8% 73.3% 10.3% 67.3%
CoT 84.8% 78.0% 89.2% 74.3% 12.7% 67.8%
Reflexion 92.1% 83.5% 96.6% 78.6% 21.8% 74.5%
GPT-4o-mini MapCoder 91.5% 78.0% 90.0% - - -
Resample 89.0% 80.5% 94.3% 76.8% 18.2% 71.8%
CodeTree-BFS 93.3% 82.1% 91.5% 72.3% 20.6% 72.0%
CodeTree-DFS 92.7% 81.1% 87.6% 71.4% 20.6% 70.7%
Strategy List 90.2% 80.5% 90.5% 69.6% 22.4% 70.6%
CodeTree 94.5% 84.8% 96.8% 77.0% 26.4% 75.9%
Direct 88.4% 81.7% 92.3% 75.9% 20.6% 71.8%
CoT 92.1% 84.1% 93.7% 77.2% 24.8% 74.4%
Reflexion 94.5% 84.8% 97.9% 79.6% 41.8% 79.7%
GPT-4o MapCoder 92.7% 81.7% 90.9% - - -
Resample 93.9% 84.8% 96.2% 77.0% 32.7% 76.9%
CodeTree-BFS 94.5% 84.1% 93.9% 70.7% 35.8% 75.8%
CodeTree-DFS 95.1% 83.5% 91.5% 76.2% 36.4% 76.5%
Strategy List 95.1% 82.3% 92.6% 73.3% 36.4% 75.9%
CodeTree 94.5% 86.0% 98.7% 80.7% 43.0% 80.6%
Direct 63.4% 54.3% 73.4% 63.8% 6.1% 52.2%
CoT 65.9% 56.1% 74.6% 65.3% 4.2% 53.2%
Reflexion 79.9% 69.5% 90.2% 72.0% 13.5% 65.0%
Llama-3.1-8B Resample 82.3% 71.3% 91.0% 73.8% 15.2% 66.7%
CodeTree-BFS 80.5% 68.3% 91.0% 69.3% 15.8% 65.0%
CodeTree-DFS 80.5% 68.9% 89.7% 70.4% 15.2% 64.9%
Strategy List 82.3% 70.1% 91.0% 72.5% 13.9% 66.0%
CodeTree 82.3% 72.0% 90.5% 73.3% 12.1% 66.0%

Table 2: Experimental results by pass@1 on HumanEval, MBPP, EvalPlus, and CodeContests: methods are
baseline methods that generate program solution only once, are methods with solution generation budget of 20
samples like our methods. are CodeTree variants with or without Critic Agent to guide the tree search. Note that
MapCoder does not work with Llama-3.1-8B as noted by Islam et al. (2024).

experimented on large language models from the forms better than CodeTree-DFS, suggesting that
GPT and Llama 3.1 family. Specifically we use exploring diverse strategies is more effective than
GPT-4o-mini, GPT-4o2 , and Llama-3.1-8B 3 . iteratively refining from one solution. Interestingly,
on Llama-3.1-8B model, Resampling achieves the
4.2 Main Results best results on 4 benchmarks. This observation
We compared CodeTree with other baselines in indicates that small language models may not be
Table 2. We noticed that Reflexion and Resam- suitable for multi-agent frameworks like CodeTree,
pling serve as strong baselines for HumanEval and where models are required to follow task-specific
MBPP datasets given the same solution genera- roles and instructions and perform distinct tasks
tion budget, comparable to CodeTree-BFS/DFS. with reasonable accuracy.
CodeTree with Critic Agent outperforms all other
4.3 Analysis of Search Strategies
baselines in 4 out of 5 benchmarks for GPT-4o-
mini and GPT-4o. For instance, CodeTree achieves Given the performance gaps between CodeTree-
pass@1=43.0% on competition-level coding tasks BFS and DFS, we conducted additional experi-
in the Codecontests benchmark (i.e. 22.4% perfor- ments to analyze these tree search strategies with-
mance gain over the Resampling baseline), show- out Critic Agent. We reported the results on
ing its advantage in solving hard problems. HumanEval/HumanEval+ with GPT-4o-mini and
We found that CodeTree-BFS almost always per- Codecontests with GPT-4o in Table 3. Compared
to DFS/BFS strategies with d = 3 and w = 3, we
2
https://openai.com/index/hello-gpt-4o/ observed that forcing the model to search wider
3
https://huggingface.co/meta-llama/Llama-3.
1-8B-Instruct. Note that we reported our replicated results (i.e. more diverse strategies with w > 3) in BFS
which might be different from the original reported ones. and only debug up to 1 iteration (i.e. d = 2) im-
Figure 3: Results of pass@1 on the APPS test subsets. We randomly select 50 samples from Introductory, Interview,
and Competition separately. We apply Resample and our methods with GPT-4o, GPT-4o-mini, Llama-3.1-8B.

Model GPT-4o-mini GPT-4o Model GPT-4o-mini

Benchmark HumanEval HumanEval+ CodeContests Benchmark HumanEval HumanEval+
CodeTree-BFS
CodeTree 94.5% 84.8%
d = 3, w = 3 93.3% 82.3% 36.4%
d = 2, w = 4 95.1% 84.1% 37.6% w/o verification 91.5% 81.7%
d = 2, w = 5 94.5% 83.4% 39.4% w/o node abort 91.5% 81.1%
w/o scoring 92.7% 82.1%
CodeTree-DFS
d = 3, w = 3 92.7% 81.1% 36.4%
d = 4, w = 2 92.1% 81.1% 37.0% Table 4: Ablation study for different tasks of Critic
d = 5, w = 2 92.1% 81.7% 36.4% Agent. We used GPT-4o-mini to evaluate corresponding
settings and reported the pass@1 on the HumanEval
Table 3: Pass@1 results of CodeTree-BFS/DFS on Hu- and HumanEval+ benchmarks.
manEval, HumanEval+, and CodeContests: d indicates
max search depth while b indicates max search width.
Competition-level). We hypothesized that while
CodeTree improves the search efficiency towards
proved the performance of pass@1. However, for the correct solution, a generation budget= 20 is
DFS, prioritizing deeper search (i.e. larger number still limited for highly complex problems.
of iterations for refinement with d > 3) does not
boost the performance significantly. Finally, we 4.5 Ablation Study
noted that w = 4 works better for HumanEval and Results in Table 2 indicate, Critic Agent plays a
w = 5 works better for CodeContests, indicating crucial role in CodeTree over naive search meth-
that more complex problems can benefit from ex- ods like BFS/DFS. We further analyzed which task
ploring a larger number of coding strategies. This in Section 3.2 is the most crucial for Critic Agent.
finding supports our proposed CodeTree in which Specifically, we conducted the following ablation
Critic Agent can dynamically determine the num- experiments: (1) w/o Solution Verification, where
ber of child nodes to explore given the context. we excluded the verification task for any solution
passing visible tests; (2) w/o Node Abort Evalu-
4.4 Analysis by Problem Difficulty
ation, where we let the agents keep exploring till
We further evaluated CodeTree against coding prob- we reach the max depth or whenever a solution
lems with diverse difficulty levels. We randomly is accepted; (3) w/o Node Scoring, where the en-
sampled 50 problems from each level of difficulty vironmental feedback is solely execution outputs,
(“introductory”, “interview”, and “competition”) without Critic Agent’s evaluation.
from the test split of the APPS benchmark, creating The results in Table 4 show that all proposed
a total test set of 150 problems. We reported the tasks are crucial for Critic Agent to guide the tree
results in Figure 3. expanding and solution search. Among these tasks,
The results demonstrate that CodeTree performs node abort and solution verification tasks are the
better for simpler problems (Introductory for GPT- most effective and have significant impacts on final
4o-mini; Introductory and Interview for GPT-4o), performances. We include a qualitative study on a
but still struggles to solve very hard problems (e.g., real example in Appendix A.
(a) Performance on CodeContests with GPT-4o (b) Performance on HumanEval+ with GPT-4o-mini

Figure 4: Cumulative pass@1 curves while new solutions are generated within budget= 20.

4.6 Search Efficiency Approach % Resolved

(Xia et al., 2024) 24.2%
While the performance of CodeTree is very strong (Xia et al., 2024) + CoT 23.6%
(Xia et al., 2024) + Reflexion 25.3%
with a generation budget of 20 samples, it is impor- (Xia et al., 2024) + CodeTree 27.6%
tant to understand how our tree search strategy are
more efficient than other related methods. Specif- Table 5: Results on the SWEBench benchmark: All
ically, we conducted experiments when limiting methods using GPT4o-mini as the base model. Com-
pared to CoT and Reflexion, CodeTree can lead to more
the solution generation budget between 1 to 20
significant performance gain.
samples. In Figure 4, we plot the pass@1 curves
w.r.t the number of sampled solutions: (a) shows
results on CodeContests with GPT-4o; (b) shows retrieval and code patch generation stages from Xia
results on HumanEval+ with GPT-4o-mini. In both et al. (2024) and integrated our CodeTree frame-
figures, not only does our proposed CodeTree per- work for the exploration of different trajectories
form better than other methods, but also achieves a (from context retrieval to code patch generation and
relatively high pass@1 even when the generation ranking). From Table 5, we observed that CodeTree
budget is small (e.g. < 9 samples). On Code- can lead to significant performance gains as com-
Contests, even when CodeTree starts with a lower pared to related approaches like CoT (Wei et al.,
performance with limited generation budgets (i.e. 2022) and Reflexion (Shinn et al., 2023). The re-
< 5 samples), its performance soon improves sig- sults also demonstrate the versatility of our method
nificantly during later exploration of the tree. This on complex coding tasks like repo-level code gen-
observation shows that CodeTree has the potential eration which often requires extremely large search
to continue solving more problems given larger space to find an optimal solution.
solution generation budgets.
5 Conclusion
4.7 Results on SWEBench
We introduce CodeTree, a new framework of agent-
Recently, Jimenez et al. (2024) introduced a novel guided tree search for code generation tasks. In-
coding task for generating code patches given an troducing a tree-based structure as a unified search
input Github repository and the corresponding pull space, CodeTree includes a Critic agent to guide
request. We attempted to adapt CodeTree to this the tree search and make critical decisions such
benchmark by making two major changes: first, we as termination, expanding and scoring of tree
replaced the problem specification with the tex- nodes. CodeTree facilitates multi-agent collabora-
tual description of input pull request; secondly, tion (among Thinker, Solver, and Debugger agents)
we adapted the strategy generation stage as the re- to find the correct solution within a limited solution
trieval stage where we instructed the Thinker agent generation budget. In our comprehensive exper-
to explore relevant code contexts (by code files, iments, CodeTree exhibits consistent and strong
methods, or lines of code) from the input Github performance against many baselines across diverse
repository. We extended the implementation of the code generation benchmarks.
6 Limitations Angelica Chen, Jérémy Scheurer, Tomasz Korbak,
Jon Ander Campos, Jun Shern Chan, Samuel R Bow-
The main limitation of this work is the requirement man, Kyunghyun Cho, and Ethan Perez. 2023a. Im-
for LLMs with strong reasoning and instruction- proving code generation by training with natural lan-
following capabilities. Empirically, we found that guage feedback. arXiv preprint arXiv:2303.16749.
smaller models such as those with 8B parameters Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan,
struggle at following complex instructions to play Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023b.
different agentic roles. For instance, to play the Codet: Code generation with generated tests. In
The Eleventh International Conference on Learning
role of the Critic agent, smaller models may gen- Representations.
erate unexpected output for some tasks like node
scoring or self-reflection, resulting in misleading or Jingchang Chen, Hongxuan Tang, Zheng Chu, Qiang-
long Chen, Zekun Wang, Ming Liu, and Bing Qin.
noisy feedback to guide other LLM agents. In addi-
2024. Divide-and-conquer meets consensus: Un-
tion, incorporating agent-generated guidance will leashing the power of functions in code generation.
incur additional cost as it requires LLMs to extract arXiv preprint arXiv:2405.20092.
information from long code context and generat-
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
ing more output tokens. Finally, this work focuses Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
on the functionality correctness of code generation plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
tasks while ignoring other qualities of generated Greg Brockman, et al. 2021. Evaluating large
code such as its readability or efficiency. To im- language models trained on code. arXiv preprint
arXiv:2107.03374.
prove these qualities of output code, CodeTree can
be further enhanced to incorporate more holistic Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang,
LLM-generated feedback and perform diverse ex- Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu,
Yi-Hsin Hung, Chen Qian, et al. 2023c. Agentverse:
ploration to find optimal code solutions. Facilitating multi-agent collaboration and exploring
emergent behaviors. In The Twelfth International
7 Ethical Considerations Conference on Learning Representations.
The primary ethical concern of this paper is that Xinyun Chen, Maxwell Lin, Nathanael Schärli, and
directly executing the pipeline may pose a risk of Denny Zhou. 2023d. Teaching large language mod-
information leakage or compromise the security of els to self-debug. arXiv preprint arXiv:2304.05128.
one’s system. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts,
• Directly passing the execution error informa- Paul Barham, Hyung Won Chung, Charles Sutton,
tion to commercial large language models Sebastian Gehrmann, Parker Schuh, Kensen Shi,
might cause the leak of private information Sasha Tsvyashchenko, Joshua Maynez, Abhishek
like the execution paths, username and pack- Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben
age versions in the environment.
Hutchinson, Reiner Pope, James Bradbury, Jacob
• Running AI-generated programs locally might Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat,
put one’s system and data at risk. Sunipa Dev, Henryk Michalewski, Xavier Garcia,
Vedant Misra, Kevin Robinson, Liam Fedus, Denny
We will address the above issues by adding imple-
Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim,
mentation of virtual environment for code execu- Barret Zoph, Alexander Spiridonov, Ryan Sepassi,
tion by the time we publish our codebase. David Dohan, Shivani Agrawal, Mark Omernick, An-
drew M. Dai, Thanumalayan Sankaranarayana Pil-
lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira,
References Rewon Child, Oleksandr Polozov, Katherine Lee,
Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov,
Diogo Almeida, Janko Altenschmidt, Sam Altman, and Noah Fiedel. 2022. Palm: Scaling language mod-
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. eling with pathways. CoRR, abs/2204.02311.
arXiv preprint arXiv:2303.08774.
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Self-collaboration code generation via chatgpt. arXiv
Bosma, Henryk Michalewski, David Dohan, Ellen preprint arXiv:2304.07590.
Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021.
Program synthesis with large language models. arXiv Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio
preprint arXiv:2108.07732. César Teodoro Mendes, Allie Del Giorno, Sivakanth
Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo Yujia Li, David Choi, Junyoung Chung, Nate Kushman,
de Rosa, Olli Saarikivi, et al. 2023. Textbooks are all Julian Schrittwieser, Rémi Leblond, Tom Eccles,
you need. arXiv preprint arXiv:2306.11644. James Keeling, Felix Gimeno, Agustin Dal Lago,
et al. 2022. Competition-level code generation with
Dan Hendrycks, Steven Basart, Saurav Kadavath, Man- alphacode. arXiv preprint arXiv:2203.07814.
tas Mazeika, Akul Arora, Ethan Guo, Collin Burns,
Samir Puranik, Horace He, Dawn Song, and Jacob Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling-
Steinhardt. 2021. Measuring coding challenge com- ming Zhang. 2023. Is your code generated by chat-
petence with apps. NeurIPS. GPT really correct? rigorous evaluation of large lan-
guage models for code generation. In Thirty-seventh
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng
Conference on Neural Information Processing Sys-
Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven
tems.
Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. 2023.
Metagpt: Meta programming for multi-agent collabo-
rative framework. arXiv preprint arXiv:2308.00352. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Dong Huang, Qingwen Bu, Jie M Zhang, Michael Luck, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
and Heming Cui. 2023. Agentcoder: Multi-agent- et al. 2023. Self-refine: Iterative refinement with
based code generation with iterative testing and opti- self-feedback. arXiv preprint arXiv:2303.17651.
misation. arXiv preprint arXiv:2312.13010.
Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoy-
Jeevana Priya Inala, Chenglong Wang, Mei Yang, An- anov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin.
dres Codas, Mark Encarnación, Shuvendu K Lahiri, 2023. Lever: Learning to verify language-to-code
Madanlal Musuvathi, and Jianfeng Gao. 2022. Fault- generation with execution. In International Con-
aware neural code rankers. In Advances in Neural ference on Machine Learning, pages 26106–26128.
Information Processing Systems. PMLR.
Md Ashraful Islam, Mohammed Eunus Ali, and Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Sil-
Md Rizwan Parvez. 2024. Mapcoder: Multi-agent vio Savarese, and Yingbo Zhou. 2023. Codegen2:
code generation for competitive problem solving. Lessons for training llms on programming and natu-
arXiv preprint arXiv:2405.11403. ral languages. arXiv preprint arXiv:2305.02309.
Carlos E Jimenez, John Yang, Alexander Wettig,
Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Theo X Olausson, Jeevana Priya Inala, Chenglong
Narasimhan. 2024. SWE-bench: Can language mod- Wang, Jianfeng Gao, and Armando Solar-Lezama.
els resolve real-world github issues? In The Twelfth 2023. Demystifying gpt self-repair for code genera-
International Conference on Learning Representa- tion. arXiv preprint arXiv:2306.09896.
tions.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Jing Yu Koh, Stephen McAleer, Daniel Fried, and Rus- roll L Wainwright, Pamela Mishkin, Chong Zhang,
lan Salakhutdinov. 2024. Tree search for language Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
model agents. arXiv preprint arXiv:2407.01476. 2022. Training language models to follow in-
structions with human feedback. arXiv preprint
Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, arXiv:2203.02155.
Doyen Sahoo, and Shafiq Joty. 2024. Codechain:
Towards modular code generation through chain of Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan
self-revisions with representative sub-modules. In Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng
The Twelfth International Conference on Learning Su, Xin Cong, et al. 2024. Chatdev: Communicative
Representations. agents for software development. In Proceedings
of the 62nd Annual Meeting of the Association for
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Computational Linguistics (Volume 1: Long Papers),
Savarese, and Steven Chu Hong Hoi. 2022. Coderl: pages 15174–15186.
Mastering code generation through pretrained models
and deep reinforcement learning. Advances in Neural
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten
Information Processing Systems, 35:21314–21328.
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023.
Khizbullin, and Bernard Ghanem. 2023a. Camel: Code llama: Open foundation models for code. arXiv
Communicative agents for" mind" exploration of preprint arXiv:2308.12950.
large language model society. Advances in Neural
Information Processing Systems, 36:51991–52008. Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke
Zettlemoyer, and Sida I. Wang. 2022. Natural lan-
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas guage to code translation with execution. In Proceed-
Muennighoff, Denis Kocetkov, Chenghao Mou, Marc ings of the 2022 Conference on Empirical Methods
Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. in Natural Language Processing, pages 3533–3546,
2023b. Starcoder: may the source be with you! Abu Dhabi, United Arab Emirates. Association for
arXiv preprint arXiv:2305.06161. Computational Linguistics.
Noah Shinn, Federico Cassano, Beck Labash, Ashwin In International Conference on Machine Learning,
Gopinath, Karthik Narasimhan, and Shunyu Yao. pages 41832–41846. PMLR.
2023. Reflexion: Language agents with verbal rein-
forcement learning. Preprint, arXiv:2303.11366. Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,
Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Jialin Song, Jonathan Raiman, and Bryan Catan- Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H.
zaro. 2024. Effective large language model de- Chi. 2023. Least-to-most prompting enables com-
bugging with best-first tree search. arXiv preprint plex reasoning in large language models. In The
arXiv:2407.19055. Eleventh International Conference on Learning Rep-
resentations.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, et al. 2023a. Llama: Open and effi-
cient foundation language models. arXiv preprint
arXiv:2302.13971.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-

bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023b. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,

Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Denny Zhou. 2022. Self-consistency improves chain
of thought reasoning in language models. arXiv
preprint arXiv:2203.11171.

Yue Wang, Hung Le, Akhilesh Deepak Gotmare,

Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023.
Codet5+: Open code large language models for
code understanding and generation. arXiv preprint
arXiv:2305.07922.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. 2022. Chain-of-thought prompting elicits rea-
soning in large language models. Advances in Neural
Information Processing Systems, 35:24824–24837.

Sean Welleck, Ximing Lu, Peter West, Faeze Brah-

man, Tianxiao Shen, Daniel Khashabi, and Yejin
Choi. 2023. Generating sequences by learning to
self-correct. In The Eleventh International Confer-
ence on Learning Representations.

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and

Lingming Zhang. 2024. Agentless: Demystify-
ing llm-based software engineering agents. arXiv
preprint arXiv:2407.01489.

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,

Tom Griffiths, Yuan Cao, and Karthik Narasimhan.
2024. Tree of thoughts: Deliberate problem solving
with large language models. Advances in Neural
Information Processing Systems, 36.

Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023a.
Self-edit: Fault-aware code editor for code genera-
tion. arXiv preprint arXiv:2305.04087.

Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike

Lewis, Wen-tau Yih, Daniel Fried, and Sida Wang.
2023b. Coder reviewer reranking for code generation.
A Appendix
A.1 Qualitative Results
Here we show one real example using GPT-4o-mini
to solve the problem: HumanEval-36. We keep the
actual prompts and the model’s responses in Fig-
ure 5, 6, and 7. We find that the initial solution
W1 generated, which passes all visible test cases,
is rejected by the critic agent, who also suggest
improvements. Debugger follows critic agent’s
suggestion and implement W2 . We evaluate W1
and W2 on the hidden test cases, where W2 passes
and W1 fails. This indicates that the Critic Agent
is making correct judgment for false positive solu-
tion(i.e., pass visible but fail on hidden test cases),
and effectively guide to the correct solution.
In Figure 5. The Thinker Agent decide to gen-
erate 5 strategies that potentially be the solutions.
To start the tree search, the first strategy S1 was
sent to the solver agent to implement. After one
solution is generated, it was sent to the Solver to
be implemented. W1 is immediately executed on
visible test cases and observe the outputs.
In Figure 6, for a W1 that passes all visible test
cases, it will go to the critic agent to be further
verified if W1 can be applied to more general test
cases. The Critic Agent gives negative judgment
and suggest to improve this solution by considering
more situation like zeros and negative integers. We
use oracle hidden test cases to evaluate W1 , which
is indeed incorrect, endorsing the decision made by
Critic Agent.
In Figure 7, the Debugger Agent will treat the
judgement and reason from Critic Agent as addi-
tional environmental feedback, and then implement
a new solution conditioned on: 1. the problem
itself, 2.the solution to improve, 3. environmen-
tal feedback. Debugger Agent refines W1 to W2 ,
which is accepted by the Critic Agent. We evaluate
W2 on hidden test cases, it is correct.

A.2 Other Prompts

Critic Agent-test cases scoring, solution scoring,
and Thinker Agent Reflection are not included
in the above real case(HumanEval-36) solving,
we presented the detailed prompts used for these
agents in Figure 8.
Thinker Agent (Strategy):
<system>:
You are an AI assistant that provides strategy for Python programmers to code. You will be given a function signature and its
docstring by the user. Your goal is to think of multiple distinct strategies in English(Not Code) on how to approach this problem and
solve it. The number of alternatives should be determined for this speciﬁc problem.
-----------------------------------
Your answer should be formatted as below:
List and Number your strategies line by line using "1. "; "2. "; "3. " and so on.
Describe each strategy with a FEW sentences in a SINGLE Line.

<user>:
[problem description]:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
```

<assistant>:
1. Start by understanding the problem requirements and constraints, focusing on the need to count occurrences of the digit '7' in
speciﬁc integers.
2. Identify the range of integers to consider, which is from 0 to n-1, and clarify the conditions for divisibility by 11 or 13.
3. Create a loop to iterate through all integers from 0 to n-1, checking each integer for divisibility by 11 or 13.
4. For each integer that meets the divisibility condition, convert it to a string to facilitate counting occurrences of the digit '7'.
5. Use the string method `.count('7')` to count how many times '7' appears in the string representation of the integer.

Solver Agent:
<system>:
You are an AI that only responds with python code, NOT ENGLISH. You will be given a function signature and its docstring by the
user. Write your full implementation (restate the function signature). Don't include test cases or printing statements in the code
block.

<user>:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
```
To solve the problem, You can follow the hint given by an expert: Start by understanding the problem requirements and constraints,
focusing on the need to count occurrences of the digit '7' in speciﬁc integers.

<assistant>:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
count = 0
for i in range(n):
if i % 11 == 0 or i % 13 == 0:
count += str(i).count('7')
return count
```

Figure 5: GPT-4o-mini as the Thinker and Solver Agents to solve HumanEval-36. The thinker agent generates 5
distinct strategies and the solver agent implements the first one. By oracle evaluation, the resulted solution can pass
visible but fail on hidden test cases.
Critic Agent - Solution Verification:
<system>:
The user will give you a programming task and a solution that passes all visible tests.
Your task is to verify whether the solution can pass hidden test cases.
Focus only on functionality correctness. If the solution is correct, no improvement is necessary. Assume all inputs follow formats
and constraints described in the examples.
Answer "True" if you think this solution is acceptable, Answer "False" if you think necessary modification is needed for it to cover
more test cases.
-----------------------------------
Your answer should be formatted as below:
In the first line, answer a single word (True/False).
In the second line, indicate your reason.

<user>:
[Problem description]:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
```

[Implementation]:
```
def ﬁzz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
# Start by understanding the problem requirements and constraints, focusing on the need to count occurrences of
the digit '7' in specific integers.
count = 0
for i in range(n):
if i % 11 == 0 or i % 13 == 0:
count += str(i).count('7')
return count
```

[Execution Feedback on Visible Test Cases]:

Tested passed:
assert fizz_buzz(50) == 0
assert fizz_buzz(78) == 2
assert fizz_buzz(79) == 3
Tests failed:

<assistant>:
False
The solution does not account for numbers that are divisible by both 11 and 13, and it may also miss edge cases where the input is
less than 11 or 13. Additionally, it does not handle negative inputs or zero, which could lead to incorrect results.

Figure 6: GPT-4o-mini as the Critic Agent to verify a solution of HumanEval-36 given by Solver Agent. It decides
to reject this solution and suggest improvements.
Debugger Agent:
<system>:
You are an AI Python assistant. You will be given a program implementation to improve, a series of unit tests and execution
feedback, and a hint to change the implementation appropriately. Write your full implementation (restate the function signature).
<user>:
[Implementation to Improve]:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
# Start by understanding the problem requirements and constraints, focusing on the need to count occurrences of
the digit '7' in specific integers.
count = 0
for i in range(n):
if i % 11 == 0 or i % 13 == 0:
count += str(i).count('7')
return count
```

[Execution Feedback on Visible Test Cases]:

Tested passed:
```
assert fizz_buzz(50) == 0
assert fizz_buzz(78) == 2
assert fizz_buzz(79) == 3
```
Tests failed:

[AI-generated Feedback]:
The solution does not account for numbers that are divisible by both 11 and 13, and it may also miss edge cases where the input is
less than 11 or 13. Additionally, it does not handle negative inputs or zero, which could lead to incorrect results.

<assistant>:
Here is the improved implementation of the `ﬁzz_buzz` function, incorporating the suggested changes:
```python
def fizz_buzz(n: int) -> int:
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
>>> fizz_buzz(10)
0
>>> fizz_buzz(-5)
0
"""
if n <= 0:
return 0

count = 0
for i in range(11, n):
if i % 11 == 0 or i % 13 == 0:
count += str(i).count('7')

return count
```
### Changes Made:
1. **Edge Case Handling**: Added a check for `n <= 0` to return 0 immediately for negative inputs and zero.
2. **Loop Starting Point**: The loop now starts from 11, as numbers less than 11 cannot be divisible by 11 or 13.
3. **Documentation**: Updated the docstring to clarify the function's behavior and added additional test cases for edge cases.
This implementation should now correctly handle all speciﬁed cases and edge conditions.

Figure 7: GPT-4o-mini as the Debugger agent to refine a solution of HumanEval-36 given by Solver Agent. It refers
to Critic Agent’s suggestion and correct the solution successfully.
Critic Agent - Test Output Scoring:
<system>:
Your task is to evaluate the execution outputs of a code implementation. The statement and code is given by the user, and the
output/expected output on a set of test cases. You should analyze the expected outputs and execution outputs. From a 0 to 5 range,
give a score on how good the execution outputs are matching the expected ones (higher score means a better match). Standards
are below:
0: Errors or time out when executing.
1: No pattern found when comparing pairs of <output, expected_output>, errors are hard to interpret.
2: Results abnormal for a part of cases(e.g., cannot handle negative elements; only half of it sorted).
3: Mismatches have clear patterns to interpret the error. For examples, all elements ojset by 1; all elements + 1; reverse all
elements etc.,
4: Lack consideration of edge condition/corner cases(e.g., error only when elements are equal), otherwise correct
5: Results exactly matched.
-----------------------------------
Your answer should be formatted as below:
In the ﬁrst line, give your brief comparison.
In the second line, give A SINGLE INTEGER NUMBER as your ﬁnal score(0 to 5)
<user>:
[problem]
[solution]
[execution results on visible test cases]

Critic Agent - Solution Scoring:

<system>: Your task is to evaluate a strategy and corresponding implementation for solving a programming problem. You should
score from 1 to 5(higher means better) on how well do the solution implement the strategy and solve the task?
-----------------------------------
Your answer should be formatted as below:
In the ﬁrst line, give your brief analysis.
In the second line, give A SINGLE INTEGER NUMBER as your ﬁnal score(0 to 5)
<user>:
[problem]
[solution]
[execution results on visible test cases]

Thinker Agent Reﬂection:

<system>: You are a programming assistant. You goal is to help the user to correct their buggy code. You will be given an incorrect
function implementation and a series of unit tests & execution results. There could be multiple ways to fix the error, you should
provide reflection alternatives using various strategies. The number of reflection alternatives depends on the situation(e.g., if you
are certain about where the bug is, you can provide only one). Each self-reflection should be complete and self-contained. If there
are more than one bugs, they should be presented in one reflection rather than separately.
-----------------------------------
Your answer should be formatted as below:
List and Number your strategies line by line using "1. "; "2. "; "3. " and so on.
Describe each strategy with a FEW sentences in a SINGLE Line.

<user>:
[problem]
[solution]
[execution results on visible test cases]
[agent feedback] if any

Figure 8: Prompts for Critic Agent-test cases scoring, solution scoring, as well as Thinker Agent Reflection.

Daniel Greenfeld, Audrey M. Roy - A Wedge of Django (2021, Two Scoops Press)
No ratings yet
Daniel Greenfeld, Audrey M. Roy - A Wedge of Django (2021, Two Scoops Press)
363 pages
The Invention of The Historic Monument
100% (1)
The Invention of The Historic Monument
27 pages
2412.12544v1
No ratings yet
2412.12544v1
19 pages
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
No ratings yet
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
26 pages
2406.00515v1
No ratings yet
2406.00515v1
49 pages
Agent Coder 2312.13010v2
No ratings yet
Agent Coder 2312.13010v2
21 pages
ASE2024_CodeGenSurvey-7
No ratings yet
ASE2024_CodeGenSurvey-7
17 pages
2024.acl-long.737
No ratings yet
2024.acl-long.737
16 pages
A_Review_on_Code_Generation_with_LLMs_Application_and_Evaluation 2 (1)
No ratings yet
A_Review_on_Code_Generation_with_LLMs_Application_and_Evaluation 2 (1)
6 pages
Code Generation With LLMs
No ratings yet
Code Generation With LLMs
59 pages
2403.00046v2
No ratings yet
2403.00046v2
13 pages
CodePori_ Large-Scale System for Autonomous Software Development Using Multi-Agent Technology - 2402.01411v2
No ratings yet
CodePori_ Large-Scale System for Autonomous Software Development Using Multi-Agent Technology - 2402.01411v2
23 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
Assessing Large Language Models for Code Generation: A Comprehensive Framework
No ratings yet
Assessing Large Language Models for Code Generation: A Comprehensive Framework
6 pages
CODEJUDGE : Evaluating Code Generation with Large Language Models
No ratings yet
CODEJUDGE : Evaluating Code Generation with Large Language Models
20 pages
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
No ratings yet
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
15 pages
Bugs in LLms Genereated Code
No ratings yet
Bugs in LLms Genereated Code
47 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
No ratings yet
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
22 pages
Code Llama
No ratings yet
Code Llama
47 pages
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
No ratings yet
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
19 pages
(2023) A Survey On Language Models For Code
No ratings yet
(2023) A Survey On Language Models For Code
55 pages
Code Llama: Open Foundation Models For Code
No ratings yet
Code Llama: Open Foundation Models For Code
48 pages
ChatGPT Coding CompSac 23
No ratings yet
ChatGPT Coding CompSac 23
9 pages
03-Towards An Understanding of Large Language
No ratings yet
03-Towards An Understanding of Large Language
41 pages
2401.14196
No ratings yet
2401.14196
23 pages
Code Generation 2305.10679v1
No ratings yet
Code Generation 2305.10679v1
13 pages
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
No ratings yet
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
18 pages
A: Active Retrieval in Knowledge Soup For Code Generation
No ratings yet
A: Active Retrieval in Knowledge Soup For Code Generation
16 pages
Assessing Cybersecurity Vulnerabilities in Code Large Language Models_2404.18567v1
No ratings yet
Assessing Cybersecurity Vulnerabilities in Code Large Language Models_2404.18567v1
12 pages
2303.06689v3
No ratings yet
2303.06689v3
29 pages
133_large_language_model_evaluatio
No ratings yet
133_large_language_model_evaluatio
12 pages
2412.17264v1
No ratings yet
2412.17264v1
20 pages
7. Agents for Software Development
No ratings yet
7. Agents for Software Development
53 pages
Case Study For Procurement
No ratings yet
Case Study For Procurement
62 pages
CIBench Evaluating Your LLMs With a Code Interpret
No ratings yet
CIBench Evaluating Your LLMs With a Code Interpret
22 pages
Rethinking the Influence of Source Code on Test Case Generation
No ratings yet
Rethinking the Influence of Source Code on Test Case Generation
23 pages
HumanEval Pro and MBPPPro Evaluating Large Language Models
No ratings yet
HumanEval Pro and MBPPPro Evaluating Large Language Models
27 pages
A Survey On Language Models For Code
No ratings yet
A Survey On Language Models For Code
125 pages
Executable Code Actions Elicit Better LLM Agents
No ratings yet
Executable Code Actions Elicit Better LLM Agents
25 pages
2402.02172v5
No ratings yet
2402.02172v5
35 pages
Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation
No ratings yet
Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation
8 pages
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
No ratings yet
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
7 pages
Large Language Models for Software Engineering_a systematic literature review
No ratings yet
Large Language Models for Software Engineering_a systematic literature review
79 pages
Large Language Models Meet NL2Code A Survey
No ratings yet
Large Language Models Meet NL2Code A Survey
22 pages
Examining Zero-Shot Vulnerability Repair With LLM
No ratings yet
Examining Zero-Shot Vulnerability Repair With LLM
18 pages
2308.08784
No ratings yet
2308.08784
10 pages
2305.01210v3
No ratings yet
2305.01210v3
15 pages
2502.17139v1
No ratings yet
2502.17139v1
13 pages
Paper 1
No ratings yet
Paper 1
10 pages
2308.10345v1
No ratings yet
2308.10345v1
18 pages
Legal 2 AI
No ratings yet
Legal 2 AI
10 pages
OpenCoder_1731317971
No ratings yet
OpenCoder_1731317971
35 pages
KeyuHe_MaxLi_JosephLiu
No ratings yet
KeyuHe_MaxLi_JosephLiu
12 pages
Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
No ratings yet
Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
16 pages
Big Code Bench
No ratings yet
Big Code Bench
62 pages
Abstract
No ratings yet
Abstract
1 page
Large Language Models For Software Engineering
No ratings yet
Large Language Models For Software Engineering
79 pages
tosem2hshzh024_5.pdf
No ratings yet
tosem2hshzh024_5.pdf
79 pages
2408.09701v1
No ratings yet
2408.09701v1
11 pages
The Architect's Guide to NestJS: Architectural Trade-Offs and Implementation Patterns with NestJS
From Everand
The Architect's Guide to NestJS: Architectural Trade-Offs and Implementation Patterns with NestJS
Aarav Joshi
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Digital Libraries
100% (1)
Digital Libraries
291 pages
Among School Children
No ratings yet
Among School Children
3 pages
Great Wall of China
No ratings yet
Great Wall of China
13 pages
Blue - MPF, Chap 01 - Student
No ratings yet
Blue - MPF, Chap 01 - Student
33 pages
Codrii 333
No ratings yet
Codrii 333
15 pages
Abdalhay Makoni Severo 2020 Preface Introduction
No ratings yet
Abdalhay Makoni Severo 2020 Preface Introduction
35 pages
Truong Minh Tien CV - Java & Android
No ratings yet
Truong Minh Tien CV - Java & Android
2 pages
The Growth of Indian Civilization
No ratings yet
The Growth of Indian Civilization
8 pages
Introduction to Electrodynamics Fourth Edition David J. Griffiths instant download
100% (1)
Introduction to Electrodynamics Fourth Edition David J. Griffiths instant download
66 pages
PL-300 Exam - Free Actual Q&As, Page 5 - ExamTopics5
No ratings yet
PL-300 Exam - Free Actual Q&As, Page 5 - ExamTopics5
5 pages
Curriculum Map - Arts 10
No ratings yet
Curriculum Map - Arts 10
22 pages
CSE 1st-SEM
No ratings yet
CSE 1st-SEM
24 pages
Grade 5 Crossword 1
No ratings yet
Grade 5 Crossword 1
2 pages
How To Configure Siebel Object Manager
No ratings yet
How To Configure Siebel Object Manager
8 pages
Lesson 2 Scratch
No ratings yet
Lesson 2 Scratch
20 pages
Pacing Guide For Compact B1 Pet
No ratings yet
Pacing Guide For Compact B1 Pet
2 pages
02 Preparing Your Project Being Open Sourced
No ratings yet
02 Preparing Your Project Being Open Sourced
3 pages
Prove: (P Q) Q P Q: Proofs Using Logical Equivalences
No ratings yet
Prove: (P Q) Q P Q: Proofs Using Logical Equivalences
4 pages
Solution Notes
No ratings yet
Solution Notes
3 pages
Paper 2018-English
No ratings yet
Paper 2018-English
7 pages
Katrina Coleen C. Carpio BSA 4 - 2: 1. Create A Flowchart That Will Compute The Area and Perimeter of A Square
No ratings yet
Katrina Coleen C. Carpio BSA 4 - 2: 1. Create A Flowchart That Will Compute The Area and Perimeter of A Square
14 pages
Com124 Data Structure Note-1-1
No ratings yet
Com124 Data Structure Note-1-1
30 pages
The Unreality of Time
No ratings yet
The Unreality of Time
13 pages
Iso Iec17020 2012documentation 140120022757 Phpapp02 PDF
No ratings yet
Iso Iec17020 2012documentation 140120022757 Phpapp02 PDF
3 pages
Hangout Slides
50% (4)
Hangout Slides
21 pages
DLL Q1 Lesson 5 Concentration of Solutions
100% (1)
DLL Q1 Lesson 5 Concentration of Solutions
3 pages
Henry Wadsworth Long Fellow
No ratings yet
Henry Wadsworth Long Fellow
24 pages
Adadmin Utility
No ratings yet
Adadmin Utility
4 pages