CodeTree
CodeTree
Language Models
Jierui Li1 * , Hung Le2 , Yingbo Zhou2 , Caiming Xiong2 , Silvio Savarese2 , Doyen Sahoo2
1
The University of Texas at Austin
2
Salesforce Research
jierui@cs.utexas.edu, hungle@salesforce.com
text data, large language models (LLMs) have tremely large search space in code, early methods
demonstrated remarkable achievements in per-
propose to sample a very large number of genera-
forming code generation tasks. With additional
execution-based feedback, these models can act tion outputs (for example, Li et al. (2022) generated
as agents with capabilities to self-refine and im- up to 1 million samples per problem) to increase
prove generated code autonomously. However, the chance of generating a correct code solution.
on challenging coding tasks with extremely More recently, several approaches adopted a
large search space, current agentic approaches “vertical” strategy in which LLMs first generate
still struggle with multi-stage planning, gener- one (or very few) generation output, and then itera-
ating, and debugging. To address this problem,
tively refine this output multiple times, often con-
we propose CodeTree, a framework for LLM
agents to efficiently explore the search space in ditioned by some forms of external feedback (Le
different stages of the code generation process. et al., 2022; Chen et al., 2023b; Shinn et al., 2023).
Specifically, we adopted a unified tree struc- While these approaches are more cost-effective by
ture to explicitly explore different coding strate- focusing only on a small subset of the search space
gies, generate corresponding coding solutions, (i.e. starting from an initial output candidate), the
and subsequently refine the solutions. In each performances of these approaches are bounded by
stage, critical decision-making (ranking, termi- the local optima of the chosen search space. Re-
nation, expanding) of the exploration process
lated to our work, several methods for NLP reason-
is guided by both the environmental execution-
based feedback and LLM-agent-generated feed- ing tasks were introduced to control and enhance
back. We comprehensively evaluated CodeTree the generation procedure of LLMs. For example,
on 7 code generation benchmarks and demon- Wang et al. (2022) proposed to enhance LLMs with
strated the significant performance gains of chains of thought and statistically select the right
CodeTree against strong baselines. Using GPT- solutions based on majority voting. Zhou et al.
4o as the base model, we consistently achieved (2023) decomposed a task into smaller sub-tasks
top results of 95.1% on HumanEval, 98.7%
and addressed them by increasing the order of diffi-
on MBPP, and 43.0% on CodeContests. On
the challenging SWEBench benchmark, our ap- culty. Yao et al. (2024) proposed to improve LLMs
proach led to significant performance gains. by adopting a tree-based structure to explicitly sim-
ulate the exploration of thoughts in a tree. We are
1 Introduction motivated by this line of research and proposed
CodeTree, a new generation framework to effec-
Recently, we have witnessed significant impacts of tively explore the search space of code generation
large language models (LLMs) beyond the NLP do- tasks through a tree-based structure. An overview
main such as in coding tasks (Achiam et al., 2023; of CodeTree is given in Figure 1.
Touvron et al., 2023a; Wang et al., 2023; Rozière We define 3 standard agents, Thinker, Solver,
et al., 2023). However, different from traditional and Debugger, to equip the strategy-planning,
NLP tasks, coding tasks require generated code solution implementation, and solution improve-
to be fully executable and functionally correct i.e. ment correspondingly, posing comprehensive roles
* The work was done while Jierui Li was a research intern needed for code generation. A CodeTree starts
in Salesforce Research Asia. from the input problem as the tree root and subse-
Execution AI Multi-
Approach Explore Exploit Action
feedback feedback agent
CodeRanker (Inala et al., 2022) ✓ ✓
AlphaCode (Li et al., 2022), MBR-Exec (Shi et al., ✓ ✓
2022), CodeT (Chen et al., 2023b)
LEVER (Ni et al., 2023), Coder-Reviewer (Zhang et al., ✓ ✓ ✓
2023b)
Self-correct (Welleck et al., 2023), ILF (Chen et al., ✓ ✓
2023a), Self-refine (Madaan et al., 2023)
CodeChain (Le et al., 2024) ✓ ✓
Self-debug (Chen et al., 2023d), Self-repair (Olausson ✓ ✓ ✓
et al., 2023), Reflexion (Shinn et al., 2023)
CAMEL (Li et al., 2023a) ✓ ✓
ChatDev (Qian et al., 2024), MetaGPT (Hong et al., ✓ ✓ ✓
2023), AgentVerse (Chen et al., 2023c)
Self-collaboration (Dong et al., 2023), AgentCoder ✓ ✓ ✓ ✓
(Huang et al., 2023)
CodeTree (ours) ✓ ✓ ✓ ✓ ✓ ✓
Table 1: We compare CodeTree with related methods in 6 aspects: (1) Explore which adopts a brute-force approach
to independently generate a large number of code candidates; (2) Exploit which focuses on self-refinement using
a small subset of output solutions; (3) Execution feedback which uses code execution outcomes to improve code
qualities; (4) AI feedback which enables synthetic feedback generated by LLMs to improve output code; (5)
Multi-agent which adopts multiple LLM agents to play different roles in the code generation process; and (6) Action
where LLM agents can take different actions and facilitate decision-making.
quent nodes represent code solutions. At any node to competition-level coding tasks. Our results
of the tree, one can either explore sibling nodes demonstrated the significant and consistent perfor-
(other strategies from the same parent node) or its mance gains of CodeTree against strong baselines.
children (refinements of this node). Within Code- Using GPT-4o as the base models, we achieved the
Tree, agents can interact with each other through a top results on HumanEval+, MBPP+ (Liu et al.,
tree expansion guided by a Critic Agent, searching 2023), and CodeContests (Li et al., 2022) respec-
for the optimal code solution. tively. On the challenging SWEBench benchmark,
Rather than following heuristic rules or classic our approach led to significant performance gains.
tree traversal methods, we use Critic Agent to self- We also conducted comprehensive ablation and
evaluate the status of tree nodes at each tree expan- qualitative analysis to derive the best practices and
sion step by executing the following tasks: any limitations of the current method.
Refine
Solver
Reflection Reflection Reflection Reflection agent ● Scoring
Reflection
● Expanding
Refined code Refined code Refined code Refined code ● Terminating
Abort Action:
Reflection Reflection Debugger
agent
Refined code Refined code Agent-guided
tree search
Accept
Figure 1: CodeTree creates a unified search space for exploration throughout the multi-stage code generation process:
strategy generation by a “Thinker” agent, initial code generation by a “Solver” agent, and code improvement by a
“Debugger” agent. To effectively perform exploration within the tree structure, we incorporate both environmental
execution-based feedback as well as AI-generated feedback (generated by a “Critic” LLM agent).
Critic Agent Scoring & Evaluation: (e.g. execution results, compiler signals):
Thinker: Your task is to evaluate a strategy and
Your goal is to think of multiple strategies in corresponding implementation for
English on how to [approach and solve this solving a programming problem. The
problem)]/[improve this solution]. You should
decide how many and what strategies are
solution fails on test cases.
You should score from 1 to 5 on how
Fexe,i = Ŵi ({(ij , oj )v }) (3)
feasible and list and number them line by line. … well the execution outputs match the
[Problem]: <problem description>
[Solution]: <previous solution> [Feedback] if any
expected ones. …
—----------------------------------------------
Fcri,i = θC (Ŵi , Ŝi , Fexe,i , D) (4)
You should score from 1 to 5 on how
Solver: well the solution implements the
Your goal is to implement the solution for a
programming problem based on the user's
strategy and solve the task.
Evaluate if one should keep refining
where θC is Critic Agent. Denoting the collective
instructions… this solution or try other strategies.
[Problem]: <problem description>
[Instruction]: <strategy>
[problem] [solution] [exe feedback] feedback as Fi = {Fexe,i , Fcri,i }, a set of reflec-
Debugger:
Critic Agent Solution Verification:
tions Ri about the code candidates are generated
Your goal is to improve the following solution for
You are given a programming task along
a programming problem, based on its execution
feedback on test cases, including
with a user's solution that passes all by Thinker Agent.
visible tests. Your job is to verify whether
evaluation/reflection for the solution and
this solution will pass the hidden test
instruction. …
cases. Answer True if it’s an acceptable
[Problem]: <problem description>
[Solution]: <previous solution>
solution, Answer False if it’s not. Your
answer should be a single word
R̂i,j ∼ pθT (.|R̂i,1:j−1 , Fi , Ŵi , Ŝi , D) (5)
[Feedback]: <execution feedback>, <Critic
True/False. …
Agent feedback>
[Instruction]: <reflection>
[problem][solution][exe feedback] Ŵi,j ∼ pθD (.|R̂i,j , Fi , Ŵi , Ŝi , D) (6)
Figure 2: Simplified versions of instruction prompts R̂i,j denotes the jth reflection that “Thinker” gener-
used for Thinker, Solver, Debugger, and Critic agents. ates for Ŵi . An LLM θD (“Debugger” Agent) will
Some details are omited for illustration purposes. modify Ŵi , referring this reflection, generating a
new program correspondingly.
during test time (Ouyang et al., 2022; Touvron 3.2 Tree Expanding with Critic Agent
et al., 2023b; Wang et al., 2023). Therefore, by CodeTree builds a heterogeneous tree for each
including the strategy as part of the input instruc- problem, where the tree root represents a problem
tion, we can condition Solver Agent to produce specification (D, {(ij , oj )}) and every subsequent
strategy-specific code candidates. For each candi- tree node represents a generated code solution Ŵi .
date, the generation objective is defined as: Each node has relevant attributes including its col-
Ŵi ∼ pθS (Ŝi , D) (2) lective code feedback Fi and its corresponding
strategy and reflections: Si and Ri . Typically,
Solution Refining with Thinker & Debugger adding a tree node is a two-step process: 1) gener-
Agents Prior approaches such as (Chen et al., ate a code solution from the corresponding strategy
2023a,d; Shinn et al., 2023; Madaan et al., 2023) (Eq. 2 or Eq. 6), 2) evaluate the generated solution
found that syntactic mistakes or even logical flaws Ŵi and obtain environmental feedback (Eq. 4).
in generated code can be fixed by allowing LLMs Unlike previous tree-structure search methods
to iteratively refine and regenerate the code. This (Yao et al., 2024; Islam et al., 2024), we do not con-
self-refinement capability is typically strengthened struct the entire tree in the beginning. Instead, we
by some forms of feedback about the code qualities introduce a Critic Agent to dynamically expand the
tree based on potential strategies. It will guide the 4 Experiments
expansion and spanning of the tree, taking actions
based on its evaluation of the current node. 4.1 Experimental Setup
We applied pass@1(Chen et al., 2021) as our eval-
Node Scoring and Evaluation For a given so-
uation metric: only one code candidate can be se-
lution and corresponding Fexe , Critic Agent per-
lected and submitted for the final evaluation with
forms an evaluation, measuring how promising it is
hidden test cases. We set the generation budget to
through equation 4, which results in Fcri . We sepa-
be 20 samples per coding task. To fairly compare
rately evaluate how well: 1) the execution outputs
our approach with other baselines, we adopted the
of test cases match expected outputs on visible test
same generation budget in all methods. For abla-
cases; and 2) the solution robustly implements its
tion experiments without using Critic Agent, we
corresponding strategy towards problem-solving.
followed similar strategies from (Shinn et al., 2023;
For one program Ŵi and its corresponding feed-
Chen et al., 2023d): we select a solution which
back Fi , the Critic Agent will evaluate whether the
passes all visible test cases as the final solution to
current solution is worth refining, or it should not
be evaluated with hidden test cases.
be explored, making decision between refinement
and abort. The critic score is calculated following Benchmarks We conducted experiments on 2
the equation: categories of code generation tasks: 1) Function
Score(Ŵi ) = Score(Fexe,i ) + Score(Fcri,i ) (7) implementation where a coding task is to com-
plete a single function following a specific func-
Solution Verification For one Ŵi that passes all tion signature: HumanEval (Chen et al., 2021),
visible test cases, it might potentially over-fit the MBPP (Austin et al., 2021), and their EvalPlus
visible test cases and could fail hidden test cases. variants from (Liu et al., 2023), denoted as Hu-
Hence, the critic agent θC will verify if this solution manEval+ and MBPP+; and 2) Program implemen-
is robust and generalizable to unseen test cases. tation where a coding task is to solve an algorithmic
problem: CodeContests (Li et al., 2022) and APPS
Decision-Making by Critic Agent Starting from
(Hendrycks et al., 2021). The sizes of test set are
the initial Si , Wi , Fi , Critic Agent guides the search
164, 378 and 165 for HumanEval(+), MBPP(+)
for a correct solution. At each node, it has an action
and CodeContests respectively. For APPS, we ran-
space of three actions: Refine: Continue exploring
domly sampled 150 samples (50 for each level of
from the current node by generating multiple reflec-
difficulty) from the test split.
tions for this node; Abort: Prune this node due to
its low evaluation score, and retrace the exploration Baselines We introduce the following baselines:
to its sibling nodes; and Accept: Accept the current Direct instructs the model to generate code directly
node as the final solution and terminate the search from the input problem; CoT (Wei et al., 2022)
early. instructs the model to provide chain-of-thought
3.3 Multi-agent Collaboration reasoning before giving the solution program; Re-
flexion (Shinn et al., 2023) utilizes solution’s ex-
Throughout the expansion of the tree, the task- ecution feedback to generate self-reflections. The
specific agents collaborate with Critic Agent, utiliz- reflections are used to iteratively refine the solution;
ing its feedback, and follow its guidance to perform MapCoder (Islam et al., 2024) proposes an agent
exploration. The flexibility of the tree expansion collaboration system to plan, solve, test, and refine
and search is determined by LLM agents’ decision- the solution. We set #plans=4, #debug-round=5
making, e.g. determining the number of strategies and generation budget=20; and Resample follows
and deciding the search path. During inference a similar principle as Li et al. (2022): resample so-
time, practically, we limit the number of explo- lutions repeatedly and filter them with visible test
ration steps to avoid large computation overhead. cases.1
Whenever a termination signal (i.e. to accept a
code solution) is found or the maximum number Models We studied our method on three mod-
of exploration steps is reached, a code candidate is els with different model sizes and capacities. We
selected based on its evaluation score Score(W ˆ i ).
1
We set sampling temperature=1 for Resample, and report
Please refer to Appendix A for all example instruc- the best results over 2 runs. For other methods, we report the
tion prompts of our LLM agents. single run’s results with deterministic inference.
Model Method HumanEval HumanE+ MBPP MBPP+ Codecontests Avg.
Direct 86.6% 78.7% 87.8% 73.3% 10.3% 67.3%
CoT 84.8% 78.0% 89.2% 74.3% 12.7% 67.8%
Reflexion 92.1% 83.5% 96.6% 78.6% 21.8% 74.5%
GPT-4o-mini MapCoder 91.5% 78.0% 90.0% - - -
Resample 89.0% 80.5% 94.3% 76.8% 18.2% 71.8%
CodeTree-BFS 93.3% 82.1% 91.5% 72.3% 20.6% 72.0%
CodeTree-DFS 92.7% 81.1% 87.6% 71.4% 20.6% 70.7%
Strategy List 90.2% 80.5% 90.5% 69.6% 22.4% 70.6%
CodeTree 94.5% 84.8% 96.8% 77.0% 26.4% 75.9%
Direct 88.4% 81.7% 92.3% 75.9% 20.6% 71.8%
CoT 92.1% 84.1% 93.7% 77.2% 24.8% 74.4%
Reflexion 94.5% 84.8% 97.9% 79.6% 41.8% 79.7%
GPT-4o MapCoder 92.7% 81.7% 90.9% - - -
Resample 93.9% 84.8% 96.2% 77.0% 32.7% 76.9%
CodeTree-BFS 94.5% 84.1% 93.9% 70.7% 35.8% 75.8%
CodeTree-DFS 95.1% 83.5% 91.5% 76.2% 36.4% 76.5%
Strategy List 95.1% 82.3% 92.6% 73.3% 36.4% 75.9%
CodeTree 94.5% 86.0% 98.7% 80.7% 43.0% 80.6%
Direct 63.4% 54.3% 73.4% 63.8% 6.1% 52.2%
CoT 65.9% 56.1% 74.6% 65.3% 4.2% 53.2%
Reflexion 79.9% 69.5% 90.2% 72.0% 13.5% 65.0%
Llama-3.1-8B Resample 82.3% 71.3% 91.0% 73.8% 15.2% 66.7%
CodeTree-BFS 80.5% 68.3% 91.0% 69.3% 15.8% 65.0%
CodeTree-DFS 80.5% 68.9% 89.7% 70.4% 15.2% 64.9%
Strategy List 82.3% 70.1% 91.0% 72.5% 13.9% 66.0%
CodeTree 82.3% 72.0% 90.5% 73.3% 12.1% 66.0%
Table 2: Experimental results by pass@1 on HumanEval, MBPP, EvalPlus, and CodeContests: methods are
baseline methods that generate program solution only once, are methods with solution generation budget of 20
samples like our methods. are CodeTree variants with or without Critic Agent to guide the tree search. Note that
MapCoder does not work with Llama-3.1-8B as noted by Islam et al. (2024).
experimented on large language models from the forms better than CodeTree-DFS, suggesting that
GPT and Llama 3.1 family. Specifically we use exploring diverse strategies is more effective than
GPT-4o-mini, GPT-4o2 , and Llama-3.1-8B 3 . iteratively refining from one solution. Interestingly,
on Llama-3.1-8B model, Resampling achieves the
4.2 Main Results best results on 4 benchmarks. This observation
We compared CodeTree with other baselines in indicates that small language models may not be
Table 2. We noticed that Reflexion and Resam- suitable for multi-agent frameworks like CodeTree,
pling serve as strong baselines for HumanEval and where models are required to follow task-specific
MBPP datasets given the same solution genera- roles and instructions and perform distinct tasks
tion budget, comparable to CodeTree-BFS/DFS. with reasonable accuracy.
CodeTree with Critic Agent outperforms all other
4.3 Analysis of Search Strategies
baselines in 4 out of 5 benchmarks for GPT-4o-
mini and GPT-4o. For instance, CodeTree achieves Given the performance gaps between CodeTree-
pass@1=43.0% on competition-level coding tasks BFS and DFS, we conducted additional experi-
in the Codecontests benchmark (i.e. 22.4% perfor- ments to analyze these tree search strategies with-
mance gain over the Resampling baseline), show- out Critic Agent. We reported the results on
ing its advantage in solving hard problems. HumanEval/HumanEval+ with GPT-4o-mini and
We found that CodeTree-BFS almost always per- Codecontests with GPT-4o in Table 3. Compared
to DFS/BFS strategies with d = 3 and w = 3, we
2
https://openai.com/index/hello-gpt-4o/ observed that forcing the model to search wider
3
https://huggingface.co/meta-llama/Llama-3.
1-8B-Instruct. Note that we reported our replicated results (i.e. more diverse strategies with w > 3) in BFS
which might be different from the original reported ones. and only debug up to 1 iteration (i.e. d = 2) im-
Figure 3: Results of pass@1 on the APPS test subsets. We randomly select 50 samples from Introductory, Interview,
and Competition separately. We apply Resample and our methods with GPT-4o, GPT-4o-mini, Llama-3.1-8B.
Figure 4: Cumulative pass@1 curves while new solutions are generated within budget= 20.
Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023a.
Self-edit: Fault-aware code editor for code genera-
tion. arXiv preprint arXiv:2305.04087.
<user>:
[problem description]:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
```
<assistant>:
1. Start by understanding the problem requirements and constraints, focusing on the need to count occurrences of the digit '7' in
specific integers.
2. Identify the range of integers to consider, which is from 0 to n-1, and clarify the conditions for divisibility by 11 or 13.
3. Create a loop to iterate through all integers from 0 to n-1, checking each integer for divisibility by 11 or 13.
4. For each integer that meets the divisibility condition, convert it to a string to facilitate counting occurrences of the digit '7'.
5. Use the string method `.count('7')` to count how many times '7' appears in the string representation of the integer.
Solver Agent:
<system>:
You are an AI that only responds with python code, NOT ENGLISH. You will be given a function signature and its docstring by the
user. Write your full implementation (restate the function signature). Don't include test cases or printing statements in the code
block.
<user>:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
```
To solve the problem, You can follow the hint given by an expert: Start by understanding the problem requirements and constraints,
focusing on the need to count occurrences of the digit '7' in specific integers.
<assistant>:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
count = 0
for i in range(n):
if i % 11 == 0 or i % 13 == 0:
count += str(i).count('7')
return count
```
Figure 5: GPT-4o-mini as the Thinker and Solver Agents to solve HumanEval-36. The thinker agent generates 5
distinct strategies and the solver agent implements the first one. By oracle evaluation, the resulted solution can pass
visible but fail on hidden test cases.
Critic Agent - Solution Verification:
<system>:
The user will give you a programming task and a solution that passes all visible tests.
Your task is to verify whether the solution can pass hidden test cases.
Focus only on functionality correctness. If the solution is correct, no improvement is necessary. Assume all inputs follow formats
and constraints described in the examples.
Answer "True" if you think this solution is acceptable, Answer "False" if you think necessary modification is needed for it to cover
more test cases.
-----------------------------------
Your answer should be formatted as below:
In the first line, answer a single word (True/False).
In the second line, indicate your reason.
<user>:
[Problem description]:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
```
[Implementation]:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
# Start by understanding the problem requirements and constraints, focusing on the need to count occurrences of
the digit '7' in specific integers.
count = 0
for i in range(n):
if i % 11 == 0 or i % 13 == 0:
count += str(i).count('7')
return count
```
<assistant>:
False
The solution does not account for numbers that are divisible by both 11 and 13, and it may also miss edge cases where the input is
less than 11 or 13. Additionally, it does not handle negative inputs or zero, which could lead to incorrect results.
Figure 6: GPT-4o-mini as the Critic Agent to verify a solution of HumanEval-36 given by Solver Agent. It decides
to reject this solution and suggest improvements.
Debugger Agent:
<system>:
You are an AI Python assistant. You will be given a program implementation to improve, a series of unit tests and execution
feedback, and a hint to change the implementation appropriately. Write your full implementation (restate the function signature).
<user>:
[Implementation to Improve]:
```
def fizz_buzz(n: int):
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
"""
# Start by understanding the problem requirements and constraints, focusing on the need to count occurrences of
the digit '7' in specific integers.
count = 0
for i in range(n):
if i % 11 == 0 or i % 13 == 0:
count += str(i).count('7')
return count
```
[AI-generated Feedback]:
The solution does not account for numbers that are divisible by both 11 and 13, and it may also miss edge cases where the input is
less than 11 or 13. Additionally, it does not handle negative inputs or zero, which could lead to incorrect results.
<assistant>:
Here is the improved implementation of the `fizz_buzz` function, incorporating the suggested changes:
```python
def fizz_buzz(n: int) -> int:
"""Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
>>> fizz_buzz(50)
0
>>> fizz_buzz(78)
2
>>> fizz_buzz(79)
3
>>> fizz_buzz(10)
0
>>> fizz_buzz(-5)
0
"""
if n <= 0:
return 0
count = 0
for i in range(11, n):
if i % 11 == 0 or i % 13 == 0:
count += str(i).count('7')
return count
```
### Changes Made:
1. **Edge Case Handling**: Added a check for `n <= 0` to return 0 immediately for negative inputs and zero.
2. **Loop Starting Point**: The loop now starts from 11, as numbers less than 11 cannot be divisible by 11 or 13.
3. **Documentation**: Updated the docstring to clarify the function's behavior and added additional test cases for edge cases.
This implementation should now correctly handle all specified cases and edge conditions.
Figure 7: GPT-4o-mini as the Debugger agent to refine a solution of HumanEval-36 given by Solver Agent. It refers
to Critic Agent’s suggestion and correct the solution successfully.
Critic Agent - Test Output Scoring:
<system>:
Your task is to evaluate the execution outputs of a code implementation. The statement and code is given by the user, and the
output/expected output on a set of test cases. You should analyze the expected outputs and execution outputs. From a 0 to 5 range,
give a score on how good the execution outputs are matching the expected ones (higher score means a better match). Standards
are below:
0: Errors or time out when executing.
1: No pattern found when comparing pairs of <output, expected_output>, errors are hard to interpret.
2: Results abnormal for a part of cases(e.g., cannot handle negative elements; only half of it sorted).
3: Mismatches have clear patterns to interpret the error. For examples, all elements ojset by 1; all elements + 1; reverse all
elements etc.,
4: Lack consideration of edge condition/corner cases(e.g., error only when elements are equal), otherwise correct
5: Results exactly matched.
-----------------------------------
Your answer should be formatted as below:
In the first line, give your brief comparison.
In the second line, give A SINGLE INTEGER NUMBER as your final score(0 to 5)
<user>:
[problem]
[solution]
[execution results on visible test cases]
<user>:
[problem]
[solution]
[execution results on visible test cases]
[agent feedback] if any
Figure 8: Prompts for Critic Agent-test cases scoring, solution scoring, as well as Thinker Agent Reflection.