Paper 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Code Generation with AlphaCodium: From Prompt Engineering to Flow

Engineering

Tal Ridnik, Dedy Kredo, Itamar Friedman


CodiumAI
{tal.r, dedy.k, itamar.f}@codium.ai
arXiv:2401.08500v1 [cs.LG] 16 Jan 2024

Abstract els [12] have successfully generated code that solves simple
programming tasks [2, 1]. However, real-world code prob-
Code generation problems differ from common natural lems are often different in nature - they are more nuanced,
language problems - they require matching the exact syn- and can be defined by a long natural language task descrip-
tax of the target language, identifying happy paths and tion (i.e., spec), that contains multiple details and rules that
edge cases, paying attention to numerous small details in the solution code must address.
the problem spec, and addressing other code-specific is- The introduction of CodeContests [8], a dataset curated
sues and requirements. Hence, many of the optimizations from competitive programming platforms such as Code-
and tricks that have been successful in natural language forces [9], enabled the evaluation of models and flows on
generation may not be effective for code tasks. In this more challenging code problems, which usually include a
work, we propose a new approach to code generation by lengthy problem description. A private test set, with more
LLMs, which we call AlphaCodium - a test-based, multi- than 200 unseen tests per problem, enables to evaluate the
stage, code-oriented iterative flow, that improves the perfor- generated code comprehensively, and to reduce false posi-
mances of LLMs on code problems. We tested AlphaCodium tive rates to a minimum.
on a challenging code generation dataset called CodeCon- The primary work addressing the CodeContests dataset
tests, which includes competitive programming problems was AlphaCode [8], a code generation system developed by
from platforms such as Codeforces. The proposed flow con- DeepMind, that utilizes a fine-tuned network specifically
sistently and significantly improves results. On the valida- for competitive programming tasks. AlphaCode generates
tion set, for example, GPT-4 accuracy (pass@5) increased a very large number of possible solutions (up to 1M), that
from 19% with a single well-designed direct prompt to 44% are then processed and clustered, and among them a small
with the AlphaCodium flow. Many of the principles and best number (∼ 10) is chosen and submitted. While the results
practices acquired in this work, we believe, are broadly ap- of AlphaCode are impressive, the need to fine-tune a model
plicable to general code generation tasks. specifically for code-oriented tasks, and the heavy compu-
Full implementation is available at: https : / / tational brute-force-like load, makes it impractical for most
github.com/Codium-ai/AlphaCodium real-life usages. CodeChain [7] is another work to tackle
competitive programming tasks, which introduced a novel
inference framework to improve code generation in LLMs
1. Introduction through a chain of sub-module-based self-revisions.
In this paper, we present AlphaCodium, a code-oriented
With a sparse reward signal, code generation tasks re- flow that revolves around an iterative process where we re-
quire searching in the huge structured space of possible pro- peatedly run and fix a generated code against input-output
grams. Correct solutions to the same problem can look sig- tests. Two key elements for AlphaCodium flow are (a) gen-
nificantly different, and judging if a partial or incorrect so- erating additional data, such as problem reflection and test
lution is useful is a difficult challenge - a single-character reasoning, to aid the iterative process, and (b) enrichment
edit can completely alter the solution’s behavior. Due to the of public tests with additional AI-generated tests. The pro-
unique nature of code generation tasks, common prompting posed flow, which is depicted in Figure 1, is divided into
techniques that have been optimized for natural language two main phases: a pre-processing phase where we reason
tasks [4, 13, 10], may not be as effective when applied to about the problem in natural language, and an iterative code
code generation. generation phase where we generate, run, and fix a code
Recent large-scale transformer-based language mod- solution against public and AI-generated tests.

1
Pre-processing Code iterations

Input -
Generate Iterate on Iterate on AI
Problem Rank
Possible Public Tests Tests
Description + Solutions
Solutions
Public Tests

Generate
Problem Public Tests Initial Code Final
Additional AI
Reflection Reasoning Solution Solution
Tests

(a) The proposed AlphaCodium flow.

- The correct solution to the problem


- The solution represented by the generated code
Public Tests Public Tests
AI Tests

(1) - Direct prompt solution (2) - Solution when adding (3) - Solution with full
iterations on public tests AlphaCodium flow

(b) Illustrating the improvement from AlphaCodium.

Figure 1: Illustration of AlphaCodium flow contribution - with direct prompt, the model struggles to solve code problems.
Iterating on public tests stabilizes and improves the solution but leaves ”blind spots” because the public tests are not compre-
hensive. The full AlphaCodium flow, which includes a pre-processing phase as well as iterations on public and AI-generated
tests, allows the solution to be further improved, leading to increased solve ratio.

A key observation when designing AlphaCodium flow is designed single prompt, consistently and significantly
that generating additional useful tests is easier than gener- improves the performance of LLMs on CodeContests
ating a correct code solution. Adding specific tests requires problems. This is true both for open-source models
mainly understanding the problem, some insight, and basic (DeepSeek [3]) and closed-source models (GPT [5]). For
brute-force or logical reasoning. There is no need to fully GPT-4 on the validation set, for example, the pass@5 ac-
“solve” the problem when generating additional tests. curacy improved from 19% to 44%. AlphaCodium also
AlphaCodium flow also utilizes novel code-oriented de- outperforms previous works, while having a significantly
sign concepts, tricks, and best practices, such as: (1) YAML smaller computational budget - it achieves superior results
structured output; (2) bullet point analysis to encourage se- over AlphaCode, for example, with four orders of magni-
mantic reasoning; (3) generating modular code; (4) soft de- tude fewer LLM calls.
cisions with double validation; (5) encouraging exploration We believe many of the principles and best practices
and postponing direct decisions; (6) test anchors. used in this work broadly apply to general code generation
AlphaCodium’s flow, when compared against a well- tasks. In addition, we argue that utilizing harder and more

2
complicated benchmarks like CodeContests dataset will al- This illustrates the importance of problem understanding as
low the community to better evaluate LLMs, compared to part of a flow that can lead, with high probability, to gener-
more simplistic benchmarks, which are common today, like ating a correct code solution.
HumanEval [2].
3. The Proposed Flow
2. CodeContests Dataset
CodeContests [8] is a challenging code generation
3.1. Overview
dataset introduced by Google’s DeepMind, involving prob- Due to the complicated nature of code generation prob-
lems curated from competitive programming platforms such lems, we observed that single-prompt optimizations, or
as Codeforces [9]. The dataset contains 10K code problems even chain-of-thought prompts, have not led to meaning-
that can be used to train LLMs, as well as a validation and ful improvement in the accuracy of LLMs on CodeCon-
test set to assess the ability of LLMs to solve challenging tests. The model struggles to understand and comprehend
code problems. the problem, and continuously produces wrong code, or a
In this work, instead of training a dedicated model, we code that passed public tests but fails to generalize to un-
focused on developing a code-oriented flow, that can be seen private tests. Common flows, that are suitable for nat-
applied to any LLM pre-trained to support coding tasks, ural language tasks, may not be optimal for code-generation
such as GPT [5] or DeepSeek [3]. Hence, we chose to ig- tasks, which include an untapped potential - repeatedly run-
nore the train set, and focused on the validation and test ning the generated code, and validating it against known
sets of CodeContests, which contain 107 and 165 problems, examples.
respectively. Figure 2(a) depicts an example of a typical Instead of common prompt engineering techniques used
problem from CodeContests dataset. Each problem con- in NLP, we found that to solve CodeContest problems it
sists of a description and public tests, available as inputs to was more beneficial to employ a dedicated code-generation
the model. The goal is to generate a code solution that pro- and testing-oriented flow, that revolves around an iterative
duces the correct output for any (legal) input. A private test process where we repeatedly run and fix the generated code
set, which is not available to the model or contesters, is used against input-output tests (see Figure 2(a) for examples of
to evaluate the submitted code solutions. such tests). Two key elements for this code-oriented flow
are (a) generating additional data in a pre-processing stage,
What makes CodeContests a good dataset for evaluating such as self-reflection and public tests reasoning, to aid the
LLMs on code generation tasks: iterative process, and (b) enrichment of the public tests with
1) CodeContests, unlike many other competitive pro- additional AI-generated tests.
gramming datasets [6, 2], utilizes a comprehensive private In Figure 1(a) we present AlphaCodium flow for solving
set of tests to avoid false positives - each problem contains competitive programming problems. The flow is divided
∼200 private input-output tests the generated code solution into two main phases:
must pass.
2) LLMs generally do not excel at paying attention to • The pre-processing phase represents a linear flow
small details, because they typically transform the problem where AlphaCodium reasons about the problem, in
description to some ”average” description, similar to com- natural language,
mon cases on which they were trained. Real-world code
problems, on the other hand, frequently contain minor de- • The code iterations phase includes iterative stages
tails that are critical to their proper resolution. A key feature where AlphaCodium generates, runs, and fixes a so-
of the CodeContests dataset is that the problem descriptions lution code against certain tests.
are, by design, complicated and lengthy, with small details
and nuances (see a typical description in Figure 2(a)). We 3.2. Flow stages
feel that adding this degree of freedom of problem under-
standing is beneficial since it simulates real-life problems, In this section, we will review the different stages used
which are often complicated and involve multiple factors in AlphaCodium flow (Figure 1(a)):
and considerations. This is in contrast to more common
code datasets such as HumanEval [2], where the problems
are easier, and presented in a concise manner. An example Problem reflection Describe the problem, in bullet
of a typical HumanEval problem appears in appendix A. points, while addressing the problem goal, inputs, outputs,
Figure 2(b) depicts the model’s introspection on the rules, constraints, and other relevant details that appear in
problem presented in Figure 2(a). Note that proper self- the problem description.
reflection makes the problem clearer and more coherent.

3
Problem Description Public Tests Private Tests
Mr. Chanek's city can be represented as a plane. He wants to build a Input: Input:
housing complex in the city. There are some telephone poles on the 22 54
plane, which is represented by a grid a of size (n + 1) × (m + 1). There is a 101 10010
telephone pole at (x, y) if a_{x, y} = 1.For each point (x, y), define S(x, y) 000 00000
as the square of the Euclidean distance between the nearest pole and (x, 000 01000
y). Formally, the square of the Euclidean distance between two points Output: 00001
(x_1, y_1) and (x_2, y_2) is (x_2 - x_1)^2 + (y_2 - y_1)^2. To optimize the 18 00100
building plan, the project supervisor asks you the sum of all S(x, y) for ... 00010
each 0 ≤ x ≤ n and 0 ≤ y ≤ m. Help him by finding the value of ∑_{x=0}^{n} Output:
{∑_{y=0}^{m} {S(x, y)}}. 36
... ...

(a) A typical CodeContests problem.

Self-reflection on the problem


- The problem is about a city represented as a plane with a grid of size (n + 1) x (m + 1) where
some points have telephone poles.

- The task is to calculate the sum of the squares of the Euclidean distances from each point in the
grid to its nearest telephone pole.

- The grid is represented by a matrix where 1 indicates the presence of a telephone pole and 0
indicates its absence.

- The input consists of the size of the grid (n, m) and the grid itself.

- The output is a single integer, the sum of the squares of the Euclidean distances from each point
to its nearest telephone pole.

(b) An AI-generated self-reflection on the problem.

Figure 2: Problem description and reflection - an example of a typical CodeContests problem, with AI-generated self-
reflection on the problem. While the original description is lengthy and complicated, proper self-reflection makes the problem
clearer and more coherent, leading to improved code solutions.

Public tests reasoning Explain why each test input leads run-fix iterations in the next stages will have a better chance
to the output. of succeeding. The stage flow:

• Choose a potential solution. Generate a corresponding


Generate possible solutions Generate a list of 2-3 possi- code, and run it on selected public and AI tests.
ble solutions to the problem, described in natural language.
• Repeat this process until the tests pass, or until a try-
limit is reached.
Rank solutions Rank the possible solutions and choose • The first code that passes the tests, or the code with
the “best solution”, in terms of correctness, simplicity, and the closest output (see appendix D), will be used as the
robustness. (not necessarily take the “most efficient” solu- base code for the next steps.
tion).

Generate additional AI tests Generate an additional 6- Iterate on public tests Start from the base code. Itera-
8 diverse input-output tests for the problem. Try to cover tively run it on the public tests. If the code fails on a specific
cases and aspects not covered by the original public tests. test, try to fix it, given the error message.

Initial code solution The goal of this stage is to generate Iterate on AI-generated Tests Continue the run-fix iter-
an initial code solution to the problem. It is essential that ations on the AI-generated tests. Use “test anchors” (see
this code will reasonably ”close” to the correct code, so the section 4).

4
and instead allows complicated tasks to be presented in a
straightforward, code-like manner. It also makes it possi-
ble to obtain complex answers that involve several stages,
3.3. Additional insights representing a logical and methodical thinking process.
In this section, we will offer additional insights and share While newer versions of GPT models [5] have built-in
intuitions about the proposed flow. support for JSON-style output, we argue that YAML output
Firstly, the flow relies on knowledge accumulation - try- is far more suitable specifically for code generation tasks,
ing to progress from easy to hard, gaining knowledge and see appendix B.
insight along the way to help with the more difficult stages.
For example, the output of the first step, problem reflection, Semantic reasoning via bullet points analysis: when
can be utilized as prompt input to more difficult steps like asking an LLM to reason about an issue, better results are
generate possible solutions. The pre-processing phase’s obtained when demanding the output to be in bullet points
outputs are used to aid the most challenging and critical format. Bullet points encourage an in-depth understanding
phase, code iterations, where we try to generate code that of the problem, and force the model to divide the output into
correctly solves the problem. logical semantic sections, leading to improved results. For
Another key observation in designing AlphaCodium is example, with self-reflection on a problem in bullet points
that for AI, generating more tests is easier than generating (Figure 2 (b)), each bullet point represents a semantic under-
a full solution code. Generating additional tests requires standing of a different part of the problem - general descrip-
mainly understanding the problem and basic brute-force or tion, goals and rules, input structure, and output structure.
logical reasoning. There is no need to fully “solve” the
problem in order to generate additional useful input-output LLMs do better when generating a modular code:
test pairs. This is in contrast to generating a correct solu- when LLMs are asked to generate a single lengthy func-
tion code, which requires a complete algorithmic solution, tion, we observed poor results - the code often contains
equivalent to correctly solving any possible pair of input- bugs or logical mistakes. Even worse, a single monolithic
output tests. As a result, we can generate more AI tests, and code hurts the ability to perform iterative fixing - the model
then leverage them to improve the code creation phase, as struggles to pinpoint and fix problems, even when given the
described in Figure 1(b). We further amplify the contribu- error message. When clearly asking the model to: “divide
tion of these additional tests by asking the model to focus the generated code into small sub-functions, with meaning-
on aspects not addressed by the original public tests, such ful names and functionality”, we observe a better-produced
as large inputs, edge cases, and so on. code, with fewer bugs, and higher success rates for the iter-
Also note that some steps can be combined into a sin- ative fixing stages.
gle LLM call, and the flow in Figure 2(a) is a conceptual
flow, emphasizing the process’s high-level steps. In prac- Soft decisions with double validation: LLMs tend to
tice, structured output (see section 4) enables to combine struggle with code tasks that require them to think, rea-
multiple stages into a single LLM call, in order to save re- son, and make strict, non-trivial decisions. Let’s take for
sources, or because a model performs better when doing example the task of generating additional tests for a prob-
specific tasks concurrently. lem. Quite often, some of the tests the model generates will
be plain wrong. With a double validation process, we add
4. Code-Oriented Design Concepts an extra step where, given the generated output, the model
is asked to re-generate the same output, but correct it if
In this section we will present additional design con- needed. For example, given the generated AI tests as in-
cepts, tricks, and best practices we found beneficial when put, the model is asked to re-generate the same tests, while
trying to solve code generation problems. AlphaCodium correcting wrong output, if exists. We found that this step of
flow proposed in Figure 1 extensively uses these design con- double validation, while encouraging the model to be crit-
cepts. ical and to reason, is more effective than asking a direct
yes\no question: ”is this test correct?”
YAML Structured output: the usage of structured out-
put - asking the model to generate an output in YAML for- Postpone decisions, try to avoid direct questions, and
mat, equivalent to a given Pydantic class - is a key compo- leave room for exploration: when we ask the model di-
nent in our proposed flow. An example of such instruction rect questions regarding complicated issues, we consistently
(possible solutions stage) appears in Figure 3. see hallucinations and wrong answers. To counter this, we
Structured output eliminates the majority of the has- adopt a flow of gradual data accumulation, from easier tasks
sle and dark knowledge required for ”prompt engineering” to harder ones:

5
...
Your goal is to present possible solutions to the problem.
Make sure that each solution fully addresses the problem goals, rules, and
constraints.

The output must be a YAML object equivalent to type $PossibleSolutions, according to


the following Pydantic definitions:

class Solution(BaseModel):
name: str = Field(description="The name of the solution")
content: str = Field(description="A description of the solution")
why_it_works: str = Field(description="Why this solution is correct. Be specific\
and detailed regarding the problem rules and goals")
complexity: str = Field(description="The complexity of the solution")

class PossibleSolutions(BaseModel):
possible_solutions: List[Solution] = Field(max_items=3, description="A list of\
possible solutions to the problem. Make sure each solution fully addresses the\
problem rules and goals, and has a reasonable runtime - less than three seconds\
on a modern computer, given the problem constraints for large inputs.")

Figure 3: Example for a prompt with structured output (generate possible solutions stage)

• Start with the easiest tasks - self-reflection on the prob- tests.


lem, and reasoning about public tests.
• Now iterate on the AI-generated tests, one by one. If a
• Move to generating additional AI tests, and possible test passes, add it to the list of test anchors
solutions to the problem
• If a test fails, assume it’s because the code is incor-
• Only after we acquire the model’s answers to the tasks rect, and try to fix the code. However, demand that the
above, we move to actual code generation, and run-fix fixed code will also pass all the test anchors already
iterations. acquired. As a result, the test anchors will protect us
against an incorrectly fixed code.
As another example, instead of choosing a single algorith-
mic solution to the problem, we prefer to rank several pos- Another optimization for test anchors is to sort the AI-
sible solutions, and give priority, but not exclusiveness, to generated tests from easy to hard. That way, there are more
the top-ranked solution when generating initial code. Since chances that the iterative process will acquire anchors at the
the model can be wrong, it’s better to avoid irreversible de- beginning of the process, which can be used as protection
cisions, and leave room for exploration and code iterations later when iterating on the more complicated AI tests.
with different possible solutions.
What did not work for us: In appendix C we present
Test anchors: even with double validation, some AI-
additional tricks and methods we tried, which have not led
generated tests will be wrong. This makes iterations on
to improved results.
them challenging - when a test fails, how can we know if it
is because the code is wrong, or because the test is wrong?
When we ask the model directly “who is wrong”, we often
5. Results
see hallucinations, and may end up with a wrongly fixed 5.1. Direct prompt vs. AlphaCodium flow
code. To tackle this problem, we utilized a technique of test
anchors: In Table 1 we compare AlphaCodium results to the re-
sults obtained with a single well-designed direct prompt.
• Iterate first on the public tests, which we know are cor- The metric being used is pass@k, defined as the percent-
rect. When finished, set all the passed tests as anchor age of problems solved by using k generated solutions per

6
model set method score model set method score
(pass@5) AlphaCodium 25%
Direct 7% (pass@5)
validation validation
DeepSeek AlphaCodium 20% CodeChain 17%
-33B [3] Direct prompt 12% (pass@5)
test GPT-3.5
AlphaCodium 24% AlphaCodium 17%
(pass@5)
Direct prompt 15% test
validation CodeChain 14%
AlphaCodium 25% (pass@5)
GPT-3.5
Direct prompt 8% GPT-4 AlphaCodium 44%
test
AlphaCodium 17% (pass@5)
Direct prompt 19% AlphaCode 17%
validation validation
(pass@10@1K)
AlphaCodium 44% AlphaCode
GPT-4 AlphaCode 24%
Direct prompt 12%
test (pass@10@100K)
AlphaCodium 29%
GPT-4 AlphaCodium 29%
Table 1: Comparison of AlphaCodium flow results to di- (pass@5)
rect prompt on various models. AlphaCode 16%
test
(pass@10@1K)
AlphaCode
problem. As can be seen, AlphaCodium flow consistently AlphaCode 28%
and significantly improves the performance of LLMs on (pass@10@100K)
CodeContests problems. This is true both for open-source
(DeepSeek) and close-source (GPT) models, and for both Table 2: Comparison of AlphaCodium to other works
the validation and test sets. For GPT-4 on the validation set, from the literature.
for example, the pass@5 score improves from 19% to 44%
- x2.3 improvement.
ample - how to treat problems with multiple solutions, how
5.2. Comparison to previous works to address tolerance issues, timeouts, and more. We com-
pare to the numbers reported in the papers, but release a full
In Table 2 we compare AlphaCodium results to other reproducible code and evaluation script of AlphaCodium,
methods from the literature. As can be seen, when com- to enable future comparisons to be more reliable and con-
paring AlphaCodium to CodeChain with the same model sistent.
(GPT-3.5) and the same metric (pass@5), AlphaCodium
consistently does better. 5.3. Computational effort and comparison to Al-
When comparing AlphaCodium to AlphaCode work, we phaCode and AlphaCode2
need to take into account that AlphaCode uses a different
generation methodology - fine-tuning an (unknown) model With AlphaCodium flow we perform ∼15-20 LLM calls
specifically for code problems, generating a very large num- per solution, so a pass@5 submission involves ∼ 100 LLM
ber of code solutions, clustering them, and submitting K calls.
solutions from the top clusters. pass@10@100K, for exam- AlphaCode did not report how many LLM calls were
ple, means the 100K (!) solutions were generated and clus- done per run [8]. Let’s assume one call per run was done
tered, and 10 solutions were finally chosen and submitted. (unknown, could be more), then a pass@10@100K (i.e.
AlphaCode used a fine-tuned model, and utilized a brute- ten submissions, curated from 100,000 generated solutions)
force-like approach with a significantly higher number of involves 1M LLM calls, four orders of magnitude more
LLM calls. Yet, the top results achieved by AlphaCodium than AlphaCodium. Yet, the top results obtained by Alpha-
are better Codium are better.
Note that neither AlphaCode nor CodeChain papers [8, Recently, a new work called AlphaCode2 [11] was pub-
7] released a reproducible open-source solution for Code- lished, where a Gemini-Pro model was fine-tuned and eval-
Contests, including end-to-end generation and evaluation uated on code programming problems. The paper also re-
scripts. There are subtleties when evaluating results. For ex- ported results on a CodeContests benchmark, but on an up-

7
dated variant that they did not release to the public. Accord- [6] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas
ing to AlphaCode2 report: “AlphaCode2 requires about 100 Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir
samples to reach the level of performance of AlphaCode Puranik, Horace He, Dawn Song, et al. Measuring
with a million samples, making it over 10000× more sam- coding challenge competence with apps. arXiv preprint
ple efficient.” Hence both AlphaCode2 and AlphaCodium arXiv:2105.09938, 2021. 3
are four orders of magnitude more efficient than Alpha- [7] Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen
Sahoo, and Shafiq Joty. Codechain: Towards modular code
Code, LLMs call-wise. But, AlphaCode2 utilized a modern
generation through chain of self-revisions with representa-
foundation model (Gemini-Pro) that was heavily fine-tuned tive sub-modules. arXiv preprint arXiv:2310.08992, 2023.
specifically for CodeContests competition, while Alpha- 1, 7
Codium uses general-purpose models as-is, and improves [8] Yujia Li, David Choi, Junyoung Chung, Nate Kush-
their performances without extra data and an expensive man, Julian Schrittwieser, Rémi Leblond, Tom Eccles,
training phase. James Keeling, Felix Gimeno, Agustin Dal Lago, et al.
Competition-level code generation with alphacode. Science,
6. Conclusions 378(6624):1092–1097, 2022. 1, 3, 7
[9] Mike Mirzayanov, Oksana Pavlova, Pavel MAVRIN, Roman
In this paper, we introduced AlphaCodium, a code- Melnikov, Andrew Plotnikov, Vladimir Parfenov, and An-
oriented flow that iteratively runs and fixes a generated code drew Stankevich. Codeforces as an educational platform for
against input-output tests. The flow is divided into two main learning programming in digitalization. Olympiads in Infor-
phases: a pre-processing phase, where AlphaCodium rea- matics, 14(133-142):14, 2020. 1, 3
sons about the problem in natural language, and a code iter- [10] Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan,
ations phase, in which AlphaCodium iterates on public and Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Lar-
AI-generated tests. son, Yuanzhi Li, Weishung Liu, et al. Can generalist founda-
AlphaCodium also utilizes additional design concepts, tion models outcompete special-purpose tuning? case study
tricks, and best practices we found beneficial for code gen- in medicine. arXiv preprint arXiv:2311.16452, 2023. 1
eration: structured output in YAML format, generating [11] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui
modular code, semantic reasoning via bullet point analysis, Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
soft decisions with double validation, encouraging explo- Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a
family of highly capable multimodal models. arXiv preprint
ration, and test anchors.
arXiv:2312.11805, 2023. 7
We tested AlphaCodium on a challenging code gen-
[12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
eration dataset called CodeContests. The proposed flow reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
consistently and significantly improves results of various Polosukhin. Attention is all you need. Advances in neural
closed-source and open-source models. AlphaCodium also information processing systems, 30, 2017. 1
outperforms previous works from the literature, while hav- [13] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed
ing a significantly smaller computational budget. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny
Zhou. Self-consistency improves chain of thought reason-
References ing in language models. arXiv preprint arXiv:2203.11171,
2022. 1
[1] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten
Bosma, Henryk Michalewski, David Dohan, Ellen Jiang,
Carrie Cai, Michael Terry, Quoc Le, et al. Program
synthesis with large language models. arXiv preprint
arXiv:2108.07732, 2021. 1
[2] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen-
rique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu-
ating large language models trained on code. arXiv preprint
arXiv:2107.03374, 2021. 1, 3
[3] DeepSeek. Deepseek coder: Let the code write itself, 2023.
2, 3, 7
[4] Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta
Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston.
Chain-of-verification reduces hallucination in large language
models. arXiv preprint arXiv:2309.11495, 2023. 1
[5] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature,
scope, limits, and consequences. Minds and Machines,
30:681–694, 2020. 2, 3, 5

8
Appendices
A. Typical HumanEval code problem

/*
Check if in given vector of numbers, are any two numbers closer to each other than
given threshold. >>>
has_close_elements({1.0, 2.0, 3.0}, 0.5) false >>>
has_close_elements({1.0, 2.8, 3.0, 4.0, 5.0, 2.0}, 0.3) true
*/
#include<stdio.h>
#include<vector>
#include<math.h>
using namespace std;
bool has_close_elements(vector<float> numbers, float threshold){

B. Why YAML output is better suited for code generation tasks than JSON output
While newer GPT versions1 have inherent support for JSON-style output, we argue that YAML output is far better for
code generation. Why - generated code often contains single-quote, double-quote, special characters, and so on. LLMs will
struggle to validly place these characters inside a JSON format, since a JSON output needs to be surrounded with initial
double quotes (see Figure 4 (a)). In contrast, YAML output with block scaler2 must only obey indention. Any text or code
with proper indention will be a legal one (see Figure 4 (b)).
In addition, as can be seen in Figure 4, since YAML format doesn’t need curly brackets, quotations or escape characters,
its output has fewer tokens than JSON, hence reducing cost and inference time, and resulting in increased quality as the
model needs to pay attention to fewer tokens that are not essential.

(a) Token counting with JSON output (b) Token counting with YAML output

Figure 4: Comparison of the same output, once in JSON format, and once in YAML format. Taken from OpenAI playground.

1 https://platform.openai.com/docs/guides/text-generation/json-mode
2 https://yaml-multiline.info/

9
C. What didn’t work for us
1. Injecting the failed execution trace to the prompt: In addition to giving the model the error message when doing
iterative fixing, we also tried giving it the trace of the last X (50) lines executed. We have not observed improvement
from this.
2. Injecting the last K failed code solutions to the prompt: When doing iterative fixing, we tried injecting the last K
failed code solutions to the prompt, in order to steer the model in different directions. We have not observed improvement
from this.
3. Injecting the last git patch diff to the prompt: When doing iterative fixing, we also tried to give the last applied code
patch diff to the prompt. No improvement was seen.
4. Complicated single-stage prompts: we have not observed any significant improvement in results when trying to ma-
nipulate and optimize a single-stage prompt, or a chain of non-iterative prompts. The model still struggles to understand
the lengthy problem description, tends to ignore specific details, and consistently produces wrong code.

D. Estimating distance between tests’ outputs


When we run a solution code against an input test, the code generates an output. We compare this output to the expected
output, and end up with a boolean answer: pass or failed. However, it is also beneficial to try and estimate in some sense the
distance between the generated output and the correct output. For this matter, we employed the following logic:

• If the test output is a number, calculate the L2 distance.


• If the test output is an array of numbers, calculate the sum of L2 distances between corresponding array cells.
• If the test output is an array of strings, compare each cell separately (boolean comparison), and return the number of
non-identical cells.

This methodology enables us to produce a distance between each generated output and the correct output on CodeCon-
tests.

10

You might also like