0% found this document useful (0 votes)
96 views

Competition-Level Code Generation With AlphaCode

Uploaded by

Diogo Oliveira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

Competition-Level Code Generation With AlphaCode

Uploaded by

Diogo Oliveira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

RES EARCH

COMPUTER SCIENCE the submitter, who must instead write their


own tests or rely on the trivial example tests for
Competition-level code generation with AlphaCode debugging. Because competitors are allowed
to draw on solutions and algorithms from
Yujia Li*†, David Choi*†, Junyoung Chung†, Nate Kushman†, Julian Schrittwieser†, Rémi Leblond†, previous contests, challenging new problems
Tom Eccles†, James Keeling†, Felix Gimeno†, Agustin Dal Lago†, Thomas Hubert†, Peter Choy†, are created for each competition. Competitive
Cyprien de Masson d’Autume†, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, programming is very popular; events such as the
Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, International Collegiate Programming Com-
Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, Oriol Vinyals* petition (15) and the International Olympiad
in Informatics (16) date back to the 1970s and
Programming is a powerful and ubiquitous problem-solving tool. Systems that can assist programmers are some of the most prestigious competitions
or even generate programs themselves could make programming more productive and accessible. in computer science, drawing hundreds of
Recent transformer-based neural network models show impressive code generation abilities yet still thousands of participants from around the
perform poorly on more complex tasks requiring problem-solving skills, such as competitive world. Using problems that humans find chal-
programming problems. Here, we introduce AlphaCode, a system for code generation that achieved lenging from battle-tested competitions pro-
an average ranking in the top 54.3% in simulated evaluations on recent programming competitions on vides a robust and meaningful benchmark for
the Codeforces platform. AlphaCode solves problems by generating millions of diverse programs using many aspects of intelligence.
specially trained transformer-based networks and then filtering and clustering those programs to a Early work using program synthesis for com-
maximum of just 10 submissions. This result marks the first time an artificial intelligence system has petitive programming has shown that large
performed competitively in programming competitions. transformer models can achieve low single-
digit solve rates (13, 17). In contrast, we created

A
a code generation system named AlphaCode
utomatically creating programs given a programming problems in Python (13, 14). A that manages to solve 29.6% of test set held-
high-level description of what they should stripped-down version of our model performs out competitive programming problems in a
do is a long-standing task in computer similarly to these prior works (table S13). How- dataset we released named CodeContests,
science (1, 2). Creating an artificial intel- ever, those problems consist mostly of simple using at most 10 submissions per problem
ligence (AI) system that can solve un- task descriptions with short solutions, and (comparable to humans). A key driver of
foreseen problems by generating code from solving them often amounts to translating a AlphaCode’s performance came from scaling
problem descriptions is a challenge that both sequence of steps (e.g., adding together all the number of model samples to orders of
affords a greater understanding of problem even numbers in a list) directly into code. In magnitude more than previous work; the

Downloaded from https://www.science.org on December 11, 2022


solving and reasoning (3) and leads to impor- contrast, generating entire programs often overall solve rate scaled log-linearly with the
tant applications, such as improving program- relies on understanding the task (e.g., win a number of samples generated, even when only
mer productivity (4) and education (5). board game), reasoning out the appropriate 10 of them were submitted, a sample scaling
Generating code that solves a specified task algorithm to solve it, and then writing the law similar to those found for training com-
requires searching in the enormous space of code to implement that algorithm. pute and model size (12).
all possible character sequences, only a tiny Solving competitive programming problems When evaluated on simulated program-
portion of which corresponds to valid and cor- (Fig. 1A) represents a big step forward. It re- ming competitions hosted on the popular
rect programs. Furthermore, single character quires understanding complex natural lan- Codeforces platform (18), AlphaCode achieved
edits can completely change program behavior guage descriptions, reasoning about previously an average ranking within the top 54.3% of
or even cause crashes, and each task has many unseen problems instead of simply memorizing human participants (a small, selected subset of
valid solutions that may be drastically differ- code snippets, mastering a wide range of algo- all programmers). To the best of our knowl-
ent. These challenges make learning to gener- rithms and data structures, and precisely im- edge, a computer system has never before
ate correct programs difficult. Most prior work plementing submissions that can span hundreds achieved such a competitive level in program-
has been limited to either restricted domain- of lines. To evaluate these submissions (Fig. 1B), ming competitions.
specific programming languages (6) or short they are executed on an exhaustive suite of
code snippets (7, 8). Perhaps the best-known hidden tests and checked for execution speed Learning system
examples of program synthesis are Flash Fill and correctness on edge cases. Feedback is Our system (Fig. 2) was designed to address
(6), which synthesizes programs from input minimal; the submission is correct only if it has the main challenges of competitive program-
and output examples to automatically fill in the correct output on all hidden tests, otherwise ming: (i) searching in the huge space of pro-
data in Microsoft Excel, and code comple- it is incorrect. Hidden tests are not visible to grams, (ii) access to only ~13,000 example tasks
tion tools common in integrated development
environments, which boost programmer pro-
ductivity (9, 10).
Table 1. AlphaCode’s results on Codeforces competitions. For each contest, we show the
Contrasting with the conceptually more
estimated percent ranking (lower is better) using simulated time and incorrect submission penalties,
complex systems used in most of program syn-
as well as the best and worst possible rankings using minimum and maximum time penalties,
thesis history, recent large-scale transformer-
averaged over three evaluations. Percents are how many users performed better than AlphaCode.
based (11) language models [which achieve
AlphaCode achieved an overall ranking in the top 54.3% averaged across the 10 contests.
impressive performance generating text (12)]
can, with minimal modification, solve simple Contest ID 1591 1608 1613 1615 1617 1618 1619 1620 1622 1623 Average
Maximum 43.5% 43.6% 59.8% 60.5% 65.1% 32.2% 47.1% 54.0% 57.5% 20.6% 48.4%
DeepMind, London, UK.
Estimated 44.3% 46.3% 66.1% 62.4% 73.9% 52.2% 47.3% 63.3% 66.2% 20.9% 54.3%
*Corresponding author. Email: yujiali@deepmind.com (Y.L.); Minimum 74.5% 95.7% 75.0% 90.4% 82.3% 53.5% 88.1% 75.1% 81.6% 55.3% 77.2%
davidhchoi@deepmind.com (D.C.); vinyals@deepmind.com (O.V.)
†These authors contributed equally to this work. .....................................................................................................................................................................................................................................

Li et al., Science 378, 1092–1097 (2022) 9 December 2022 1 of 6


RES EARCH | R E S E A R C H A R T I C L E

for training, and (iii) a restricted number of


submissions per problem. Notably, scaling the
amount of model samples generated greatly
improved performance (Fig. 3), and therefore
many aspects of our system were designed to
draw samples as quickly as possible, to ensure
sample diversity, and to select the best samples
for submission. See supplementary materials
(SM) text section C for full details of the method.
Creating solutions to problems can be seen
as a sequence-to-sequence (19) prediction task:
Given a problem description X in natural lan-
guage (e.g., Fig. 1A), produce a corresponding
solution Y in a programming language (e.g.,
Fig. 1B). This task naturally motivated an
encoder-decoder transformer architecture (11)
for AlphaCode, which models p(Y|X). The ar-
chitecture took as input to the encoder the
problem description and metadata X as a flat
sequence of tokenized (20) characters and sam-
pled Y autoregressively from the decoder one
token at a time until an end-of-code token was
produced. The code was then ready to be com-
piled and run against tests.
We pretrained our models on a 715-GB snap-
shot of human code from GitHub (21), with
cross-entropy next-token prediction loss. Dur-
ing pretraining, we randomly split code files
into two parts, using the first part as input to
the encoder and training the decoder to pro-

Downloaded from https://www.science.org on December 11, 2022


duce the second part. This pretraining learned
a strong prior for human coding, enabling
subsequent task-specific fine-tuning with a
much smaller dataset.
We fine-tuned and evaluated models on a
2.6-GB dataset of competitive programming
problems that we created and released under
the name CodeContests (22). CodeContests in-
cludes problems, correct and incorrect human
submissions, and test cases. The training set Fig. 1. Competitive programming problem statement and solution. (A) Problem statement of 1553D
contains 13,328 problems, with an average of (“Backspace”; https://codeforces.com/problemset/problem/1553/D), a Codeforces problem (18). This
922.4 submissions per problem. The valida- is a medium-difficulty problem, with a rating of 1500, and its statement includes a public example test case.
tion and test sets contain 117 and 165 prob- The entire statement is given to AlphaCode. (B) Solution to the problem statement, generated by Alpha-
lems, respectively. Full sets of hidden test Code. Examples of the exact formatting of problem descriptions, solutions, and what the model perceives
cases used in competitions were not availa- can be found at https://alphacode.deepmind.com (32).
ble, so we included extra generated tests in
CodeContests (created by mutating existing
tests) to decrease the false positive rate of our humans could have seen before a competition shared key and value heads per attention block
model’s submissions from 60% (comparable to would have been available to the models. to substantially reduce memory usage and
other datasets) to 4%. We still observed that During fine-tuning, we encoded the natural cache updates (which were bottlenecks during
42% of solutions were correct but algorithmi- language problem statement as a program sampling). In combination with an encoder-
cally inefficient, running out of time or mem- comment, to make it appear more similar to decoder model instead of a decoder-only one,
ory on larger test inputs, which motivated the files seen during pretraining (which can in- the change increased the sampling speed of
crucial step of evaluating model submissions clude extended natural language comments), AlphaCode more than 10-fold.
on the official competitive programming web- and used the same next-token prediction loss. 2) Masked language modeling (MLM): We
site for the main evaluation. The validation set We trained a variety of models ranging from included an MLM loss (24) on the encoder,
of CodeContests was used as an easy-to-measure, 300 million to 41 billion parameters. which empirically improved model solve rates.
low-variance proxy of submitting to a competi- Our models included the following en- 3) Tempering: Tempering (25) artificially
tion, and the test set was used only for final hancements on top of a standard transformer made the training distribution sharper, pro-
evaluations after all training and tuning had encoder-decoder architecture. Each improve- ducing a smoother sampling distribution (a
been completed. All data used to train the mod- ment improved the solve rate, as we show in regularization effect to prevent overfitting).
els (in pretraining and fine-tuning) appeared the evaluation section: 4) Value conditioning and prediction: We
online before all problems in the validation and 1) Multi-query attention: Multi-query atten- used value conditioning and prediction to
test sets, ensuring that only information that tion (23) used a full set of query heads but discriminate between correct and incorrect

Li et al., Science 378, 1092–1097 (2022) 9 December 2022 2 of 6


RES EARCH | R E S E A R C H A R T I C L E

Downloaded from https://www.science.org on December 11, 2022


Fig. 2. Overview of AlphaCode. (A) In pretraining, files from GitHub are randomly split into two parts. The first part goes to the encoder as input, and the decoder
is trained to produce the second part. (B) In fine-tuning, problem descriptions (formatted as comments) are given to the encoder, and the decoder is trained to
generate the solutions. (C) For evaluation, AlphaCode generates many samples for each problem description, then it executes them to filter out bad samples and
cluster the remaining ones before finally submitting a small set of candidates.

problem submissions contained in CodeContests, To select the 10 best samples for submis- ptions and weaknesses that could skew results
providing an additional training signal and sion, we applied filtering and clustering to ob- and allows us to benchmark against the best
allowing the use of incorrect submission data tain a small number of candidate submissions performers on this task–human competitors.
that might otherwise mislead the model. on the basis of their program behavior. Filtering We ensembled our 41 billion (41B) and 9 bil-
5) Generation by off-policy learning from executed samples using example tests included lion (9B) parameter models by pooling their
demonstrations (GOLD): The CodeContests in the problem statement and removed samples samples and then evaluated the ensemble on
training set contains hundreds of solutions that failed those tests. This filtering removed all Codeforces competitions from 1 December
for each problem. The standard cross-entropy ~99% of model samples. The possibly tens of 2021 to 28 December 2021 with >5000 par-
next-token prediction loss would put equal thousands of candidate samples that remained ticipants per contest, a total of 10 competitions
weight on all solutions. A successful model, how- were then clustered by executing them on inputs that we believe are a representative sample of
ever, only needs to generate a single correct generated by a separate transformer model Codeforces contests. For each contest, we sim-
solution for each problem. To resolve this dis- trained to do test input generation and by ulated running AlphaCode live, generating
crepancy, we adopted a variation of GOLD (26), grouping together programs that produced the samples for each problem, filtering with ex-
an offline reinforcement learning (RL) algo- same outputs on the generated inputs. We then ample tests (28), and clustering to get candidate
rithm that focuses training on the most likely picked one sample from each of the 10 largest submissions. We submitted these candidates
solutions for each problem instead of all pos- clusters for submission, approximately the most to the Codeforces platform and computed
sible solutions. likely program behaviors from our model. In- AlphaCode’s placement in each contest (Table 1).
At sampling time, the diversity of samples tuitively, correct programs would behave the After the first run, we repeated this procedure
was important, so that the millions of samples same and form large clusters, but incorrect two more times to measure variance.
for each problem could effectively explore the programs could fail in many different ways. Overall, our system achieved an average rank-
space of possible solutions. As in another pub- ing in the top 54.3% when limited to 10 sub-
lication (27), we ensured sample diversity by Evaluation missions per problem, although 66% of solved
using a high temperature and conditioning To assess the performance of AlphaCode, we problems were solved with the first submis-
samples on random metadata: problem diffi- evaluated it against programming compe- sion. This performance in competitions approx-
culty ratings, problem tags (which indicate titions from the Codeforces platform (18). imately corresponds to a novice programmer
which techniques a solution might use), and Compared with reporting the solve rate on a with a few months to a year of training (see SM
solution programming language. dataset, this evaluation avoids dataset assum- text section G). To the best of our knowledge,

Li et al., Science 378, 1092–1097 (2022) 9 December 2022 3 of 6


RES EARCH | R E S E A R C H A R T I C L E

attention change). We added one new setting


Table 2. Buildup ablation for model enhancements. Effect of each additional model enhancement at a time, with the final setting that corresponds
building up from “No enhancements”—a plain fine-tuned 1B encoder-decoder model trained with to AlphaCode reported at the bottom of the
the standard next-token prediction loss. Numbers in parentheses represent 95% confidence intervals. table. Combining the six enhancements to-
For each setting, we fine-tuned and sampled from at least three different models from the same gether increased the 10@1M solve rate from
pretrained checkpoint. 19.6% to 28.4%, although the improvement
depended on the number of samples.

Solve rate Model analysis


Fine-tuning setting 10@1K 10@1M To explore the possibility of solving problems
by exploiting weaknesses in the problem struc-
No enhancements 6.7% (6.5–6.8) 19.6% (18.2–20.4)
.....................................................................................................................................................................................................................
ture, we analyzed the capabilities and limi-
+ MLM 6.6% (6.2–7.0) 20.7% (19.1–21.3)
.....................................................................................................................................................................................................................
tations of our models. A common concern
+ Tempering 7.7% (7.2–8.5) 21.9% (20.7–22.6)
.....................................................................................................................................................................................................................
for large language models trained on large
+ Random tags and ratings 6.8% (6.4–7.0) 22.4% (21.3–23.0)
.....................................................................................................................................................................................................................
amounts of data is that they solve downstream
+ Value 10.6% (9.8–11.1) 23.2% (21.7–23.9)
.....................................................................................................................................................................................................................
problems by simply memorizing the training
+ GOLD 12.4% (12.0–13.0) 24.2% (23.1–24.4)
.....................................................................................................................................................................................................................
set (30, 31). For competitive programming to
+ Clustering 12.2% (10.8–13.4) 28.4% (27.5–29.3)
.....................................................................................................................................................................................................................
be a good test of problem-solving ability, mod-
els need to come up with novel solutions to
solve new problems. We found no evidence
this is the first time that a computer system has improving the solve rate required exponen- that our model copied core logic from the
been competitive with human participants in tially increasing amounts of samples, and the training data. Copying solutions from the train-
programming competitions. Training and eval- costs quickly became prohibitive. ing set was not sufficient to solve any prob-
uating our largest 41B model on Codeforces 2) Better models have higher slopes in the lems in the unseen validation set, and the
required a total of 2149 petaflop/s-days and scaling curve. Figure 3A also shows that larger longest substrings that both human solutions
175 megawatt-hours [~16 times the average models tended to generate higher-quality sam- and AlphaCode solutions shared with the train-
American household’s yearly energy consump- ples, reflected as better solve rates with the ing data were of similar lengths (fig. S10).
tion (29)]. The large resource usage required same number of samples and higher slopes Further manual investigation showed that core
for this experiment both has environmental in this log-linear scaling curve. Because of log- logic of model solutions necessary to solve prob-
impacts and makes replication challenging for linear scaling, a better model could reach the lems could not be found in the training data.

Downloaded from https://www.science.org on December 11, 2022


most research entities. same solve rates with exponentially fewer sam- We also analyzed the properties of model
To assess our modeling decisions, we eval- ples than worse models. Improving model qual- samples. Like humans, models correctly solved
uated our models on the validation and test ity is an effective way to counter the exponential easier problems more often than harder prob-
sets of CodeContests. The main CodeContests explosion of sample budget required for a lems; the 41B model solved 62.4% of validation
metric was 10@k: the percentage of problems higher solve rate. problems rated between 800 and 1100 and 7.8%
solved when we take k samples from the model 3) Solve rates scale log-linearly with more of problems rated between 2400 and 2700
for each problem but can only submit 10 of compute. As shown in Fig. 3B, the solve rate (higher-rated problems are more difficult).
them for evaluation on the hidden tests. If any also scaled approximately log-linearly with Models had the highest solve rates for prob-
of the 10 selected samples solved a problem, more training compute. Each vertical slice on lems that deal with bitmasks, sorting, greedy
the problem is solved. This limit of 10 submis- the curves corresponds to one model size. algorithms, or math, but the lowest solve rates
sions mimics the upper bound of the number Figure 3C shows how the solve rate scaled with for dynamic programming, constructive algo-
of submissions a human would typically use, sampling compute: Larger models required rithm, and graph problems. The former cat-
and tracking k is important for comparisons, more compute per sample but eventually out- egories tend to afford shorter solutions and
as performance improves as k increases. performed smaller models, as the better qual- the latter categories afford longer ones, indi-
With up to 100,000 samples per problem ity of samples from the larger models became cating that AlphaCode had trouble producing
(10@100K), the AlphaCode 41B model solved the dominant factor for performance. Both longer solutions.
29.6% of problems in the CodeContests test set. ways of leveraging more compute demon- AlphaCode was sensitive to changes in the
Prior work on comparable datasets achieved low strate log-linear scaling. problem descriptions, indicating that it did
single-digit solve rates (13, 17), and direct com- 4) Sample selection is critical to solve rate not exploit obvious weaknesses in the task
parisons can be found in SM text section E.3. scaling. Figure 3D shows how 10@k solve rates structure. We modified a problem statement
Figure 3 shows how the model performance scaled when we used different sample selection that asked for a program that can find the
scaled on the 10@k metrics with more samples, methods. With clustering and filtering, as was maximum product of two consecutive array
that is, as we increased k, or with more compute. used in AlphaCode, taking more samples im- elements. When AlphaCode generated sam-
These scaling curves highlight a few notable facts proved the solve rate. However, the combination ples for the opposite problem (minimum in-
about this problem domain and our models: was far from the theoretical upper bound of stead of maximum product) and a related
1) Solve rates scale log-linearly with more performance with perfect sample selection. With- problem (any elements instead of consecutive
samples. As shown in Fig. 3A, the 10@k solve out filtering (bottom curve), 10 samples were ones), the percentage correctness of those
rates scaled approximately log-linearly with k, randomly selected for submission, and taking samples on the original problem decreased from
bending down slightly at high sample bud- more samples did not improve the solve rate. 17.1% to 0.1% and 3.2% for the opposite and
gets. The fact that sampling much more than Table 2 shows a buildup ablation of related rewordings, respectively. AlphaCode
10 still improved the 10@k solve rate shows AlphaCode’s enhancements, which signifi- attended to important changes in problem
how important it was to sufficiently explore cantly improved the solve rate on models at descriptions that fundamentally changed the
the search space before committing to the the 1B scale. We started from the base setting problem, instead of uniformly trying every re-
final 10 submissions per problem. However, with no enhancements (beyond the multi-query lated algorithm by abusing the large sample

Li et al., Science 378, 1092–1097 (2022) 9 December 2022 4 of 6


RES EARCH | R E S E A R C H A R T I C L E

Downloaded from https://www.science.org on December 11, 2022


Fig. 3. Solve rate scaling. (A) The solve rate scaled approximately log-linearly amount of compute we used for sampling, the optimal model size increased.
with the number of samples, tapering off slightly in the 10@k setting. Larger models (D) Improved sample selection methods were necessary to see solve rate scaling. In
with more parameters had higher scaling slopes in this log-linear plot. (B) The solve order of quality: random selection (“10@k no filtering”), filtering using example
rate scaled approximately log-linearly with the training compute when we choose tests (“10@k with filtering”), clustering after filtering (“10@k with filtering +
near-optimal model sizes for each compute allocation. (C) As we increased the clustering”), and perfect sample selection. TPU, tensor processing unit.

budget. Conversely, alterations that funda- then filtering and clustering samples to a small performing reasoning necessary to solve com-
mentally left the problem unchanged affected set, together with efficient transformer archi- plex problems with code. This line of work has
solve rates less, indicating that AlphaCode tectures to support large-scale sampling, were exciting applications that can improve the
could mostly ignore changes that humans essential to achieving good performance. Our productivity of programmers and make pro-
could also ignore. For example, introducing clean dataset and robust evaluation procedure gramming accessible to a new generation of
30 adjacent character transposition typos de- [released as CodeContests (22)] also contrib- developers. Our work in code generation still
creased the 10@1024 solve rate from 13.5% to uted substantially to guiding our research pro- leaves much room for improvement, and we
11.3%. More analysis can be found in SM text gress. We show through detailed analysis that hope that further work will advance the fields
section F. there is no evidence that AlphaCode copied of both programming and reasoning in AI.
important parts of previous solutions or ex-
Conclusion ploited weaknesses in the problem structure. REFERENCES AND NOTES

Here, we present AlphaCode, a system applied The analysis indicates that our model was in- 1. Z. Manna, R. J. Waldinger, Commun. ACM 14, 151–165
(1971).
to code generation for competitive program- deed able to solve problems it has never seen 2. A. Church, J. Symb. Log. 28, 289–290 (1963).
ming that can generate novel solutions to un- before, even though those problems require 3. C. C. Green, in Readings in Artificial Intelligence, B. L. Webber,
seen programming problems. When evaluated considerable reasoning. N. J. Nilsson, Eds. (Elsevier, 1981), pp. 202–222.
4. N. D. Matsakis, F. S. Klock II, ACM SIGAda Ada Lett. 34,
on Codeforces, AlphaCode performed roughly Combined with careful research and engi-
103–104 (2014).
at the level of the median competitor. We neering, the relatively simple architecture of 5. M. Resnick et al., Commun. ACM 52, 60–67 (2009).
found that massively scaling up sampling and transformers shows exceptional potential in 6. S. Gulwani, SIGPLAN Not. 46, 317–330 (2011).

Li et al., Science 378, 1092–1097 (2022) 9 December 2022 5 of 6


RES EARCH | R E S E A R C H A R T I C L E

7. V. Raychev, M. Vechev, E. Yahav, . SIGPLAN Not. 49, 419–428 Linguistics: Human Language Technologies, Volume 1 (Long and expertise as the paper was being written; M. Mirzayanov for
(2014). Short Papers) (Association for Computational Linguistics, allowing us to evaluate on Codeforces and for lending competitive
8. M. Bruch, M. Monperrus, M. Mezini, “Learning from examples 2019), pp. 4171–4186. programming expertise when writing the paper; and everyone at
to improve code completion systems,” Proceedings of the 25. R. Dabre, A. Fujita, arXiv:2009.09372 [cs.CL] (2020). DeepMind for their insight and support. Funding: All research in this
7th Joint Meeting of the European Software Engineering 26. R. Y. Pang, H. He, arXiv:2009.07839 [cs.CL] (2020). paper was funded by DeepMind and Alphabet. There was no external
Conference and the ACM SIGSOFT Symposium on the 27. O. Vinyals et al., Nature 575, 350–354 (2019). funding. Author contributions: Y.L. and D.C. led the project;
Foundations of Software Engineering (ESEC/FSE ’09), 28. For problems permitting multiple correct outputs, we changed C.d.M.d’A., D.J.M., D.C., F.G., J.K., J.S., J.C., N.K., P.C., R.L., T.H., T.E.,
Amsterdam, Netherlands, 24 to 28 August 2009 (Association the example test outputs to be the most canonical, which X.C., and Y.L. designed and implemented the learning system and
for Computing Machinery, 2009), pp. 213–222. gives our approach a slight advantage in the evaluation. See algorithm; J.C. and N.K. ran the final experiments; A.D.L., D.C., F.G.,
9. A. Hindle, E. T. Barr, Z. Su, M. Gabel, P. Devanbu, “On the SM text section D for more details. J.K., J.M., J.S., J.C., N.K., P.C., R.L., T.H., T.E., and Y.L. worked on
naturalness of software,” Proceedings of the 34th International 29. US Energy Information Administration, Electric Sales, Revenue, infrastructure for running and evaluating experiments; A.C., C.d.M.d’A.,
Conference on Software Engineering (ICSE ’12), Zurich, Switzerland, and Average Price: Summary Table T5.a: 2021 Residential J.W., N.K., P.C., P.-S.H., R.L., S.G., and T.E. worked on model analysis;
2 to 9 June 2012 (IEEE Press, 2012), pp. 837–847. Average Monthly Bill by Census Division, and State; A.D.L., D.C., F.G., J.S., J.C., N.K., R.L., T.E., and Y.L. worked on
10. A. Svyatkovskiy, S. K. Deng, S. Fu, N. Sundaresan, “IntelliCode https://www.eia.gov/electricity/sales_revenue_price/. datasets; D.C., I.B., J.C., N.K., and Y.L. worked on initial prototypes
30. A. Ziegler, “GitHub Copilot research recitation:
compose: code generation using transformer,” Proceedings of and infrastructure; D.C., J.K., J.S., J.C., N.d.F., N.K., O.V., P.C., R.L.,
GitHub Copilot: Parrot or Crow? A first look at rote
the 28th ACM Joint Meeting on European Software Engineering T.E., and Y.L. wrote the paper; D.C., E.S.R., O.V., and Y.L. worked on
learning in GitHub Copilot suggestions,” The GitHub Blog,
Conference and Symposium on the Foundations of Software project management; and N.d.F., O.V., K.K., and P.K. advised the
30 June 2021; https://docs.github.com/en/github/copilot/
Engineering(ESEC/FSE ’20), 8 to 13 November 2020 project. Competing interests: This work was done in the course
research-recitation.
(Association for Computing Machinery, 2020), pp. 1433–1443. of employment at DeepMind, with no other competing financial
31. N. Carlini et al., “Extracting training data from large language
11. A. Vaswani et al., Adv. Neural Inf. Process. Syst. 30, interests. DeepMind has filed the following patent application related
models,” 30th USENIX Security Symposium (USENIX Security
5998–6008 (2017). to this work: US63/306,043. Data and materials availability: The
21), 11 to 13 August 2021, pp. 2633–2650.
12. T. B. Brown et al., arXiv:2005.14165 [cs.CL] (2020). datasets used in the experiments as well as evaluation code have
32. AlphaCode examples, explanations, and visualizations can be
13. M. Chen et al., arXiv:2107.03374 [cs.LG] (2021). been made available for download on GitHub at https://github.com/
found at https://alphacode.deepmind.com and at https://doi.
14. J. Austin et al., arXiv:2108.07732 [cs.PL] (2021). deepmind/code_contests. Visualizations of the AlphaCode
org/10.5281/zenodo.6975437.
15. International Collegiate Programming Contest (ICPC), model, selected samples, and explanations can be found at
https://icpc.global/. 33. Y. Li et al., AlphaCode data materials, version 1.0.0, Zenodo https://alphacode.deepmind.com. Both of the above, as well as
16. International Olympiad in Informatics (IOI), https://ioinformatics.org/. (2022); https://doi.org/10.5281/zenodo.6975437. results of manual marking, can also be found in Zenodo (33). All
17. D. Hendrycks et al., arXiv:2105.09938 [cs.SE] (2021). other data needed to evaluate the conclusions in the paper are
18. Codeforces, https://codeforces.com/. ACKN OWLED GMEN TS present in the paper or the supplementary materials. License
19. I. Sutskever, O. Vinyals, Q. V. Le, Adv. Neural Inf. Process. Syst. We thank T. Cai, J. Rae, S. Borgeaud, M. Glaese, R. Ring, L. Sifre, information: Copyright © 2022 the authors, some rights reserved;
27, 3104–3112 (2014). J. Hoffman, J. Aslanides, J.-B. Lespiau, A. Mensch, E. Elsen, exclusive licensee American Association for the Advancement of
20. T. Kudo, J. Richardson, arXiv:1808.06226 [cs.CL] (2018). G. van den Driessche, and G. Irving for developing tools that we Science. No claim to original US government works. https://www.
21. The public GitHub dataset is hosted on Google BigQuery at use to train large language models and for lending their expertise science.org/about/science-licenses-journal-article-reuse
https://console.cloud.google.com/marketplace/product/ in model training; K. Anderson, C. Pope, and R. Foley for project
github/github-repos. SUPPLEMENTARY MATERIALS
management in early stages; Y. Whye Teh, C. Dyer, D. Silver,
22. The released CodeContests dataset can be found at https:// A. Barekatain, A. Zhernov, M. Overlan, and P. Veličković for research science.org/doi/10.1126/science.abq1158
github.com/deepmind/code_contests and at https://doi.org/ advice and assistance; K. Simonyan, C. Dyer, and D. Yogatama Supplementary Text
10.5281/zenodo.6975437. for reviewing the paper; L. Bennett, K. Ayoub, and J. Stanway for Figs. S1 to S29
23. N. Shazeer, arXiv:1911.02150 [cs.NE] (2019). logistically making the project possible; S. Dathathri for analyzing our Tables S1 to S23
24. J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, “BERT:

Downloaded from https://www.science.org on December 11, 2022


model; E. Caballero for granting permission to use Description2Code References (34–88)
Pre-training of deep bidirectional transformers for language data; R. Ke for helping connect us with E. Caballero; P. Heiber
understanding,” Proceedings of the 2019 Conference of the for helping connect us with Codeforces; P. Mitrichev for helping Submitted 29 April 2022; accepted 4 October 2022
North American Chapter of the Association for Computational connect us with Codeforces, and lending competitive programming 10.1126/science.abq1158

Li et al., Science 378, 1092–1097 (2022) 9 December 2022 6 of 6


Competition-level code generation with AlphaCode
Yujia LiDavid ChoiJunyoung ChungNate KushmanJulian SchrittwieserRémi LeblondTom EcclesJames KeelingFelix
GimenoAgustin Dal LagoThomas HubertPeter ChoyCyprien de Masson d’AutumeIgor BabuschkinXinyun ChenPo-
Sen HuangJohannes WelblSven GowalAlexey CherepanovJames MolloyDaniel J. MankowitzEsme Sutherland
RobsonPushmeet KohliNando de FreitasKoray KavukcuogluOriol Vinyals

Science, 378 (6624), • DOI: 10.1126/science.abq1158

Machine learning systems can program too


Computer programming competitions are popular tests among programmers that require critical thinking informed by
experience and creating solutions to unforeseen problems, both of which are key aspects of human intelligence but
challenging to mimic by machine learning models. Using self-supervised learning and an encoder-decoder transformer
architecture, Li et al. developed AlphaCode, a deep-learning model that can achieve approximately human-level
performance on the Codeforces platform, which regularly hosts these competitions and attracts numerous participants
worldwide (see the Perspective by Kolter). The development of such coding platforms could have a huge impact on
programmers’ productivity. It may even change the culture of programming by shifting human work to formulating
problems, with machine learning being the main one responsible for generating and executing codes. —YS

Downloaded from https://www.science.org on December 11, 2022


View the article online
https://www.science.org/doi/10.1126/science.abq1158
Permissions
https://www.science.org/help/reprints-and-permissions

Use of this article is subject to the Terms of service

Science (ISSN ) is published by the American Association for the Advancement of Science. 1200 New York Avenue NW, Washington, DC
20005. The title Science is a registered trademark of AAAS.
Copyright © 2022 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim
to original U.S. Government Works

You might also like