Competition-Level Code Generation With AlphaCode
Competition-Level Code Generation With AlphaCode
A
a code generation system named AlphaCode
utomatically creating programs given a programming problems in Python (13, 14). A that manages to solve 29.6% of test set held-
high-level description of what they should stripped-down version of our model performs out competitive programming problems in a
do is a long-standing task in computer similarly to these prior works (table S13). How- dataset we released named CodeContests,
science (1, 2). Creating an artificial intel- ever, those problems consist mostly of simple using at most 10 submissions per problem
ligence (AI) system that can solve un- task descriptions with short solutions, and (comparable to humans). A key driver of
foreseen problems by generating code from solving them often amounts to translating a AlphaCode’s performance came from scaling
problem descriptions is a challenge that both sequence of steps (e.g., adding together all the number of model samples to orders of
affords a greater understanding of problem even numbers in a list) directly into code. In magnitude more than previous work; the
problem submissions contained in CodeContests, To select the 10 best samples for submis- ptions and weaknesses that could skew results
providing an additional training signal and sion, we applied filtering and clustering to ob- and allows us to benchmark against the best
allowing the use of incorrect submission data tain a small number of candidate submissions performers on this task–human competitors.
that might otherwise mislead the model. on the basis of their program behavior. Filtering We ensembled our 41 billion (41B) and 9 bil-
5) Generation by off-policy learning from executed samples using example tests included lion (9B) parameter models by pooling their
demonstrations (GOLD): The CodeContests in the problem statement and removed samples samples and then evaluated the ensemble on
training set contains hundreds of solutions that failed those tests. This filtering removed all Codeforces competitions from 1 December
for each problem. The standard cross-entropy ~99% of model samples. The possibly tens of 2021 to 28 December 2021 with >5000 par-
next-token prediction loss would put equal thousands of candidate samples that remained ticipants per contest, a total of 10 competitions
weight on all solutions. A successful model, how- were then clustered by executing them on inputs that we believe are a representative sample of
ever, only needs to generate a single correct generated by a separate transformer model Codeforces contests. For each contest, we sim-
solution for each problem. To resolve this dis- trained to do test input generation and by ulated running AlphaCode live, generating
crepancy, we adopted a variation of GOLD (26), grouping together programs that produced the samples for each problem, filtering with ex-
an offline reinforcement learning (RL) algo- same outputs on the generated inputs. We then ample tests (28), and clustering to get candidate
rithm that focuses training on the most likely picked one sample from each of the 10 largest submissions. We submitted these candidates
solutions for each problem instead of all pos- clusters for submission, approximately the most to the Codeforces platform and computed
sible solutions. likely program behaviors from our model. In- AlphaCode’s placement in each contest (Table 1).
At sampling time, the diversity of samples tuitively, correct programs would behave the After the first run, we repeated this procedure
was important, so that the millions of samples same and form large clusters, but incorrect two more times to measure variance.
for each problem could effectively explore the programs could fail in many different ways. Overall, our system achieved an average rank-
space of possible solutions. As in another pub- ing in the top 54.3% when limited to 10 sub-
lication (27), we ensured sample diversity by Evaluation missions per problem, although 66% of solved
using a high temperature and conditioning To assess the performance of AlphaCode, we problems were solved with the first submis-
samples on random metadata: problem diffi- evaluated it against programming compe- sion. This performance in competitions approx-
culty ratings, problem tags (which indicate titions from the Codeforces platform (18). imately corresponds to a novice programmer
which techniques a solution might use), and Compared with reporting the solve rate on a with a few months to a year of training (see SM
solution programming language. dataset, this evaluation avoids dataset assum- text section G). To the best of our knowledge,
budget. Conversely, alterations that funda- then filtering and clustering samples to a small performing reasoning necessary to solve com-
mentally left the problem unchanged affected set, together with efficient transformer archi- plex problems with code. This line of work has
solve rates less, indicating that AlphaCode tectures to support large-scale sampling, were exciting applications that can improve the
could mostly ignore changes that humans essential to achieving good performance. Our productivity of programmers and make pro-
could also ignore. For example, introducing clean dataset and robust evaluation procedure gramming accessible to a new generation of
30 adjacent character transposition typos de- [released as CodeContests (22)] also contrib- developers. Our work in code generation still
creased the 10@1024 solve rate from 13.5% to uted substantially to guiding our research pro- leaves much room for improvement, and we
11.3%. More analysis can be found in SM text gress. We show through detailed analysis that hope that further work will advance the fields
section F. there is no evidence that AlphaCode copied of both programming and reasoning in AI.
important parts of previous solutions or ex-
Conclusion ploited weaknesses in the problem structure. REFERENCES AND NOTES
Here, we present AlphaCode, a system applied The analysis indicates that our model was in- 1. Z. Manna, R. J. Waldinger, Commun. ACM 14, 151–165
(1971).
to code generation for competitive program- deed able to solve problems it has never seen 2. A. Church, J. Symb. Log. 28, 289–290 (1963).
ming that can generate novel solutions to un- before, even though those problems require 3. C. C. Green, in Readings in Artificial Intelligence, B. L. Webber,
seen programming problems. When evaluated considerable reasoning. N. J. Nilsson, Eds. (Elsevier, 1981), pp. 202–222.
4. N. D. Matsakis, F. S. Klock II, ACM SIGAda Ada Lett. 34,
on Codeforces, AlphaCode performed roughly Combined with careful research and engi-
103–104 (2014).
at the level of the median competitor. We neering, the relatively simple architecture of 5. M. Resnick et al., Commun. ACM 52, 60–67 (2009).
found that massively scaling up sampling and transformers shows exceptional potential in 6. S. Gulwani, SIGPLAN Not. 46, 317–330 (2011).
7. V. Raychev, M. Vechev, E. Yahav, . SIGPLAN Not. 49, 419–428 Linguistics: Human Language Technologies, Volume 1 (Long and expertise as the paper was being written; M. Mirzayanov for
(2014). Short Papers) (Association for Computational Linguistics, allowing us to evaluate on Codeforces and for lending competitive
8. M. Bruch, M. Monperrus, M. Mezini, “Learning from examples 2019), pp. 4171–4186. programming expertise when writing the paper; and everyone at
to improve code completion systems,” Proceedings of the 25. R. Dabre, A. Fujita, arXiv:2009.09372 [cs.CL] (2020). DeepMind for their insight and support. Funding: All research in this
7th Joint Meeting of the European Software Engineering 26. R. Y. Pang, H. He, arXiv:2009.07839 [cs.CL] (2020). paper was funded by DeepMind and Alphabet. There was no external
Conference and the ACM SIGSOFT Symposium on the 27. O. Vinyals et al., Nature 575, 350–354 (2019). funding. Author contributions: Y.L. and D.C. led the project;
Foundations of Software Engineering (ESEC/FSE ’09), 28. For problems permitting multiple correct outputs, we changed C.d.M.d’A., D.J.M., D.C., F.G., J.K., J.S., J.C., N.K., P.C., R.L., T.H., T.E.,
Amsterdam, Netherlands, 24 to 28 August 2009 (Association the example test outputs to be the most canonical, which X.C., and Y.L. designed and implemented the learning system and
for Computing Machinery, 2009), pp. 213–222. gives our approach a slight advantage in the evaluation. See algorithm; J.C. and N.K. ran the final experiments; A.D.L., D.C., F.G.,
9. A. Hindle, E. T. Barr, Z. Su, M. Gabel, P. Devanbu, “On the SM text section D for more details. J.K., J.M., J.S., J.C., N.K., P.C., R.L., T.H., T.E., and Y.L. worked on
naturalness of software,” Proceedings of the 34th International 29. US Energy Information Administration, Electric Sales, Revenue, infrastructure for running and evaluating experiments; A.C., C.d.M.d’A.,
Conference on Software Engineering (ICSE ’12), Zurich, Switzerland, and Average Price: Summary Table T5.a: 2021 Residential J.W., N.K., P.C., P.-S.H., R.L., S.G., and T.E. worked on model analysis;
2 to 9 June 2012 (IEEE Press, 2012), pp. 837–847. Average Monthly Bill by Census Division, and State; A.D.L., D.C., F.G., J.S., J.C., N.K., R.L., T.E., and Y.L. worked on
10. A. Svyatkovskiy, S. K. Deng, S. Fu, N. Sundaresan, “IntelliCode https://www.eia.gov/electricity/sales_revenue_price/. datasets; D.C., I.B., J.C., N.K., and Y.L. worked on initial prototypes
30. A. Ziegler, “GitHub Copilot research recitation:
compose: code generation using transformer,” Proceedings of and infrastructure; D.C., J.K., J.S., J.C., N.d.F., N.K., O.V., P.C., R.L.,
GitHub Copilot: Parrot or Crow? A first look at rote
the 28th ACM Joint Meeting on European Software Engineering T.E., and Y.L. wrote the paper; D.C., E.S.R., O.V., and Y.L. worked on
learning in GitHub Copilot suggestions,” The GitHub Blog,
Conference and Symposium on the Foundations of Software project management; and N.d.F., O.V., K.K., and P.K. advised the
30 June 2021; https://docs.github.com/en/github/copilot/
Engineering(ESEC/FSE ’20), 8 to 13 November 2020 project. Competing interests: This work was done in the course
research-recitation.
(Association for Computing Machinery, 2020), pp. 1433–1443. of employment at DeepMind, with no other competing financial
31. N. Carlini et al., “Extracting training data from large language
11. A. Vaswani et al., Adv. Neural Inf. Process. Syst. 30, interests. DeepMind has filed the following patent application related
models,” 30th USENIX Security Symposium (USENIX Security
5998–6008 (2017). to this work: US63/306,043. Data and materials availability: The
21), 11 to 13 August 2021, pp. 2633–2650.
12. T. B. Brown et al., arXiv:2005.14165 [cs.CL] (2020). datasets used in the experiments as well as evaluation code have
32. AlphaCode examples, explanations, and visualizations can be
13. M. Chen et al., arXiv:2107.03374 [cs.LG] (2021). been made available for download on GitHub at https://github.com/
found at https://alphacode.deepmind.com and at https://doi.
14. J. Austin et al., arXiv:2108.07732 [cs.PL] (2021). deepmind/code_contests. Visualizations of the AlphaCode
org/10.5281/zenodo.6975437.
15. International Collegiate Programming Contest (ICPC), model, selected samples, and explanations can be found at
https://icpc.global/. 33. Y. Li et al., AlphaCode data materials, version 1.0.0, Zenodo https://alphacode.deepmind.com. Both of the above, as well as
16. International Olympiad in Informatics (IOI), https://ioinformatics.org/. (2022); https://doi.org/10.5281/zenodo.6975437. results of manual marking, can also be found in Zenodo (33). All
17. D. Hendrycks et al., arXiv:2105.09938 [cs.SE] (2021). other data needed to evaluate the conclusions in the paper are
18. Codeforces, https://codeforces.com/. ACKN OWLED GMEN TS present in the paper or the supplementary materials. License
19. I. Sutskever, O. Vinyals, Q. V. Le, Adv. Neural Inf. Process. Syst. We thank T. Cai, J. Rae, S. Borgeaud, M. Glaese, R. Ring, L. Sifre, information: Copyright © 2022 the authors, some rights reserved;
27, 3104–3112 (2014). J. Hoffman, J. Aslanides, J.-B. Lespiau, A. Mensch, E. Elsen, exclusive licensee American Association for the Advancement of
20. T. Kudo, J. Richardson, arXiv:1808.06226 [cs.CL] (2018). G. van den Driessche, and G. Irving for developing tools that we Science. No claim to original US government works. https://www.
21. The public GitHub dataset is hosted on Google BigQuery at use to train large language models and for lending their expertise science.org/about/science-licenses-journal-article-reuse
https://console.cloud.google.com/marketplace/product/ in model training; K. Anderson, C. Pope, and R. Foley for project
github/github-repos. SUPPLEMENTARY MATERIALS
management in early stages; Y. Whye Teh, C. Dyer, D. Silver,
22. The released CodeContests dataset can be found at https:// A. Barekatain, A. Zhernov, M. Overlan, and P. Veličković for research science.org/doi/10.1126/science.abq1158
github.com/deepmind/code_contests and at https://doi.org/ advice and assistance; K. Simonyan, C. Dyer, and D. Yogatama Supplementary Text
10.5281/zenodo.6975437. for reviewing the paper; L. Bennett, K. Ayoub, and J. Stanway for Figs. S1 to S29
23. N. Shazeer, arXiv:1911.02150 [cs.NE] (2019). logistically making the project possible; S. Dathathri for analyzing our Tables S1 to S23
24. J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, “BERT:
Science (ISSN ) is published by the American Association for the Advancement of Science. 1200 New York Avenue NW, Washington, DC
20005. The title Science is a registered trademark of AAAS.
Copyright © 2022 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim
to original U.S. Government Works