Software Testing With Large Language Models: Survey, Landscape, and Vision

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

1

Software Testing with Large Language Models:


Survey, Landscape, and Vision
Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, Qing Wang

Abstract—Pre-trained large language models (LLMs) have recently emerged as a breakthrough technology in natural language
processing and artificial intelligence, with the ability to handle large-scale datasets and exhibit remarkable performance across a wide
range of tasks. Meanwhile, software testing is a crucial undertaking that serves as a cornerstone for ensuring the quality and reliability
of software products. As the scope and complexity of software systems continue to grow, the need for more effective software testing
techniques becomes increasingly urgent, making it an area ripe for innovative approaches such as the use of LLMs. This paper provides
a comprehensive review of the utilization of LLMs in software testing. It analyzes 102 relevant studies that have used LLMs for software
arXiv:2307.07221v3 [cs.SE] 4 Mar 2024

testing, from both the software testing and LLMs perspectives. The paper presents a detailed discussion of the software testing tasks for
which LLMs are commonly used, among which test case preparation and program repair are the most representative. It also analyzes
the commonly used LLMs, the types of prompt engineering that are employed, as well as the accompanied techniques with these LLMs.
It also summarizes the key challenges and potential opportunities in this direction. This work can serve as a roadmap for future research
in this area, highlighting potential avenues for exploration, and identifying gaps in our current understanding of the use of LLMs in
software testing.

Index Terms—Pre-trained Large Language Model, Software Testing, LLM, GPT

1 I NTRODUCTION While the field of software testing has gained signifi-


Software testing is a crucial undertaking that serves as cant popularity, there remain dozens of challenges that have
a cornerstone for ensuring the quality and reliability of not been effectively addressed. For example, one such chal-
software products. Without the rigorous process of software lenge is automated unit test case generation. Although var-
testing, software enterprises would be reluctant to release ious approaches, including search-based [3], [4], constraint-
their products into the market, knowing the potential based [5] or random-based [6] techniques to generate a suite
consequences of delivering flawed software to end-users. of unit tests, the coverage and the meaningfulness of the
By conducting thorough and meticulous testing procedures, generated tests are still far from satisfactory [7], [8]. Simi-
software enterprises can minimize the occurrence of critical larly, when it comes to mobile GUI testing, existing studies
software failures, usability issues, or security breaches with random-/rule-based methods [9], [10], model-based
that could potentially lead to financial losses or jeopardize methods [11], [12], and learning-based methods [13] are un-
user trust. Additionally, software testing helps to reduce able to understand the semantic information of the GUI
maintenance costs by identifying and resolving issues early page and often fall short in achieving comprehensive cov-
in the development lifecycle, preventing more significant erage [14], [15]. Considering these limitations, numerous re-
complications down the line [1], [2]. search efforts are currently underway to explore innovative
The significance of software testing has garnered sub- techniques that can enhance the efficacy of software testing
stantial attention within the research and industrial com- tasks, among which large language models are the most
munities. In the field of software engineering, it stands as promising ones.
an immensely popular and vibrant research area. One can Large language models (LLMs) such as T5 and GPT-3
observe the undeniable prominence of software testing by have revolutionized the field of natural language processing
simply examining the landscape of conferences and sym- (NLP) and artificial intelligence (AI). These models, initially
posiums focused on software engineering. Amongst these pre-trained on extensive corpora, have exhibited remarkable
events, topics related to software testing consistently domi- performance across a wide range of NLP tasks including
nate the submission numbers and are frequently selected for question-answering, machine translation, and text genera-
publication. tion [16]–[19]. In recent years, there has been a significant
advancement in LLMs with the emergence of models capa-
● J. Wang,Y. Huang, Z. Liu, Q. Wang are with State Key Laboratory of ble of handling even larger-scale datasets. This expansion
Intelligent Game, Institute of Software Chinese Academy of Sciences, and in model size has not only led to improved performance
University of Chinese Academy of Sciences, Beijing, China. J. Wang and but also opened up new possibilities for applying LLMs
Q. Wang are corresponding authors.
E-mail: {junjie, yuchao2019, liuzhe2020, wq}@iscas.ac.cn
as Artificial General Intelligence. Among these advanced
● C. Chen is with Monash University, Melbourne, Australia LLMs, models like ChatGPT1 and LLaMA2 boast billions
E-mail: chunyang.chen@monash.edu
● S. Wang is with York University, Toronto, Canada. 1. https://openai.com/blog/chatgpt
E-mail: wangsong@yorku.ca
2. https://ai.meta.com/blog/large-language-model-llama-meta-ai/
2

of parameters. Such models hold tremendous potential for


tackling complex practical tasks in domains like code gener-
ation and artistic creation. With their expanded capacity and
enhanced capabilities, LLMs have become game-changers in
NLP and AI, and are driving advancements in other fields
like coding and software testing.
LLMs have been used for various coding-related tasks
including code generation and code recommendation [20]–
[23]. On one hand, in software testing, there are many tasks
related to code generation, such as unit test generation [7],
where the utilization of LLMs is expected to yield good
performance. On the other hand, software testing possesses
unique characteristics that differentiate it from code gener-
ation. For example, code generation primarily focuses on
producing a single, correct code snippet, whereas software
testing often requires generating diverse test inputs to en-
sure better coverage of the software under test [1]. The ex-
istence of these differences introduces new challenges and
opportunities when employing LLMs for software testing.
Moreover, people have benefited from the excellent perfor-
mance of LLMs in generation and inference tasks, leading
to the emergence of dozens of new practices that use LLMs
for software testing.
This article presents a comprehensive review of the uti- Fig. 1: Structure of the contents in this paper (the numbers
lization of LLMs in software testing. We collect 102 relevant in bracket indicates the number of involved papers, and a
papers and conduct a thorough analysis from both software paper might involve zero or multiple items)
testing and LLMs perspectives, as roughly summarized in
Figure 1.
past two years, there are still challenges in achieving high
From the viewpoint of software testing, our analysis in-
coverage of the testing, test oracle problem, rigorous evalu-
volves an examination of the specific software testing tasks
ations, and real-world application of LLMs in software test-
for which LLMs are employed. Results show that LLMs are
ing. Since it is a new emerging field, there are many research
commonly used for test case preparation (including unit test
opportunities, including exploring LLMs in an early stage of
case generation, test oracle generation, and system test input
testing, exploring LLMs for more types of software and non-
generation), program debugging, and bug repair, while we
functional testing, exploring advanced prompt engineering,
do not find the practices for applying LLMs in the tasks of
as well as incorporating LLMs with traditional techniques.
early testing life-cycle (such as test requirement, test plan,
This paper makes the following contributions:
etc). For each test task, we would provide detailed illustra-
tions showcasing the utilization of LLMs in addressing the ● We thoroughly analyze 102 relevant studies that used

task, highlighting commonly-used practices, tracking tech- LLMs for software testing, regarding publication
nology evolution trends, and summarizing achieved per- trends, distribution of publication venues, etc.
formance, so as to facilitate readers in gaining a thorough ● We conduct a comprehensive analysis from the perspec-

overview of how LLMs are employed across various testing tive of software testing to understand the distribution of
tasks. software testing tasks with LLM and present a thorough
From the viewpoint of LLMs, our analysis includes discussion about how these tasks are solved with LLM.
the commonly used LLMs in these studies, the types of ● We conduct a comprehensive analysis from the perspec-

prompt engineering, the input of the LLMs, as well as tive of LLMs, and uncover the commonly-used LLMs,
the accompanied techniques with these LLMs. Results the types of prompt engineering, input of the LLMs, as
show that about one-third of the studies utilize the LLMs well as the accompanied techniques with these LLMs.
through pre-training or fine-tuning schema, while the others ● We highlight the challenges in existing studies and

employ prompt engineering to communicate with LLMs present potential opportunities for further studies.
to steer their behavior for desired outcomes. For prompt We believe that this work will be valuable to both re-
engineering, the zero-shot learning and few-shot learning searchers and practitioners in the field of software engineer-
strategies are most commonly used, while other advances ing, as it provides a comprehensive overview of the current
like chain-of-thought promoting and self-consistency are state and future vision of using LLMs for software testing.
rarely utilized. Results also show that traditional testing For researchers, this work can serve as a roadmap for future
techniques like differential testing and mutation testing research in this area, highlighting potential avenues for ex-
are usually accompanied by LLMs to help generate more ploration and identifying gaps in our current understanding
diversified tests. of the use of LLMs in software testing. For practitioners, this
Furthermore, we summarize the key challenges and po- work can provide insights into the potential benefits and
tential opportunities in this direction. Although software limitations of using LLMs for software testing, as well as
testing with LLMs has undergone significant growth in the practical guidance on how to effectively integrate them into
3

existing testing processes. By providing a detailed landscape architecture (as mentioned in [24]), such as BART with 140M
of the current state and future vision of using LLMs for parameters and GPT-2 with parameter sizes ranging from
software testing, this work can help accelerate the adoption 117M to 1.5B. This is also to potentially include more studies
of this technology in the software engineering community to demonstrate the landscape of this topic.
and ultimately contribute to improving the quality and reli-
ability of software systems. 2.2 Software Testing
Software testing is a crucial process in software develop-
2 BACKGROUND ment that involves evaluating the quality of a software prod-
2.1 Large Language Model (LLM) uct. The primary goal of software testing is to identify de-
fects or errors in the software system that could potentially
Recently, pre-trained language models (PLMs) have been
lead to incorrect or unexpected behavior. The whole life
proposed by pretraining Transformer-based models over
cycle of software testing typically includes the following
large-scale corpora, showing strong capabilities in solving
tasks (demonstrated in Figure 4):
various natural language processing (NLP) tasks [16]–[19].
● Requirement Analysis: analyze the software require-
Studies have shown that model scaling can lead to improved
ments and identify the testing objectives, scope, and
model capacity, prompting researchers to investigate the
criteria.
scaling effect through further parameter size increases.
● Test Plan: develop a test plan that outlines the testing
Interestingly, when the parameter scale exceeds a certain
strategy, test objectives, and schedule.
threshold, these larger language models demonstrate not
● Test Design and Review: develop and review the test
only significant performance improvements but also special
cases and test suites that align with the test plan and
abilities such as in-context learning, which are absent in
the requirements of the software application.
smaller models such as BERT.
● Test Case Preparation: the actual test cases are prepared
To discriminate the language models in different
based on the designs created in the previous stage.
parameter scales, the research community has coined
● Test Execution: execute the tests that were designed in
the term large language models (LLM) for the PLMs of
the previous stage. The software system is executed
significant size. LLMs typically refer to language models
with the test cases and the results are recorded.
that have hundreds of billions (or more) of parameters and
● Test Reporting: analyze the results of the tests and gen-
are trained on massive text data such as GPT-3, PaLM,
erate reports that summarize the testing process and
Codex, and LLaMA. LLMs are built using the Transformer
identify any defects or issues that were discovered.
architecture, which stacks multi-head attention layers
● Bug Fixing and Regression Testing: defects or issues
in a very deep neural network. Existing LLMs adopt
identified during testing are reported to the develop-
similar model architectures (Transformer) and pre-training
ment team for fixing. Once the defects are fixed, regres-
objectives (language modeling) as small language models,
sion testing is performed to ensure that the changes
but largely scale up the model size, pre-training data,
have not introduced new defects or issues.
and total compute power. This enables LLMs to better
● Software Release: once the software system has passed
understand natural language and generate high-quality text
all of the testing stages and the defects have been fixed,
based on given context or prompts.
Note that, in existing literature, there is no formal con- the software can be released to the customer or end
sensus on the minimum parameter scale for LLMs, since user.
the model capacity is also related to data size and total The testing process is iterative and may involve multiple
compute. In a recent survey of LLMs [17], the authors focus cycles of the above stages, depending on the complexity of
on discussing the language models with a model size larger the software system and the testing requirements.
than 10B. Under their criteria, the first LLM is T5 released During the testing phase, various types of tests may be
by Google in 2019, followed by GPT-3 released by OpenAI performed, including unit tests, integration tests, system
in 2020, and there are more than thirty LLMs released be- tests, and acceptance tests.
tween 2021 and 2023 indicating its popularity. In another ● Unit Testing involves testing individual units or com-
survey of unifying LLMs and knowledge graphs [24], the ponents of the software application to ensure that they
authors categorize the LLMs into three types: encoder-only function correctly.
(e.g., BERT), encoder-decoder (e.g., T5), and decoder-only ● Integration Testing involves testing different modules
network architecture (e.g., GPT-3). In our review, we take or components of the software application together to
into account the categorization criteria of the two surveys ensure that they work correctly as a system.
and only consider the encoder-decoder and decoder-only ● System Testing involves testing the entire software sys-
network architecture of pre-training language models, since tem as a whole, including all the integrated components
they can both support generative tasks. We do not consider and external dependencies.
the encoder-only network architecture because they cannot ● Acceptance Testing involves testing the software appli-
handle generative tasks, were proposed relatively early (e.g., cation to ensure that it meets the business requirements
BERT in 2018), and there are almost no models using this and is ready for deployment.
architecture after 2021. In other words, the LLMs discussed In addition, there can be functional testing, performance
in this paper not only include models with parameters of testing, unit testing, security testing, accessibility testing,
over 10B (as mentioned in [17]) but also include other mod- etc, which explores various aspects of the software under
els that use the encoder-decoder and decoder-only network test [25].
4

Major SE Venues
& AI Venues
3.1.2 Manual Search
START
3.1.1 Automatic
Search
14,623
Papers
3.1.1 Automatic
Filtering
1,239
Papers
3.1.2 Manual Search To compensate for the potential omissions that may result
from automated searches, we also conduct manual searches.
1,278
Papers In order to make sure we collect highly relevant papers,
102 102 3.1.4 Quality 109 3.1.3 Inclusion and
we conduct a manual search within the conference proceed-
3.1.5 Snowballing
Papers Papers Assessment Papers Exclusion Criteria
ings and journal articles from top-tier software engineering
Fig. 2: Overview of the paper collection process venues (listed in Table 2).
In addition, given the interdisciplinary nature of this
work, we also include the conference proceedings of the
3 PAPER S ELECTION AND R EVIEW S CHEMA artificial intelligence field. We select the top ten venues
based on the h5 index from Google Scholar, and exclude
3.1 Paper Collection Methodology
three computer vision venues, i.e., CVPR, ICCV, ECCV, as
Figure 2 shows our paper search and selection process. To listed in Table 2.
collect as much relevant literature as possible, we use both
automatic search (from paper repository database) and man- 3.1.3 Inclusion and Exclusion Criteria
ual search (from major software engineering and artificial The search conducted on the databases and venue is, by de-
intelligence venues). We searched papers from Jan. 2019 to sign, very inclusive. This allows us to collect as many papers
Jun. 2023 and further conducted the second round of search as possible in our pool. However, this generous inclusivity
to include the papers from Jul. 2023 to Oct. 2023. results in having papers that are not directly related to the
scope of this survey. Accordingly, we define a set of specific
inclusion and exclusion criteria and then we apply them to
3.1.1 Automatic Search
each paper in the pool and remove papers not meeting the
To ensure that we collect papers from diverse research areas, criteria. This ensures that each collected paper aligns with
we conduct an extensive search using four popular scientific our scope and research questions.
databases: ACM digital library, IEEE Xplore digital library, Inclusion Criteria. We define the following criteria for
arXiv, and DBLP. including papers:
We search for papers whose title contains keywords re- ● The paper proposes or improves an approach, study, or
lated to software testing tasks and testing techniques (as shown tool/framework that targets testing specific software or
below) in the first three databases. In the case of DBLP, we systems with LLMs.
use additional keywords related to LLMs (as shown below) ● The paper applies LLMs to software testing practice,
to filter out irrelevant studies, as relying solely on testing- including all tasks within the software testing lifecycle
related keywords would result in a large number of can- as demonstrated in Section 2.2.
didate studies. While using two sets of keywords for DBLP ● The paper presents an empirical or experimental study
may result in overlooking certain related studies, we believe about utilizing LLMs in software testing practice.
it is still a feasible strategy. This is due to the fact that a ● The paper involves specific testing techniques (e.g.,
substantial number of studies present in this database can fuzz testing) employing LLMs.
already be found in the first three databases, and the fourth If a paper satisfies any of the following criteria, we will
database only serves as a supplementary source for collect- include it.
ing additional papers. Exclusion Criteria. The following studies would be ex-
● Keywords related with software testing tasks and tech- cluded during study selection:
niques: test OR bug OR issue OR defect OR fault OR ● The paper does not involve software testing tasks, e.g.,
error OR failure OR crash OR debug OR debugger OR code comment generation.
repair OR fix OR assert OR verification OR validation ● The paper does not utilize LLMs, e.g., using recurrent
OR fuzz OR fuzzer OR mutation. neural networks.
● Keywords related with LLMs: LLM OR language model ● The paper mentions LLMs only in future work or dis-
OR generative model OR large model OR GPT-3 OR cussions rather than using LLMs in the approach.
ChatGPT OR GPT-4 OR LLaMA OR PaLM2 OR CodeT5 ● The paper utilizes language models with encoder-only
OR CodeX OR CodeGen OR Bard OR InstructGPT. Note architecture, e.g., BERT, which can not directly be uti-
that, we only list the top ten most popular LLMs (based lized for generation tasks (as demonstrated in Section
on Google search), since they are the search keywords 2.1).
for matching paper titles, rather than matching the pa- ● The paper focuses on testing the performance of LLMs,
per content. such as fairness, stability, security, etc. [125]–[127].
The above search strategy based on the paper title can ● The paper focuses on evaluating the performance of

recall a large number of papers, and we further conduct the LLM-enabled tools, e.g., evaluating the code quality of
automatic filtering based on the paper content. Specifically, the code generation tool Copilot [128]–[130].
we filter the paper whose content contains “LLM” or “lan- For the papers collected through automatic search and
guage model” or “generative model” or “large model” or manual search, we conduct a manual inspection to check
the name of the LLMs (using the LLMs in [17], [24] except whether they satisfy our inclusion criteria and filter those
those in our exclusion criteria). This can help eliminate the following our exclusion criteria. Specifically, the first two
papers that do not involve the neural models. authors read each paper to carefully determine whether it
5

TABLE 1: Details of the collected papers

ID Topic Paper title Year Reference


1 Unit test case generation Unit Test Case Generation with Transformers and Focal Context 2020 [26]
2 Unit test case generation Codet: Code Generation with Generated Tests 2022 [27]
3 Unit test case generation Interactive Code Generation via Test-Driven User-Intent Formalization 2022 [28]
4 Unit test case generation A3Test: Assertion-Augmented Automated Test Case Generation 2023 [29]
5 Unit test case generation An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation 2023 [30]
6 Unit test case generation An Initial Investigation of ChatGPT Unit Test Generation Capability 2023 [31]
7 Unit test case generation Automated Test Case Generation Using Code Models and Domain Adaptation 2023 [32]
8 Unit test case generation Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models 2023 [33]
9 Unit test case generation Can Large Language Models Write Good Property-Based Tests? 2023 [34]
10 Unit test case generation CAT-LM Training Language Models on Aligned Code And Tests 2023 [35]
11 Unit test case generation ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation 2023 [8]
12 Unit test case generation ChatUniTest: a ChatGPT-based Automated Unit Test Generation Tool 2023 [36]
13 Unit test case generation CODAMOSA: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models 2023 [37]
14 Unit test case generation Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing 2023 [38]
15 Unit test case generation Exploring the Effectiveness of Large Language Models in Generating Unit Tests 2023 [39]
16 Unit test case generation How Well does LLM Generate Security Tests? 2023 [40]
17 Unit test case generation No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation 2023 [7]
18 Unit test case generation Prompting Code Interpreter to Write Better Unit Tests on Quixbugs Functions 2023 [41]
19 Unit test case generation Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation 2023 [42]
20 Unit test case generation Unit Test Generation using Generative AI: A Comparative Performance Analysis of Autogeneration Tools 2023 [43]
21 Test oracle generation Generating Accurate Assert Statements for Unit Test Cases Using Pretrained Transformers 2022 [44]
22 Test oracle generation Learning Deep Semantics for Test Completion 2023 [45]
23 Test oracle generation; Program repair Using Transfer Learning for Code-Related Tasks 2023 [46]
24 Test oracle generation; Program repair Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning 2023 [47]
25 System test input generation Automated Conformance Testing for JavaScript Engines via Deep Compiler Fuzzing 2021 [48]
26 System test input generation Fill in the Blank: Context-aware Automated Text Input Generation for Mobile GUI Testing 2022 [49]
27 System test input generation Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors 2022 [50]
28 System test input generation Slgpt: Using Transfer Learning to Directly Generate Simulink Model Files and Find Bugs in the Simulink Toolchain 2021 [51]
29 System test input generation Augmenting Greybox Fuzzing with Generative AI 2023 [52]
30 System test input generation Automated Test Case Generation Using T5 and GPT-3 2023 [53]
31 System test input generation Automating GUI-based Software Testing with GPT-3 2023 [54]
32 System test input generation AXNav: Replaying Accessibility Tests from Natural Language 2023 [55]
33 System test input generation Can ChatGPT Advance Software Testing Intelligence? An Experience Report on Metamorphic Testing 2023 [56]
34 System test input generation Efficient Mutation Testing via Pre-Trained Language Models 2023 [57]
35 System test input generation Large Language Models are Edge-Case Generators:Crafting Unusual Programs for Fuzzing Deep Learning Libraries 2023 [58]
36 System test input generation Large Language Models are Zero Shot Fuzzers: Fuzzing Deep Learning Libraries via Large Language Models 2023 [59]
37 System test input generation Large Language Models for Fuzzing Parsers (Registered Report) 2023 [60]
38 System test input generation LLM for Test Script Generation and Migration: Challenges, Capabilities, and Opportunities 2023 [61]
39 System test input generation Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions 2023 [14]
40 System test input generation PentestGPT: An LLM-empowered Automatic Penetration Testing Tool 2023 [62]
41 System test input generation SMT Solver Validation Empowered by Large Pre-Trained Language Models 2023 [63]
42 System test input generation TARGET: Automated Scenario Generation from Traffic Rules for Testing Autonomous Vehicles 2023 [64]
43 System test input generation Testing the Limits: Unusual Text Inputs Generation for Mobile App Crash Detection with Large Language Model 2023 [65]
44 System test input generation Understanding Large Language Model Based Fuzz Driver Generation 2023 [66]
45 System test input generation Universal Fuzzing via Large Language Models 2023 [67]
46 System test input generation Variable Discovery with Large Language Models for Metamorphic Testing of Scientific Software 2023 [68]
47 System test input generation White-box Compiler Fuzzing Empowered by Large Language Models 2023 [69]
48 Bug analysis Itiger: an Automatic Issue Title Generation Tool 2022 [70]
49 Bug analysis CrashTranslator: Automatically Reproducing Mobile Application Crashes Directly from Stack Trace 2023 [71]
50 Bug analysis Cupid: Leveraging ChatGPT for More Accurate Duplicate Bug Report Detection 2023 [72]
51 Bug analysis Employing Deep Learning and Structured Information Retrieval to Answer Clarification Questions on Bug Reports 2023 [73]
52 Bug analysis Explaining Software Bugs Leveraging Code Structures in Neural Machine Translation 2022 [74]
53 Bug analysis Prompting Is All Your Need: Automated Android Bug Replay with Large Language Models 2023 [75]
54 Bug analysis Still Confusing for Bug-Component Triaging? Deep Feature Learning and Ensemble Setting to Rescue 2023 [76]
55 Debug Detect-Localize-Repair: A Unified Framework for Learning to Debug with CodeT5 2022 [77]
56 Debug Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction 2022 [78]
57 Debug A Preliminary Evaluation of LLM-Based Fault Localization 2023 [79]
58 Debug Addressing Compiler Errors: Stack Overflow or Large Language Models? 2023 [80]
59 Debug Can LLMs Demystify Bug Reports? 2023 [81]
60 Debug Dcc –help: Generating Context-Aware Compiler Error Explanations with Large Language Models 2023 [82]
61 Debug Explainable Automated Debugging via Large Language Model-driven Scientific Debugging 2023 [83]
62 Debug Large Language Models for Test-Free Fault Localization 2023 [84]
63 Debug Large Language Models in Fault Localisation 2023 [85]
64 Debug LLM4CBI: Taming LLMs to Generate Effective Test Programs for Compiler Bug Isolation 2023 [86]
65 Debug Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting 2023 [87]
66 Debug Teaching Large Language Models to Self-Debug 2023 [88]
67 Debug; Program repair A study on Prompt Design, Advantages and Limitations of ChatGPT for Deep Learning Program Repair 2023 [89]
68 Program repair Examining Zero-Shot Vulnerability Repair with Large Language Models 2022 [90]
69 Program repair Automated Repair of Programs from Large Language Models 2022 [91]
70 Program repair Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar 2022 [92]
71 Program repair Practical Program Repair in the Era of Large Pre-trained Language Models 2022 [93]
72 Program repair Repairing Bugs in Python Assignments Using Large Language Models 2022 [94]
73 Program repair Towards JavaScript Program Repair with Generative Pre-trained Transformer (GPT-2) 2022 [95]
74 Program repair An Analysis of the Automatic Bug Fixing Performance of ChatGPT 2023 [96]
75 Program repair An Empirical Study on Fine-Tuning Large Language Models of Code for Automated Program Repair 2023 [97]
76 Program repair An Evaluation of the Effectiveness of OpenAI’s ChatGPT for Automated Python Program Bug Fixing using QuixBugs 2023 [98]
77 Program repair An Extensive Study on Model Architecture and Program Representation in the Domain of Learning-based Automated Program Repair 2023 [99]
78 Program repair Can OpenAI’s Codex Fix Bugs? An Evaluation on QuixBugs 2022 [100]
79 Program repair CIRCLE: Continual Repair Across Programming Languages 2022 [101]
80 Program repair Coffee: Boost Your Code LLMs by Fixing Bugs with Feedback 2023 [102]
81 Program repair Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair 2023 [103]
82 Program repair Domain Knowledge Matters: Improving Prompts with Fix Templates for Repairing Python Type Errors 2023 [104]
83 Program repair Enhancing Genetic Improvement Mutations Using Large Language Models 2023 [105]
84 Program repair FixEval: Execution-based Evaluation of Program Fixes for Programming Problems 2023 [106]
85 Program repair Fixing Hardware Security Bugs with Large Language Models 2023 [107]
86 Program repair Fixing Rust Compilation Errors using LLMs 2023 [108]
87 Program repair Framing Program Repair as Code Completion 2022 [109]
88 Program repair Frustrated with Code Quality Issues? LLMs can Help! 2023 [110]
89 Program repair GPT-3-Powered Type Error Debugging: Investigating the Use of Large Language Models for Code Repair 2023 [111]
90 Program repair How Effective Are Neural Networks for Fixing Security Vulnerabilities 2023 [112]
91 Program repair Impact of Code Language Models on Automated Program Repair 2023 [113]
92 Program repair Inferfix: End-to-end Program Repair with LLMs 2023 [114]
93 Program repair Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT 2023 [115]
94 Program repair Neural Program Repair with Program Dependence Analysis and Effective Filter Mechanism 2023 [116]
95 Program repair Out of Context: How important is Local Context in Neural Program Repair? 2023 [117]
96 Program repair Pre-trained Model-based Automated Software Vulnerability Repair: How Far are We? 2023 [118]
97 Program repair RAPGen: An Approach for Fixing Code Inefficiencies in Zero-Shot 2023 [119]
98 Program repair RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair 2023 [120]
99 Program repair STEAM: Simulating the InTeractive BEhavior of ProgrAMmers for Automatic Bug Fixing 2023 [121]
100 Program repair Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions 2023 [122]
101 Program repair VulRepair: a T5-based Automated Software Vulnerability Repair 2022 [123]
102 Program repair What Makes Good In-Context Demonstrations for Code Intelligence Tasks with LLMs? 2023 [124]
6

TABLE 2: Conference proceedings and journals considered


for manual search 

Acronym Venue


3XEOLFDWLRQV
ICSE International Conference on Software Engineering
ESEC/FSE Joint European Software Engineering Conference and Symposium on the
Foundations of Software Engineering
SE Conference

ASE International Conference on Automated Software Engineering


ISSTA International Symposium on Software Testing and Analysis 
ICST
ESEM
International Conference on Software Testing, Verification and Validation
International Symposium on Empirical Software Engineering and Mea-

surement
MSR International Conference on Mining Software Repositories
QRS International Conference on Software Quality, Reliability and Security 
ICSME International Conference on Software Maintenance and Evolution
ISSRE International Symposium on Software Reliability Engineering

TSE
TOSEM
Transactions on Software Engineering
Transactions on Software Engineering and Methodology   
EMSE Empirical Software Engineering
    
3XEOLFDWLRQ<HDU
ASE Automated Software Engineering
SE Journal

JSS Journal of Systems and Software


JSEP Journal of Software: Evolution and Process
STVR Software Testing, Verification and Reliability
IEEE SOFTW. IEEE Software Fig. 3: Trend in the number of papers with year
IET SOFTW. IET Software
IST Information and Software Technology
SQJ Software Quality Journal
ICLR International Conference on Learning Representations
snowballing [131] by inspecting the references cited by the
NeurIPS Conference on Neural Information Processing Systems collected papers so far. Note that, this procedure did not in-
AI Venues

ICML International Conference on Machine Learning


AAAI AAAI Conference on Artificial Intelligence clude new studies, which might because the surveyed topic
EMNLP Conference on Empirical Methods in Natural Language Processing
ACL Annual Meeting of the Association for Computational Linguistics is quite new and the reference studies tend to published pre-
IJCAI International Joint Conference on Artificial Intelligence
viously, and we already include a relatively comprehensive
automatic and manual search.
should be included based on the inclusion criteria and exclu-
sion criteria, and any paper with different decisions will be 3.2 Collection Results
handed over to the third author to make the final decision. As shown in Figure 2, the collection process started
with a total of 14,623 papers retrieved from four
3.1.4 Quality Assessment academic databases employing keyword searching.
In addition, we establish quality assessment criteria to ex- Then after automated filtering, manual search, applying
clude low-quality studies as shown below. For each ques- inclusion/exclusion criteria, and quality assessment, we
tion, the study’s quality is rated as “yes”, “partial” or “no” finally collected a total of 102 papers involving software
which are assigned values of 1, 0.5, and 0, respectively. Pa- testing with LLMs. Table 1 shows the details of the collected
pers with a score of less than eight will be excluded from papers. Besides, we also use Table 5 (at the end of the
our study. paper) to provide a more comprehensive overview of these
● Is there a clearly stated research goal related to software
papers regarding the specific characteristics which will be
testing? illustrated in Section 4 and Section 5.
● Is there a defined and repeatable technique?
Note that, there are two studies which are respectively
● Is there any explicit contribution to software testing?
the extension of a previously published paper by the same
● Is there an explicit description of which LLMs are uti-
authors ( [46] and [132], [68] and [133]), and we only keep
lized? the extended version to avoid duplicate.
● Is there an explicit explanation about how the LLMs are
utilized? 3.3 General Overview of Collected Paper
● Is there a clear methodology for validating the tech- Among the papers, 47% papers are published in software
nique? engineering venues, among which 19 papers are from ICSE,
● Are the subject projects selected for validation suitable 5 papers are from FSE, 5 papers are from ASE, and 3 pa-
for the research goals? pers are from ISSTA. 2% papers are published in artificial
● Are there control techniques or baselines to demon- intelligence venues such as EMNLP and ICLR, and 5% pa-
strate the effectiveness of the proposed technique? pers are published in program analysis or security venues
● Are the evaluation metrics relevant (e.g., evaluate the like PLDI and S&P. Besides, 46% of the papers have not
effectiveness of the proposed technique) to the research yet been published via peer-reviewed venues, i.e., they are
objectives? disclosed on arXiv. This is understandable because this field
● Do the results presented in the study align with the is emerging and many works are just completed and in
research objectives and are they presented in a clear the process of submission. Although these papers did not
and relevant manner? undergo peer review, we have a quality assessment process
that eliminates papers with low quality, which potentially
3.1.5 Snowballing ensures the quality of this survey.
At the end of searching database repositories and confer- Figure 3 demonstrates the trend of our collected papers
ence proceedings and journals, and applying inclusion/ex- per year. We can see that as the years go by, the number of
clusion criteria and quality assessment, we obtain the initial papers in this field is growing almost exponentially. In 2020
set of papers. Next, to mitigate the risk of omitting rele- and 2021, there were only 1 and 2 papers, respectively. In
vant literature from this survey, we also perform backward 2022, there were 19 papers, and in 2023, there have been 82
7

Fig. 4: Distribution of testing tasks with LLMs (aligned with software testing life cycle [134]–[136], the number in bracket
indicates the number of collected studies per task, and one paper might involve multiple tasks)

papers. It is conceivable that there will be even more papers Nevertheless, the coverage and the meaningfulness of the
in the future, which indicates the popularity and attention generated tests are still far from satisfactory.
that this field is receiving.
Since LLMs have demonstrated promising results in
tasks such as code generation, and given that both code
4 A NALYSIS FROM S OFTWARE T ESTING P ER - generation and unit test case generation involve generating
SPECTIVE source code, recent research has extended the domain of
This section presents our analysis from the viewpoint of code generation to encompass unit test case generation.
software testing and organizes the collected studies in terms Despite initial success, there are nuances that set unit
of testing tasks. Figure 4 lists the distribution of each in- test case generation apart from general code generation,
volved testing task, aligned with the software testing life signaling the need for more tailored approaches.
cycle. We first provide a general overview of the distribu-
tion, followed by further analysis for each task. Note that, Pre-training or fine-tuning LLMs for unit test case
for each following subsection, the cumulative total of sub- generation. Due to the limitations of LLMs in their earlier
categories may not always match the total number of papers stages, a majority of the earlier published studies adopt
since a paper might belong to more than one subcategory. this pre-training or fine-tuning schema. Moreover, in some
We can see that LLMs have been effectively used in both recent studies, this schema continues to be employed to
the mid to late stages of the software testing lifecycle. In increase the LLMs’ familiarity with domain knowledge.
the test case preparation phase, LLMs have been utilized for Alagarsamy et al. [29] first pre-trained the LLM with the
tasks such as generating unit test cases, test oracle genera- focal method and asserted statements to enable the LLM to
tion, and system test input generation. These tasks are cru- have a stronger foundation knowledge of assertions, then
cial in the mid-phase of software testing to help catch issues fine-tuned the LLM for the test case generation task where
and prevent further development until issues are resolved. the objective is to learn the relationship between the focal
Furthermore, in later phases such as the test report/bug re- method and the corresponding test case. Tufano et al. [26]
ports and bug fix phase, LLMs have been employed for tasks utilized a similar schema by pre-training the LLM on a
such as bug analysis, debugging, and repair. These tasks are large unsupervised Java corpus, and supervised fine-tuning
critical towards the end of the testing phase when software a downstream translation task for generating unit tests.
bugs need to be resolved to prepare for the product’s release. Hashtroudi et al. [32] leveraged the existing developer-
written tests for each project to generate a project-specific
dataset for domain adaptation when fine-tuning the LLM,
4.1 Unit Test Case Generation
which can facilitate generating human-readable unit tests.
Unit test case generation involves writing unit test cases to Rao et al. [35] trained a GPT-style language model by
check individual units/components of the software inde- utilizing a pre-training signal that explicitly considers the
pendently and ensure that they work correctly. For a method mapping between code and test files. Steenhoek et al.
under test (i.e., often called the focal method), its corre- [42] utilizes reinforcement learning to optimize models by
sponding unit test consists of a test prefix and a test oracle. providing rewards based on static quality metrics that can
In particular, the test prefix is typically a series of method be automatically computed for the generated unit test cases.
invocation statements or assignment statements, which aims
at driving the focal method to a testable state; and then the Designing effective prompts for unit test case genera-
test oracle serves as the specification to check whether the tion. The advancement of LLMs has allowed them to excel
current behavior of the focal method satisfies the expected at targeted tasks without pre-training or fine-tuning. There-
one, e.g., the test assertion. fore most later studies typically focus on how to design
To alleviate manual efforts in writing unit tests, the prompt, to make the LLM better at understanding the
researchers have proposed various techniques to facilitate context and nuances of this task. Xie et al. [36] generated
automated unit test generation. Traditional unit test unit test cases by parsing the project, extracting essential
generation techniques leverage search-based [3], [4], information, and creating an adaptive focal context that in-
constraint-based [5] or random-based strategies [6] to cludes a focal method and its dependencies within the pre-
generate a suite of unit tests with the main goal of defined maximum prompt token limit of the LLM, and in-
maximizing the coverage in the software under test. corporating these context into a prompt to query the LLM.
8

TABLE 3: Performance of unit test case generation


Dataset Correctness Coverage LLM Paper
5 Java projects from Defects4J 16.21% 5%-13% (line coverage) BART [26]
10 Jave projects 40% 89% (line coverage), 90% (branch coverage) ChatGPT [36]
CodeSearchNet 41% N/A ChatGPT [7]
HumanEval 78% 87% (line coverage), 92% (branch coverage) Codex [39]
SF110 2% 2% (line coverage), 1% (branch coverage) Codex [39]
Note that, [39] experiments with Codex, CodeGen, and ChatGPT, and the best performance was achieved by Codex.

Dakhel et al. [38] introduced MuTAP for improving the ef- Performance of unit test case generation. Since the
fectiveness of test cases generated by LLMs in terms of re- aforementioned studies of unit test case generation are
vealing bugs by leveraging mutation testing. They augment based on different datasets, one can hardly derive a fair
prompts with surviving mutants, as those mutants highlight comparison and we present the details in Table 3 to let
the limitations of test cases in detecting bugs. Zhang et al. the readers obtain a general view. We can see that in the
[40] generated security tests with vulnerable dependencies SF110 benchmark, all three evaluated LLMs have quite low
with LLMs. performance, i.e., 2% coverage [39]. SF110 is an Evosuite
(a search-based unit test case generation technique)
Yuan et al. [7] first performed an empirical study to eval-
benchmark consisting of 111 open-source Java projects
uate ChatGPT’s capability of unit test generation with both
retrieved from SourceForge, containing 23,886 classes, over
a quantitative analysis and a user study in terms of cor-
800,000 bytecode-level branches, and 6.6 million lines of
rectness, sufficiency, readability, and usability. And results
code. The authors did not present detailed reasons for the
show that the generated tests still suffer from correctness
low performance which can be further explored in the
issues, including diverse compilation errors and execution
future.
failures. They further propose an approach that leveraged
the ChatGPT itself to improve the quality of its generated
tests with an initial test generator and an iterative test re- 4.2 Test Oracle Generation
finer. Specifically, the iterative test refiner iteratively fixed A test oracle is a source of information about whether the
the compilation errors in the tests generated by the initial output of a software system (or program or function or
test generator, which follows a validate-and-fix paradigm to method) is correct or not [138]. Most of the collected studies
prompt the LLM based on the compilation error messages in this category target the test assertion generation, which is
and additional code context. Guilherme et al. [31] and Li inside a unit test case. Nevertheless, we opted to treat these
et al. [41] respectively evaluated the quality of the gener- studies as separate sections to facilitate a more thorough
ated unit tests by LLM using different metrics and different analysis.
prompts. Test assertion, which is to indicate the potential issues
Test generation with additional documentation. in the tested code, is an important aspect that can distin-
Vikram et al. [34] went a step further by investigating the guish the unit test cases from the regular code. This is why
potential of using LLMs to generate property-based tests some studies specifically focus on the generation of effec-
when provided API documentation. They believe that the tive test assertions. Actually, before using LLMs, researchers
documentation of an API method can assist the LLM in have proposed RNN-based approaches that aim at learning
producing logic to generate random inputs for that method from thousands of unit test methods to generate meaning-
and deriving meaningful properties of the result to check. ful assert statements [139], yet only 17% of the generated
Instead of generating unit tests from the source code, Plein asserts can exactly match with the ground truth asserts. Sub-
et al. [33] generated the tests based on user-written bug sequently, to improve the performance, several researchers
reports. utilized the LLMs for this task.
Mastropaolo et al. [46], [132] pre-trained a T5 model on
LLM and search-based method for unit test generation.
a dataset composed of natural language English text and
The aforementioned studies utilize LLMs for the whole unit
source code. Then, it fine-tuned such a model by reusing
test case generation task, while Lemieux et al. [37] focus on
datasets used in four previous works that used deep learn-
a different direction, i.e., first letting the traditional search-
ing techniques (such as RNN as mentioned before) includ-
based software testing techniques (e.g., Pynguin [137]) in
ing test assertion generation and program repair, etc. Results
generating unit test case until its coverage improvements
showed that the extract match rate of the generated test
stall, then asking the LLM to provide the example test cases
assertion is 57%. Tufano et al. [44] proposed a similar ap-
for under-covered functions. These examples can help the
proach which separately pre-trained the LLM with English
original test generation redirect its search to more useful
corpus and code corpus, and then fine-tuned it on the asserts
areas of the search space.
dataset (with test methods, focal methods, and asserts). This
Tang et al. [8] conducts a systematic comparison of test further improved the performance to 62% of the exact match
suites generated by the LLM and the state-of-the-art search- rate. Besides the syntax-level data as previous studies, Nie et
based software testing tool EvoSuite, by considering the cor- al. [45] fine-tuned the LLMs with six kinds of code semantics
rectness, readability, code coverage, and bug detection ca- data, including the execution result (e.g., types of the local
pability. Similarly, Bhatia [43] experimentally investigates variables) and execution context (e.g., the last called method
the quality of unit tests generated by LLM compared to a in the test method), which enabled LLMs to learn to under-
commonly-used test generator Pynguin. stand the code execution information. The exact match rate
9

is 17% (note that this paper is based on a different dataset 0RELOHDSS 


from all other studies mentioned under this topic). 'HHSOHDUQLQJOLEUDU\ 
&RPSLOHU 

6RIWZDUH8QGHU7HVW
The aforementioned studies utilized the pre-training and
fine-tuning schema when using LLMs, and with the increas- 607VROYHU 
ingly powerful capabilities of LLMs, they can perform well $XWRQRPRXVGULYLQJV\VWHP 
&\EHUSK\VLFDOV\VWHP 
on specific tasks without these specialized pre-training or
*2WRROFKDLQ 
fine-tuning datasets. Subsequently, Nashid et al. [47] uti-
-DYD6FULSWHQJLQH 
lized prompt engineering for this task, and proposed a tech- 4XDQWXPFRPSXWLQJSODWIRUP 
nique for prompt creation that automatically retrieves code 9LGHRJDPH 
demonstrations similar to the task, based on embedding      
or frequency analysis. They also present evaluations about 3DSHU&RXQW
the few-shot learning with various numbers (e.g., zero-shot, Fig. 5: Distribution of software under test
one-shot, or n-shot) and forms (e.g., random vs. systematic,
or with vs. without natural language descriptions) of the
prompts, to investigate its feasibility on test assertion gen- in today’s business and daily life. Additionally, there are
eration. With only a few relevant code demonstrations, this respectively two studies focusing on testing deep learning
approach can achieve an accuracy of 76% for exact matches libraries, compilers, and SMT solvers. Moreover, LLM-based
in test assertion generation, which is the state-of-the-art per- testing techniques have also been applied to domains such
formance for this task. as cyber-physical systems, quantum computing platforms,
and more. This widespread adoption of LLMs demonstrates
4.3 System Test Input Generation their effectiveness in handling diverse test inputs and en-
This category encompasses the studies related to creating hancing testing activities across various software domains.
test input of system testing for enabling the automation of A detailed analysis is provided below.
test execution. We employ three subsections to present the Test input generation for mobile apps. For mobile app
analysis from three different orthogonal viewpoints, and testing, one difficulty is to generate the appropriate text in-
each of the collected studies may be analyzed in one or puts to proceed to the next page, which remains a prominent
more of these subsections. obstacle for testing coverage. Considering the diversity and
The first subsection is input generation in terms of software semantic requirement of valid inputs (e.g., flight departure,
types. The generation of system-level test inputs for software movie name), traditional techniques with heuristic-based or
testing varies for specific types of software being tested. For constraint-based techniques [10], [140] are far from generat-
example, for mobile applications, the test input generation ing meaningful text input. Liu et al. [49] employ the LLM
requires providing a diverse range of text inputs or oper- to intelligently generate the semantic input text according
ation combinations (e.g., click a button, long press a list) to the GUI context. In detail, their proposed QTypist auto-
[14], [49], which is the key to testing the application’s func- matically extracts the component information related to the
tionality and user interface; while for Deep Learning (DL) EditText for generating the prompts, and then inputs the
libraries, the test input is a program which covers diversified prompts into the LLM to generate the input text.
DL APIs [58], [59]. This subsection will demonstrate how the Besides the text input, there are other forms of input
LLMs are utilized to generate inputs for different types of for mobile apps, i.e., operations like ‘click a button’ and
software. ‘select a list’. To fully test an app, it is required to cover
The second subsection input generation in terms of testing more GUI pages and conduct more meaningful exploration
techniques. We have observed that certain approaches serve traces through the GUI operations, yet existing studies with
as specific types of testing techniques. For example, dozens random-/rule-based methods [9], [10], model-based meth-
of our collected studies specifically focus on using LLMs ods [11], [12], and learning-based methods [13] are unable
for fuzz testing. Therefore, this subsection would provide to understand the semantic information of the GUI page
an analysis of the collected studies in terms of testing tech- thus could not conduct the trace planning effectively. Liu et
niques, showcasing how the LLMs are employed to enhance al. [14] formulates the test input generation of mobile GUI
traditional testing techniques. testing problem as a Q&A task, which asks LLM to chat
The third subsection input generation in terms of input and with the mobile apps by passing the GUI page information
output. While most of the collected studies take the source to LLM to elicit testing scripts (i.e., GUI operation), and
code or the software itself as the input and directly output executing them to keep passing the app feedback to LLM, it-
the software’s test input, there are studies that utilize alter- erating the whole process. The proposed GPTDroid extracts
native forms of input and output. This subsection would the static context of the GUI page and the dynamic context
provide an analysis of such studies, highlighting different of the iterative testing process, and designs prompts for in-
approaches and their input-output characteristics. putting this information to LLM which enables the LLM to
better understand the GUI page as well as the whole testing
4.3.1 Input Generation in Terms of Software Types process. It also introduces a functionality-aware memory
Figure 5 demonstrates the types of software under test in prompting mechanism that equips the LLM with the abil-
our collected studies. It is evident that the most prominent ity to retain testing knowledge of the whole process and
category is mobile apps, with five studies utilizing LLMs conduct long-term, functionality-based reasoning to guide
for testing, possibly due to their prevalence and importance exploration. Similarly, Zimmermann et al. utilize the LLM to
10

interpret natural language test cases and programmatically high-level test programs that can trigger the optimization;
navigate through the application under test [54]. (2) a generation LLM produces test programs based on the
Yu et al. [61] investigate the LLM’s capabilities in the summarized requirements. Ye et al. [48] utilize the LLM
mobile app test script generation and migration task, in- for generating the JavaScript programs and then use the
cluding the scenario-based test generation, and the cross- well-structured ECMAScript specifications to automatically
platform/app test migration. generate test data along with the test programs, after that
Test input generation for DL libraries. The input for they apply differential testing to expose bugs.
testing DL libraries is DL programs, and the difficulty
in generating the diversified input DL programs is that 4.3.2 Input Generation in Terms of Testing Techniques
they need to satisfy both the input language (e.g., Python) By utilizing system test inputs generated by LLMs, the col-
syntax/semantics and the API input/shape constraints for lected studies aim to enhance traditional testing techniques
tensor computations. Traditional techniques with API-level and make them more effective. Among these techniques,
fuzzing [141], [142] or model-level fuzzing [143], [144] fuzz testing is the most commonly involved one. Fuzz test-
suffer from the following limitations: 1) lack of diverse API ing, as a general concept, revolves around generating in-
sequence thus cannot reveal bugs caused by chained API valid, unexpected, or random data as inputs to evaluate the
sequences; 2) cannot generate arbitrary code thus cannot behavior of software. LLMs play a crucial role in improv-
explore the huge search space that exists when using the DL ing traditional fuzz testing by facilitating the generation of
libraries. Since LLMs can include numerous code snippets diverse and realistic input data. This enables fuzz testing to
invoking DL library APIs in their training corpora, they uncover potential bugs in the software by subjecting it to a
can implicitly learn both language syntax/semantics and wide range of input scenarios. In addition to fuzz testing,
intricate API constraints for valid DL program generation. LLMs also contribute to enhancing other testing techniques,
Taken in this sense, Deng et al. [59] used both generative which will be discussed in detail later.
and infilling LLMs to generate and mutate valid/diverse Universal fuzzing framework. Xia et al. [67] present
input DL programs for fuzzing DL libraries. In detail, it first Fuzz4All that can target many different input languages
uses a generative LLM (CodeX) to generate a set of seed and many different features of these languages. The key
programs (i.e., code snippets that use the target DL APIs). idea behind it is to leverage LLMs as an input generation
Then it replaces part of the seed program with masked and mutation engine, which enables the approach to
tokens using different mutation operators and leverages the produce diverse and realistic inputs for any practically
ability of infilling LLM (InCoder) to perform code infilling relevant language. To realize this potential, they present
to generate new code that replaces the masked tokens. Their a novel auto-prompting technique, which creates LLM
follow-up study [58] goes a step further to prime LLMs to prompts that are well-suited for fuzzing, and a novel
synthesize unusual programs for the fuzzing DL libraries. LLM-powered fuzzing loop, which iteratively updates the
It is built on the well-known hypothesis that historical prompt to create new fuzzing inputs. They experiment
bug-triggering programs may include rare/valuable code with six different languages (C, C++, Go, SMT2, Java and
ingredients important for bug finding and show improved Python) as inputs and demonstrate higher coverage than
bug detection performance. existing language-specific fuzzers. Hu et al. [52] propose a
Test input generation for other types of software. There greybox fuzzer augmented by the LLM, which picks a seed
are also dozens of studies that address testing tasks in vari- in the fuzzer’s seed pool and prompts the LLM to produce
ous other domains, due to space limitations, we will present the mutated seeds that might trigger a new code region
a selection of representative studies in these domains. of the software. They experiment with three categories of
Finding bugs in a commercial cyber-physical system input formats, i.e., formatted data files (e.g., json, xml),
(CPS) development tool such as Simulink is even more source code in different programming languages (e.g., JS,
challenging. Given the complexity of the Simulink language, SQL, C), text with no explicit syntax rules (e.g., HTTP
generating valid Simulink model files for testing is an response, md5 checksum). In addition, effective fuzzing
ambitious task for traditional machine learning or deep relies on the effective fuzz driver, and Zhang et al. [66]
learning techniques. Shrestha et al. [51] employs a small set utilize LLMs on the fuzz driver generation, in which five
of Simulink-specific training data to fine-tune the LLM for query strategies are designed and analyzed from basic to
generating Simulink models. Results show that it can create enhanced.
Simulink models quite similar to the open-source models, Fuzzing techniques for specific software. There are
and can find a super-set of the bugs traditional fuzzing studies that focus on the fuzzing techniques tailored to
approaches found. specific software, e.g., the deep learning library [58], [59],
Sun et al. [63] utilize LLM to generate test formulas for compiler [69], SMT solvers [63], input widget of mobile app
fuzzing SMT solvers. It retrains the LLMs on a large corpus [65], cyber-physical system [51], etc. One key focus of these
of SMT formulas to enable them to acquire SMT-specific fuzzing techniques is to generate diverse test inputs so as
domain knowledge. Then it further fine-tunes the LLMs to achieve higher coverage. This is commonly achieved
on historical bug-triggering formulas, which are known by combining the mutation technique with LLM-based
to involve structures that are more likely to trigger bugs generation, where the former produces various candidates
and solver-specific behaviors. The LLM-based compiler while the latter is responsible for generating the executable
fuzzer proposed by Yang et al. [69] adopts a dual-model test inputs [59], [63]. Another focus of these fuzzing
framework: (1) an analysis LLM examines the low-level techniques is to generate the risky test inputs that can
optimization source code and produces requirements on the trigger bugs earlier. To achieve this, a common practice is to
11

collect the historical bug-triggering programs to fine-tune performance of bug-component triaging further. Zhang et
the LLM [63] or treat them as the demonstrations when al. [72] first leverage the LLM under the zero-shot setting
querying the LLM [58], [65]. to get essential information on bug reports, then use the
Other testing techniques. There are studies that utilize essential information as the input to detect duplicate bug re-
LLMs for enhancing GUI testing for generating meaningful ports. Mahbub et al. [74] proposes to explain software bugs
text input [49] and functionality-oriented exploration traces with LLM, which generates natural language explanations
[14], which has been introduced in Test input generation for for software bugs by learning from a large corpus of bug-fix
mobile apps part of Section 4.3.1. commits. Zhang et al. [70] target to automatically generate
Besides, Deng et al. [62] leverage the LLMs to carry out the bug title from the descriptions of the bug, which aims
penetration testing tasks automatically. It involves setting a to help developers write issue titles and facilitate the bug
penetration testing goal for the LLM, soliciting it for the triaging and follow-up fixing process.
appropriate operation to execute, implementing it in the
testing environment, and feeding the test outputs back to
4.5 Debug
the LLM for next-step reasoning.
This category refers to the process of identifying and locat-
4.3.3 Input Generation in Terms of Input and Output ing the cause of a software problem (i.e., bug). It involves
Other output format of test generation. Although most analyzing the code, tracing the execution flow, collecting
works use LLM to generate test cases directly, there are also error information to understand the root cause of the issue,
some works generating indirect inputs like testing code, test and fixing the issue. Some studies concentrate on the com-
scenarios, metamorphic relations, etc. Liu et al. [65] pro- prehensive debug process, while others delve into specific
pose InputBlaster which leverages the LLM to automati- sub-activities within the process.
cally generate unusual text inputs for fuzzing the text input Overall debug framework. Bui et al. [77] proposes a uni-
widgets in mobile apps. It formulates the unusual inputs fied Detect-Localize-Repair framework based on the LLM
generation problem as a task of producing a set of test gen- for debugging, which first determines whether a given code
erators, each of which can yield a batch of unusual text snippet is buggy or not, then identifies the buggy lines, and
inputs under the same mutation rule. In detail, InputBlaster translates the buggy code to its fixed version. Kang et al.
leverages LLM to produce the test generators together with [83] proposes automated scientific debugging, a technique
the mutation rules serving as the reasoning chain and uti- that given buggy code and a bug-revealing test, prompts
lizes the in-context learning schema to demonstrate the LLM LLMs to automatically generate hypotheses, uses debuggers
with examples for boosting the performance. Deng et al. to actively interact with buggy code, and thus automati-
[64] use LLM to extract key information related to the test cally reaches conclusions prior to patch generation. Chen
scenario from a traffic rule, and represent the extracted in- et al. [88] demonstrate that self-debugging can teach the
formation in a test scenario schema, then synthesize the LLM to perform rubber duck debugging; i.e., without any
corresponding scenario scripts to construct the test scenario. human feedback on the code correctness or error messages,
Luu et al. [56] examine the effectiveness of LLM in generat- the model is able to identify its mistakes by investigating the
ing metamorphic relations (MRs) for metamorphic testing. execution results and explaining the generated code in nat-
Their results show that ChatGPT can be used to advance ural language. Cao et al. [89] conducts a study of LLM’s de-
software testing intelligence by proposing MRs candidates bugging ability for deep learning programs, including fault
that can be later adapted for implementing tests, but human detection, fault localization and program repair.
intelligence should still inevitably be involved to justify and Bug localization. Wu et al. [85] compare the two LLMs
rectify their correctness. (ChatGPT and GPT-4) with the existing fault localization
Other input format of test generation. The aforemen- techniques, and investigate the consistency of LLMs in fault
tioned studies primarily take the source code or the software localization, as well as how prompt engineering and the
as the input of LLM, yet there are also studies that take length of code context affect the results. Kang et al. [79]
natural language description as the input for test generation. propose AutoFL, an automated fault localization technique
Mathur et al. [53] propose to generate test cases from the that only requires a single failing test, and during its fault
natural language described requirements. Ackerman et al. localization process, it also generates an explanation about
[60] generate the instances from natural language described why the given test fails. Yang et al. [84] propose LLMAO to
requirements recursively to serve as the seed examples for a overcome the left-to-right nature of LLMs by fine-tuning a
mutation fuzzer. small set of bidirectional adapter layers on top of the rep-
resentations learned by LLMs, which can locate buggy lines
of code without any test coverage information. Tu et al. [86]
4.4 Bug Analysis propose LLM4CBI to tame LLMs to generate effective test
This category involves analyzing and categorizing the iden- programs for finding suspicious files.
tified software bugs to enhance understanding of the bug, Bug reproduction. There are also studies focusing on a
and facilitate subsequent debug and bug repair. Mukher- sub-phase of the debugging process. For example, Kang et
jee et al. [73] generate relevant answers to follow-up ques- al. [78] and Plein et al. [81] respectively propose the frame-
tions for deficient bug reports to facilitate bug triage. Su et work to harness the LLM to reproduce bugs, and suggest
al. [76] transform the bug-component triaging into a multi- bug reproducing test cases to the developer for facilitating
classification task and a generation task with LLM, then debugging. Li et al. [87] focus on a similar aspect of finding
ensemble the prediction results from them to improve the the failure-inducing test cases whose test input can trigger
12

the software’s fault. It synergistically combines LLM and al. [95] propose to fine-tune the LLM with JavaScript code
differential testing to do that. snippets to serve as the purpose for the JavaScript program
There are also studies focusing on the bug reproduc- repair. Zhang et al. [116] employs program slicing to extract
tion of mobile apps to produce the replay script. Feng et contextual information directly related to the given buggy
al. [75] propose AdbGPT, a new lightweight approach to statement as repair ingredients from the corresponding pro-
automatically reproduce the bugs from bug reports through gram dependence graph, which makes the fine-tuning more
prompt engineering, without any training and hard-coding focused on the buggy code. Zhang et al. [121] propose a
effort. It leverages few-shot learning and chain-of-thought stage-wise framework STEAM for patching single-line bugs,
reasoning to elicit human knowledge and logical reasoning which simulates the interactive behavior of multiple pro-
from LLMs to accomplish the bug replay in a manner similar grammers involved in bug management, e.g., bug reporting,
to a developer. Huang et al. [71] propose CrashTranslator to bug diagnosis, patch generation, and patch verification.
automatically reproduce bugs directly from the stack trace. Since most real-world bugs would involve multiple lines
It accomplishes this by leveraging the LLM to predict the of code, and later studies explore these more complex situa-
exploration steps for triggering the crash, and designing a tions (although some of them can also patch the single-line
reinforcement learning based technique to mitigate the in- bugs).
accurate prediction and guide the search holistically. Taeb et Patch multiple-lines bugs. The studies in this category
al. [55] convert the manual accessibility test instructions into would input a buggy function to the LLM, and the goal is to
replayable, navigable videos by using LLM and UI element output the patched function, which might involve complex
detection models, which can also help reveal accessibility semantic understanding, code hunk modification, as well
issues. as program refactoring. Earlier studies typically employ the
Error explanation. Taylor et al. [82] integrates the LLM fine-tuning strategy to enable the LLM to better understand
into the Debugging C Compiler to generate unique, novice- the code semantics. Fu et al. [123] fine-tune the LLM by
focused explanations tailored to each error. Widjojo et al. employing BPE tokenization to handle Out-Of-Vocabulary
[80] study the effectiveness of Stack Overflow and LLMs at (OOV) issues which makes the approach generate new to-
explaining compiler errors. kens that never appear in a training function but are newly
introduced in the repair. Wang et. al. [120] train the LLM
based on both buggy input and retrieved bug-fix examples
4.6 Program Repair which are retrieved in terms of the lexical and semantical
This category denotes the task of fixing the identified similarities. The aforementioned studies (including the ones
software bugs. The high frequency of repair-related studies in patching single-line bugs) would predict the fixed pro-
can be attributed to the close relationship between this grams directly, and Hu et al. [92] utilize a different setup
task and the source code. With their advanced natural that predicts the scripts that can fix the bugs when executed
language processing and understanding capabilities, LLM with the delete and insert grammar. For example, it predicts
are well-equipped to process and analyze source code, whether an original line of code should be deleted, and what
making them an ideal tool for performing code-related content should be inserted.
tasks such as fixing bugs. Nevertheless, fine-tuning may face limitations in terms
There have been template-based [145], heuristic-based of its reliance on abundant high-quality labeled data,
[146], and constraint-based [147], [148] automatic program significant computational resources, and the possibility of
repair techniques. And with the development of deep overfitting. To approach the program repair problem more
learning techniques in the past few years, there have been effectively, later studies focus on how to design an effective
several studies employing deep learning techniques for prompt for program repair. Several studies empirically
program repair. They typically adopt deep learning models investigate the effectiveness of prompt variants of the latest
to take a buggy software program as input and generate a LLMs for program repair under different repair settings
patched program. Based on the training data, they would and commonly-used benchmarks (which will be explored
build a neural network model that learns the relations in depth later), while other studies focus on proposing
between the buggy code and the corresponding fixed code. new techniques. Ribeiro et al. [109] take advantage of
Nevertheless, these techniques still fail to fix a large portion LLM to conduct the code completion in a buggy line for
of bugs, and they typically have to generate hundreds to patch generation, and elaborate on how to circumvent the
thousands of candidate patches and take hours to validate open-ended nature of code generation to appropriately
these patches to fix enough bugs. Furthermore, the deep fit the new code in the original program. Xia et al. [115]
learning based program repair models need to be trained propose the conversation-driven program repair approach
with huge amounts of labeled training data (typically that interleaves patch generation with instant feedback
pairs of buggy and fixed code), which is time- and effort- to perform the repair in a conversational style. They first
consuming to collect the high-quality dataset. Subsequently, feed the LLM with relevant test failure information to start
with the popularity and demonstrated capability of the with, and then learns from both failures and successes
LLMs, researchers begin to explore the LLMs for program of earlier patching attempts of the same bug for more
repair. powerful repair. For earlier patches that failed to pass
Patch single-line bugs. In the early era of program re- all tests, they combine the incorrect patches with their
pair, the focus was mainly on addressing defects related to corresponding relevant test failure information to construct
single-line code errors, which are relatively simple and did a new prompt for the LLM to generate the next patch,
not require the repair of complex program logic. Lajkó et in order to avoid making the same mistakes. For earlier
13

TABLE 4: Performance of program repair


Dataset % Correct patches LLM Paper
Defects4J v1.2, Defects4J 22/40 Jave bugs (QuixBugs dataset, with InCoder-6B, correct PLBART, CodeT5, CodeGen, In- [113]
v2.0, QuixBugs, code infilling setting) Coder (each with variant pa-
HumanEval-Java rameters, 10 LLMs in total)
QuixBugs 23/40 Python bugs, 14/40 Java bugs (complete function genera- Codex-12B [100]
tion setting)
Defects4J v1.2, Defects4J 39/40 Python bugs, 34/40 Java bugs (QuixBugs dataset, with Codex, GPT-Neo, CodeT5, In- [93]
v2.0, QuixBugs, Many- Codex-12B, correct code infilling setting); 37/40 Python bugs, Coder (each with variant pa-
Bugs 32/40 Java bugs (QuixBugs dataset, with Codex-12B, complete rameters, 9 LLMs in total)
function generation setting)
QuixBugs 31/40 Python bugs (completion function generation setting) ChatGPT-175B [96]
DL programs from Stack- 16/72 Python bugs (complete function generation setting) ChatGPT-175B [89]
Overflow
Note that, for studies with multiple datasets or LLMs, we only present the best performance or in the most commonly utilized dataset.

patches that passed all the tests (i.e., plausible patches), OCaml programs (an industrial-strength programming lan-
they further ask the LLM to generate alternative variations guage) [111].
of the original plausible patches. This can further build on
and learn from earlier successes to generate more plausible Empirical study about program repair. There are several
patches to increase the chance of having correct patches. studies related to the empirical or experimental evaluation
Zhang et al. [94] propose a similar approach design by of the various LLMs on program repair, and we summa-
leveraging multimodal prompts (e.g., natural language rize the performance in Table 4. Jiang et al. [113], Xia et al.
description, error message, input-output-based test cases), [93], and Zhang et. al. [118] respectively conduct compre-
iterative querying, test-case-based few-shot selection to hensive experimental evaluations with various LLMs and
produce repairs. Moon et al. [102] propose for bug fixing on different automated program repair benchmarks, while
with feedback. It consists of a critic model to generate other researchers [89], [96], [98], [100] focus on a specific
feedback, an editor to edit codes based on the feedback, LLM and on one dataset, e.g., QuixBugs. In addition, Gao
and a feedback selector to choose the best possible feedback et al. [124] empirically investigate the impact of in-context
from the critic. demonstrations for bug fixing, including the selection, or-
der, and number of demonstration examples. Prenner et al.
Wei et. al. [103] propose Repilot to copilot the AI “copi-
[117] empirically study how the local context (i.e., code that
lots” (i.e., LLMs) by synthesizing more valid patches during
comes before or after the bug location) affects the repair per-
the repair process. Its key insight is that many LLMs pro-
formance. Horváth et al. [99] empirically study the impact
duce outputs autoregressively (i.e., token by token), and by
of program representation and model architecture on the
resembling human writing programs, the repair can be sig-
repair performance.
nificantly boosted and guided through a completion engine.
Brownlee et al. [105] propose to use the LLM as mutation There are two commonly-used repair settings when us-
operators for the search-based techniques of program repair. ing LLMs to generate patches: 1) complete function gen-
Repair with static code analyzer. Most of the program eration (i.e., generating the entire patch function), 2) cor-
repair studies would suppose the bug has been detected, rect code infilling (i.e., filling in a chunk of code given the
while Jin et al. [114] propose a program repair framework prefix and suffix), and different studies might utilize differ-
paired with a static analyzer to first detect the bugs, and ent settings which are marked in Table 4. The commonly-
then fix them. In detail, the static analyzer first detects an used datasets are QuixBugs, Defects4J, etc. These datasets
error (e.g., null pointer dereference) and the context infor- only involve the fundamental functionalities such as sorting
mation provided by the static analyzer will be sent into the algorithms, each program’s average number of lines rang-
LLM for querying the patch for this specific error. Wadhwa ing from 13 to 22, implementing one functionality, and in-
et al. [110] focus on a similar task, and additionally employ volving few dependencies. To tackle this, Cao et al. [89]
an LLM as the ranker to assess the likelihood of acceptance conducts an empirical study on a more complex dataset
of generated patches which can effectively catch plausible with DL programs collected from StackOverflow. Every pro-
but incorrect fixes and reduce developer burden. gram contains about 46 lines of code on average, imple-
Repair for specific bugs. The aforementioned studies menting several functionalities including data preprocess-
all consider the buggy code as the input for the automatic ing, DL model construction, model training, and evaluation.
program repair, while other studies conduct program re- And the dataset involves more than 6 dependencies for each
pairing in terms of other types of bug descriptions, specific program, including TensorFlow, Keras, and Pytorch. Their
types of bugs, etc. Fakhoury et al. [122] focus on program results demonstrate a much lower rate of correct patches
repair from natural language issue descriptions, i.e., gen- than in other datasets, which again reveals the potential
erating the patch with the bug and fix-related information difficulty of this task. Similarly, Haque et al. [106] introduce
described in the issue reports. Garg et al. [119] aim at re- a dataset comprising of buggy code submissions and their
pairing performance issues, in which they first retrieve a corresponding fixes collected from online judge platforms,
prompt instruction from a pre-constructed knowledge-base in which it offers an extensive collection of unit tests to
of previous performance bug fixes and then generate a re- enable the evaluations about the correctness of fixes and fur-
pair prompt using the retrieved instruction. There are stud- ther information regarding time, memory constraints, and
ies focusing on the bug fixing of Rust programs [108] or acceptance based on a verdict.
14

Others, 7 There are already 14 studies that utilize GPT-4, ranking


UniXCoder, 2 at the fourth place, which is launched on March 2023. Sev-
5% 1% StarCoder,
BART, 3
3 eral studies directly utilize this state-of-the-art LLM of Ope-
2%2% GPT-2, 4
ChatGPT, 36 nAI, since it demonstrates excellent performance across a
25% 3% CodeGPT, 4 wide range of generation and reasoning tasks. For example,
3%
T5, 5 Xie et al. utilize GPT-4 to generate fuzzing inputs [67], while
4% Vikram et al. employ it to generate property-based tests with
4% PLBART, 5 the assistance of API documentation [34]. In addition, some
4% InCoder, 5 studies conduct experiments using both GPT-4 and Chat-
GPT or other LLMs to provide a more comprehensive evalu-
4% CodeGen, 6 ation of these models’ performance. In their proposed LLM-
empowered automatic penetration testing technique, Deng
16% 5%
Codex, 23 GPT-3, 7 et al. find that GPT-4 surpasses ChatGPT and LaMDA from
Google [62]. Similarly, Zhang et al. find that GPT-4 shows
10% its performance superiority over ChatGPT when generat-
13% GPT-4, 14 ing the fuzz drivers with both the basic query strategies
CodeT5, 18 and enhanced query strategies [66]. Furthermore, GPT-4, as
a multi-modal LLM, sets itself apart from the other men-
tioned LLMs by showcasing additional capabilities such as
Fig. 6: LLMs used in the collected papers generating image narratives and answering questions based
on images [149]. Yet we have not come across any studies
5 A NALYSIS FROM LLM P ERSPECTIVE that explore the utilization of GPT-4’s image-related features
This section discusses the analysis based on the viewpoints (e.g., UI screenshots, programming screencasts) in software
of LLM, specifically, it’s unfolded from the viewpoints of testing tasks.
utilized LLMs, types of prompt engineering, input of the
LLMs, as well as the accompanied techniques when utilizing
LLM. 5.2 Types of Prompt Engineering
As shown in Figure 7, among our collected studies, 38
5.1 LLM Models studies utilize the LLMs through pre-training or fine-
As shown in Figure 6, the most commonly utilized LLM tuning schema, while 64 studies employ the prompt
in software testing tasks is ChatGPT, which was released engineering to communicate with LLMs to steer its
on Nov. 2022 by OpenAI. It is trained on a large corpus behavior for desired outcomes without updating the model
of natural language text data, and primarily designed for weights. When using the early LLMs, their performances
natural language processing and conversation. ChatGPT is might not be as impressive, so researchers often use
the most widely recognized and popular LLM up until now, pre-training or fine-tuning techniques to adjust the models
known for its exceptional performance across various tasks. for specific domains and tasks in order to improve their
Therefore, it comes as no surprise that it ranks in the top performance. Then with the upgrading of LLM technology,
position in terms of our collected studies. especially with the introduction of GPT-3 and later
Codex, an LLM based on GPT-3, is the second most com- LLMs, the knowledge contained within the models and
monly used LLM in our collected studies. It is trained on a their understanding/inference capability has increased
massive code corpus containing examples from many pro- significantly. Therefore, researchers will typically rely on
gramming languages such as JavaScript, Python, C/C++, prompt engineering to consider how to design appropriate
and Java. Codex was released on Sep. 2021 by OpenAI and prompts to stimulate the model’s knowledge.
powers GitHub Copilot– an AI pair programmer that gener- Among the 64 studies with prompt engineering, 51 stud-
ates whole code snippets, given a natural language descrip- ies involve zero-shot learning, and 25 studies involve few-
tion as a prompt. Since a large portion of our collected stud- shot learning (a study may involve multiple types). There
ies involve the source code (e.g., repair, unit test case gen- are also studies involving the chain-of-though (7 studies),
eration), it is not surprising that researchers choose Codex self-consistency (1 study), and automatic prompt (1 study).
as the LLM in assisting them in accomplishing the coding- Zero-shot learning is to simply feed the task text to the
related tasks. model and ask for results. Many of the collected studies em-
The third-ranked LLM is CodeT5, which is an open- ploy the Codex, CodeT5, and CodeGen (as shown in Section
sourced LLM developed by salesforce3 . Thanks to its open 5.1), which is already trained on source code. Hence, for the
source, researchers can easily conduct the pre-training and tasks dealing with source code like unit test case generation
fine-tuning with domain-specific data to achieve better and program repair as demonstrated in previous sections,
performance. Similarly, CodeGen is also open-sourced and directly querying the LLM with prompts is the common
ranked relatively higher. Besides, for CodeT5 and CodeGen, practice. There are generally two types of manners of zero-
there are more than half of the related studies involve the shot learning, i.e., with and without instructions. For exam-
empirical evaluations (which employ multiple LLMs), e.g., ple, Xie et al. [36] would provide the LLMs with the instruc-
program repair [112], [113], unit test case generation [39]. tions as “please help me generate a JUnit test for a specific
Java method ...” to facilitate the unit test case generation.
3. https://blog.salesforceairesearch.com/codet5/ In contrast, Siddiq et al. [39] only provide the code header
15

Fig. 7: Distribution about how LLM is used (Note that, a study can involve multiple types of prompt engineering)

of the unit test case (e.g., “class ${className}${suffix}Test (with a debugger or code execution) using an LLM, after
{”), and the LLMs would carry out the unit test case gener- that, depending on the conclusion, it either starts with a
ation automatically. Generally speaking, prompts with clear new hypothesis or opts to terminate the debugging process
instructions will yield more accurate results, while prompts and generate a fix.
without instructions are typically suitable for very specific
Automatic prompt aims to automatically generate and
situations.
select the appropriate instruction for the LLMs, instead of
Few-shot learning presents a set of high-quality demon- requiring the user to manually engineer a prompt. Xia et
strations, each consisting of both input and desired output, al. [67] introduce an auto-prompting step that automatically
on the target task. As the model first sees the examples, distils all user-provided inputs into a concise and effective
it can better understand human intention and criteria for prompt for fuzzing. Specifically, they first generate a list of
what kinds of answers are wanted, which is especially im- candidate prompts by incorporating the user inputs and
portant for tasks that are not so straightforward or intuitive auto prompting instruction while setting the LLM at high
to the LLM. For example, when conducting the automatic temperature, then a small-scale fuzzing experiment is con-
test generation from general bug reports, Kang et al. [78] ducted to evaluate each candidate prompt, and the best one
provide examples of bug reports (questions) and the corre- is selected.
sponding bug reproducing tests (answers) to the LLM, and
their results show that two examples can achieve the highest Note that there are fourteen studies that apply the it-
performance than no examples or other number of exam- erative prompt design when using zero-shot or few-shot
ples. Another example of test assertion generation, Nashid learning, in which the approach continuously refines the
et al. [47] provide demonstrations of the focal method, the prompts with the running information of the testing task,
test method containing an <AssertPlaceholder>, and the ex- e.g., the test failure information. For example, for program
pected assertion, which enables the LLMs to better under- repair, Xia et al. [115] interleave patch generation with test
stand the task. validation feedback to prompt future generation iteratively.
Chain-of-thought (CoT) prompting generates a In detail, they incorporate various information from a failing
sequence of short sentences to describe reasoning logics test including its name, the relevant code line(s) triggering
step by step (also known as reasoning chains or rationales) the test failure, and the error message produced in the next
to the LLMs for generating the final answer. For example, round of prompting which can help the model understand
for program repair from the natural language issue the failure reason and provide guidance towards generating
descriptions [122], given the buggy code and issue report, the correct fix. Another example is for mobile GUI testing,
the authors first ask the LLM to localize the bug, and then Liu et al. [14] iteratively query the LLM about the operation
they ask it to explain why the localized lines are buggy, (e.g., click a button, enter a text) to be conducted in the
finally, they ask the LLM to fix the bug. Another example is mobile app, and at each iteration, they would provide the
for generating unusual programs for fuzzing deep learning LLM with current context information like which GUI pages
libraries, Deng et al. [58] first generate a possible “bug” (bug and widgets have just explored.
description) before generating the actual “bug-triggering” Mapping between testing tasks and how LLMs are
code snippet that invokes the target API. The predicted used. Figure 8 demonstrates the mapping between the test-
bug description provides an additional hint to the LLM, ing tasks (mentioned in Section 4) and how LLMs are used
indicating that the generated code should try to cover (as introduced in this subsection). The unit test case gen-
specific potential buggy behavior. eration and program repair share similar patterns of com-
Self-consistency involves evaluating the coherence and municating with the LLMs, since both tasks are closely re-
consistency of the LLM’s responses on the same input in lated to the source code. Typically, researchers utilize pre-
different contexts. There is one study with this prompt training and/or fine-tuning and zero-shot learning methods
type, and it is about debugging. Kang et al. [83] employ a for these two tasks. Zero-shot learning is suitable because
hypothesize-observe-conclude loop, which first generates these tasks are relatively straightforward and can be easily
a hypothesis about what the bug is and constructs an understood by LLMs. Moreover, since the training data for
experiment to verify, using an LLM, then decide whether these two tasks can be automatically collected from source
the hypothesis is correct based on the experiment result code repositories, pre-training and/or fine-tuning methods
16

different setups and involve different inputs, including (i)


inputting a buggy function with the goal of outputting the
patched function, (ii) inputting the buggy location with the
goal of generating the correct replacement code (can be a
single line change) given the prefix and suffix of the buggy
function [93]. Besides, there can be variations for the buggy
location input, i.e., (i) does not contain the buggy lines (but
the bug location is still known), (ii) give the buggy lines as
lines of comments.
There are also 12 studies taking the bug description as
input for the LLM. For example, Kang et al. [78] take the
bug description as input when querying LLM and let the
LLM generate the bug-reproducing test cases. Fakhoury et
Fig. 8: Mapping between testing tasks and how LLMs are al. [122] input the natural language descriptions of bugs to
used the LLM, and generate the correct code fixes.
There are 7 studies that would provide the intermedi-
ate error information, e.g., test failure information, to the
Others, 12 LLM, and would conduct the iterative prompt (as described
View hierarchy file of UI, 6 in Section 5.2) to enrich the context provided to the LLM.
10% These studies are related to the unit test case generation
5% Error information, 7 and program repair, since in these scenarios, the running
6% information can be acquired easily.
When testing mobile apps, since the utilized LLM could
10% Bug description, 12 not understand the image of the GUI page, the view hierar-
chy file which represents the details of the GUI page usually
68% acts as the input to LLMs. Nevertheless, with the emergence
Code, 78 of GPT-4 which is a multimodal model and accepts both
image and text inputs for model input, the GUI screenshots
might be directly utilized for LLM’s input.

5.4 Incorporating Other Techniques with LLM


Fig. 9: Input of LLM There are divided opinions on whether LLM has reached
an all-powerful status that requires no other techniques. As
shown in Figure 10, among our collected studies, 67 of them
are widely employed for these two tasks, which can enhance
utilize LLMs to address the entire testing task, while 35 stud-
LLMs’ understanding of domain-specific knowledge.
ies incorporate additional techniques. These techniques in-
In comparison, for system test input generation, zero-
clude mutation testing, differential testing, syntactic check-
shot learning and few-shot learning methods are commonly
ing, program analysis, statistical analysis, etc. .
used. This might be because this task often involves gener-
The reason why researchers still choose to combine
ating specific types of inputs, and demonstrations in few-
LLMs with other techniques might be because, despite
shot learning can assist the LLMs in better understanding
exhibiting enormous potential in various tasks, LLMs still
what should be generated. Besides, for this task, the uti-
possess limitations such as comprehending code semantics
lization of pre-training and/or fine-tuning methods are not
and handling complex program structures. Therefore,
as widespread as in unit test case generation and program
combining LLMs with other techniques optimizes their
repair. This might be attributed to the fact that training data
strengths and weaknesses to achieve better outcomes in
for system testing varies across different software and is
specific scenarios. In addition, it is important to note that
relatively challenging to collect automatically.
while LLMs are capable of generating correct code, they
may not necessarily produce sufficient test cases to check
5.3 Input of LLM for edge cases or rare scenarios. This is where mutation
We also find that different testing tasks or software under and other testing techniques come into play, as they allow
test might involve diversified input when querying the for the generation of more diverse and complex code that
LLM, as demonstrated in Figure 9. can better simulate real-world scenarios. Taken in this
The most commonly utilized input is the source code sense, a testing approach can incorporate a combination
since a large portion of collected studies relate to program of different techniques, including both LLMs and other
repair or unit test case generation whose input are source testing strategies, to ensure comprehensive coverage and
code. For unit test case generation, typical code-related in- effectiveness.
formation would be (i) the complete focal method, including LLM + statistical analysis. As LLMs can often generate
the signature and body; (ii) the name of the focal class (i.e., a multitude of outputs, manually sifting through and iden-
the class that the focal method belongs to); (iii) the field in tifying the correct output can be overwhelmingly laborious.
the focal class; and (iv) the signatures of all methods defined As such, researchers have turned to statistical analysis tech-
in the focal class [7], [26]. For program repair, there can be niques like ranking and clustering [28], [45], [78], [93], [116]
17

Fig. 10: Distribution about other techniques incorporated with LLMs (Note that, a study can involve multiple types)

to efficiently filter through LLM’s outputs and ultimately whether there is a triggered bug based on the software’s
obtain more accurate results. output. For example, Ye et al. [48] first uses LLM to
LLM + program analysis. When utilizing LLMs to produce random JavaScript programs, and leverages the
accomplish tasks such as generating unit test cases and language specification document to generate test data, then
repairing software code, it is important to consider that conduct the differential testing on JavaScript engines such
software code inherently possesses structural information, as JavaScriptCore, ChakraCore, SpiderMonkey, QuickJS,
which may not be fully understood by LLMs. Hence, etc. There are also studies utilizing the LLMs to generate
researchers often utilize program analysis techniques, test inputs and then conduct differential testing for fuzzing
including code abstract syntax trees (ASTs) [74], to DL libraries [58], [59] and SAT solvers [63]. Li et al. [87]
represent the structure of code more effectively and increase employs the LLM in finding the failure-inducing test cases.
the LLM’s ability to comprehend the code accurately. In detail, given a program under test, they first request the
Researchers also perform the structure-based subsetting LLM to infer the intention of the program, then request the
of code lines to narrow the focus for LLM [94], or extract LLM to generate programs that have the same intention,
additional code context from other code files [7], to enable which are alternative implementations of the program, and
the models to focus on the most task-relevant information are likely free of the program’s bug. Then they perform
in the codebase and lead to more accurate predictions. the differential testing with the program under test and the
LLM + mutation testing. It is mainly targeting at gener- generated programs to find the failure-inducing test cases.
ating more diversified test inputs. For example, Deng et al.
[59] first use LLM to generate the seed programs (e.g., code 6 C HALLENGES AND O PPORTUNITIES
snippets using a target DL API) for fuzzing deep learning
Based on the above analysis from the viewpoints of soft-
libraries. To enrich the pool of these test programs, they
ware testing and LLM, we summarize the challenges and
replace parts of the seed program with masked tokens using
opportunities when conducting software testing with LLM.
mutation operators (e.g., replaces the API call arguments
with the span token) to produce masked inputs, and again
utilize the LLMs to perform code infilling to generate new 6.1 Challenges
code that replaces the masked tokens. As indicated by this survey, software testing with LLMs
LLM + syntactic checking. Although LLMs have shown has undergone significant growth in the past two years.
remarkable performance in various natural language pro- However, it is still in its early stages of development, and
cessing tasks, the generated code from these models can numerous challenges and open questions need to be ad-
sometimes be syntactically incorrect, leading to potential er- dressed.
rors and reduced usability. Therefore, researchers have pro-
posed to leverage syntax checking to identify and correct 6.1.1 Challenges for Achieving High Coverage
errors in the generated code. For example, in their work for Exploring the diverse behaviors of the software under test
unit test case generation, Alagarsamy et al. [29] addition- to achieve high coverage is always a significant concern
ally introduce a verification method to check and repair the in software testing. In this context, test generation differs
naming consistency (i.e., revising the test method name to from code generation, as code generation primarily focuses
be consistent with the focal method name) and the test sig- on producing a single, correct code snippet, whereas soft-
natures (i.e., adding missing keywords like public, void, or ware testing requires generating diverse test inputs to en-
@test annotations). Xie et al. [36] also validates the generated sure better coverage of the software. Although setting a high
unit test case and employs rule-based repair to fix syntactic temperature can facilitate the LLMs in generating different
and simple compile errors. outputs, it remains challenging for LLMs to directly achieve
LLM + differential testing. Differential testing is well- the required diversity. For example, for unit test case gen-
suited to find semantic or logic bugs that do not exhibit eration, in SF110 dataset, the line coverage is merely 2%
explicit erroneous behaviors like crashes or assertion and the branch coverage is merely 1% [39]. For system test
failures. In this category of our collected studies, the LLM input generation, in terms of fuzzing DL libraries, the API
is mainly responsible for generating valid and diversified coverage for TensorFlow is reported to be 66% (2215/3316)
inputs, while the differential testing helps to determine [59].
18

From our collected studies, we observe that the generate test cases based on metamorphic relations,
researchers often utilize mutation testing together with the covering a wide range of inputs.
LLMs to generate more diversified outputs. For example, The advancement of multi-model LLMs like GPT-4 may
when fuzzing a DL library, instead of directly generating open up possibilities for exploring their ability to detect
the code snippet with LLM, Deng et al. [59] replace parts bugs in software user interfaces and assist in deriving test
of the selected seed (code generated by LLM) with masked oracles. By leveraging the image understanding and reason-
tokens using different mutation operators to produce ing capabilities of these models, one can investigate their
masked inputs. They then leverage the LLM to perform potential to automatically identify inconsistencies, errors, or
code infilling to generate new code that replaces the masked usability issues in user interfaces.
tokens, which can significantly increase the diversity of the
generated tests. Liu et al. [65] leverage LLM to produce the 6.1.3 Challenges for Rigorous Evaluations
test generators (each of which can yield a batch of unusual
The lack of benchmark datasets and the potential data leak-
text inputs under the same mutation rule) together with the
age issues associated with LLM-based techniques present
mutation rules for text-oriented fuzzing, which reduces the
challenges in conducting rigorous evaluations and compre-
human effort required for designing mutation rules.
hensive comparisons of proposed methods.
A potential research direction could involve utilizing
For program repair, there are only two well-known and
testing-specific data to train or fine-tune a specialized LLM
commonly-used benchmarks, i.e., Defect4J and QuixBugs,
that is specifically designed to understand the nature of
as demonstrated in Table 4. Furthermore, these datasets are
testing. By doing so, the LLM can inherently acknowledge
not specially designed for testing the LLMs. For example, as
the requirements of testing and autonomously generate
reported by Xia et al. [93], 39 out of 40 Python bugs in the
diverse outputs.
QuixBugs dataset can be fixed by Codex, yet in real-world
practice, the successful fix rate can be nowhere near as high.
6.1.2 Challenges in Test Oracle Problem For unit test case generation, there are no widely recognized
The oracle problem has been a longstanding challenge in benchmarks, and different studies would utilize different
various testing applications, e.g., testing machine learning datasets for performance evaluation, as demonstrated in Ta-
systems [150] and testing deep learning libraries [59]. To ble 3. This indicates the need to build more specialized and
alleviate the oracle problem to the overall testing activities, diversified benchmarks.
a common practice in our collected studies is to transform it Furthermore, the LLMs may have seen the widely-used
into a more easily derived form, often by utilizing differen- benchmarks in their pre-training data, i.e., data leakage
tial testing [63] or focusing on only identifying crash bugs issues. Jiang et al. [113] check the CodeSearchNet and
[14]. BigQuery, which are the data sources of common LLMs,
There are successful applications of differential testing and the results show that four repositories used by the
with LLMs, as shown in Figure 10. For instance, when Defect4J benchmark are also in CodeSearchNet, and the
testing the SMT solvers, Sun et al. adopt differential testing whole Defects4J repository is included by BigQuery.
which involves comparing the results of multiple SMT Therefore, it is very likely that existing program repair
solvers (i.e., Z3, cvc5, and Bitwuzla) on the same generated benchmarks are seen by the LLMs during pre-training. This
test formulas by LLM [63]. However, this approach is data leakage issue has also been investigated in machine
limited to systems where counterpart software or running learning-related studies. For example, Tu et al. [151] focus
environment can easily be found, potentially restricting on the data leakage in issue tracking data, and results show
its applicability. Moreover, to mitigate the oracle problem, that information leaked from the “future” makes prediction
other studies only focus on the crash bugs which are easily models misleadingly optimistic. This reminds us that the
observed automatically. This is particularly the case for performance of LLMs on software testing tasks may not be
mobile applications testing, in which the LLMs guide the as good as reported in previous studies. It also suggests
testing in exploring more diversified pages, conducting that we need more specialized datasets that are not seen by
more complex operational actions, and covering more LLMs to serve as benchmarks. One way is to collect it from
meaningful operational sequences [14]. However, this specialized sources, e.g., user-generated content from niche
significantly restricts the potential of utilizing the LLMs for online communities.
uncovering various types of software bugs.
Exploring the use of LLMs to derive other types of 6.1.4 Challenges in Real-world Application of LLMs in Soft-
test oracles represents an interesting and valuable research ware Testing
direction. Specifically, metamorphic testing is also widely As we mentioned in Section 5.2, in the early days of us-
used in software testing practices to help mitigate the oracle ing LLMs, pre-training and fine-tuning are commonly used
problem, yet in most cases, defining metamorphic relations practice, considering the model parameters are relatively
relies on human ingenuity. Luu et al. [56] have examined the few resulting in weaker model capabilities (e.g., T5). As time
effectiveness of LLM in generating metamorphic relations, progressed, the number of model parameters increased sig-
yet they only experiment with straightforward prompts by nificantly, leading to the emergence of models with greater
directly querying ChatGPT. Further exploration, potentially capabilities (e.g., ChatGPT). And in recent studies, prompt
incorporating human-computer interaction or domain engineering has become a common approach. However, due
knowledge, is highly encouraged. Another promising to concerns regarding data privacy, when considering real-
avenue is exploring the capability of LLMs to automatically world practice, most software organizations tend to avoid
19

using commercial LLMs and would prefer to adopt open- Adopting a human-computer interaction schema for
source ones with training or fine-tuning using organization- tackling early-stage testing tasks would harness the domain-
specific data. Furthermore, some companies also consider specific knowledge of human developers and leverage the
the current limitations in terms of computational power or general knowledge embedded in LLMs. Additionally, it is
pay close attention to energy consumption, they tend to highly encouraged for software development companies
fine-tune medium-sized models. It is quite challenging for to record and provide access to early-stage testing data,
these models to achieve similar performance to what our allowing for improved training and performance of LLMs
collected papers have reported. For instance, in the widely- in these critical testing activities.
used QuixBugs dataset, it has been reported that 39 out of
40 Python bugs and 34 out of 40 Java bugs can be automat- 6.2.2 Exploring LLMs in Other Testing Phases
ically fixed [93]. However, when it comes to DL programs
collected from Stack Overflow, which represent real-world We have analyzed the distribution of testing phases for the
coding practice, only 16 out of 72 Python bugs can be auto- collected studies. As shown in Fig 11, we can observe that
matically fixed [89]. LLMs are most commonly used in unit testing, followed by
system testing. However, there is still no research on the use
Recent research has highlighted the importance of high-
of LLMs in integration testing and acceptance testing.
quality training data in improving the performance of mod-
els for code-related tasks [152], yet manually building high-
quality organization-specific datasets for training or fine- 8QLWWHVW 

7HVWLQJ3KDVHV
tuning is time-consuming and labor-intensive. To address
this, one is encouraged to utilize the automated techniques ,QWHJUDWLRQWHVW 
of mining software repositories to build the datasets, for 6\VWHPWHVW 
example, techniques like key information extraction tech-
niques from Stack Overflow [153] offer potential solutions $FFHSWDQFHWHVW 
for automatically gathering relevant data.      
In addition, exploring the methodology for better fine- 3DSHU&RXQW
tuning the LLMs with software-specific data is worth con- Fig. 11: Distribution of testing phases (note that we omit the
sidering because software-specific data differs from natural studies which do not explicitly specify the testing phases,
language data as it contains more structural information, e.g., program repair)
such as data flow and control flow. Previous research on
code representations has shown the benefits of incorporat- For integration testing, it involves testing the interfaces
ing data flow, which captures the semantic-level structure between different software modules. In some software or-
of code and represents the relationship between variables in ganizations, integration testing might be merged with unit
terms of “whether-value-comes-from” [154]. These insights testing, which can be a possible reason why LLM is rarely
can provide valuable guidance for effectively fine-tuning utilized in integration testing. Another reason might be that
LLMs with software-specific data. the size and complexity of the input data in this circum-
stance may exceed the capacity of the LLM to process and
analyze (e.g., the source code of all involved software mod-
6.2 Opportunities ules), which can lead to errors or unreliable results. To tackle
There are also many research opportunities in software test- this, a potential reference can be found in Section 4.1, where
ing with LLMs, which can greatly benefit developers, users, Xie et al. [36] design a method to organize the necessary
and the research community. While not necessarily chal- information into the pre-defined maximum prompt token
lenges, these opportunities contribute to advancements in limit of the LLM. Furthermore, integration testing requires
software testing, benefiting practitioners and the wider re- diversified data to be generated to sufficiently test the in-
search community. terface among multiple modules. As mentioned in Section
4.3, previous work has demonstrated the LLM’s capability
in generating diversified test input for system testing, in
6.2.1 Exploring LLMs in the Early Stage of Testing conjunction with mutation testing techniques [48], [59]. And
As shown in Figure 4, LLMs have not been used in the early these can provide insights about generating the diversified
stage of testing, e.g., test requirements, and test planning. interface data for integration testing.
There might be two main reasons behind that. The first is Acceptance testing is usually conducted by business an-
the subjectivity in early-stage testing tasks. Many tasks in alysts or end-users to validate the system’s functionality
the early stages of testing, such as requirements gathering, and usability, which requires more non-technical language
test plan creation, and design reviews, may involve subjec- and domain-specific knowledge, thus making it challenging
tive assessments that require significant input from human to apply LLM effectively. Since acceptance testing involves
experts. This could make it less suitable for LLMs that rely humans, it is well-suited for the use of human-in-the-loop
heavily on data-driven approaches. The second might be the schema with LLMs. This has been studied in traditional
lack of open-sourced data in the early stages. Unlike in later machine learning [155], but has not yet been explored with
stages of testing, there may be limited data available online LLMs. Specifically, the LLMs can be responsible for auto-
during early-stage activities. This could mean that LLMs matically generating test cases, evaluating test coverage, etc,
may not have seen much of this type of data, and therefore while human testers are responsible for checking the pro-
may not perform well on these tasks. gram’s behavior and verifying test oracle.
20

6.2.3 Exploring LLMs for More Types of Software 6.2.5 Exploring Advanced Prompt Engineering
We analyze what types of software have been explored in There are a total of 11 commonly used prompt engineering
the collected studies, as shown in Figure 5. Note that, since techniques as listed in a popular prompt engineering guide
a large portion of studies are focused on unit testing or [158], as shown in Figure 12. Currently, in our collected
program repair, they are conducted on publicly available studies, only the first five techniques are being utilized. The
datasets and do not involve specific software types. more advanced techniques have not been employed yet, and
From the analysis in Section 4.3, the LLM can generate can be explored in the future for prompt design.
not only the source code for testing DL libraries but also
the textual input for testing mobile apps, even the models =HURVKRWOHDUQLQJ 
for testing CPS. Overall, the LLM provides a flexible and )HZVKRWOHDUQLQJ 
powerful framework for generating test inputs for a wide &KDLQRIWKRXJKW 
range of applications. Its versatility would make it useful 6HOIFRQVLVWHQF\ 

3URPSW(QJLQHHULQJ
for testing the software in other domains. $XWRPDWLFSURPSW 
From one point of view, some proposed techniques can *HQHUDWHNQRZOHGJHSURPSW 
be applied to other types of software. For example, in the 7UHHRIWKRXJKWV 
paper proposed for testing deep learning libraries [58], since $FWLYHSURPSW 
it proposes techniques for generating diversified, compli- 'LUHFWLRQDOVWLPXOXVSURPSW 
cated, and human-like DL programs, the authors state that 5H$FWSURPSW 
the approach can be easily extended to test software systems 0XOWLPRGDOFKDLQRIWKRXJKW 
from other application domains, e.g., interpreters, database *UDSKSURPSW 
$XWRPDWLFUHDVRQLQJ 
systems, and other popular libraries. More than that, there DQGWRROXVH
     
are already studies that focus on universal fuzzing tech- 3DSHU&RXQW
niques [52], [67] which are designed to be adaptable and
applicable to different types of test inputs and software. Fig. 12: List of advanced prompt engineering practices and
From another point of view, other types of software can those utilized in the collected papers
also benefit from the capabilities of LLMs to design the test- For instance, multimodal chain of thought prompting in-
ing techniques that are better suited to their specific do- volves using diverse sensory and cognitive cues to stimulate
main and characteristics. For instance, the metaverse, with thinking and creativity in LLMs [159]. By providing images
its immersive virtual environments and complex interac- (e.g., GUI screenshots) or audio recordings related to the
tions, presents unique challenges for software testing. LLMs software under test can help the LLM better understand
can be leveraged to generate diverse and realistic inputs that the software’s context and potential issues. Besides, try to
mimic user behavior and interactions within the metaverse, prompt the LLM to imagine itself in different roles, such
which are never explored. as a developer, user, or quality assurance specialist. This
perspective-shifting exercise enables the LLM to approach
software testing from multiple viewpoints and uncover dif-
6.2.4 Exploring LLMs for Non-functional Testing
ferent aspects that might require attention or investigation.
In our collected studies, LLMs are primarily used for func- Graph prompting [160] involves the representation of
tional testing, and no practice in performance testing, usabil- information using graphs or visual structures to facilitate
ity testing or others. One possible reason for the prevalence understanding and problem-solving. Graph prompting can
of LLM-based solutions in functional testing is that they be a natural match with software engineering, consider
can convert functional testing problems into code gener- it involves various dependencies, control flow, data flow,
ation or natural language generation problems [14], [59], state transitions, or other relevant graph structure. Graph
which LLMs are particularly adept at solving. prompting can be beneficial in analyzing this structural
On the other hand, performance testing and usability information, and enabling the LLMs to comprehend the
testing may require more specialized models that are de- software under test effectively. For instance, testers can use
signed to detect and analyze specific types of data, handle graph prompts to visualize test coverage, identify untested
complex statistical analyses, or determine the buggy criteria. areas or paths, and ensure adequate test execution.
Moreover, there have been dozens of performance testing
tools (e.g., LoadRunner [156]) that can generate a workload 6.2.6 Incorporating LLMs with Traditional Techniques
that simulates real-world usage scenarios and achieve rela- There is currently no clear consensus on the extent to which
tively satisfactory performance. LLMs can solve software testing problems. From the analy-
The potential opportunities might let the LLM integrate sis in Section 5.4, we have seen some promising results from
the performance testing tools and acts like the LangChain studies that have combined LLMs with traditional software
[157], to better simulate different types of workloads based testing techniques. This implies the LLMs are not the sole
on real user behavior. Furthermore, the LLMs can identify silver bullet for software testing. Considering the availabil-
the parameter combinations and values that have the high- ity of many mature software testing techniques and tools,
est potential to trigger performance problems. It is essen- and the limited capabilities of LLMs, it is necessary to ex-
tially a way to rank and prioritize different parameter set- plore other better ways to combine LLMs with traditional
tings based on their impact on performance and improve testing or program analysis techniques and tools for better
the efficiency of performance testing. software testing.
21

Based on the collected studies, the LLMs have been suc- (excluding software testing), this paper specifically focuses
cessfully utilized together with various techniques such as on the use of LLMs for software testing. It surveys related
differential testing (e.g., [63]), mutation testing (e.g., [59]), studies, summarizes key challenges and potential opportu-
program analysis (e.g., [104], as shown in Figure 10. From nities, and serves as a roadmap for future research in this
one perspective, future studies can explore improved in- area.
tegration of these traditional techniques with LLMs. Take
mutation testing as an example, current practices mainly 8 C ONCLUSION
rely on the human-designed mutation rules to mutate the This paper provides a comprehensive review of the use
candidate tests, and let the LLMs re-generate new tests [38], of LLMs in software testing. We have analyzed relevant
[59], [67], while Liu et al. directly utilize the LLMs for pro- studies that have utilized LLMs in software testing from
ducing the mutation rules alongside the mutated tests [65]. both the software testing and LLMs perspectives. This paper
Further explorations in this direction are of great interest. also highlights the challenges and potential opportunities
From another point of view, more traditional techniques in this direction. Results of this review demonstrate that
can be incorporated in LLMs for software testing. For in- LLMs have been successfully applied in a wide range
stance, besides the aforementioned traditional techniques, of testing tasks, including unit test case generation, test
the LLMs have been combined with formal verification for oracle generation, system test input generation, program
self-healing software detection in the field of software se- debugging, and program repair. However, challenges still
curity [161]. More attempts are encouraged. Moreover, con- exist in achieving high testing coverage, addressing the
sidering the existence of numerous mature software testing test oracle problem, conducting rigorous evaluations, and
tools, one can explore the integration of LLMs with these applying LLMs in real-world scenarios. Additionally, it is
tools, allowing them to act as a “LangChain” to better ex- observed that LLMs are commonly used in only a subset of
plore the potential of these tools. the entire testing lifecycle, for example, they are primarily
utilized in the middle and later stages of testing, only
7 R ELATED W ORK serving the unit and system testing phases, and only for
functional testing. This highlights the research opportunities
The systematic literature review is a crucial manner for gain- for exploring the uncovered areas. Regarding how the LLMs
ing insights into the current trends and future directions are utilized, we find that various pre-training/fine-tuning
within a particular field. It enables us to understand and and prompt engineering methods have been developed
stay updated on the developments in that domain. to enhance the capabilities of LLMs in addressing testing
Wang et al. surveyed the machine learning and deep tasks. However, more advanced techniques in prompt
learning techniques for software engineering [162]. Yang et design have yet to be explored and can be an avenue for
al. and Watson et al. respectively carried out surveys about future research.
the use of deep learning in software engineering domain It can serve as a roadmap for future research in this area,
[163], [164]. Bajammal et al. surveyed the utilization of com- identifying gaps in our current understanding of the use of
puter vision techniques to improve software engineering LLMs in software testing and highlighting potential avenues
tasks [165]. Zhang et al. provided a survey of techniques for exploration. We believe that the insights provided in this
for testing machine learning systems [150] paper will be valuable to both researchers and practition-
With the advancements of artificial intelligence and ers in the field of software engineering, assisting them in
LLMs, researchers also conduct systematic literature leveraging LLMs to improve software testing practices and
reviews about LLMs, and their applications in various ultimately enhance the quality and reliability of software
fields (e.g., software engineering). Zhao et al. [17] reviewed systems.
recent advances in LLMs by providing an overview of their
background, key findings, and mainstream techniques. R EFERENCES
They focused on four major aspects of LLMs, namely
[1] G. J. Myers, The art of software testing (2. ed.). Wiley,
pre-training, adaptation tuning, utilization, and capacity
2004. [Online]. Available: http://eu.wiley.com/WileyCDA/
evaluation. Additionally, they summarized the available WileyTitle/productCd-0471469122.html
resources for developing LLMs and discuss the remaining [2] M. Pezzè and M. Young, Software testing and analysis - process,
issues for future directions. Hou et al. conducted a principles and techniques. Wiley, 2007.
[3] M. Harman and P. McMinn, “A theoretical and empirical study
systematic literature review on using LLMs for software of search-based testing: Local, global, and hybrid search,” vol. 36,
engineering, with a particular focus on understanding no. 2, 2010, pp. 226–247.
how LLMs can be exploited to optimize processes and [4] P. Delgado-Pérez, A. Ramı́rez, K. J. Valle-Gómez, I. Medina-
Bulo, and J. R. Romero, “Interevo-tr: Interactive evolutionary
outcomes [166]. Fan et al. conducted a survey of LLMs for
test generation with readability assessment,” IEEE Trans. Software
software engineering, and set out open research challenges Eng., vol. 49, no. 4, pp. 2580–2596, 2023.
for the application of LLMs to technical problems faced by [5] X. Xiao, S. Li, T. Xie, and N. Tillmann, “Characteristic studies
software engineers [167]. Zan et al. conducted a survey of of loop problems for structural test generation via symbolic
execution,” in 2013 28th IEEE/ACM International Conference on
existing LLMs for NL2Code task (i.e., generating code from Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA,
a natural language description), and reviewed benchmarks November 11-15, 2013, E. Denney, T. Bultan, and A. Zeller, Eds.
and metrics [168]. IEEE, 2013, pp. 246–256.
While these studies either targeted the broader software [6] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-
directed random test generation,” in 29th International Conference
engineering domain (with a limited focus on software test- on Software Engineering (ICSE 2007), Minneapolis, MN, USA, May
ing tasks) or focused on other software development tasks 20-26, 2007. IEEE Computer Society, 2007, pp. 75–84.
22

[7] Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng, code generation via test-driven user-intent formalization,” arXiv
“No more manual tests? evaluating and improving chatgpt for preprint arXiv:2208.05950, 2022.
unit test generation,” arXiv preprint arXiv:2305.04207, 2023. [29] S. Alagarsamy, C. Tantithamthavorn, and A. Aleti, “A3test:
[8] Y. Tang, Z. Liu, Z. Zhou, and X. Luo, “Chatgpt vs SBST: Assertion-augmented automated test case generation,” arXiv
A comparative assessment of unit test suite generation,” preprint arXiv:2302.10352, 2023.
CoRR, vol. abs/2307.00588, 2023. [Online]. Available: https: [30] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical eval-
//doi.org/10.48550/arXiv.2307.00588 uation of using large language models for automated unit test
[9] A. Developers, “Ui/application exerciser monkey,” 2012. generation,” IEEE Transactions on Software Engineering, pp. 1–21,
[10] Y. Li, Z. Yang, Y. Guo, and X. Chen, “Droidbot: a lightweight ui- 2023.
guided test input generator for android,” in ICSE. IEEE, 2017. [31] V. Guilherme and A. Vincenzi, “An initial investigation
[11] T. Su, G. Meng, Y. Chen, K. Wu, W. Yang, Y. Yao, G. Pu, Y. Liu, and of chatgpt unit test generation capability,” in 8th Brazilian
Z. Su, “Guided, stochastic model-based gui testing of android Symposium on Systematic and Automated Software Testing, SAST
apps,” in Proceedings of the 2017 11th Joint Meeting on Foundations 2023, Campo Grande, MS, Brazil, September 25-29, 2023, A. L.
of Software Engineering, 2017, pp. 245–256. Fontão, D. M. B. Paiva, H. Borges, M. I. Cagnin, P. G.
[12] Z. Dong, M. Böhme, L. Cojocaru, and A. Roychoudhury, “Time- Fernandes, V. Borges, S. M. Melo, V. H. S. Durelli, and E. D.
travel testing of android apps,” in ICSE. IEEE, 2020. Canedo, Eds. ACM, 2023, pp. 15–24. [Online]. Available:
[13] M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement https://doi.org/10.1145/3624032.3624035
learning based curiosity-driven testing of android applications,” [32] S. Hashtroudi, J. Shin, H. Hemmati, and S. Wang,
in Proceedings of the 29th ACM SIGSOFT International Symposium “Automated test case generation using code models and
on Software Testing and Analysis, 2020, pp. 153–164. domain adaptation,” CoRR, vol. abs/2308.08033, 2023. [Online].
[14] Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, X. Che, D. Wang, Available: https://doi.org/10.48550/arXiv.2308.08033
and Q. Wang, “Make LLM a testing expert: Bringing human- [33] L. Plein, W. C. Ouédraogo, J. Klein, and T. F. Bissyandé,
like interaction to mobile GUI testing via functionality-aware “Automatic generation of test cases based on bug reports:
decisions,” CoRR, vol. abs/2310.15780, 2023. [Online]. Available: a feasibility study with large language models,” CoRR, vol.
https://doi.org/10.48550/arXiv.2310.15780 abs/2310.06320, 2023. [Online]. Available: https://doi.org/10.
[15] T. Su, J. Wang, and Z. Su, “Benchmarking automated GUI testing 48550/arXiv.2310.06320
for android against real-world bugs,” in ESEC/FSE ’21: 29th ACM [34] V. Vikram, C. Lemieux, and R. Padhye, “Can large
Joint European Software Engineering Conference and Symposium on language models write good property-based tests?” CoRR,
the Foundations of Software Engineering, Athens, Greece, August 23- vol. abs/2307.04346, 2023. [Online]. Available: https://doi.org/
28, 2021. ACM, 2021, pp. 119–130. 10.48550/arXiv.2307.04346
[16] M. Shanahan, “Talking about large language models,” [35] N. Rao, K. Jain, U. Alon, C. L. Goues, and V. J. Hellendoorn,
CoRR, vol. abs/2212.03551, 2022. [Online]. Available: https: “CAT-LM training language models on aligned code and
//doi.org/10.48550/arXiv.2212.03551 tests,” in 38th IEEE/ACM International Conference on Automated
[17] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Software Engineering, ASE 2023, Luxembourg, September 11-
Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, 15, 2023. IEEE, 2023, pp. 409–420. [Online]. Available:
Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, https://doi.org/10.1109/ASE56229.2023.00193
P. Liu, J. Nie, and J. Wen, “A survey of large language [36] Z. Xie, Y. Chen, C. Zhi, S. Deng, and J. Yin, “Chatunitest: a
models,” CoRR, vol. abs/2303.18223, 2023. [Online]. Available: chatgpt-based automated unit test generation tool,” arXiv preprint
https://doi.org/10.48550/arXiv.2303.18223 arXiv:2305.04764, 2023.
[18] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and [37] C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa:
Y. Iwasawa, “Large language models are zero- Escaping coverage plateaus in test generation with pre-trained
shot reasoners,” in NeurIPS, 2022. [Online]. Avail- large language models,” in International conference on software
able: http://papers.nips.cc/paper files/paper/2022/hash/ engineering (ICSE), 2023.
8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html [38] A. M. Dakhel, A. Nikanjam, V. Majdinasab, F. Khomh,
[19] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, and M. C. Desmarais, “Effective test generation using
F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, pre-trained large language models and mutation testing,”
“Chain-of-thought prompting elicits reasoning in large CoRR, vol. abs/2308.16557, 2023. [Online]. Available: https:
language models,” in NeurIPS, 2022. [Online]. Avail- //doi.org/10.48550/arXiv.2308.16557
able: http://papers.nips.cc/paper files/paper/2022/hash/ [39] M. L. Siddiq, J. Santos, R. H. Tanvir, N. Ulfat, F. A. Rifat, and V. C.
9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html Lopes, “Exploring the effectiveness of large language models in
[20] J. Li, G. Li, Y. Li, and Z. Jin, “Structured chain-of-thought generating unit tests,” arXiv preprint arXiv:2305.00418, 2023.
prompting for code generation,” 2023. [Online]. Available: [40] Y. Zhang, W. Song, Z. Ji, D. Yao, and N. Meng, “How well does
https://api.semanticscholar.org/CorpusID:258615421 LLM generate security tests?” CoRR, vol. abs/2310.00710, 2023.
[21] J. Li, Y. Li, G. Li, Z. Jin, Y. Hao, and X. Hu, “Skcoder: A [Online]. Available: https://doi.org/10.48550/arXiv.2310.00710
sketch-based approach for automatic code generation,” in 2023 [41] V. Li and N. Doiron, “Prompting code interpreter to write better
IEEE/ACM 45th International Conference on Software Engineering unit tests on quixbugs functions,” CoRR, vol. abs/2310.00483,
(ICSE), 2023, pp. 2124–2135. 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.
[22] J. Li, Y. Zhao, Y. Li, G. Li, and Z. Jin, “Acecoder: Utilizing existing 00483
code to enhance code generation,” 2023. [Online]. Available: [42] B. Steenhoek, M. Tufano, N. Sundaresan, and A. Svyatkovskiy,
https://api.semanticscholar.org/CorpusID:257901190 “Reinforcement learning from automatic feedback for high-
[23] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration quality unit test generation,” 2023.
code generation via chatgpt,” CoRR, vol. abs/2304.07590, 2023. [43] S. Bhatia, T. Gandhi, D. Kumar, and P. Jalote, “Unit test generation
[Online]. Available: https://doi.org/10.48550/arXiv.2304.07590 using generative ai : A comparative performance analysis of
[24] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, autogeneration tools,” 2023.
“Unifying large language models and knowledge graphs: A [44] M. Tufano, D. Drain, A. Svyatkovskiy, and N. Sundaresan,
roadmap,” CoRR, vol. abs/2306.08302, 2023. [Online]. Available: “Generating accurate assert statements for unit test cases using
https://doi.org/10.48550/arXiv.2306.08302 pretrained transformers,” in Proceedings of the 3rd ACM/IEEE
[25] G. J. Myers, T. Badgett, T. M. Thomas, and C. Sandler, The art of International Conference on Automation of Software Test, 2022, pp.
software testing. Wiley Online Library, 2004, vol. 2. 54–64.
[26] M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sun- [45] P. Nie, R. Banerjee, J. J. Li, R. J. Mooney, and M. Gligoric,
daresan, “Unit test case generation with transformers and focal “Learning deep semantics for test completion,” arXiv preprint
context,” arXiv preprint arXiv:2009.05617, 2020. arXiv:2302.10166, 2023.
[27] B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and [46] A. Mastropaolo, N. Cooper, D. Nader-Palacio, S. Scalabrino,
W. Chen, “Codet: Code generation with generated tests,” arXiv D. Poshyvanyk, R. Oliveto, and G. Bavota, “Using transfer
preprint arXiv:2207.10397, 2022. learning for code-related tasks,” IEEE Trans. Software Eng.,
[28] S. K. Lahiri, A. Naik, G. Sakkas, P. Choudhury, C. von Veh, vol. 49, no. 4, pp. 1580–1598, 2023. [Online]. Available:
M. Musuvathi, J. P. Inala, C. Wang, and J. Gao, “Interactive https://doi.org/10.1109/TSE.2022.3183297
23

[47] N. Nashid, M. Sintaha, and A. Mesbah, “Retrieval-based prompt model,” CoRR, vol. abs/2310.15657, 2023. [Online]. Available:
selection for code-related few-shot learning,” in Proceedings of https://doi.org/10.48550/arXiv.2310.15657
the 45th International Conference on Software Engineering (ICSE’23), [66] C. Zhang, M. Bai, Y. Zheng, Y. Li, X. Xie, Y. Li, W. Ma, L. Sun,
2023. and Y. Liu, “Understanding large language model based fuzz
[48] G. Ye, Z. Tang, S. H. Tan, S. Huang, D. Fang, X. Sun, L. Bian, driver generation,” CoRR, vol. abs/2307.12469, 2023. [Online].
H. Wang, and Z. Wang, “Automated conformance testing for Available: https://doi.org/10.48550/arXiv.2307.12469
javascript engines via deep compiler fuzzing,” in Proceedings of [67] C. Xia, M. Paltenghi, J. Tian, M. Pradel, and L. Zhang,
the 42nd ACM SIGPLAN international conference on programming “Universal fuzzing via large language models,” ArXiv,
language design and implementation, 2021, pp. 435–450. vol. abs/2308.04748, 2023. [Online]. Available: https://api.
[49] Z. Liu, C. Chen, J. Wang, X. Che, Y. Huang, J. Hu, and Q. Wang, semanticscholar.org/CorpusID:260735598
“Fill in the blank: Context-aware automated text input generation [68] C. Tsigkanos, P. Rani, S. Müller, and T. Kehrer, “Variable
for mobile gui testing,” arXiv preprint arXiv:2212.04732, 2022. discovery with large language models for metamorphic testing
[50] M. R. Taesiri, F. Macklon, Y. Wang, H. Shen, and C.-P. Bezemer, of scientific software,” in Computational Science - ICCS 2023 -
“Large language models are pretty good zero-shot video game 23rd International Conference, Prague, Czech Republic, July 3-5,
bug detectors,” arXiv preprint arXiv:2210.02506, 2022. 2023, Proceedings, Part I, ser. Lecture Notes in Computer
[51] S. L. Shrestha and C. Csallner, “Slgpt: using transfer learning Science, J. Mikyska, C. de Mulatier, M. Paszynski, V. V.
to directly generate simulink model files and find bugs in the Krzhizhanovskaya, J. J. Dongarra, and P. M. A. Sloot, Eds.,
simulink toolchain,” in Evaluation and Assessment in Software vol. 14073. Springer, 2023, pp. 321–335. [Online]. Available:
Engineering, 2021, pp. 260–265. https://doi.org/10.1007/978-3-031-35995-8 23
[52] J. Hu, Q. Zhang, and H. Yin, “Augmenting greybox fuzzing [69] C. Yang, Y. Deng, R. Lu, J. Yao, J. Liu, R. Jabbarvand, and
with generative AI,” CoRR, vol. abs/2306.06782, 2023. [Online]. L. Zhang, “White-box compiler fuzzing empowered by large
Available: https://doi.org/10.48550/arXiv.2306.06782 language models,” CoRR, vol. abs/2310.15991, 2023. [Online].
[53] A. Mathur, S. Pradhan, P. Soni, D. Patel, and R. Regunathan, Available: https://doi.org/10.48550/arXiv.2310.15991
“Automated test case generation using t5 and gpt-3,” in 2023 9th [70] T. Zhang, I. C. Irsan, F. Thung, D. Han, D. Lo, and L. Jiang,
International Conference on Advanced Computing and Communication “itiger: an automatic issue title generation tool,” in Proceedings
Systems (ICACCS), vol. 1, 2023, pp. 1986–1992. of the 30th ACM Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering, 2022, pp.
[54] D. Zimmermann and A. Koziolek, “Automating gui-based soft-
1637–1641.
ware testing with gpt-3,” in 2023 IEEE International Conference
on Software Testing, Verification and Validation Workshops (ICSTW), [71] Y. Huang, J. Wang, Z. Liu, Y. Wang, S. Wang, C. Chen,
2023, pp. 62–65. Y. Hu, and Q. Wang, “Crashtranslator: Automatically
reproducing mobile application crashes directly from stack
[55] M. Taeb, A. Swearngin, E. Schoop, R. Cheng, Y. Jiang, and
trace,” CoRR, vol. abs/2310.07128, 2023. [Online]. Available:
J. Nichols, “Axnav: Replaying accessibility tests from natural
https://doi.org/10.48550/arXiv.2310.07128
language,” CoRR, vol. abs/2310.02424, 2023. [Online]. Available:
[72] T. Zhang, I. C. Irsan, F. Thung, and D. Lo, “Cupid:
https://doi.org/10.48550/arXiv.2310.02424
Leveraging chatgpt for more accurate duplicate bug report
[56] Q. Luu, H. Liu, and T. Y. Chen, “Can chatgpt advance software detection,” CoRR, vol. abs/2308.10022, 2023. [Online]. Available:
testing intelligence? an experience report on metamorphic https://doi.org/10.48550/arXiv.2308.10022
testing,” CoRR, vol. abs/2310.19204, 2023. [Online]. Available:
[73] U. Mukherjee and M. M. Rahman, “Employing deep
https://doi.org/10.48550/arXiv.2310.19204
learning and structured information retrieval to answer
[57] A. Khanfir, R. Degiovanni, M. Papadakis, and Y. L. Traon, “Ef- clarification questions on bug reports,” 2023. [Online]. Available:
ficient mutation testing via pre-trained language models,” arXiv https://api.semanticscholar.org/CorpusID:259501524
preprint arXiv:2301.03543, 2023. [74] P. Mahbub, O. Shuvo, and M. M. Rahman, “Explaining software
[58] Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, bugs leveraging code structures in neural machine translation,”
“Large language models are edge-case fuzzers: Testing deep arXiv preprint arXiv:2212.04584, 2022.
learning libraries via fuzzgpt,” arXiv preprint arXiv:2304.02014, [75] S. Feng and C. Chen, “Prompting is all your need:
2023. Automated android bug replay with large language models,”
[59] ——, “Large language models are zero shot fuzzers: Fuzzing CoRR, vol. abs/2306.01987, 2023. [Online]. Available: https:
deep learning libraries via large language models,” arXiv preprint //doi.org/10.48550/arXiv.2306.01987
arXiv:2209.11515, 2023. [76] Y. Su, Z. Han, Z. Gao, Z. Xing, Q. Lu, and X. Xu, “Still
[60] J. Ackerman and G. Cybenko, “Large language models for confusing for bug-component triaging? deep feature learning
fuzzing parsers (registered report),” in Proceedings of the and ensemble setting to rescue,” in 31st IEEE/ACM International
2nd International Fuzzing Workshop, FUZZING 2023, Seattle, Conference on Program Comprehension, ICPC 2023, Melbourne,
WA, USA, 17 July 2023, M. Böhme, Y. Noller, B. Ray, and Australia, May 15-16, 2023. IEEE, 2023, pp. 316–327. [Online].
L. Szekeres, Eds. ACM, 2023, pp. 31–38. [Online]. Available: Available: https://doi.org/10.1109/ICPC58990.2023.00046
https://doi.org/10.1145/3605157.3605173 [77] N. D. Bui, Y. Wang, and S. Hoi, “Detect-localize-repair: A unified
[61] S. Yu, C. Fang, Y. Ling, C. Wu, and Z. Chen, “LLM for framework for learning to debug with codet5,” arXiv preprint
test script generation and migration: Challenges, capabilities, arXiv:2211.14875, 2022.
and opportunities,” CoRR, vol. abs/2309.13574, 2023. [Online]. [78] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot
Available: https://doi.org/10.48550/arXiv.2309.13574 testers: Exploring llm-based general bug reproduction,” arXiv
[62] G. Deng, Y. Liu, V. M. Vilches, P. Liu, Y. Li, Y. Xu, preprint arXiv:2209.11515, 2022.
T. Zhang, Y. Liu, M. Pinzger, and S. Rass, “Pentestgpt: [79] S. Kang, G. An, and S. Yoo, “A preliminary evaluation of
An llm-empowered automatic penetration testing tool,” llm-based fault localization,” CoRR, vol. abs/2308.05487, 2023.
CoRR, vol. abs/2308.06782, 2023. [Online]. Available: https: [Online]. Available: https://doi.org/10.48550/arXiv.2308.05487
//doi.org/10.48550/arXiv.2308.06782 [80] P. Widjojo and C. Treude, “Addressing compiler errors: Stack
[63] M. Sun, Y. Yang, Y. Wang, M. Wen, H. Jia, and Y. Zhou, overflow or large language models?” CoRR, vol. abs/2307.10793,
“SMT solver validation empowered by large pre-trained 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.
language models,” in 38th IEEE/ACM International Conference on 10793
Automated Software Engineering, ASE 2023, Luxembourg, September [81] L. Plein and T. F. Bissyandé, “Can llms demystify bug
11-15, 2023. IEEE, 2023, pp. 1288–1300. [Online]. Available: reports?” CoRR, vol. abs/2310.06310, 2023. [Online]. Available:
https://doi.org/10.1109/ASE56229.2023.00180 https://doi.org/10.48550/arXiv.2310.06310
[64] Y. Deng, J. Yao, Z. Tu, X. Zheng, M. Zhang, and T. Zhang, [82] A. Taylor, A. Vassar, J. Renzella, and H. A. Pearce, “Dcc
“Target: Automated scenario generation from traffic rules –help: Generating context-aware compiler error explanations
for testing autonomous vehicles,” 2023. [Online]. Available: with large language models,” 2023. [Online]. Available:
https://api.semanticscholar.org/CorpusID:258588387 https://api.semanticscholar.org/CorpusID:261076439
[65] Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, X. Che, [83] S. Kang, B. Chen, S. Yoo, and J.-G. Lou, “Explainable automated
D. Wang, and Q. Wang, “Testing the limits: Unusual text inputs debugging via large language model-driven scientific debug-
generation for mobile app crash detection with large language ging,” arXiv preprint arXiv:2304.02195, 2023.
24

[84] A. Z. H. Yang, R. Martins, C. L. Goues, and V. J. [102] S. Moon, Y. Song, H. Chae, D. Kang, T. Kwon, K. T. iunn Ong,
Hellendoorn, “Large language models for test-free fault S. won Hwang, and J. Yeo, “Coffee: Boost your code llms by
localization,” CoRR, vol. abs/2310.01726, 2023. [Online]. fixing bugs with feedback,” 2023.
Available: https://doi.org/10.48550/arXiv.2310.01726 [103] Y. Wei, C. S. Xia, and L. Zhang, “Copiloting the copilots:
[85] Y. Wu, Z. Li, J. M. Zhang, M. Papadakis, M. Harman, Fusing large language models with completion engines for
and Y. Liu, “Large language models in fault localisation,” automated program repair,” in Proceedings of the 31st ACM Joint
CoRR, vol. abs/2308.15276, 2023. [Online]. Available: https: European Software Engineering Conference and Symposium on the
//doi.org/10.48550/arXiv.2308.15276 Foundations of Software Engineering, ESEC/FSE 2023, San Francisco,
[86] H. Tu, Z. Zhou, H. Jiang, I. N. B. Yusuf, Y. Li, and L. Jiang, CA, USA, December 3-9, 2023, S. Chandra, K. Blincoe, and
“LLM4CBI: taming llms to generate effective test programs P. Tonella, Eds. ACM, 2023, pp. 172–184. [Online]. Available:
for compiler bug isolation,” CoRR, vol. abs/2307.00593, 2023. https://doi.org/10.1145/3611643.3616271
[Online]. Available: https://doi.org/10.48550/arXiv.2307.00593 [104] Y. Peng, S. Gao, C. Gao, Y. Huo, and M. R. Lyu, “Domain
[87] T.-O. Li, W. Zong, Y. Wang, H. Tian, Y. Wang, S.-C. Cheung, knowledge matters: Improving prompts with fix templates for
and J. Kramer, “Nuances are the key: Unlocking chatgpt to repairing python type errors,” CoRR, vol. abs/2306.01394, 2023.
find failure-inducing tests with differential prompting,” in 2023 [Online]. Available: https://doi.org/10.48550/arXiv.2306.01394
38th IEEE/ACM International Conference on Automated Software [105] A. E. I. Brownlee, J. Callan, K. Even-Mendoza, A. Geiger,
Engineering (ASE), 2023, pp. 14–26. C. Hanna, J. Petke, F. Sarro, and D. Sobania, “Enhancing
[88] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large genetic improvement mutations using large language models,”
language models to self-debug,” CoRR, vol. abs/2304.05128, 2023. in Search-Based Software Engineering - 15th International
[Online]. Available: https://doi.org/10.48550/arXiv.2304.05128 Symposium, SSBSE 2023, San Francisco, CA, USA, December
8, 2023, Proceedings, ser. Lecture Notes in Computer
[89] J. Cao, M. Li, M. Wen, and S.-c. Cheung, “A study on prompt
Science, P. Arcaini, T. Yue, and E. M. Fredericks, Eds.,
design, advantages and limitations of chatgpt for deep learning
vol. 14415. Springer, 2023, pp. 153–159. [Online]. Available:
program repair,” arXiv preprint arXiv:2304.08191, 2023.
https://doi.org/10.1007/978-3-031-48796-5 13
[90] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, [106] M. M. A. Haque, W. U. Ahmad, I. Lourentzou, and C. Brown,
“Examining zero-shot vulnerability repair with large language “Fixeval: Execution-based evaluation of program fixes for
models,” in 2023 IEEE Symposium on Security and Privacy (SP). programming problems,” in IEEE/ACM International Workshop on
IEEE Computer Society, 2022, pp. 1–18. Automated Program Repair, APR@ICSE 2023, Melbourne, Australia,
[91] Z. Fan, X. Gao, A. Roychoudhury, and S. H. Tan, “Automated May 16, 2023. IEEE, 2023, pp. 11–18. [Online]. Available:
repair of programs from large language models,” arXiv preprint https://doi.org/10.1109/APR59189.2023.00009
arXiv:2205.10583, 2022. [107] B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce, “Fixing
[92] Y. Hu, X. Shi, Q. Zhou, and L. Pike, “Fix bugs with trans- hardware security bugs with large language models,” arXiv
former through a neural-symbolic edit grammar,” arXiv preprint preprint arXiv:2302.01215, 2023.
arXiv:2204.06643, 2022. [108] P. Deligiannis, A. Lal, N. Mehrotra, and A. Rastogi, “Fixing rust
[93] C. S. Xia, Y. Wei, and L. Zhang, “Practical program repair in compilation errors using llms,” CoRR, vol. abs/2308.05177, 2023.
the era of large pre-trained language models,” arXiv preprint [Online]. Available: https://doi.org/10.48550/arXiv.2308.05177
arXiv:2210.14179, 2022. [109] F. Ribeiro, R. Abreu, and J. Saraiva, “Framing program repair
[94] J. Zhang, J. Cambronero, S. Gulwani, V. Le, R. Piskac, G. Soares, as code completion,” in Proceedings of the Third International
and G. Verbruggen, “Repairing bugs in python assignments Workshop on Automated Program Repair, 2022, pp. 38–45.
using large language models,” arXiv preprint arXiv:2209.14876, [110] N. Wadhwa, J. Pradhan, A. Sonwane, S. P. Sahu, N. Natarajan,
2022. A. Kanade, S. Parthasarathy, and S. K. Rajamani, “Frustrated with
[95] M. Lajkó, V. Csuvik, and L. Vidács, “Towards javascript program code quality issues? llms can help!” CoRR, vol. abs/2309.12938,
repair with generative pre-trained transformer (gpt-2),” in Pro- 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.
ceedings of the Third International Workshop on Automated Program 12938
Repair, 2022, pp. 61–68. [111] F. Ribeiro, J. N. C. de Macedo, K. Tsushima, R. Abreu,
[96] D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of and J. Saraiva, “Gpt-3-powered type error debugging:
the automatic bug fixing performance of chatgpt,” arXiv preprint Investigating the use of large language models for code
arXiv:2301.08653, 2023. repair,” in Proceedings of the 16th ACM SIGPLAN International
[97] K. Huang, X. Meng, J. Zhang, Y. Liu, W. Wang, S. Li, Conference on Software Language Engineering, SLE 2023, Cascais,
and Y. Zhang, “An empirical study on fine-tuning large Portugal, October 23-24, 2023, J. Saraiva, T. Degueule, and
language models of code for automated program repair,” E. Scott, Eds. ACM, 2023, pp. 111–124. [Online]. Available:
in 38th IEEE/ACM International Conference on Automated https://doi.org/10.1145/3623476.3623522
Software Engineering, ASE 2023, Luxembourg, September 11- [112] Y. Wu, N. Jiang, H. V. Pham, T. Lutellier, J. Davis, L. Tan,
15, 2023. IEEE, 2023, pp. 1162–1174. [Online]. Available: P. Babkin, and S. Shah, “How effective are neural networks for
https://doi.org/10.1109/ASE56229.2023.00181 fixing security vulnerabilities,” arXiv preprint arXiv:2305.18607,
[98] M. C. Wuisang, M. Kurniawan, K. A. Wira Santosa, A. Agung 2023.
Santoso Gunawan, and K. E. Saputra, “An evaluation of the [113] N. Jiang, K. Liu, T. Lutellier, and L. Tan, “Impact of code
effectiveness of openai’s chatgpt for automated python program language models on automated program repair,” arXiv preprint
bug fixing using quixbugs,” in 2023 International Seminar on Appli- arXiv:2302.05020, 2023.
cation for Technology of Information and Communication (iSemantic), [114] M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan,
2023, pp. 295–300. and A. Svyatkovskiy, “Inferfix: End-to-end program repair with
[99] D. Horváth, V. Csuvik, T. Gyimóthy, and L. Vidács, llms,” arXiv preprint arXiv:2303.07263, 2023.
“An extensive study on model architecture and program [115] C. S. Xia and L. Zhang, “Keep the conversation going: Fixing
representation in the domain of learning-based automated 162 out of 337 bugs for $0.42 each using chatgpt,” arXiv preprint
program repair,” in IEEE/ACM International Workshop on arXiv:2304.00385, 2023.
Automated Program Repair, APR@ICSE 2023, Melbourne, Australia, [116] Y. Zhang, G. Li, Z. Jin, and Y. Xing, “Neural program repair with
May 16, 2023. IEEE, 2023, pp. 31–38. [Online]. Available: program dependence analysis and effective filter mechanism,”
https://doi.org/10.1109/APR59189.2023.00013 arXiv preprint arXiv:2305.09315, 2023.
[100] J. A. Prenner, H. Babii, and R. Robbes, “Can openai’s codex fix [117] J. A. Prenner and R. Robbes, “Out of context: How important is
bugs? an evaluation on quixbugs,” in Proceedings of the Third local context in neural program repair?” 2023.
International Workshop on Automated Program Repair, 2022, pp. 69– [118] Q. Zhang, C. Fang, B. Yu, W. Sun, T. Zhang, and Z. Chen,
75. “Pre-trained model-based automated software vulnerability
[101] W. Yuan, Q. Zhang, T. He, C. Fang, N. Q. V. Hung, X. Hao, and repair: How far are we?” CoRR, vol. abs/2308.12533, 2023.
H. Yin, “Circle: continual repair across programming languages,” [Online]. Available: https://doi.org/10.48550/arXiv.2308.12533
in Proceedings of the 31st ACM SIGSOFT International Symposium [119] S. Garg, R. Z. Moghaddam, and N. Sundaresan, “Rapgen:
on Software Testing and Analysis, 2022, pp. 678–690. An approach for fixing code inefficiencies in zero-shot,”
25

CoRR, vol. abs/2306.17077, 2023. [Online]. Available: https: N. Novielli, Eds. IEEE, 2023, pp. 678–682. [Online]. Available:
//doi.org/10.48550/arXiv.2306.17077 https://doi.org/10.1109/SANER56733.2023.00070
[120] W. Wang, Y. Wang, S. Joty, and S. C. H. Hoi, “Rap- [134] G. J. Myers, The art of software testing (2. ed.). Wiley,
gen: Retrieval-augmented patch generation with codet5 for 2004. [Online]. Available: http://eu.wiley.com/WileyCDA/
automatic program repair,” in Proceedings of the 31st ACM Joint WileyTitle/productCd-0471469122.html
European Software Engineering Conference and Symposium on the [135] P. Farrell-Vinay, Manage software testing. Auerbach Publ., 2008.
Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, [136] A. Mili and F. Tchier, Software testing: Concepts and operations.
CA, USA, December 3-9, 2023, S. Chandra, K. Blincoe, and John Wiley & Sons, 2015.
P. Tonella, Eds. ACM, 2023, pp. 146–158. [Online]. Available: [137] S. Lukasczyk and G. Fraser, “Pynguin: Automated unit
https://doi.org/10.1145/3611643.3616256 test generation for python,” in 44th IEEE/ACM International
[121] Y. Zhang, Z. Jin, Y. Xing, and G. Li, “STEAM: simulating Conference on Software Engineering: Companion Proceedings,
the interactive behavior of programmers for automatic bug ICSE Companion 2022, Pittsburgh, PA, USA, May 22-24,
fixing,” CoRR, vol. abs/2308.14460, 2023. [Online]. Available: 2022. ACM/IEEE, 2022, pp. 168–172. [Online]. Available:
https://doi.org/10.48550/arXiv.2308.14460 https://doi.org/10.1145/3510454.3516829
[122] S. Fakhoury, S. Chakraborty, M. Musuvathi, and S. K. Lahiri, [138] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The
“Towards generating functionally correct code edits from natu- oracle problem in software testing: A survey,” IEEE transactions
ral language issue descriptions,” arXiv preprint arXiv:2304.03816, on software engineering, vol. 41, no. 5, pp. 507–525, 2014.
2023. [139] C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshyvanyk,
[123] M. Fu, C. Tantithamthavorn, T. Le, V. Nguyen, and D. Phung, “On learning meaningful assert statements for unit test cases,”
“Vulrepair: a t5-based automated software vulnerability repair,” in ICSE ’20: 42nd International Conference on Software Engineering,
in Proceedings of the 30th ACM Joint European Software Engineering Seoul, South Korea, 27 June - 19 July, 2020, G. Rothermel and D. Bae,
Conference and Symposium on the Foundations of Software Engineer- Eds. ACM, 2020, pp. 1398–1409.
ing, 2022, pp. 935–947. [140] Y. He, L. Zhang, Z. Yang, Y. Cao, K. Lian, S. Li, W. Yang, Z. Zhang,
[124] S. Gao, X. Wen, C. Gao, W. Wang, H. Zhang, and M. Yang, Y. Zhang, and H. Duan, “Textexerciser: Feedback-driven
M. R. Lyu, “What makes good in-context demonstrations text input exercising for android applications,” in 2020 IEEE
for code intelligence tasks with llms?” in 38th IEEE/ACM Symposium on Security and Privacy, SP 2020, San Francisco, CA,
International Conference on Automated Software Engineering, ASE USA, May 18-21, 2020. IEEE, 2020, pp. 1071–1087.
2023, Luxembourg, September 11-15, 2023. IEEE, 2023, pp. 761– [141] A. Wei, Y. Deng, C. Yang, and L. Zhang, “Free lunch for test-
773. [Online]. Available: https://doi.org/10.1109/ASE56229. ing: Fuzzing deep-learning libraries from open source,” in 44th
2023.00109 IEEE/ACM 44th International Conference on Software Engineering,
[125] C. Treude and H. Hata, “She elicits requirements and he ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 2022,
tests: Software engineering gender bias in large language pp. 995–1007.
models,” CoRR, vol. abs/2303.10131, 2023. [Online]. Available: [142] D. Xie, Y. Li, M. Kim, H. V. Pham, L. Tan, X. Zhang, and M. W.
https://doi.org/10.48550/arXiv.2303.10131 Godfrey, “Docter: documentation-guided fuzzing for testing
deep learning API functions,” in ISSTA ’22: 31st ACM SIGSOFT
[126] R. Kocielnik, S. Prabhumoye, V. Zhang, R. M. Alvarez, and
International Symposium on Software Testing and Analysis, Virtual
A. Anandkumar, “Autobiastest: Controllable sentence generation
Event, South Korea, July 18 - 22, 2022, S. Ryu and Y. Smaragdakis,
for automated and open-ended social bias testing in language
Eds. ACM, 2022, pp. 176–188.
models,” CoRR, vol. abs/2302.07371, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2302.07371 [143] Q. Guo, X. Xie, Y. Li, X. Zhang, Y. Liu, X. Li, and C. Shen,
“Audee: Automated testing for deep learning frameworks,” in
[127] M. Ciniselli, L. Pascarella, and G. Bavota, “To what extent do 35th IEEE/ACM International Conference on Automated Software
deep learning-based code recommenders generate predictions Engineering, ASE 2020, Melbourne, Australia, September 21-25, 2020.
by cloning code from the training set?” in 19th IEEE/ACM IEEE, 2020, pp. 486–498.
International Conference on Mining Software Repositories, MSR 2022,
[144] Z. Wang, M. Yan, J. Chen, S. Liu, and D. Zhang, “Deep learning
Pittsburgh, PA, USA, May 23-24, 2022. ACM, 2022, pp. 167–178.
library testing via effective model generation,” in ESEC/FSE
[Online]. Available: https://doi.org/10.1145/3524842.3528440
’20: 28th ACM Joint European Software Engineering Conference
[128] D. Erhabor, S. Udayashankar, M. Nagappan, and S. Al-Kiswany, and Symposium on the Foundations of Software Engineering, Virtual
“Measuring the runtime performance of code produced with Event, USA, November 8-13, 2020, P. Devanbu, M. B. Cohen, and
github copilot,” CoRR, vol. abs/2305.06439, 2023. [Online]. T. Zimmermann, Eds. ACM, 2020, pp. 788–799.
Available: https://doi.org/10.48550/arXiv.2305.06439 [145] J. Jiang, Y. Xiong, H. Zhang, Q. Gao, and X. Chen, “Shaping
[129] R. Wang, R. Cheng, D. Ford, and T. Zimmermann, “Investigating program repair space with existing patches and similar code,” in
and designing for trust in ai-powered code generation Proceedings of the 27th ACM SIGSOFT International Symposium on
tools,” CoRR, vol. abs/2305.11248, 2023. [Online]. Available: Software Testing and Analysis, ser. ISSTA 2018. New York, NY,
https://doi.org/10.48550/arXiv.2305.11248 USA: Association for Computing Machinery, 2018, p. 298–309.
[130] B. Yetistiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating [Online]. Available: https://doi.org/10.1145/3213846.3213871
the code quality of ai-assisted code generation tools: An [146] M. Wen, J. Chen, R. Wu, D. Hao, and S.-C. Cheung, “Context-
empirical study on github copilot, amazon codewhisperer, and aware patch generation for better automated program repair,”
chatgpt,” CoRR, vol. abs/2304.10778, 2023. [Online]. Available: in Proceedings of the 40th International Conference on Software
https://doi.org/10.48550/arXiv.2304.10778 Engineering, ser. ICSE ’18. New York, NY, USA: Association
[131] C. Wohlin, “Guidelines for snowballing in systematic literature for Computing Machinery, 2018, p. 1–11. [Online]. Available:
studies and a replication in software engineering,” in https://doi.org/10.1145/3180155.3180233
18th International Conference on Evaluation and Assessment [147] Y. Xiong, J. Wang, R. Yan, J. Zhang, S. Han, G. Huang, and
in Software Engineering, EASE ’14, London, England, United L. Zhang, “Precise condition synthesis for program repair,” in
Kingdom, May 13-14, 2014, M. J. Shepperd, T. Hall, and 2017 IEEE/ACM 39th International Conference on Software Engineer-
I. Myrtveit, Eds. ACM, 2014, pp. 38:1–38:10. [Online]. Available: ing (ICSE), 2017, pp. 416–426.
https://doi.org/10.1145/2601248.2601268 [148] J. Xuan, M. Martinez, F. DeMarco, M. Clément, S. L. Marcote,
[132] A. Mastropaolo, S. Scalabrino, N. Cooper, D. Nader-Palacio, T. Durieux, D. Le Berre, and M. Monperrus, “Nopol: Automatic
D. Poshyvanyk, R. Oliveto, and G. Bavota, “Studying the usage repair of conditional statement bugs in java programs,” IEEE
of text-to-text transfer transformer to support code-related tasks,” Transactions on Software Engineering, vol. 43, no. 1, pp. 34–55, 2017.
in 43rd IEEE/ACM International Conference on Software Engineering, [149] S. Song, X. Li, and S. Li, “How to bridge the gap between modal-
ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 2021, pp. 336– ities: A comprehensive survey on multimodal large language
347. model,” CoRR, vol. abs/2311.07594, 2023.
[133] C. Tsigkanos, P. Rani, S. Müller, and T. Kehrer, “Large [150] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine learning
language models: The next frontier for variable discovery testing: Survey, landscapes and horizons,” IEEE Trans. Software
within metamorphic testing?” in IEEE International Conference Eng., vol. 48, no. 2, pp. 1–36, 2022.
on Software Analysis, Evolution and Reengineering, SANER 2023, [151] F. Tu, J. Zhu, Q. Zheng, and M. Zhou, “Be careful of when:
Taipa, Macao, March 21-24, 2023, T. Zhang, X. Xia, and an empirical study on time-related misuse of issue tracking
26

data,” in Proceedings of the 2018 ACM Joint Meeting on European [168] D. Zan, B. Chen, F. Zhang, D. Lu, B. Wu, B. Guan,
Software Engineering Conference and Symposium on the Foundations Y. Wang, and J. Lou, “Large language models meet nl2code:
of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena A survey,” in Proceedings of the 61st Annual Meeting of
Vista, FL, USA, November 04-09, 2018, G. T. Leavens, A. Garcia, the Association for Computational Linguistics (Volume 1: Long
and C. S. Pasareanu, Eds. ACM, 2018, pp. 307–318. [Online]. Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers,
Available: https://doi.org/10.1145/3236024.3236054 J. L. Boyd-Graber, and N. Okazaki, Eds. Association for
[152] Z. Sun, L. Li, Y. Liu, X. Du, and L. Li, “On the importance Computational Linguistics, 2023, pp. 7443–7464. [Online].
of building high-quality training datasets for neural code Available: https://doi.org/10.18653/v1/2023.acl-long.411
search,” in 44th IEEE/ACM 44th International Conference on
Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May
25-27, 2022. ACM, 2022, pp. 1609–1620. [Online]. Available:
https://doi.org/10.1145/3510003.3510160
[153] L. Shi, Z. Jiang, Y. Yang, X. Chen, Y. Zhang, F. Mu, H. Jiang, and
Q. Wang, “ISPY: automatic issue-solution pair extraction from
community live chats,” in 36th IEEE/ACM International Conference
on Automated Software Engineering, ASE 2021, Melbourne, Australia,
November 15-19, 2021. IEEE, 2021, pp. 142–154. [Online].
Available: https://doi.org/10.1109/ASE51524.2021.9678894
[154] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu,
L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano,
S. K. Deng, C. B. Clement, D. Drain, N. Sundaresan, J. Yin,
D. Jiang, and M. Zhou, “Graphcodebert: Pre-training code
representations with data flow,” in 9th International Conference
on Learning Representations, ICLR 2021, Virtual Event, Austria,
May 3-7, 2021. OpenReview.net, 2021. [Online]. Available:
https://openreview.net/forum?id=jLoC4ez43PZ
[155] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and
J. Xiao, “Lsun: Construction of a large-scale image dataset us-
ing deep learning with humans in the loop,” arXiv preprint
arXiv:1506.03365, 2015.
[156] LoadRunner, Inc., “Loadrunner,” 2023, microfocus.com.
[157] LangChain, Inc., “Langchain,” 2023, https://docs.langchain.
com/docs/.
[158] Prompt engineering, “Prompt engineering guide,” 2023, https:
//github.com/dair-ai/Prompt-Engineering-Guide.
[159] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola,
“Multimodal chain-of-thought reasoning in language models,”
CoRR, vol. abs/2302.00923, 2023.
[160] Z. Liu, X. Yu, Y. Fang, and X. Zhang, “Graphprompt: Unifying
pre-training and downstream tasks for graph neural networks,”
in Proceedings of the ACM Web Conference 2023, WWW 2023, Austin,
TX, USA, 30 April 2023 - 4 May 2023, Y. Ding, J. Tang, J. F. Sequeda,
L. Aroyo, C. Castillo, and G. Houben, Eds. ACM, 2023, pp. 417–
428.
[161] Y. Charalambous, N. Tihanyi, R. Jain, Y. Sun, M. A. Ferrag, and
L. C. Cordeiro, “A new era in software security: Towards self-
healing software via large language models and formal verifica-
tion,” 2023.
[162] S. Wang, L. Huang, A. Gao, J. Ge, T. Zhang, H. Feng, I. Satyarth,
M. Li, H. Zhang, and V. Ng, “Machine/deep learning for
software engineering: A systematic literature review,” IEEE
Trans. Software Eng., vol. 49, no. 3, pp. 1188–1231, 2023. [Online].
Available: https://doi.org/10.1109/TSE.2022.3173346
[163] Y. Yang, X. Xia, D. Lo, and J. C. Grundy, “A survey on
deep learning for software engineering,” ACM Comput. Surv.,
vol. 54, no. 10s, pp. 206:1–206:73, 2022. [Online]. Available:
https://doi.org/10.1145/3505243
[164] C. Watson, N. Cooper, D. Nader-Palacio, K. Moran, and
D. Poshyvanyk, “A systematic literature review on the use of
deep learning in software engineering research,” ACM Trans.
Softw. Eng. Methodol., vol. 31, no. 2, pp. 32:1–32:58, 2022. [Online].
Available: https://doi.org/10.1145/3485275
[165] M. Bajammal, A. Stocco, D. Mazinanian, and A. Mesbah,
“A survey on the use of computer vision to improve
software engineering tasks,” IEEE Trans. Software Eng.,
vol. 48, no. 5, pp. 1722–1742, 2022. [Online]. Available:
https://doi.org/10.1109/TSE.2020.3032986
[166] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo,
D. Lo, J. C. Grundy, and H. Wang, “Large language
models for software engineering: A systematic literature
review,” CoRR, vol. abs/2308.10620, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2308.10620
[167] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta,
S. Yoo, and J. M. Zhang, “Large language models
for software engineering: Survey and open problems,”
CoRR, vol. abs/2310.03533, 2023. [Online]. Available: https:
//doi.org/10.48550/arXiv.2310.03533
TABLE 5: All details of the collected papers
ID Paper title Year Topic Involved LLM How LLM is used Input to LLM How LLM inte- Venue Ref
grated
1 Unit Test Case Generation with Transform- 2021 Unit test case gener- BART Pre-training and/or Code Pure LLM Arxiv
ers and Focal Context ation Fine-tuning [26]
2 Codet: Code Generation with Generated 2022 Unit test case gener- Codex Zero-shot learning Code Pure LLM ICLR 2023
Tests ation [27]
3 Interactive Code Generation via Test-Driven 2022 Unit test case gener- Codex Zero-shot learning Code Mutation testing; Arxiv
User-Intent Formalization ation Statistic analysis [28]
4 A3Test: Assertion-Augmented Automated 2023 Unit test case gener- PLBART Pre-training and/or Code Syntactic repair Arxiv
Test Case Generation ation Fine-tuning [29]
5 An Empirical Evaluation of Using Large 2023 Unit test case gener- ChatGPT Zero-shot learning Code; Others Syntactic repair Arxiv
Language Models for Automated Unit Test ation [30]
Generation
6 An Initial Investigation of ChatGPT Unit 2023 Unit test case gener- ChatGPT Zero-shot learning Code Pure LLM SAST 2023
Test Generation Capability ation [31]
7 Automated Test Case Generation Using 2023 Unit test case gener- CodeT5; LLaMA-2 Pre-training and/or Code Syntactic repair Arxiv
Code Models and Domain Adaptation ation Fine-tuning [32]
8 Automatic Generation of Test Cases based 2023 Unit test case gener- CodeGPT; ChatGPT Pre-training and/or Bug description Pure LLM Arxiv
on Bug Reports: a Feasibility Study with ation Fine-tuning [33]
Large Language Models
9 Can Large Language Models Write Good 2023 Unit test case gener- GPT-4 Zero-shot learning Code; Others Pure LLM Arxiv
Property-Based Tests? ation [34]
10 CAT-LM Training Language Models on 2023 Unit test case gener- GPT-neox Pre-training and/or Code Pure LLM ASE 2023
Aligned Code And Tests ation Fine-tuning [35]
11 ChatGPT vs SBST: A Comparative Assess- 2023 Unit test case gener- ChatGPT Zero-shot learning Code Pure LLM Arxiv [8]
ment of Unit Test Suite Generation ation
12 ChatUniTest: a ChatGPT-based Automated 2023 Unit test case gener- ChatGPT Zero-shot learning Code Syntactic repair Arxiv
Unit Test Generation Tool ation [36]
13 CODAMOSA: Escaping Coverage Plateaus 2023 Unit test case gener- Codex Zero-shot learning Code Mutation testing; ICSE 2023
in Test Generation with Pre-trained Large ation Program analysis [37]
Language Models
14 Effective Test Generation Using Pre-trained 2023 Unit test case gener- Codex Few-shot learning; Code Mutation testing; Arxiv
Large Language Models and Mutation Test- ation Zero-shot learning Syntactic repair [38]
ing
15 Exploring the Effectiveness of Large Lan- 2023 Unit test case gener- CodeGen; Codex; Zero-shot learning Code Syntactic repair Arxiv
guage Models in Generating Unit Tests ation ChatGPT [39]
16 How Well does LLM Generate Security 2023 Unit test case gener- ChatGPT Few-shot learning Code Pure LLM Arxiv
Tests? ation [40]
17 No More Manual Tests? Evaluating and Im- 2023 Unit test case gener- ChatGPT Zero-shot learning Code; Error in- Program analysis Arxiv [7]
proving ChatGPT for Unit Test Generation ation formation
18 Prompting Code Interpreter to Write Better 2023 Unit test case gener- GPT-4 Few-shot learning Code Pure LLM Arxiv
Unit Tests on Quixbugs Functions ation [41]
19 Reinforcement Learning from Automatic 2023 Unit test case gener- Codex Pre-training and/or Code Program analysis, Arxiv
Feedback for High-Quality Unit Test Gen- ation Fine-tuning Reinforcement [42]
eration learning
20 Unit Test Generation using Generative AI: A 2023 Unit test case gener- ChatGPT Zero-shot learning Code Pure LLM Arxiv
Comparative Performance Analysis of Au- ation [43]
togeneration Tools
21 Generating Accurate Assert Statements for 2023 Test oracle genera- BART Pre-training and/or Code Pure LLM AST 2022
Unit Test Cases Using Pretrained Trans- tion Fine-tuning [44]
formers
22 Learning Deep Semantics for Test Comple- 2023 Test oracle genera- CodeT5 Pre-training and/or Code Statistic analysis ICSE 2023
tion tion Fine-tuning [45]
23 Using Transfer Learning for Code-Related 2022 Test oracle gener- T5 Pre-training and/or Code Pure LLM TSE 2022
Tasks ation; Program re- Fine-tuning [46]
pair
24 Retrieval-Based Prompt Selection for Code- 2023 Test oracle gener- Codex Few-shot learning Code Pure LLM ICSE 2023
27

Related Few-Shot Learning ation; Program re- [47]


pair
ID Paper title Year Topic Involved LLM How LLM is used Input to LLM How LLM inte- Venue Ref
grated
25 Automated Conformance Testing for 2021 System test input GPT-2 Pre-training and/or Code Differential testing; PLDI 2021
JavaScript Engines via Deep Compiler generation Fine-tuning Program analysis [48]
Fuzzing
26 Fill in the Blank: Context-aware Automated 2022 System test input GPT-3 Pre-training and/or View hierarchy Pure LLM ICSE 2023
Text Input Generation for Mobile GUI Test- generation Fine-tuning file of UI [49]
ing
27 Large Language Models are Pretty Good 2022 System test input InstructGPT Chain-of-Thought; Others Pure LLM Arxiv
Zero-Shot Video Game Bug Detectors generation Zero-shot learning [50]
28 Slgpt: Using Transfer Learning to Directly 2022 System test input GPT-2 Pre-training and/or Others Formal method EASE 2021
Generate Simulink Model Files and Find generation Fine-tuning [51]
Bugs in the Simulink Toolchain
29 Augmenting Greybox Fuzzing with Gener- 2023 System test input ChatGPT Few-shot learning Code Pure LLM Arxiv
ative AI generation [52]
30 Automated Test Case Generation Using T5 2023 System test input GPT-3; T5 Pre-training and/or NL specifica- Pure LLM ICACCS
and GPT-3 generation Fine-tuning; Zero-shot tion 2023 [53]
learning
31 Automating GUI-based Software Testing 2023 System test input GPT-3 Pre-training and/or View hierarchy Pure LLM ICSTW
with GPT-3 generation Fine-tuning file of UI 2023 [54]
32 AXNav: Replaying Accessibility Tests from 2023 System test input GPT-4 Chain-of-Thought View hierarchy Pure LLM Arxiv
Natural Language generation file of UI [55]
33 Can ChatGPT Advance Software Testing In- 2023 System test input ChatGPT Zero-shot learning Others Pure LLM Arxiv
telligence? An Experience Report on Meta- generation [56]
morphic Testing
34 Efficient Mutation Testing via Pre-Trained 2023 System test input CodeBert Zero-shot learning Code Mutation testing Arxiv
Language Models generation [57]
35 Large Language Models are Edge-Case 2023 System test input Codex Chain-of-Thought; Pre- Code Differential testing ICSE 2024
Generators:Crafting Unusual Programs for generation training and/or Fine- [58]
Fuzzing Deep Learning Libraries tuning; Zero-shot learn-
ing; Few-shot learning
36 Large Language Models are Zero Shot 2023 System test input Codex; InCoder Zero-shot learning Code Mutation testing; ISSTA 2023
Fuzzers: Fuzzing Deep Learning Libraries generation Differential testing [59]
via Large Language Models
37 Large Language Models for Fuzzing Parsers 2023 System test input GPT-4 Few-shot learning NL specifica- Pure LLM FUZZING
(Registered Report) generation tion 2023 [60]
38 LLM for Test Script Generation and Migra- 2023 System test input ChatGPT Zero-shot learning View hierarchy Pure LLM Arxiv
tion: Challenges, Capabilities, and Opportu- generation file of UI [61]
nities
39 Make LLM a Testing Expert: Bringing 2023 System test input GPT-3 Zero-shot learning View hierarchy Natural language ICSE 2024
Human-like Interaction to Mobile GUI Test- generation file of UI processing [14]
ing via Functionality-aware Decisions
40 PentestGPT: An LLM-empowered Auto- 2023 System test input ChatGPT; GPT-4; Chain-of-Thought; NL specifica- Pure LLM Arxiv
matic Penetration Testing Tool generation LaMDA Few-shot learning tion [62]
41 SMT Solver Validation Empowered by 2023 System test input GPT-2 Pre-training and/or Code Differential testing ASE 2023
Large Pre-Trained Language Models generation Fine-tuning [63]
42 TARGET: Automated Scenario Generation 2023 System test input GPT-3 Zero-shot learning Others Scenario testing Arxiv
from Traffic Rules for Testing Autonomous generation [64]
Vehicles
43 Testing the Limits: Unusual Text Inputs 2023 System test input ChatGPT Few-shot learning View hierarchy Pure LLM ICSE 2024
Generation for Mobile App Crash Detection generation file of UI [65]
with Large Language Model
44 Understanding Large Language Model 2023 System test input ChatGPT; GPT-4 Few-shot learning; Code; Others Pure LLM Arxiv
Based Fuzz Driver Generation generation Zero-shot learning [66]
45 Universal Fuzzing via Large Language 2023 System test input GPT-4; StarCoder Few-shot learning; Au- Code Mutation testing ICSE 2024
Models generation tomatic prompt [67]
46 Variable Discovery with Large Language 2023 System test input GPT-j Zero-shot learning Others Pure LLM SANER
Models for Metamorphic Testing of Scien- generation 2023 [68]
28

tific Software
ID Paper title Year Topic Involved LLM How LLM is used Input to LLM How LLM inte- Venue Ref
grated
47 White-box Compiler Fuzzing Empowered 2023 System test input GPT-4; StarCoder Few-shot learning Code Pure LLM Arxiv
by Large Language Models generation [69]
48 Itiger: an Automatic Issue Title Generation 2022 Bug analysis BART Pre-training and/or Bug description Pure LLM FSE 2022
Tool Fine-tuning [70]
49 CrashTranslator: Automatically Reproduc- 2023 Bug analysis ChatGPT Pre-training and/or Bug description Reinforcement ICSE 2024
ing Mobile Application Crashes Directly Fine-tuning learning [71]
from Stack Trace
50 Cupid: Leveraging ChatGPT for More Ac- 2023 Bug analysis ChatGPT Zero-shot learning Bug description Statistic analysis Arxiv
curate Duplicate Bug Report Detection [72]
51 Employing Deep Learning and Structured 2023 Bug analysis CodeT5 Zero-shot learning Bug description Statistic analysis Arxiv
Information Retrieval to Answer Clarifica- [73]
tion Questions on Bug Reports
52 Explaining Software Bugs Leveraging Code 2023 Bug analysis CodeT5 Pre-training and/or Code Program analysis ICSE 2023
Structures in Neural Machine Translation Fine-tuning [74]
53 Prompting Is All Your Need: Automated 2023 Bug analysis ChatGPT Few-shot learning; Bug description Pure LLM ICSE 2024
Android Bug Replay with Large Language Chain-of-Thought [75]
Models
54 Still Confusing for Bug-Component Triag- 2023 Bug analysis CodeT5 Pre-training and/or Bug description Statistic analysis ICPC 2023
ing? Deep Feature Learning and Ensemble Fine-tuning [76]
Setting to Rescue
55 Detect-Localize-Repair: A Unified Frame- 2022 Debug CodeT5 Pre-training and/or Code Pure LLM EMNLP
work for Learning to Debug with CodeT5 Fine-tuning 2022 [77]
56 Large Language Models are Few-shot 2022 Debug Codex Few-shot learning Bug description Program analysis; ICSE 2023
Testers: Exploring LLM-based General Bug Statistic analysis [78]
Reproduction
57 A Preliminary Evaluation of LLM-Based 2023 Debug ChatGPT Few-shot learning Code Pure LLM Arxiv
Fault Localization [79]
58 Addressing Compiler Errors: Stack Over- 2023 Debug ChatGPT; GPT-4 Zero-shot learning Error informa- Pure LLM Arxiv
flow or Large Language Models? tion [80]
59 Can LLMs Demystify Bug Reports? 2023 Debug ChatGPT Zero-shot learning Bug description Pure LLM Arxiv
[81]
60 Dcc –help: Generating Context-Aware Com- 2023 Debug ChatGPT Zero-shot learning Code; Error in- Pure LLM SIGCSE
piler Error Explanations with Large Lan- formation 2024 [82]
guage Models
61 Explainable Automated Debugging via 2023 Debug CodeGen; Codex; Self-consistency; Zero- Code Pure LLM Arxiv
Large Language Model-driven Scientific De- ChatGPT shot learning [83]
bugging
62 Large Language Models for Test-Free Fault 2023 Debug CodeGen Pre-training and/or Code Pure LLM ICSE 2024
Localization Fine-tuning [84]
63 Large Language Models in Fault Localisa- 2023 Debug ChatGPT; GPT-4 Zero-shot learning Code; Error in- Pure LLM Arxiv
tion formation [85]
64 LLM4CBI: Taming LLMs to Generate Effec- 2023 Debug ChatGPT Zero-shot learning Code Mutation testing; Arxiv
tive Test Programs for Compiler Bug Isola- Reinforcement [86]
tion learning
65 Nuances are the Key: Unlocking ChatGPT 2023 Debug ChatGPT Zero-shot learning Code Differential testing ASE 2023
to Find Failure-Inducing Tests with Differ- [87]
ential Prompting
66 Teaching Large Language Models to Self- 2023 Debug Codex; ChatGPT; Few-shot learning Code Pure LLM Arxiv
Debug GPT-4; StarCoder [88]
67 A study on Prompt Design, Advantages and 2023 Debug; Program re- ChatGPT Zero-shot learning Code Pure LLM Arxiv
Limitations of ChatGPT for Deep Learning pair [89]
Program Repair
68 Examining Zero-Shot Vulnerability Repair 2021 Program repair Codex Zero-shot learning Code; Bug de- Pure LLM SP 2023
with Large Language Models scription [90]
69 Automated Repair of Programs from Large 2022 Program repair Codex Zero-shot learning Code Pure LLM ICSE 2023
Language Models [91]
70 Fix Bugs with Transformer through a 2022 Program repair CodeGPT Pre-training and/or Code Pure LLM Arxiv
29

Neural-Symbolic Edit Grammar Fine-tuning [92]


71 Practical Program Repair in the Era of Large 2022 Program repair GPT-3; Codex; Few-shot learning; Code Statistic analysis ICSE 2023
Pre-trained Language Models CodeT5; InCoder Zero-shot learning [93]
ID Paper title Year Topic Involved LLM How LLM is used Input to LLM How LLM inte- Venue Ref
grated
72 Repairing Bugs in Python Assignments Us- 2022 Program repair Codex Few-shot learning; Code; Error in- Program analysis Arxiv
ing Large Language Models Zero-shot learning formation [94]
73 Towards JavaScript Program Repair with 2022 Program repair GPT-2 Pre-training and/or Code Pure LLM APR 2022
Generative Pre-trained Transformer (GPT-2) Fine-tuning [95]
74 An Analysis of the Automatic Bug Fixing 2023 Program repair ChatGPT Zero-shot learning Code; Error in- Pure LLM APR 2023
Performance of ChatGPT formation [96]
75 An Empirical Study on Fine-Tuning Large 2023 Program repair PLBART; CodeT5; Pre-training and/or Code Pure LLM ASE 2023
Language Models of Code for Automated UniXCoder Fine-tuning [97]
Program Repair
76 An Evaluation of the Effectiveness of Ope- 2023 Program repair ChatGPT Zero-shot learning Code Pure LLM iSemantic
nAI’s ChatGPT for Automated Python Pro- 2023 [98]
gram Bug Fixing using QuixBugs
77 An Extensive Study on Model Architecture 2023 Program repair T5; CodeT5 Pre-training and/or Code Pure LLM APR 2023
and Program Representation in the Domain Fine-tuning [99]
of Learning-based Automated Program Re-
pair
78 Can OpenAI’s Codex Fix Bugs? An Evalua- 2023 Program repair Codex Few-shot learning; Code Pure LLM APR 2022
tion on QuixBugs Zero-shot learning [100]
79 CIRCLE: Continual Repair Across Program- 2023 Program repair T5 Pre-training and/or Code Pure LLM ISSTA 2022
ming Languages Fine-tuning [101]
80 Coffee: Boost Your Code LLMs by Fixing 2023 Program repair CodeLLAMA Pre-training and/or Code Pure LLM Arxiv
Bugs with Feedback Fine-tuning [102]
81 Copiloting the Copilots: Fusing Large Lan- 2023 Program repair CodeT5; InCoder Zero-shot learning Code Statistic analysis FSE 2023
guage Models with Completion Engines for [103]
Automated Program Repair
82 Domain Knowledge Matters: Improving 2023 Program repair CodeT5 Pre-training and/or Code Program analysis ICSE 2024
Prompts with Fix Templates for Repairing Fine-tuning [104]
Python Type Errors
83 Enhancing Genetic Improvement Mutations 2023 Program repair GPT-4 Zero-shot learning Code Pure LLM SSBSE 2023
Using Large Language Models [105]
84 FixEval: Execution-based Evaluation of Pro- 2023 Program repair CodeT5; PLBART Pre-training and/or Code Pure LLM APR 2023
gram Fixes for Programming Problems Fine-tuning [106]
85 Fixing Hardware Security Bugs with Large 2023 Program repair Codex; CodeGen Few-shot learning; Code; Bug de- Pure LLM Arxiv
Language Models Zero-shot learning scription [107]
86 Fixing Rust Compilation Errors using LLMs 2023 Program repair ChatGPT; GPT-4 Zero-shot learning Code Pure LLM Arxiv
[108]
87 Framing Program Repair as Code Comple- 2023 Program repair CodeGPT Zero-shot learning Code Pure LLM ICSE 2022
tion [109]
88 Frustrated with Code Quality Issues? LLMs 2023 Program repair ChatGPT; GPT-4 Zero-shot learning Code Pure LLM Arxiv
can Help! [110]
89 GPT-3-Powered Type Error Debugging: In- 2023 Program repair GPT-3 Zero-shot learning Code Program analysis SLE 2023
vestigating the Use of Large Language Mod- [111]
els for Code Repair
90 How Effective Are Neural Networks for Fix- 2023 Program repair Codex; CodeGen; Pre-training and/or Code Pure LLM ISSTA 2023
ing Security Vulnerabilities CodeT5; PLBART; Fine-tuning; Zero-shot [112]
InCoder learning
91 Impact of Code Language Models on Auto- 2023 Program repair PLBART; CodeT5; Pre-training and/or Code Pure LLM ICSE 2023
mated Program Repair CodeGen; InCoder Fine-tuning; Zero-shot [113]
learning
92 Inferfix: End-to-end Program Repair with 2023 Program repair Codex Few-shot learning; Pre- Code Pure LLM FSE 2023
LLMs training and/or Fine- [114]
tuning
93 Keep the Conversation Going: Fixing 162 2023 Program repair ChatGPT Few-shot learning Code; Error in- Pure LLM Arxiv
out of 337 bugs for $0.42 each using Chat- formation [115]
GPT
94 Neural Program Repair with Program De- 2023 Program repair CodeT5 Pre-training and/or Code Statistic analysis Arxiv
30

pendence Analysis and Effective Filter Fine-tuning [116]


Mechanism
ID Paper title Year Topic Involved LLM How LLM is used Input to LLM How LLM inte- Venue Ref
grated
95 Out of Context: How important is Local 2023 Program repair CodeT5 Pre-training and/or Code Pure LLM ICSE 2024
Context in Neural Program Repair? Fine-tuning [117]
96 Pre-trained Model-based Automated Soft- 2023 Program repair CodeT5; UniX- Pre-training and/or Code Pure LLM IEEE TDSC
ware Vulnerability Repair: How Far are We? Coder; CodeGPT Fine-tuning [118]
97 RAPGen: An Approach for Fixing Code In- 2023 Program repair Codex Few-shot learning; Code Pure LLM Arxiv
efficiencies in Zero-Shot Chain-of-Thought [119]
98 RAP-Gen: Retrieval-Augmented Patch Gen- 2023 Program repair CodeT5 Pre-training and/or Code Statistic analysis FSE 2023
eration with CodeT5 for Automatic Pro- Fine-tuning [120]
gram Repair
99 STEAM: Simulating the InTeractive BEhav- 2023 Program repair ChatGPT Zero-shot learning Code Pure LLM Arxiv
ior of ProgrAMmers for Automatic Bug Fix- [121]
ing
100 Towards Generating Functionally Correct 2023 Program repair Codex; ChatGPT Few-shot learning; Code; Bug de- Pure LLM Arxiv
Code Edits from Natural Language Issue Zero-shot learning; scription [122]
Descriptions Chain-of-Thought
101 VulRepair: a T5-based Automated Software 2023 Program repair T5 Pre-training and/or Code Pure LLM FSE 2022
Vulnerability Repair Fine-tuning [123]
102 What Makes Good In-Context Demonstra- 2023 Program repair Codex; ChatGPT Few-shot learning Code Pure LLM ASE 2023
tions for Code Intelligence Tasks with [124]
LLMs?
31

You might also like