Software Testing With Large Language Models: Survey, Landscape, and Vision
Software Testing With Large Language Models: Survey, Landscape, and Vision
Software Testing With Large Language Models: Survey, Landscape, and Vision
Abstract—Pre-trained large language models (LLMs) have recently emerged as a breakthrough technology in natural language
processing and artificial intelligence, with the ability to handle large-scale datasets and exhibit remarkable performance across a wide
range of tasks. Meanwhile, software testing is a crucial undertaking that serves as a cornerstone for ensuring the quality and reliability
of software products. As the scope and complexity of software systems continue to grow, the need for more effective software testing
techniques becomes increasingly urgent, making it an area ripe for innovative approaches such as the use of LLMs. This paper provides
a comprehensive review of the utilization of LLMs in software testing. It analyzes 102 relevant studies that have used LLMs for software
arXiv:2307.07221v3 [cs.SE] 4 Mar 2024
testing, from both the software testing and LLMs perspectives. The paper presents a detailed discussion of the software testing tasks for
which LLMs are commonly used, among which test case preparation and program repair are the most representative. It also analyzes
the commonly used LLMs, the types of prompt engineering that are employed, as well as the accompanied techniques with these LLMs.
It also summarizes the key challenges and potential opportunities in this direction. This work can serve as a roadmap for future research
in this area, highlighting potential avenues for exploration, and identifying gaps in our current understanding of the use of LLMs in
software testing.
task, highlighting commonly-used practices, tracking tech- LLMs for software testing, regarding publication
nology evolution trends, and summarizing achieved per- trends, distribution of publication venues, etc.
formance, so as to facilitate readers in gaining a thorough ● We conduct a comprehensive analysis from the perspec-
overview of how LLMs are employed across various testing tive of software testing to understand the distribution of
tasks. software testing tasks with LLM and present a thorough
From the viewpoint of LLMs, our analysis includes discussion about how these tasks are solved with LLM.
the commonly used LLMs in these studies, the types of ● We conduct a comprehensive analysis from the perspec-
prompt engineering, the input of the LLMs, as well as tive of LLMs, and uncover the commonly-used LLMs,
the accompanied techniques with these LLMs. Results the types of prompt engineering, input of the LLMs, as
show that about one-third of the studies utilize the LLMs well as the accompanied techniques with these LLMs.
through pre-training or fine-tuning schema, while the others ● We highlight the challenges in existing studies and
employ prompt engineering to communicate with LLMs present potential opportunities for further studies.
to steer their behavior for desired outcomes. For prompt We believe that this work will be valuable to both re-
engineering, the zero-shot learning and few-shot learning searchers and practitioners in the field of software engineer-
strategies are most commonly used, while other advances ing, as it provides a comprehensive overview of the current
like chain-of-thought promoting and self-consistency are state and future vision of using LLMs for software testing.
rarely utilized. Results also show that traditional testing For researchers, this work can serve as a roadmap for future
techniques like differential testing and mutation testing research in this area, highlighting potential avenues for ex-
are usually accompanied by LLMs to help generate more ploration and identifying gaps in our current understanding
diversified tests. of the use of LLMs in software testing. For practitioners, this
Furthermore, we summarize the key challenges and po- work can provide insights into the potential benefits and
tential opportunities in this direction. Although software limitations of using LLMs for software testing, as well as
testing with LLMs has undergone significant growth in the practical guidance on how to effectively integrate them into
3
existing testing processes. By providing a detailed landscape architecture (as mentioned in [24]), such as BART with 140M
of the current state and future vision of using LLMs for parameters and GPT-2 with parameter sizes ranging from
software testing, this work can help accelerate the adoption 117M to 1.5B. This is also to potentially include more studies
of this technology in the software engineering community to demonstrate the landscape of this topic.
and ultimately contribute to improving the quality and reli-
ability of software systems. 2.2 Software Testing
Software testing is a crucial process in software develop-
2 BACKGROUND ment that involves evaluating the quality of a software prod-
2.1 Large Language Model (LLM) uct. The primary goal of software testing is to identify de-
fects or errors in the software system that could potentially
Recently, pre-trained language models (PLMs) have been
lead to incorrect or unexpected behavior. The whole life
proposed by pretraining Transformer-based models over
cycle of software testing typically includes the following
large-scale corpora, showing strong capabilities in solving
tasks (demonstrated in Figure 4):
various natural language processing (NLP) tasks [16]–[19].
● Requirement Analysis: analyze the software require-
Studies have shown that model scaling can lead to improved
ments and identify the testing objectives, scope, and
model capacity, prompting researchers to investigate the
criteria.
scaling effect through further parameter size increases.
● Test Plan: develop a test plan that outlines the testing
Interestingly, when the parameter scale exceeds a certain
strategy, test objectives, and schedule.
threshold, these larger language models demonstrate not
● Test Design and Review: develop and review the test
only significant performance improvements but also special
cases and test suites that align with the test plan and
abilities such as in-context learning, which are absent in
the requirements of the software application.
smaller models such as BERT.
● Test Case Preparation: the actual test cases are prepared
To discriminate the language models in different
based on the designs created in the previous stage.
parameter scales, the research community has coined
● Test Execution: execute the tests that were designed in
the term large language models (LLM) for the PLMs of
the previous stage. The software system is executed
significant size. LLMs typically refer to language models
with the test cases and the results are recorded.
that have hundreds of billions (or more) of parameters and
● Test Reporting: analyze the results of the tests and gen-
are trained on massive text data such as GPT-3, PaLM,
erate reports that summarize the testing process and
Codex, and LLaMA. LLMs are built using the Transformer
identify any defects or issues that were discovered.
architecture, which stacks multi-head attention layers
● Bug Fixing and Regression Testing: defects or issues
in a very deep neural network. Existing LLMs adopt
identified during testing are reported to the develop-
similar model architectures (Transformer) and pre-training
ment team for fixing. Once the defects are fixed, regres-
objectives (language modeling) as small language models,
sion testing is performed to ensure that the changes
but largely scale up the model size, pre-training data,
have not introduced new defects or issues.
and total compute power. This enables LLMs to better
● Software Release: once the software system has passed
understand natural language and generate high-quality text
all of the testing stages and the defects have been fixed,
based on given context or prompts.
Note that, in existing literature, there is no formal con- the software can be released to the customer or end
sensus on the minimum parameter scale for LLMs, since user.
the model capacity is also related to data size and total The testing process is iterative and may involve multiple
compute. In a recent survey of LLMs [17], the authors focus cycles of the above stages, depending on the complexity of
on discussing the language models with a model size larger the software system and the testing requirements.
than 10B. Under their criteria, the first LLM is T5 released During the testing phase, various types of tests may be
by Google in 2019, followed by GPT-3 released by OpenAI performed, including unit tests, integration tests, system
in 2020, and there are more than thirty LLMs released be- tests, and acceptance tests.
tween 2021 and 2023 indicating its popularity. In another ● Unit Testing involves testing individual units or com-
survey of unifying LLMs and knowledge graphs [24], the ponents of the software application to ensure that they
authors categorize the LLMs into three types: encoder-only function correctly.
(e.g., BERT), encoder-decoder (e.g., T5), and decoder-only ● Integration Testing involves testing different modules
network architecture (e.g., GPT-3). In our review, we take or components of the software application together to
into account the categorization criteria of the two surveys ensure that they work correctly as a system.
and only consider the encoder-decoder and decoder-only ● System Testing involves testing the entire software sys-
network architecture of pre-training language models, since tem as a whole, including all the integrated components
they can both support generative tasks. We do not consider and external dependencies.
the encoder-only network architecture because they cannot ● Acceptance Testing involves testing the software appli-
handle generative tasks, were proposed relatively early (e.g., cation to ensure that it meets the business requirements
BERT in 2018), and there are almost no models using this and is ready for deployment.
architecture after 2021. In other words, the LLMs discussed In addition, there can be functional testing, performance
in this paper not only include models with parameters of testing, unit testing, security testing, accessibility testing,
over 10B (as mentioned in [17]) but also include other mod- etc, which explores various aspects of the software under
els that use the encoder-decoder and decoder-only network test [25].
4
Major SE Venues
& AI Venues
3.1.2 Manual Search
START
3.1.1 Automatic
Search
14,623
Papers
3.1.1 Automatic
Filtering
1,239
Papers
3.1.2 Manual Search To compensate for the potential omissions that may result
from automated searches, we also conduct manual searches.
1,278
Papers In order to make sure we collect highly relevant papers,
102 102 3.1.4 Quality 109 3.1.3 Inclusion and
we conduct a manual search within the conference proceed-
3.1.5 Snowballing
Papers Papers Assessment Papers Exclusion Criteria
ings and journal articles from top-tier software engineering
Fig. 2: Overview of the paper collection process venues (listed in Table 2).
In addition, given the interdisciplinary nature of this
work, we also include the conference proceedings of the
3 PAPER S ELECTION AND R EVIEW S CHEMA artificial intelligence field. We select the top ten venues
based on the h5 index from Google Scholar, and exclude
3.1 Paper Collection Methodology
three computer vision venues, i.e., CVPR, ICCV, ECCV, as
Figure 2 shows our paper search and selection process. To listed in Table 2.
collect as much relevant literature as possible, we use both
automatic search (from paper repository database) and man- 3.1.3 Inclusion and Exclusion Criteria
ual search (from major software engineering and artificial The search conducted on the databases and venue is, by de-
intelligence venues). We searched papers from Jan. 2019 to sign, very inclusive. This allows us to collect as many papers
Jun. 2023 and further conducted the second round of search as possible in our pool. However, this generous inclusivity
to include the papers from Jul. 2023 to Oct. 2023. results in having papers that are not directly related to the
scope of this survey. Accordingly, we define a set of specific
inclusion and exclusion criteria and then we apply them to
3.1.1 Automatic Search
each paper in the pool and remove papers not meeting the
To ensure that we collect papers from diverse research areas, criteria. This ensures that each collected paper aligns with
we conduct an extensive search using four popular scientific our scope and research questions.
databases: ACM digital library, IEEE Xplore digital library, Inclusion Criteria. We define the following criteria for
arXiv, and DBLP. including papers:
We search for papers whose title contains keywords re- ● The paper proposes or improves an approach, study, or
lated to software testing tasks and testing techniques (as shown tool/framework that targets testing specific software or
below) in the first three databases. In the case of DBLP, we systems with LLMs.
use additional keywords related to LLMs (as shown below) ● The paper applies LLMs to software testing practice,
to filter out irrelevant studies, as relying solely on testing- including all tasks within the software testing lifecycle
related keywords would result in a large number of can- as demonstrated in Section 2.2.
didate studies. While using two sets of keywords for DBLP ● The paper presents an empirical or experimental study
may result in overlooking certain related studies, we believe about utilizing LLMs in software testing practice.
it is still a feasible strategy. This is due to the fact that a ● The paper involves specific testing techniques (e.g.,
substantial number of studies present in this database can fuzz testing) employing LLMs.
already be found in the first three databases, and the fourth If a paper satisfies any of the following criteria, we will
database only serves as a supplementary source for collect- include it.
ing additional papers. Exclusion Criteria. The following studies would be ex-
● Keywords related with software testing tasks and tech- cluded during study selection:
niques: test OR bug OR issue OR defect OR fault OR ● The paper does not involve software testing tasks, e.g.,
error OR failure OR crash OR debug OR debugger OR code comment generation.
repair OR fix OR assert OR verification OR validation ● The paper does not utilize LLMs, e.g., using recurrent
OR fuzz OR fuzzer OR mutation. neural networks.
● Keywords related with LLMs: LLM OR language model ● The paper mentions LLMs only in future work or dis-
OR generative model OR large model OR GPT-3 OR cussions rather than using LLMs in the approach.
ChatGPT OR GPT-4 OR LLaMA OR PaLM2 OR CodeT5 ● The paper utilizes language models with encoder-only
OR CodeX OR CodeGen OR Bard OR InstructGPT. Note architecture, e.g., BERT, which can not directly be uti-
that, we only list the top ten most popular LLMs (based lized for generation tasks (as demonstrated in Section
on Google search), since they are the search keywords 2.1).
for matching paper titles, rather than matching the pa- ● The paper focuses on testing the performance of LLMs,
per content. such as fairness, stability, security, etc. [125]–[127].
The above search strategy based on the paper title can ● The paper focuses on evaluating the performance of
recall a large number of papers, and we further conduct the LLM-enabled tools, e.g., evaluating the code quality of
automatic filtering based on the paper content. Specifically, the code generation tool Copilot [128]–[130].
we filter the paper whose content contains “LLM” or “lan- For the papers collected through automatic search and
guage model” or “generative model” or “large model” or manual search, we conduct a manual inspection to check
the name of the LLMs (using the LLMs in [17], [24] except whether they satisfy our inclusion criteria and filter those
those in our exclusion criteria). This can help eliminate the following our exclusion criteria. Specifically, the first two
papers that do not involve the neural models. authors read each paper to carefully determine whether it
5
Acronym Venue
3 X E O L F D W L R Q V
ICSE International Conference on Software Engineering
ESEC/FSE Joint European Software Engineering Conference and Symposium on the
Foundations of Software Engineering
SE Conference