1 s2.0 S0950584923002495 Main
1 s2.0 S0950584923002495 Main
1 s2.0 S0950584923002495 Main
Keywords: Context: Test flakiness arises when test cases have a non-deterministic, intermittent behavior that leads them
Test Code Flakiness to either pass or fail when run against the same code. While researchers have been contributing to the detection,
Software Testing classification, and removal of flaky tests with several empirical studies and automated techniques, little is
Mobile Apps Development
known about how the problem of test flakiness arises in mobile applications.
Mixed-Method Research
Objective: We point out a lack of knowledge on: (1) The prominence and harmfulness of the problem; (2) The
most frequent root causes inducing flakiness; and (3) The strategies applied by practitioners to deal with it in
practice. An improved understanding of these matters may lead the software engineering research community
to assess the need for tailoring existing instruments to the mobile context or for brand-new approaches that
focus on the peculiarities identified.
Methods: We address this gap of knowledge by means of an empirical study into the mobile developer’s
perception of test flakiness. We first perform a systematic grey literature review to elicit how developers
discuss and deal with the problem of test flakiness in the wild. Then, we complement the systematic review
through a survey study that involves 130 mobile developers and that aims at analyzing their experience on
the matter.
Results: The results of the grey literature review indicate that developers are often concerned with flakiness
connected to user interface elements. In addition, our survey study reveals that flaky tests are perceived as
critical by mobile developers, who pointed out major production code- and source code design-related root
causes of flakiness, other than the long-term effects of recurrent flaky tests. Furthermore, our study lets the
diagnosing and fixing processes currently adopted by developers and their limitations emerge.
Conclusion: We conclude by distilling lessons learned, implications, and future research directions.
1. Introduction is one of the most effective practices to verify that the behavior of a
mobile app meets the expectations and that non-functional attributes
The increasing world digitalization, the changes we are experienc- are preserved [12,13]. The relevance of software testing is even more
ing in terms of communication, and the long-term effects of the era evident in the context of mobile applications development, where
of social distancing are just some of the reasons why people need continuous improvements and releases represent a threat to software
more and more mobile applications that let them connect and support reliability [14].
their daily activities [1]. It is indeed not surprising to see that we Unfortunately, recent research has highlighted several limitations
currently have more mobile devices than people.1 In such a context, that make software testing still poorly applied by mobile developers,
the quality of mobile applications is crucial to let developers create like the lack of practical automated testing tools [15] or the low
sustainable, successful, and dependable apps that can stay in the market attitude of developers to write unit tests [13]. In addition, the poor
and keep acquiring users [2,3]. For instance, research has shown that quality of test cases [16] might lead developers not to rely anymore
internal properties of mobile software, like energy consumption [4], on the outcome of the testing activities and even lead them to ignore
presence of design flaws [5], or even faulty APIs [6,7], might impact the actual defects in production code [17,18].
user’s satisfaction and commercial success of mobile applications [8, One of the most recurring issues connected to test code quality is
9]. Among the various methods to control for source code quality test flakiness [19]. This problem arises when a test case has a non-
(e.g., design-by-contract [10] or formal methods [11]), software testing deterministic behavior, i.e., it may either pass or fail when run against
∗ Corresponding author at: Software Engineering (SeSa) Lab - Department of Computer Science, University of Salerno, Italy.
E-mail addresses: valeria.pontillo@vub.be (V. Pontillo), fpalomba@unisa.it (F. Palomba), fferrucci@unisa.it (F. Ferrucci).
1
Smartphone users worldwide: https://www.statista.com/statistics/330695/number-of-smartphone-users-worldwide/.
https://doi.org/10.1016/j.infsof.2023.107394
Received 27 July 2023; Received in revised form 13 December 2023; Accepted 24 December 2023
Available online 6 January 2024
0950-5849/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
the same code [19–21]. Over the last years, flaky tests have been deeply The results of both studies converge toward the same conclusions.
investigated from different perspectives [22]: researchers attempted to We show that test flakiness represents a key problem for mobile devel-
classify the root causes making a test flaky [19,20,23,24], their lifecy- opers as well, who would like to have further support to deal with it.
cle [23,25], and their reproducibility [26]. At the same time, notable This is especially true for flaky tests connected to the user interface. In
advances have been carried out in terms of automated detection [27– addition, we elicit the typical processes that developers conduct when
30], classification [31–34], root cause identification [24,35,36] and detecting, diagnosing, and fixing flaky tests. Last but not least, our
removal [37,38]. research lets emerge that third-party APIs and production code-related
While most of these studies targeted traditional software, our re- factors might contribute to test flakiness.
search points out a worrisome lack of knowledge on how test flakiness Based on these findings, we conclude our paper by distilling a
manifests itself in the context of mobile applications, how developers number of lessons learned, implications, and novel research avenues
deal with it, and how mobile-specific (semi-)automated support might that researchers might consider to further build on the current state of
be designed. More specifically, the current state of the art has studied the art.
mobile flakiness only tangentially or partially. On the one hand, Gruber
Structure of the paper. Section 2 discusses the literature on flaky tests.
and Fraser [39] and Habchi et al. [40] surveyed practitioners to elicit
Section 3 overviews our empirical setting. Section 4 reports the design
their perspective on various properties of the problem, including how
and results of the systematic grey literature review, while Section 5
practitioners define the problem, the root causes of flakiness, and
elaborates on the survey and the related results. In Section 6 we further
their reactions to the emergence of flaky tests. These studies targeted
discuss on the implications that our findings have for both researchers
the generic population of developers, hence tangentially considering
and practitioners. Section 7 summarizes the threats to validity of our
mobile developers as well. Nonetheless, their findings did not aim
study and how we mitigated them. Finally, Section 8 concludes the
at identifying the peculiarities of test flakiness in mobile apps, but
paper and outlines our future research agenda on the matter.
rather to provide general observations on the problem. In addition,
neither Gruber and Fraser [39] nor Habchi et al. [40] analyzed the
2. Related work
specific identification and mitigation processes put in place by mobile
developers to deal with flakiness. On the other hand, Thorve et al. [41]
The problem of flakiness is widely recognized and discussed by in-
proposed the first investigation into test flakiness in Android apps: the
dustrial practitioners worldwide: dozens of daily discussions are opened
authors replicated the seminal work by Luo et al. [19], classifying the
on the topic on social networks and blogs (e.g., [43,44]). Researchers
root causes of flakiness through manual analysis of flakiness-inducing
have started investigating the problem from multiple perspectives
commits. The findings of the study showed that mobile apps have
pushed by such an industrial movement. Systematic analyses of state
unique characteristics and, indeed, they present flaky tests whose root
of the art were presented by Barboni et al. [45], Parry et al. [22], and
causes are different from those previously investigated in traditional
Zheng et al. [46]. At the same time, to the best of our knowledge,
software, e.g., bugs related to Program Logic and the User Interface. The
there is still no systematic analysis of the grey literature [47]. This
scope of the work by Thorve et al. [41] was limited to the analysis of
point already motivates our work, which proposes the first systematic
the root causes of flakiness, yet it represented a call for further and
grey literature on test flakiness and the first exploration of specific
more comprehensive research on the matter.
aspects connected to test flakiness in mobile applications discussed by
practitioners in the grey literature.
In the following, we discuss the literature on test flakiness by
Objective of the work. discussing (1) the main scientific activities conducted on the matter in
the context of traditional software systems; (2) the papers that targeted
Our work builds on top of these previous findings and advances the developer’s perception of flaky tests; and (3) the currently available
the state of the art by conducting a targeted and comprehensive literature on test flakiness in mobile apps.
analysis of how practitioners deal with test flakiness in mobile
applications. Specifically, we aim to understand (1) How prominent 2.1. Test flakiness in the wild
and problematic flaky tests are; (2) What are the common root
causes of test flakiness and how hard it is for developers to diagnose In terms of scientific research, test flakiness has been widely studied
and fix them; and (3) What are the diagnosing and fixing activities in the context of large, open-source traditional systems, and, in the
that developers adopt when dealing with flaky tests. last few years, several automated techniques have been proposed [22].
These approaches cover various angles of the flakiness problem, such
as the detection of flaky tests [27–30,48–54], the negative effects that
Through our analysis, we aim at providing the software engineering flaky tests may create during regression testing [55], e.g., missing
research community with insights that may be helpful in designing deadlines because certain features cannot be tested sufficiently, the
novel instruments to support practitioners or to tailor existing ones to prediction of test flakiness [31–34,56–59], the identification of the root
fit the peculiarities of mobile applications. We approach our research cause of flakiness [24,35,36,60,61] and their removal [37,38,60,62],
by means of mixed-method research that combines two complementary using various algorithms (e.g., machine learning or search-based) and
studies. First, we conduct a systematic grey literature review that ex- methods (e.g., static versus dynamic analysis).
plores the developer’s discussions and perspectives about test flakiness. Interestingly, researchers and practitioners have also been working
Second, we extend the previous analysis through a survey study that together to investigate test flakiness. Indeed, a growing number of
involves 130 experienced mobile developers who were inquired about industrial studies have been carried out that report on empirical inves-
the state of the practice on test flakiness in mobile development. The tigations or propose tools. Lampel et al. [63] proposed a new approach
choice of the research method was motivated by the recent findings by that automatically classifies failing jobs as pertaining to software bugs
Zolfaghari et al. [42]: the authors showed that, despite the advances or flaky tests. Rehman et al. [64] quantified how often a test fails
made over the last decade, the academic research community is still without finding any defect in production code through an empirical
behind industries in terms of test flakiness management. As such, we investigation across four large projects at Ericsson. From an empirical
address our objectives by focusing on the practitioners’ knowledge software engineering viewpoint, Luo et al. [19] manually inspected
through two complementary research instruments that may contribute 1129 commits and reported a set of ten root causes of test flakiness.
to understanding the current state of the practice and the limitations Lam et al. [35] also conducted a large-scale study of flakiness in
thereof. Microsoft, emphasizing the problematic nature of flaky tests.
2
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Table 1
Comparison between our study and the most closely related papers.
Related work Main focus Differences
Eck et al. [20] A mixed-method study to understand the developer’s perspective on test ∙ No focus on mobile applications;
flakiness. First, 21 professional developers classified 200 flaky tests ∙ The context of the study was on industrial practitioners;
previously fixed, in terms of the nature and the origin of the flakiness. ∙ The goal of the multivocal literature review was to identify
Then, the authors conducted an online survey with 121 developers with a the challenges faced by developers when dealing with
median industrial programming experience of five years — a multivocal flakiness rather than to collect the practices employed by
literature review was conducted to inform the survey with the challenges developers.
faced by developers when dealing with flaky tests.
Habchi et al. [40] A qualitative study with 14 practitioners to identify the sources and the ∙ No focus on mobile applications;
impact of flakiness, the measures adopted by practitioners and the ∙ The study was based on the experience on 14 people, while
automation solutions. A grey literature review was conducted to inform our work is based on a large amount of mobile practitioners;
the design of the semi-structured interviews. ∙ The grey literature review was exploratory and
non-exhaustive.
Gruber and Fraser [39] An empirical study in which the authors surveyed 335 professional ∙ The target audience and the participant selection are
software developers and testers in different domains to understand the different;
prevalence of test flakiness, how flaky tests affect developers and what ∙ The work had a specific focus on the support that
developers expect from researchers. developers need from the tools;
∙ The work did not conduct a systematic literature review.
Ahmad et al. [65] A multiple-case study to understand flakiness in a closed-source ∙ The study was based on the experience on 18 developers in
development context. The study showed 19 factors categorized as test code, a closed-source context, while our work is based on a larger
system under test, CI/test infrastructure, and organization-related that are number of practitioners;
directly related to test flakiness. ∙ The work was focused on the root causes of flaky tests and
the practitioner’s perception, while our study also
investigates the strategies applied by practitioners to deal
with flakiness in mobile apps;
∙ The work did not conduct a systematic literature review.
Parry et al. [66] A mixed-method study to understand how developers define flaky tests, ∙ No focus on mobile applications;
their experiences on the impact and causes of test flakiness, and the ∙ Our work includes the investigation of the most frequent
actions taken in response to flaky tests. The authors deployed a survey root causes and the problematic nature of test flakiness in
that obtained 170 responses and analyzed 38 Stack Overflow threads mobile apps;
related to test flakiness. ∙ The work did not conduct a systematic literature review.
2.2. Developer’s perception of test flakiness out that in a non-negligible amount of time, flakiness stems from inter-
actions between the system components, the testing infrastructure, and
Multiple investigations have been previously conducted to under- other external factors. With respect to this paper, our study approaches
stand the developer’s perception of test flakiness. These studies are the problem with a different research method, i.e., a survey study
clearly closely related with respect to the work we propose. Table 1 versus an interview study. This difference allowed us to reach more
summarizes the key articles in this area along with the main differences developers than Habchi et al. [40] (150 versus 14), hence reinforcing
between them and our study. the ecological validity of our findings.
Eck et al. [20] surveyed 21 Mozilla developers in order to classify It is worth mentioning that Habchi et al. [40] did not explicitly focus
the root causes of the flaky tests they previously fixed. While the on the problem of flakiness in mobile applications, hence considering
authors confirmed most of the root causes identified by Luo et al. [19], a more general set of resources that might have provided information
they also identified additional implementation issues inducing flak- on how developers deal with flaky tests. On the contrary, our goal
iness. It is worth remarking that Eck et al. [20] also conducted a was to investigate the perspective of mobile developers and, for this
follow-up survey study to analyze the developer’s perception of test reason, we tailored the systematic grey literature process to mobile
flakiness — this survey was informed by a multivocal literature review apps, thus eliciting and analyzing information that are specific to the
targeting the challenges faced by practitioners when dealing with test mobile software development context.
flakiness. By collecting 121 responses, the authors found that flaky In the second place, Habchi et al. [40] focused on mapping the
tests are perceived as relevant by the vast majority of the traditional measures employed by practitioners when dealing with flaky tests.
developers involved. In addition, they found that the emergence of Our work is, instead, larger and more comprehensive, as we aimed to
flakiness-related concerns might impact socio-technical activities like analyze (1) the prominence, (2) the root causes, and (3) the diagnosis
resource allocation and scheduling. This study has multiple differences and mitigation strategies applied by mobile developers when dealing
with respect to ours. First, we design the study to investigate how with flakiness.
flakiness manifests itself in mobile applications: as such, our results Finally, the grey literature developed by Habchi et al. [40] did not
on the root causes of flakiness should be seen as complementary to comprehensively survey the resources available but aimed to inform
those of Eck et al. [20], who analyzed an industrial case such as the the design of semi-structured interviews — according to the authors,
one of Mozilla. Second, the authors conducted a multivocal literature they performed a literature review which was ‘‘exploratory, not ex-
review and a survey study. Nonetheless, the scope of their analyses haustive, and only aimed at informing the design of the semi-structured
is diametrically different with respect to ours. They indeed conducted interviews’’ [40]. In other terms, their work did not systematically
a review with the sole goal of identifying the challenges faced by follow the guidelines for conducting these types of studies; on the
practitioners when dealing with flakiness and validated the identified contrary, we performed multiple steps to make our literature review
challenges with a survey study. On the contrary, we are interested in as complete as possible, discarding too generic or unreliable resources
understanding the practices adopted by practitioners when identifying, that might have biased our conclusions.
diagnosing, and fixing strategies. In this sense, our study provides a Gruber and Fraser [39] surveyed 335 professional software de-
wider overview of how mobile developers face test flakiness. velopers and testers in different domains; their results confirmed the
Habchi et al. [40] built upon the findings by Eck et al. [20] and relevance of the problem, especially when using automated testing.
conducted an interview study involving 14 industrial practitioners. While the authors involved practitioners working on various domains,
Their results confirmed the relevance of the problem but also pointed including the development of mobile applications, their focus was on
3
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
the automated support needed to analyze test flakiness. As such, our specific hardware, OS version, or a third-party library. At the same
survey is complementary and aims at enlarging the knowledge of how time, other root causes frequently reported in previous studies targeting
mobile developers identify, diagnose, and fix flaky tests. In addition, non-mobile applications are less common or even irrelevant in mobile
the grey literature review represents an additional contribution we apps. This is the case, for instance, of the root causes known as ‘‘Test
provide. Order Dependency’’ and ‘‘Resource Leak’’, which were never discovered
Ahmad et al. [65] interviewed 18 industrial developers on the in the analysis of Thorve et al. [41].
root causes of flakiness and their perception of the harmfulness of The observations above indicate the existence of peculiarities that
flaky tests. On the one hand, our work is different since it targets make mobile apps different from other systems and further motivate
mobile developers. On the other hand, our scope is larger as we aim to the need to investigate the problem of test flakiness in such a spe-
understand the practices used by practitioners to deal with flakiness. cific context — which is indeed the ultimate goal of our empirical
Finally, the empirical study conducted by Parry et al. [66] involved investigation.
practitioners in analyzing their definition of test flakiness, their expe- Other relevant papers that elaborated on test flakiness in mobile
riences, and the actions taken in response to newly identified flaky apps are those by Dong et al. [72], Romano et al. [73], and Silva
tests. While the goals of this paper are somehow similar to ours, we et al. [74]. In particular, Dong et al. [72] proposed FlakeShovel, a
explicitly target mobile application developers to identify the specific tool for detecting Android-specific flaky tests through the systematic
actions conducted by those developers while addressing the problem exploration of event orders. Romano et al. [73] analyzed 235 flaky UI
of flaky tests. In addition, we also elaborate on the most frequent tests to understand common root causes, the strategies used to identify
root causes of flakiness and the problematic nature of test flakiness as the flaky behavior and the fixing strategies proposed. Finally, Silva
perceived by developers. Finally, Parry et al. [66] conducted the study et al. [74] proposed SHAKER, a lightweight technique to improve the
by combining the survey study with the analysis of StackOverflow posts. ability of the ReRun approach to detect flaky tests. Their study provided
On the contrary, we conduct a grey literature review to address the guidelines for creating reliable and stable UI test suites.
research questions of the study. Therefore, also in this case, our study Despite the advances made by researchers on test code flakiness,
can be seen as complementary to the one of Parry et al. [66]. there is still a lack of knowledge on the problems and processes that
On the basis of the analysis just discussed, our contribution com- developers face when dealing with flaky tests in mobile apps. Our study
pared to the related work presented in Table 1 (1) is, therefore, the fills this gap and provides a starting point for future research to improve
first exhaustive grey literature review on test flakiness in the context mobile app development and testing practices.
of mobile applications, (2) proposes the first survey study that explic-
itly involves mobile developers inquiring them on the peculiarities of 3. Research questions and settings
flakiness in mobile apps, and (3) extends the current body of knowledge
through a complementary analysis on how mobile developers identify, The goal of the empirical study was to collect and analyze the
diagnose, and fix flaky tests. mobile app developer’s perspective on test flakiness with the purpose
of eliciting (1) the prominence of the problem, (2) the most common
2.3. Test flakiness in mobile applications root causes, and (3) the strategies implemented by developers to di-
agnose and fix flaky tests. The perspective is of both researchers and
The population-level differences between mobile apps and non- practitioners. The former are interested in assessing the current state
mobile apps have been the subject of a number of previous works. of the practice, hence understanding how to further support mobile
For instance, the seminal work by Wasserman [67] pointed out several developers when dealing with flaky tests. The latter are interested in
aspects making mobile applications different and, to some extent, understanding how other mobile developers deal with the problem,
special: the potential interaction that a mobile app may have with thus learning potential strategies to put in place.
other applications and services, the need for handling sensors, the user More specifically, our study was driven by three main research
interface which needs to adhere to externally developed guidelines, questions. In the first place, we aimed at collecting the developer’s
other than the different weight that some non-functional attributes, perception concerning the prominence and problematic nature of test
e.g., energy consumption, may have on how the app is developed flakiness. Such an investigation allowed us to challenge the results
are just some of the elements that naturally make mobile applications obtained in traditional contexts by Eck et al. [20], hence comple-
different from other applications. In addition, it is worth considering menting their observations with data coming from mobile application
that mobile apps can be native or hybrid — this further influences the developers. In particular, we asked the following research question:
way the app might interact with the telephone network or the internet
and the way the app should acquire and display data on the device.
Furthermore, software engineering researchers have also found out that RQ1 . How prominent and problematic is test flakiness as perceived
the development process behind the creation of mobile apps is different by mobile app developers?
from the traditional one (e.g., [68–70]), impacting in different manners
the way these apps are developed and tested.
Regarding test flakiness, the key aspect to consider is that mobile Once we had assessed the harmfulness of the problem, we moved
apps run on a large variety of devices with different technical character- toward understanding of the most common causes of test flakiness and
istics and display sizes [71]. This makes mobile apps more sensitive to the effort spent when dealing with them. This analysis first allowed us
various problems. Thorve et al. [41] investigated the nature of flakiness to complement the many studies that previously targeted the identifi-
in Android using a research methodology inspired by the work by Luo cation of the root causes of flakiness [19,20,23,24,35]: we were indeed
et al. [19]. The authors found various root causes which are unique interested in providing insights into the root causes that might be more
to mobile apps, like ‘‘Program Logic’’ and ‘‘UI’’. The former refers to prominent in mobile apps. Secondly, this research question also had the
the flakiness induced by wrong assumptions about the app’s program goal of challenging the findings reported by Thorve et al. [41], who
behavior, while the latter refers to flakiness due to the design and investigated the root causes of test flakiness in mobile apps. We asked:
rendering of widget layouts on user interfaces. The need to run on
multiple devices may cause additional issues with test flakiness due
to the dependency between the app and the underlying platform it is RQ2 . What are the most common causes of test flakiness and how
running on. Indeed, Thorve et al. [41] discovered a root cause, coined hard is for developers to deal with them?
‘‘Dependency’’, which concerns the flakiness caused by the usage of
4
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Finally, we focused on eliciting the strategies performed by mobile As for the search query, which represents the set of keywords
app developers when it turns to diagnosing and fixing flaky tests. This to search sources on the phenomenon of interest, we designed it to
analysis allowed us to delineate the test flakiness problem from the account for the multiple terms/synonyms that could be used to refer
process perspective, hence providing insights into how developers prac- to test flakiness, e.g., intermittent or non-deterministic test. Hence, we
tically deal with test flakiness — this might be particularly relevant for set the initial search terms and created the following query:
researchers interested in supporting developers during those processes.
We, therefore, asked:
Search Query.
RQ3 . How do mobile app developers approach flakiness in terms of
diagnosing and fixing strategies? (‘‘flaky test*’’ OR ‘‘flakiness’’ OR ‘‘intermittent test*’’ OR ‘‘non-
deterministic test*’’ OR ‘‘nondeterministic test*’’ OR ‘‘nondetermin-
ism’’) AND (‘‘mobile’’ OR ‘‘Android’’ OR ‘‘iOS’’)
To address our research questions, we proceeded with two comple-
mentary investigations. On the one hand, a systematic grey literature
review aims at analyzing how mobile app developers discuss the prob- We applied the query on Google, scanning each resulting page until
lem and which strategies they employ to identify, diagnose, and remove saturation, i.e., we stopped our search when no relevant resources
flaky tests. On the other hand, a survey study aimed to corroborate emerged from the results, as suggested by Garousi et al. [47]. We
and further extend the findings obtained through the systematic grey performed our search in an incognito mode to avoid our personal
literature review, possibly providing additional insights from the direct search bias. As recommended by Garousi et al. [47], we did our best
inquiry of practitioners involved in the flaky test management process. to exclude resources providing outdated information that might have
The conclusions drawn at the end of the empirical study aimed to biased the conclusions drawn in the study. In particular, we excluded
contribute lessons learned, implications, and open challenges on the
all the resources published before 2018: the median lifespan of a mobile
practices and issues that developers currently face when dealing with
operating system version is 4.6 years [80,81] and, therefore, it is likely
flaky tests. By definition, the empirical study has an exploratory nature,
that testing frameworks and concerns present in older versions are not
as its goal is to elicit the developer’s opinions about a problem. In
anymore relevant nowadays. In other terms, we preferred to include
terms of design, reporting, and presentation, we followed the empirical
only the most relevant resources in an effort to present the current state
software engineering guidelines by Wohlin et al. [75], other than the
of the practice.
ACM/SIGSOFT Empirical Standards.2
4. Analyzing the grey literature 4.2. Eligibility criteria and quality assessment
The first step of our analysis consisted of designing and executing On the basis of the resources identified from the search, we then ver-
a systematic grey literature review. The choice to conduct a systematic ified the eligibility criteria and performed a quality assessment. These
grey literature review has several reasons: (1) to increase the likelihood were required to select the most appropriate and relevant resources to
of a comprehensive search [77], (2) to avoid publication bias [78], synthesize.
and (3) to allow the analysis of the point of view of practitioners [78].
Inclusion/Exclusion Criteria. Inclusion and exclusion criteria allow
We followed well-established research guidelines to conduct it [47], as
the selection of resources that address the goal of our literature re-
elaborated in the remainder of the section.
view [47]. In the context of our study, we applied the following
criteria.
4.1. Data sources and search strategy
(A) Inclusion criteria: Resources that analyze test flakiness in mo-
Similarly to previous work in literature [40,79], we employed the
bile applications from different perspectives, i.e., root causes,
Google search engine to search for the grey literature. It is important
diagnosing and fixing strategies were included in our study.
to note that Garousi et al. [47] defined a model that describes the
(B) Exclusion criteria: The resources that met the following con-
grey literature’s layers (or ‘‘shades’’). The first layer concerns books,
straints were filtered out from our study:
magazines, government reports, and white papers, which are consid-
ered as high credibility sources. The second layer is annual reports, • Resources not written in English;
news articles, presentations, videos, Q/A sites, and Wiki articles, which • Resources that do not match the focus of the study;
are considered as moderate credibility sources. The third layer consists • Resources that are restricted with a paywall;
instead of blogs, emails, and tweets, which are deemed to have lower
• Duplicated resources;
credibility. In an effort to systematically analyze all the resources avail-
• Resources that analyze flakiness from the perspective of
able on test flakiness in mobile development, we decided to include
academics.
all the layers of grey literature, hence also including the risky sources
coming from the third layer. Yet, as explained in Section 4.2, we Quality Assessment: Before proceeding with the extraction of the
accounted for this aspect by means of additional quality checks aiming data from our resources, we assessed the credibility, quality, and thor-
at assessing the credibility of the sources included in our systematic oughness of the retrieved resources to discard the results that did not
analysis.
provide enough details to be used in our study. Starting from the quality
assessment checklist presented by Garousi et al. [47], we defined a
2 checklist that included the following questions:
Available at: https://github.com/acmsigsoft/EmpiricalStandards. Given
the nature of our study and the currently available empirical standards,
we followed the ‘‘General Standard’’, ‘‘Systematic Reviews’’, and ‘‘Qualitative
Q1. Is the publication organization reputable?
Surveys’’ guidelines. All the material collected and employed to address the Q2. Is the author known?
research questions have been anonymized and made publicly available in the Q3. Does the author have expertise in the area?
online appendix [76]. Q4. Does the source have a clearly stated purpose?
5
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Each question could be answered as ‘‘Yes’’, ‘‘Partially’’, and ‘‘No’’. 4.3. Execution of the grey literature review
We associated a numeric value for each label to better assess the quality
and thoroughness of each source: the label ‘‘Yes’’ was associated to the Once defined the steps for our grey literature review, we pro-
value ‘1’, ‘‘Partially’’ to ‘0.5’, while the label ‘‘No’’ was associated to ‘0’. ceeded with its execution. Fig. 1 overviews the process, reporting the
The overall quality score was computed by summing up the score of the inputs/outputs of each stage as well as summarizing the number of
answers to the four questions, and the articles with a quality score of resources considered for our study. The entire process was executed in
at least 2 were accepted. the period between November 15 to April 30, 2023, and took around
The quality assessment was mainly carried out by the first two 70 person/h. The potentially relevant results for our search query were
authors of the paper, who manually dived into the resources identified 187, but after an initial pre-screening – where we discarded resources
in order to address the questions in the checklist. More specifically, the not related to the topic, articles belonging to white literature, Wikipedia
entries, or repeated entries – we filtered out 77 sources, obtaining
inspectors applied the following steps to address each of the questions
a final list of 110 relevant sources. Next, the inspectors applied the
on the checklist:
Inclusion/Exclusion criteria. This step led to filtering out 87 sources,
• Q1. The inspectors took the name of the organization mentioned for a final list of 23 sources. The last step of our process was the quality
in the resource as input. If the organization was publicly known assessment, where six additional resources were discarded. Table 2
reports the outcome of the quality assessment step: More particularly,
and considered reputable by the inspectors, e.g., Slack,3 it was
for each resource excluded as a result of the quality assessment, the
considered reputable. If the organization was unknown to the
table overviews (i) the description of the resource; (ii) the motivation
inspectors, the inspectors performed a search over internet to
leading to the exclusion; and (iii) the score assigned to each quality
evaluate the reputation of the organization by looking at its
criterion used for the assessment. As a conclusion of the selection
mission, i.e., whether the organization was connected to software
process, we proceeded with a set of 17 online resources of different
programming and testing, its social presence, i.e., whether the
types, e.g., articles, blog posts, useful for our grey literature review.
organization had a LinkedIn account, the number of followers of
The relatively low amount of resources identified does not necessarily
the social accounts, and any additional contextual information influence the prominence of flaky tests perceived by mobile developers.
that might have led the inspectors assess its reputation, e.g., size On the one hand, all the resources selected as part of our grey literature
of the organization, number of branches of the company and their review come from organizations deemed reputable, hence suggesting
geo-dispersion. Upon the collection of these data, the inspectors that the problem of flakiness is indeed serious in multiple contexts. On
opened a discussion and finally provided an assessment of the the other hand, test flakiness is a technical matter typically discussed by
reputation of the organization. We did not set any threshold on developers with strong experience in software testing, which naturally
the pieces of information gathered, but we used such pieces of limits the amount of resources available.
information to have a contextual understanding that might have Moreover, we are aware that such a relatively low amount of
allowed us to better assess the reputation of the organizations resources left after the application of the quality assessment may lower
and authors. The use of fixed thresholds could have biased our the generalizability of the conclusions drawn from our analysis. In this
evaluation or filtered out relevant resources: as such, we preferred respect, two observations can be made. First, we aimed at performing
to conduct a qualitative assessment of the resources by discussing an extensive analysis of the grey literature on flaky tests in mobile
the organization rather than relying on quantitative measures. applications. While more resources were available when considering
• Q2–Q3. To assess the author of a post and their related expertise, the problem from a general perspective, i.e., when including all the
the inspectors conducted a similar process as described for Q1. articles that discuss flaky tests independently from the domain, only
First, they verified that the post had an author. Then, they took a limited amount of resources were valid when it came to the mo-
the name of the author as input and snowballed the search over bile application domain. As such, our analysis simply overviews the
the internet with the aim of gathering information and evaluate current state, highlighting the scarcity of resources available. Second,
the credibility of the author by looking at their previous/current and perhaps more importantly, the few results obtained motivated
positions, social media presence, and its skills in terms of software the follow-up survey study, which was designed to complement and
testing according to the information available on the internet, extend the findings obtained from the grey literature review. In our
e.g., personal website or company pages. If the inspectors were online appendix [76], we made available a file that contains both the
able to link a resource to an author with some expertise on the resources included in the analysis and those excluded at each step of
matter, they considered the author reputable and the resource the reviewing process.
Upon completion of the quality assessment, we then proceeded with
worthy to be analyzed.
the data extraction and analysis. In particular, the first two authors
• Q4. A resource was considered to have a clearly stated purpose
of the paper went through each resource, reading and summarizing
if it explicitly reported information and/or points of view on the
its content. For each of them, they also assigned a label describing
problem of test flakiness. In other terms, the inspectors verified
the main theme treated in the resource, e.g., a source reporting on
that the content of the resource was specific enough to address
‘‘Mitigation Strategies’’, and a brief summary of the content. These pieces
the research questions of the study.
of information allowed the two authors to describe and categorize the
content of the resources. These were later jointly discussed in order
The material was equally distributed between the inspectors. In
to homogenize the labels and descriptions. The discussion lasted 4 h
problematic cases of filtration, the inspectors opened a discussion
and the resulting coding exercise was used to analyze the results, as
revolving around the parameters described above and found an agree-
reported in Section 4.4.
ment on whether a resource should have been included or not. In case
of disagreement, the third author was involved and the decision was 4.4. Analysis of the results
taken based on majority voting. This did not eventually happen, as
the discussions between the first two inspectors already addressed the Before discussing the main findings coming from the systematic
source selection process. grey literature review, let us discuss the years of publication of the
resources included in the study. Fig. 2 shows the distribution of the
number of resources over the years. From the figure, we may notice
3
Slack: https://slack.com/. again that only a few developers discuss the problem of flakiness in
6
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Table 2
The outcome of the quality assessment with a brief description of the content, the motivation behind the exclusion and the score assigned to each quality criteria, i.e., reputable
organization, author known, expertise on the topic and a clearly stated purpose.
Resource Description Motivation Q1 Q2 Q3 Q4
RE1 The resource cites test flakiness as a challenge The author and his expertise are not known. The 1 0 0 0.5
when implementing mobile test automation. organization is known (Repeato is a test automation tool that
works based on computer vision and machine learning), but
the resource does not provide additional information useful
to answer our research questions.
RE2 The resource shows an Android mono repo The author’s name is reported but it has been impossible to 1 0 0 0
case study to understand the unreasonable find more information about his expertise; the article is
effectiveness of the test. written in a personal blog. For this reason, it is impossible to
analyze the expertise.
RE3 The resource presents a generic description of The author and his expertise are not known. The 0.5 0 0 1
test flakiness and how to reduce it. The focus organization is known (Testinium), but its mission is unclear.
is not on mobile apps. Does the organization provide a framework, a specialized
testing team, or other?
RE4 The resource presents a test automation tool, SpriteCloud is a community of software quality assurance and 0 0 0 0.5
i.e., Detox and reports the added advantages of cybersecurity testers in Amsterdam, but it is a sort of blog.
being less flaky without explaining how. The author and his expertise are unknown and the resource
just mentions flakiness. It is impossible to evaluate the
criteria related to the author and the organization.
RE5 The resource provides an overview of flakiness The author is unknown (the author used a nickname) and 0 0 0 1
and how to fix it. the article is written in a personal blog. For this reason, it is
impossible to analyze the expertise.
RE6 The resource provides an overview of the top The author is unknown (the author used a nickname) and 0 0 0 1
mobile app testing frameworks for 2022 and the article is in a personal blog. For this reason, it is
names flakiness about a tool (XCUITest) to say impossible to analyze the expertise.
that it minimizes it.
7
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Table 3
Final resource of the Grey Literature Review with a brief description of the content, the publication year, and at least one label describing the main theme treated in the
resource.
Resource Description Year Label
R1 Strategies to reduce flakiness in Espresso UI Tests (e.g., disable animation). 2020 Mitigation strategy
R2 Explanation of a workflow to reduce flakiness. 2021 Detection strategy
R3 The Mobile Developer Experience Team, DevXP, describes the path to minimize test 2022 Detection strategy,
flakiness. Mitigation strategy
R4 Lecture where the speaker suggests some best practices for increasing test stability are 2022 Developers’ perspective
learned. The speaker suggests that Android developers often use the word flakiness.
R5 Description of some root causes of flakiness in mobile apps (e.g., communication with 2021 Root cause, Fixing
external resources) and some strategies to fix flaky tests. strategy
R6 Description of Appium and some tips to avoid test flakiness. 2018 Mitigation strategy
R7 Overview of some root causes of test flakiness and how to avoid them. 2021 Root cause, Mitigation
strategy
R8 Tutorial on facilitating the User Interface testing and reducing test flakiness. 2020 Mitigation strategy
R9 Presentation of some strategies in XCode to detect test flakiness, i.e., the test repetition 2022 Detection strategy
mode.
R10 The author describes how to mitigate test flakiness in the emulator. 2021 Mitigation strategy
R11 Description of Gordon, an Android plugin that makes a report on test flakiness. 2019 Detection strategy
R12 The author shows an example of a test case that could have some problems related to 2021 Mitigation strategy
non-determinism. Then, the author shows how to avoid this problem.
R13 Presentation of Bitrisea , a platform to build mobile applications that can detect flakiness 2023 Detection strategy
by analyzing the result from various builds and reporting which tests change behavior.
R14 Android Documentation on UI Test. The resource suggests avoiding the use of the 2022 Mitigation strategy
sleep() function because this makes tests unnecessarily slow or flaky because running
the same test in different environments might need more or less time to execute.
R15 Since 2017, in Slack there is the Mobile Developer Experience (DevXP) team, that 2022 Detection strategy
focused on several areas of mobile application development, including ‘‘Automation test
infrastructure and automated test flakiness’’. The team analyzed flakiness in the last years
and developed an approach to reduce flakiness from 57% to 4% (the approach is
described in R3).
R16 A simple list of best Android testing tools. There is a reference to flakiness detection for 2023 Detection strategy
Waldo, a no-code testing tool useful for teams that do not have dedicated developers.
The tool analyzes the change between the various builds to detect a flaky test.
R17 This master thesis project gives guidance to mobile application developers and testers 2020 Mitigation strategy
when planning to execute automated tests. The project performed a comparison of the
three most popular mobile application testing automation in Agile/DevOps environments
and used the flakiness as a measure to understand the degree of the frameworks to
handle flaky tests based on the number of Wait statement declared.
a
The Bitrise platform: https://bitrise.io/.
‘‘Root Cause Analysis’’. Two of the twelve resources, i.e., R5 and R7, the survey study. Hence, the systematic work performed let emerge
provided information about root causes. While in R5, we could that test flakiness is not just a matter of test code when it comes to
find a detailed explanation of the most common root causes in iOS mobile applications, but rather further investigations into the root
mobile applications, in R7, we could analyze a short list of the causes due to production code might be worthwhile. In addition, R7
reasons behind test flakiness in Android applications. In the former highlights that flaky tests may be due to the specific test execution
case, the author recognizes four common root causes such as Race environment, i.e., the flakiness might arise when exercising the app
Conditions, Environment assumptions, Global State, and Communication against a real device rather than a simulator. The aspects highlighted
with external services. While some of the root causes mentioned are in R5 and R7 let emerge new perspectives of the problem, which are
similar to those reported for traditional software, e.g., race conditions, peculiar to mobile apps.
additional factors seem to come into play in mobile apps. This is Another interesting discussion point relates to the source code re-
the case of Global State and Communication with external services. In ported in Listing 1. It shows an example of UI flaky tests coming from
particular, global states pertain to variables or instances globally R12. The listing shows how the UI events reliant on network calls can
available in the source code: mutable global configurations may be take different amounts of time. In fact, this test switches between
configured differently for different test classes, possibly affecting two fragments in an activity, i.e., the click of the first button reveals
the execution environment and leading test cases to fail in certain the second button and the click on the second button reveals the
scenarios. As for the external services, R5 reported that this is a first button again. As reported in R12, the R.id.button_second
common root cause of flakiness, which is mostly due to the strong is not enabled until a network call is completed — this a flaky test
reliance of mobile apps on third-party libraries and services. because the test may attempt to access a button or a list item that is
In R7, the author recognizes four macro-categories as root causes of not available yet because it is awaiting a response from a test server.
test flakiness. These macro-categories refer to problems connected On the one hand, this example corroborates the results obtained by
to the design of production and test code, the use of a real device Romano et al. [73], who reported that the root causes behind UI flaky
or an emulator, and the hardware infrastructure. Also in this case, tests may be actually similar to those of traditional systems, e.g., they
we could recognize the existence of additional factors affecting the might be induced by network-related concerns. On the other hand,
problem of test flakiness, like the design of production and test code however, those flaky tests might be harder to manage and fix, as
— this is something that will be identified and discussed further in multiple root causes might co-occur — this example highlights, once
8
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
again, that the problem of flaky tests in mobile apps is peculiar For instance, Listing 2 shows how to manage the UI flaky test
and possibly requires more sophisticated instruments that might help previously discussed. One simple solution is to use a view matcher
mobile developers to diagnose and handle them. that checks for the desired conditions before moving forward in the
test. The onViewEnabled function checks whether the UI element
being queried is enabled. If the element is not enabled, the function
1 @Test ( expected=androidx . t e s t . espresso . NoMatchingViewException : :
will use a small amount of time and check again. Once this function
class )
2 fun f l a k y T e s t ( ) { is introduced, the test case presented in Listing 1 will be modified as
3 onView ( withId (R . i d . b u t t o n _ f i r s t ) ) . perform ( c l i c k ( ) ) shown in Listing 3.
4 onView ( withId (R . i d . button_second ) ) . perform ( c l i c k ( ) )
5 onView ( withId (R . i d . b u t t o n _ f i r s t ) ) . check ( matches (
isDisplayed ( ) ) ) 1 @Test
6 } 2 fun idleViewMatcherTest ( ) {
3 onView ( withId (R . i d . b u t t o n _ f i r s t ) ) . perform ( c l i c k ( ) )
Listing 1: Example of UI flaky tests in a mobile application. This test 4 onViewEnabled ( withId (R . i d . button_second ) ) . perform ( c l i c k ( )
switches between two fragments in an activity and requires a network )
5 onView ( withId (R . i d . b u t t o n _ f i r s t ) ) . check ( matches (
call to be completed. The awaiting response from a server could cause isDisplayed ( ) ) )
flakiness. 6 }
9
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Table 4
Mitigation strategies with related resources.
problematic nature of flaky tests; this is something that will be
Mitigation strategy Resources
explored further through the survey study.
As for the root causes (RQ2 ), our study highlighted how
Use the wait actions R1, R6, R10, R12, R14, R17
Use the locator strategies R6, R8, R10 some of them are similar to those identified in previous research
Disable animations R1, R6 targeting traditional software development, e.g., race conditions
Mock the external services R6, R7 or resource leakage. At the same time, additional problems can be
Isolate the environment R3, R7 caused by tests verifying user interfaces or making use of third-
Set the AppState R6
party libraries. In addition, our grey literature review confirmed
Sort the test R7
the findings by Romano et al. [73]: in some cases, multiple root
causes might co-exist, making the flaky test diagnosis process
harder for mobile developers.
‘‘Fixing Strategy’’. Only R5, provided insights into addressing flaky Finally, with respect to RQ3 , our analysis revealed the
tests in mobile environments. The strategies discussed by the au- existence of some testing tools and/or frameworks that allow
thor are three: (1) Cleaning up the local environment; (2) Ensuring developers to verify the presence of flaky tests by means of
consistent global state and configurations; and (3) Verifying race re-running them multiple times. Perhaps more importantly, practi-
conditions. On the one hand, it is worth noting that, from a practical tioners do not discuss about proper fixing strategies but only apply
standpoint, these do not really represent proper fixing strategies for a limited amount of mitigation actions to reduce the effects of flaky
test flakiness but development best practices — in other terms, the tests, e.g., by disabling animations in UIs.
grey literature review did not let emerge any real fixing strategy
to deal with flaky tests in mobile apps. On the other hand, the
suggested practices may be used to address some of the root causes
that emerged from our analysis. This seems to suggest that flaky tests
might be mitigated or even fixed using test code quality strategies: 5. Analyzing the developer’s perception
this aspect informed the design of the follow-up survey study, in
which we inquired mobile developers on the relation between test The second step of our investigation consisted of a survey study,
code quality aspects and the emergence of test flakiness. whose details are discussed in the following.
‘‘Developer’s Perspective’’. Resource R4 concerned about the droid- 5.1. Survey design and structure
con community.5 This is one of the largest communities around
Android development that periodically organizes conferences around We followed the guidelines by Kitchenham and Pleeger [84] to
the world. While only one resource highlighted the prominence of design a survey that balanced the need to be short enough and the
flaky tests in mobile apps, the resource should still be considered requirement of being effective to address the research questions of the
impactful and representative of a larger number of developers in- empirical study. In the following, we reported the questions included
volved in developing mobile apps. In the April 2022 conference held in the survey and the type of answer requested from participants. The
in Lisbon, one of the speakers (Sinan Kozak, an Android engineer in various answer options can be found in the online appendix [76].
the Delivery Hero company) showed some best practices to increase Survey design. First, we included a brief introductory text where
test stability. This talk was required because, according to him, ‘‘the we presented ourselves and the overall goals of the empirical study. In
Android developers use the flakiness word more than the stability word doing so, we defined flakiness as ‘‘a phenomenon occurring when a test
while talking about Espresso tests’’. In other terms, the speaker high- case is non-deterministic and exhibits both a passing and failing behavior
lighted that developers performing UI testing often have to do with when run against the same production code’’ [19]. This was needed
flakiness, since most tests are born flaky before being made stable to inform participants of our expectations, the expertise required to
by developers. This resource makes us recognize that the problem of participate, and the research objectives. Participants who were not
flakiness in mobile applications might be managed differently than familiar with or not confident enough to discuss test flakiness could
in traditional systems since it seems to have a different frequency decide to abandon the survey already at this point. Furthermore, we
and impact. provided information on the survey length, which was estimated at
about 15/18 min and later empirically assessed through a pilot study
(more details later in this section). We also explained that the participa-
Connecting the dots... tion was voluntary, that participants could abandon the survey at any
time, and that all responses would have been anonymous to preserve
The findings coming from the systematic grey literature review
the privacy of participants.
allow us to address the research questions posed in our empirical
We finally concluded this part by asking participants the explicit
study in different manners.
consent to use the collected data for research purposes and their
First and foremost, we observed that flaky tests might be more
authorization to proceed with the survey.
prominent in the mobile context for two main reasons. On the
Participant’s background. The first section of the survey was related
one hand, the tight relation between an app and its user interface
to the participant’s background. This was required to (1) characterize
might cause additional issues when it comes to flaky tests, hence
the sample of developers answering the survey and (2) understand
confirming the findings by Thorve et al. [41]. On the other hand,
whether and which answers should have been removed because of the
the analysis of resource R4 let emerge an additional perspective:
poor experience/expertise of some participants. Table 5 reports the list
especially when performing UI testing, mobile developers might
deal with some sort of flakiness already when developing tests of questions related to this part. In particular, we asked for information
in the first place. Hence, in terms of RQ1 , we can report that on the traditional and mobile programming experience (in terms of
the systematic grey literature revealed the problem of flaky tests years), if participants mainly developed in open-source or other con-
is prominent. We cannot, instead, provide any insights into the texts, e.g., industrial, the environment in which they developed mobile
applications, i.e., continuous integration and continuous delivery, how
much and which testing they typically do when developing mobile
applications, which frameworks they use, and whether they have ever
5
https://www.droidcon.com/. dealt with test flakiness in the traditional and mobile development. We
10
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Table 5
List of questions for the background section in the survey with the type of response provided.
Section 1: Participant’s background Type
#1 How long have you been a developer?a Multiple choice
#2 How long have you been developing mobile apps?a Multiple choice
#3 What kind of developer are you?a Checkboxes
#4 You define yourself as ...a Multiple choice
#5 When you develop mobile apps, do you work in a CI (Continuous Integration) environment, i.e., the practice of merging Multiple choice
all developers’ work copied to a shared repository several times a day?a
#6 When you develop mobile apps, do you work in a CD (Continuous Delivery) environment, i.e., an approach in which teams Multiple choice
produce software in short cycles and the deployment of a new version is automatic and does not need a human hand?a
#7 How much testing do you typically perform in relation to production code?a Multiple choice
#8 How much do you test your app with respect to the following types of testing? To answer this, consider all tests you have Multiple-choice grid
done for a mobile application.a
#9 What framework do you usually use to develop test cases in mobile apps?a Checkboxes
#10 To what extent have you ever identified flaky tests in the code for traditional software systems (non-mobile systems) 5-point Likert scale
you were developing?a
#11 Have you ever found yourself identifying flaky tests in the code you were developing, i.e., a test that is non-deterministic Multiple choice
and exhibits both a passing and failing behavior?a
a Indicates that these questions were mandatory to enter an answer.
Table 6
List of questions for prominence and relevance section into the survey with the type of response provided.
Section 2: Prominence and relevance of flakiness Type
#12 How often have you found flaky tests in your projects (i.e., from less than a few times per year to daily)?a 5-point Likert Scale
#13 How dangerous do you consider the flaky tests identified in terms of the failures they have caused?a 5-point Likert Scale
#14 What framework were you using when you detected the flakiness?a Checkboxes
#15 Which point in the development pipeline did you identify flakiness in your project?a Checkboxes
a Indicates that these questions were mandatory to enter an answer.
designed the latter question to discriminate the participation in the might impact the likelihood of a test being flaky. By presenting these
survey. Developers who have never dealt with a flaky test in mobile properties in the survey, we could challenge the findings by Pontillo
were deemed unable to proceed — in these cases, we showed a ‘‘Thank et al. [57], complementing them with information on the developer’s
You!’’ message and let participants leave the survey. perception. Also, we could further verify the insights coming from
Prominence of test flakiness. The participants who continued the sur- the systematic grey literature review, where we discovered that some
vey were then moved to the second section, which proposed questions properties of production code might affect test flakiness. We did not
concerned with the prominence of flaky tests, and their harmfulness, explicitly mention the name of the metrics and smells, but reported
other than additional contextual information that allowed us to elabo- their description: this was done to mitigate risks due to confirmation
rate on the potential co-factors associated with the emergence of test bias [85], e.g., developers might have been more inclined to identify
flakiness. As shown in Table 6, we specifically inquired about the a relationship if they were aware of the fact that some of these prop-
frameworks participants were using when a flaky test arose and at erties referred to sub-optimal test code design and/or implementation
which point of the development is test flakiness typically discovered choices. Participants had also the chance to write other answers to
(e.g., during code review or as a result of regression testing). report on relations they identified in their past experiences.
Root causes of test flakiness. Afterward, the survey proposed ques- Diagnosing of test flakiness. The next section of the survey was related
tions related to the root causes of test flakiness — Table 7 shows to the strategies employed once a seemingly flaky test was identified,
the list of questions related to this section. These questions aimed to e.g., when a test that previously succeeded failed — Table 8 reports
stimulate participants to think about the characteristics of the flaky the list of questions. We first asked how participants verified the actual
tests they dealt with. We first asked developers to indicate the most flakiness of a test (e.g., by rerunning it multiple times), how the
common root causes from their perspective, along with the complexity development pipeline moved forward (e.g., by temporarily disabling
in terms of detection and fixing. Then, we asked the frequency with the test), whether the same flaky test appeared again in later stages of
which they identified a root cause in both traditional and mobile the development and, if so, how participants reacted. When answering
development, giving the opportunity to answer ‘‘I do not remember’’. these questions, the developers could check one or more of the pre-
When answering these questions, they could select one or more options defined answers that we included to ease the survey compilation. The
from a predefined list of the most common root causes coming from predefined answers were based on the strategies described by previous
both previous works [19,20,23,24,35,41] and the results of the sys- works [20,21,26] and those coming from our systematic grey literature
tematic grey literature review previously conducted. Developers could review. At the same time, developers could still provide additional
also indicate additional root causes not included in the list through the answers through the ‘‘Other’’ option.
‘‘Other’’ option. Fixing of test flakiness. After the diagnosis, we focused on the fixing
In addition, we asked whether the participants identified some process. Table 9 reports the list of questions. Similar to what was done
relations between the emergence of a flaky test and general properties earlier, we asked how participants typically fix flaky tests. Then, we
of the test code, like a high number of lines of code, the use of provided them with a list of fixing strategies for each of the root causes
external resources, and so on. The proposed list of properties comes presented in the previous section of the survey, asking participants to
from the results of a recent work by Pontillo et al. [57]: the authors indicate how often they applied those fixing strategies to address flaky
assessed the relationship between flaky tests and a set of test and tests. We exploited the findings by Eck et al. [20] to compile the list
production code metrics and smells, reporting that some of these factors of fixing strategies: they have indeed provided a taxonomy of the most
11
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Table 7
List of questions concerning the root causes of flaky test in the survey with the type of response provided.
Section 3: Root causes of flaky tests Type
#16 Could you indicate which root causes you have most identified and analyzed in the mobile Checkboxes
application?a
#17 How many times did you identify/analyze the following root causes after detecting a flaky test during Multiple-choice grid
the development of a traditional software system (non-mobile application)?a
#18 How many times did you identify/analyze the following root causes after detecting a flaky test during Multiple-choice grid
the development of a mobile application?a
#19 Based on your experience, which flaky tests are more difficult to DETECT in terms of effort?a Multiple-choice grid
#20 Based on your experience, which flaky tests are more difficult to FIX in terms of effort?a Multiple-choice grid
#21 When you detected test flakiness in your mobile app, did you ever recognize the following Multiple-choice grid
characteristics?a
#22 Did you detect any additional characteristics other than those previously presented?a Paragraph
a
Indicates that these questions were mandatory to enter an answer.
Table 8
List of questions entered in the survey for the flakiness detection strategies in the mobile apps, with the type of response provided.
Section 4: Detection of flakiness in mobile apps Type
#23 Once you identified a failure how did you later diagnose that it was a flaky test?a Checkboxes
#24 Once a flaky test is identified, how does the development pipeline move forward?a Checkboxes
#25 After you have found and fixed one or more flaky tests, to what extent have you found again the same flaky test 5-point Likert Scale
going on with the development? Please address the question independently from the difficulty of fixing the flaky
test.a
#26 If you answered with at least 3 to the previous question, could you describe what you did in the development Paragraph
process (e.g., you fixed the flaky test again, you disabled the flaky test, you rerun the flaky test)?
#27 Would you like to give us further information about the flakiness detection process you used? Paragraph
a Indicates that these questions were mandatory to enter an answer.
Table 9
List of questions related to the fixing strategies employed by mobile developers in the survey with the type of response provided.
Section 5: Fixing flakiness in mobile apps Type
#28 How often have you fixed flaky tests in your projects (i.e., from never to daily)?a 5-point Likert Scale
#29 How often have you used these strategies to fix a flaky test? Multiple-choice grid
#30 Did you use any additional strategies other than those previously presented?a Paragraph
#31 If you work in a CI environment (i.e., the practice of merging all developers’ working copies to a shared repository Paragraph
several times a day), once a flaky test is detected and fixed, do you immediately release a new version of the
application?a
#32 If you work in a CD environment (i.e., an approach in which teams produce software in short cycles and the Paragraph
deployment of a new version is automatic and does not need a human hand), how do you perceive the presence of
a flaky test? Do you quickly try to fix and update the mobile app, or knowing that the test is occasionally
successful make you proceed differently?a
#33 In your opinion, what do you think are the similarities and/or differences in the flakiness management process Paragraph
(i.e., detection, mitigation, and fix) in mobile applications between CI and CD environment?a
#34 Would you like to give us further information about the fixing process you used?a Paragraph
a Indicates that these questions were mandatory to enter an answer.
frequent operations performed by developers to deal with test flakiness. terms, we did not advertise the test code quality consultation in the call
We did not use the results of the systematic grey literature review when for participation but rather we presented the offer just as a recompen-
proposing potential fixing strategies, as it did not let emerge any proper sation for the effort and time the participants spent in answering our
strategy to consider — as discussed in Section 4.4. Our participant also questions.
had the chance to answer ‘‘I do not remember’’ and indicate additional At the end of each section, we included a free-text answer where
strategies to put in place. Finally, we asked the participants to give developers could elaborate and provide additional information on how
us more details about the fixing process in a Continuous Integration they deal with flaky tests.
and Continuous Delivery environment, asking them if there were any Survey validation. A crucial aspect to consider when designing
similarities and differences based on their experiences. surveys is concerned with their quality and the time required to fill
Survey closure. In the last section of the survey, before thanking the them out. While longer surveys provide more insights into the matter,
participants, we allowed them to enter their e-mail addresses to (1) the excessive length may discourage participation [86]. We took great
receive a summary of our results and/or (2) participate in a future care of this point and, for this reason, after defining the first version
follow-up semi-structured interview. As shown in Table 10, participants of the survey, we conducted a pilot study [87]. A pilot study typically
were also allowed to enter one or more links to open-source repositories consists of an experiment with a small sample of trusted participants
of the apps they developed: this was part of the incentives we provided who might provide feedback on the length, clarity, and structure of
for the participation, namely a free test code quality analysis of the the survey. In our case, the pilot study was conducted with four
repositories shared. It is worth clarifying that the participants were mobile software engineering developers with experience in software
made aware of the offer only upon completion of the survey. In other development and testing, who were selected from our contact network.
12
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Table 10
List of questions related to the fixing strategies employed by mobile developers in the survey with the type of response provided. The asterisk next to some questions indicates
that it was mandatory to enter an answer.
Section 6: Survey closure Type
#35 Would you like to learn about the results of our study or to be contacted to participate in a follow-up Short answer
interview/focus group on this topic? If yes, please write down your email.
#36 If you are an open-source developer, would you like to share the mobile app code where you spotted one or more Paragraph
flaky tests? If yes, please write down the link to the repository.
#37 Would you like to share a link to other open-source projects you are working on that you have not experienced Paragraph
any flakiness issues with so far? If you are interested, we can review the code and get back to you if we spot any
flaky tests. Please, write down the link to the repository.
The four involved developers well matched the ideal population of our Ethical Considerations. In our country, it is not yet mandatory to
survey, i.e., they have expertise in software testing and deal with test seek approval from an Ethical Review Board when releasing surveys
flakiness in the context of their projects. with human subjects. Nevertheless, when designing the survey we
They were contacted via e-mail by the first author of the paper and, mitigated many of possible ethical and privacy concerns [93]. We
upon acceptance, were instructed on how to fill out the survey. Besides guaranteed the participants’ privacy by gathering anonymous answers.
the link to the survey, the four mobile developers were provided with Participants voluntarily provided their e-mail addresses to receive a
a text document where they could report their observations in terms summary of the results and/or request the free test code quality analysis
of length, clarity, and structure of the survey. We also requested them on their apps. When recruiting developers, we stated the goal of the
to clock their progress so that we could have precise indications of the survey study, other than explicitly reporting that the given answers
time required to complete the survey. They had one week to complete would have been used in the scope of a research activity that would
the task. The four pilots sent their notes back to the first author upon not have any intention of publishing sensitive data. Finally, we clarified
completion. These were later jointly analyzed by the first two authors that completed surveys would eventually become public, preserving the
of the paper, who identified and addressed a few clarity issues in the privacy of the participants.
description of the various section of the survey. Also, one of the pilots
let emerged a lack of emphasis on the development that needed to 5.3. Data cleaning and analysis
be relied on to answer the question, i.e., traditional software system
or mobile application development. We recognized this problem and Once we had collected the responses, we performed a quality assess-
emphasized the concept in questions #17 and #18. In terms of timing, ment phase. In particular, the first author of the paper went through
the pilots successfully completed the survey within 17 min. We finally the individual responses collected to validate them, possibly spotting
double-checked our changes with the pilots, who confirmed that all cases where the participants did not take the task seriously or did not
their feedback was addressed. have enough experience to provide valuable insights. This procedure
led to the exclusion of 23 responses, that are reported in the online
5.2. Survey recruitment and dissemination appendix [76]. On the other 130 answers, we first make sense of
the data by analyzing the closed answers. As the reader might see
The following sections elaborate on the decisions taken to maximize by looking at the detailed set of questions available in Section 5.1,
the developer’s participation. most questions were formulated so that participants could express
Survey Recruitment. A key choice of survey studies is concerned their opinions through check-boxes or using a 5-point Likert scale [94]
with the selection of suitable target participants [86]. We consciously with different ranges, e.g., from Very rarely to Very frequently. These
avoided spreading our survey on social networks, as we could not answers could be analyzed using statistical analysis, collecting numeric
have had control over the answers received [86]. On the contrary, distributions and interpreting them via bar plots and tables.
we identified Prolific6 as a useful instrument to recruit our partici- Instead, the open answers were subject to content analysis [95],
pants. This research-oriented web-based platform enables researchers a research method where one or more inspectors go over the data
to find participants for survey studies. It allows the specification of of interest and attempt to deduct their meaning and/or the concepts
constraints over participants, which enabled us to limit mobile develop-
they let emerge. The process was conducted by the first two authors
ers’ participation. Prolific implements an opt-in strategy [88], meaning
of the paper, who jointly analyzed the individual responses to identify
that participants get voluntarily involved. To mitigate the possible
and label the main insights and comments left by participants. The
self-selection or voluntary response bias, we introduced a monetary
content analysis process took around 80 person/h and was used to
incentive of 7 USD. Incentives are well-known to mitigate self-selection
better understand and contextualize the position of participants with
or voluntary response bias, other than increasing the response rate, as
respect to the diagnosing and fixing processes adopted when dealing
shown in previous studies targeting the methods to increase response
with flaky tests.
rate in survey studies [89,90].
The Prolific platform has been studied by various researchers [91,
5.4. Analysis of the results
92]. For example, Reid et al. [92] recently defined recommendations
to conduct surveys using the platform. We followed those recommen-
This section summarizes the demographics of the 130 involved
dations while preparing the survey, e.g., we pre-screened participants
participants, other than addressing the three research questions of the
in order to assess their suitability for the study.
study.
Survey Dissemination. The recruitment platform allows to include
Demographics. Fig. 3 overviews the main information on the par-
external links to web pages hosting the actual survey to fill. We im-
ticipants’ backgrounds. In the first place, a large portion of partici-
plemented our survey through a Google form7 and included it within
pants qualify themselves as freelance or open-source developers (36.6%
Prolific. We collected 153 responses, which were subject to quality
and 28.2%, respectively), while 35.1% and 26.1% of them consider
assessment — more details are discussed in Section 5.3.
themselves as industrial and startup developers, respectively. Of the
130 responses, 32 participants qualified themselves as belonging to
6
Prolific website: https://www.prolific.com/. more than one category: 11 participants recognized themselves as
7
Google form website: https://www.google.com/forms/about. open-source and freelance developers, 7 as open-source and startup
13
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
developers, 4 as open-source and industrial developers, 4 as startup and — 40% of them develop in both Continuous Integration and Continuous
freelance developers, 2 as industrial and startup developers, 2 as open Delivery. More important for the objectives of our study is, however,
source, startup, and freelance developers, 1 as industrial and freelance the information provided on the amount of testing that participants
developer, and finally 1 as industrial, freelance, and startup developer. typically perform when developing their apps. Concerning the answers
As a consequence, the percentages reported in the paper do not sum to question #4, we can observe that 69 participants (66.9%) reported
up to 100%. Almost 59% of the participants have more than 3 years doing testing for less than 60% of their production code. Finally,
of development experience and 24% of them have been developing participants declared to develop between 21% and 40% of test cases
mobile applications for more than 3 years. 79% of the participants were for all types of testing levels.
Android developers, with others involved in developing apps using iOS, Based on these pieces of information, we can conclude that our sam-
Flutter, or other. Concerning the environments in which the partici- ple is quite heterogeneous and composed of mobile developers working
pants work, 71% declare that they develop in a Continuous Integration in different contexts and having various levels of seniority. On the one
environment, while 50% work in a Continuous Delivery environment hand, these demographics are in line with the recent findings reported
14
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Fig. 4. Results for questions #12 and #13, i.e., how often they have detected flakiness in their project and how dangerous they consider the failure caused by the flaky test to
be.
Fig. 5. List of frameworks where flakiness appears during development — some are for iOS development (XCUI Test) or Android development (Solar 2D, Mockito and JUnit), others
are for cross-platform development (Appium, React Native, Flutter, and Ionic). The blue bar indicates the popularity of the framework in the participants’ responses, while the green
bar shows the frequency of flakiness appearance in a specific framework based on the responses.
by Pecorelli et al. [13], who showed that mobile apps are typically RQ1 - On the Relevance of the Problem. Fig. 4 shows the answers
poorly tested in practice — this consideration will be important to provided by participants to questions #12 and #13, which allowed us
interpret some of the results discussed later in this section. On the other to address RQ1 .
hand, it is worth remarking that Pecorelli et al. [13] only dealt with Regarding the frequency of flaky test appearance, 43% of the par-
open-source Android applications, while our work also collects informa- ticipants indicated that flakiness sometimes arises, i.e., on a monthly
tion on mobile apps developed in industrial and startup environments, basis, while 12% claimed to find flakiness frequently, i.e., weekly and
other than in different operating systems. For this reason, the results more. The last results for question #12 show that 33.3% of participants
we will report in the next sections include the perspective of a larger claimed that flaky tests are rarely observed, i.e., a few times per year,
population of mobile developers. while the 12% declared that flaky tests are very rarely observed, i.e., less
than a few times per year.
Despite the relatively low amount of testing done, 64.6% of devel-
In the first place, these results suggest that, despite the low amount
opers have experienced test code flakiness at least once in the projects
of testing performed, the flakiness issue is observed in practice, hence
they developed. This percentage already suggests the diffuseness of
corroborating the insights coming from our systematic grey literature
the problem, indicating that test flakiness is not a negligible issue.
review. Moreover, our findings are in line with what has been investi-
More particularly, 84 participants claimed to have dealt with flaky tests
gated in traditional software: for instance, Eck et al. [20] reported that
and were allowed to answer the subsequent questions of the survey,
over 40% of developers faced the problem only a few times per year.
which aimed to understand better the flakiness problem in mobile The answers received on the harmfulness of flaky tests in terms of
development. The remaining 46 participants were thanked for their the failures they caused (question #13) corroborate the idea that the
participation and left the survey, i.e., we discarded those developers flaky test problem is serious. Around 45% of the participants reported
from the subsequent analyses that led us to address the research ques- that flakiness is dangerous or very dangerous, while 52.4% indicated that
tions. Among these 46 participants who were not considered, 7 (15%) the problem could cause moderate issues. Hence, only a limited number
of them also declared that they have faced flaky tests in traditional of developers (2.4%) considered the problem a minor issue.
contexts often or very often, 8 (17%) sometimes (46%), 21 rarely, and As for the testing frameworks used when flaky tests were discov-
only 10 (22%) never. We found these data interesting, as they seem to ered (question #14), Fig. 5 shows the results obtained. For Android
emphasize the existence of significant differences between the problem development, participants mentioned frameworks such as Appium, Solar
of flakiness in mobile and non-mobile applications — hence motivating 2D, and Espresso Test, while for the iOS development, we obtained as
the need for mobile-specific investigations like ours. responses only XCUI Test. In addition, frameworks like Flutter and
15
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Fig. 6. Answer to question #15, i.e., at which point of the development pipeline the participants identified the flakiness? Please, consider that the sum of the percentages does
not make 100 because participants could answer with more options to the question.
React Native – which can be used for creating cross-platform apps – the results showed by Thorve et al. [41]. Other answers provided by
were also mentioned. Analyzing the frequency of appearance of flaki- developers revealed root causes similar to those identified in previ-
ness in the various frameworks based on their popularity as reported ous work conducted in the context of traditional systems [19] – this
by the participants, we could see that there seems to be a correlation confirms, once again, the findings coming from the systematic grey
— the more popular a framework is, the higher the risk of flakiness. literature review. Problems connected to randomness, network, test
Unfortunately, the responses obtained and analyzed are too limited to environment, concurrency, and test order dependency are well-known
allow for more detailed statistical analysis, but we plan to investigate in the research community – and various automated techniques to deal
this in our future work. with them have also been proposed [72,104]. Nonetheless, we could
Finally, Fig. 6 shows the results when analyzing the context where observe differences in terms of frequency of appearance. For instance,
flakiness is detected (question #15). In most cases, developers discover with respect to Eck et al. [20] and Luo et al. [19], problems due to
flaky tests from bug reports (54%), through code review activities test order dependency occur way less, according to the opinions of our
(48%), or regression testing (40%). Most likely, these results reflect participants. This difference is also visible when considering the effort
the typical processes applied to guarantee software quality assurance: required to detect and fix them, which is medium for most participants.
indeed, developers use to automate the opening of bug reports when These results further suggest the need for specific, contextual empirical
previously successful test cases suddenly fail [96], when tests are run investigations aiming to understand mobile app testing characteristics.
against production code in a continuous integration setting [97,98], or The involved developers did not mention root causes different from
as a consequence of the code review activities performed on production those proposed in the survey, meaning that, in their perspective, the
code, i.e., when executing tests to verify source code during code list was complete.
review [17]. Fig. 8 shows the comparison between the most common root causes
RQ2 - On the root causes of test flakiness. Fig. 7 overviews analyzed in the traditional software system and the mobile applications.
the most common root causes of flakiness along with their perceived We can observe that the flakiness related to API Issues, Program Logic,
detection and fixing effort. As shown, the most common pertains to and UI problems are more diffused in mobile applications than tradi-
API Issues and Program Logic. On the one hand, these results contrast tional systems, respectively with 49%, 48%, and 40% of the responses
what has been reported by previous works [19,20]. Both indeed relate varying between Monthly and Daily. Analyzing the other root causes
to issues in the production code rather than in the test code, which presented in the survey for questions #17 and #18, we can see that the
implies that the sources of flakiness are often to be searched in how frequencies with which they occur in traditional systems and mobile
the app logic is implemented or how external services are integrated. applications are similar. We can therefore conclude that some root
On the other hand, these findings corroborate what we discovered with causes are more likely or more frequent to arise in mobile apps,
the systematic grey literature review, i.e., production code factors and justifying the need for further investigations into the matter.
third-party libraries might impact the emergence of flaky tests. Perhaps more interesting were the answers provided when linking
The difference with respect to the knowledge acquired in traditional flaky tests to test code metrics and smells (question #22) - Fig. 9 shows
software might be a reflection of the peculiarities of mobile devel- the results achieved. From this analysis, we could recognize that all
opment. The continuous release approach followed by developers to features proposed in our question are to some extent, related to the
introduce new features and fix defects encountered by users [14] might presence of test flakiness. For instance, design concerns pertaining to
enforce them to induce sources of non-determinism in the production the improper management of external resources, i.e., Mystery Guest
code more frequently. Previous works [5,99] have indeed shown that and Resource Optimism, appear in the 81% and 60% of the cases with
the continuous changes applied to mobile applications are correlated values between Sometimes and Always. Other factors that seem to be
with an increase in maintainability, reliability, and security concerns, related to test flakiness are Eager Test (61% of the responses between
which are all well-known causes of instability that may complicate Sometimes and Always) and the number of lines of code in the test suite
software testing activities [100]. In other terms, our findings suggest (63% of the responses between Sometimes and Always). These are well-
that test cases mostly become or might become flaky as a consequence known problems in the field of test code quality [105–107], other than
of poor design solutions applied by developers when evolving mobile being already associated with test flakiness [57,108]. On a similar note,
apps. The API issues mentioned by our participants may provide ad- other design metrics, like coupling between tests or assertion density,
ditional motivations for the conclusions drawn so far. In fact, mobile seem to relate to flakiness, pointing out that how test code is designed
apps make intensive use of REST APIs [101], which are often hard to might affect the likelihood of tests being flaky. Finally, we could also
use [102,103] or even change- and fault-prone [6,7]. This aspect may observe that factors like code coverage or the size of test suites are
create additional issues and indirectly affect the quality and effective- not frequently connected to flakiness: this seems to indicate that these
ness of test cases. Interestingly, the two root causes discussed are those factors are less relevant and insightful in identifying the sources of
that developers perceive as the hardest to detect and fix. This result flakiness.
suggests the need for further experiments that focus on the source code RQ3 - On the diagnosis and fixing strategies. Our third research
quality perspective of test flakiness. question targeted mobile developers’ test flakiness diagnosis and fixing
The third root cause most identified in our study is UI problems processes. For the sake of understandability, we split the following dis-
— this result confirms the results of our grey literature review and cussion based on the lifecycle of a flaky test, i.e., verification, detection,
16
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Fig. 7. The most common root causes identified by participants. For each root cause, we reported the developer’s perception in terms of detection and fixing effort, i.e., questions
#16, #18, and #19.
and fixing of flakiness. Besides providing a more structured analysis of The manual test code inspection relates to the manual debugging
the developers’ answers, this also enables an easier comparison with of the potential issue affecting the test code, with the aim of compre-
existing literature targeting the developer’s perception [20] and the hending the nature of the failure. In this respect, only 23.8% of the
lifecycle of test flakiness [23]. participants declared to use dedicated tools (question #23). This result
Flakiness verification. The responses provided by developers let could show that developers do not know about the existence of some
us first understand how they diagnose the existence of a flaky test, tools or frameworks that can be helpful in the analysis of flaky tests.
i.e., how they verify that a failing test is intermittent. In particular, For example, developer #38 report that they ‘‘do not know if there is any
the participants claimed to use two mechanisms, namely the so-called targeted tools for flakiness detection.[...]’’.
ReRun [26] and the manual test code inspection [17], respectively in Flakiness detection. Once developers verify that a test is flaky,
the 81% and 74% of the cases. The former result is in line with the they tend not to ignore it. Only 14.3% of the participants declared that
grey literature analysis and consists of rerunning tests multiple times they voluntarily disabled one or more flaky tests in the past, while 13%
with the aim of assessing their degree of reliability. One documented ignored the sources of flakiness and proceeded with the development
issue about this strategy concerns the lack of insights on the number of without caring about the issue. The vast majority of the surveyed
times a test should be rerun to verify its flakiness [44,109]. This is also developers (89%) reported their attempts to fix a flaky test as soon as
perceived by the mobile developers surveyed: for example, developers it arises. This process is typically approached manually because of the
#37 and #78 reported that they ‘‘Rerun the test multiple times under lack of tools that might assist them in this process — our participants
different conditions and using different values’’. could not provide us with any sort of automation they were aware of.
17
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Fig. 8. Comparison between the root causes analyzed by developers in the traditional software system and in the mobile applications, i.e., questions #17 and #18.
Fig. 9. The frequency of appearance of the test and production code factors provided in question #21.
18
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
More worrisome is what emerged from the answers to question #25. two environments, so we could conclude that test flakiness is a relevant
Indeed, 48.8% of participants periodically identified the same flaky problem regardless of the process and development environment.
tests over the evolution of their app. This means that a non-negligible
amount of flaky tests, which were supposed to be fixed, arise again
in a later stage of software development. On the one hand, this result
suggests that developers cannot always properly verify the outcome of Connecting the dots...
a fix, perhaps because of the lack of usable tools [22]. On the other
hand, this aspect might be connected to what was discovered in the The findings obtained from the survey study allowed us to further
context of RQ2 : if most flaky tests are due to problems pertaining to elaborate on the research questions.
production code, it seems clear that developers trying to fix them by As for RQ1 , test flakiness is a frequent problem for 55% of the
modifying the test code are likely to fail and see the flakiness appears participants and may lead to harmful consequences. Developers
again. Our findings point to the well-known problem of locating the identify flaky tests during bug report (54% of the responses) and
source code flakiness to reduce maintenance efforts [22]. code review (48% of the responses) activities. Finally, it seems
The problems due to the recurrence of flaky tests do not limit that the choice of testing frameworks might not impact the emer-
themselves to the short term. The participants reported a constant gence of flaky tests. Our results corroborate the systematic grey
interest in understanding and fixing the source of flakiness. 43% of literature review with respect to the prominence of the prob-
them claimed that they try to fix again the flaky tests and a con- lem: in this respect, while the grey literature review could only
stant follow-up monitoring of the flaky tests. This is typically realized anecdotally suggest that the problem of test flakiness is relevant
through the ReRun approach, which is used to control the severity of the for mobile developers due to the limited amount of resources
problem, namely the extent to which the flakiness keeps manifesting available, the survey study could provide tangible insights in this
itself. Finally, only two participants claimed that the recurrent flaky respect, quantifying the relevance of the problem more precisely.
tests were disabled or even removed from the test suite. Furthermore, the survey study could provide a larger overview of
Flakiness fixing. Turning our focus on the fixing strategies, 42.9% the problematic nature of flaky tests, which could not be possible
of developers indicated that they sometimes fix flakiness, while 8.3% fix to gather solely by looking at the resources retrieved through the
flakiness frequently. grey literature review.
Concerning the strategies applied by participants to fix flaky tests, In terms of root causes (RQ2 ), about the 50% of developers
answers to question #29 showed that the most frequent strategies indicate issues in production code as the main causes of test
used to fix flakiness are Checking instance variables before that the test flakiness, which is in line with the insights arising in our systematic
is accessed and executed (61%) and Verify the race condition (53.5%). grey literature review. This is likely due to continuous changes
The first strategy is often associated with root causes like Program applied or the hardness to deal with external APIs. Also in this
Logic, Concurrency, and Randomness. The fixing consists of verifying and case, the survey study had the function of quantifying the insights
adapting the status of the instance variables set in the fixture of a test arising from the grey literature review and suggested that further
suite so that the related test cases can work in a proper environment. studies investigating the peculiarities of mobile testing are needed.
Checking the context of a test case seems to be a pervasive problem Finally, the survey results could also extend the results of the grey
in mobile applications, as Thorve et al. [41] also reported. Very likely, literature review, letting emerge the role played by test smells as
this is a reflection of the peculiarities of mobile applications, which a potential indicator of flakiness.
rely on external sensors, APIs, etc., being, therefore, more prone to As for the diagnosis and fixing procedures (RQ3 ), develop-
non-deterministic variations of their environment. ers typically diagnose test case failures by rerunning test cases
For the second most common strategy, the race conditions refer to multiple times or manually inspecting test code (80%). More im-
multiple threads that access the same data without controlling which portantly, the detection process is mostly manual and conducted
one is first so that each run can differ. As reported in our grey literature with low or no tool support, even though some automation mech-
review, verifying the race conditions using the Thread Sanitizer is a anisms emerged from the systematic grey literature review. While
fixing strategy for flakiness depending on Concurrency. the grey literature review let arise the presence of various tools and
Other common fixing strategies concern the removal of conc- frameworks to deal with test flakiness, the survey results revealed
urrency-related problems. The most common in this respect are the that these instruments are not actually used in practice: in this
Add waitFor statement (36%) and the Add new code, so the conflicting sense, the survey study had the role of bridging the gap between
object is destroyed before continuing the execution (34.5%). Finally, we the availability of detection and diagnosis instruments and their
also noticed that 24% of the participants mentioned that so-called Skip adoption in practice. Furthermore, the survey study identified
Non-Initialized Part strategy: this refers to the addition of code aimed additional insights into the state of the practice. First, the recurring
at skipping non-initialized parts and making the test run faster [19] emergence of the same flaky tests makes developers less and less
— this represents an alternative strategy to deal with API issues. The inclined to address the causes of flakiness. Second, the key fixing
additional fixing strategies provided in the survey appear to be only strategies are connected to the verification/update of the instance
rarely used in practice. variables when the test case is executed (61%) and the verification
Overall, we could conclude that the analysis of the detection and of the race conditions (53.5%) — these strategies did not emerge
fixing strategies is in line with the root causes pointed out by the in the grey literature survey. Both strategies might be somehow
participants. Most of the strategies are connected to very specific prob- considered as best practices when developing and verifying test
lems of mobile applications. Some participants also provided additional cases: we might therefore claim that future investigation into the
insights into the operations done when fixing flaky in a Continuous best practices to write and verify test cases might help developers
Integration Environment and in a Continuous Delivery Environment. avoid the introduction or speed up the fixing of flaky tests. The
First and foremost, the answers reported that independently from the survey results in this respect complemented those of the grey
environment the participants try to fix flakiness quickly and release a literature review: these may indeed provide an initial catalog of
new version. Only in some cases, participants tend to ignore the flaky practices applied by practitioners to deal with flaky tests; on the
tests. For instance, developer #83 reported that ‘‘if the new version is contrary, the grey literature study could not let emerge any specific
needed for a demo, probably ignore/disable the test, [...] but in general we practice.
try to fix/handle the test failure’’. Finally, participants declared that they
do not see any differences in the management of flaky tests between the
19
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
6. Discussion and implications contrary, we focused on mobile applications and targeted mobile prac-
titioners. As such, the differences observed may be due to the different
The results of our empirical study pointed out a number of valuable context of the study: our findings must be seen as complementary.
discussion points that are worth to be further elaborated, along with a Concerning the work conducted by Ahmad et al. [65], the authors
number of implications for both researchers and practitioners. analyzed an industrial environment, hence collecting flakiness-related
information from a different context. Similarly to the cases above, the
6.1. Contextualizing our findings major difference lies in the frequency of the root causes identified —
Ahmad et al. [65] identified ‘‘Async Wait’’, ‘‘Test Order Dependency’’,
As discussed in Section 2, the scientific literature on test flakiness and ‘‘Randomness’’ as the most frequent root causes, while our findings
is rapidly growing and multiple papers have already assessed the reported a different perspective on the matter when considering mobile
developer’s perception of flaky tests, other than proposing tools and apps. In the first place, the differences observed can be due to our
techniques to manage them. It is, therefore, worth contextualizing our specific focus on mobile apps: Ahmad et al. [65] did not explicitly report
findings with respect to previous research, to provide the reader with the types of applications they considered, yet we may suppose these
a more comprehensive view of our results and how they impact the were general-purpose and targeted different application domains. As
current knowledge on test flakiness. a consequence, our work more deeply explores the problem of test
With respect to the root causes of flakiness, our findings extend flakiness in mobile applications. In the second place, the differences
those reported by Thorve et al. [41]. In particular, according to the might also be connected to the research methods employed. Ahmad
surveyed participants, we identified ‘‘API issues’’ as the most prominent et al. [65] performed a qualitative analysis of the source code of the
cause. This was never encountered by Thorve et al. [41] during their test suites of the closed-source systems taken into account, while we
manual analysis of 77 commits pertaining to 29 open-source Android performed a mixed-method research to elicit the mobile practitioner’s
apps. On the one hand, the difference might be due to the different perception. Finally, another possible reason explaining the differences
research methods employed, which led us to gather the experience of a observed is concerned with the number of practitioners involved in the
large amount of mobile practitioners. On the other hand, our sample two studies: Ahmad et al. [65] recruited 18 developers and analyzed the
is not only composed of open-source developers engaging with the test suites they developed, while we recruited 130 mobile developers.
Android operating system but also of practitioners working in different The findings by Gruber and Fraser [39] first differ from ours in
contexts and on different platforms. As such, we cover a larger amount terms of the frequency of specific root causes. In particular, while the
of environments, hence extracting additional knowledge with respect findings by Gruber and Fraser [39] reported ‘‘Test Order Dependency’’
to the state of the art. At the same time, it is also worth remarking as the most common root cause for both non-mobile developers and
that other prominent causes of flakiness identified by Thorve et al. [41] mobile developers, ours showed that ‘‘API’’ issues is the most common
were also frequently mentioned by the practitioners involved in our cause of flakiness in mobile apps, with ‘‘Test Order Dependency’’ only
study — this is the case of ‘‘Program Logic’’ and ‘‘UI problems’’. In this considered as the ninth most mentioned root cause. Still, in terms of
respect, our findings corroborate those of Thorve et al. [41] and show root causes, the mobile developers surveyed by Gruber and Fraser [39]
that these causes are actually relevant in practice. A similar discussion mentioned the use of emulators as a cause of increased frequency of
can be drawn when considering the least frequent root causes. For flakiness, which our study did not confirm. Finally, concerning the
instance, flakiness due to ‘‘Test Order Dependency’’ was confirmed to developers’ perception, Gruber and Fraser [39] suggested that flaky
rarely arise in mobile applications. tests with a higher failure rate are perceived as more problematic, while
When analyzing the resources from the systematic grey literature our survey showed that the recurring emergence of the same flaky
review, we found that in some cases the flaky tests due to ‘‘UI problems’’ tests makes developers less and less inclined to address the causes of
were instead a consequence of other root causes. In other terms, what flakiness. In other terms, when compared to the paper by Gruber and
developers believe is due to issues with the design and rendering of wid- Fraser [39], our work specifically characterized test code flakiness in
get layouts hides more sneaky issues. While this was already observed mobile apps, providing an improved overview of the most problematic
by Romano et al. [73], there are two considerations to make here. and frequent root causes of flakiness, hence emphasizing the need for
On the one hand, our findings triangulate those obtained by Romano treating test code flakiness differently when considering mobile apps.
et al. [73] using a different research approach (survey study rather than The differences observed might be due to two main reasons. In the
a manual analysis of UI tests): in this respect, we could confirm that first place, the sample of practitioners: Gruber and Fraser [39] surveyed
test code flakiness is particularly relevant when considering UI tests. On 233 professionals from the general public all over the globe and 102
the other hand, however, we complemented those findings by reporting employees of the BMW Group, while our work targeted a sample of 130
that, in most cases, practitioners could not recognize the specific root mobile practitioners. In the second place, the specificity of the analysis:
cause making a UI test flaky: in this sense, our work emphasized the while Gruber and Fraser [39] aimed at collecting the root causes and
need for additional instruments that may make developers aware of practitioner’s perception of test flakiness of general-purpose systems,
the rationale behind the flakiness of UI tests. As such, our findings we provided a more focused analysis of the mechanisms applied by
can (1) inform the designers of flaky test detection, diagnosing, and mobile developers to deal with mobile test flakiness, also extending
fixing techniques of the innate relation between UI-related issues and the current knowledge by investigating the specific fixing mechanisms
other, more fundamental problems causing flakiness; (2) call for more they typically put in place. As a consequence, the findings of the two
empirical investigations into the potential strategies and to manage studies should be seen as complementary, since they provide different
multiple root causes of flakiness jointly. perspectives on the same matter by analyzing different contexts.
Table 11 makes explicit the commonalities and differences between The work by Habchi et al. [40] focused on mapping the measures
the results obtained in our work and those presented in previous employed by practitioners when dealing with flaky tests. Our work is,
research. With respect to Eck et al. [20], the major differences were instead, larger and more comprehensive, as we aimed to analyze (1) the
observed in terms of the frequency of specific root causes. For instance, prominence, (2) the root causes, and (3) the diagnosis and mitigation
Eck et al. [20] reported that the most common root causes in Mozilla strategies applied by mobile developers when dealing with flakiness. In
were due to ‘‘Concurrency’’, ‘‘Async Wait’’, and ‘‘Too Restrictive Range’’ terms of findings, Habchi et al. [40] reported that developers address
issues, while our findings revealed ‘‘API’’ issues, ‘‘Program Logic’’, and flaky tests by building stable infrastructures and enforcing guidelines:
‘‘UI problems’’ as the most common root causes in mobile apps. It is our work could further elaborate on the matter, proposing a catalog of
worth remarking that Eck et al. [20] analyzed an industrial context, more specific actions developers take when detecting and addressing
i.e., Mozilla, and investigated test flakiness in web applications. On the the causes of test code flakiness. The different scope of the two studies,
20
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
Table 11
Comparison of the findings between our study and the most closely related papers.
Related work Different findings Common findings
Eck et al. [20] Differences. Eck et al. [20] surveyed web developers, finding that Concurrency (26%), ∙ Prominence and relevance of test flakiness;
Async Wait (22%), and Too Restrictive Range (17%) are the most common root causes ∙ The ReRun approach is the most common
of flakiness. Our study surveyed mobile developers, finding that API issues (31%), technique used to detect flakiness.
Program Logic (30%), and UI problems (25%) are the most common root causes.
Possible rationale. The different target of practitioners involved might explain the
differences observed, along with our more specific focus on mobile apps.
Habchi et al. [40] Differences. Habchi et al. [40] reported on how developers address test flakiness in the ∙ Prominence and relevance of test flakiness;
wild (e.g., building stable infrastructure), while we (i) proposed a catalog of more ∙ The ReRun approach is the most common
specific actions developers take when detecting the causes of test code flakiness in technique used to detect flakiness.
mobile apps; and (ii) reported on the fixing actions performed to deal with mobile test
flakiness.
Possible rationale. The different target of practitioners involved might explain the
differences observed, along with our more specific focus on mobile apps. In addition,
we surveyed a larger number of practitioners (130 vs. 14).
Gruber and Fraser [39] Differences. The authors reported Test Order Dependency as the most common root ∙ Prominence and relevance of test flakiness;
cause, while in our study, it appears only in 15% of the cases. In addition, the mobile ∙ The ReRun approach is the most common
developers surveyed by Gruber and Fraser [39] reported the use of emulators, while our technique used to detect flakiness;
practitioners did not mention this aspect. ∙ The automated approaches to handle a flaky
Different practitioners’ perceptions. Gruber and Fraser [39] suggested that flaky tests test are unknown or rarely used.
with a higher failure rate are perceived as more problematic, while our study showed
that the recurring emergence of the same flaky tests makes developers less and less
inclined to address the causes of flakiness.
Possible rationale. The different target of practitioners involved might explain the
differences observed, along with our more specific focus on mobile apps. Gruber and
Fraser [39] indeed considered a sample of general practitioners, while we specifically
focused on the detection and fixing procedures applied by mobile developers.
Ahmad et al. [65] Differences. The study was conducted on closed-source test cases finding that Async ∙ Prominence and relevance of test flakiness.
Wait (91%), Test Order Dependency (5%), and Randomness (4%) are the most root
causes of test flakiness. Ahmad et al. [65] did not specify the types of systems object of
the study, hence we cannot speculate on the amount of mobile practitioners involved in
the study. In any case, we discovered different root causes and fixing strategies that
specifically target mobile applications. In addition, Ahmad et al. [65] analyzed the
relation between test flakiness and factors such as robustness, age, and size. On the
contrary, we assessed the relation between flaky tests and test smells.
Possible rationale. The different research methods used to conduct the studies may
explain the differences observed: Ahmad et al. [65] performed a qualitative analysis of
the test suites of closed-source systems, while we performed mixed-method research to
elicit the mobile practitioner’s perception of test flakiness. Also, our specific focus on
mobile apps might be an additional reason for the differences observed, along with the
larger number of practitioners involved in our study (130 vs. 18).
Parry et al. [66] Differences. Parry et al. [66] surveyed general-purpose developers, finding that ∙ Prominence and relevance of test flakiness.
Improper Setup and Teardown and Network-related Issues are the most common root
causes of flakiness — they did not report the number of practitioners who indicated the
root causes as frequent. Our study surveyed mobile developers, finding that API issues
(31%), Program Logic (30%), and UI problems (25%) are the most common root causes.
Possible rationale. The different target of practitioners involved might explain the
differences observed, along with our more specific focus on mobile apps.
combined with the different sample sizes considered (14 versus 130 these are detected rather than disabling or ignoring them — this is
practitioners), may have led to different results, allowing us to extend in contrast with what has been widely reported in previous studies
the knowledge emerging from the paper by Habchi et al. [40]. targeting traditional systems [20,40]. Hence, our empirical analysis
Finally, the work by Parry et al. [66] surveyed 170 general-purpose reveals that the socio-technical impact of flaky tests on the evolution of
practitioners recruited through social media channels. The goal of mobile apps is greater, highlighting that the managerial and technical
the study was to elicit the definition that practitioners provide for strategies employed by mobile developers to evolve mobile apps might
test flakiness, the impact of flaky tests, their root causes, and the be affected by test flakiness.
actions performed to fix them. Parry et al. [66] identified ‘‘Improper
Setup and Teardown’’ and ‘‘Network-related Issues’’ as the most common 6.2. Reflections, implications, and actionable items
root causes. On the contrary, our work identified root causes that are
intrinsically connected to the way mobile applications are developed, The findings obtained and the observations provided in the previous
i.e., ‘‘API’’ issues, ‘‘Program Logic’’, and ‘‘UI problems’’. In other words, section, allow us to formulate a number of additional considerations
our work should be seen as complementary with respect to the work and implications for both researchers and practitioners.
by Parry et al. [66]: the differences observed are likely to be due to
the different context of the study and, as a consequence, to the sample of Test Flakiness: A Key Problem for Developers. While previous work
practitioners. pointed out the limited amount of testing activities performed by
These findings recall the need for prioritizing in a different man- mobile developers [13], our study revealed that the problem of test
ner the research effort to spend to support the testing activities of flakiness is still perceived as relevant and problematic. Not only
mobile developers. Perhaps more importantly, one of the key results more than half of the surveyed participants have dealt with flaky
of our study concerns the different impacts that flaky tests have on tests at least once in their experience, but the problem seems to be
the maintenance and evolution process. Most of the participants of more serious than in traditional software because of the peculiarities
the survey study indeed reported that they use to fix flaky tests when of mobile apps, like the intensive reliance on external APIs. This
21
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
was also one of the main insights coming from the systematic grey dichotomy. First, there seems to be a relevant gap between academia
literature review: especially when dealing with UI testing, developers and industry, which should be addressed somehow. On the one
might end up with flaky tests already when designing tests. Our hand, novel dissemination strategies might be experimented to un-
study, therefore, represents a call for researchers working in the field derstand how to better communicate research results to practitioners:
of software testing: further empirical investigations on the current the availability of practitioner’s forums and/or blogs, e.g., Reddit
developer’s practices, and the support practitioners need, other than and Medium, might represent a valuable starting point, i.e., the sci-
on the nature and the impact of test flakiness, are critical missing entific community might consider the grey literature sources as
pieces that our research community is called to address. In this a complementary advertisement method of dissemination. Second,
sense, it is our hope that the results of the study and the additional our research community is called to devise more readily usable
implications discussed later in this section will inspire and stimulate prototypes that would make academic techniques usable, allowing
further empirical and developer-centered research. practitioners to more easily use or adopt them in practice. Our results
further highlight the importance of devising additional policies to
Root Causes of Flakiness: Beyond Test Code. Researchers have fre- stimulate the publication of tools and frameworks developed in an
quently investigated the sources of flakiness, identifying a number academic setting. Last but not least, our findings might inspire edu-
of test-related factors impacting the likelihood of tests being non- cators, who might want to update the academic programs currently
deterministic [19,20,22,108]. Unfortunately, this is not enough or, in place to include specific parts connected to the relation between
at least, not fully generalizable to mobile apps. While developers
academia and industry, other than letting the new generation of
recognized some of the known causes of flakiness, like concurrency,
computer scientists get already exposed to academic prototypes, in
randomness, or test order dependency, they also put in the spotlight
the hope that this would then favor their larger adoption in the
issues in production code design and in the verification of user
future.
interfaces. These represent key differences with respect to traditional
software. For instance, application logic and API issues are not only Improving Test Code Review. One of the key findings of our study,
the most diffused causes of flakiness according to our survey study even as a consequence of the previously discussed point, concerns
but also those harder to detect and fix. While Thorve et al. [41] the fact that developers mostly deal with flakiness manually. Our
pioneered the analysis of how flaky tests manifest themselves in participants mentioned code review [118] as a useful instrument to
mobile apps, we advocate more research on the role played by the diagnose and possibly treat flakiness in a collaborative fashion [119]
characteristics of production code that might increase the likelihood — this is reasonable, given the popularity of the code review pro-
of tests becoming flaky. We can envision a number of further in- cess [120]. To the best of our knowledge, however, research in
vestigations into the impact of design-related aspects, like REST API this field is still neglected. Only a few researchers have proposed
design patterns and anti-patterns [101,110], code smells [111,112], investigations in the context of test code review [17] or even at-
usability patterns [113], and other usage patterns [114], on test tempted to understand how to best support developers during test
code flakiness. We believe that new experimentations would also code reviews [121]. Therefore, it is of the greatest urgency to build
offer opportunities for joint research efforts among different research on this line of research and assess how current and future instruments
sub-communities in software engineering. might assist developers in the code review process.
On the Role of Technical Debt. As a follow-up discussion point, it is On the Long-Term Effects of Flakiness. Our last discussion point re-
important to further remark on the role that developers assigned
lates to the findings achieved about the recurrence of flaky tests.
to technical debt. As a side result of our analysis, we could indeed
Around 30% of developers claimed that test flakiness appears multi-
confirm that a number of test and code metrics and smells are per-
ple times during software evolution. Besides the problems that this
ceived as connected to flakiness. This aspect has several implications.
point may have on source code quality and effectiveness, we discov-
In the first place, some forms of technical debt, like the test smells
ered that recurrent flakiness might have dramatic consequences for
capturing issues in the management of external resources, might act
software testing, corroborating what was advocated by Melski [109]
as proxy metrics that developers might monitor to keep the stability
in a well-known blog post. Our participants indeed reported that
of test cases under control. In this sense, additional studies on the
the recurrence of flaky tests might lead them to decide to ignore,
characteristics making test cases smelly and flaky would be desir-
disable, or even remove the non-deterministic test case at the risk
able. Furthermore, should the relation between test code design and
of missing relevant defects. Hence, the analysis of the evolutionary
flakiness be confirmed, this would open new interesting avenues for
aspects of test flakiness represents a key challenge for researchers
researchers working in predictive analytics [31,33,34] and software
who are called to operate more in this direction.
quality [105,115,116]: the former might be interested in assessing
how well can test-related metrics and smells predict the emergence
7. Threats to validity
of flaky tests; the latter in devising novel mobile-specific instruments
and analytical dashboards that could assist developers in monitoring
Several confounding factors might have possibly influenced the
and detecting potential sources of flakiness in advance. Last but not
results of our study.
least, our results are interesting from the automated refactoring per-
spective. As a matter of fact, we still lack fully automated refactoring Threats to Construct Validity. Threats in this category refer to the
approaches to deal with test design concerns [117]. The relevance relationship between hypothesis and observations. To understand how
of these aspects on flakiness might boost researchers’ willingness to mobile developers discuss test flakiness, we conducted a systematic
invest effort on the matter. grey literature review [47]. In this regard, possible threats refer to
the soundness and completeness of the review. With respect to the
Is the Automated Support Really Limited? Over the last years, re- former, the first two authors of this paper followed well-established
searchers have been proposing several tools to deal with flakiness, guidelines to search, analyze, and select relevant sources; moreover,
especially when it turns to specific root causes like concurrency and the joint work conducted by the two authors have reduced the risk of
test order dependency [22]. At the same time, our systematic grey subjective evaluations of the resources to include as well as allowed a
literature review pointed out the availability of some automation quick solving of possible disagreements. As for the latter, we analyzed
mechanisms to ease the flaky test detection process. Nevertheless, all the relevant Google pages when gathering the study, also performing
developers do not seem to use (or even be aware of) the automated it using the incognito mode to avoid biases due to previous navigation
support available. We see a number of possible reasons behind this history.
22
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
As for the survey study, these are mainly related to the way we As for the survey study, we used statistical methods to interpret
designed the survey. Starting from the objectives posed, we attempted numeric distributions coming from the responses of the participants.
to define clear and explicit questions in the survey that could allow For instance, we employed frequency analysis to describe the most
participants to properly understand the meaning/phrasing and provide common root causes reported by the involved practitioners. At the same
an answer that could have been directly mapped onto our objectives. time, we did not use additional statistical methods and tests, e.g., we
Also, whenever needed, we defined free-text answers to let develop- did not compare the numeric distributions pertaining to two questions
ers express themselves freely, without restriction. The survey design of the survey to find out statistically significant differences. This was
involved all the paper’s authors, who all have experience in software done on purpose, as the distributions coming from the responses relate
testing, empirical software engineering, and research design. While to different concepts and address different perspectives of the test flak-
this already partially mitigated threats to construct validity, we also iness problem. Statistical comparisons would not apply in this case. For
involved two external mobile software engineering developers in a pilot the open questions, we had to conduct content analysis to interpret the
study. This additional investigation aimed at preliminarily assessing the opinions of developers. In this respect, we systematically approached
opinions of our sample population, highlighting possible issues to be the matter with two experienced researchers involved. In any case,
fixed before releasing the survey on a large scale. The pilot study let we released all the material produced in the context of this study
some minor concerns emerge that we promptly fixed. The follow-up to enable the verifiability of the conclusions described in our online
double-checks of the involved developers let us be even more confident appendix [76].
of the validity of our survey. Nonetheless, replications of the survey An additional point of discussion concerns the data cleaning process
study would be beneficial to discover additional points of view and conducted, who might have missed the exclusion of answers clearly
perspectives that we might have missed. In this respect, we made all released by participants who did not take the task seriously or did
our data and scripts publicly available to make our results repeatable not have enough expertise to approach our survey. In this respect, the
and reproducible [76]. paper’s first author went through each response and validated it. Then,
Threats to Internal Validity. The recruitment strategy employed a second author confirmed the operations performed. Such a double-
in our study might have led to the selection of a biased sample. This check makes us confident of not having considered answers that might
is especially true since we relied on voluntary participation through have biased our conclusions.
an online instrument like Prolific. In this respect, there are three con- Another possible threat to the validity of our study is related to
siderations to make. First, involving developers is always challenging: the reliability of the opinions collected by surveying practitioners. In
this is also visible from the relatively low response rate of survey this respect, previous studies showed that developers are sometimes
studies in software engineering [84,122,123]. As such, accepting the inaccurate when reporting on the amount of testing they perform [124,
limitations coming from the recruitment through online platforms is 125]. Similarly, they might be inaccurate when reporting additional
often the only means to conduct these studies. In addition, we mitigated information concerned with the activities performed to deal with test
possible self-selection or voluntary response bias [89,90] by provid- flakiness. On the one hand, we made sure to profile the participants
ing participants with a payment as incentive. This incentive aims to by inquiring about their experience and the testing practices they
stimulate broader participation, hence reducing threats to validity. Last performed. The self-reported information of our sample was in line
but not least, our results corroborate some of the findings achieved with the quantitative findings by Pecorelli et al. [13] in terms of
in previous studies [19,20], hence increasing the confidence in the amount of testing performed: such a congruence may indicate that the
validity of the insights we reported. However, we cannot still exclude participants approached our study in a proactive manner, providing
possible influences of the recruitment strategy on our results. On the insights that well represent their daily activities. On the other hand,
one hand, we plan to perform additional studies on test flakiness in other researchers have approached the problem of flaky tests through
mobile applications: these will have the goal of triangulating the results survey studies [40,66], reporting compelling evidence of the validity of
of our survey study. On the other hand, industrial replications or further this research method to understand the properties of the test flakiness
experimentation on the developer’s opinions about flaky tests in mobile problem.
apps would be beneficial to corroborate our findings. Threats to External Validity. Threats in this category are con-
Threats to Conclusion Validity. As for the potential threats to the cerned with the generalizability of the results. When performing the
conclusions drawn, the data analysis methods are worth discussing. In systematic grey literature review, there is a risk of missing relevant
the systematic grey literature review, we conducted content analysis resources because concepts related to those included in our search
sessions to label and summarize the themes in the selected sources. strings are differently named in such studies. Some studies may refer
To reduce the potential subjectivity of the analysis, two of the authors to non-deterministic tests instead of test flakiness. To mitigate this, we
acted as inspectors, defining a formal protocol to assess the suitability have explicitly included all relevant synonyms and similar words in our
of each resource for the research questions of the study and opening a search strings. We have also exploited the features offered by search
discussion on the labels and descriptions assigned to each source. engines, which naturally support considering related terms for all those
Another discussion point concerns the fact that the practitioners contained in a search string. Items found using the search terms have
involved in the study – both those writing on the grey literature sources been assessed thoroughly based on various dimensions of quality [47].
and those who participated in the survey study – might not be mobile Despite our efforts, we still identified a few amounts of resources. This
developers only and, therefore, might have expressed opinions that limits the generalizability of the findings and the representativeness of
are not necessarily specific to the mobile context. Being aware of this the sample considered. For this reason, future replications of the grey
potential limitation, we performed quality assessment checks to verify literature review would be desirable and might lead to different results
the suitability of the sources considered with respect to the research with respect to those reported in our paper.
questions of the study. In addition, we include questions in the survey As for the survey study, the conclusions pertain to a specific sample
that explicitly asked participants to answer, keeping in mind that the of 130 developers having the characteristics discussed in Section 5.4.
problem of test flakiness should have been assessed within the context We noticed that the survey participants were pretty heterogeneous in
of mobile apps. In some cases, e.g., when considering the diffuseness of terms of working contexts: indeed, freelance, industrial, open-source,
the root causes of flakiness, we also requested participants to provide and startup developers composed the 36.6%, 35.1%, 28.2%, and 26.1%
their opinion by explicitly comparing mobile and non-mobile environ- of our sample, respectively. On the contrary, we did not observe much
ments: in this way, the participants were stimulated to reason on the heterogeneity when considering the other background questions — this
peculiarities of flaky tests in the two different contexts, hence providing was already perceivable in Fig. 3, which showed how the background
us with more reliable insights into the problem. of participants is somehow similar in terms of years of experience as
23
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
24
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
25
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
[49] S. Habchi, G. Haben, J. Sohn, A. Franci, M. Papadakis, M. Cordy, Y. Le Traon, [75] C. Wohlin, P. Runeson, M. Höst, M.C. Ohlsson, B. Regnell, A. Wesslén,
What made this test flake? Pinpointing classes responsible for test flakiness, in: Experimentation in Software Engineering, Springer Science & Business Media,
2022 IEEE International Conference on Software Maintenance and Evolution 2012.
(ICSME), IEEE, 2022, pp. 352–363. [76] V. Pontillo, F. Palomba, F. Ferrucci, Test code flakiness in mobile apps:
[50] A. Wei, P. Yi, Z. Li, T. Xie, D. Marinov, W. Lam, Preempting flaky tests via non- The developer’s perspective, 2022, http://dx.doi.org/10.6084/m9.figshare.
idempotent-outcome tests, in: Proceedings of the 44th International Conference 24183279.
on Software Engineering, 2022, pp. 1730–1742. [77] K.M. Benzies, S. Premji, K.A. Hayden, K. Serrett, State-of-the-evidence reviews:
[51] O. Parry, G.M. Kapfhammer, M. Hilton, P. McMinn, Empirically evaluating flaky advantages and challenges of including grey literature, Worldviews Evid. Based
test detection techniques combining test case rerunning and machine learning Nurs. 3 (2) (2006) 55–61.
models, Empir. Softw. Eng. 28 (3) (2023) 72. [78] H. Zhang, X. Zhou, X. Huang, H. Huang, M.A. Babar, An evidence-based inquiry
[52] C. Li, M.M. Khosravi, W. Lam, A. Shi, Systematically producing test orders to into the use of grey literature in software engineering, in: 2020 IEEE/ACM
detect order-dependent flaky tests, in: Proceedings of the 32nd ACM SIGSOFT 42nd International Conference on Software Engineering (ICSE), IEEE, 2020, pp.
International Symposium on Software Testing and Analysis, 2023, pp. 627–638. 1422–1434.
[53] R. Greca, B. Miranda, A. Bertolino, Orchestration strategies for regression test [79] I. Kumara, M. Garriga, A.U. Romeu, D. Di Nucci, F. Palomba, D.A. Tamburri,
suites, in: 2023 IEEE/ACM International Conference on Automation of Software W.-J. van den Heuvel, The do’s and don’ts of infrastructure code: A systematic
Test (AST), IEEE, 2023, pp. 163–167. gray literature review, Inf. Softw. Technol. 137 (2021) 106593.
[54] J. Morán Barbón, C. Augusto Alonso, A. Bertolino, C.A. Riva Álvarez, P.J. [80] C. Bakker, F. Wang, J. Huisman, M. Den Hollander, Products that go round:
Tuya González, et al., Flakyloc: flakiness localization for reliable test suites exploring product life extension through design, J. Clean. Prod. 69 (2014)
in web applications, J. Web Eng. 2 (2020). 10–16.
[55] A. Memon, M. Cohen, Automated testing of GUI applications: models, tools, [81] M. Cordella, F. Alfieri, C. Clemm, A. Berwald, Durability of smartphones: A
and controlling flakiness, in: International Conference on Software Engineering technical analysis of reliability and repairability aspects, J. Clean. Prod. 286
(ICSE’13), IEEE, 2013, pp. 1479–1480. (2021) 125388.
[56] S. Fatima, T.A. Ghaleb, L. Briand, Flakify: A black-box, language model-based [82] A. Shi, W. Lam, R. Oei, T. Xie, D. Marinov, IFixFlakies: A framework for
predictor for flaky tests, IEEE Trans. Softw. Eng. (2022). automatically fixing order-dependent flaky tests, in: 2019 27th ACM Joint
[57] V. Pontillo, F. Palomba, F. Ferrucci, Static test flakiness prediction: How far Meeting on European Software Engineering Conference and Symposium on the
can we go? Empir. Softw. Eng. 27 (7) (2022) 1–44. Foundations of Software Engineering, 2019, pp. 545–555.
[83] R. Wang, Y. Chen, W. Lam, iPFlakies: A framework for detecting and fixing
[58] M. Gruber, M. Heine, N. Oster, M. Philippsen, G. Fraser, Practical flaky test
Python order-dependent flaky tests, in: Proceedings of the ACM/IEEE 44th
prediction using common code evolution and test history data, in: 2023 IEEE
International Conference on Software Engineering: Companion Proceedings,
Conference on Software Testing, Verification and Validation (ICST), IEEE, 2023,
2022, pp. 120–124.
pp. 210–221.
[84] B.A. Kitchenham, S.L. Pfleeger, Principles of survey research part 2: designing
[59] R. Verdecchia, E. Cruciani, B. Miranda, A. Bertolino, Know you neighbor: Fast
a survey, ACM SIGSOFT Softw. Eng. Not. 27 (1) (2002) 18–20.
static prediction of test flakiness, IEEE Access 9 (2021) 76119–76134.
[85] J. Klayman, Varieties of confirmation bias, Psychol. Learn. Motiv. 32 (1995)
[60] N. Hashemi, A. Tahir, S. Rasheed, An empirical study of flaky tests in Javascript,
385–418.
in: 2022 IEEE International Conference on Software Maintenance and Evolution
[86] D. Andrews, B. Nonnecke, J. Preece, Conducting research on the internet:
(ICSME), IEEE, 2022, pp. 24–34.
Online survey design, development and implementation guidelines, 2007.
[61] J. Morán, C. Augusto, A. Bertolino, C. de la Riva, J. Tuya, Debugging flaky
[87] K.H. Morin, Value of a pilot study, 2013.
tests on web applications, in: WEBIST, 2019, pp. 454–461.
[88] K.J. Hunt, N. Shlomo, J. Addington-Hall, Participant recruitment in sensitive
[62] C. Li, C. Zhu, W. Wang, A. Shi, Repairing order-dependent flaky tests via test
surveys: a comparative trial of ‘opt in’ versus ‘opt out’ approaches, BMC Med.
generation, in: Proceedings of the 44th International Conference on Software
Res. Methodol. 13 (1) (2013) 1–8.
Engineering, 2022, pp. 1881–1892.
[89] J.J. Heckman, Selection bias and self-selection, in: Econometrics, Springer,
[63] J. Lampel, S. Just, S. Apel, A. Zeller, When life gives you oranges: detecting and
1990, pp. 201–224.
diagnosing intermittent job failures at mozilla, in: 29th ACM Joint Meeting on
[90] J.W. Sakshaug, A. Schmucker, F. Kreuter, M.P. Couper, E. Singer, Evaluating
European Software Engineering Conference and Symposium on the Foundations
active (opt-in) and passive (opt-out) consent bias in the transfer of federal
of Software Engineering, 2021, pp. 1381–1392.
contact data to a third-party survey agency, J. Surv. Statist. Methodol. 4 (3)
[64] M.H.U. Rehman, P.C. Rigby, Quantifying no-fault-found test failures to pri-
(2016) 382–416.
oritize inspection of flaky tests at ericsson, in: 29th ACM Joint Meeting on
[91] F. Ebert, A. Serebrenik, C. Treude, N. Novielli, F. Castor, On recruiting
European Software Engineering Conference and Symposium on the Foundations
experienced GitHub contributors for interviews and surveys on prolific, in:
of Software Engineering, 2021, pp. 1371–1380.
International Workshop on Recruiting Participants for Empirical Software
[65] A. Ahmad, O. Leifler, K. Sandahl, Empirical analysis of practitioners’ perceptions Engineering, 2022.
of test flakiness factors, Softw. Test. Verif. Reliab. 31 (8) (2021) e1791. [92] B. Reid, M. Wagner, M. d’Amorim, C. Treude, Software engineering user study
[66] O. Parry, G.M. Kapfhammer, M. Hilton, P. McMinn, Surveying the developer recruitment on prolific: An experience report, 2022, arXiv preprint arXiv:
experience of flaky tests, in: Proceedings of the 44th International Conference 2201.05348.
on Software Engineering: Software Engineering in Practice, 2022, pp. 253–262. [93] T. Hall, V. Flynn, Ethical issues in software engineering research: a survey of
[67] A.I. Wasserman, Software engineering issues for mobile application develop- current practice, Empir. Softw. Eng. 6 (4) (2001) 305–317.
ment, in: FSE/SDP Workshop on Future of Software Engineering Research, [94] T. Nemoto, D. Beglar, Likert-scale questionnaires, in: JALT 2013 Conference
2010, pp. 397–400. Proceedings, 2014, pp. 1–8.
[68] R. Francese, C. Gravino, M. Risi, G. Scanniello, G. Tortora, Mobile app [95] S. Cavanagh, Content analysis: concepts, methods and applications, Nurse Res.
development and management: results from a qualitative investigation, in: 2017 4 (3) (1997) 5–16.
IEEE/ACM 4th International Conference on Mobile Software Engineering and [96] M. Wessel, M.A. Gerosa, E. Shihab, Software bots in software engineering:
Systems (MOBILESoft), IEEE, 2017, pp. 133–143. benefits and challenges, in: Proceedings of the 19th International Conference
[69] R. Jabangwe, H. Edison, A.N. Duc, Software engineering process models for on Mining Software Repositories, 2022, pp. 724–725.
mobile app development: A systematic literature review, J. Syst. Softw. 145 [97] S. Elbaum, G. Rothermel, J. Penix, Techniques for improving regression testing
(2018) 98–111. in continuous integration development environments, in: 22nd ACM SIGSOFT
[70] J. Zhang, S. Sagar, E. Shihab, The evolution of mobile apps: An exploratory International Symposium on Foundations of Software Engineering, 2014, pp.
study, in: Proceedings of the 2013 International Workshop on Software 235–245.
Development Lifecycle for Mobile, 2013, pp. 1–8. [98] B. Vasilescu, Y. Yu, H. Wang, P. Devanbu, V. Filkov, Quality and productivity
[71] M. Fazzini, A. Orso, Automated cross-platform inconsistency detection for outcomes relating to continuous integration in GitHub, in: 2015 10th Joint
mobile apps, in: 2017 32nd IEEE/ACM International Conference on Automated Meeting on Foundations of Software Engineering, 2015, pp. 805–816.
Software Engineering (ASE), IEEE, 2017, pp. 308–318. [99] A. Di Sorbo, S. Panichella, Exposed! A case study on the vulnerability-proneness
[72] Z. Dong, A. Tiwari, X.L. Yu, A. Roychoudhury, Flaky test detection in Android of google play apps, Empir. Softw. Eng. 26 (4) (2021) 1–31.
via event order exploration, in: Proceedings of the 29th ACM Joint Meeting on [100] J. Ferrer, F. Chicano, E. Alba, Estimating software testing complexity, Inf. Softw.
European Software Engineering Conference and Symposium on the Foundations Technol. 55 (12) (2013) 2125–2139.
of Software Engineering, 2021, pp. 367–378. [101] M.A. Oumaziz, A. Belkhir, T. Vacher, E. Beaudry, X. Blanc, J. Falleri, N.
[73] A. Romano, Z. Song, S. Grandhi, W. Yang, W. Wang, An empirical analysis Moha, Empirical study on rest apis usage in Android mobile applications, in:
of UI-based flaky tests, in: 2021 IEEE/ACM 43rd International Conference on International Conference on Service-Oriented Computing, Springer, 2017, pp.
Software Engineering (ICSE), IEEE, 2021, pp. 1585–1597. 614–622.
[74] D. Silva, L. Teixeira, M. d’Amorim, Shake it! Detecting flaky tests caused by [102] M. Abdellatif, R. Tighilt, A. Belkhir, N. Moha, Y. Guéhéneuc, É. Beaudry, A
concurrency with shaker, in: 2020 IEEE International Conference on Software multi-dimensional study on the state of the practice of REST APIs usage in
Maintenance and Evolution (ICSME), IEEE, 2020, pp. 301–311. Android apps, Autom. Softw. Eng. 27 (3) (2020) 187–228.
26
V. Pontillo et al. Information and Software Technology 168 (2024) 107394
[103] H. Alrubaye, D. Alshoaibi, E. Alomar, M.W. Mkaouer, A. Ouni, How does library [115] F. Palomba, A. Zaidman, A. De Lucia, Automatic test smell detection using
migration impact software quality and comprehension? An empirical study, in: information retrieval techniques, in: 2018 IEEE International Conference on
International Conference on Software and Software Reuse, Springer, 2020, pp. Software Maintenance and Evolution (ICSME), IEEE, 2018, pp. 311–322.
245–260. [116] A. Peruma, K. Almalki, C.D. Newman, M.W. Mkaouer, A. Ouni, F. Palomba,
[104] A. Gambi, J. Bell, A. Zeller, Practical test dependency detection, in: 2018 IEEE Tsdetect: An open source test smells detection tool, in: 28th ACM Joint
11th International Conference on Software Testing, Verification and Validation Meeting on European Software Engineering Conference and Symposium on the
(ICST), IEEE, 2018, pp. 1–11. Foundations of Software Engineering, 2020, pp. 1650–1654.
[105] D.J. Kim, T.P. Chen, J. Yang, The secret life of test smells-an empirical study on [117] V. Garousi, B. Küçük, Smells in software test code: A survey of knowledge in
test smell evolution and maintenance, Empir. Softw. Eng. 26 (5) (2021) 1–47. industry and academia, J. Syst. Softw. 138 (2018) 52–81.
[106] D. Spadini, F. Palomba, A. Zaidman, M. Bruntink, A. Bacchelli, On the relation [118] A. Bacchelli, C. Bird, Expectations, outcomes, and challenges of modern code
of test smells to software code quality, in: 2018 IEEE International Conference review, in: 2013 35th International Conference on Software Engineering (ICSE),
on Software Maintenance and Evolution (ICSME), IEEE, 2018, pp. 1–12. IEEE, 2013, pp. 712–721.
[107] M. Tufano, F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, A. De Lucia, D. [119] L. Pascarella, D. Spadini, F. Palomba, M. Bruntink, A. Bacchelli, Information
Poshyvanyk, An empirical investigation into the nature of test smells, in: 31st needs in contemporary code review, ACM Hum. Comput. Interact. 2 (CSCW)
IEEE/ACM International Conference on Automated Software Engineering, 2016, (2018) 1–27.
pp. 4–15. [120] O. Kononenko, O. Baysal, M.W. Godfrey, Code review quality: How developers
[108] B. Camara, M. Silva, A. Endo, S. Vergilio, On the use of test smells for prediction see it, in: ACM/IEEE 38th International Conference on Software Engineering,
of flaky tests, in: Brazilian Symposium on Systematic and Automated Software 2016, pp. 1028–1038.
Testing, 2021, pp. 46–54. [121] S.V.V. Subramanian, S. McIntosh, B. Adams, Quantifying, characterizing, and
[109] E. Melski, 6 tips for writing robust, maintainable unit tests, 2020, https: mitigating flakily covered program elements, IEEE Trans. Softw. Eng. (2020).
//blog.melski.net/tag/unit-tests/. [122] D. Lo, N. Nagappan, T. Zimmermann, How practitioners perceive the relevance
[110] A. Belkhir, M. Abdellatif, R. Tighilt, N. Moha, Y. Guéhéneuc, É. Beaudry, of software engineering research, in: 10th Joint Meeting on Foundations of
An observational study on the state of REST API uses in Android mobile Software Engineering, 2015, pp. 415–425.
applications, in: 2019 IEEE/ACM 6th International Conference on Mobile [123] T. Punter, M. Ciolkowski, B. Freimut, I. John, Conducting on-line surveys in
Software Engineering and Systems (MOBILESoft), IEEE, 2019, pp. 66–75. software engineering, in: 2003 International Symposium on Empirical Software
[111] M. Fowler, Refactoring: Improving the Design of Existing Code, Addison-Wesley Engineering, 2003. ISESE 2003. Proceedings, IEEE, 2003, pp. 80–88.
Professional, 2018. [124] M. Beller, G. Gousios, A. Panichella, A. Zaidman, When, how, and why
[112] M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, D. developers (do not) test in their IDEs, in: Proceedings of the 2015 10th Joint
Poshyvanyk, When and why your code starts to smell bad (and whether the Meeting on Foundations of Software Engineering, 2015, pp. 179–190.
smells go away), IEEE Trans. Softw. Eng. 43 (11) (2017) 1063–1088. [125] M. Beller, G. Gousios, A. Zaidman, How (much) do developers test? in: 2015
[113] F. Nayebi, J.-M. Desharnais, A. Abran, The state of the art of mobile application IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol.
usability evaluation, in: 2012 25th IEEE Canadian Conference on Electrical and 2, IEEE, 2015, pp. 559–562.
Computer Engineering (CCECE), IEEE, 2012, pp. 1–4. [126] N. Nachar, et al., The mann-whitney u: A test for assessing whether two
[114] Y. Jin, N. Duffield, A. Gerber, P. Haffner, W. Hsu, G. Jacobson, S. Sen, S. independent samples come from the same distribution, Tutor. Quant. Methods
Venkataraman, Z. Zhang, Characterizing data usage patterns in a large cellular Psychol. 4 (1) (2008) 13–20.
network, in: 2012 ACM SIGCOMM Workshop on Cellular Networks: Operations,
Challenges, and Future Design, 2012, pp. 7–12.
27