We presented our system in user study sessions with 10 professional accessibility testers. The goal of the user study is to understand how AXNav could assist accessibility testers in their workflows, specifically, how well the system could replicate manual accessibility tests, aid testers in finding accessibility issues, and be integrated into existing test workflow.
6.1 Procedure
We conducted 10 1-to-1 interview-based study sessions. During each session, we first presented an overview of AXNav to the participant. We then showed three videos generated by AXNav and the associated test instructions, in randomized order. Each video showed an accessibility test on iOS media applications for e-books, news stories, and podcasts, respectively, with different UI elements and layouts. The videos were selected from the set of videos used in section
5, based on their coverage of different accessibility features, including VoiceOver, Dynamic Type, and Button Shapes. Two of the tests shown in the videos were selected from those with the difficulty level of Easy, and one test with the difficulty level of Hard. The tests shown in the videos represented real accessibility tests that our participants would perform, as they were selected from the set of test instructions authored and used by testers in the organization. We chose to show videos to participants as they are the primary output produced by AXNav, offering a realistic representation of interaction with our system. Furthermore, since AXNav is not a production system, it was not optimized for speed, and can take several minutes to an hour to produce a video. In practice, this is not a critical limitation, since many tests can be run in parallel, possibly overnight, and reviewed all at once following their completion. The specific videos and associated test instructions that we used for the user studies are as follows:
(1)
VO: This video shows a test of a podcast application. The test instruction prompts the system to share an episode of a podcast show through text message using Voice Over. (Difficulty level Hard)
1 (2)
DT: This video shows a test of Dynamic Text in a news application. The test instruction prompts the system to increase the size of the text in four different fonts in a specific tab of the application. (Difficulty level Easy)
(3)
BS: This video shows a test of Button Shapes in an e-book application. The test instruction prompts the system to test the Button Shape feature across all the tabs in the application. (Difficulty level Easy)
All three videos contained some accessibility issues, which we prompted the participants to discover using the heuristics as part of the system. Furthermore, all videos deliberately contained errors and imperfect navigation to conservatively showcase the capabilities of our system. Specifically, the VO video shares a podcast itself instead of an episode, and some false positive errors are flagged in the DT and BS videos. We intentionally presented those imperfections to the participants to show the performance of the system conservatively, and to trigger a discussion of limitations and future directions.
For each video, the researcher asked the participant to think aloud as they watched the video to 1) point out any accessibility issues related to the input test, and 2) point out any places where the test performed by the system could be improved. After each video, we interviewed each participant about how well the test in the video met their expectations, and how well the heuristics assisted them in finding any accessibility issues. Besides qualitative questions, we also asked the participants to provide 5-point Likert scale ratings on how similar the tests in the videos are to their manual tests, and how useful the heuristics are for tests to identify accessibility bugs. Following the viewing of all three videos, we asked about the participants’ overall attitude toward the system, how they envisioned incorporating it into their workflow, and any areas they identified for improvement. Additionally, we asked participants to provide 5-point Likert scale ratings assessing our system’s usefulness in its current form and with ideal performance within their workflow.
6.4 Findings
6.4.1 Performance of the Automatic Test Navigation.
Automatic test navigation replicates manual test. Participants generally agreed that the system navigated applications in a similar path as they would conduct tests manually, especially in the BS and VO test cases. For VO, Participants rated 4.60 (SD = 0.52, N = 10) on average in the similarity regarding the navigation path between human testers and the AI (between “very good match” and “extremely good match” with their manual testing procedures). P3 was impressed by the system’s ability to execute the test: “my mind is blown that it was able to find that [shared button] buried within that actions menu.” Similarly, in the BS test case, Participants rated 4.35 (SD = 0.75, N = 10) on average. In P9’s opinion, the system’s heuristics might outperform most human testers in BS, since it could be subjective for a human tester to determine what consists of a button shape. Participants also reacted positively to the chapter feature, as it enabled efficient navigation through the video.
Differences in system and human approaches. Some of the approaches the system provided were different from what human testers would do. Compared to BS and VO, the system’s performance in DT received 3.39 (SD = 0.78, N = 9) on average, a relatively lower rating that was between “moderately good match” and “good match” with manual testing procedures. A main difference is that the system always relaunches the application between the tests of different text sizes, while human testers tend to use the control center to adjust text sizes within the application without relaunching it in order to mimic what a real user would do. In fact, participants recognized a potential benefit of AXNav’s approach, as it added an additional layer of testing: “I really like that launches the app in between changing the text size, because I think it’s a separate class of bug, whether or not, it responds to a change in text size versus having the text size there initially.” (P8) Similarly, P9 found in the VO example that the system waited for spoken output, which was not something that a human tester would typically do, but might be beneficial for more thorough tests.
At the same time, participants also suggested that future versions of the system could enable exploratory and alternative navigation, as well as more in-depth tests of the UI structure. For example, for BS, participants mentioned that they would have explored more nested content in the application to ensure the Button Shape feature works for all elements (P2, P6). For VO, participants wished the system could support alternative, non-linear pathways that VO users could go through (P7) and navigation using both swiping and tapping gestures (P4). Another common request is the ability to scroll through the screen of an application when testing display features like DT and BS.
Reaction to navigation errors. The VO video contains a slight error in the navigation: the navigation shares a show instead of sharing an episode. Only 2 out of 10 participants (P2 and P5) were able to identify this navigation error. Most participants ignored the error, potentially due to over-reliance on the automatic navigation, as P2 said, “it worked well enough that I almost kind of let that slip. I needed to watch this video twice. Maybe I got over-reliant on [it].” To address this error, P2 elaborated on how they would re-write the test instruction so that the agent could potentially correct the mistake: “I would have [written], like, navigate to an episode, click the dot dot dot menu... I would suspect that this model would have done a better job finding the actual episode...” P5, instead, described how they would navigate the application themselves based on the instruction: “I would definitely do it the same route as it did through the more button, [but] instead of a certain episode, I would just switch it to show.”
6.4.2 Identifying Accessibility Issues with Automatic Navigation.
For all three cases of VO, BS, and DT, all participants spotted at least one accessibility issue, and agreed that the issues they discovered were significant enough to be filed in the internal bug reporting system within their company.
Heuristics aid discovery of issues. Overall, participants agreed that the heuristics provided by the system assisted them in finding the issues. For VO, BS, and DT respectively, participants on average rated 4.06 (SD = 1.38, N = 9) (between “useful” and “very useful”), 4.75 (SD = 0.43, N = 10), and 3.67 (SD = 1.09, N = 9) (between “moderately useful” and “useful”) on the usefulness of the heuristics. Specifically, the potential issues flagged in the chapters allowed participants to navigate to where the issue was and review it with greater attention. The heuristics in particular helped direct testers’ attention to the potential issues, which might otherwise be too subtle to discover: “Watching it in a video, as opposed to actually interacting with it, I think it is easier to potentially miss things... So, having some sort of automatic detection to surface things [is good].” (P8) Even though they sometimes resulted in false positives, participants appreciated the heuristics providing an extra layer of caution, as P10 said, “I actively like the red [annotation boxes around potential issues] because I think the red is like ‘take a look at this’ and then even if it’s not necessarily an issue, that’s not hurtful.”
Risks of over-reliance on heuristics. Participants expressed the concern of over-reliance on the heuristics provided by the system. In some sessions of our study, although participants found issues that were not marked by the heuristics, they were worried that those false negatives might bias testers: “if things are marked as green, and maybe there actually is an issue in there, maybe that would dissuade somebody from looking there.” (P10) This could influence testers of different experience levels differently. An experienced tester might rely on their expertise to find issues, while a novice tester might over-rely on the suggested bugs (or non-bugs) made by the system. As P8 explained: “If somebody is kind of experienced with large text testing, they kind of know what to look for... If it’s an inexperienced tester, they might not know that the false positives are false positives and might file bugs.” (P8)
A mechanism to explain how the heuristics were generated and applied to the test cases might help with the issue of over-reliance. For example, P7 imagined it to be a series of “human-readable strings, like what it actually found... human-readable descriptions of what the error is in addition to seeing the boxes.” Other suggestions focus on making the heuristics more digestible for the testers. Currently, we show the heuristics as screenshots with annotations separate from the videos. Participants suggested it would be easier to comprehend the heuristics if they were encoded in the video and separated from regular chapters (P6), and only annotated the potential issues (P1). P7 brought up the idea to include a dashboard or summary mechanism in the system, so that a tester “instead of just having a scrub through this video,” could see “a summary of the errors as well.”
6.4.3 Integration in Accessibility Testing Workflow.
Overall, participants reacted positively to our system. Participants rated 4.70 (SD = 0.48, N = 10) (between “useful” and “very useful”) on average for how useful the system is in their existing workflow if it performs extremely well, and 3.95 (SD = 0.96, N = 10) (between “moderately useful” and “useful”) on average to the system in its current form. Participants expressed excitement about the potential of integrating the system and bringing automation to their workflow. For instance, when asked for a rating on the overall usefulness of the system, P3 answered: “[I will rate] it like a 5 million... Even with the current limitations, it is very useful... just being able to feed it some real simple steps and have it do anything at all is massively powerful.” The next sections unpack a range of ways that AXNav might be integrated into existing test workflows.
Automating test planning. A compelling use case for AXNav is to automate the planning and setup of the test, which, according to our participants, is a time-consuming part of accessibility testing as it can involve an excessive amount of manual work to “go through and find all of the labels to tap through” (P3). The step-by-step executable test plan generated from natural language from our system can reduce the amount of tedious work: “rather than having to hard code navigation logic, it seems that this is able to determine those pathways for you... I think this idea is really awesome and would definitely save a lot of hours of not having to hard code the setup steps to go through a workflow with VoiceOver.” (P4) P4 also envisioned using the system as a test authoring tool, which can generate templates that can be run daily.
Complementing manual tests. Participants found the system helpful in reducing workload and saving time in running tests. Some participants would like to embrace the automation provided by the system, keeping the system running a large scale of tests in the background while the team could focus on more important tasks: “you can run it in an automated fashion. You don’t need to be there. You can run it overnight. You can run it continually without scaling up some more people” (P7). As P8 imagined, “this could run on each new build [of the software], and then what all the QA engineer has to do is potentially a review about an hour’s worth of videos that were generated by the system, potentially automatically flagging issues.” The system can also provide consistency and standardization in tests, which “ensure[s] that everything is run the same way every time.” (P8)
At the same time, some participants are more cautious about automation and would like to use the system as a supplement to their manual work. P4 believed that even with the flagged issues, they would still pay attention to the system-generated videos to a degree similar to how they would test them manually. P1 imagined that they would still test manually, but would use the video as validation of their tests “to see if it could catch things that I couldn’t catch.” (P1) Some also imagined handing lower-risk tests, such as testing Button Shapes, to the system, while using the time saved by the system to manually and carefully test higher-risk tests that will be a regulatory blocker. (P2)
Aiding downstream bug reporting. The videos generated by the system can also facilitate bug reporting in the downstream pipeline. Participants agreed that the video along with the chapters generated by the system could be used to triage any accessibility issues that they would report to the engineering teams. In their current practice, testers would sometimes include screenshots or screen recording video clips to demonstrate the discovered issue. Our system prepared a navigable video automatically, streamlining this process: “I thought to be able to jump to specifically when the issue is and scrub a couple of seconds back or a couple seconds forward is super useful for engineering.” (P7)
Educating novices about accessibility testing. The system can also serve as an educational tool for those who are new to accessibility tests. The system can not only help new QA professionals, but also developers from under-resourced teams where there are no dedicated QA teams or pipelines. For example, P2 found the videos and heuristics helpful in terms of demonstrating certain accessibility bugs that people should be looking for: “This will be very useful for some of the folks that never do accessibility testing and [for] they [to] have a context or starting point for even knowing what a VoiceOver bug is.” (P2) In a way, our system has the potential to demonstrate and raise awareness of accessibility issues among broader developer communities, even for those who do not have QA resources.