Academia.eduAcademia.edu

The empirical investigation of Perspective-Based Reading

1996, Empirical Software Engineering

We consider reading techniques a fundamental means of achieving high quality software. Due to the lack of research in this area, we are experimenting with the application and comparison of various reading techniques. This paper deals with our experiences with a family of reading techniques known as Perspective-Based Reading (PBR), and its application to requirements documents. The goal of PBR is to provide operational scenarios where members of a review team read a document from a particular perspective, e.g., tester, developer, user. Our assumption is that the combination of different perspectives provides better coverage of the document, i.e., uncovers a wider range of defects, than the same number of readers using their usual technique.

The Empirical Investigation of Perspective-Based Reading Victor R. Basili1, Scott Green2, Oliver Laitenberger3, Filippo Lanubile1, Forrest Shull1, Sivert Sørumgård5, Marvin V. Zelkowitz1 Keywords: perspective-based reading, reading technique, requirement specification, defect detection, experimental software engineering Abstract We consider reading techniques a fundamental means of achieving high quality software. Due to the lack of research in this area, we are experimenting with the application and comparison of various reading techniques. This paper deals with our experiences with a family of reading techniques known as Perspective-Based Reading (PBR), and its application to requirements documents. The goal of PBR is to provide operational scenarios where members of a review team read a document from a particular perspective, e.g., tester, developer, user. Our assumption is that the combination of different perspectives provides better coverage of the document, i.e., uncovers a wider range of defects, than the same number of readers using their usual technique. To test the effectiveness of PBR, we conducted a controlled experiment with professional software developers from the National Aeronautics and Space Administration / Goddard Space Flight Center (NASA/GSFC) Software Engineering Laboratory (SEL). The subjects read two types of documents, one generic in nature and the other from the NASA domain, using two reading techniques, a PBR technique and their usual technique. The results from these experiments, as well as the experimental design, are presented and analyzed. Teams applying PBR are shown to achieve significantly better coverage of documents than teams that do not apply PBR. We thoroughly discuss the threats to validity so that external replications can benefit from the lessons learned and improve the experimental design if the constraints are different from those posed by subjects borrowed from a development organization. 1 University of Maryland, USA NASA Goddard Space Flight Center, USA 3 University of Kaiserslautern, Germany 4 University of Trondheim, Norway 2 1. Reading Scenarios The primary goal of software development is to generate systems that satisfy the user’s needs. However, the various documents associated with software development (e.g., requirements documents, code and test plans) often require continual review and modification throughout the development life cycle. In order to analyze these documents, reading is a key, if not the key technical activity for verifying and validating software work products. Methods such as inspections (Fagan, 1976) are considered most effective in removing defects during development. Inspections rely on effective reading techniques for success. Reading can be performed on all documents associated with the software process and can be applied as soon as the documents are written. However, except for Mills’ reading by step-wise abstraction (Linger, 1979), there has been very little written on reading techniques. Most efforts have been associated with methods that simply assume that the given document can be read effectively (e.g., inspections, walk-throughs, reviews), but techniques for reading particular documents, such as requirements documents or test plans, do not exist. In cases where techniques do exist, the required skills are neither taught nor practiced. In teaching program design, for example, almost all effort is spent learning how to write code rather than how to read code. Thus, when it comes to reading, little exists in the way of research or practice. In the NASA/GSFC Software Engineering Laboratory (SEL) environment, we have learned much about the effectiveness of reading and reading-based approaches through the application and evaluation of methodologies such as Cleanroom. We are now part of a group (ISERN6) that has undertaken as one of its activities, a research program to define and evaluate software reading techniques to support the various review methods7 for software development. The work reported in this paper was conducted within the confines of SEL. The SEL, started in 1976, has been developing technology aimed at improving the process of developing flight dynamics software for NASA/GSFC. This software is typically written in FORTRAN, C, C++, 6 ISERN is the International Software Engineering Research Network whose goal is to support experimental research and the replication of experiments. 7 We define the terms "technique" and "method" as follows: A technique is a series of steps, producing some desired effect, and requiring skilled application. A method is a management procedure for applying software techniques, which describes not only how to apply a technique, but also under what conditions the technique is appropriate. 2 or Ada. Systems can range from 20K to 1M lines of source code, with development teams of up to 15 persons working over a one to two year period. 1.1 Scenario-Based Reading Since we believe that software development and analysis techniques need to be context dependent, well-defined, goal-oriented, and demonstrated effective for purpose, we established the following goals for defining reading techniques: • The technique should be associated with the particular document (e.g., requirements) and the notation in which the document is written (e.g., English text). That is, it should fit the appropriate development phase and notation. • The technique should be tailorable, based upon the project and environment characteristics. If the problem domain changes, so should the reading technique. • The technique should be detailed, in that it provides the reader with a well-defined process. We are interested in usable techniques that can be repeated by others. • The technique should be specific in that each reader has a particular purpose or goal for reading the document and the procedures support that goal. This can vary from project to project. • The technique should be focused in that a particular technique provides a particular coverage of the document, and a combination of techniques provides coverage of the entire document. • The technique should be studied empirically to determine if and when it is most effective. To this end, we have defined a set of techniques, which we call proactive process-driven scenarios, in the form of algorithms that readers can apply to traverse the document with a particular emphasis. Because the scenarios are focused, detailed, and specific to a particular emphasis or viewpoint, several scenarios must be combined to provide coverage of the document. We have defined an approach to generating a family of reading techniques based upon operational scenarios, illustrated in Figure 1. An operational scenario requires the reader to first create a model of the product, and then answer questions based on analyzing the model with a particular emphasis. The choice of abstraction and the types of questions asked may depend on the document being read, the problem history of the organization or the goals of the organization. 3 So far, two different families of scenario-based reading techniques have been defined for requirements documents: perspective-based reading and defect-based reading. Defect-based reading was the subject of an earlier set of experiments. Defect-based reading was defined for reading documents written in SCR style (Heninger, 1980), a formal notation for event-driven process control systems, and focuses on different defect classes, e.g., missing functionality and data type inconsistencies. These create three different scenarios: data type consistency, safety properties, and ambiguity/missing information. An experimental study (Porter, 1995) analyzed defect-based reading, ad hoc reading and checklist-based reading to evaluate and compare them with respect to their effect on defect detection rates. Major results were that (1) scenario readers performed better than ad hoc and checklist readers with an improvement of about 35%, (2) scenarios helped reviewers focus on specific defect classes but were no less effective at detecting other defects, and that (3) checklist reading was no more effective than ad hoc reading. However, the experiment discussed in this paper is concerned with an experimental validation of perspective-based reading, and so we treat it in more detail in the next section. Model-based process on product Emphasis Model generates scenarios Analysis generates questions Procedure for building and analyzing models with respect to a set of goals Figure 1. Building focused, tailored reading techniques. 1.2 Perspective-Based Reading Perspective-based reading (PBR) focuses on the point of view or needs of the customers or consumers of a document. For example, one reader may read from the point of view of the tester, another from the point of view of the developer, and yet another from the point of view of the user of the system. To provide a proactive scenario, each of these readers produces some 4 physical model which can be analyzed to answer questions based upon the perspective. For example, the team member reading from the perspective of the tester would design a set of tests for a potential test plan and answer questions arising from the activities being performed. Similarly, the team member reading from the perspective of the developer would generate a high level design, and the team member representing the user would create a user's manual. Each scenario is focused on one perspective. The assumption is that the union of the perspectives provides extensive coverage of the document, yet each reader is responsible for a narrowly focused view of the document, which should lead to more in-depth analysis of any potential errors in the document. Consider, as an example, the procedure for a reader applying the test-based perspective to a requirements specification document: Reading Procedure: For each requirement, make up a test or set of tests that will allow you to ensure that the implementation satisfies the requirement. Use your standard test approach and test criteria to make up the test suite. While making up your test suite for each requirement, ask yourself the following questions: 1. Do you have all the information necessary to identify the item being tested and to identify your test criteria? Can you make up reasonable test cases for each item based upon the criteria? 2. Is there another requirement for which you would generate a similar test case but would get a contradictory result? 3. Can you be sure the test you generated will yield the correct value in the correct units? 4. Are there other interpretations of this requirement that the implementor might make based upon the way the requirement is defined? Will this effect the test you made up? 5. Does the requirement make sense from what you know about the application and from what is specified in the general description? These five questions form the basis for the approach that the test-based reader will use to review the document. We are proposing two different series of experiments for evaluating perspective-based reading techniques. The first series of experiments, the subject of this current paper, is aimed at 5 discovering if perspective-based reading is more effective than current practices. We tested this using professionals within the SEL. It is expected that other studies will be run in different environments using the same artifacts where appropriate. A second series, to be undertaken later, will be used to discover under which circumstances each of the various perspectives is most effective. 1.3 Experimental Plan Our method for evaluating PBR was to compare its effectiveness in uncovering defects with the approach people were already using for reading and reviewing requirements specifications. Thus, it assumes some experience in reading requirements documents on the part of the subjects. More specifically, the current NASA SEL reading technique (SEL, 1992) had evolved over time and was based upon recognizing certain types of concerns which were identified and accumulated as a set of issues requiring clarification by the document authors, typically the analysts and users of the system. To test our hypotheses concerning PBR, a series of partial factorial experiments was designed, in which subjects would be given one document and told to discover defects using their current method. They would then be trained in PBR and given another document in order to see if their performance improved. The main research question was: • If groups of individuals (such as during an inspection meeting) were given unique PBR roles, would a larger collection of defects be detected than if each read the document in a similar way? Our hypothesis is that “the union of the defects detected by groups of individuals with unique PBR roles provides a greater coverage of the documents than the union of defects detected by groups using the usual NASA technique.” As by-products of the main research question, we were also interested in the following secondary questions: • • If individuals read a document using PBR, would a different number of defects be found than if they read the document using their usual technique? Does a reviewer’s experience in the role (designer, tester, user) influence performance when using PBR? 6 While we were interested in the effectiveness of PBR within our SEL environment, we were also interested in the general applicability of the techniques in environments different from the flight dynamics software that the SEL generally builds. Thus two classes of documents were developed: a domain-specific set that would have limited usefulness outside of NASA, and a generic set that is more representative of other domains and could be reused in other contexts. For the NASA flight dynamics application domain, two small specifications derived from an existing set of requirements documentation were used. These specification documents, seeded with errors common to the environment, were labeled NASA_A and NASA_B. For the generic application domain, two requirements documents were developed and seeded with errors. These applications were an automated parking garage (PG) control system, and an automated bank teller machine (ATM). 1.4. Structure of this Paper In section 2, we give a short discussion of the experimental design that we employed to test the effectiveness of PBR in the SEL environment and give a short overview of how we conducted two runs of this experiment. Section 3 presents the analysis of the data we obtained in the experiment. Section 4 discusses the various threats to the validity of our results. We describe those threats that we were able to anticipate in the experimental design and address in our results. We also discuss several threats that we were unable to foresee, and the impact of those new threats on our results. Section 5 discusses our experiences regarding designing and carrying out the experiment. Finally, Section 6 summarizes our findings and concludes with some indications of future directions for this research. 2. Design of the Experiment Two runs of the experiment were conducted. Due to our experiences from the initial run, some modifications were introduced in the second run. We therefore view the initial run as a pilot study to test our experimental design, and we have run the experiment once more under more appropriate testing conditions. For both runs, the population was software developers from the NASA SEL environment. All subjects were volunteers so we did not have a random sample population. We accepted everyone who volunteered, and nobody participated in both runs of the experiment. 7 2.1 Factors in the Design In designing the experiment, we had to consider what factors were likely to have an impact on the results. The experimental design takes these independent variables into account and allows each of them to be separable from the others in order to allow for testing a causal relationship to the defect detection rate, the dependent variable under study. Below we list the independent variables that we could manipulate in each separate run of the experiment. • • • Reading technique: We have two alternatives: One is using a PBR technique, and the other is using the technique currently used for requirements document review in the NASA SEL environment, which we refer to as the “usual” technique. Perspective: Within PBR, a subject uses a technique based on one of the review perspectives. For this experiment we used the three perspectives previously described: Designer, Tester and User. Requirements documents: For each task to be carried out by the subjects, a requirements specification is handed out to be read and reviewed. The document will presumably have an impact on the results due to differences in size, domain and complexity. There will also be other factors present that may have an impact on the outcome of the experiment, but that are hard to measure and control. These will be discussed in Section 4. 2.2 Constraints and Limitations In designing the experiment we also took into account various constraints that restrict the way we could manipulate the independent variables. There are basically two factors that constrain the design of this experiment: time and cost. • • Time: Since the subjects in this experiment are borrowed from a development organization, we could not expect to have them available for an indefinite amount of time. This required us to make the experiment as time-efficient as possible without compromising the integrity of the design. Cost: For the same reason, we could not get as many subjects as we would have liked. Given the salaries typical for experienced software designers, we estimated the burdened cost (i.e., with overhead) per individual would be at least $500 per day, or over $1000 per participant. 8 This did not include the costs of the experimenters to set up, run, and analyze the results. For this reason, a major constraint in the experimental design would be to achieve meaningful results with a minimal number of subjects and a minimal number of replications. Specifically, we knew that we could expect to get between 12 and 18 subjects for two days on any run of the experiment. Since we had to rely on volunteers, we had to provide some potential benefit to the subjects and the organization that was supporting their participation. Training in a new approach provided some benefit for their time. This had an impact on our experimental design because we had to treat people equally as far as the training they received. 2.3 Choosing a Design Due to the constraints, we found constructing real teams of three reviewers to work together in the experiment to be unfeasible. With twelve subjects, we would only have four teams that would allow for only two treatments (use of PBR and the usual technique) with two data points in each. In order to achieve more statistical validity, we had each reviewer work independently, yielding six data points for each treatment. This decision not to use teams was supported by similar experiments (Parnas, 1985) (Porter, 1995) (Votta, 1993), where the team meetings were reported to have little effect in terms of the defect coverage; the meeting gain was outweighed by the meeting loss. Therefore we present no conclusions about the effects of PBR team meetings in practice. We do however examine the defect coverage that can result from teams by grouping reviewers into simulated teams which combine one reviewer from each of the perspectives. We discuss this further in Section 3.1. The tasks performed by the subjects consisted of reading and reviewing a requirements specification document and recording the identified defects on a form. The treatments, which had the purpose of manipulating one or more of the independent variables, were aimed at teaching the subjects how to use PBR. There were four ways that we could have arranged the order of tasks and treatments for a group of subjects: 1. Start by teaching PBR, then do all tasks using PBR (experimental group) 2. Do all tasks using the usual technique (control group) 3. Start by teaching PBR, then do some tasks using PBR, followed by tasks using the usual technique. 4. Do pre-task(s) with the usual technique, then teach PBR, followed by post-task(s) using PBR. 9 In the first two options, the reading technique is a between-groups factor, in which each subject participates in only one treatment condition, either PBR or the usual technique. These two options were rejected because of the limited number of participants in the experiment. Furthermore, the control group of volunteers would not benefit from this study since they would not be learning anything about PBR. We did not believe this was appropriate given the support we got from the development organization. In the other two options, the reading technique is a repeated-measures factor, in which each subject provides data under each of the treatment conditions. Each subject serves as its own control because each subject uses both PBR and the usual technique. These two last options are more efficient than the first two because they double the number of available observations. The experiment is also more attractive in terms of getting subjects, since they would all receive similar training. Option 3, where the subjects first use PBR and then switch to their usual technique, was not considered a viable alternative because their recent knowledge in PBR may have undesirable influences on the way they apply their usual technique. The opposite may also be true, that their usual technique has an influence on the way they apply PBR. However, a prescriptive technique, such as a scenario-based reading technique, is assumed to produce a greater carry-over effect than a non-prescriptive technique, such as the usual technique at NASA. An analogous decision was taken in a related experiment (Porter, 1995) where subjects starting with a defect-based reading technique continued to apply it in the second experimental task. Thus, option 4 was selected. All documents reviewed by a subject must be different. If a document was reviewed more than once by the same subject, the results would be disturbed by the subject’s non-erasable knowledge about defects found in previous readings. This meant that we had to separate the subjects into two groups - one reading the first document and one reading the second in order to be able to compare a PBR and a usual reading of a document. Based on the constraints of the experiment, each subject would have time to read and review no more than four documents: two from the generic domain, and two from the NASA domain. In addition, we needed one sample document from each domain for training purposes. We ended up providing the following documents: 10 • Generic: - Automatic teller machine (ATM) - 17 pages, 29 seeded defects. - Parking garage control system (PG) - 16 pages, 27 seeded defects. - Video rental system - 14 pages, 16 seeded defects (for training) • NASA: - Flight dynamics (NASA_A) - 27 pages, 15 seeded defects - Flight dynamics (NASA_B) - 27 pages, 15 seeded defects - NASA sample - 9 pages, 6 seeded defects (for training) Since we have sets of different documents and techniques to compare, it became clear that a variant of factorial design would be appropriate. Such a design would allow us to test the effects of applying both of the techniques on both of the relevant documents. A full factorial design would be inappropriate for two reasons: (1) It would require some subjects to apply the ordering of techniques that we previously argued against, and (2) It would require each subject to use all three perspectives at some point. Given our constraints, this would require an excessive amount of training, and perhaps even more important, the perspectives would likely interfere with each other, causing an undesirable learning effect. We blocked the design on technique, perspective, document and reading sequence in order to get an equal distribution of the values of the different independent variables. Thus we ended up with two groups of subjects, where each group contains three subgroups, one for each perspective (see Figure 2). Our goal was to have a minimum of 12 subjects, yielding six for group 1 (with two in each perspective) and six for group 2 (again with two in each perspective). Group 1 usual technique PBR technique Group 2 Designer Tester User Designer Tester User Training Training NASA A NASA B Training Training PG ATM Teaching of PBR Training Training ATM PG Training Training NASA A NASA B Figure 2. Design of the experiment. 11 First day Second day 2.4 Conducting the Experiment We conducted the first run in November, 1994, with 12 subjects. Each run of the experiment took two days, a Monday where each subject used the usual technique to review a NASA and then a generic document, and a Wednesday where each subject was taught one of the three PBR perspectives and applied that to two additional documents. After analyzing the results (discussed in the next section), we held a meeting with the subjects to give them our conclusions and to obtain feedback from them on how the experiment was conducted. Several potential problems were reported: 1. We tried to assign subjects to each perspective according to their experiences. However, we had mostly software designers and did not have an equitable breakdown of testers and users for our three perspectives. We decided that we would randomize assignments in any future run of the experiment. 2. The NASA documents (at 27 pages) were deemed too long for appropriate analysis in a single session. We decided to revise these into shorter documents for any future run of the experiment. 3. We gave each subject up to three hours to review each document (i.e., one document in the morning, and one in the afternoon). Only one subject took more than two hours, so it was agreed that a two hour limit per document would be as effective in any future replications. 4. The initial run included training sessions only for the generic documents, but the subjects felt training for the NASA documents was warranted as well. Therefore in subsequent runs, we needed training sessions before each document review. For this purpose we generated an additional sample document representative of the NASA domain. 5. Subjects were allowed to work at their own desks, subject to interruptions, telephone calls, and other hazards of office life as long as they kept a log of time actually spent on the experiment. For any replication we believe that there would be greater internal validity in the results if we used a more uniform setting. While the first run took place at the facility of the developer, we decided to use a classroom setting at the University of Maryland for any subsequent runs. 6. We found a few inconsistencies in our specifications that we did not anticipate. A few sentences were changed to make them less ambiguous. Most importantly, the specification consisted of a general description of the problem and then a series of precise specifications, some of which were intentionally incorrect. We decided that for any replication the general 12 description should be absolutely correct in order for the subjects to have some basis for making decisions. This required minor changes to some of the specifications. Because of these changes, we decided that it was best to call the November, 1994 run a pilot study for our experimental design. We conducted a second run of the experiment in June of 1995 with 14 subjects. Since one of the 14 was not familiar with NASA flight dynamics applications, we only used the 13 other subjects in analyzing NASA_A and NASA_B. In the next section we present an analysis of both the pilot study and the June, 1995 run. After the 1995 run of the experiment, we marked all reviews with respect to their defect detection rate. Each review was graded by two individuals who did not know whether the subject was using PBR or the usual technique in order to eliminate any potential bias in the grading. After several iterations of discussion and re-marking, we arrived at a set of defect lists that were considered representative of the documents. Since these lists were slightly different from the lists that were used in the pilot study, we re-marked all the reviews from November 1994 in order to make all results consistent. Our initial measure of defect coverage was the percentage of the seeded defects that was found by each reviewer. 3. Statistical Analysis After the pilot study and 1995 run, we have a substantial base of observations from which to draw conclusions about PBR. This task is complicated, however, by the various sources of extraneous variability in the data. Specifically, we identify four other variables (besides the reading technique) which may have an impact on the detection rate of a reviewer: the experiment run within which the reviewer participated, the problem domain, the document itself, and the reviewer's experience. The experiment run is taken into account by performing a separate analysis for the pilot study and the 1995 run. The domain is taken into account in a similar way, by performing separate analyses for generic and NASA documents. The technique used and the document read are represented by nominal-scale variables used in our models. We measured reviewer experience as the number of years the reviewer had spent performing jobs related to the assigned perspective. We are also careful to note that there are variables that our statistical analysis cannot measure. Perhaps most importantly, an influence due to a learning effect would be hidden within the effect 13 of the reading technique. The full list of these threats to validity is found in Section 4, and any interpretation of results must take them into account. In Section 3.1, we analyze the coverage that could result from PBR review teams, by simulating teams composed of one reviewer from each perspective. Section 3.2 presents the analysis of scores on an individual basis. Section 3.3 takes an initial look at the analysis with respect to the reviewer perspectives. In Section 3.4, we analyze the relationship between the reviewer’s experience and the individual scores. In each section, we present the general analysis strategy and some details on the statistical tests, followed by the statistical results. 3.1 Analysis for Teams In this section, we give a preliminary analysis concerning our primary hypothesis of the effect of PBR on inspection teams. It should be noted that since the PBR techniques are specific, any one of them will not cover the entire document. It is assumed that several of them need to be combined, in a team, to offer complete coverage of the document. In composing these teams, we might select one reader from each of the perspectives. Or, we could use the expected error profile of a project to help determine the number and types of perspectives to be used. For example, if we knew that, for a particular application domain, there is a tendency to commit a large number of errors of omission and that the user perspective offers the most opportunity for exposing omission errors, we might create a team with a larger number of use-based readers. One interesting direction for future research will be in developing ways to tailor the selection of reviewer perspectives to the problem at hand. However, for this experiment we examine the defect coverage of teams composed of one reviewer from each of the three perspectives. Because we are concerned only with the range of a team’s defect coverage, and not with issues of how team members will interact, we simulate team results by taking the union of the defects detected by the reviewers on the team. We emphasize that this grouping of individual reviewers into teams was performed after the experiment’s conclusion, and does not signify that the team members actually worked together in any way. The only real constraint on the makeup of a team which applied PBR is that it contain one reviewer using each of the three perspectives; the nonPBR teams can have any three reviewers who applied their usual technique. At the same time, the way in which the teams are composed has a very strong effect on the team scores, so an arbitrary choice can have a significant effect on the test results. 14 For these reasons, we used a permutation test to test for differences in team scores between the PBR and the usual technique. We examine hypothetical teams from both the pilot study and the 1995 run, but keep the analysis of each run separate. Since the generic and NASA problem domains are also very different, we compare reviewer scores on documents within the same domain only. This gives us four iterations of the permutation test, for which we present an informal description here. In Section 2.3 we explained how reviewers were categorized into two groups, depending on which technique they applied to which document. Reviewers in Group 1 applied their usual technique to Document A and PBR to Document B, where Document A and Document B represent the two documents within either of the domains. We can thus generate the set of all possible non-PBR teams for Document A and the set of all possible PBR teams for Document B, and examine the defect coverage that could be expected from each technique by taking the average detection rate of each set. This ensures that our results are independent of any arbitrary choice of team members, but because the data points for all possible teams are not independent (i.e., each reviewer appears multiple times in this list of all possible teams), we cannot run simple statistical tests on these average values. For now, let us call these averages A1USUAL and B1PBR. We can then perform the same calculations for Group 2, in which reviewers applied their usual technique to Document B and PBR to Document A, in order to obtain averages A2PBR and B2USUAL. The test statistic (A2PBR - A1USUAL) + (B1PBR - B2USUAL) then gives us some measure of how all possible PBR teams would have performed relative to all possible usual technique teams, for each document. For each test performed, Figure 3 shows the average of the PBR scores (A2PBR and B1PBR) against the average usual technique scores (A1USUAL and B2USUAL). Now suppose we switch a reviewer in Group 1 with someone from Group 2. The new reviewer in Group 1 will be part of a usual technique team for document A even though he used PBR on this document, and will be part of a PBR team for Document B even though he applied the usual technique. A similar but reversed situation awaits the reviewer who suddenly finds himself in Group 2. If the use of PBR does in fact improve team detection scores, one would intuitively expect that as the PBR teams are diluted with usual technique reviewers, their average score will decrease (that is, A2PBR and B1PBR decrease), even as the average score of usual technique teams with more and more PBR members is being raised (that is, A1USUAL and B2USUAL increase). Thus, the test statistic computed above will decrease. On the other hand, if PBR does in fact have no effect, then as reviewers are switched between groups the only effect will be due to 15 random effects, and team scores may improve or decrease with no correlation with the reading technique of the reviewers from which they are formed. So, let us now compute the test statistic for all possible permutations of reviewers between Group 1 and Group 2, and rank each of these scenarios in decreasing order by the statistic. We can now formulate the null hypothesis we are testing as: H0: There is no difference in the defect detection rates of teams applying PBR as compared to teams applying the usual technique. That is, every successive dilution of a PBR team with non-PBR reviewers has only random effects on team scores. The alternative hypothesis is: Ha: The defect detection rates of teams applying PBR are higher compared to teams using the usual technique. That is, every time the PBR teams were diluted with non-PBR reviewers they tended to perform somewhat worse relative to the usual technique teams. If the scenario in which no dilution has occurred appears toward the top of the list (in the top 5%) we will reject H0 and conclude that PBR does have a beneficial effect on team scores. Note that Detection Rate this is meant to be only a very rough and informal description of the intuition behind the test; the interested reader is referred to Edington’s Randomization Tests (Edington, 1987). 100 90 80 70 60 50 40 30 20 10 0 79.3 85.3 62.4 Usual 52.9 48.2 44.3 PBR 30.1 31.3 Pilot/Generic Pilot/NASA 1995/Generic 1995/NASA Figure 3. Simulated scores for undiluted teams 16 Table 1 summarizes the results. The first row of the table, for example, shows that twelve reviewers read the generic documents in the pilot study; there are 924 distinct ways they can be assigned into groups of 6. The group in which there was no dilution had the 61st highest test statistic, corresponding to a p-value of 0.0660. Both domains in the 1995 run had results significant at the 0.05 level, that is, we can reject H0 because undiluted teams appear in the top 5% of all possible permutations between groups. The generic domain in the pilot study had results significant at the 0.1 level. However, since there were only 924 permutations generated in the pilot study, the power of the test is correspondingly less and it may be reasonable to reject the null hypothesis at the 0.1 level. Rank of P-value Experiment Run/ Number Undiluted Domain of Group Permutations Group Generated 924 61 0.0660 Pilot/Generic 924 401 0.4340 Pilot/NASA 3003 2 0.0007 1995/Generic 1716 67 0.0390 1995/NASA Table 1. Results of permutation tests for team scores. 3.2 Analysis for Individuals Although our main hypothesis was an investigation of teams, we decided to also analyze the results on an individual basis. The purpose was to determine whether an individual performed differently when reviewing with PBR than when using the usual technique. The dependent variable was again the defect rate, in this case, the percentage of true defects found by a single reviewer with respect to the total number of defects in the inspected document. As we did for the analysis of team scores, we analyzed separately both the experiment runs (pilot study and 1995 run), and the problem domains (generic and NASA). Thus, we performed four separate analyses for each combination of the experiment run and problem domain. For each analysis, the corresponding design is a 2 × 2 factorial experiment with repeated measures in blocks of size 2 (Winer, 1991). This analysis involves two factors, or treatments, on 17 which there are repeated measures: the reading technique (RTECH) and the document reviewed (DOC). Both the dependent variables are on a nominal scale. The reading technique has two levels: PBR and usual. Also the document reviewed can assume two levels which depend on the problem domain considered: if generic domain, ATM and PG; if NASA domain, NASA_A and NASA_B. Groups of subjects were assigned in two groups, or blocks, and had repeated measures on the two treatments within each block. The result is shown in Figure 4. Group 1 applied the usual technique to the ATM document and PBR to the PG document, while Group 2 applied PBR to ATM and the usual technique to the PG document. On the other hand, for the NASA problem domain, subjects in Group 1 read NASA_A document with their usual technique and NASA_B document with PBR, while subjects in Group 2 read the documents in the opposite fashion. Generic domain Group 2 Group 1 usual/ATM usual/PG PBR/PG PBR/ATM NASA domain Group 1 Group 2 usual/NASA_A usual/NASA_B PBR/NASA_B PBR/NASA_A Figure 4. 2 × 2 factorial experiments with repeated measures in blocks of size 2 These plans use all the treatment conditions required for the complete factorial experiment, but the block size is reduced from four to two; that is, within any block only two treatment combinations appear instead of the four possible treatment combinations. The cost of this reduction in block size is the loss of some information on interactions. Precisely, the interaction RTECH × DOC is totally confounded with the group main effect. This means that we cannot estimate the two-factor interaction separately from the group effect. However, we do not expect this interaction to be important because both the documents are within the same problem domain. In exchange, both of the main effects are completely within-block effects, and thus independent from the subject variability. The two design plans in the pilot study have 6 subjects in each group. On the contrary, the 1995 run shows an unbalanced situation. For the generic problem domain, there are 8 subjects in Group 1 and 6 subjects in Group 2, while for the NASA domain, there are 7 subjects in Group 1 and 6 subjects in Group 2. Thus, in order to perform the analysis of variance for unbalanced 18 design, we used the GLM procedure in the SAS statistical package (SAS, 1989), which uses the method of least squares to fit general linear models. The analysis of variance for the 2 × 2 factorial design with repeated measures in blocks of size 2 uses F ratios to test three hypotheses: Group effect or RTECH × DOC interaction effect H0: There is no difference between subjects in Group 1 and subjects in Group 2 with respect to their mean defect rate scores. Ha: There is a difference between subjects in Group 1 and subjects in Group 2 with respect to their mean defect rate scores. Main effect RTECH H0: There is no difference between subjects using PBR and subjects using their usual technique with respect to their mean defect rate scores. Ha: There is a difference between subjects using PBR and subjects using their usual technique with respect to their mean defect rate scores. Main effect DOC H0: There is no difference between subjects reading the ATM document (or NASA_A document) and subjects reading the PG document (or NASA_B document) with respect to their mean defect rate scores. Ha: There is a difference between subjects reading the ATM document (or NASA_A document) and subjects reading the PG document (or NASA_B document) with respect to their mean defect rate scores. The analysis makes a number of assumptions, which we were careful to fulfill: The dependent variable is measured on a ratio scale, and the independent variables are nominal. Observations are independent. The values tested for each level of the independent variables are normally distributed; we confirmed this with the Shapiro-Wilk W Test (Shapiro, 1965). Also, the test assumes that the sources are homogeneous with regard to the two groups. However, we note that the test is robust against violations of this last assumption for data sets such as ours in which the number of subjects in the largest treatment group is no more than 1.5 times greater than the number of subjects in the smallest (Hatcher, 1994). The test also assumes that the sample must be obtained through random sampling; this is a threat to the validity of our experiment, as we must rely on volunteers for our subjects (see Section 4). 19 3.2.1 Analysis of Pilot Study in the Generic Problem Domain The analysis, summarized in Table 2, failed to reveal a significant main effect both for the reading technique (p = 0.2148) and the document reviewed (p = 0.4068). The interaction between reading technique and document, which is totally confounded in the group effect, also proved to be non-significant (p = 0.5582). The mean defect rates obtained for each level of reading technique and document are displayed in Figure 5. The defect detection rate for PBR reviewers is slightly higher (24.92) than for reviewers using their usual technique (20.58). Although this difference is not statistically significant, it represents a 21% improvement over the usual detection rate. Source Between subjects df SS MS F p>F 0.37 0.5582 11 1205.5 1 42.67 42.67 10 1162.83 116.28 12 803.01 Reading Technique (RTECH) 1 112.67 112.67 1.75 0.2148 Document (DOC) 1 48.17 48.17 0.75 0.4068 10 642.17 64.22 Group or RTECH × DOC Error Within subjects Error Table 2. ANOVA summary table for pilot study in the generic problem domain 20 Detection Rate 100 90 80 70 60 50 40 30 20 10 0 20.58 Usual 24.92 24.16 21.33 PBR ATM PG Levels of main effects Figure 5. Individual mean scores of pilot study for the generic problem domain 3.2.2 Analysis of Pilot Study in the NASA Problem Domain The analysis, summarized in Table 3, failed to reveal a significant main effect both for the reading technique (p = 0.9629) and the document reviewed (p = 0.7109). The interaction between reading technique and document, which is totally confounded in the group effect, also proved to be non-significant (p = 0.1339). The mean defect rates obtained for each level of reading technique and document are displayed in Figure 6. Reviewers scored poorly regardless of which reading technique was used or which document was reviewed. Source MS F p>F 337.50 337.50 2.66 0.1339 10 1268.50 126.85 12 744.01 Reading Technique (RTECH) 1 0.17 0.17 0.00 0.9629 Document (DOC) 1 10.67 10.67 0.15 0.7109 10 733.17 73.32 Between subjects Group or RTECH × DOC Error Within subjects Error df SS 11 1606.00 1 Table 3. ANOVA summary table for pilot study in the NASA problem domain 21 Detection Rate 100 90 80 70 60 50 40 30 20 10 0 13.58 13.41 14.17 12.83 Usual PBR NASA_A NASA_B Levels of main effects Figure 6. Individual mean scores of pilot study for the NASA problem domain 3.2.3 Analysis of 1995 Run in the Generic Problem Domain The analysis, summarized in Table 4, revealed a significant main effect at the 0.05 level both for the reading technique (p = 0.0019) and the document reviewed (p = 0.0160). However, the defect detection rate has a stronger relationship with the reading technique (R2 = 0.44) than with the document reviewed (R2 = 0.22). R2 indicates what percent of variance in the dependent variable is accounted for by the independent variable. Unlike the main effects, the interaction between reading technique and document, which is totally confounded in the group effect, proved to be non-significant (p = 0.5213). The mean defect rates obtained for each level of reading technique and document are displayed in Figure 7. PBR reviewers (detection rate = 32.14) scored significantly higher than reviewers using their usual technique (detection rate = 24.64). Thus, there is 30% improvement over the usual detection rate. On the other hand, reviewers reading PG document (detection rate = 31.29) performed significantly better than reviewers reading ATM document (detection rate = 25.50). 22 Source MS F p>F 108.57 108.57 0.44 0.5213 12 2984.60 248.72 14 719.99 Reading Technique (RTECH) 1 318.24 318.24 15.72 0.0019 Document (DOC) 1 158.81 158.81 7.84 0.0160 12 242.94 20.24 Between subjects Group or RTECH × DOC Error Within subjects Error df SS 13 3093.17 1 Detection Rate Table 4. ANOVA summary table for 1995 run in the generic problem domain 100.00 90.00 80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 0.00 32.14 25.50 24.64 Usual PBR ATM 31.29 PG Levels of main effects Figure 7. Individual mean scores of 1995 run for the generic problem domain 3.2.4 Analysis of 1995 Run in the NASA Problem Domain The analysis, summarized in Table 5, failed to reveal a significant main effect both for the reading technique (p = 0.4755) and the document reviewed (p = 0.9100). The interaction between reading technique and document, which is totally confounded in the group effect, also proved to be non-significant (p = 0.5394). 23 Source df Between subjects SS MS F p>F 0.40 0.5394 12 14405.61 1 506.90 506.90 11 13898.71 1263.52 13 2137.94 Reading Technique (RTECH) 1 100.94 100.94 0.55 0.4755 Document (DOC) 1 2.48 2.48 0.01 0.9100 11 2034.52 184.96 Group or RTECH × DOC Error Within subjects Error Table 5. ANOVA summary table for 1995 run in the NASA problem domain Detection Rate The mean defect rates obtained for each level of reading technique and document are displayed in Figure 8. The defect detection rate for PBR reviewers is slightly higher (51.23) than for reviewers using their usual technique (47.23). All the mean scores in the 1995 run were greatly better than the mean scores in the pilot study, thus confirming the need for improving the experimental conditions. 100 90 80 70 60 50 40 30 20 10 0 47.23 Usual 51.23 48.77 49.69 PBR NASA_A NASA_B Levels of main effects Figure 8. Individual mean scores of 1995 run for the NASA problem domain 24 3.3 Analysis for Perspectives Aside from detection rates, we were also interested in determining if the PBR reviewers discovered a larger class of errors than those without PBR training and if the errors found were orthogonal (i.e., perspectives did not overlap in terms of the set of defects they helped detect). A full study of correlation between the different perspectives and the types and numbers of errors they uncovered will be the subject of future work, but for now we take a qualitative look at the results for each perspective by examining each perspective's coverage of defects and how perspectives overlap. We formulate no explicit statistical tests concerning the detection rates of reviewers using each of the perspectives, but present Figures 9a and 9b as an illustration of the defect coverage of each perspective. Results within domains are rather similar; therefore we present the ATM coverage charts as an example from the generic domain and the document NASA_A charts as an example from the NASA domain. The numbers within each of the circle slices represent the number of defects found by each of the perspectives intersecting there. So, for example, ATM reviewers using the design perspective in the 1995 run found 11 defects in total: two were defects that no other perspective caught, three defects were also found by testers, one defect was also found by users, and five defects were found by at least one person from each of the three perspectives. ATM Results: 1995: 1994: Designer Designer 2 5 5 4 5 2 2 Tester 1 3 5 3 6 3 3 Use-based Use-based Tester Figure 9a. Defect coverage for the ATM document in the generic domain 25 NASA_A Results: 1994: 1995: Designer Designer 1 0 0 2 2 3 0 Tester 0 1 8 1 4 2 1 Use-based Use-based Tester Figure 9b. Defect coverage for the NASA_A document in the NASA domain 3.4 Analysis for Reviewer’s Experience We measured reviewer experience via questionnaires used during the course of the experiment: a subjective question asked each reviewer to rate on an ordinal scale his or her level of comfort using such documents, and objective questions asked how many years the reviewer had spent in each of the perspective roles (designer, tester, user). As shown in Figures 10a and 10b, the relationship between PBR defect rates and experience is weak. Reviewers with more experience do not perform better than reviewers with less experience. On the contrary, it appears that some less-experienced reviewers have learned to apply PBR better. Both Spearman’s and Pearson’s correlation coefficients were computed in order to measure the degree of association between the two variables for each type of document in each experiment run, but in no case was there any value above 35% (values close to 100% would have indicated a high degree of correlation). A similar lack of significant relationship has been found by Humphrey in analyzing the experiences with teaching the Personal Software Process (Humphrey, 1996). 26 PBR Defect Rate 100 90 80 70 60 50 40 30 20 10 0 Generic NASA 0 5 10 15 20 25 30 35 Years of role experience PBR Defect Rate Figure 10a. PBR defect rate versus role experience in the pilot study 100 90 80 70 60 50 40 30 20 10 0 Generic NASA 0 2 4 6 8 10 Years of role experience Figure 10b. PBR defect rate versus role experience in the 1995 run 4. Threats to Validity Threats to validity are factors beyond our control that can affect the dependent variables. Such threats can be considered unknown independent variables causing uncontrolled rival hypotheses to exist in addition to our research hypotheses. One crucial step in an experimental design is to minimize the impact of these threats. In this section, we present both those threats which we anticipated and tried to control for, as well those threats which were only realized after the fact. We have two different classes of threats to validity: threats to internal validity and threats to external validity. Threats to internal validity constitute potential problems in the interpretation of 27 the data from the experiment. If the experiment does not have a minimum internal validity, we can make no valid inference regarding the cause-effect relationship between independent and dependent variables. On the other hand, the level of external validity tells us nothing about whether the data is interpretable, but is an indicator of the generalizability of the results. Depending on the external validity of the experiment, the data can be assumed to be valid in other populations and settings. 4.1. Threats to Internal Validity The following five threats to internal validity (Campbell, 1963) are discussed in order to reveal their potential interference with our experimental design: • • • History: Since there was one day between the two days of the experiment, some of the improvement that appears due to technique may be attributed to other events that took place between the tests. The subjects were instructed not to discuss the experiment or otherwise do anything between the tests that could cause an unwanted effect on the results. We trusted the professional programmers in Group 1 not to discuss their specifications with the programmers in Group 2. Thus, we do not consider this effect to be very significant, but we cannot completely ignore it. Maturation: This is the effect of processes taking place within the subjects as a function of time, such as becoming tired or bored. But it may also be intellectual maturation, regardless of the experimental events. For our experiment, the likely effect would be that tests towards the end of the day tend to get worse results than they would normally. We provided long breaks between tests to try an avoid such a tendency. However, since the ordering of documents and domains was different for the two days, the differences between the two days may be disturbed by maturation effects. Looking at the design of the experiment, we see that an improvement from the first to the second day would be amplified for the generic documents, while it would be lessened for the NASA documents. Based on the results from the experiment, we see that this effect seems plausible. On the other hand, if we had chosen to have the same order of domains and documents in the two days, the threat would be worse because an improvement from the usual technique to PBR would be completely confounded with the maturation effect. Testing: Getting familiar with the tests may have effects on subsequent results. This threat has several components, including becoming familiar with the specifications, the technique, or the testing procedures. This effect may amplify the effects of the historical events and thus be part of the reason for improvement that has previously been considered a result of change 28 • • in technique. Testing effects may counteract maturation effects within each day. Although our subjects were already familiar with NASA documents, we tried to overcome unwanted effects by providing training sessions before each test where the subjects could familiarize themselves with the particular kind of document and technique. Also, the subjects received no feedback regarding their actual defect detection success during the experiment, so that it would presumably be difficult for them to discover whether aspects of their performance were in fact improving their detection rate or not. Furthermore, the generic documents are dissimilar enough that there little to be learned from the first document that could be transferred to the second. However, it would be interesting to replicate this experiment with a true control group who did not receive PBR training on day 2. Instrumentation: These effects are basically due to differences in the way of measuring scores. Our scores were measured by two people independently who did not know which treatment they were grading, and then discussed in order to resolve any disagreement consistently. Selection: For the 1995 run we used random assignment of subjects to perspective. Since PBR assumes the reviewers in a team use the perspectives with which they are familiar, the random assignment used in the experiment would presumably lead to an underestimation of the improvement caused by PBR. Another threat to internal validity is the possibility that the subjects ignore PBR when they are supposed to use it. In particular, there is a danger that the subjects continue to use their usual technique. This need not be the result of a deliberate choice from the subject, but may simply reflect the fact that people unconsciously prefer to apply existing skills with which they are familiar. The only way of coping with this threat is to provide enhanced training sessions and some sort of control or measure of conformance to the assigned technique. However, as we have already shown, the PBR scenario did have a positive effect over the usual techniques in most of the documents, so even if the experiment was confounded with this effect, the true results would be stronger. 4.2. Threats to External Validity Threats to external validity imply limitations to generalizing the results. The experiment was conducted with professional developers and with documents from an industrial context, so these factors should pose little threat to external validity. However, the limited number of data points is a potential problem which may only be overcome by further replications of the experiment. Other threats to external validity pertinent to the experimental design include (Campbell, 1963): 29 • • • Interaction of testing and treatment: A pretest may affect the subject’s sensitivity of the experimental variable. Both of our groups receive similar pretests and treatments, so this effect may be of concern to us. We cannot avoid the fact that this is an experimental environment, and all subjects knew that. This, by itself, may affect the results and is a limitation of almost any experimental design. Interaction of selection and treatment: Selection biases may have different effects due to interaction with the treatment. One factor we need to be aware of is that all our subjects were volunteers. This may imply that they are more prone to improvement-oriented efforts than the average developer - or it may indicate that they consider the experiment an opportunity to get away from normal work activities for a couple of days. Thus, the effects can strike in either direction. Also, all subjects had received training in their usual technique, a property that developers from other organizations may not possess. Reactive arrangements: These effects are due to the experimental environment. In 1994, the pilot study was carried out in the subjects’ own environment, and thus would be valid also in a real setting. We cannot assume the same for the 1995 results since this run was done in a classroom situation. However, the change of experimental environment between the experiment runs has made it easier to concentrate on the techniques and tests to be done, thus separating the techniques better. Since this experiment was conducted using personnel from the NASA SEL environment, it is reasonable to discuss whether the results can be generalized to a NASA SEL context. This kind of generalization involves less of a change in context than is the case for an arbitrary organization; in particular the differences in populations can be ignored since the population for the experiment is in fact all of the NASA SEL developers. Clearly, the results for the generic documents cannot be generalized to the NASA documents due to the difference in nature between the two sets of documents. The results for the NASA documents, on the other hand, may be valid since we used parts of real NASA documents. 5. Discussion We have encountered problems in the two runs of the experiment which we have previously discussed. However, some of these problems are of a general nature and may be relevant in other experimental situations. 30 • What is a good design for the experiment under investigation, given the constraints? An important constraint in the design of our experiment was the necessity of a minimal number of subjects and experiment runs due to high costs and low availability of subjects. Unlike many other academic studies, it is simply not feasible to acquire another classroom full of students and rerun the experiment. This is not meant to disparage such studies (several of the authors have conducted numerous studies of that type in the past), but is only an indication that the more usual academic model is not applicable in the industrial setting we wish to study. Analytical techniques involving only a few subjects need to be developed and explored. • What is the optimal sample size? Small samples lead to problems in the statistical analysis while large samples represent major expenses for the organization providing the subjects. Organizations generally have limits for the amount of subjects they are willing to part with for an experiment, so the cost concerns are handled by the organizations themselves. A small sample size requires us to be careful in the design in order to get as many useful data points as possible. In our case, we chose to neglect learning effects in order to avoid having control groups. While giving us more data points to be used in analyzing the difference between the two techniques, we remained uncertain as far as the threat to internal validity caused by learning effects is concerned. • We need to adjust to various constraints - how far can we go before the value of the experiment decreases to a level where it is not worthwhile? The problem is how controlled can we make the experimental environment, yet still keep the industrial organization interested in participating. Giving each subject training in PBR and giving up a control group is an example of this tradeoff. • To what extent can experimental aspects such as design, instrumentation and environment be changed when the experiment still is to be considered a replication? The software engineering literature contains many examples of experiments often under the guise of "case study." How often can we compare the results of two different case studies? In our own example, we viewed the 1994 pilot study and the 1995 run as distinct even though they were probably more similar than other published case studies. This is a non-trivial problem in how to build up a body of knowledge that others may reference. 31 • What threats to validity did we fail to address? The hardest part of any experiment is to admit what you did wrong and what did not go as planned. Throughout this report we addressed several aspects of our experimental design that could be improved. Some of the more important ones are: 1. The maturation internal threat to validity may be a factor if tiredness in the afternoon affects results. Our experimental design favored the generic documents over the NASA documents, as we mentioned earlier. 2. The testing internal threat to validity may be factor if there is a learning effect. Rerunning the experiment with a third control group, or rerunning the experiment to add a third or fourth day of PBR testing to see if there is further improvement after repeated PBR practice are ways to address this. 3. Matching subject experience with PBR scenario would make the results more relevant to the usual NASA domain. Of course, rerunning this experiment within another domain would be necessary to generalize this technique outside of the NASA SEL flight dynamics domain. 4. There are other issues besides coverage that are important when studying review teams. Even though we have indications that teams composed of developers using unique PBR techniques are better than teams using no formal reading technique, it is important to confirm these simulated results with a study in which real review teams are used. A more fundamental problem that should be considered is to what extent the proposed technique actually is followed. This problem with process conformance is relevant in experiments, but also in software development where deviations from the process to be followed may lead to wrong interpretation of measures obtained. For experiments, one problem is that the mere action of controlling or measuring conformance may have an impact on how well the techniques work, thus decreasing the external validity. Conformance is relevant in this experiment because there seems to be a difference that corresponds to experience level. Subjects with less experience seem to follow PBR more closely (“It really helps to have a perspective because it focuses my questions. I get confused trying to wear all the hats!”), while people with more experience were more likely to fall back to their usual technique (“I reverted to what I normally do.”). 32 6. Conclusions and Future Directions To get high quality software, the various documents associated with software development must be verified and validated. People doing the verification or validation effectively must get an understanding of the document. We consider reading the key technical activity to understand a document. In this paper we have presented a reading technique called Perspective-Based Reading (PBR) and its application to requirements documents. We tested the effectiveness of PBR in two runs of a controlled experiment with professionals from the NASA SEL environment. The subjects used both their usual technique and PBR on generic requirements documents and on requirements documents from their application domain (NASA documents). As PBR can be used in methods, like inspections, where two or more reviewers of a review team individually look for defects, we were especially interested in the team results. Because of time and cost constraints, we simulated team results by combining the defects detected by the reviewers on the team. In the first run (pilot study) of the experiment we only got significant results for teams using PBR on the generic documents. After the pilot study we made some changes to improve the experiment. In the 1995 run, PBR teams provided significantly better coverage of both NASA and generic documents. The reasons for this observed improvement, as compared to the pilot study, may include shorter assignments in the NASA problem domain and training session before each document review. We have also compared PBR and the usual technique with respect to the individual performance of reviewers. Although in most cases, reviewers using PBR found about the same number of defects as reviewers using their usual technique, PBR reviewers did perform significantly better on the generic documents in the 1995 run. This is an unexpected result, since the true benefit of PBR is expected to be seen at the level of teams which combine several different perspectives for improved coverage. The results for individuals show that under certain conditions, the use of focused techniques may lead to improvements at the individual level as well We think that better results could be achieved by more closely tailoring PBR to the specific characteristics of the NASA documents and NASA SEL environment. Partly, this conclusion is motivated by examining the distribution of discovered defects among the different PBR techniques: in generic documents there were a number of defects which were found by only one 33 of the perspectives, while on NASA documents a much greater degree of overlap among the perspectives was observed. We also got feedback from the subjects that supported this view; several found it tempting to fall back to their usual technique when reading the NASA documents, thus underestimating the effect of using PBR. This observation is also supported by the lack of relationship between defect coverage using PBR and reviewer’s experience in the assigned role. A possible direction for further experimentation would be to do a case-study of a NASA SEL project to obtain more qualitative data, so that we can understand how to control the conformance to the assigned technique, and discover characteristic differences between the single PBR techniques. Throughout the pilot study and the 1995 run we realized that there are some threats to validity. For us it was important to describe and address all of them in detail so that other researchers benefits from the lessons we have learned and can try to avoid the threats while replicating this experiment or developing another one. Some threats have their origin in the fact that this was not an experiment with students but with professionals from industry. Some of the threats might even only be addressed through replication. We need to replicate the generic part of the experiment in other environments, perhaps even in other countries where differences in language and culture may cause effects that can be interesting targets for further investigation. These replications can take the form of controlled experiments with students, controlled experiments with subjects from the industry using their usual technique for comparison, or case studies in industrial projects. One challenging goal of a continued series of experiments will be to assess the impact that the threats to validity have. Since it is often hard to design the experiment in a way that controls for most of the threats, a possibility would be to concentrate on certain threats in each replication to assess their impact on the results. For example, one replication may use control groups to measure the effect of repeated tests, while another replication may test explicitly for maturation effects. However, we need to keep the replications under control as far as threats to external validity are concerned, since we need to assume that the effects we observe in one replication will also occur in the others. We are currently working on a lab package for researchers to support the replication of this experiment by other researchers in other environments. 34 Acknowledgements This research was sponsored in part by grant NSG-5123 from NASA Goddard Space Flight Center to the University of Maryland. We would also like to thank the members of the Experimental Software Engineering Group at the University of Maryland for their valuable comments to this paper. References (Campbell, 1963) Campbell, D. T. and Stanley, J. C. 1963. Experimental and QuasiExperimental Designs for Research. Boston, MA: Houghton Mifflin Company. (Edington 1987) Edington, E. S. 1987. Randomization Tests. New York, NY: Marcel Dekker Inc. (Fagan, 1976) Fagan, M. E. 1976. Design and code inspections to reduce errors in program development. IBM Systems Journal, 15(3):182-211. (Hatcher, 1994) Hatcher, L. and Stepanski, E. J. 1994. A Step-by-Step Approach to Using the SAS® System for Univariate and Multivariate Statistics. Cary, NC: SAS Institute Inc.8 (Heninger, 1980) Heninger, K. L. 1985. Specifying Software Requirements for Complex Systems: New Techniques and Their Application. IEEE Transaction on Software Engineering, SE-6(1):2-13. (Humphrey, 1996) Humphrey, W. S. 1996. Using a Defined and Measured Personal Software Process. IEEE Software. 13(3): 77-88. 8 SAS® is the registered trademark of SAS Institute Inc. 35 (Linger, 1979) Linger, R. C., Mills, H. D. and Witt, B. I. 1979. Structured Programming: Theory and Practice. In The Systems Programming Series. Addison Wesley. (Parnas, 1985) Parnas, D. L. and Weiss, D. M. 1985. Active design reviews: principles and practices. In Proceedings of the 8th International Conference on Software Engineering, pp.215-222. (Porter, 1995) Porter, A. A., Votta, L. G. Jr. and Basili, V. R. 1995. Comparing Detection Methods For Software Requirements Inspections: A Replicated Experiment. IEEE Transactions on Software Engineering, 21(6): 563-575. (SAS, 1989) SAS Institute Inc. 1989. SAS/STAT User’s Guide, Version 6, Fourth edition, Vol.2. Cary, NC: SAS Institute Inc.9 (SEL, 1992) Software Engineering Laboratory Series. 1992. Recommended Approach to Software Development, Revision 3. SEL-81-305, pp.41-62. (Shapiro, 1965) Shapiro, S. S. and Wilk, M. B. 1965. An Analysis of Variance Test for normality (concrete samples). Biometrika, 52: 591-611. (Votta, 1993) Votta, L. G. Jr. 1993. Does every inspection need a meeting?. In Proceedings of ACM SIGSOFT '93 Symposium on Foundations of Software Engineering. Association of Computing Machinery. (Winer, 1991) Winer, B. J., Brown, D. R. and Michels, K. M. 1991. Statistical Principles in Experimental Design, 3rd ed. New York, NY: McGraw-Hill Inc. 9 JMP® is a trademark of SAS Institute Inc 36 A. Sample Requirements Below is a sample requirement from the ATM document which tells what is expected when the bank computer gets a request from the ATM to verify an account: Functional requirement 1 (From ATM document) Description: The bank computer checks if the bank code is valid. A bank code is valid Input: Processing: Output: if the cash card was issued by the bank. Request from the ATM to verify card (Serial number and password) Check if the cash card was issued by the bank. Valid or invalid bank code. We also include a sample requirement from one of the NASA documents in order to give a picture of the difference in nature between the two domains. Below is the process step for calculating adjusted measurement times: Calculate Adjusted Measurement Times: Process (From NASA document) 1. Compute the adjusted Sun angle time from the new packet by t s ,adj = t s + t s ,bias 2. Compute the adjusted MTA measurement time from the new packet by t T ,adj = t T + t T ,bias 3. Compute the adjusted nadir angle time from the new packet. a. Select the most recent Earth_in crossing time that occurs before the Earth_in crossing time of the new packet. Note that the Earth_in crossing time may be from a previous packet. Check that the times are part of the same spin period by t e −in − t e − out < E max Tspin ,user 37 b. If the Earth_in and Earth_out crossing times are part of the same spin period, compute the adjusted nadir angle time by t e − adj = t e −in + t e − out + t e ,bias 2 4. Add the new packet adjusted times, measurements, and quality flags into the first buffer position, shifting the remainder of the buffer appropriately. 5. The Nth buffer position indicates the current measurements, observation times, and quality flags, to be used in the remaining Adjust Processed Data section. If the Nth buffer does not contain all of the adjusted times (and), set the corresponding time quality flags to indicate invalid data. 38