Kane 1992

QUANTITATIVE METHODS IN PSYCHOLOGY
An Argument-Based Approach to Validity

Michael T. Kane
American College Testing
Iowa City, Iowa
This article outlines a general, argument-based approach to validation, develops an interpretive

argument for a placement test as an example, and examines some key properties of interpretive
arguments. Validity is associated with the interpretation assigned to test scores rather than with the
test scores or the test. The interpretation involves an argument leading from the scores to score-
based statements or decisions, and the validity of the interpretation depends on the plausibility of
this interpretive argument. The interpretive arguments associated with most test-score interpreta-
tions involve multiple inferences and assumptions. An explicit recognition of the inferences and
assumptions in the interpretive argument makes it possible to identify the kinds of evidence needed
to evaluate the argument. Evidence for the inferences and assumptions in the argument supports
the interpretation, and evidence against any part of the argument casts doubt on the interpretation.
Validity is associated with the interpretations assigned to test To validate a test-score interpretation is to support the plausi-
scores rather than with the scores themselves or the test and bility of the corresponding interpretive argument with appro-
involves an evaluation of the appropriateness of these interpre- priate evidence. The argument-based approach to validation
tations (American Educational Research Association [AERA], adopts the interpretive argument as the framework for collect-
American Psychological Association, & National Council on ing and presenting validity evidence and seeks to provide con-
Measurement in Education, 1985; Cronbach, 1971; Messick, vincing evidence for its inferences and assumptions, especially
1989). The kinds of evidence needed for the validation of a its most questionable assumptions. One (a) decides on the state-
test-score interpretation can be identified systematically by fo- ments and decisions to be based on the test scores, (b) specifies
cusing attention on the details of the interpretation. the inferences and assumptions leading from the test scores to
At their core, interpretations involve meaning or explanation. these statements and decisions, (c) identifies potential compet-
The first definition of the verb interpret, in Webster's Ninth New ing interpretations, and (d) seeks evidence supporting the infer-
Collegiate Dictionary (1989) is "to explain or tell the meaning ences and assumptions in the proposed interpretive argument
of; present in understandable terms." This definition captures and refuting potential counterarguments.
much of the meaning of interpretation as it is used in discussions The remainder of this article describes the argument-based
of the validity of test-score interpretations. To interpret a test approach in more detail. The next section summarizes general
score is to explain the meaning of the score and, thereby, to criteria for evaluating the kind of reasoning found in interpre-
make at least some of the implications of the score clear. tive arguments. The following section outlines some inferences
A test-score interpretation always involves an interpretive ar- and assumptions that typically appear in interpretive argu-
gument, with the test score as a premise and the statements and ments and the kinds of evidence needed to support these infer-
decisions involved in the interpretation as conclusions. The in- ences and assumptions. A detailed example is then presented,
ferences in the interpretive argument depend on various as- followed by a discussion of some characteristics of interpretive
sumptions, which may be more-or-less credible. For example, arguments that have a strong influence on validation efforts.
inferences from test scores to nontest behavior involve assump-
tions about the relationship between test behavior and nontest Practical Arguments
behavior; inferences from test scores to theoretical constructs
depend on assumptions included in the theory defining the The kind of reasoning involved in interpretive arguments has
construct. Because it is not possible to prove all of the assump- received increasing attention in the last 20 years, under various
tions in the interpretive argument, it is not possible to verify headings including "practical reasoning," "informal logic," and
this interpretive argument in any absolute sense. The best that "rhetoric" (Cronbach, 1982, 1988; House, 1980; Perelman &
can be done is to show that the interpretive argument is highly Olbrechts-Tyteca, 1969; Toulmin, Rieke, & Janik, 1979). Practi-
plausible, given all available evidence. cal arguments address issues in various disciplines and in prac-
tical affairs. Because the assumptions in such arguments can-
not be taken as given and because the available evidence is
Correspondence concerning this article should be addressed to Mi- often incomplete and, perhaps, questionable, the argument is,
chael T. Kane, who is now at the Department of Kinesiology, Univer- at best, convincing or plausible. The conclusions are not
sity of Wisconsin, 2000 Observatory Drive, Madison, Wisconsin proven.
53706. This is a clear departure from traditional logic and mathemat-
Psychological Bulletin, 1992, Vol. 112, No. 3, 527-535
Copyright 1992 by the American Psychological Association. Inc. 0033-2909/92/$3.00
527
528 MICHAEL T. KANE
ics, where the emphasis is on formal rules of inference. In logic developed. The plausibility of an assumption is judged in terms
and mathematics, the assumptions are taken as given, and the of all of the evidence for and against it.
conclusions are proven (i.e., the proof is logically valid), if and Some questionable assumptions may be checked directly
only if the chain of inferences from the premises to the conclu- (e.g., statistical inferences usually make distributional assump-
sions follows certain explicit, formal rules. The rules of infer- tions, and these can be checked empirically). Some assump-
ence and the criteria for evaluating the application of these tions may be supported by careful documentation and analysis
rules in formal arguments are sufficiently unambiguous that of procedures (e.g., sampling assumptions). More general as-
formal arguments can always be checked mechanically (e.g., by sumptions (e.g., that certain skills are being taught in a course)
computer). may be supported by several types of evidence (e.g., classroom
Practical arguments make use of traditional logic and mathe- observations, review of curriculum and teaching materials, and
matics in evaluating some inferences but also include infer- interviews with students).
ences and assumptions that cannot be evaluated in this way. An interpretive argument can be criticized for failing to meet
The evidence supporting practical arguments needs to address any of these three criteria, but weak assumptions, especially
the appropriateness of various lines of argument in specific weak "hidden" assumptions, are typically the most serious
contexts, the plausibility of assumptions, and the impact of problem. A vague argument can be developed more fully.
weak assumptions on the overall plausibility of the argument. Errors in logic and mathematics can be corrected, and loose
There are three general criteria for evaluating practical argu- inferences can be made more explicit. Weak assumptions, once
ments (see House, 1980; Toulmin et al., 1979, for discussions of recognized, can be supported by evidence. However, because
practical argumentation). hidden assumptions are not recognized as part of the argu-
ment, no effort is made to support them with evidence, and
Criterion 1: Clarity of the Argument they may not be very plausible a priori. One of the main rea-
sons for stating the argument clearly and for examining the
Has the argument been stated clearly? The conclusions to be inferences in some detail is to identify the assumptions being
drawn and the inferences and assumptions used in getting to made.
these conclusions should be specified in enough detail so that
what is being claimed by the argument is known. The explicit Parallel Lines of Evidence and Counterarguments
statement of the details of the argument helps us to understand
Parallel lines of evidence that support certain assumptions or
the conclusions and is also a necessary step in meeting the
parallel lines of argument that support certain conclusions play
second and third criteria.
an important role in practical arguments. One may have high
confidence in an assumption that is supported by several inde-
Criterion 2: Coherence of the Argument pendent sources of evidence even though each source of evi-
dence is questionable. Similarly a conclusion that can be
Is the argument coherent in the sense that the conclusions
reached in several ways is less vulnerable than a conclusion that
follow reasonably from the specified assumptions? The logical
depends on a single line of argument. The use of multiple inde-
and mathematical inferences that occur in practical arguments
pendent sources of evidence to support a conclusion is often
can be judged against the rules of logic or mathematics (includ-
referred to as triangulation. In formal arguments, a second line
ing probability theory and inferential statistics) and thereby of argument does not add anything to our confidence in a con-
judged to be either valid or invalid in the sense that these terms clusion once the conclusion is proven. In practical arguments,
are used in logic and mathematics. Inferences that are based on redundancy can be a virtue.
a theory can be judged in terms of their consistency with the The identification and refutation of plausible counterargu-
theory, and if the theory is stated mathematically, the theory- ments can be a particularly effective way to reinforce practical
based inferences can also be judged unambiguously. arguments. If the potential counterarguments can be shown to
However, in practical arguments, there are always some infer- be implausible, confidence in the initial argument is increased.
ences that are not based on logic, or mathematics, or on any By contrast, in formal systems, once a proposition has been
formal theory. These inferences are often specific to the particu- proven, there is no need to disprove its negation.
lar discipline or area of practice in which the argument is being To sum up, practical arguments may have some inferences
developed and tend to become codified in textbooks and jour- and assumptions that can be evaluated unambiguously. Confi-
nals. Forexample, the Standards for Educational and Psycholog- dence in other inferences and assumptions depends on the ac-
ical Testing (AERA et al., 1985) provides a summary of many cumulation of various kinds of evidence, none of which is com-
acceptable types of inferences for testing. pletely decisive. The plausibility of the argument as a whole is
limited by its weakest assumptions and inferences. Therefore, it
Criterion 3: Plausibility of Assumptions is important to identify the assumptions being made and to
provide supporting evidence for the most questionable of these
Are the assumptions used in the argument inherently plausi- assumptions.
ble or supported by evidence? It is not possible to prove that the
assumptions used in practical arguments are true, but it is Evaluating the Interpretive Argument
usually possible to develop some empirical evidence for doubt-
ful assumptions. If no single type of evidence is decisive in Interpretive arguments are practical arguments, and the crite-
evaluating an assumption, several types of evidence may be ria for evaluating an interpretive argument are the same as the
AN ARGUMENT-BASED APPROACH TO VALIDITY 529
criteria for evaluating any practical argument. The argument universe of similar observations. In interpreting a test score,
should be stated clearly so that what it claims and what it as- statements typically are not limited to a specific time, a specific
sumes are known. The argument should be coherent in the place, a specific set of items, or a specific scorer. In reporting
sense that the conclusions are reasonable, given the assump- results with sentences such as, "John got a 60 on the test" rather
tions. The assumptions should be plausible a priori or sup- than the more cumbersome statement, "John got a 60 on Form
ported by evidence. Parallel lines of evidence should be devel- A of the test that he took on May 6, in Room 201, and that was
oped whenever this is possible, and plausible counterarguments scored by Professor Jones," one is implicitly assuming that the
should be considered. particular time and place of testing, the choice of scorer, and
The details of the interpretive argument depend on the spe- the specific form of the test are not relevant to the interpreta-
cific interpretation being proposed, the population to which tion. The observations are treated as if they have been sampled
the interpretation is applied, the specific data collection proce- from some universe of observations, involving different occa-
dures being used, and the context in which measurement oc- sions, locations, and observers that could have served equally
curs. The particular mix of evidence needed to support the well; in generalizing over conditions of observation, one draws
interpretive argument will be different for each case. The re- conclusions about the universe of possible observations on the
mainder of this section describes several categories of infer- basis of a limited sample of actual observations. The assump-
ences that appear regularly in interpretive arguments, the as- tions supporting such inferences are invariance laws stating
sumptions associated with these inferences, and the evidence that the conditions of observation involved in the measurement
that can support these inferences and assumptions. can be allowed to vary along certain dimensions without chang-
ing the outcomes much (Kane, 1982); the results are largely
Observation invariant with respect to changes in the conditions of observa-
tion (within the limits imposed by the measurement procedure
Perhaps the most basic inference in interpreting a score is
specifications).
that the score results from an instance of the measurement The evidence needed to support assumptions about invar-
procedure. This inference assumes that the methods used to iance is collected in reliability studies (Feldt & Brennan, 1989)
assign the score were consistent with the definition of the mea-
or generalizability studies (Brennan, 1983; Cronbach, Gleser,
surement procedure. The evidence supporting this kind of infer-
Nanda, & Rajaratnam, 1972), which indicate how consistent
ence is procedural. Some parts of the measurement procedure
scores are across different samples of observation (e.g., across
may be specified exactly. In attitude and personality inventories
samples of items, occasions). Reliability is a necessary condi-
and in some tests, the questions and scoring keys are often
tion for validity because generalization is a key inference in
standardized. The instructions given to examinees and the
interpretive arguments, but it is not a sufficient condition be-
methods used to generate scores from the raw data are also
cause generalization is not the only inference in the argument.
specified in detail. It is assumed that these standardized proce-
dures are followed exactly.
Other conditions of observation are specified by general Extrapolation
guidelines rather than being uniquely determined. If responses
Most interpretive arguments also involve extrapolation; con-
are to be interpreted or evaluated by a scorer, the general qualifi-
clusions are drawn about behavior that is different in poten-
cations of the scorer will be specified, usually in terms of train-
tially important ways from that observed in the testing proce-
ing, experience, or knowledge. It is assumed that data collection
dure. Scores on a reading test are interpreted as indicating the
follows these general guidelines.
Some conditions of observation are not explicitly limited and ability to comprehend a variety of written materials in a variety
become relevant only if they are extreme. For example, the envi- of contexts, even though the test may consist of discrete, multi-
ronment in which data are collected will typically be specified ple-choice items administered in one testing session. The use of
only in the most general terms, if at all, but can become an issue test scores as an indication of nontest behavior assumes that the
if these conditions are extreme (e.g., very hot, cold, or noisy). relationship between the scores and the target behavior is un-
The existence of certain extreme conditions of observation may derstood fairly well (Cronbach, 1982; Kane, 1982). The extrapo-
constitute a plausible competing interpretation for low scores. lation may be based on fairly loose notions of similarity or on a
Procedural evidence does not go very far in establishing the detailed analysis of the specific processes used by examinees in
plausibility of an interpretive argument. However, it can be responding in the two situations (Snow & Lohman, 1984,1989).
decisive in refuting an interpretive argument. If the procedures In addition to qualitative analyses of the relationship be-
have not been followed correctly (e.g., the wrong scoring key tween the behavior actually observed and the nontest behavior
was used, the sample of scores used to generate norms is inap- to which inferences are drawn, extrapolation can also be sup-
propriate) or if the procedures themselves are clearly inade- ported by empirical evidence showing a relationship between
quate (e.g., no training for raters who are called on to make the test performance and nontest behavior. Criterion-related
complex decisions), the interpretive argument would be effec- validity evidence seeks to establish a direct link between test
tively undermined. behavior and nontest behavior.
Generalization Theory-Based Inferences

Most, if not all, test-score interpretations involve generaliza- Essentially all interpretations also involve, at least implicitly,
tion from the specific observations being made to a broader some theory-based inferences involving possible explanations
530 MICHAEL T. KANE
or connections to other constructs (Cronbach & Meehl, 1955; tent and procedures can effectively rule out some interpreta-
Embretson, 1983; Messick, 1988, 1989). Some interpretations tions.
are primarily theory based, in that the observations in the mea- The evidence needed to support theory-based inferences de-
surement procedure are of interest mainly as indicators of unob- pends on the theory. Because different kinds of theories can be
servable constructs. However, even when the focus of the inter- used and because any given theory may be supported by differ-
pretation is more practical than theoretical, theory has a role in ent kinds of evidence, a wide range of different kinds of empiri-
the interpretive argument. cal evidence may be relevant to theory-based inferences (Mes-
Two kinds of formal theories have been widely discussed in sick, 1989).
relation to validity: nomological theories and process models
(Embretson, 1983). If the construct being measured is inter-
Decisions
preted in terms of its relationship to other constructs in a no-
mological theory, evidence comparing the observed pattern of Most tests are also linked to some decision. If the test scores
relationships (based on the test scores and accepted measures were not relevant to any decision, it is not clear why the test
of other constructs) with that predicted by the theory would be would be given. The legitimacy of test use rests on assumptions
relevant to the plausibility of the interpretive argument. In ad- about the possible outcomes (intended and unintended) of the
dition, any other evidence supporting the theory would also decision to be made and on the values associated with these
support the theory-based interpretive argument. Nomological different outcomes (Guion, 1974; Messick, 1975, 1980, 1981,
theories figured prominently in the original description of con- 1988,1989).
struct validity (Cronbach & Meehl, 1955) and are the basis for
what Embretson (Whitely) (1983) called nomothetic span.
Technical Inferences
A second kind of theory involves the development of process
models for test behavior (Pellegrino, 1988; Sternberg, 1985). If There are also a number of more technical inferences that
the process model provides an explanatory interpretation of frequently appear in interpretive arguments. For example, if
test scores, the corresponding interpretive argument would in- scores on different forms of a test are equated, the inference
corporate the process model. To the extent that important from a score on the current form to the estimated score on the
aspects of test behavior (e.g., accuracy and speed) can be ex- original form rests on statistical assumptions in the equating
plained in terms of postulated component processes, the inter- models and on assumptions about the appropriateness of these
pretive argument and therefore the interpretation of test scores models (Holland & Rubin, 1982). If item-response models are
in terms of these processes are supported. Embretson (1983) used, assumptions about the fit of the model to the data are
uses the term construct representation to describe the role of needed (Hambleton, 1989; Lord, 1980).
process models.
In many cases, the theories that are used, implicitly or explic- Summary
itly, as the basis for explaining test scores are neither nomologi-
cal nor process models but are, rather, loose collections of gen- If the evidence for validity is to support the interpretive argu-
eral assumptions. In such cases, validity evidence that is based ment effectively, it must reflect the structure of the interpretive
on the theory is more effective in ruling out certain explana- argument. Many different types of inferences appear in inter-
tions than it is in establishing a particular theory-based interpretive arguments, and each of these inferences rests on as-
pretation. For example, analyses of multitrait-multimethod ma- sumptions that provide justification for the inference. The ac-
trices can undermine proposed interpretations by showing that ceptance of a number as a test score assumes that the number
score differences are attributable largely to method variance was assigned using certain procedures. Generalizations from a
(Campbell & Fiske, 1959). Similarly, evidence of racial or sex sample of behavior to some domain of behaviors rest on as-
bias can cast doubt on a variety of interpretations (Cole & Moss, sumptions about the invariance of observed scores for different
1989). conditions of observation. Extrapolations are based on assump-
Different possible assumptions about the processes involved tions about the relationship between the behavior actually ob-
in responding to test items are often the basis for competing served and the behavior to which the results are being extrapo-
interpretations. For example, Nedelsky (1965, p. 152) has sug- lated. Any theory-based inferences assume that the theory is
gested that items on science achievement tests must present credible. Decisions that are based on test scores make assump-
novel situations-problems if they are to measure comprehen- tions about the desirability of various kinds of outcomes, that
sion rather than simple recall. Of course, as Cronbach (1971) is, about values.
has pointed out: Validity evidence is most effective when it addresses the
weakest parts of the interpretive argument. Evidence that pro-
An item qua item cannot be matched with a single behavioral vides further support for a highly plausible assumption does
process. Finding the answer calls for dozens of processes, from not add much to the overall plausibility of the argument. The
hearing the directions to complex integration of ideas. The
shorthand description in terms of a single process is justified most questionable assumptions deserve the most attention. An
when one is certain that every person can and will carry out all the assumption can be questioned because of existing evidence in-
required processes save one. (p. 453) dicating that it may not be true, because of plausible alternative
interpretations that deny the assumption, because of specific
Collateral assumptions are clearly necessary if inferences about objections raised by critics, or simply because of a lack of sup-
cognitive processes are to be drawn. However, analyses of con- porting evidence.
Example occasions, and scorers could be derived from generalizability

studies (Brennan, 1983; Cronbach et al, 1972).
Suppose college students can take either a calculus course or Assumption 4. There are no sources of systematic error that
a remedial algebra course designed for students who are not would bias the interpretation of the test scores as measures of
adequately prepared to take calculus. Assume further that an skill in algebra.
algebra placement test is going to be used to assign students to The investigation of this assumption would require the iden-
one of these two courses (Cronbach & Gleser, 1965; Frisbie, tification of possible sources of bias and the evaluation of their
1982; Sawyer, 1989; Willingham, 1974). impact on the scores. For example, if the students do not under-
On one level, the interpretation is quite simple. The place- stand the purpose of the test or are unmotivated for some other
ment test scores are interpreted as measures of competence in reason, have trouble understanding the instructions or reading
algebra and as measures of readiness for the regular calculus the items (possibly because English is a second language), or
course. However, even in this simple case, the interpretive argu- have trouble responding (e.g., because of a physical handicap),
ment is not trivial. To keep the length of the example within the scores might not fully reflect their level of skill in algebra.
bounds, the discussion focuses on the general flow of the argu- The most serious threats to the interpretation could be investi-
ment and the kinds of evidence that might be used to support gated empirically using methods and data appropriate for each
various inferences and the assumptions on which they are specific alternative explanation.
based and omits many important details (e.g., the methodology The evidence relevant to Assumptions 3 and 4 addresses a
used to select a cutoff score). number of potential counterinterpretations. The evidence for
Assumption 3 addresses the potential impact of various sources
The Test as a Measure of Prerequisite Skills in Algebra of random error associated with the sampling of conditions of
observation. The evidence for Assumption 4 addresses various
The first part of the interpretive argument involves infer- potential sources of systematic error, or bias.
ences from test scores to statements about level of skill in the If we accept Assumptions 1-4, it would be reasonable to ac-
prerequisite domain. This part of the argument rests on four cept the claim that scores on the placement test can be inter-
main assumptions: preted as measures of achievement in a domain of algebra skills
Assumption 1. Certain algebraic skills are prerequisites for needed in the calculus course. The algebraic skills used in the
the calculus course in the sense that these skills are used exten- calculus course have been identified, these skills have been
sively in the calculus course, and students who lack these skills incorporated in the placement test, the generalizability of the
are likely to have great difficulty in the calculus course. test scores has been evaluated, and the most likely sources of
For anyone familiar with the content of a typical first course bias of test scores have been ruled out.
in calculus, this assumption is likely to be considered self-evi-
dent. However, the assumption might not hold in a particular Appropriate Placement for Students With Low
case, if, for example, the necessary algebraic skills are taught Placement Scores
during the calculus course.
One way to support this assumption would be to conduct a The second part of the argument claims that students with
detailed analysis of the content and methods of instruction in low scores on the placement test (i.e., below some cutoff score)
the calculus course. This analysis would identify the specific will do substantially better in the calculus course if they take
algebraic skills used in the calculus course and, ideally, would the remedial course before taking the calculus course.
assign weights to different parts of algebra that are based on This claim could be examined directly by having half of the
their relative importance for the calculus course. The result of students with low scores take the remedial course before the
this analysis would be a well-defined target domain of content calculus course and having the other half go straight into the
for the placement test. calculus course. The achievement of these two groups in the
Assumption 2. The content domain of the placement test calculus course could then be compared. This direct approach
matches the target domain of algebraic skills used in the calcu- obviously requires the existence of a measure of success in the
lus course. calculus course.
An evaluation of this assumption would typically involve Assumption 5. An appropriate measure of success in the
judgments about the correspondence between the test specifi- calculus course is available.
cations and the target domain and about the match between the Course grades are the usual measure of success in courses. If
actual test items and the test specifications. the course grades are considered inadequate (e.g., too unreli-
Assumption 3. Scores on the test are generalizable across able), a new measure of achievement in the calculus course
samples of items, scorers, and occasions. could be developed.
The interpretive argument being developed for the place- Of course, it is often not possible to conduct a study that is
ment test relates a student's level of skill in the target domain to based on random assignment of students. The number of stu-
expected performance in the calculus course. It assumes, there- dents with low scores on the placement test may be too small to
fore, that the placement scores do not depend very much on the produce statistically reliable results, or a random assignment of
specific set of items, specific occasions, specific scorers, and so students to the two treatments may not be feasible for various
on used to generate them and therefore can be generalized to a reasons, including the obvious ethical considerations.
universe of possible observations. Evidence supporting as- An alternate line of argument depends on the claim, which is
sumptions about the invariance of placement scores over items, based on Assumptions 1-4, that the placement test measures
532 MICHAEL T. KANE
algebraic skills that are prerequisites for the calculus course. If are also assuming that minimizing the number of students who
the skills assessed by the placement test are prerequisites, stu- fail the regular course is a sufficiently important goal that it
dents with low scores on the placement test are likely to have merits the commitment of substantial resources (i.e., a second
trouble achieving success in the calculus course. With an addi- course and the time and money required for placement testing).
tional assumption about the effectiveness of the remedial
course (see Assumption 6 below), it is reasonable to infer that
Moral of the Story
the students with low placement scores will have substantially
better prospects in the calculus course if they first complete the This example makes three major points. First, the interpre-
remedial course. tive argument has been stated in some detail. To investigate the
Assumption 6. The remedial course is effective in teaching plausibility of an interpretive argument, it is necessary to be
the algebraic skills used in the calculus course. clear about the structure of the argument, about the major as-
Evidence supporting this assumption could be obtained by sumptions in the argument, and about possible alternative as-
giving parallel forms of the placement test to students before sumptions and possible counterarguments. Some assumptions
and after they take the remedial course or by collecting data on may be accepted as plausible on the basis of experience, without
how well students who take the remedial course and students conducting empirical studies, but it is dangerous to accept as-
who do not take the remedial course subsequently do in calcu- sumptions as plausible without being clear about what they
lus. In addition, the performance of students who have taken claim.
the remedial course could be monitored (e.g., in a series of Second, the evidence that is most relevant to the validity of
structured interviews) to determine whether the remedial the proposed interpretation depends on the details of the inter-
course was helpful in preparing these students for the calculus pretive argument. Validation of the interpretive argument for
course. the placement test requires many different kinds of evidence
because the seven assumptions included in the argument re-
Appropriate Placement for Students With High quire different kinds of evidence. The evidence for a proposed
interpretation should support the specific interpretive argu-
Placement Scores
ment associated with the interpretation.
The third part of the argument claims that taking the reme- Third, the structure of the interpretive argument and the
dial course will not substantially improve performance in the assumptions built into the argument reflect choices about how
calculus course for students with high scores on the placement to interpret the placement test scores. The interpretive argu-
test (i.e., above the cutoff score). If the remedial course were as ment for the placement test could have been developed along
helpful to students with high placement scores as it was to stu- different lines if we had chosen to interpret the scores differ-
dents with low placement scores, it would make sense to place ently. As one alternative, instead of specifying the prerequisite
all students in the remedial course. skills in terms of skill in solving certain kinds of algebra prob-
It is usually not possible to check this inference empirically. It lems, one could adopt a process model for the algebraic skills
would require students who already have the required algebraic used in calculus, design the placement test to measure perfor-
skills to take a course that seemed unlikely to provide much mance on the component processes, and design the remedial
benefit to them. Rather, we make the following reasonable as- course to develop skill in using these processes. Given the
sumption. current level of development of process models for algebra
Assumption 7. Students with a high level of skill in algebra (Booth, 1988; Larkin, 1989; Matz, 1980), this approach would
would not substantially improve these skills in the remedial involve a far more ambitious undertaking than the kind of con-
course and therefore would not substantially improve their tent analysis required to support Assumptions 2 and 3, but it is
chances of success in the calculus course. potentially more rewarding in terms of our ability to design
This assumption is likely to be accepted without direct empir- effective instruction in algebra and calculus. The argument-
ical evidence. based approach does not specify a particular form for the inter-
Together, the second and third parts of the argument consti- pretive argument; however, it does require that the argument,
tute a special case of what Cronbach and Snow (1977) have whatever it claims, should be stated clearly and should be evalu-
called an Aptitude X Treatment interaction. If these two parts ated using appropriate evidence.
of the argument hold, we have a reasonable basis for asserting
that the use of the placement test with the two courses is more Characteristics of Interpretive Arguments
effective in helping students to succeed in calculus than simply
placing all students directly in the calculus course or placing all Interpretive arguments have four general characteristics that
students in the remedial course. are especially relevant to validity: (a) Interpretive arguments are
artifacts. They are made, not discovered, (b) Interpretive argu-
ments are dynamic; they may expand or contract or simply shift
Additional Assumptions
their focus, (c) Interpretive arguments may need to be adjusted
The appropriateness of the placement system also rests on to reflect the needs of specific examinees or special circum-
more fundamental assumptions. For example, we are tacitly stances, (d) Interpretive arguments are practical arguments,
assuming that the use of a placement system is preferable to a which are evaluated in terms of their degree of plausibility and
redesign of the regular course so that the pace or sequence of not in terms of a simple valid or invalid decision.
instruction is flexible enough to accommodate all students. We Interpretive arguments are artifacts. The interpretation that
is assigned to the test scores is not uniquely determined by the the basis of validity data. That is, sometimes the case for the
observations being made. The possible interpretations for any validity of the interpretive argument can be strengthened by
set of test scores vary along several dimensions, including their changing some inferences and assumptions to fit the data. I
focus and their level of abstraction. For example, a test involv- argue later that one possible criterion for evaluating validation
ing passages followed by questions about the passage could be research is the extent to which the research improves both the
interpreted simply as a measure of skill at answering passage- interpretation (by making it clearer, more solidly based, and
related questions, or as a measure of reading comprehension more accurate) and the test (by eliminating flaws and sources of
denned more broadly, or as one indicator of verbal aptitude, or error).
as an indicator of some more general construct, such as intelli- The interpretive argument may need to be adjusted to reflect
gence. These different interpretations necessarily involve dif- the needs of specific examinees or special circumstances that
ferent interpretive arguments. might have an impact on the test scores. The general version of
Because the procedures used to obtain a score do not the interpretive argument cannot take explicit account of all of
uniquely determine the interpretation, the interpretation must the special circumstances that might affect an examinee's per-
be assigned to the test score. Someone or some group decides formance. In applying the argument in a specific case, it is
on the interpretation to be given to the reading comprehension assumed that the examinee is drawn from an appropriate popu-
scores. The mathematics placement test discussed earlier was lation and that there are no circumstances that might alter the
interpreted as a measure of readiness for the regular calculus interpretation.
course, because this was how it was to be used. Adjustments in the interpretive argument may need to be
Specifying the associated interpretive argument is of funda- made for subpopulations and for individuals. For example,
mental importance in evaluating the validity of the interpreta- within the subpopulation of examinees with a certain handi-
tion. One validates the interpretation by evaluating the plausi- cap, the interpretive argument may need to be adjusted to re-
bility of the interpretive argument, and some possible interpre- flect the impact of the handicap (see Willingham, 1988). If
tive arguments are more plausible than others. In the example testing procedures are adjusted to accommodate the needs of a
given above, the interpretation of the scores in the reading com- handicapped student, it may be necessary to add evidence sup-
prehension test in terms of skill at answering passage-related porting the comparability of scores obtained under special test-
questions is likely to be more solid (although perhaps less inter- ing procedures (Willingham, 1988). The general form of the
esting) than interpretations involving more general constructs, interpretive argument may also need to be modified for individ-
because the more limited interpretation is likely to make fewer ual examinees to reflect special circumstances (e.g., illness or
assumptions. lack of motivation).
Therefore, an important first step in any effort to validate the
Interpretive arguments make many assumptions that are
interpretive argument is to state the argument clearly. The argu-
ment may be changed later, perhaps as a result of validation plausible under ordinary circumstances (e.g., that examinees
research, but if the effort to check on the assumptions and can hear instructions that are read to them) but that may be
inferences in the interpretive argument is to make much pro- questionable for specific examinees (e.g., hearing-impaired ex-
gress, the effort needs to begin by specifying the details of the aminees) or under special circumstances (a noisy environment).
argument. An analogous point is made within the context of The assignment of an interpretation to a specific test score is an
generalizability theory where the importance of explicitly de- instantiation of the general form of the interpretive argument.
fining the universe of generalization proposed for test scores is The plausibility of the resulting specific interpretive argument
emphasized (Brennan, 1983; Cronbach et al., 1972; Kane, depends on the reasonableness of the general form of the inter-
1982). pretive argument and on the extent to which the interpretive
Interpretive arguments are dynamic. As new information be- argument applies to the specific situation under consideration.
comes available, the interpretive argument may expand to in- Interpretive arguments are practical arguments, which areeval-
clude new types of inferences. Empirical results may support uated in terms of their degree of plausibility. Initially, the in-
generalization to a wider domain or extrapolation to a new tended interpretation is quite likely to be stated in very general
domain. Conversely, new results may refute assumptions that terms, for example, in terms of reading comprehension or readi-
supported part of an interpretive argument, thus forcing a ness for a particular course. The interpretive argument is then
narrower interpretation. Society's priorities and values may correspondingly loose. The interpretive argument may be made
change, leading to changes in how test scores are used. more explicit over time, but even the most highly developed
As research proceeds, deeper or more sophisticated explana- interpretive arguments do not attain the precision of mathemat-
tions for the test scores may be developed. For example, a pro- ical derivations. Interpretive arguments are practical argu-
cess model describing how students solve algebra problems ments, and their evaluation does not involve a simple valid or
could greatly expand the scope and depth of the interpretation invalid decision, as it might in logic or mathematics. The evalua-
given to our placement test. Similarly, the development of new tion is necessarily judgmental, leading to conclusions about the
theoretical approaches to reading is bound to influence our plausibility of the interpretive argument rather than a simple
interpretation of scores on a reading comprehension test. yes or no decision.
The malleability of interpretations can make validation In general, then, interpretive arguments are artifacts, they
more difficult or easier. A changing interpretation presents the change with time, they may need to be modified for particular
validator with a moving target. However, it may also be possible examinees or circumstances, and they are more-or-less
to make some adjustments in the intended interpretation, on plausible.
534 MICHAEL T. KANE
Advantages of an Argument-Based Approach (1985). Standards for educational and psychological testing. Washing-
to Validation ton, DC: American Psychological Association.
Booth, L. R. (1988). Children's difficulties in beginning algebra. In
The argument-based approach to validity is basically quite
A. F. Coxford (Ed.), The ideas of algebra, K-12 (pp. 20-32). Reston,
simple. One chooses the interpretation, specifies the interpre- VA: National Council of Teachers of Mathematics.
tive argument associated with the interpretation, identifies Brennan, R. L. (1983). Elements of generalizabilily theory. Iowa City,
competing interpretations, and develops evidence to support IA: American College Testing.
the intended interpretation and to refute the competing inter- Campbell, D. T, & Fiske, D. W (1959). Convergent and discriminant
pretations. The amount of evidence and the types of evidence validation by the multitrait-multimethod matrix. Psychological
needed in a particular case depend on the inferences and as- Bulletin, 56, 81-105.
sumptions in the interpretive argument. Cole, N. S., & Moss, P. A. (1989). Bias in test use. In R. L. Linn (Ed.),
The argument-based approach offers several advantages. Educational measurement (3rd ed., pp. 201 -219). New York: Ameri-
First, it can be applied to any type of test interpretation or use: can Council on Education/Macmillan.
It is highly tolerant. It does not preclude the development of any Cronbach, L. J. (1971). Test validation. In R. L. Thorndike(Ed-), Edu-
kind of interpretation or the use of any data collection tech- cational measurement (2nd ed., pp. 443-507). Washington, DC:
American Council on Education.
nique. It does not identify any kind of validity evidence as being
Cronbach, L. J. (1982). Designing evaluations of educational and social
generally preferable to any other kind of validity evidence. It programs. San Francisco: Jossey-Bass.
does require that the interpretive argument be stated as clearly Cronbach, L. J. (1988). Five perspectives on validity argument. In H.
as possible and that the validity evidence should address the Wainer & H. Braun (Eds.), Test validity (pp. 3-17). Hillsdale, NJ:
plausibility of the specific interpretive argument being pro- Erlbaum.
posed. Cronbach, L. J. (1989). Construct validation after thirty years. In R. E.
Second, although the evaluation of an interpretive argument Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp.
does not lead to any absolute decision about validity, it does 147-171). Urbana: University of Illinois Press.
provide a way to gauge progress. As the most questionable infer- Cronbach, L. J., &Gleser, G. C. (1965). Psychological tests and person-
ences and assumptions are checked and either are supported by nel decisions. Urbana: University of Illinois Press.
the evidence or are adjusted so that they are more plausible, the Cronbach, L. J, Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The
plausibility of the interpretive argument as a whole can im- dependability of behavioral measurements: Theory ofgeneralizability
prove. for scores and profiles. New York: Wiley.
Third, the approach may increase the chances that research Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychologi-
on validity will lead to improvements in measurement proce- cal tests. Psychological Bulletin, 52, 281-302.
dures. To the extent that the argument-based approach focuses Cronbach, L. J., &Snow, R. E. (1977). Aptitudes and instructional meth-
ods. New York: Irvington.
attention on specific parts of the interpretive argument and on
Embretson (Whitely), S. (1983). Construct validity: Construct represen-
specific aspects of measurement procedures, evidence indicat- tation versus nomothetic span. Psychological Bulletin, 93,179-197.
ing the existence of a problem (e.g., inadequate coverage of con- Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.),
tent or the presence of some form of systematic error) may also Educational measurement (3rd ed., pp. 105-146). New York: Ameri-
suggest ways to solve the problem and thereby to improve the can Council on Education/Macmillan.
procedure. Frisbie, D. A. (1982). Methods of evaluating course placement systems.
The argument-based approach to validity is similar to what Educational Evaluation and Policy Analyses, 4,133-140.
Cronbach (1989) called the strong program of construct valida- Guion, R. M. (1974). Open a window: Validities and values in psycho-
tion: "a construction made explicit, a hypothesis deduced from logical measurement. American Psychologist, 29, 287-296.
it, and pointedly relevant evidence brought in" (p. 162). The Hambleton, R. K. (1989). Principles and selected applications of item
term argument-based approach to validity has been used here response theory. In R. L. Linn (Ed.), Educational measurement (3rd
instead of construct validity or the strong program of construct ed., pp. 147-200). New York: American Council on Education/Mac-
validity to emphasize the generality of the argument-based ap- millan.
proach, applying as it does to theoretical constructs as well as to Holland, P. W, & Rubin, D. B. (1982). Test equating. San Diego, CA:
attributes defined in terms of specific content or performance Academic Press.
House, E. R. (1980). Evaluating with validity. Beverly Hills, CA: Sage.
domains. The term construct validity has often been associated
Kane, M. T. (1982). A sampling model for validity. Applied Psychologi-
with theory-based interpretations (Cronbach & Meehl, 1955).
cal Measurement, 6, 125-160.
Interpretive arguments may be, but do not have to be, asso- Larkin, J. H. (1989). Robust performance in algebra: The role of the
ciated with formal theories. problem representation. In S. Wagner & C. Kieram (Eds.), Research
The expression argument-based approach offers some advan- issues in the learning and teaching of algebra (pp. 120-134). Reston,
tages. It is an approach to validity rather than a type of validity. VA: National Council of Teachers of Mathematics.
The term argument emphasizes the existence of an audience to Lord, F. M. (1980). Applications of item response theory to practical
be persuaded, the need to develop a positive case for the pro- testing problems. Hillsdale, NJ: Erlbaum.
posed interpretation, and the need to consider and evaluate Matz, M. (1980). Towards a computational theory of algebraic compe-
competing interpretations. tence. Journal of Mathematical Behavior, 3, 93-166.
Messick, S. (1975). The standard problem: Meaning and values in mea-
References surement and evaluation. American Psychologist, 30, 955-966.
American Educational Research Association, American Psychologi- Messick, S. (1980). Test validity and the ethics of assessment. Ameri-
cal Association, & National Council on Measurement in Education. can Psychologist, 35, 1012-1027.
Messick, S. (1981). Evidence and ethics in the evaluation of tests. Edu- Snow, R. E., & Lohman, D. E. (1989). Implications of cognitive psychol-
cational Researcher, 10, 9-20. ogy for educational measurement. In R. L. Linn (Ed.), Educational
Messick, S. (1988). The once and future issues of validity: Assessing the measurement (3rd ed., pp. 263-331). New York: American Council
meaning and consequences of measurement. In H. Wainer & H. on Education/Macmillan.
Braun (Eds.), Test validity (pp. 33-45). Hillsdale, NJ: Erlbaum. Sternberg, R. J. (1985). Human abilities: An information processing ap-
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measure- proach. New York: W H. Freeman.
ment (3rd ed., pp. 13-103). New York: American Council on Educa- Toulmin, S., Rieke, R., & Janik, A. (1979). An introduction to reasoning.
tion/Macmillan. New York: Macmillan.
Nedelsky, L. (1965). Science teaching and testing. New York: Harcourt, Webster's new collegiate dictionary (9th ed., rev). (1989). Springfield,
Brace & World. MA: Merriam-Webster.
Pellegrino, J. W (1988). Mental models and mental tests. In H. Wainer Willingham, W (1974). College placement and exemption. New York:
& H. Braun (Eds.), Test validity (pp. 49-59). Hillsdale, NJ: Erlbaum. College Entrance and Examination Board.
Perelman, C., & Olbrechts-Tyteca, L. (1969). The new rhetoric: A trea- Willingham, W (1988). Testing handicapped people—The validity is-
tise on argumentation. Notre Dame, IN: University of Notre Dame. sue. In H. Wainer & H. Braun (Eds.), Test validity (pp. 89-103). Hills-
Sawyer, R. (1989). ValidatingtheuseofACT Assessment scores and high dale, NJ: Erlbaum.
school grades for remedial course placement in college (ACT Re-
search Rep. No. 89-4). Iowa City, IA: American College Testing.
Snow, R. E., & Lohman, D. E. (1984). Toward a theory of cognitive Received January 2,1991
aptitude for learning from instruction. Journal of Educational Psy- Revision received November 21,1991
chology, 76, 347-376. Accepted November 25,1991 •
Hill Appointed Editor of the Journal of

Counseling Psychology, 1994-1999
The Publications and Communications Board of the American Psychological Association
announces the appointment of Clara E. Hill, PhD, University of Maryland, as editor of the
Journal of Counseling Psychology for a 6-year term beginning in 1994. As of January 1,
1993, manuscripts should be directed to
Clara E. Hill, PhD

Department of Psychology
University of Maryland
College Park, Maryland 20742
Manuscript submission patterns for the Journal of Counseling Psychology make the precise
date of completion of the 1993 volume uncertain. The current editor, Lenore W. Harmon,
PhD, will receive and consider manuscripts until December 31,1992. Should the 1993
volume be completed before that date, manuscripts will be redirected to Dr. Hill for
consideration in the 1994 volume.

Kane 1992

Uploaded by

Copyright:

Available Formats

Kane 1992

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kane 1992

Uploaded by

Copyright:

Available Formats

QUANTITATIVE METHODS IN PSYCHOLOGY

An Argument-Based Approach to Validity

This article outlines a general, argument-based approach to validation, develops an interpretive

Generalization Theory-Based Inferences

Example occasions, and scorers could be derived from generalizability

Hill Appointed Editor of the Journal of

Clara E. Hill, PhD

You might also like