Multivariate Variance Component Analysis: An Application in Test Development
Multivariate Variance Component Analysis: An Application in Test Development
Multivariate Variance Component Analysis: An Application in Test Development
Key words: latent trait, multivariate data, psychometric test, variance components
The author has benefited from discussions with Charlie Lewis. Comments of an
associate editor and of two referees have helped to improve the presentation.
Secretarial services of Liz Brophy, ETS, are acknowledged.
91
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Nicholas T. Longford
92
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Variance Components
hierarchy may have a different correlation structure of the true subscores
(their means).
This paper presents a multivariate variance component analysis of data
from the pilot stage of development of a test instrument aimed to provide
information for colleges, or more generally groups of students, about sev-
eral aspects of general education outcomes. The analysis addresses issues
of discriminant validity of the subtests and other statistical issues that arise
in the development of a test instrument with complex test specification.
The sections of the paper are organized as follows: details of the studied
test instrument and statement of the principal research problem, overview
of variance component models, definition of the substantively important
dimension of the latent trait, data and procedures, results, and conclusions
and summary.
The Test Instrument
The test instrument, in this paper referred to by the acronym GENED,
involves a matrix design, as depicted in Figure 1. There are three forms of
the test, each with 48 items. The items within a form are cross-classified into
three content areas (humanities, social sciences, and natural sciences; the
rows in Figure 1) and four academic skills (reading, writing, critical think-
ing, and mathematics; the columns in Figure 1). The numbers in Figure 1
represent the numbers of items within each cell of one of the three forms
of the test, and the row and column subtotals. Of the 48 items, there are
16 in each row and 12 in each column; however, the numbers of items within
cells vary both within a form and from form to form. Each item has four
alternative responses, one of which is correct.
The first stage of pretesting of the instrument took place in the academic
year 1986-87 and involved about 1,000 students from 12 colleges. The
original pool of 240 items has been reduced to 144 items, and several items
have been reviewed and amended. The main purpose of the second stage
of pretesting, the pilot year (1987-88), was to establish content and con-
struct validity of the test and of its components (skills and content areas) to
decide what kinds of comparisons of college-mean scores can be made and
to settle the format of the report for participating colleges.
The test can be administered in two ways. The short form involves 1 hour
of testing time, and each student has to respond to 48 items of one form.
The three forms are spiralled (in the order A, B, C, A, B,...) so as to
ensure approximately equal numbers of students and equal distribution of
ability for all three forms within colleges. Note that the three forms are
unlikely to be exactly parallel, because that would mean that they have
matching characteristics either for each cell or at least for the three rows
and the four columns of the design. However, because individual examinee
scores are not reported, this stringent condition can be dispensed with.
93
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Nicholas T. Longford
USING Three
COLLEGE- COLLEGE- MATHE- Content
LEVEL LEVEL CRITICAL MATICAL Subscores
READING WRITING THINKING DATA Reported
Humanities
Subscore
HUMANITIES 5 5 4 2 16
Soc. Sci.
Subscore
SOCIAL 4 4 4 4 16
SCIENCES
Nat. Sci.
Subscore
NATURAL 3 3 4 6 16
SCIENCES
FIGURE 1. An example of the design of a form of the GENED test (the cells contain
the numbers of items; the exact number of items per cell varies from form to form)
The long form involves 3 hours of testing time, in which each student is
administered three forms of the test in three separate 1-hour sessions. The
order of administration of the forms within the long form is balanced to
enable inference about fatigue and practice effects. The score report for the
long form administration will contain both individual students' scores and
college-average scores and subscores.
Results of the short form administration are intended for colleges, or
groups of students, only. The college-average number of correct item re-
sponses would be reported for the entire test, for the three subject areas,
and for the four skills. Suitable minimum numbers of students from a
college must be recommended, for whom meaningful scores and subscores
could be provided. Alternatively, the expected standard errors associated
with these scores, as functions of the number of students in a college, would
suffice. Colleges may administer the test to a random sample of students,
and then the choice of the sampling size is important.
Another important point is to decide which subscore means to report,
94
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Variance Components
and in what form. The subscore means are highly correlated, and therefore
it is more informative to report certain contrasts. Suppose two colleges have
substantially different values of a given contrast of true subscore means.
Then this contrast of subscore means is declared to be worth reporting, for
colleges with n examinees each, if in a future administration with n exam-
inees from each college the observed subscore means are very likely to be
significantly different, using the conventional t test.
The term \ikh represents the overall mean score for the cell (k, h), the
college-specific term aikh represents the deviation of the true college mean
score \ikh + aikh from the overall mean score for the cell (k, h), and the
examinee term bijkh stands for the deviation of the examinee's true score
from the true mean score for his or her college. The random term eijkh is
included to account for the aggregate of the random influences associated
with momentary variation in human performance and recall, external influ-
ences, guessing, and so on, and for model inadequacy.
Differences in the difficulties as well as in the numbers of items in the
cells of the three forms can be accommodated by form-specific parameters
\*>khf (f= 1,2,3) in (1). We will assume, however, that the college and
student deviations, aikh and bijkh, are form-independent.
95
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Nicholas T. Longford
Because each cell contains only 2-6 items, the assumptions of normality
in (1) are difficult to justify, even after a transformation of the subscores.
We circumvent these issues by aggregating the subscores to the level of
content areas and skills. Because the development for content area and skill
subscores is completely analogous, I describe in detail only the latter.
For the skill subscores we have an analysis-of-variance type model,
Yijk. = Mkf + Aik + Bijk + eijk., (2)
where Mkf=%h\Lkhf, Aik = ^haikhy Bijk = l<hbijkh, and tijk. = ?<hzijkh. The
assumption of normality of these subscores, or of their suitable transforma-
tions, is now more palatable than in the model (1). Assume that the college
terms A, = (An,Ai2, Ai3, Ai4), the examinee terms B,7 = (BijU Bij2, Btj^ BijA),
and €/y* form three mutually independent random samples:
A,~;v4(o,ao,
B/;~Ar4(0,flB),
eijk.~N(0,a2). (3)
The variance matrix for the four skill subscores of an examinee is equal
to a 2 1 + ilB + ilA, and the conditional variance matrix, given the college-
level random terms A, (i.e., the within-college variance matrix), is
ft]? = a 2 1 + ilfl. In absence of any replication for the subscores the com-
ponents a2 and ftB are confounded, and therefore o2 cannot be estimated
without some data in addition to {Yijk.}.
An upper bound for a2 can be obtained from the eigenvalue decom-
position of £1%', we have a 2 ^e 1 ? where ex is the smallest eigenvalue of
Us. If a2 < eu then ll B = 11^ - a 2 1 is of full rank; otherwise, the rank of 1XB
is equal to the number of eigenvalues greater than ex. Thus, an estimate of
ft]? can provide an estimate of the upper bound for a2. A different
"guestimate" can be obtained by considering the variation associated with
binomial outcomes (0/1 responses to 12 items). As a crude approximation
for the upper bound for a2, we have 12 x (Vi)2 = 3. Note, however, that the
responses contributing to a subscore are probably positively correlated,
even after partialling out the college and examinee terms. Reliable infer-
ence about a2 could be obtained only from an experiment involving
replication, embedded in the administation of the test.
Because the skill subscores, or their averages, are highly correlated, it is
advantageous to consider a nonsingular linear transformation of the sub-
scores. I choose the transformation
Afl = (Al+^2+^/3+A-4)/4,
-1 Vi -1 -1
D4 = 1 VA 0 0
V4 0 1 0
VA 0 0 1.
and B* = B/;D4 for examinee-level random effects. I will refer to this trans-
formation, given by the matrix D4, as the centered transformation
(parameterization). The first components of A* and B* are associated with
the score average, and the others with the respective contrasts of writing,
critical thinking, and mathematics versus reading. The variance matrices
for these random effects are XA = DjflUD4 and XB = Djil B D4. The inverse
transformation is given by the matrix
1 1 1 l]
c 4 = D -i= -v* v* -v* -vA
" -H -Vi 3/4 -Ytl
L—vi -v* -v* vA
(A,- = A* C4, Bij = B,* C4).
Explanatory variables, such as background information, can be directly
incorporated within the systematic (fixed) part Mkf in model (2). We can
consider variables defined for examinees (age, sex, major, high school
academic achievement, and so on) and for colleges (public/private, total
enrollment, and so on). If the test is administered to distinct subgroups of
students within a college, such as freshmen and upperclassmen, or the four
grades, the model (2) can be extended by another level of clustering (exam-
inees within grades within colleges) to
Yigjk. — Mkgf + A i k + Gigk + Bigjk + tigjk' > (4)
where the assumptions (3) are supplemented by Gig = (Gigu . . . , G/g4) ~
N4(0,ftG)- Systematic differences among the grades are accommodated
within Mkgf, and the within-college deviations from the mean differences
among the grades are represented by the random vector Gig. Models for
longitudinal data are obtained by introducing a level of clustering between
the examinee and subscore levels (subscores within occasions for exam-
inees). Similarly, repeated administrations across years correspond to the
clustering of students within college-years within colleges (assuming no
migration).
The number of colleges taking part in the pilot year is limited; the
participation is voluntary and recruitment of colleges requires extensive
efforts. Therefore, only a small number of college-level variables can be
97
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Nicholas T. Longford
used to model the fixed part Mkf (Mkgf). There is more scope for incorpo-
rating examinee-level variables, but it should be borne in mind that such
variables may be almost constant within colleges. If several college-level
variables, or variables with small within-college variation, are used, then
the variance matrix XA may vanish.
Voluntary participation raises several issues related to sampling. The
hope is that the colleges taking part in the pilot year are a representative
sample of the hypothetical population of colleges that will administer the
test in future. However, the composition of colleges interested in future
operational administration may change because of various circumstances,
including the results of the pilot year analysis. Another issue concerns par-
ticipation and motivation of the designated students within participating
colleges. In the short form administration the examinee's name is not
entered on the response form, and so he or she is aware that no substantial
personal stake is involved. The problem arises to a lesser extent in the long
form administration, where individual examinees' scores are reported. This
problem relates not only to the pilot year data but also to future operation
because differential motivation of the examinees may seriously influence
the intended comparisons.
Dimensionality of the Observable Trait
I subscribe to the approach of Goldstein and McDonald (1988) for factor
analysis with multilevel data, but in the present context it is necessary to
distinguish between statistical significance and substantive importance. I
adopt a definition of the underlying multidimensional trait for a normally
distributed random vector, similar to that used in factor analysis. By an
m -dimensional normal trait I mean any ra-variate normal vector with mean
0 and a nonsingular variance matrix. A normal vector z is said to have an
underlying m -dimensional trait if it can be formed by a nonsingular linear
transformation of the m -dimensional trait, that is,
z = T + U8, (5)
where 8 ~ Nm(0, * ) , * is nonsingular, U is a matrix of rank m, and T is a
vector of constants. The matrices U and W are not defined uniquely, and
the analyst can make a choice that facilitates a suitable interpretation. In
many situations it is useful to restrict the variance matrix * to a unit matrix
or a diagonal matrix, but this restriction is purely formal. Two random
vectors are said to have the same trait if the corresponding matrices U in
their representations (5) span the same linear space. In particular, two
random vectors of the same length with nonsingular variance matrices have
the same trait.
An important issue in the development of the GENED test is in estab-
lishing a four-dimensional trait underlying the four skill subscores (and a
98
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Variance Components
99
100
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Variance Components
test and its scoring would be a commercial undertaking, the testing organi-
zation would be committed to provide the college means of the subscores
even in the presence of substantial attrition, which may drastically affect
their validity. Therefore, excluding the colleges with substantial attrition
from the analysis might result in bias of the predicted variation of the
outcomes in the future operation. The attrition rate in the long form admin-
istration was negligible. In the short form administration the vast majority
of the examinees were freshmen (91%), and only three colleges had 50 or
more upperclass examinees. In the long form administration there were
seven colleges in which the test was administered to both freshmen and
upperclassmen, in six colleges only freshmen were examined, and in the
remaining five colleges only upperclassmen took part. Details of the admin-
istration sizes are given in Tables 1 and 2.
Separate analyses were carried out for short form and long form data and
for skill subscores and subject-area subscores within forms. Variance com-
ponent models were fitted using the Fisher scoring algorithm of Longford
(1987) implemented in the software VARCL (Longford, 1988). The cen-
tered parameterization was used, and instead of the variance parameters
their square roots (sigmas) were estimated. Therefore, the standard errors
for these sigmas are quoted throughout.
For model fitting in VARCL I declare, for the analysis of skill subscores
in the short form, 4 x 5,159 elementary observations (subscores) within
5,159 examinees, within 34 colleges. For the elementary observations we
have the categorical explanatory variable, indicating the subscore type (4
categories), and for the examinees the categorical explanatory variable,
indicating the form of the test (3 categories). In order to allow a general
pattern of the between-form differences in the population subscore means,
the form-by-subscore interaction is considered. The between-college and
between-examinee covariance structures are declared as associated with the
subscore type. Model fitting is iterative, but the number of iterations re-
quired in the models fitted was always less than 12. The model parameters
are the location parameters associated with the form-by-subscore inter-
action (using the centered parameterization), an imputed value for the
elementary level variance a 2 , and the variance and covariance elements of
the matrices XA and 2 5 . The choice of a 2 has to be such that the estimate
of XB be positive definite (if too large a a 2 has been chosen, the model has
to be refitted with a smaller a 2 ). For each estimated parameter its maximum
likelihood estimate and the associated standard error are obtained, as well
as the value of - 2 x log-likelihood for the fitted model (useful for likeli-
hood ratio testing). The conditional expectations of the random effects
(i.e., the residuals corresponding to the random terms Aik and Bijk) are also
provided. The software VARCL has no requirements for balance of the
data.
101
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Nicholas T. Longford
TABLE 1
Numbers of examinees from the colleges in the short form administration by form
Excluded
College no. Total Form A Form B Form C examinees3
1 61 21 21 19 2
2 217 72 72 73 5
3 400 134 131 135 4
4 278 105 93 80 80
5 132 45 43 44 2
6 183 69 69 45 47
7 133 44 44 45 1
8 75 27 25 23 7
9 81 27 28 26 0
10 72 47 47 48 1
11 48 14 20 14 25
12 100 34 33 33 0
13 148 49 50 49 5
14 171 57 56 58 2
15 176 76 60 40 137
16 117 39 39 39 0
17 57 19 19 19 1
18 40 13 13 14 2
19 96 33 32 31 4
20 77 25 27 25 1
21 151 52 51 48 6
22 158 54 50 54 5
23 172 60 55 57 5
24 93 39 34 26 21
25 81 22 39 20 13
26 72 24 24 24 1
27 132 47 40 45 3
28 80 30 25 25 11
29 130 66 63 61 7
30 242 88 81 78 0
31 112 37 37 38 2
32 84 28 27 29 2
33 517 172 175 170 4
34 332 116 112 104 66
Total 5,159 1,785 1,639 1,735 972
Note. N = 5,159.
a
Examinees who scored less than 10 points (out of 48).
102
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Variance Components
Results
The estimates of the college mean subscores are given in Tables 3 and 4.
They differ from the corresponding arithmetic averages of the subscores by
.01 or less. The standard errors associated with the variance component
analysis estimates of the means also differ very little from the standard
errors associated with the arithmetic averages. For example, the standard
errors of the within-skill and within-form means in Table 3 are in the range
.060-.064, whereas the corresponding values for the arithmetic averages
are .057-.058.
Examinees in the short form administration appear to score substantially
higher than in the long form administration. No college has administered
both the short form and the long form, and so we can distinguish between
genuine differences between the short form and long form populations on
one hand and fatigue and loss of interest on the other hand only by looking
at the results of the first session in the long form administration. There is
strong evidence of fatigue and loss of interest because the subscores in the
first session were much higher than in the later sessions.
TABLE 2
Numbers of students in the long form administration
College no. Number of students Freshmen Upperclassmen
1 332 235 97
2 163 57 106
3 468 0 468
4 61 61 0
5 445 445 0
6 66 0 66
7 263 0 263
8 387 387 0
9 182 182 0
10 209 161 48
11 206 178 28
12 177 136 43
13 194 194 0
14 69 0 69
15 812 812 0
16 131 0 131
17 41 18 23
18 629 432 197
103
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
o
4^
TABLE 3
Mean skill- and content-area subscores in the short form administration
Social Natural Form
Reading Writing Critical thinking Mathematics Humanities sciences sciences total
Form (12 items) (12 items) (12 items) (12 items) (16 items) (16 items) (16 items) (48 items)
Note. The standard errors are .06 for the skill subscore means (12 items), .07 for the content-area subscore means (16 items), and .12 for
the form totals (48 items).
Note. The standard errors are .10 for the skill subscore means (36 items), .12 for the content-area subscore means (48 items), and .20 for
the form totals (144 items).
i—k
o
106
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Variance Components
TABLE 5
Eigenvalue decompositions of the estimated variance matrices for the short form
data, administration, and skill subscores
Level Eigenvalues Eigenvectors
Examinee
"4.96 2.56 2.49 2.00" 11.95 (.50 .53 .50 .46)
5.30 2.42 2.16 3.15 (.36 .27 .17 -.88)
£iB= 4.96 2.11 2.72 (.21 -.76 .62 -.03)
5.04 2.43 (.76 -.27 -.58 -.12)
College
".80 .82 .85 .91" 3.61 ( .47 .48 .50 .54)
.84 .87 .92 .032 ( .25 .55 .10 -.79)
iiA = .90 .97 .005 ( .84 -.49 -.22 .10)
1.07 .001 (-.10 -.48 .83 -.26)
^ -*
107
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Nicholas T. Longford
TABLE 6
Eigenvalue decompositions of the estimated variance matrices for the short form
administration; subject-area subscores
Level Eigenvalues Eigenvectors
Examinee
'7.95 4.13 3.97" 15.92 (.58 .58 .57)
ilB = 7.76 4.04 3.88 (.70 .01 -.71)
7.75. 3.67 (.41 -.82 .41)
College
'1.41 1.51 1.56" 4.82 (.54 .58 .61)
&A = 1.64 1.71 .034 (.71 .08 -.70)
1.80 .008 (.46 -.81 .37)
108
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Variance Components
Conditional Expectations of the Random Effects
Diagnosis of the model fit in ordinary regression is usually based on
inspection of the residuals. In variance component models we have "re-
siduals" associated with units at each level of the hierarchy—^colleges
(grades) and examinees/The separation of the total residual y - M into its
components A and B + e is obtained by considering the conditional expec-
tations of the random effects given the parameter estimates. The formulae
and their connection to the EM algorithm are discussed in Longford (1987).
The college-level conditional expectations {A,} are also useful for com-
parison of the colleges that participated in the pilot year. Colleges at either
extreme of the scale defined by the fitted variance for the component of the
vector of residuals are of interest both for diagnostic and substantive rea-
sons. Table 7 lists the residuals for the skill subscores in the short form
administration for the average of the subscores and the mathematics-
reading contrast.
Equality of the covariance structure of the subscores across the forms is
the most important aspect of parallelness. Comparison of the columns in
Table 7 (in particular for colleges with large numbers of examinees) pro-
vides evidence of parallelness of the three forms. Parallelness of the forms
can be more formally explored by separate analyses for each form.
Concluding Remarks
The analyses indicate that the two-way matrix design of the test is not
justified. For colleges using the short form administration, only the total
score is worth reporting; any subscore, or a linear combination of sub-
scores, is indistinguishable from a less reliable version of the total score. In
the long form administration, there are two observable dimensions of the
latent trait for the skill subscores; it is meaningful to consider two
subtests—one containing all the mathematics items and the other all the
other items of the test (reading, writing, and critical thinking).
We cannot, on the basis of the analyses, determine why the complex
factor structure is not observable in the college-mean test subscores. On
one hand, the items may be an imperfect representation of the domain
indicated by the label, or they may be heavily contaminated by a "general"
domain. This problem may be alleviated by careful review and content
analysis of the test items. On the other hand, the true factorial structure
underlying the knowledge/skill domains may be genuinely unidimensional,
in which case the test cannot serve the intended purpose. We can only
speculate that there may be a multivariate factorial structure underlying
academic skills (and subject areas), but the traits of the skills defined in
GENED are all subsumed in a single dimension.
109
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Nicholas T. Longford
TABLE 7
Conditional expectations of the random effects Afi and A*4; average of subscores
and mathematics-reading contrast
Short form
^ „ Form A Form B Form C All fiforms
All orms
College
No.
No. At,
AH Af4
Ai AK
~ATi A%
Af 4 AK
Ati Ai AR
A% At4
Af
110
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Variance Components
Ill
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015
Nicholas T. Longford
112
Downloaded from http://jebs.aera.net at The University of Iowa Libraries on June 8, 2015