The Construct Related Validity of Assess

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Journal of Management 2003 29(2) 231–258

The Construct-Related Validity of Assessment


Center Ratings: A Review and Meta-Analysis
of the Role of Methodological Factors夽
David J. Woehr∗
Department of Management, The University of Tennessee, 419 Stokely Management Center,
Knoxville, TN 37796, USA

Winfred Arthur Jr.


Department of Psychology, Texas A&M University, College Station, TX 77843-4235, USA
Received 9 July 2001; received in revised form 6 March 2002; accepted 3 June 2002

In the present study, we provide a systematic review of the assessment center literature with
respect to specific design and methodological characteristics that potentially moderate the
construct-related validity of assessment center ratings. We also conducted a meta-analysis of
the relationship between these characteristics and construct-related validity outcomes. Results
for rating approach, assessor occupation, assessor training, and length of assessor training
were in the predicted direction such that a higher level of convergent, and lower level of dis-
criminant validity were obtained for the across-exercise compared to the within-exercise rating
method; psychologists compared to managers/supervisors as assessors; assessor training com-
pared no assessor training; and longer compared to shorter assessor training. Partial support
was also obtained for the effects of the number of dimensions and assessment center purpose.
Our review also indicated that relatively few studies have examined both construct-related
and criterion-related validity simultaneously. Furthermore, these studies provided little, if any
support for the view that assessment center ratings lack construct-related validity while at
the same time demonstrating criterion-related validity. The implications of these findings for
assessment center construct-related validity are discussed.
© 2002 Elsevier Science Inc. All rights reserved.

夽 Portions of this paper were presented at the 14th annual meeting of the Society for Industrial/Organizational

Psychology, Atlanta, GA, April 1999 and the 17th annual meeting of the Society for Industrial/Organizational
Psychology, Toronto, Canada, April 2002.
∗ Corresponding author. Tel.: +1-865-974-1673; fax: +1-865-974-3163.

E-mail addresses: djw@utk.edu (D.J. Woehr), wea@psyc.tamu.edu (W. Arthur Jr.).

0149-2063/02/$ – see front matter © 2002 Elsevier Science Inc. All rights reserved.
PII: S 0 1 4 9 - 2 0 6 3 ( 0 2 ) 0 0 2 1 6 - 7
232 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

Over the past several decades, assessment centers have enjoyed increasing popularity.
They are currently used in numerous private and public organizations to assess thousands of
people each year (Lowry, 1997; Spychalski, Quiñones, Gaugler & Pohley, 1997; Thornton
& Byham, 1982). The validity of assessment centers is undoubtedly partially responsible for
their popularity. Evidence supporting the criterion-related validity of assessment center rat-
ings has been consistently documented (Arthur, Day, McNelly & Edens, in press; Gaugler,
Rosenthal, Thornton & Bentson, 1987). In addition, content-related methods of validation
are also regularly used in assessment center development in an effort to meet professional
and legal requirements (Sackett, 1987). Evidence for the construct-related validity of assess-
ment center dimensions, however, has been less promising. Specifically, assessment centers
are designed to evaluate individuals on specific dimensions of job performance across sit-
uations or exercises. Research, however, has indicated that exercise rather than dimension
factors emerge in the evaluation of assessees (Bycio, Alvares & Hahn, 1987; Highhouse
& Harris, 1993; Schneider & Schmitt, 1992; Turnage & Muchinsky, 1982). Thus, a lack
of evidence of convergent validity, as well as a partial lack of evidence of discriminant
validity, has been extensively reported in the literature (Brannick, Michaels & Baker, 1989;
Klimoski & Brickner, 1987; Sackett & Harris, 1988). These findings have led to a pre-
vailing view that assessment center ratings demonstrate criterion-related validity while at
the same time lacking construct-related validity (e.g., evidence of convergent/discriminant
validity).
It is important to note that this “prevailing view” is inconsistent with the unitarian con-
ceptualization of validity which postulates that content-, criterion-, and construct-related
validity are simply different strategies for demonstrating the construct validity of a test
or measure (Binning & Barrett, 1989). Here, consistent with Binning and Barrett (1989),
Landy (1986), and other proponents of the unitarian conceptualization of validity, we draw
a distinction between construct-related validity and construct validity (or validation) and
consider the “validation of personnel selection decisions [to be] merely a special case of the
more general validation process” (Binning & Barrett, 1989: 480). Psychological constructs
are conceptualizations regarding the arrangement and interaction of covarying groups of
behavior (i.e., theory-building). In this sense, a construct is a hypothesis concerning these
commonalities among behaviors. Within this framework, construct validation refers to “the
process for identifying constructs by developing measures of such constructs and examin-
ing relationships among the various measures” (Binning & Barrett, 1989: 474). Construct
validation is then, fundamentally, a process of assessing what a test or measurement mea-
sures and how well it does so. In contrast, construct-related validity (e.g., evidence of
convergent/discriminant validity) refers to a specific evidential approach for justifying a
specific measure-construct link and is one of several inferential strategies that can be used
to contribute to our understanding of the construct validity of a test.
Content-related and criterion-related validity are two other commonly used inferential
strategies where the former is typically a rational assessment of the content overlap between
the performance domain and that sampled by the predictor, and the latter is an empiri-
cal demonstration of predictor/criterion measure relationship. Thus, within the unitarian
framework of validity, content-related, criterion-related, and construct-related validity are
considered to be three of several evidential bases for demonstrating the construct validity
of a test or measure (AERA, APA & NCME, 1999; SIOP, 2002), where construct validity
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 233

(as differentiated from construct-related validity), generally refers to whether a test is mea-
suring what it purports to measure, how well it does so, and the appropriateness of infer-
ences that are drawn from the test scores (AERA et al., 1999; Binning & Barrett, 1989;
Landy, 1986; Lawshe, 1985; Messick, 1989, 1995, 1998). And because these evidential
bases form an interrelated, bound, logical system, demonstration of any two, conceptually
implies the third is also present (Binning & Barrett, 1989). So, within the unitarian frame-
work, at a theoretical level, if a measurement tool demonstrates criterion-related validity
and content-related validity, as has been established with assessment centers, it should also
be expected to demonstrate construct-related validity.

A Closer Look at Validity Evidence for Assessment Center Ratings

Here it may be helpful to expand on the assessment center “validity paradox.” Within
the context of the unitarian view of validity, this paradox is reflected in the idea that assess-
ment center ratings demonstrate (1) content-related validity—it is widely accepted that the
situations and exercises incorporated into assessment centers represent relatively realistic
work samples and that the knowledge, skills, and abilities required for successful assess-
ment center performance are the same as those required for successful job performance;
(2) criterion-related validity—as noted above, the predictive validity of assessment center
ratings has been consistently documented (Arthur et al., in press; Gaugler et al., 1987);
and (3) a lack of construct-related validity—again as noted above, the assessment center
literature consistently points to a lack of convergent and discriminant validity with respect
to assessment center dimensions (cf. Arthur, Woehr & Maldegan, 2000).

The Construct Misspecification Explanation of the Validity Paradox

Several explanations have been postulated for the presence of evidence supporting as-
sessment center content- and criterion-related validity in the absence of construct-related
validity evidence. One recently endorsed view is the construct misspecification explana-
tion. That is, assessment centers may be measuring constructs other than those originally
intended by the assessment center designers (Arthur & Tubre, 2002; Raymark & Binning,
1997; Russell & Domm, 1995). This explanation suggests that the lack of convergent and
discriminant validity evidence is not due to measurement error, but instead due to misspecifi-
cation of the latent structure of the construct domain. As Russell and Domm (1995: 26) note,
“simply put, assessment center ratings must be valid representations of some construct(s),
we just do not know which one(s).” An alternate perspective on the construct misspecifi-
cation explanation has also recently been advanced. Several researchers have argued that
rather than the construct domain being “misspecified,” the factors are correctly specified
but misinterpreted. They argue that the exercise factors that typically emerge represent valid
cross-situational specificity and not “method bias” (Ladd, Atchley, Gniatczyk & Bauman,
2002; Lance et al., 2000).
Although conceptually plausible, the misspecification hypothesis has yet to be demon-
strated empirically. In addition, although it may be argued that construct misspecification
may not be particularly troublesome when assessment centers are used for selection or
234 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

promotion, it has dire implications for the use of assessment centers as training and develop-
ment interventions. Specifically, the use of assessment centers as training and development
interventions is predicated on the assumption that they are indeed measuring the specified
targeted dimensions (e.g., team building, flexibility, influencing others) and consequently,
developmental feedback reports and interviews, and individual development plans are all
designed and developed around these dimensions.
It is important to note that explanations such as the construct misspecification explanation
are predicated on the idea that the assessment center validity paradox actually exists. It is
possible, however, that this paradox is illusory. Specifically, evidence supporting the validity
paradox would require that specific assessment centers which demonstrate content- and
criterion-related validity also lack construct-related validity. Yet a cursory examination of
the literature suggests that studies examining assessment center construct-related validity
and those examining criterion-related validity are largely independent. Thus, one important
question with respect to the assessment center validity paradox is how many individual
studies have demonstrated a lack of construct-related validity while also demonstrating
criterion-related validity for a specific assessment center application?
Assuming, however, that the assessment center validity paradox is not illusory, a second
explanation for this paradox posits that assessment center design, implementation, and other
methodological factors may add measurement error that prevents appropriate convergent
and discriminant validity from being obtained (Arthur et al., 2000; Jones, 1992; Lievens,
1998). Here, it may be argued that if assessment centers are implemented in a manner consis-
tent with their theoretical and conceptual basis, more consistent validity outcomes should
be obtained. Specifically, assessment center dimensions should display content-related,
criterion-related, and construct-related validity.

Methodological Explanations of the Validity Paradox

Although the lack of construct-related validity has been widely cited, conceptual and
methodological explanations have not been closely considered (Jones, 1992). However,
recent research (e.g., Arthur & Tubre, 2002; Arthur et al., 2000; Born, Kolk & van der
Flier, 2000; Howard, 1997; Jones, 1992; Kudisch, Ladd & Dobbins, 1997; Lievens, 1998,
2001; Thornton, Tziner, Dahan, Clevenger & Meir, 1997) has focused on these issues, and
subsequently called the lack of construct-related validity view into question.
There is a body of research which indicates that differences in the design and implemen-
tation of assessment centers can result in large variations in their psychometric outcomes.
For example, Schmitt, Schneider and Cohen (1990) compared the correlations of overall
assessment ratings (OARs) with teacher ratings from one assessment center implemented
at 16 different sites. Although the original implementation was the same across sites, some
sites took liberties to make changes during the time the assessment center was in use.
These changes in implementation resulted in a considerable range in predictive validity
coefficients, ranging from −.40 to .82. In addition, Lievens (1998) reviewed 21 studies
that explicitly manipulated assessment center design characteristics hypothesized to impact
the construct-related validity of assessment center ratings. Across studies, design-related
characteristics were sorted into five categories namely dimension characteristics, exercise
characteristics, assessor characteristics, observation and evaluation approach, and rating
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 235

integration approach. Results of the review indicated no clear impact of rating integra-
tion approach or observation and evaluation approaches on evidence of assessment center
construct-related validity. However, manipulations focusing on dimension characteristics
(e.g., number of dimensions, conceptual distinctiveness, and transparency), assessor char-
acteristics (e.g., type of assessor and training), and exercise characteristics (e.g., exercise
format) were all found to moderate construct-related validity evidence.
Studies such as those reviewed by Lievens (1998) provide clear evidence that assess-
ment center design-related factors can impact the validity of assessment center ratings.
However, these studies differ markedly from the vast majority of studies on which pre-
vailing views of assessment center validity are based. Specifically, the studies reviewed
by Lievens (1998) almost exclusively incorporate experimental or quasi-experimental de-
signs in which one or two design characteristics were directly manipulated. In addition,
these studies were typically conducted in relatively artificial or contrived settings and
thus, did not address criterion-related validity. In fact, of the 21 studies included in the
Lievens review, 10 were based on student samples (students serving as either assessors,
assessees or both), 7 used videotaped “hypothetical” assessees, and none presented any
criterion-related validity data. Although this research indicates that design-related charac-
teristics can impact validity, they do not provide an indication of the actual design features
of the operational assessment centers from which the validity paradox stems. Thus, in or-
der to evaluate the role of design characteristics in the validity paradox, one must have
a clear view of the existing literature with respect to methodological and design-related
factors.

Methodological and Design-Related Assessment Center Characteristics

Drawing on a relatively large body of research, it is possible to identify specific as-


sessment center methodological factors and design characteristics that have discernable
hypothesized positive or negative effects on the construct-related validity of assessment
center ratings. These include the number of dimensions assessors are asked to observe,
record, and subsequently rate (Bycio et al., 1987; Gaugler & Thornton, 1989; Schmitt,
1977); the participant-to-assessor ratio (Gaugler et al., 1987); the type of rating approach
(i.e., within-exercise vs. across-exercise; Harris, Becker & Smith, 1993; Robie, Adams,
Osburn, Morris & Etchegaray, 2000; Sackett & Dreher, 1982; Silverman, DeLessio, Woods
& Johnson, 1986); the type of assessor used (psychologists vs. managers and supervi-
sors; cf. Spychalski et al., 1997); assessor/rater training (Gaugler et al., 1987; Woehr &
Huffcutt, 1994); and the assessment center purpose (i.e., selection vs. development). The
hypothesized effects for these factors and the conceptual basis for these effects are next
reviewed.

Number of Dimensions and Participant-to-Assessor Ratio

The first variables of interest are the participant-to-assessor ratio and the number of di-
mensions assessors are asked to observe, record, and rate. These variables play an important
role in the validity of assessment center ratings (Bycio et al., 1987). For example, Schmitt
236 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

(1977) found that instead of using the 17 designated dimensions, in evaluating participants,
assessors actually collapsed these 17 dimensions into three global dimensions for rating
purposes. Along similar lines, Sackett and Hakel (1979) found that only 5 dimensions, out
of a total of 17, were required to predict most of the variance in OARs. In an extension of
Sackett and Hakel, Russell (1985) also found that out of 16 dimensions, a single dimension
dominated assessors’ ratings.
Gaugler and Thornton (1989) further demonstrated that assessors have difficulty differ-
entiating between a large number of performance dimensions. In this study, assessors were
responsible for rating 3, 6, or 9 dimensions. Those assessors who were asked to rate 3 or 6
dimensions provided more accurate ratings than those asked to rate 9. Thus, it appears that
when asked to rate a large number of dimensions, the cognitive demands placed on assessors
may make it difficult for them to process information at the dimension level resulting in a
failure to obtain convergent and discriminant validity. These findings are consistent with the
upper limits of human information processing capacity reported in the cognitive psychol-
ogy literature (Miller, 1956). Relatedly, the role of cognitive processes in the performance
evaluation and rating process has been well established (Bretz, Milkovich & Read, 1992;
Ilgen, Barnes-Farrell & McKellin, 1993).
Although there is less direct evidence, a similar argument can be made with respect to
the number of assessment center participants any given assessor is required to observe and
evaluate in any given exercise. That is, as the participant-to-assessor ratio increases, the
cognitive demands placed on assessors may make it more difficult to process information at
a dimension level for each participant. In addition, assessment center ratings will be more
susceptible to bias and information processing errors under conditions of high cognitive
demand (Martell, 1991; Woehr & Roch, 1996).
In summary, this body of research would suggest that when placed under high cognitive
demands or overload due to a large number of dimensions (Gaugler & Thornton, 1989;
Reilly, Henry & Smither, 1990) or assigned participants, assessors are unable to distinguish
between and use dimensions consistently across exercises. This means that there is much
to lose from the inclusion of a large number of assessment center dimensions or a large
participant-to-assessor ratio. The inability to simultaneously process a large number of
dimensions across multiple participants may account for assessors’ tendency to rate using
more global dimensions resulting in a failure to obtain convergent and discriminant validity.
Consequently we hypothesized that:

Hypothesis 1: The number of performance dimensions assessors are asked to evaluate


will be related to construct-related validity such that dimension convergent validity estimates
will be higher when assessors rate fewer compared to a larger number of dimensions. In
addition, dimension discriminant validity estimates will be lower when assessors rate fewer
compared to a larger number of dimensions.

Hypothesis 2: The number of assessment center participants rated by an assessor (parti-


cipant-to-assessor ratio) will be related to construct-related validity such that dimension
convergent validity estimates will be higher when each assessor rates fewer compared to a
larger number of participants. In addition, dimension discriminant validity estimates will
be lower when each assessor rates fewer compared to a larger number of participants.
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 237

Type of Rating Approach

Two primary evaluation approaches have been identified across assessment centers
(Sackett & Dreher, 1982; Robie et al., 2000). In the within-exercise approach assessees
are rated on each dimension after completion of each exercise. Two variations of this
within-exercise approach have been described: (a) the same assessors observe all exercises
but provide dimension ratings after observing each exercise and (b) different sets of asses-
sors observe each exercise and provide ratings for each dimension. In the across-exercise
approach, evaluation occurs after all of the exercises have been completed and dimen-
sion ratings are based on performance from all of the exercises. Two variations of the
across-exercise approach have also been described: (a) assessors provide an overall rating
for each dimension reflecting performance across all exercises and (b) assessors provide
dimension ratings for each exercise, but after all exercises are observed.
Silverman et al. (1986) provide some evidence that the choice of approach may moderate
findings of convergent and discriminant validity in assessment center ratings. And although
their results would seem to suggest that an across-exercise approach is preferable to a
within-exercise approach, Harris et al. (1993: 677) failed to replicate their findings. Harris
et al.’s results “showed that both across- and within-exercise scoring methods produced vir-
tually the same average monotrait-heteromethod correlations and heterotrait-monomethod
correlations.” However, Robie et al. (2000) recently provided further evidence supporting
the across-exercise rating approach. Specifically, Robie et al. found that when assessors
rated one dimension across all exercises, clear dimension factors emerged. Alternately,
when assessors rated all dimensions within one exercise, clear exercise factors emerged.
Given the research to date, it may be argued that the across-exercise approach is concep-
tually more appropriate and thus, results in better evidence of construct-related validity.
Consequently, we hypothesized:

Hypothesis 3: Rating approach (across-exercise vs. within-exercise) will be related to


construct-related validity such that dimension convergent validity estimates will be higher
for the across-exercise approach compared to the within-exercise approach. In addition,
dimension discriminant validity estimates will be lower for the across-exercise approach
compared to the within-exercise approach.

Type of Assessor

The fourth factor pertains to the type of assessor, specifically psychologists vs. managers
and supervisors. In an explanation of their meta-analytic results, Gaugler et al. (1987) posit
that psychologists make better assessors because, as a result of their education and training,
they are better equipped to observe, record, and rate behavior. Sagie and Magnezy (1997)
demonstrated that type of assessor (i.e., managers vs. psychologists) significantly influenced
the construct-related validity of assessment center ratings. Thus, all things being equal,
studies that use industrial/organizational (I/O) psychologists (and similarly trained human
resource consultants and professionals) as assessors, are more likely to obtain evidence of
convergent/discriminant validity in contrast to those that use managers, supervisors, and
incumbents. Thus, we hypothesized that:
238 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

Hypothesis 4: The type of assessor used in assessment centers (psychologists vs. man-
agers/supervisors) will be related to construct-related validity such that dimension conver-
gent validity estimates will be higher when ratings are provided by psychologists compared
to managers/supervisors. In addition, dimension discriminant validity estimates will be
lower when ratings are provided by psychologists compared to managers/supervisors.

Assessor Training

Because assessment center ratings are obviously inherently judgmental in nature, train-
ing assessors/raters is an important element in the development and design of assessment
centers. Thus, the type of training is also an important variable (Woehr & Huffcutt, 1994).
For instance, there is consensus in the literature that frame-of-reference (FOR) is a highly
effective approach to rater training (Lievens, 2001; Noonan & Sulsky, 2001; Schleicher
& Day, 1998; Woehr & Huffcutt, 1994). However, irrespective of the training approach
used, assessment centers that have more extensive rater training are more likely to result
in ratings that display convergent/discriminant validity. Consequently, we hypothesized
that:

Hypothesis 5: Assessor training will be related to construct-related validity such that


dimension convergent validity estimates will be higher when the implementation of as-
sessor training is reported compared to when one is not reported. In addition, dimension
discriminant validity estimates will be lower when the implementation of assessor training
is reported compared to when one is not reported.

Hypothesis 6: Assessor training will be related to construct-related validity such that


dimension convergent validity estimates will be higher for longer assessor training programs
compared to shorter assessor training programs. In addition, dimension discriminant validity
estimates will be lower for longer assessor training programs compared to shorter assessor
training programs.

Assessment Center Purpose

Another variable that may impact assessment center construct-related validity outcomes
is the purpose for which assessment center ratings are collected. Here, it may be argued
that assessors may evaluate candidates differently depending on whether their ratings will
be used for selection or promotion decisions, or for training and development purposes.
Although research focusing on rating purpose in the assessment center literature is limited,
this issue has received a great deal of attention in the performance appraisal literature.
This literature suggests that rating purpose impacts rater cognitive processing such that
raters process incoming information differently depending on whether they begin with an
evaluative or observational goal (e.g., Feldman, 1981; Woehr & Feldman, 1993). Research
in this area has indicated that raters are more likely to form differentiated dimension-based
evaluations, as opposed to overall global evaluations, when initial processing goals focus
on observation and differentiation as opposed to pure evaluation (Woehr, 1992; Woehr &
Feldman, 1993). Thus, assessment centers conducted for training and development purposes
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 239

may lead to more differentiated ratings than would assessment centers conducted solely for
selection or promotion purposes. Thus, we hypothesized that:

Hypothesis 7: Assessment center purpose (training/development vs. selection/promo-


tion) will be related to construct-related validity such that dimension convergent validity
estimates will be higher for assessment centers conducted for training/development purposes
compared to those conducted for selection/promotion. In addition dimension discriminate
validity estimates will be lower for assessment centers conducted for training/development
purposes compared those conducted for selection/promotion.

In summary, the literature reviewed above identified several assessment center method-
ological/design factors that potentially moderate assessment center dimension construct-
related validity evidence. These are (1) number of performance dimensions assessed, (2) the
participant-to-assessor ratio, (3) type of rating approach, (4) type of assessor, (5) asses-
sor training, (6) length of assessor training, and (7) assessment center purpose. Another
methodological factor that has received some attention in the assessment center litera-
ture is type of rating scale. We chose not to include this variable in the current study
for two reasons. First, we found very few studies reporting information on the type of
rating scale used and the vast majority of this small subset were laboratory-based stud-
ies and thus, would not have met our criterion for inclusion (Lievens, 1998 found only
six studies and almost all of these were laboratory-based with student samples). Second,
there is very limited evidence with respect to the impact of different rating scales in an
assessment center context. In contrast, there is a great deal more literature on the impact
of rating scales on ratings in the performance appraisal literature, and the generally ac-
cepted conclusion of this literature is that specific rating scale format has little effect on
performance ratings. In fact, over 20 years ago Landy and Farr (1980) went so far as to
call for a moratorium on rating scale format research, arguing that it had largely proved
fruitless.
Thus, the methodological factors considered here are not intended to be an exhaustive
list of all possible potential moderators. Rather these characteristics are those that appear
most likely to impact construct-related validity outcomes. That is, based on the both the
conceptual and empirical arguments presented, these characteristics appear to be those for
which the hypothesized effect on construct-related validity outcomes (both positive and
negative) can be most clearly articulated.

Present Study

We propose that the lack of convergent and discriminant validity for assessment center
dimensions is not an inherent flaw of the assessment center as a measurement tool or method,
but rather these findings may be attributable to certain design and methodological features
(Gaugler et al., 1987; Jones, 1992; Schmitt et al., 1990). Given this proposition, it would
seem worthwhile to systematically re-examine the literature on which the current view
(i.e., that assessment center ratings demonstrate content-related and criterion-related but
not construct-related validity) is based.
240 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

Our primary objective in the present study was to review this literature with respect
to several methodological/design-related characteristics and to conduct a meta-analysis to
empirically examine the relationship between these characteristics and assessment center
construct-related validity. Toward this objective, we first provided a detailed review of the
existing literature examining the construct-related validity of assessment center ratings. The
goal of this review was to provide summary descriptive information on the existing literature
with respect to the seven assessment center characteristics presented above and then formu-
late specific hypotheses with respect to the effect of these methodological/design-related
characteristics on the construct-related validity of assessment center ratings. We next con-
ducted a meta-analysis to test the hypothesized effects of the specified methodological and
design characteristics.
Another goal of the present study was to review the extent to which the studies comprising
the existing literature on assessment center construct-related validity simultaneously exam-
ine multiple sources of validity evidence. Here we sought to document the extent to which
studies that examine the construct-related validity of assessment center dimensions also
present data on the criterion-related validity of the assessment center dimension ratings.
Specifically, how many individual studies have demonstrated a lack of construct-related
validity while also demonstrating criterion-related validity for a specific assessment center
application? Thus, overall we sought to provide a detailed picture of the literature on which
the prevailing view of assessment center validity is based and use meta-analytic procedures
to empirically examine the impact of specific assessment center methodological/design
characteristics on the construct-related validity of assessment center ratings.

Method

Literature Search and Inclusion Criteria

A search was conducted to locate studies which empirically examined the construct-related
validity of assessment center ratings. A literature search was conducted using a number of
computerized databases (i.e., PsycINFO, Social Sciences Citation Index, Web of Science).
In addition, reference lists from obtained studies were also examined in order to identify
additional studies. We used several criteria for the inclusion of studies. First, we sought
out studies that directly examined the construct-related validity of assessment center di-
mensions. Second, we included only those studies which provided information about the
construct-related validity of operational assessment center ratings. Specifically, we focused
on studies in which assessment centers were conducted in an actual organizational context
and thus, excluded studies based on “simulated” assessment centers (i.e., we did not exclude
studies in which assessment center characteristics were examined using an experimental or
quasi-experimental approach—however we did exclude studies based on student samples
[either as assessors or assessees] or those using videotaped “hypothetical” assessees). The
search resulted in the location of 32 studies spanning over 30 years (from 1966 to 2001)
reporting results for 48 separate assessment centers. This set of studies served as the basis
for our descriptive review of the literature. Finally, we also identified the subset of these
studies that reported traditional MTMM correlation-based data. We used these summary
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 241

indices of convergent and discriminant validity (i.e., mean monotrait-heteromethod and/or


heterotrait-monomethod rs, respectively) as dependent measures for testing our seven hy-
potheses pertaining to the impact of the methodological/design factors. Of the 48 separate
assessment centers, MTMM correlation-based data were available for 31.

Coding of Methodological Characteristics

Each of the 48 assessment centers was reviewed and coded with respect to the seven
methodological/design characteristics discussed above. These characteristics were:
(1) number of dimensions evaluated; (2) participant-to-assessor ratio; (3) rating approach
(within-exercise vs. across-exercise); (4) assessor occupation (manager or supervisor vs.
psychologist); (5) whether assessor training was reported; (6) the length of assessor train-
ing; and (7) the assessment center purpose. Each study was also coded with respect to
four additional pieces of descriptive information: (1) number of assessees (i.e., sample
size); (2) number of exercises included in the assessment center; (3) descriptions of the
dimensions/constructs rated; and (4) type of analysis used to examine construct-related va-
lidity (exploratory factor analysis, confirmatory factor analysis, MTMM data, nomological
net). For those studies reporting MTMM data, we also recorded convergent (i.e., mean
monotrait-heteromethod rs) and/or discriminant validity coefficients (i.e., mean heterotrait-
monomethod rs). Finally, each of the studies was reviewed to ascertain whether criterion-
related validity evidence was reported in addition to the construct-related validity evidence.

Meta-Analytic Procedures

As previously noted, we used both convergent and discriminant validity coefficients (i.e.,
mean monotrait-heteromethod and/or heterotrait-monomethod rs, respectively) as measures
of construct-related validity—in other words, the meta-analysis used the convergent and dis-
criminant validity coefficients (rs) as the outcome statistic. Consequently, the meta-analysis
was based on the 31 (out of 48) studies that reported traditional MTMM correlation-based
data.
The participant-to-assessor ratio and number of dimensions assessed were initially coded
as continuous variables, but were converted to dichotomous variables for the meta-analysis
using a median split. We also categorized the length of training into three levels—less
than 1 day, 1–5 days, and more than 5 days of training. Although it permitted us to run
the specified analyses, the limitations associated with the coding of assessor training must
be noted. First, the variable represents whether or not assessor training was reported, not
necessarily whether training actually occurred. It is possible that training was provided and
simply not reported. Second, this coding provides no indication of the nature or content of
the training provided. It would have been preferable to code for training with respect to the
content of training, but unfortunately, very few of the studies provided sufficient detail for
such an approach.
The data analyses were performed using Arthur, Bennett and Huffcutt’s (2001) SAS
PROC MEANS meta-analysis program to compute sample-weighted convergent and dis-
criminant validities for the specified levels of the methodological characteristics. Sample
weighting assigns studies with larger sample sizes more weight and reduces the effect
242 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

of sampling error since sampling error generally decreases as the sample size increases
(Hunter & Schmidt, 1990). We also computed 95% confidence intervals (CIs) for the
sample-weighted convergent and discriminant validities. CIs assess the accuracy of the
estimate of the mean validity/effect size (Whitener, 1990). CIs estimate the extent to which
sampling error remains in the sample-size-weighted validity. Thus, CI gives the range of
values that the mean validity is likely to fall within if other sets of studies were taken from
the population and used in the meta-analysis. A desirable CI is one that does not include
zero if a non-zero relationship is hypothesized.

Results

Descriptive Summary of the Literature

As expected there was a great deal of variability across studies in terms of the specified
methodological and design characteristics. Specifically, the mean sample size across studies
was 269.58 (SD = 281.69; median = 159.5; mode = 75) and ranged from 29 to 1170.
The mean number of dimensions to be evaluated was 10.60 (SD = 5.11; median = 9.00;
mode = 8) and ranged from 3 to 25. It is also interesting to note that across the 48 assessment
center studies 129 different dimension labels were recorded (a listing of the dimension labels
is available from the authors). The mean number of exercises included in the assessment
centers represented was 4.78 (SD = 1.47; median = 5; mode = 4; minimum = 2;
maximum = 8). It should be noted that this number of exercises represents only situational
exercises, several of the studies also included paper-and-pencil measures of some of the
dimensions evaluated. Only 26 of the 48 studies (54%) reported information on the ratio
of participants to assessors. For these 26, the participant-to-assessor ratio ranged from 1
participant for 4 assessors to 4 participants for each assessor with a mean ratio of 1.71
(mode = 2) participants per assessor. With respect to rating approach, 17 studies reported
using an across-exercise approach in which dimensional ratings were collected after the
completion of all exercises. Twenty-nine reported using a within-exercise approach in which
dimensional ratings were collected after the completion of each exercise.
Thirty-five of the 48 studies (73%) provided information with respect to whether rater
training was included but of these only 22 studies (44%) reported information pertaining to
length of training. For these studies, the mean length of training was 3.36 days (SD = 3.06;
median = 2; mode = 2) ranging from 1 to 15 days. Of the 48 assessment centers presented,
40 (83%) indicated that they were used primarily for selection/promotion decisions and 8
(17%) indicated training/development as the primary purpose. Finally, 26 of the 48 studies
(54%) provided information with respect to assessor occupation, 21 (81%) reported using
managers or supervisors from the same organization in which the assessment center was
being implemented and 5 (19%) reported using psychologists.

Meta-Analysis of the Effect of the Methodological Characteristics

Next we empirically examined the relation between the methodological characteristics


and construct-related validity outcomes. As previously noted, the meta-analysis used the
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 243

convergent and discriminant validity coefficients as the outcome statistic. Convergent valid-
ity coefficients represent the level of intercorrelation within dimensions across exercises and
in contrast, discriminant validity coefficients represent the level of intercorrelation within
exercise and across dimensions. Thus, construct-related validity is expressed by high con-
vergent and low discriminant validity coefficients.
The results of the meta-analysis, which are presented in Table 1, indicate that the results
for rating approach (Hypothesis 3), assessor occupation (Hypothesis 4), assessor training
(Hypothesis 5), and length of assessor training (Hypothesis 6) were in the predicted direc-
tion. Specifically, for rating approach, the mean dimension convergent validity was higher
for the across-exercise approach compared to the within-exercise approach (.43 vs. .29).
In addition, dimension discriminant validity was lower for the across-exercise approach
compared to the within-exercise approach (.48 vs. .58). Likewise, for type-of-assessor,
the mean dimension convergent validity was higher for psychologists compared to man-
agers/supervisors (.45 vs. .38), and the dimension discriminant validity was lower when
ratings are provided by psychologists compared to managers/supervisors (.40 vs. .64). Sim-
ilar patterns of results were obtained for assessor training compared no assessor training.
And excluding the single data point for more than 5 days of training, the results of the
meta-analysis also indicated that longer assessor training was associated with higher levels
of convergent validity and lower levels of discriminant validity.
Partial support was obtained for the number of dimensions (Hypothesis 1) and assess-
ment center purpose (Hypothesis 7). Specifically, fewer dimensions were associated with
higher levels of convergent validity, but contrary to our hypothesis, fewer dimensions were
also associated with a higher level of discriminant validity. Likewise, although higher, the
level of convergent validity for training/development assessment centers was not meaning-
fully higher than that for assessment centers used for selection/promotion. Furthermore,
contrary to the study hypothesis, the level of discriminant validity was higher for train-
ing/development assessment centers than selection/promotion assessment centers. Finally,
the participant-to-assessor ratio hypothesis (Hypothesis 2) was not supported. Both the con-
vergent and discriminant validity results for the participant-to-assessor ratio were opposite
to what we had hypothesized—lower participant-to-assessor ratios were associated with
lower convergent validity and higher discriminant validity. In summary, 4 of the 7 study
hypotheses were fully supported, partial support was obtained for 2, and 1 hypothesis was
not supported.

Co-occurrence of Construct-Related and Criterion-Related Validity Evidence

We next examined which of the 48 studies examining assessment center construct-related


validity also examined criterion-related validity as well as the nature of the validity evidence
presented. Somewhat surprisingly, in only 17% (8 of the 48) were both criterion-related and
construct-related validation strategies incorporated in analyzing assessment center ratings.
Even more surprising was the fact that these studies provided little, if any, support for the
view that assessment center ratings lack construct-related validity while at the same time
demonstrating criterion-related validity. A summary of the validity evidence presented in
each of these 8 studies is presented in Table 2. In 4 of the 8 cases there was no examination
of internal construct-related validity (i.e., MTMM type analysis). Rather, assessor ratings
Table 1
Meta-analysis results for convergent and discriminant validities

244
Convergent validity Discriminant validity

K N mean r SDr % var. acc. for 95% CI K N mean r SDr % var. acc. for 95% CI
Overall 31 7440 .34 .11 25.11 .33/.36 30 6412 .55 .13 13.31 .53/.56
Number of dimensionsa

D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258


High 14 1875 .27 .15 27.35 .23/.31 14 1875 .48 .10 43.02 .45/.51
Low 17 5565 .37 .09 31.62 .35/.38 16 4537 .58 .13 8.96 .56/.59
Participant-to-assessor ratioa
High 8 2739 .43 .09 24.39 .40/.45 7 1711 .48 .07 53.39 .45/.51
Low 8 2302 .33 .06 75.94 .30/.36 8 2302 .68 .09 11.93 .66/.70
Rating approach
Within-exercise 23 4297 .29 .10 43.50 .26/.31 23 4297 .58 .14 11.43 .56/.60
Across-exercise 6 2869 .43 .06 34.60 .40/.45 5 1841 .48 .07 37.74 .45/.51
Assessor occupation
Manager/supervisor 10 4249 .38 .07 30.81 .36/.41 9 3221 .64 .10 10.65 .63/.66
Psychologist/consultant 2 287 .45 .15 18.57 .37/.53 2 287 .40 .01 100.00 .32/.48
Assessor training
No training indicated 6 2115 .29 .12 16.19 .26/.33 6 2115 .63 .15 4.44 .61/.65
Training indicated 25 5325 .36 .11 32.29 .34/.38 24 4297 .51 .10 32.64 .49/.53
Length of assessor training
Less than 1 day 4 1043 .29 .07 70.81 .24/.34 4 1043 .59 .06 46.11 .55/.62
1–5 days 7 2430 .44 .08 32.34 .41/.47 6 1402 .54 .08 29.55 .51/.57
More than 5 days 1 138 .29 – – – 1 138 .41 – – –
Assessment center purpose
Selection/promotion 25 5746 .34 .12 24.51 .32/.36 24 4718 .50 .09 32.41 .49/.52
Training/development 6 1694 .35 .10 28.60 .32/.39 6 1694 .67 .14 5.16 .65/.69
a Converted to dichotomous variable using median spilt. Medians were 2 and 9 for participant-to-assessor ratio and number of dimensions, and number of exercises,

respectively. K = number of convergent/discriminant validities; N = number of participants; mean r = mean of sample−weighted convergent and discriminant
validities; SDr = standard deviation of sample−weighted convergent/discriminant validities; % var. acc. for = percentage of variance due to sampling error; 95%
CI = lower and upper values of 95% confidence interval. CIs estimate the extent to which sampling error remains in the sample-size-weighted mean effect size. Thus, CI
gives the range of values that the mean effect size is likely to fall within if other sets of studies were taken from the population and used in the meta-analysis. A desirable
CI is one that does not include zero if a non-zero relationship is hypothesized.
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 245

Table 2
Summary of analyses used to investigate construct and criterion-related validity evidence
Study Type of analysisa Criterion-related validity evidence
Bray and Grant EFA—Hierarchical factor analysis of mean Correlations between derived “factor” scores
(1966) rating on 25 dimensions resulted in 11 and salary progression
factors for the college sample and 8 factors
for the non-college sample
Chan (1996) MTMM—Mean within dimension, Mean AC rating (rxy with performance
cross-exercise r of .07; mean within ratings = .06; with actual promotion = .59)
exercise, cross-dimension r of .71
EFA—Principal components analysis with Consensus ‘promotability’ rating (rxy with
orthogonal rotation of 6 exercises × 14 performance ratings = .25; with actual
dimension ratings resulted in 6 exercise promotion = .70)
factors
Nomological Net—Pattern of correlations
between AC dimension ratings and
measures of cognitive ability and
personality do not support construct validity
Fleenor (1996) MTMM—Mean within dimension, Mean correlation of AC dimension rating of:
cross-exercise r of .22; mean within .10 with subordinate performance ratings; .15
exercise, cross-dimension r of .42 self-performance ratings; .17 supervisor
performance ratings
EFA—Principal components analysis with
orthogonal rotation of 8 exercises × 10
dimension ratings resulted in 8 exercise
factors
Henderson MTMM—Mean within dimension, Job performance ratings regressed on 14
et al. (1995) cross-exercise r of .19; mean within dimension scores. Results indicated only 2
exercise, cross-dimension r of .42 dimensions were significant predictors of
performance
EFA—Analysis of dimension ratings with
orthogonal rotation resulted in 5 exercise
factors
Hinrichs (1969) EFA—Principal components analysis with Scores based on 3 factors were correlated
non-orthogonal rotation of 12 trait ratings with relative salary standing, overall
resulting in 3 overlapping factors management potential, and overall
assessment center-based evaluation. The r’s
ranged from .15 to .78
Huck and Bray EFA—Principal components analysis with Overall assessment rating (rxy with overall
(1976) orthogonal rotation of 16 dimension ratings performance rating = .41; with rated
resulting in 4 factors potential for advancement = .59) for whites.
Overall assessment rating (rxy with overall
performance rating = .35; with rated
potential for advancement = .54) for blacks
Jansen and MTMM—Mean within dimension, Correlations of average salary growth with
Stoop (2001) cross-exercise r of .28; mean within dimensions scores from each exercise (mean
exercise, cross-dimension r of .62 r = .09, min. = −.02, max. = .30)
Shore et al. Nomological Net—Pattern of correlations Correlations of job advancement with peer
(1992) between self and peer AC dimension (mean r = .20) and self (mean r = .07) AC
ratings and measures of cognitive ability ratings
and personality support construct validity
a EFA: exploratory factor analysis; CFA: confirmatory factor analysis; MTMM: multitrait-multimethod data.
246 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

on a relatively large number of dimensions (ranging from 12 to 25 dimensions) were factor


analyzed revealing a smaller number of factors. Scores based on these factors were then
shown to demonstrate varying levels of criterion-related validity. Only four studies (Chan,
1996; Fleenor, 1996; Henderson, Anderson & Rick, 1995; Jansen & Stoop, 2001) specifi-
cally applied both criterion- and internal construct-related validation strategies to a single
sample and the results of these studies are mixed. Henderson et al. (1995) report results
consistent with a lack of construct-related validity (mean within-dimension, across-exercise
r of .19; mean within exercise, cross-dimension r of .42), but also report that only 2 of 14
dimension ratings were significant predictors of job performance ratings. Similarly, Jansen
and Stoop (2001) report results consistent with a lack of construct-related validity (mean
within-dimension, across-exercise r of .28; mean within exercise, cross-dimension r of .68),
but also report small dimension correlations (mean r = .09, min. r = −.02, max. r = .30)
with average salary growth. Finally, Fleenor (1996) also indicates results consistent with a
lack of construct-related validity (mean within-dimension, across-exercise r of .22; mean
within exercise, cross-dimension r of .42), but also reports small dimension correlations
with performance ratings (mean r = .10 with subordinate ratings, mean r = .15 with
self-ratings, mean r = .17 with supervisor ratings).
Only Chan (1996) reports a lack of construct-related validity (mean within-dimension,
across-exercise r of .07; mean within exercise, cross-dimension r of .71) while also reporting
a significant correlation of the mean assessment center rating with promotion rates but not
with job performance ratings. However, although Chan’s (1996) results suggest a lack of
construct-related validity accompanied by some evidence supportive of criterion-related
validity, there are a number of problematic methodological issues. Specifically, a large
number of both dimensions (14) and exercises (6) were used with a very small sample of
participants (n = 46).

Construct-Related Validity Evidence and Analytic Strategy

With respect to the type of evidence presented for construct-related validity, across all
of the 48 studies, several approaches were indicated with many studies incorporating mul-
tiple analytic strategies. Some type of exploratory factor analysis (typically examining the
number and nature of factors underlying ratings) was used to analyze data from 26 of the
assessment centers, and 16 cases used confirmatory factor analysis (most often evaluating
some form of a MTMM model incorporating dimension and exercise latent variables). Typi-
cal MTMM correlation matrix data (i.e., monotrait-heteromethod, monomethod-heterotrait,
rs) were reported for 32 of the assessment centers, while only 6 cases reported using a vari-
ance partitioning approach (i.e., ANOVA) looking at proportions of dimension and exercise
variance. Finally, five studies reported data based on the relationship of assessment cen-
ter dimension ratings with measures of other constructs (a “nomological net” approach
examining patterns of relationships relative to expectations).
Again, somewhat unexpectedly, evidence with respect to the construct-related validity of
assessment center ratings was mixed and tended to depend on the analytic strategy used.
Evidence from the 31 studies reporting traditional MTMM correlation-based data indicated a
mean within-dimension, across-exercise (i.e., monotrait-heteromethod) sample-weighted r
of .34 (SDr = .11) and a mean across-dimension, within-exercise (monomethod-heterotrait)
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 247

sample-weighted r of .55 (SDr = .13). Confirmatory factor analyses tended to support


models with the hypothesized number of dimension and exercise factors while indicating
higher exercise factor loadings than dimension factor loadings. Accordingly, a consistent
conclusion was that although assessment center ratings may demonstrate some convergent
validity, there was little evidence or support for discriminant validity.
Exploratory factor analyses (EFA) consistently indicated that the number of factors ex-
tracted did not correspond to the number of dimensions assessed. Two potentially problem-
atic issues emerge from our review that limit the interpretability of the EFA results. First,
most of the studies conducting exploratory factor analysis used a rotation forcing orthog-
onality on the factors. As noted by Sackett and Hakel (1979) there is no conceptual basis
for presuming that the factors underlying assessment ratings should be orthogonal. Second,
many of the factor analyses were conducted with extremely small sample sizes relative
to the number of “items” (i.e., dimensions). The median ratio of assessees to dimensions
(items) across studies in our review was 3.37 to 1. Thus, the small sample sizes relative to
the number of items increases the likelihood that factor groupings are simply the result of
sampling error (Nunnally & Bernstein, 1994).
Finally, results from those studies examining the relationship of assessment center dimen-
sion ratings with convergent and discriminant constructs measured by some other method
(e.g., paper-and-pencil measures) tended to provide evidence for both convergent and dis-
criminant validity.
In summary, despite the widely accepted view that assessment center ratings display
criterion-related validity while lacking construct related validity, the evidence in the liter-
ature is far from clear. The use of some analytical procedures (e.g., correlating paper-and-
pencil measures with associated assessment center dimensions) has provided some
support for the existence of specified assessment center dimensions (constructs). Further-
more, studies that have shown a lack of construct-related validity have typically failed to
provide any criterion-related validity evidence. And for the few that have (we found only
four), the lack of construct-related validity has been coupled with a lack of criterion-related
validity. Finally, it is clear that specified design and methodological factors influence the
convergent and discriminant validity of assessment center ratings.

Discussion

The purpose of the present paper was to provide a systematic re-examination of the liter-
ature on which the current view that assessment center ratings demonstrate criterion-related
but not construct-related validity is based. We argue that these findings may be attributable to
certain design and methodological features as opposed to an inherent flaw of the assessment
center as a measurement tool or method. Thus, we examine the methodological character-
istics of studies examining the construct-related validity of assessment centers. We also
argue that the prevailing view that assessment center ratings demonstrate criterion-related
validity but not construct-related validity is inconsistent with current conceptualizations of
validity and the validation process. Thus, we also examine the extent to which evidence
with respect to criterion-related validity and construct-related validity stems from the same
empirical studies.
248 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

Methodological/Design Characteristics and Assessment Center


Construct-Related Validity

We highlight seven methodological/design characteristics that have been suggested by


previous research to influence the construct-related validity of assessment center ratings.
These characteristics are: the participant-to-assessor ratio, the number of dimensions rated,
the rating approach (within-dimension vs. within-exercise), type of assessor, assessment
center purpose, assessor training, and length of assessor training. These characteristics di-
rectly pertain to the information processing load placed on the assessor and/or the assessors’
ability to accurately deal with the information processing task of rating participants. One
concern of past researchers is that assessors may be required to observe, record, and ag-
gregate too much information (Bycio et al., 1987). The high cognitive demand placed on
assessors may make it difficult for them to process information at the dimension level, thus,
negatively impacting the construct-related validity evidence for the dimensions assessed.
Both the participant-to-assessor ratio and the number of dimensions rated directly determine
the cognitive demand placed on assessors. However, despite their theoretical and conceptual
appeal, we found limited support for the number of dimensions and participant-to-assessor
ratio effects in the meta-analysis. Although a smaller number of dimensions was related
to higher levels of convergent validity as predicted, contrary to our hypothesis, it was also
related to higher levels of discriminant validity. The participant-to-assessor ratio hypothesis
was also not supported by the meta-analysis.
We believe, however, that there are a number of plausible methodological explana-
tions for our failure to support the hypothesized effects for participant-to-assessor ratio
and number of dimensions assessed. First, for the participant-to-assessor ratio effect, al-
though we found that 26 of the 48 studies we coded reported this information, only 16
of the 26 also reported MTMM-based convergent and discriminant validity coefficients.
Thus, only these 16 could be included in the meta-analysis. Furthermore, only 2 of these
16 had a participant-to-assessor ratio greater than the modal ratio found across the 26
studies (i.e., mode participant-to-assessor ratio = 2) and these 2 were only slightly above
the mode (i.e., participant-to-assessor ratio for both was 3). Consequently, it is possible
that the relatively small participant-to-assessor ratios (as well as the very low level of
variability) across the studies included in the meta-analysis led to our failure to find any
effect. That is, it is quite likely that observing and evaluating multiple participants only
imposes a discernibly higher cognitive demand on raters at larger participant-to-assessor
ratios.
Second, our finding that a smaller number of dimensions was related to higher lev-
els of convergent validity (as predicted) but, contrary to our predictions, also related to
higher levels of discriminant validity may be confounded by the level of covariance among
the dimensions assessed. Specifically, Campbell and Fiske’s (1959) original MTMM ap-
proach required the measurement of independent constructs. This requirement is based on
the fact that any intercorrelation among traits is actually reflected in the discriminant va-
lidity estimates. Thus, discriminant validity estimates derived from multiple measures of
non-independent constructs are derived from the impact of both trait and method-based
covariation. With assessment centers, it is unlikely that the dimensions assessed are truly
independent (Arthur et al., in press; Sackett & Hakel, 1979). Furthermore, the level of
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 249

interdependence probably increases as the number of dimensions increases. So our find-


ing that smaller numbers of dimensions was associated with higher levels of discriminant
validity may be the result of higher levels of interrelationship among dimensions as the
number of dimensions increase. Thus, although only partial support was obtained for the
number of dimensions, researchers and practitioners should consider using a smaller num-
ber of dimensions than what currently appears to be common practice (e.g., almost half of
the assessment centers we reviewed used 10 or more dimensions). For instance, in a re-
cent meta-analysis of the criterion-related validity of assessment center dimensions, Arthur
et al. (in press) showed that four dimensions accounted for the criterion-related validity of
assessment center ratings. So researchers and practitioners may be using more dimensions
than may be actually needed (Jones & Whitmore, 1995; Russell, 1985; Sackett & Hakel,
1979).
It should also be noted that cognitive processing difficulties associated with large numbers
of dimensions are likely compounded by the nature and variety of the constructs reflected
in dimension ratings. As noted, we found 129 different dimension labels across the 48
assessment centers examined. Certainly many of these labels reflect synonyms or common
constructs and thus, the number of constructs represented is probably far fewer. However,
the degree to which these constructs represent observable behavior varies greatly (e.g.,
oral communication vs. likeability). To the extent that the constructs used in assessment
centers are not represented by easily observable behaviors, the construct-related validity of
assessment center ratings will likely be adversely affected.
In contrast to the results for the number of dimensions and participant-to-assessor ratio,
our hypotheses for the rating approach, type of assessor, and assessor training were fully
supported. These factors all directly impact the assessors’ ability to accurately deal with the
information processing task represented in assessment centers. As noted previously, Gaugler
et al. (1987) posit that psychologists make better assessors than managers or incumbents.
As a result of their education and training, they are better equipped to observe, record, and
rate behavior. Despite this, our findings indicate 22 of the 48 assessment centers reviewed
provide no information on the type of assessor and of those studies that did indicate the
type of assessor, 81% reported using managers or supervisors as opposed to psychologists.
We also found wide variability with respect to the rating approach used and the nature and
extent of assessor training.
One potential caveat should be noted with respect to our findings regarding rating ap-
proach. While we found that ratings provided across (as opposed to within) exercises
demonstrated better convergent and discriminant validity outcomes, there is a possible
confound. Specifically, assessment centers designed to incorporate across-exercise ratings
typically have the same rater or group of raters observe the same assessees across mul-
tiple exercises. Alternately, assessment centers incorporating within-exercise ratings typ-
ically have a rater or raters observe different assessees in a single exercise. It is possible
that our findings with respect to rating approach reflect common “method” variance as-
sociated with the rater. Furthermore, in an across-exercise rating approach, this variance
would manifest as stronger dimension effects, while in a within-exercise approach it would
manifest as stronger exercise effects. Although it was not possible to separate these
effects in the present study, this would likely be an interesting avenue for future
research.
250 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

There are two limitations with the meta-analysis that should be noted. First, to investi-
gate the effects of the methodological characteristics, each was broken down into specified
levels to run the sublevel analysis. Although there is no standard as to the minimum number
of data points required for a stable and interpretable meta-analysis, breaking down vari-
ables into sublevels sometimes results in a small number of data points that can result in
second-order sampling error (Arthur et al., 2001; Hunter & Schmidt, 1990). Consequently,
because the levels of some of our methodological characteristics had a small number of
data points, their associated results should be cautiously interpreted. Second, to permit the
sublevel analyses, variables which were originally continuous (i.e., number of dimensions,
participant-to-assessor ratio, length of assessor training) had to be categorized. For the num-
ber of dimensions and the participant-to-assessor ratio, this was accomplished by using a
median split. Because of the problems associated with this procedure, we reanalyzed these
methodological characteristics by correlating each with the convergent and discriminant
validity coefficients across studies. The results of these correlational analyses replicated
those obtained for the meta-analysis.
Our primary goal in the present study was to provide a systematic review of the assess-
ment center literature with respect to specific design and methodological characteristics
that potentially moderate the validity of assessment center ratings. Given the results of this
review, we believe that studies that directly manipulate specific characteristics, where feasi-
ble, have a lot to contribute to further our understanding of assessment centers. Our findings
suggest a number of features and characteristics that may impact the validity of assessment
centers. We believe that future research should be directed toward systematically examining
design factors that influence the psychometric properties of assessment centers. These de-
sign factors include, but are not limited to, the number of dimensions used, characteristics
of assessors, how ratings are made (across-exercise vs. within-exercise), assessment center
purpose, and assessor training. Other potential moderators of the convergent/discriminant
validity of assessment centers that should also be examined include the use of behavior
checklists (Reilly et al., 1990), and the non-transparency of dimensions (Kleinmann, 1993;
Kleinmann, Kuptsch & Koller, 1996).

Co-occurrence of Construct-Related and Criterion-Related Validity Evidence

With respect to the extent to which findings pertaining to construct- and criterion-related
validity stem from a common literature, our results indicate that this evidence is largely
drawn from independent bodies of research. There is, of course, nothing inherently wrong
with this approach. Given that the lack of construct-related validity evidence in the presence
of criterion-related and content-related validity evidence is inconsistent with the unitarian
view of validity, it is conceivable that studies in which there was a lack of construct-related
validity may also have demonstrated a lack of criterion-related validity (which would be
consistent with the unitarian view). However, because the evidence is drawn from largely
independent research studies, this is a very plausible explanation. Indeed we found only
four studies that reported both criterion-related and construct-related validity data. And of
these, only one (Chan, 1996) reported support for criterion-related validity in the absence of
construct-related validity. For the other three (Fleenor, 1996; Henderson et al., 1995; Jansen
& Stoop, 2001), lack of construct-related validity was coupled with a lack of criterion-related
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 251

validity. Thus, although our results do not disprove the prevailing view of assessment center
validity, they do raise serious concerns about its veridicality. We believe that future research
should also be directed at providing simultaneous examinations of multiple evidential bases
of validity.
Our re-examination of the literature on which the current view, that assessment cen-
ter ratings demonstrate criterion-related (and content-related) validity evidence but not
construct-related validity evidence, is based suggests that these findings may be attributable
to certain design and methodological features as opposed to an inherent flaw of the assess-
ment center as a measurement tool. Alternately, there may be other plausible explanations
for the presence of assessment center content- and criterion-related validity in the absence of
convergent and discriminant validity. One such explanation focuses on the idea of construct
misspecification/misidentification (Raymark & Binning, 1997; Russell & Domm, 1995).
Thus, instead of measuring the targeted constructs of interest—such as team building, flex-
ibility, influencing others—assessment centers may unwittingly be measures of unspecified
constructs like, for example, self-monitoring or impression management (Church, 1997;
Cronshaw & Ellis, 1991) or misinterpreting the nature of “exercise” factors (Ladd et al.,
2002; Lance et al., 2000). Thus, in this particular example, the actual explanatory variable—
self-monitoring—is a “deeper” source trait operating at a different nomological level than
assessment center constructs ostensibly being measured.
On one hand, this construct misspecification hypothesis has yet to receive extensive
empirical attention and appears to be an area worthy of future research (cf. Arthur and
Tubre (2002); see also Russell and Domm’s (1995) investigation of role congruency as
a plausible explanatory construct). On the other hand, in our opinion, a potential prob-
lem with this reconciliation or explanation for the assessment center validity paradox is
that it may have dire implications for the current use of assessment centers as training
and development interventions. Specifically, the use of assessment centers in this manner
is largely predicated on the assumption that they are indeed measuring the specified tar-
geted dimensions (e.g., team building, flexibility, influencing others) and consequently,
developmental feedback reports and interviews, and individual development plans are
all designed and developed around these dimensions. Are, and have all of these efforts
been fundamentally misguided? Is this important use of assessment centers fundamen-
tally flawed? We think not. Although conceptually plausible, the misspecification hypoth-
esis has yet to be demonstrated empirically. Furthermore, given the data and arguments
presented in the present study, we are inclined to believe that the position that the lack
of discriminant and convergent validity is due to development and design factors is a
more parsimonious and succinct explanation for the assessment center construct validity
paradox.
In conclusion, our findings question the prevailing view that assessment center ratings
do not demonstrated construct-related validity, and instead they lead us to conclude that
as measurement tools, assessment centers are probably only as good as their development,
design, and implementation. Furthermore, we believe that the assessment center is a method,
and like any method there will be variability in its implementation. Thus, future research
needs to be directed toward systematically examining design factors that influence the
psychometric properties of assessment centers and at the simultaneous examination of
multiple evidential bases of validity.
252
Appendix A. Summary of Design-Related Characteristics of Studies that Investigated Convergent/Discriminant
Validity for Assessment Center Ratings

Studya Sample Participant-to- Number of Ex Within- vs. Assessor Purpose Training Length of

D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258


size assessor ratio dimension across-exercise occupation training
(in days)
Archambeau (1979)a 29 2:1 10 5 Within Managers Selection Yes 5
Arthur et al. (2000)a 149 2:1 9 4 Within Psychologists Feedback Yes 2
Bray and Grant (1966)
College graduates 207 3:1 25 3b Across Managers Selection Yes ?
Non-college graduates 148 3:1 25 3b Across Managers Selection Yes ?
Bycio et al. (1987)a 1170 1:1 8 5 Within Supervisors Feedback ? ?
Carless and Allwood (1997) 875 4:1 5 5 Across ? Feedback ? ?
Chan (1996)a 46 1:2 14 6b Within Managers Selection Yes 2
Crawley et al. (1990)a
A 117 ? 13 5 ? ? Selection Yes 1
B 157 ? 9 6 ? ? Selection Yes 1
Donahue et al. (1997)a 188 2:1 9 4 Within Managers Selection Yes 1
Fleenor (1996)a 102 ? 10 8 Within ? Feedback Yes 5
Harris et al. (1993)a
A 237 ? 7 6 Within ? Feedback Yes 4
B 556 ? 7 6 Across ? Feedback Yes 4
C 63 ? 7 6 Across ? Feedback Yes 4
Henderson et al. (1995)a 311 ? 14 4b Within ? Selection ? ?
Hinrichs (1969) 47 ? 12 6b Across Managers ? ? ?
Huck and Bray (1976)
Whites 241 ? 18 4b Across Managers Selection Yes ?
Blacks 238 ? 18 4b Across Managers Selection Yes ?
Jansen and Stoop (2001)a 581 1:1 3 2 Within Managers Selection Yes 1
Joyce et al. (1994)a
Time 1 75 2:1 7 4 Within ? Selection Yes 5
Time 2 75 2:1 7 4 Within ? Selection Yes 7
Kudisch et al. (1997)a 138 1:2 12 4 Within I/O grad students Feedback Yes 3
Nedig et al. (1979) 260 2:1 19 6 Within 2nd level managers Selection ? ?

D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258


Reilly et al. (1990)a
A 120 2:1 8 8 Within ? Selection Yes ?
B 235 2:1 8 8 Within ? Selection Yes ?
Robertson et al. (1987)a
A 41 ? 10 5 Within ? Selection ? ?
B 48 ? 8 4 Within ? Selection ? ?
C 84 ? 8 3 Within ? Selection ? ?
D 49 ? 11 4 Within ? Selection ? ?
Russell (1985) 200 ? 18 4 Across Managers Selection Yes 15
Russell (1987)a 75 ? 9 4 Within ? Selection ? ?
Sackett and Dreher (1982)a
A 86 ? 8 6 Within ? Selection ? ?
B 311 ? 16 6 Within ? Selection ? ?
C 162 ? 9 6 Within ? Selection ? ?
Sackett and Hakel (1979) 719 2:1 17 ? Across Managers Selection ? ?
Sackett and Harris (1988)
A 346 ? 8 3 Within ? Selection ? ?
B 51 ? 7 6 Within ? Selection ? ?
Sagie and Magnezy (1997)
A 425 2:1 5 3 Within Psychologists Selection Yes 2
B 425 2:1 5 3 Within Managers Selection Yes 2
Schmitt (1977) 101 ? 17 4 Within Managers Selection Yes ?
Schneider and Schmitt (1992)a 89 1:2 3 4 Within Teachers & Feedback Yes ?
administration
Shore et al. (1992) 394 2:1 11 5 Across Managers Selection Yes ?
Shore et al. (1990) 441 2:1 11 5 Across Psychologists Selection ? ?

253
254
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258
Appendix A (Continued )
Studya Sample Participant-to- Number of Ex Within- vs. Assessor Purpose Training Length of
size assessor ratio dimension across-exercise occupation training
(in days)

Silverman et al. (1986)a


A 45 1:4 6 3 Within Managers Selection Yes 1.5
B 45 1:4 6 3 Across Managers Selection Yes 1.5
Thornton et al. (1997) 382 2:1 16 8 Across High level Selection Yes ?
managers and
psychologists
Turnage and Muchinsky (1982)
A 1028 3:1 8 5 Within Managers Selection Yes 2
B 1028 3:1 8 5 Within Managers Selection Yes 2
Note: (?): could not be determined from the information provided.
a Studies that provided MTMM-based convergent (mean mono-trait, heteromethod r) and discriminant (mean heterotrait, mono-method r) validity data.
b Used paper-and-pencil measures—these were not included in the total number of exercises; within- vs. across-exercise = method used to obtain dimension ratings

(within-exercise method = rating all dimensions within an exercise before proceeding to the next exercise; across-exercise method = rating a dimension across all
exercises before proceeding to the next dimension).
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 255

References

References marked with an asterisk indicate studies included in the review.


American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. 1999. Standards for educational and psychological testing. Washington, DC:
American Educational Research Association.
∗ Archambeau, D. J. 1979. Relationships among skill ratings assigned in an assessment center. Journal of

Assessment Center Technology, 2: 7–19.


Arthur, W., Jr., Bennett, W., Jr., & Huffcutt, A. I. 2001. Conducting meta-analysis using SAS. Mahwah, NJ:
Lawrence Erlbaum.
Arthur, W., Jr., Day, E. D., McNelly, T. L., & Edens, P. In press. Meta-analysis of the criterion-related validity of
assessment center dimensions. Personnel Psychology.
Arthur, W., Jr., & Tubre, T. C. 2002. The assessment center construct-related validity paradox: A case of construct
misspecification? Manuscript submitted for publication.
∗ Arthur, W., Jr., Woehr, D. J., & Maldegan, R. M. 2000. Convergent and discriminant validity of assessment center

dimensions: A conceptual and empirical reexamination of the assessment center construct-related validity
paradox. Journal of Management, 26: 813–835.
Binning, J. F., & Barrett, G. V. 1989. Validity of personnel decisions: A conceptual analysis of the inferential and
evidential bases. Journal of Applied Psychology, 74: 478–494.
Born, M. P., Kolk, N. J., & van der Flier, H. 2000. A meta-analytic study of assessment center construct validity.
Paper presented at the 15th annual conference of the Society for Industrial and Organizational Psychology,
New Orleans, LA.
Brannick, M. T., Michaels, C. E., & Baker, D. P. 1989. Construct validity of in-basket scores. Journal of Applied
Psychology, 74: 957–963.
∗ Bray, D. W., & Grant, D. L. 1966. The assessment center in the measurement of potential for business management.

Psychological Monographs: General and Applied, 80: 1–27.


Bretz, R. D., Jr., Milkovich, G. T., & Read, W. 1992. The current state of performance appraisal research and
practice: Concerns, directions, and implications. Journal of Management, 18: 321–352.
∗ Bycio, P., Alvares, K. M., & Hahn, J. 1987. Situation specificity in assessment center ratings: A confirmatory

analysis. Journal of Applied Psychology, 72: 463–474.


Campbell, D. T., & Fiske, D. W. 1959. Convergent and discriminat validation by the multtrait-multimethod matrix.
Psychological Bulletin, 56: 81–105.
∗ Carless, S. A., & Allwood, V. E. 1997. Managerial assessment centres: What is being rated? Australian

Psychologist, 32: 101–105.


∗ Chan, D. 1996. Criterion and construct validation of an assessment centre. Journal of Occupational and

Organizational Psychology, 69: 167–181.


Church, A. H. 1997. Managerial self-awareness in high-performing individuals in organizations. Journal of Applied
Psychology, 82: 281–292.
∗ Crawley, B., Pinder, R., & Herriot, P. 1990. Assessment centre dimensions, personality and aptitudes. Journal of

Occupational and Organizational Psychology, 63: 211–216.


Cronshaw, S. F., & Ellis, R. J. 1991. A process investigation of self-monitoring and leader emergence. Small Group
Research, 22: 403–420.
∗ Donahue, L. M., Truxillo, D. M., Cornwell, J. M., & Gerrity, M. J. 1997. Assessment center construct validity

and behavioral checklists: Some additional findings. Journal of Social Behavior and Personality, 12: 85–108.
∗ Fleenor, J. W. 1996. Constructs and developmental assessment centers: Further troubling empirical findings.

Journal of Business and Psychology, 10: 319–335.


Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., Jr., & Bentson, B. 1987. Meta-analysis of assessment center
validity. Journal of Applied Psychology, 72: 493–511.
Gaugler, B. B., & Thornton, G. C., Jr. 1989. Number of assessment center dimensions as a determinant of assessor
generalizability of the assessment center ratings. Journal of Applied Psychology, 74: 611–618.
∗ Harris, M. M., Becker, A. S., & Smith, D. E. 1993. Does the assessment center scoring method affect the

cross-situational consistency of ratings? Journal of Applied Psychology, 78: 675–678.


256 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

∗ Henderson, F., Anderson, A., & Rick, S. 1995. Future competency profiling: Validating and redesigning the ICL
graduate assessment centre. Personnel Review, 24: 19–31.
∗ Highhouse, S., & Harris, M. M. 1993. The measurement of assessment center situations: Bem’s template matching

technique for examining exercise similarity. Journal of Applied Social Psychology, 23: 140–155.
∗ Hinrichs, J. R. 1969. Comparison of “real life” assessments of management potential with situational exercises,

paper-and-pencil ability tests, and personality inventories. Journal of Applied Psychology, 53: 425–432.
Howard, A. 1997. A reassessment of assessment centers: Challenges for the 21st century. Journal of Social
Behavior and Personality, 12: 13–52.
Huck, J. R., & Bray, D. W. 1976. Management assessment center evaluations and subsequent job performance of
white and black females. Personnel Psychology, 29: 13–30.
Hunter, J. E., & Schmidt, F. L. 1990. Methods of meta-analysis: Correcting error and bias in research findings.
Newbury Park, CA: Sage.
Ilgen, D. R., Barnes-Farrell, J. L., & McKellin, D. B. 1993. Performance appraisal process research in the 1980s:
What has it contributed to appraisals in use? Organizational Behaviors and Human Decision Processes, 54:
321–368.
∗ Jansen, P. G. W., & Stoop, B. A. M. 2001. The dynamics of assessment center validity: Results of a 7-year study.

Journal of Applied Psychology, 86: 741–753.


Jones, R. G. 1992. Construct validation of assessment center final dimension ratings: Definition and measurement
issues. Human Resources Management Review, 2: 195–220.
Jones, R. G., & Whitmore, M. D. 1995. Evaluating developmental assessment centers as interventions. Personnel
Psychology, 48: 377–388.
∗ Joyce, L. W., Thayer, P. W., & Pond, S. B., III. 1994. Managerial functions: An alternative to traditional assessment

center dimensions? Personnel Psychology, 47: 109–121.


∗ Kudisch, J. D., Ladd, R. T., & Dobbins, G. H. 1997. New evidence on the construct validity of diagnostic

assessment centers: The findings may not be so troubling after all. Journal of Social Behavior and Personality,
12: 129–144.
Kleinmann, M. 1993. Are rating dimensions in assessment centers transparent for participants? Consequences for
criterion and construct validity. Journal of Applied Psychology, 78: 988–993.
Kleinmann, M., Kuptsch, C., & Koller, O. 1996. Transparency: A necessary requirement for the construct validity
of assessment centers. Applied Psychology: An International Review, 45: 67–84.
Klimoski, R., & Brickner, M. 1987. Why do assessment centers work? The puzzle of assessment center validity.
Personnel Psychology, 40: 243–259.
Ladd, R. T., Atchley, E. K., Gniatczyk, L. A., & Baumann, L. B. 2002. An evaluation of the construct validity of an
assessment center using multiple-regression importance analysis. Paper presented at the 17th annual meeting
of the Society for Industrial/Organizational Psychology, Toronto, Canada, April 2002.
Lance, C. E., Newbolt, W. H., Gatewood, R. D., Foster, M. S., French, N. R., & Smith, D. E. 2000. Assessment
center exercise factors represent cross-situational specificity, not method bias. Human Performance, 13: 323–
353.
Landy, F. J. 1986. Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41:
1181–1192.
Landy, F. J., & Farr, J. L. 1980. Performance rating. Psychological Bulletin, 87: 72–107.
Lawshe, C. H. 1985. Inferences from personnel tests and their validity. Journal of Applied Psychology, 70: 237–238.
Lievens, F. 1998. Factors which improve the construct validity of assessment centers: A review. International
Journal of Selection and Assessment, 6: 141–152.
Lievens, F. 2001. Assessors and use of assessment center dimensions: A fresh look at a troubling issue. Journal
of Organizational Behavior, 22: 203–221.
Lowry, P. E. 1997. The assessment center process: New directions. Journal of Social Behavior & Personality, 12:
53–62.
Martell, R. F. 1991. Sex bias at work: The effects of attentional and memory demands on performance ratings of
men and women. Journal of Applied Social Psychology, 21: 1939–1960.
Messick, S. J. 1989. Validity. In R. L. Linn (Ed.), Educational measurement: 13–103. New York: Macmillian.
Messick, S. J. 1995. The validity of psychological assessment: Validation of inferences from persons’ responses
and performances as scientific inquiry into score meaning. American Psychologist, 50: 741–749.
D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258 257

Messick, S. J. 1998. Alternative modes of assessment, uniform standards of validity. In M. D. Hakel (Ed.), Beyond
multiple choice: Evaluating alternatives to traditional testing for selection: 59–74. Mahwah, NJ: Lawrence
Erlbaum.
Miller, G. A. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing
information. Psychological Review, 63: 81–97.
∗ Nedig, R. D., Martin, J. C., & Yates, R. E. 1979. The contribution of exercise skill ratings to final assessment

center evaluations. Journal of Assessment Center Technology, 2: 21–23.


Noonan, L. E., & Sulsky, L. M. 2001. Impact of frame-of-reference and behavioral observation training on
alternative training effectiveness criteria in a Canadian military sample. Human Performance, 14: 2–36.
Nunnally, J. C., & Bernstein, I. H. 1994. Psychometric theory (3rd ed.). New York: McGraw-Hill.
Raymark, P. H., & Binning, J. F. 1997. Explaining assessment center validity: A test of the criterion contamination
hypothesis. Paper presented at the 1997 Academy of Management meeting, Boston, MA.
∗ Reilly, R. R., Henry, S., & Smither, J. W. 1990. An examination of the effects of using behavior checklists on the

construct validity of assessment center dimensions. Personnel Psychology, 43: 71–84.


∗ Robertson, I. T., Gratton, L., & Sharpley, D. 1987. The psychometric properties and design of managerial

assessment centres: Dimensions into exercises won’t go. Journal of Occupational Psychology, 60: 187–195.
Robie, C., Adams, K. A., Osburn, H. G., Morris, M. A., & Etchegaray, J. M. 2000. Effects of the rating process
on the construct validity of assessment center dimension evaluations. Human Performance, 13: 355–370.
∗ Russell, C. J. 1985. Individual decision processes in an assessment center. Journal of Applied Psychology, 70:

737–746.
∗ Russell, C. J. 1987. Person characteristics vs. role congruency explanations for assessment center ratings. Academy

of Management Journal, 30: 817–826.


Russell, C. J., & Domm, D. R. 1995. Two field tests of an explanation of assessment centre validity. Journal of
Occupational and Organizational Psychology, 68: 25–47.
Sackett, P. R. 1987. Assessment centers and content validity: Some neglected issues. Personnel Psychology, 40:
13–25.
∗ Sackett, P. R., & Dreher, G. F. 1982. Constructs and assessment center dimensions: Some troubling empirical

findings. Journal of Applied Psychology, 67: 401–410.


∗ Sackett, P. R., & Hakel, M. D. 1979. Temporal stability and individual differences in using assessment center

information to form overall ratings. Organizational Behavior and Human Performance, 23: 120–137.
∗ Sackett, P. R., & Harris, M. M. 1988. A further examination of the constructs underlying assessment center

ratings. Journal of Business and Psychology, 3: 214–229.


∗ Sagie, A., & Magnezy, R. 1997. Assessor type, number of distinguishable categories, and assessment center

construct validity. Journal of Occupational and Organizational Psychology, 70: 103–108.


Schleicher, D. J., & Day, D. V. 1998. A cognitive evaluation of frame-of-reference rater training: Content and
process issues. Organizational Behavior and Human Decision Processes, 73: 76–101.
∗ Schmitt, N. 1977. Interrater agreement in dimensionality and combination of assessment center judgments.

Journal of Applied Psychology, 62: 171–176.


Schmitt, N., Schneider, J. R., & Cohen, S. A. 1990. Factors affecting validity of a regionally administered
assessment center. Personnel Psychology, 43: 1–12.
∗ Schneider, J. R., & Schmitt, N. 1992. An exercise design approach to understanding assessment center dimension

and exercise constructs. Journal of Applied Psychology, 77: 32–41.


∗ Shore, T. H., Shore, L. M., & Thornton, G. C., III. 1992. Construct validity of self and peer evaluations of

performance dimensions in an assessment center. Journal of Applied Psychology, 77: 42–54.


∗ Shore, T. H., Thornton, G. C., III, & Shore, L. M. 1990. Construct validity of two categories of assessment center

dimension ratings. Personnel Psychology, 43: 101–116.


∗ Silverman, W. H., Dalessio, A., Woods, S. B., & Johnson, R. L., Jr. 1986. Influence of assessment center methods

on assessors’ ratings. Personnel Psychology, 39: 565–578.


Society for Industrial and Organizational Psychology Inc. 2002. Principles for the validation and use of personnel
selection procedures (4th ed.). Retrieved March 14th, 2002, from http://www.Siop.org/Principles.
Spychalski, A. C., Quiñones, M. A., Gaugler, B. B., & Pohley, K. 1997. A survey of assessment center practices
in organizations in the United States. Personnel Psychology, 50: 71–90.
Thornton, G. C., III, & Byham, W. C. 1982. Assessment centers and managerial performance. New York: Academic
Press.
258 D.J. Woehr, W. Arthur Jr. / Journal of Management 2003 29(2) 231–258

∗ Thornton, G. C., III, Tziner, A., Dahan, M., Clevenger, J. P., & Meir, E. 1997. Construct validity of assessment
center judgments: Analysis of the behavioral reporting method. Journal of Social Behavior and Personality,
12: 109–128.
∗ Turnage, J. J., & Muchinsky, P. M. 1982. Transsituational variability in human performance within assessment

centers. Organizational Behavior and Human Performance, 30: 174–200.


Whitener, E. M. 1990. Confusion of confidence intervals and credibility intervals in meta-analysis. Journal of
Applied Psychology, 75: 315–321.
Woehr, D. J. 1992. Performance dimension accessibility: Implications for rating accuracy. Journal of
Organizational Behavior, 13: 357–367.
Woehr, D. J., & Feldman, J. M. 1993. Processing objective and question order effects on the causal relation between
memory and judgment: The tip of the iceberg. Journal of Applied Psychology, 78: 232–241.
Woehr, D. J., & Huffcutt, A. I. 1994. Rater training for performance appraisal: A meta-analytic review. Journal of
Occupational and Organizational Psychology, 67: 189–205.
Woehr, D. J., & Roch, S. 1996. Context effects in performance evaluation: The impact of ratee gender and
performance level on performance ratings and behavioral recall. Organizational Behavior and Human Decision
Processes, 66: 31–41.

David J. Woehr is currently a Professor of Management at The University of Tennessee. He


received his Ph.D. in Industrial/Organizational Psychology from the Georgia Institute of
Technology in 1989. Dr. Woehr served on the Faculty of the Psychology Department in the
I/O Psychology program at Texas A&M University from 1988 to 1999. He has also served
as a Visiting Scientist to the Air Force Human Resource Laboratory and as a consultant
to private industry. His research on job performance measurement, work-related attitudes
and behavior, training development, and quantitative methods has appeared in a variety of
books, journals, as papers presented at professional meetings, and as technical reports.

Winfred Arthur Jr. is currently a Professor of Psychology and Management at Texas A&M
University. He received his Ph.D. in Industrial/Organizational Psychology from the Uni-
versity of Akron in 1988. His research interests are in the areas of personnel psychology,
testing, selection, and validation, human performance, team selection and training, train-
ing development, design, delivery, and evaluation, human performance and complex skill
acquisition and retention, models of job performance, and meta-analysis.

You might also like