EPPP Test Construction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

EPPP Test Construction

Test construction is not a heavily emphasized area on the psychology licensing exam, so we’ll review
the terms, concepts, and procedures that you're most likely to be asked about on the exam, and
we’ll practice several questions that are similar to exam questions in terms of content, format, and
difficulty level. You'll find the information most helpful if you read the test construction section of
the written study materials first so that are you familiar with the language and have a point of
reference for the content. The test construction domain addresses the adequacy of tests used for
measuring psychological phenomena. When we talk about adequacy, we are assessing the test’s
levels of reliability and validity according to classical test theory. In other words, are the instruments
consistently measuring what we intend for them to measure? This is the main question we will be
addressing today. Let’s get started.

First, it’s important to differentiate between the two predominant schools of thought with regard to
test construction and evaluation. These are classical test theory and item response theory. Classical
test theory has a much longer history than item-response theory, and at the same time, its
theoretical assumptions are considered to be much weaker than those of item response theory. In
classical test theory, the focus is on test-level information such as how an individual’s obtained score
is representative of their true score; its approaches to validity and reliability focus on the test as a
whole. However, with item response theory, there’s a larger focus on item-level information and the
theoretical assumptions are more stringent. You will want to refer to your written study materials
for additional topics within item response and classical test theories.

A test is defined as a systematic procedure for measuring a sample of an individual's behavior. The
first part of this definition indicates that a test is a systematic procedure. There are at least three
ways in which a test is systematic. First, a test’s content is systematic because it is chosen in a
systematic manner from the domain of interest. Second, the administration procedures for a test are
systematic because they are standardized, which means that the test developer has developed and
provided specific guidelines for administering the test. Third, the scoring of a test is systematic
because the test developer has specified clear rules or steps for evaluating an examinee’s responses
to test items. The second half of the definition indicates that when we administer a test, we are
measuring a sample of the behavior of interest. For example, the EPPP, a test designed to measure
knowledge of psychology, will not assess knowledge of all of the terms concepts and theories from
the field, but will measure familiarity with a sample of those terms concepts and theories.

Because a test is a sample of a larger “population” of behavior, two problems arise. First, the items
included in the test must accurately and thoroughly represent the behavior or phenomena we want
them to represent. For example, does the Beck Depression Inventory assess symptoms of
depression? This is often a difficult goal to achieve. So the first problem in test construction is the
problem of validity—that is, how do we know if the test accurately and adequately measures what
we want it to measure? Second, when we use a test, we want to know that each examinee would
have obtained the same score if they had taken the test at a different time or had responded to a
different sample of items. In other words, we want to know that the test is reliable, meaning each
examinee’s score on the test is their true score. When you weigh yourself on a scale five times in a
row, you want the measurement to be consistent; if you got on each time and the weight varied by
25-50 pounds, you would throw that scale in the trash. Moreover, if an examinee obtains a low score
on a psychology test, for instance, we would want to know that the examinee’s low score reflects
their actual level of knowledge and is not the result of error that was caused by distractions that
occurred during test administration or the fact that some of the items were unclear or ambiguous.

Before we use a test, we want to make sure that it is both valid and reliable. There are several ways
to evaluate validity and reliability, and the methods used for a particular test depend on several
factors, including the nature and purpose of the test. In other words, different methods for
EPPP Test Construction

evaluating validity and reliability are most appropriate for different types of tests.

Let's take a closer look at reliability. Reliability is the amount of consistency or dependability in the
scores. To understand what reliability is, you need to understand the concepts of a true score and
measurement error. According to classical test theory, an individual’s score on any given measure is
a combination of their true score and error. A true score is the score a person would get on a test if
the test were perfectly reliable and test performance was not affected by extraneous factors or
error. In the context of reliability, error is referred to as measurement error and includes all factors
that are relevant to the behavior measured by the test that affect the examinee’s scores in a random
or unpredictable way. As an example, assume that an examinee obtains three different scores on the
same test of self-esteem when it is administered three times at one-week intervals. Assuming that
self-esteem is a stable characteristic, the discrepancies in the examinee’s scores must be due to
measurement error. Some of the items may have been confusing, for instance, and the examinee
interpreted and answered them differently each time he took the test. Keep in mind that
measurement error is always random and affects test scores in an unsystematic or unpredictable
way. For example, distractions outside the testing room can serve as a source of random error. For
some examinees, distractions will decrease performance on a test; for some, distractions may
improve performance because they increase the examinees level of arousal; and for others,
distractions will have no effect on test performance.

Since no psychological or educational test is perfectly reliable, examinees’ scores on a test are
always a function of their true score plus the effects of measurement error, or X = T + E, where X
refers to the obtained scores from a sample of examinees, T is the variability in scores that is due to
differences in true scores, and E is the variability in scores due to measurement error. This equation
tells us that each examinee’s obtained score is a combination of true score and error.

What we want to know when we construct or use a test is the degree to which examinees’ obtained
scores represent true scores rather than error. This is what we're determining when we evaluate a
test’s reliability. Unfortunately, there's no direct way to measure reliability, so it has to be estimated
using an indirect method that measures the consistency of scores across time, different versions of
the test, different items, or different scorers. This method of estimating reliability is based on the
assumption that true scores are consistent, while measurement error is inconsistent. So, the more
consistent test scores are, the greater the test’s reliability.

There are several ways for estimating a test’s reliability, and most involve calculating a correlation
coefficient from the test scores obtained by a single sample of examinees. This coefficient is called
the reliability coefficient, and it is symbolized with the lowercase r that has a subscript containing
two of the same letters, usually xx or yy. The identical letters in the subscript indicates that we have
correlated the test with itself rather than with another measure and consequently, that the
correlation coefficient is a reliability coefficient.

Because of the way the reliability coefficient is calculated, it differs from other correlation
coefficients in a couple of important ways. First, the reliability coefficient is already a squared
number, and as a result, it ranges in value from 0 to +1. Second, since the reliability coefficient is
already squared, it is never squared to interpret it. Instead, a reliability coefficient is interpreted
directly as the proportion of variability in obtained test scores that is due to a true score variability.
For instance, if a test has a reliability coefficient of .90, this means that 90% of variability in the test
scores obtained by a group of examinees reflects variability in their true scores, while the remaining
variability, that is 10%, is due to measurement error. Note that for most types of tests, a reliability
coefficient of .80 or higher is considered adequate, but this reliability coefficient does not provide us
with any information about what the test is actually measuring. Reliability only indicates the degree
EPPP Test Construction

of consistency with which the test measures whatever it is measuring. To determine if a test is
actually measuring what it was designed to measure, we have to evaluate the test’s validity, which
we will talk about later. Let's take a look at the four main methods for evaluating reliability, which
include test-retest, alternate forms, internal consistency, and inter-rater. For each of these methods,
you will want to know how to administer the tests, their associated coefficient, and limitations or
when use of these tests would not be appropriate.

Test-retest reliability is evaluated by administering the test to the same group of examinees on two
different occasions, and then correlating the two sets of scores. The resulting test-retest reliability
coefficient provides information on the consistency, or stability, of scores over time and is also
known as the coefficient of stability. Test-retest reliability is most appropriate for tests designed to
measure attributes that are stable over time, such as aptitude, interest, and personality traits. It is
not appropriate for characteristics that fluctuate over short periods of time, and it is not a suitable
method when performance on the second administration of the test is likely to be affected by the
first administration— for instance, when the test is susceptible to memory or practice effects that
will affect each examinee differently.

The second method for estimating reliability is known by several names including alternate, parallel,
and equivalent forms reliability. When using this method, we administer two equivalent forms of the
test to the same sample of examinees, and then correlate the two sets of scores. The resulting
alternate forms reliability coefficient is also known as the coefficient of equivalence. It tells us how
consistent examinees’ scores are across the different samples of the attribute measured by the two
versions of the test. Like test-retest reliability, alternate forms reliability is appropriate for tests
designed to measure stable characteristics, but it is not suitable for tests designed to measure
attributes that fluctuate over time or when exposure to the content of one form of the test will
affect examinees’ performance on the second form.

The third method for evaluating reliability is internal consistency reliability. There are three methods
for assessing internal consistency that you want to be familiar with for the licensing exam: split-half
coefficient, Cronbach’s coefficient alpha, and KR-20. All three involve administering a test once to a
single sample of examinees and using the obtained scores to calculate a coefficient of internal
consistency. This indicates how consistent examinees’ responses are to the different items included
in the test. When using the split-half method, we would administer a single form of the test to a
group of examinees, and then divide the test into two halves. A common method for dividing a test
is by odd and even numbered items. When we do this, each examinee has two scores, one for the
odd numbered items and one for the even numbered items. We then correlate the two sets of
scores to obtain a split-half reliability coefficient. An important limitation of this method is that
shorter tests are less reliable than longer ones, and when assessing split-half reliability, we end up
correlating two forms of the test that are each one-half as long as the original test. Consequently,
the split-half reliability coefficient tends to underestimate the test’s internal consistency reliability.
To estimate what the reliability coefficient would be if it were based on the full length of the test,
the Spearman Brown prophecy formula is often used. You don't need to memorize this formula, but
you do want to know that it's often used to correct the split-half reliability coefficient and can be
used whenever we want to estimate the effect of shortening or lengthening a test on its reliability
coefficient. The next method for evaluating internal consistency reliability is Cronbach’s coefficient
alpha. Like with the split-half method, Cronbach’s coefficient alpha involves administering a test
once to a single group of examinees. It's a little more complex and can be conceptualized as the
average of all split-half reliability coefficients corrected by the Spearman Brown formula. The last
method for evaluating internal consistency reliability is the Kuder-Richardson Formula 20, or KR-20.
It's very similar to Cronbach’s alpha, except that it can only be used when test items are
dichotomously scored. A multiple choice question is one example of a dichotomously scored item.
EPPP Test Construction

Even though a multiple choice question has three or four responses to choose from, it is scored as
either right or wrong, which is dichotomous. In contrast to test-retest and alternate forms reliability,
internal consistency reliability is appropriate for estimating the reliability of tests designed to
measure characteristics that fluctuate over time and for tests that are susceptible to memory and
practice effects. However, internal consistency reliability is not appropriate for evaluating the
reliability of a speeded test since it tends to overestimate the reliability of this type of test.

Finally, the last method for evaluating reliability is inter-rater or inter-scorer reliability. When scoring
a true-false, multiple choice, or other objectively scored test, different scorers or raters should be
able to derive the same test score for any given examinee. However, since the scoring of a projective
essay or other non-objective test is subjective, an examinee’s score on these tests may vary for
different scorers. In other words, whenever a test is subjectively scored, the rater is a potential
source of measurement error, and the test’s inter-rater reliability should be evaluated. This usually
involves administering the test to a sample of examinees and then having each examinee’s test
scored independently by two or more raters. The correlation coefficient for scores assigned by the
different raters indicates the degree of consistency across different raters.

There are several special correlation coefficients that are used to measure inter-rater reliability. One
of these is the kappa statistic, which is used to assess inter-rater reliability for two or more raters
when the data are nominal. Another is the coefficient of concordance, which is used to evaluate
inter-rater reliability for two or more raters when the data are in the form of ranks.

Let's take a look at a practice question on reliability that's similar to those you're likely to encounter
on the licensing exam. Please pause or replay the question to give yourself time to think through the
question and respond. Here’s the question.

To evaluate the reliability of a characteristic that varies in severity or intensity over time, a test
developer would be best advised to use which of the following:
A) coefficient of equivalence
B) coefficient of stability
C) coefficient of alienation
D) coefficient of internal consistency

I’ll repeat the question (PAUSE).

To evaluate the reliability of a characteristic that varies in severity or intensity over time, a test
developer would be best advised to use which of the following:
A) coefficient of equivalence
B) coefficient of stability
C) coefficient of alienation
D) coefficient of internal consistency

To answer this question, you need to know that, when a test measures a characteristic that
fluctuates in severity or intensity over time, it would not be appropriate to use a method that
requires administering the test at different times. Let's look at the answers. Answer A is the
coefficient of equivalence. A coefficient of equivalence is obtained by administering equivalent
forms of the test to the same group of examinees and correlating the two sets of scores since the
two forms can't be administered at exactly the same time. This method may not be the best one for
a test that measures a characteristic that fluctuates over time, so this answer is probably not the
correct response. Answer B is the coefficient of stability, which is another name for the test-retest
reliability coefficient. It's obtained by administering the same test to a group of examinees at two
EPPP Test Construction

different times, so it's clearly not the best method for a test that measures a characteristic that
varies in severity or intensity over time. Answer C is the coefficient of alienation. The coefficient of
alienation is not a reliability coefficient but instead is used in correlational analysis as a measure of
the degree of non-association between variables. So answer C isn't the correct response either.
Finally, answer D is the coefficient of internal consistency. Since obtaining a coefficient of internal
consistency requires administering the test only once to a sample of examinees, it would be suitable
for evaluating the reliability of a test that measures an unstable characteristic, so answer D is the
correct response. A coefficient of internal consistency is the appropriate type of reliability for a test
that measures a characteristic that varies in intensity or severity over time.

There are a couple of other things you want to know about reliability for the licensing exam. First,
since no test is completely reliable, the test score obtained by an examinee may or may not be their
true score. In other words, an examinee’s obtained score might be their true score, or the
examinee’s true score might be somewhat higher or lower than their obtained score. For this
reason, test users often interpret an examinee’s obtained test score in terms of a confidence
interval, which indicates the range within which the examinee’s true score is likely to fall given his or
her obtained score. A confidence interval is constructed using the examinee’s obtained score and
something called the standard error of measurement. To understand what the standard error of
measurement and confidence intervals are, assume that we administer the same test to an
examinee an infinite number of times. Because of the effects of measurement error, the examinee
will not obtain the same score each time they take the test; instead they will obtain a distribution of
test scores. The mean of this distribution will be the examinee’s true score, and its standard
deviation will be a measure of the variability in the examinee’s obtained scores that is due to the
effects of measurement error. This standard deviation is called the standard error of measurement.
Of course, we never actually administer the same test to the same examinee an infinite number of
times to determine the examinee’s true score or to obtain a standard error of measurement.
Instead, when we know a test’s standard deviation and reliability coefficient, we can use a formula
to obtain an estimate of the standard error of measurement, and then use this estimate to construct
the range, within which the examinee’s true score is likely to fall given their obtained score. This
range is called a confidence interval.

Let’s go over an example. An examinee obtains a score of 50 on a test of self-esteem. Because the
test is not completely reliable, the examinee’s obtained score may or may not be their true score,
and to avoid over interpreting the examinee’s obtained score, we can construct a confidence
interval around it. To do so, we need to know the test’s standard error of measurement. The formula
for the standard error of measurement is given in the written study materials, but it's unlikely that
you'll have to use the formula to calculate a standard error of measurement on the exam. Instead,
you're more likely to encounter a question that gives you an examinee’s obtained test score and the
test’s standard error of measurement and asks you to construct a confidence interval. To do so, you
add and subtract the standard error of measurement to and from the examinee’s obtained score.
So, if the obtained score is 50 and the standard error of the measurement is 5, you would add 5 to
and subtract 5 from 50, creating a range or confidence interval of 45-55. Specifically, to construct a
68% confidence interval, you add and subtract one standard error of measurement to and from the
examinee’s obtained score, which means that there is a 68% chance the examinee’s true score falls
between 45 and 55. To construct the 95% confidence interval, you add and subtract two standard
errors to and from the examinees obtained score. In this case, given the obtained score of 50, you
would add and subtract 10. This would give us a confidence interval of 40 to 60, which means that
there is a 95% chance that the examinee’s true score falls between 40 and 60.

The last topic we’ll want to cover on reliability are the factors that affect the strength or magnitude
of the reliability coefficient. We’ve already noted that one factor is the length of the test. Longer
EPPP Test Construction

tests tend to be more reliable than shorter ones. So, all other things being equal, a 200 item test is
likely to have a larger reliability coefficient than a 100 item test. Another factor that affects the size
of the reliability coefficient is the nature of the examinees. The more heterogeneous, or diverse, the
examinees are in terms of the behavior or construct measured by the test, the higher the reliability
coefficient. The reason for this is that the more heterogeneous the examinees, the more likely they
will have an unrestricted range of scores. As described in the written study materials, the range of
scores affects any type of correlation coefficient, including a reliability coefficient with an
unrestricted range, producing a higher correlation than a restricted range. So, if the possible range
of scores on a test is 0 to 100, we’re likely to obtain a larger reliability coefficient for a sample of
examinees whose scores range from 0 to 100 than for a sample whose scores range from only 40 to
60. If there are substantive reasons to believe that reliabilities will be different for different
subpopulations, then separate reliabilities should be reported for those subpopulations

Finally, the size of the reliability coefficient is affected by the content of the test. All other things
being equal, a test that is homogeneous with regard to content will be more reliable than a test that
is heterogeneous. Let’s look at this through an internal consistency reliability lens. The more the
items in the test are measuring the same construct, the greater the internal consistency, and the
internal consistency reliability coefficient will be larger. For instance, we should expect a 50 item test
that measures only knowledge of test construction to have a larger coefficient alpha and thus have
greater internal consistency than a 50 item test that measures knowledge of test construction,
statistics, clinical psychology, industrial/organizational psychology, and ethics. Let's try another
practice question. Here’s the question.

For a sample of children whose scores on an IQ test range from 50 to 150, the reliability coefficient
for the test is .80. If a reliability coefficient for the IQ test is calculated for a sample that includes
intellectually gifted children only, it would likely be:
A) .80.
B) greater than .80.
C) less than .80.
D) either greater or less than .80.

I’ll repeat the question (PAUSE).

For a sample of children whose scores on an IQ test range from 50 to 150, the reliability coefficient
for the test is .80. If a reliability coefficient for the IQ test is calculated for a sample that includes
intellectually gifted children only, it would likely be:
A) .80.
B) greater than .80.
C) less than .80.
D) either greater or less than .80.

This question states that, for an unrestricted range of scores—that is, for a sample of children with a
wide range of IQ scores—the reliability coefficient is .80. It then asks you what you would expect the
reliability coefficient to be for a sample of children with very high IQ. To answer this question, you
need to know that a restricted range of scores tends to lower the reliability coefficient. Since the
second sample of only intellectually gifted children would have a restricted range of scores, you'd
expect the reliability coefficient to be smaller, so the correct answer is C. For a sample of gifted
children, you'd expect the reliability coefficient to be less than .80.

Let's turn now to the next topic, which is validity. As noted earlier, a test is valid when it accurately
measures what it was designed to measure. There are three main types of validity: content,
EPPP Test Construction

construct, and criterion-related. These three types correspond to the three main reasons why we
would administer a test to an examinee. First, we might administer a test to assess the degree to
which an examinee has mastered a specific content or behavior domain. This is our goal, for
instance, when we administer an achievement test to determine if examinees have mastered the
information presented in a specific course or program. When the purpose of a test is to evaluate
mastery of a content or behavior domain, we are most concerned about the test’s content validity.

Alternatively, we might administer a test to find out how much of a particular hypothetical or
intangible trait an examinee has. This is the purpose of tests designed to measure such traits as self-
esteem, intelligence, and depression. When we want to use a test to measure a hypothetical trait,
we are most interested in determining the test construct validity.

Finally, we might want to use an examinee’s score on a test to estimate or predict their score or
status on an external measure of performance or criterion measure. For instance, we might want to
use a job applicant’s score on a selection test to predict how well the applicant can be expected to
perform on a measure of job performance several months after they’re hired. When the purpose of
a test is to predict or estimate performance on a criterion measure, we are most interested in
determining the test’s criterion-related validity. So, these three types of validity correspond to the
three major purposes or goals of testing, and it's the purpose of the test that determines if we are
most concerned about its content, construct, or criterion-related validity.

Let's take a closer look at these three types of validity, beginning with content validity. A test has
adequate content validity when it has been demonstrated that the test’s items adequately and
accurately represent the content or behavior domain the test was designed to measure. Test
developers for the EPPP would examine content validity as they develop new items. Content validity
is important for achievement type tests that are designed to measure how well examinees have
mastered a specific body of knowledge. It is also of concern for job sample tests that measure the
specific knowledge or skills required for successful job performance. Content validity is ordinarily
built into a test while the test is being constructed. It involves clearly identifying or defining the
content or behavior domain, dividing the domain into relevant categories and subcategories, and
then developing or choosing items that represent each subcategory. If we were developing a
psychology licensing exam, we'd first divide the domain of psychology into several categories such as
test construction, statistics, clinical psychology, developmental psychology, and ethics. We would
then divide each category into subcategories and write items to assess each subcategory once test
items have been written. The primary method for evaluating a test’s content validity is to obtain the
agreement of subject matter experts. When these experts conclude that the items accurately and
adequately sample the content or behavior domain of interest, the test can be said to have content
validity.

Now, let's take a look at construct validity, which is for tests designed to measure a hypothetical
construct or trait such as self-esteem, intelligence, depression, or sense of belonging. Construct
validity is most relevant for measures in psychology, and there isn’t a single way to establish a test’s
construct validity. Instead, it's evaluated through the accumulation of evidence, and the sources of
evidence may include the judgment of experts that the test items actually measure the construct
they are supposed to measure. Studies in which the test scores of individuals who differ in terms of
the construct of interest are compared in correlations. These correlations are between scores on the
test and scores on measures of related and unrelated traits.

One method that is often used to evaluate construct validity is the multitrait-multimethod matrix.
When using this method, scores on the test of interest are correlated with scores on established
measures of the same or similar constructs as well as with scores of established measures of
EPPP Test Construction

unrelated constructs in order to obtain information on the test’s convergent and divergent validity.
A test has convergent validity when scores on the test correlate highly with scores of measures of
the same or similar traits. Conversely, a test has divergent validity when scores on the test do not
correlate highly with scores on measures of unrelated traits together. Convergent validity and
divergent validity provide evidence that the test has construct validity. They show that the test is
actually measuring the construct it was designed to measure and not something else.

The development of the multitrait-multimethod matrix was based on two assumptions. First, a
person’s test score is a function of both the trait being measured and the specific method used to
measure the trait. And second, when the test is valid, a person's test score will be more of a function
of the trait being measured than the method of measurement. When using this method for
evaluating construct validity, we measure two or more traits that are each measured using two or
more methods, with one of the measures being the test we are validating. After administering all of
our tests to a sample of examinees, we then correlate scores on all pairs of the tests to obtain a
multitrait-multimethod matrix, which is simply a table of correlation coefficients.

For the exam, you want to know that the multitrait-multimethod matrix is used to evaluate a test’s
construct validity, and you want to be familiar with the four types of correlation coefficients that the
matrix contains. That is the monotrait-monomethod, monotrait-heteromethod, heterotrait-
monomethod, and heterotrait-heteromethod coefficients. Note that these names sound more
complicated than they actually are. Keep in mind that mono- means same while hetero- means
different.

One kind of correlation coefficient contained in the multitrait-multimethod matrix is the monotrait-
monomethod coefficient. It indicates the correlation of the test with itself—that is, the same trait
being measured using the same method, or monotrait-monomethod. The monotrait-monomethod
coefficient is a reliability coefficient, and while it isn't directly relevant to a test’s construct validity, it
is important since reliability limits test validity. A test with low reliability can't be a valid measure;
think back to that bathroom scale that we threw in the trash. With such inconsistency in its
measurements, there’s no way it could accurately be measuring weight. As such, we would want the
tests we’re validating to have a large monotrait-monomethod coefficient, since this indicates that it
has a high reliability.

The other three types of correlation coefficients contained in the multitrait-multimethod matrix
provide information about the test’s construct validity. One of these is monotrait-heteromethod
coefficient. It indicates the correlation between two different measures of the same trait—that is,
same trait, different methods. It provides information about the test’s convergent validity. Let's say,
for example, that the test we're validating is a self-report test of creativity. To evaluate the test’s
convergent validity, we correlate scores on the test with scores on a measure of creativity that has
already been validated and uses a different method of measurement, such as a teacher rating of
creativity. The correlation coefficient for our self-report test of creativity and the teacher rating of
creativity is a monotrait-heteromethod coefficient, and we’d want this this coefficient to be large. If
it is, we'd have evidence of our test’s convergent validity, meaning that scores on our self-report test
of creativity correlate highly with scores on a teacher rating of creativity.

The next type of correlation coefficient is the heterotrait-monomethod coefficient. It indicates the
correlation between two different traits being measured by the same method—that is different
traits, same method. It provides information on a test’s divergent validity. For our self-report test of
creativity, the heterotrait-monomethod coefficient would indicate the correlation between our test
of creativity and a test of an unrelated trait that uses the same method of measurement. For
example, a self-report measure of intelligence. We'd want this correlation to be small, and if it is,
EPPP Test Construction

we'd have evidence of divergent validity because scores on our self-report test of creativity have a
low correlation with scores on a self-report test of intelligence.

Finally, the last kind of correlation coefficient is the heterotrait-heteromethod coefficient. It


indicates the correlation between two different traits being measured by two different methods—
that is different traits, different methods. It also provides information on divergent validity. For a
self-report test of creativity, the heterotrait-heteromethod coefficient would indicate the correlation
between our test of creativity and a test of an unrelated trait that uses a different method of
measurement, like a teacher rating of intelligence. We'd also want this correlation to be small, and if
it is, we'd have additional evidence of our test’s divergent validity since scores on our self-report test
of creativity have a low correlation with scores on a teacher rating of intelligence.

So, to summarize, when using the multitrait-multimethod matrix to assess a test’s construct validity,
we want the monotrait-heteromethod coefficient to be larger than the heterotrait-monomethod
and the heterotrait-heteromethod coefficients. When this occurs, we have evidence of our test’s
convergent and divergent validity, which in turn confirms that the test has construct validity.

Let's look at a question on the multitrait-multimethod matrix that's similar to one you might
encounter on the licensing exam. Here’s the question.

When using the multitrait-multimethod matrix to evaluate the construct validity of a test, a large
heterotrait-monomethod coefficients suggests that the test:
A) has convergent validity.
B) has discriminant validity.
C) lacks convergent validity.
D) lacks discriminant validity.

I’ll repeat the question (PAUSE).

When using the multitrait-multimethod matrix to evaluate the construct validity of a test, a large
heterotrait-monomethod coefficients suggests that the test:
A) has convergent validity.
B) has discriminant validity.
C) lacks convergent validity.
D) lacks discriminant validity.

To choose the correct response to this question, you need to know what a heterotrait-monomethod
coefficient is. Recall that in the context of the multitrait-multimethod matrix, hetero- means
different and mono- means the same. So the heterotrait-monomethod coefficient is the correlation
between two different traits being measured by the same method. Logically, it makes sense that
we'd want this coefficient to be small since it's the correlation between tests of two different traits.
However, the question states that the heterotrait-monomethod coefficient is large, which suggests
that there's a problem. Let's take a look at the answers. Answer A states that a large heterotrait-
monomethod coefficient indicates that the test has convergent validity. Since the heterotrait-
monomethod coefficient provides information on divergent not convergent validity, this can't be the
correct response. Answer B states that a large heterotrait-monomethod coefficient indicates that
the test has discriminant validity, which is another name for divergent validity. This answer can't be
correct either because a heterotrait-monomethod coefficient provides evidence of discriminant or
divergent validity when it's small, not when it's large. Answer C states that a large heterotrait-
monomethod method coefficient suggests that the test lacks convergent validity. Again, this
coefficient does not provide information on convergent validity. So, this leaves us with answer D,
EPPP Test Construction

which states that a large heterotrait-monomethod coefficient suggests that the test lacks
discriminant validity. In fact, if the heterotrait-monomethod coefficient is large rather than small,
this indicates that the test does not have discriminant validity. In other words, the test correlates
highly with a measure of an unrelated trait, which it shouldn't do if it's actually measuring the trait it
was designed to measure. So answer D is the correct response. A large heterotrait-monomethod
coefficient suggests that the test lacks discriminant validity.

Before we move on to other types of validity, let’s take a look at factor analysis as it relates to
construct validity. Factor analysis is a set of analytical procedures that allows researchers to examine
the interrelationships among large numbers of variables or items and reduce those variables into
smaller groups known as factors or dimensions. Broadly, factor analysis can serve a number of
purposes in test construction and instrument development, which include assessing a test’s
construct validity, creating subtests, and identifying components of intelligence, attitudes, interests,
and other attributes. When used for assessing construct validity, factor analysis is another way of
obtaining information about a test’s convergent and discriminant validity.

Factor analysis is typically used to identify complex interrelationships among items that are part of a
larger unified dimension or factor. For example, let’s say you are conducting a survey and you want
to know whether the items or questions in the survey have any patterns; in other words, do the
items “hang together” to represent a larger construct. We use factor analysis to explain the
interrelationships among those variables.

A factor is a set of variables that have similar response patterns. When you conduct a factor analysis,
factors are presented in terms of factor loadings, or how much variation in the data they can explain.
Each factor loading is often listed from greatest to least and is a correlation between a variable and
the underlying factor. To interpret a factor loading, you should square it to determine the amount of
variability in the test score that is accounted for or explained by the factor. For example, when given
a factor loading of .78 for factor 1 on a measure of resiliency, you would square this, and it would
show that 61% of the variability in the resiliency test is accounted for by factor 1. Squaring a factor
loading provides a measure of shared variability. Eigenvalues indicate the strength of the factor.
Factors with eigenvalues less than 1.0 are not considered significant and are not interpreted.

There are two general approaches or types of factor analysis used to evaluate the dimensionality of
an instrument: confirmatory and exploratory. Confirmatory factor analysis is a more complex
approach and tests hypotheses about the nature of underlying factors and items. It allows us to
determine which of the many dimensional structures or factors best accounts for the empirical
relationship between the items. Exploratory factor analysis, on the other hand, is used to extract or
reduce data to a smaller set of dimensions, and the researcher does not make hypotheses or
predictions in advance. It uses exploratory procedures to identify the simplest, most straightforward
dimensional structure that can be supported by the empirical relationships between items. Your
written study materials provide a breakdown of steps in the factor analysis. You should review this
process and familiarize yourself with the key terms.

This brings us to the topic of criterion-related validity. Criterion-related validity is of interest when
the goal of testing is to use the test to estimate or predict an examinee’s status on another variable
in the context of criterion-related validity. The test is referred to as the predictor, and the other
variable is called the criterion, and criterion-related validity is a measure of the degree of association
between the predictor and the criterion. There are two types of criterion-related validity: predictive
and concurrent. Both are evaluated by obtaining a sample of examinees and correlating their scores
on the predictor and criterion to obtain a criterion-related validity coefficient. The larger the
coefficient, the greater the predictor’s criterion-related validity.
EPPP Test Construction

Predictive validity is of most interest when we want to use a predictor to predict the examinees
future performance on the criterion. For example, we'd be interested in the predictive validity of a
job selection test if we want to use it to predict a job applicant’s score on a measure of job
performance six months after being hired. To assess predictive validity, we administer the predictor
to a sample of examinees and at a later time determine their scores on the criterion. We then
correlate the examinees’ scores on the two measures to obtain a criterion-related or predictive
validity coefficient.

In contrast, concurrent validity is of interest when we want to use a predictor score to estimate an
examinee’s current status on the criterion. We would be interested in the concurrent validity of a
job selection test if we want to use it to estimate how well an applicant will do on the job
immediately after being hired. To assess the predictor’s concurrent validity, we administer the
predictor and criterion to a sample of examinees at about the same time, and then correlate the two
sets of scores to obtain a criterion-related or concurrent validity coefficient. The criterion-related
validity coefficient indicates the degree of association between the predictor and the criterion. It is
symbolized as rxy, where the subscript xy indicates that we have correlated two different measures—
in this case, the predictor and the criterion. We can interpret the criterion-related validity coefficient
directly as a measure of degree of association, or we can square it to obtain a coefficient of
determination, which indicates the amount of variability the two measures share in common.

For example, if the correlation between scores on a job selection test and a measure of job
performance is .60, we can say that .60 squared or 36% of variability in job performance scores is
shared with or is explained by variability in selection test scores. The remaining 64% of variability in
job performance scores is due to factors that are not measured by the selection test.

Another way to use the data collected in a criterion-related validity study is to calculate a regression
equation. The regression equation is used to predict or estimate a person's criterion score based on
their obtained predicted score. Of course, whenever the validity coefficient is less than plus-or-
minus one, which is usually the case, there will be error in the prediction, which means that the
person's actual score on the criterion may be somewhat lower or higher than their predicted score.
Consequently, the standard error of estimate is used to construct a confidence interval around an
examinee’s predicted criterion score. The confidence interval indicates the range within which the
examinee’s actual criterion score is likely to fall given their predicted criterion score. The formula for
the standard error of estimate is given in the written study materials, but you're not likely to have to
use the formula on the exam. Instead, you may encounter a question that requires you to know that
the standard error of estimate is used to construct a confidence interval around a predicted criterion
score or a question that gives you a person's predicted criterion score and the standard error of
estimate and requires you to construct a confidence interval. The procedure for constructing a
confidence interval around a predicted criterion score is the same as the procedure for constructing
a confidence interval around an obtained test score. To construct the 68% confidence interval, you
add and subtract one standard error to and from the predicted criterion score, and to construct the
95% confidence interval, you add and subtract two standard errors to and from the predicted
criterion score.

The information obtained in a criterion-related validity study can also be used to evaluate the
predictor’s incremental validity, which is the increase in decision making accuracy achieved by using
the predictor or predictors. Incremental validity can be determined from the scatterplot of data
points that have been obtained from the validity study. Each data point in the scatterplot represents
one person and indicates that person's scores on the predictor and criterion. Let's use the example
in the written study materials and assume that we're interested in determining the incremental
EPPP Test Construction

validity of an assertiveness test. Here, we want to evaluate the test’s incremental validity by using an
assertiveness test, as a selection tool, for sales job applicants. We conduct a concurrent validity
study and administer the assertiveness test to a sample of 100 currently employed salespeople, and
as the measure of job performance, we determine each salesperson’s dollar amount of sales during
the previous three-month period. We then use the data we have collected to construct a scatterplot.
In this example, the predictor is the assertiveness test, and the criterion is dollar amount of sales.
The predictor scores are plotted on the horizontal axis of the scatterplot, and the criterion scores are
plotted on its vertical axis.

To evaluate the incremental validity of the assertiveness test, we first divide the scatter plot in half
by setting a criterion cutoff score. In our example, the criterion cutoff would be determined by the
minimum amount of sales the company finds acceptable for its salespeople. Salespeople who sell
above the minimum are considered successful while those who sell below the minimum are
considered unsuccessful. In the scatterplot in the written study materials, the criterion cutoff is the
horizontal line that divides the scatterplot into an upper and lower half. The data points in the upper
half of the scatter plot represent current employees who are successful salespeople, while the data
points in the lower half represent current employees who are unsuccessful.

Once we've set the criterion cutoff, we can determine the base rate, which is one of the numbers we
need to calculate a predictor’s incremental validity. The base rate is the proportion of people who
score above the criterion cutoff before we use the new predictor. In our example, the base rate is
the proportion of current salespeople who are considered successful in terms of dollar amount of
sales, and it is calculated by dividing the number of successful salespeople by the total number of
salespeople.

One hundred salespeople were included in the concurrent validity study, and of these, 61 have sales
above the criterion cutoff, so the base rate is 61 divided by 100, or 61%. In other words, of the
salespeople hired without using the new assertiveness test, 61% are successful. In order for the
assertiveness test to have incremental validity, it will have to increase the number of accurate
decisions, or the percent of successful salespeople above 61%.

The next step is to calculate the proportion of sales people who would have been successful if we
had actually used the assertiveness test as a selection tool. This proportion is referred to as the
positive hit rate, which is the other number we need to calculate incremental validity. To determine
the positive hit rate, we have to set a predictor cutoff score. There are several ways to do this, but
for the licensing exam you don't need to be familiar with them. Let's assume that we've used one of
those techniques to set the cutoff score for the assertiveness test in a scatterplot in the written
study materials. The predictor cutoff is the vertical line that divides the scatter plot into a right and
left half. To the right of this cutoff are the positives, or the employees who scored high on the
assertiveness test and would have been hired if the test had been used as a selection tool. To the
left of this cutoff are the negatives, or the employees who scored low on the assertiveness test and
would not have been hired if the test had been used to make selection decisions.

Note that once we set both the criterion and predictor cutoff scores, we've divided the scatter plot
into four quadrants. To the right of the predictor cutoff are the positives, in the upper right are the
true positives. These are the individuals who scored high on the predictor and high on the criterion.
In the lower right are the false positives. These are the individuals who scored high on the predictor
but low on the criterion. To the left of the predictor cutoff are the negatives. In the upper left are
the false negatives. These are the individuals who scored low on the predictor but high on the
criterion. And in the lower left are the true negatives. These are the individuals who scored low on
the predictor and low on the criterion.
EPPP Test Construction

To determine the positive hit rate, we consider only the positives, and divide the number of true
positives by the total number of positives. In this example, the number of true positives is 21, and
the total number of positives is 22. Consequently, the positive hit rate is 21 divided by 22, or 95%.

If we had used the assertiveness test as a selection tool for the 100 current employees included in
our validity study, we would have hired 22 of them, and of the 22, 21 would have been successful.
We can now calculate the assertiveness test incremental validity by subtracting the base rate from
the positive hit rate. Since the base rate is 61% and the positive hit rate is 95%, the assertiveness
test’s incremental validity is 95%minus 61%, or 34%. In other words, use of the assertiveness test as
a selection tool for salespeople would increase the proportion of correct selection decisions by 34%.
Whether or not this increase justifies using the assertiveness test as a selection tool is a subjective
judgment. There's no absolute standard for determining what constitutes an acceptable
improvement in decision making accuracy, but 34%seems like a pretty decent increase.

For the licensing exam, you want to know what the terms incremental validity and base rate mean.
Recall that incremental validity refers to the increase in decision making accuracy that results when
the new predictor is used to make decisions, and the base rate is the proportion of correct decisions
being made without using the new predictor. You also want to know the definitions for the four
terms used to describe the four quadrants in the scatter plot and the effects on those quadrants of
raising and lowering the criterion and predictor cutoff scores. If you look at the scatter plot in the
written study materials, you can see that raising or lowering the predictor and criterion cutoff scores
changes the number of data points in each quadrant. Let’s look at an example of the kind of
question you're likely to encounter on the exam on changing the cutoff scores in a scatter plot.
Here’s the question.

A test developer is attempting to set the optimal cutoff score for a new selection test for job
applicants for a computer programmer job. Using the data they collected in a concurrent validity
study, they find that raising the cutoff score on the selection test:
A) increases the number of true positives and true negatives.
B) decreases the number of true positives and negatives.
C) decreases the number of true and false positives.
D) increases the number of true and false positives.

I’ll repeat the question (PAUSE).

A test developer is attempting to set the optimal cutoff score for a new selection test for job
applicants for a computer programmer job. Using the data they collected in a concurrent validity
study, they find that raising the cutoff score on the selection test:
A) increases the number of true positives and true negatives.
B) decreases the number of true positives and negatives.
C) decreases the number of true and false positives.
D) increases the number of true and false positives.

This question is asking about the consequence of raising the cutoff score of the selection test that is
being used as a predictor. It will probably be easier to answer questions like this one on the exam if
you draw a picture of the scatter plot in the same orientation as it is in the written study materials,
and include the lines representing the predictor, criterion cutoff scores, and the names of the four
quadrants. If you have this picture in front of you, it will be easy to see that raising the predictor
cutoff score decreases the number of positives and increases the number of negatives. Only one of
the answers describes this result. Answer C states that raising the predictor cutoff will decrease the
EPPP Test Construction

number of true and false positives, and that is the correct response. When you raise the predictor
cutoff score, which means you'll move the vertical line in the scatter plot to the right, the number of
true and false positives decreases.

There's just one more topic related to criterion-related validity that we should review, and that's the
relationship between a test’s reliability coefficient and its validity coefficient. We noted earlier that
reliability places a ceiling on validity. A test cannot have good content, construct, or criterion-related
validity when it has low reliability. In terms of criterion-related validity, this relationship is defined by
a formula that indicates that a test’s criterion-related validity coefficient cannot exceed the square
root of its reliability coefficient. For example, if a test’s reliability coefficient is .81, then its validity
coefficient can be no larger than the square root of .81, which is .9. Of course, the test’s validity
coefficient can be less than .9, it just can't be larger than .9. In other words, reliability is a necessary
but not sufficient condition for validity. A test must have high reliability to be valid, but high
reliability does not guarantee validity – in other words, just because a test measures something
consistently doesn’t mean that it is actually measuring what we want it to measure.

This concludes my lecture on Test Construction. Please note, this is not an exhaustive list of the
content within this domain. I encourage you to also study our written materials on Test Construction
as this will expand your understanding of the most important topics on the EPPP. Good luck on your
exam!

You might also like