Test Reliability
Test Reliability
Test Reliability
and Reliability
LESSON 6 / CHAPTER 2
What is test reliability?
1. The number of items in a test – the more items a test has, the more likely the
reliability is high. The probability of obtaining consistent scores has high probability
because of the large pool of items.
The different types of reliability are indicated and how they are done.
1. Linear regression
Linear regression is demonstrated when you have two variables that are
measured. Like two set of scores in a test taken at two different times by the
same participants. When the two scores are plotted in a graph (with X and Y
axis), they will tend to form a straight line. The straight line formed for the two
set of scores can produce a linear regression. When a straight line is formed, we
can say that there is a correlation between the two set of scores.
Basis of Statistical Analysis to Determine Reliability
Scatterplot (Spreadsheet1 10v*10c)
Score 2 = 4.8493+1.0403*x
22
20
Example: 18
Score 2
scatterplot. Each point 14
in a scatterplot is a
12
respondent with two
scores (one for each 10
test).
8
6
2 4 6 8 10 12 14 16 18
Score 1
Basis of Statistical Analysis to Determine Reliability
The index of the linear regression is called a correlation coefficient. When the
points in a scatterplot tend to fall within the linear line, the stronger is the
correlation. When the direction of the scatterplot is directly proportional, the
correlation coefficient will have a positive value. If the line is inverse, the
correlation coefficient will have a negative value. The statistical analysis,
used to determine the correlation coefficient is called the Pearson r. Below
illustrates how the Pearson r is obtained.
Suppose a teacher gave a spelling of two syllable words with 20 items for
Monday and Tuesday. The teacher wanted to determine the reliability of the
two set of scores by computing for the Pearson r.
Formula: N (XY ) − (X )(Y )
r=
[ N (X 2 ) − (X ) 2 ][ N (Y 2 ) − (Y ) 2 ]
Monday Test Tuesday Test
X Y X2 Y2 XY
10 20 100 400 200
9 15 81 225 135
6 12 36 144 72
10 18 100 324 180
12 19 144 361 228
4 8 16 64 32
5 7 25 49 35
7 10 49 100 70
16 17 256 289 272
8 13 64 169 104
r = 0.80
The value of a correlation coefficient does not exceed 1.00 or -1.00. A value of
1.00 and -1.00 indicates perfect correlation.
Basis of Statistical Analysis to Determine Reliability
When the value of the correlation coefficient is positive, it means that the
higher are the scores in X, the higher are the scores in Y. This is called a positive
correlation. In the case of the two spelling scores, a positive correlation is
obtained. When the value of the correlation coefficient is negative, it means that the
higher are the scores in X, the lower are the scores in Y or vice versa. This is called
a negative correlation. When the same test is administered to the same group of
participants, usually a positive correlation indicates reliability or consistency of the
scores.
Basis of Statistical Analysis to Determine Reliability
The obtained correlation of two variables may be due to chance. In order to determine
if the correlation is free of some error, it is tested for significance. When a correlation
is significant, it means that the probability of the two variables being related is free of
some error.
In order to determine if a correlation coefficient value is significant, it is compared
with an expected probability of correlation coefficient values called a critical table.
When the value computed is greater than the critical value, it means that the
information obtained has beyond 95% chance of being correlated and it is significant.
Another statistical analysis mentioned to determine the internal consistency of test is
the Cronbach’s alpha. Follow the procedure to determine the internal consistency.
Suppose that five students answered a checklist with a scale of 1 to 5 about their
hygiene where the following are the corresponding scores:
5 - always, 4 – often, 3 – sometimes, 2 –rarely, 1 – never
The checklist has five items. The teacher wanted to determine if the items have
internal consistency.
item item item item Item total for each case Score-
Student 1 2 3 4 5 (X) Mean (Score-Mean)2
A 5 5 4 4 1 19 2.8 7.84
B 3 4 3 3 2 15 -1.2 1.44
C 2 5 3 3 3 16 -0.2 0.04
D 1 4 2 3 3 13 -3.2 10.24
E 3 3 4 4 4 18 1.8 3.24
case=16.2 Σ(Score-Mean)2=22.8
total for
each item
(ΣX) 14 21 16 17 13 item
=16.2
ΣX2 48 91 54 59 39
n t − ( t )
2 2
5 5.7 − 5.2
Cronbach' s = Cronbach' s =
n − 1 t
2
5 − 1 5.7
Cronbach’s α = .10
The internal consistency of the responses in the attitude towards teaching is .10 indicating low
internal consistency.
The consistency of the ratings can be obtained using a coefficient of concordance.
The Kendall’s ω coefficient of concordance is used to test the agreement among
raters.
If a performance task was demonstrated by five students and there are three
raters. The rubric used a scale of 1 to 4 where 4 is the highest and 1 is the lowest.
Five Sum of
demonstrations Rater 1 Rater 2 Rater 3 Ratings D D2
A 4 4 3 11 2.6 6.76
B 3 2 3 8 -0.4 0.16
C 3 4 4 11 2.6 6.76
D 3 3 2 8 -0.4 0.16
E 1 1 2 4 -4.4 19.36
=8.4 ΣD2=33.2
The scores given by the three raters are first computed by summating
the total ratings for each demonstration. The mean is obtained for the
sum of ratings (X Ratings=8.4). The mean is subtracted to each of the Sum
of Ratings (D). Each difference is squared (D2), then the sum of squares
is computed (ΣD2=33.2). The mean and summation of squared
difference is substituted in the Kendall’s ω formula. In the formula, m is
the numbers of raters.
12D 2
W= 2
m ( N )( N 2 − 1) A value of .38 Kendall’s ω coefficient estimates the
agreement of the three raters in the five demonstrations.
12(33.2) There is a moderate concordance among the three raters
W= 2
3 (5)(5 − 1)
2 because the value is far from 1.00.
W=0.37
What is test validity?
Construct Validity The components or factors of the test The Pearson r can be used to correlate the items
should contain items that are strongly for each factor. However, there is a technique
correlated. called factor analysis to determine which items
are highly correlated to form a factor.
Convergent Validity When the components or factors of a Correlation is done for the factors of the test.
test are hypothesized to have positive
correlation.
Divergent Validity When the components or factors of a Correlation is done for the factors of the test.
test are hypothesized to have a
negative correlation. Example is the
items on intrinsic and extrinsic
motivation.
Cases to Illustrate the Types of Validity
1. Content Validity
A coordinator in science is checking the science test paper for grade 4. She asked the
grade 4 science teacher to submit the table of specifications containing the objectives
of the lesson and the corresponding items. The coordinator checked whether each
item is aligned with the objectives.
The assistant principal browsed the test paper made by the math
teacher. She checked if the contents of the items are about
mathematics. She examined if instructions are clear. She browsed
through the items if the grammar is correct and if the vocabulary is
within the students’ level of understanding.
A school guidance counselor administered a math achievement test among the grade
6 students. She also has a copy of the students’ grades in math. She wanted to verify
if the math grades of the students are measuring the same competencies in the math
achievement test. The school counselor correlated math achievement scores and the
math grades to determine if they are measuring the same competencies.
A science test was made by a grade 10 teacher composed of four domains: Matter,
living things, force and motion, and earth and space. There are 10 items under each
domain. The teacher wanted to determine if the 10 items made under each domain
really belong to that domain. The teacher consulted an expert in test measurement.
They conducted a procedure called factor analysis. Factor Analysis is a statistical
procedure in determining if the items written will load under the domain they
belong.
A math teacher developed a math test that will be administered at the end of the
school year that measures number sense, patterns and algebra, measurement,
geometry, and statistics. It is assumed by the math teacher that students’
competencies in number sense help students learn better the patterns and algebra
and the other areas. After administering the test, the scores were separated for each
area and these five domains were inter-correlated using Pearson r. The positive
correlation between number sense and patterns and algebra indicates that when
number sense scores increase, patterns and algebra scores also increase. This shows
that students learning in number sense scaffold patterns and algebra competencies.
An item can discriminate if the examinees who are high in the test can
answer more the items correctly than the examinees who got low scores.
Below is a data set of 5 items on addition and subtraction of integers. Follow
the procedure to determine the difficulty and discrimination of each item.
Student 2 1 1 1 0 1
4
Student 5 0 1 1 1 1
4
Student 9 1 0 1 1 1
4
Total 2 2 3 2 3
Proportion
of the high
group
(PH) 0.67 0.67 1.00 0.67 1.00
Student 7 0 0 1 1 0
2
Student 8 0 1 1 0 0
2
Student 4 0 0 0 0 1
1
Total 0 1 2 1 1
Proportio
n of the
low
group
(PL) 0.00 0.33 0.67 0.33 0.33
4. The item difficulty is obtained using the following formula:
pH + pL
Item difficulty =
2
Index of
difficulty 0.33 0.50 0.83 0.50 0.67
Item Difficult Average Easy Average Average
difficulty
5. The index of discrimination is obtained using the formula: Item
discrimination=pH – pL
The value is interpreted using the table: