Unit-3
Unit-3
Unit-3
EVALUATION TOOLS
Structure
3.0 Objectives
3.1 Introduction
3.2 Principles of Evaluation Tool Construction
3.2.1 Planning
3.2.2 Preparation
3.2.3 Try-out
3.2.4 Evaluation
3.2.5 Finalisation
3.3 Item Analysis
3.3.1 Mechanics of an Item
3.3.2 Required Functional Conditions of an Item
3.3.3 Checking the Functional Conditions of an Item
3.3.4 Behavioural Characteristics of an Item
3.3.5 Measuring Behavioural Characteristics
3.3.6 Interpreting Behavioural Characteristics
3.3.7 Use of behavioural indices
3.4 Guidelines for the Use of an Evaluation Tool
3.4.1 Quality of a Test: Some Focal Points
3.4.2 Validity of Tests
3.4.3 Reliability of Tests
3.4.4 Usability
3.5 Let Us Sum Up
3.6 Answers to Check Your Progress
3.0 OBJECTIVES
l After going through this unit, you should be able to:
l identify the principles of test construction;
l explain the processes involved in test construction;
l describe and differentiate item formats; and
l list the quality of a test and identify the different approaches to use a test.
3.1 INTRODUCTION
Educational testing involves four stages of activity: (i) planning the test structure;
(ii) constructing the test; (iii) administering the test; and (iv) assessing and
interpreting the learners’ performance. At the planning stage decisions have
to be taken with regard to the objectives, choice of content area, choice of
skills/abilities, the length/duration of the test, etc. At the construction stage the
choice of item formats for the proposed test-points, the nature of sampling, the
sequencing and grouping of items, drafting of instructions, etc. are to be decided
upon. At the administration stage the main concern is to provide the appropriate
46 condition, facilities, accessories, etc., uniformly to all learners who take the test.
At the stage of interpretation, conclusions with regard to the testees’ performance, Construction of Evaluation
ability, their achievement in terms of the standards set, their relative standing in Tools
the group, etc., are to be arrived at.
These stages of activities are not independent of one another. But rather they are
closely interdependent. Test construction is not only guided by a test plan but
also governed by considerations of available conditions of administration. The
test-plan conceives not only the areas of test-construction activities but also the
norms by which the learner-performance is to be interpreted.
Beyond these four stages is the stage of test validation—a stage where we
undertake the evaluation of a test. This activity serves two important purposes:
i) the purpose of immediate concern—ensuring whether the test procedure and
the test are dependable in terms of the objectives and measurement;
ii) the purpose of long-term concern—envisaging and directing the reformative
efforts that are required to improve the examination procedures (through
consistent and systematic efforts over a period of years of feeding the
ndings of the evaluation of the work in one examination into the planning
operations of the subsequent examinations).
There are different criteria to be taken into consideration while evaluating
a test or determining the worth or quality of a test. Chief among these are
provided by the concepts of ‘validity’ and ‘reliability’. There are a host
of other features to consider which pertain, in general, to the question of
‘usability’. We shall discuss these in the following sections. But before we
take them up for discussion, let us look into some of the general questions
concerning evaluation of educational tests.
3.2.1 Planning
It is the rst step in test construction that whatever we do we plan it in advance,
so that our act will be systematic. Obviously so many questions come to our
mind before we prepare a test. What content area is to be covered by the test?
What types of items are to be asked and what are the objectives that are going
to be tested? Are the objectives that are going to be tested clear? etc. Suppose
we want to prepare a good test in General Science for grade X. If most of the
questions are asked from Physics and Chemistry and a few from Zoology, will
it be a good test? Anybody would offer the criticism that the test is defective
because it would fail to measure the achievement of pupils in Geology,
Astronomy, Physiology, Hygiene and Botany. So it is desirable that all the
content areas be duly represented in the test. Moreover, due Weightage should be
given to different contents. Whenever we want to prepare a test, the weight to be
given to different content areas must be decided at the beginning.
If we are preparing a test on Physics we must decide the weight to be given to
different chapters taught in Physics. Thus, at the planning stage, it is the rst task 47
Curriculum Evaluation to decide the weight to be given to different content areas. Usually the weight is
decided by taking expert advice. It must be noted that the Weightage must be in
conformity with the amount of content taught under each content area. To draw
up an objective list of such weight, the average of the available expert opinion
may be taken.
Even though we give due weight to content areas, if all the questions (test
items) aim at testing only the memory of the pupil, can it be called a good test?
No. It must not only cover the area of knowledge (recall or memory comes
under knowledge), but also other objectives like understanding, application,
skill etc. A good test must aim at measuring all the behavioural areas. But can
all the objectives be tested through one test? At this stage expert opinion may
be sought as to which objectives are to be covered through our test. Usually
in achievement tests four major objectives viz., knowledge, understanding,
application and skill are tested. Now another question comes to our mind. What
weights are to be given to different objectives chosen? Here again expert opinion
can be taken.
Suppose that we are going to construct a test of Mathematics for students of
grade IX. After due expert opinion we decide that the weight to be given to
the different content areas viz. Arithmetic, Algebra, Mensuration and Geometry
are 20%, 30%, 20% and 30% respectively and the weight to be given to the
objectives of knowledge, understanding, application and skill are 40%, 30%, 20%
and 10% respectively. After the weight to be given to the different content areas
and different objectives is decided upon, a blue-print can be prepared. A blue-
print for the proposed test on Mathematics is presented below.
Table 3.1: Blue-print (two dimensional chart)
3.2.2 Preparation
The second step in test construction is the preparation of the test itself. At this
stage we have to prepare:
i) the test items
ii) the directions to test items
iii) the directions for administration
iv) the directions for scoring
v) a question wise analysis chart.
i) Preparation of the test items
Items must be prepared in conformity with the blue print. We have to choose
appropriate items (test situations) which would test the specied objectives
in the specic content area. Construction of test items is not so easy. It
is the task of test-specialists and experts. An experienced teacher who is
sufciently trained in test-construction can prepare appropriate test items.
There are certain rules and guidelines for construction of test items. Separate
guidelines are there for construction of ‘essay type’, ‘short-answer type’ and
‘objective type’ tests. Even for construction of different types of objective-
type tests, specic guidelines are prescribed. One must have access to all 49
Curriculum Evaluation these guidelines and also access into the taxonomy of objectives before
constructing test items. In general, the test items must be clear, comprehensive
and free from ambiguity. They must be aimed at measuring the desired pupil-
behaviour. They must fulll their functions to ensure validity.
After the test items are framed they must be arranged properly and assembled
into a test. If different forms of test items are being used, they should
preferably be grouped form-wise. Moreover, easy items are to be given a
place in the beginning, the difcult items at the end. The test items may
be arranged in the order of difculty. Of course, there are various ways of
assembling the questions and we may assemble the questions according to our
purpose and convenience of interpretation.
ii) Preparation of directions to test items
Appropriate directions to test items should be prepared. The directions must
be clear and concise so that the students will understand them easily. The
students should know as to whether he/she has to write the response or put
a tick against the right response or to mark his/her response in some squares
provided on the right side of the question or to mark his/her response on
a separate answer sheet etc. Sometimes the directions to test items are so
ambiguous that the students cannot follow them and as such he/she responds
to the items in a manner which he/she thinks t at that instant or simply
passes on to the next item leaving it unanswered. Due to lack of clarity or
directions students will respond differently at different times which would
lower the reliability of the test. It is essential that the directions to the test
items must be carefully prepared and they must be as clear and simple as
possible. If necessary, full guidelines (even demonstration) for responding on
item may be given.
iii) Preparation of directions for administration
A clear and detailed direction as to how the test is to be administered is to
be provided. The conditions under which the test is to be administered, when
the test is to be administered (whether in the middle of the session or at the
end of the session etc.), within what time limit it is to be administered etc.
are to be stated clearly. If the test has separate sections, time limits to cover
each section must be mentioned. The materials required (if any) for the test
such as graph papers, logarithm tables etc. must be mentioned. The directions
must state clearly what precautions the administrator should take at the time
of administration. So it is important that appropriate and clear directions for
test-administration be prepared.
Preparation of direction for scoring
To facilitate objectivity in scoring, ‘scoring keys’ are to be provided. Scoring
key is a prepared list of answers to a given set of objective-type questions.
Suppose there are 10 multiple-choice objective type questions (each having
four options, A, B, C, D) in a section of the test, the scoring-key will be as
follows:
Section-I Scoring key
Q.N. 1 2 3 4 5 6 7 8 9 10
Key D C C A B D B A C B
A scoring key is prepared by listing serially the key (or right answer) to each
question against each item.
For short answer type questions and essay type questions, marking schemes
are to be prepared (i.e., marks allotted to different parts of the answer or to
different important points etc. are to be mentioned). Such scoring keys and
50
marking schemes must be carefully prepared. They serve as guides at the time Construction of Evaluation
of scoring the test and they ensure objectivity in scoring. Tools
W
S=R— where S = the corrected score
N -1
R = No. of right responses
W = No. of wrong responses
N = Total no. of options
Thus, such specic directions for scoring as are likely to be necessary must be
prepared. Of course, these may vary from test to test.
v) Preparation of a question-wise analysis chart
A question-wise analysis chart is given here. In this chart every question is
analysed. This chart shows the content area (topic) the question covers, the
objectives (with specications) that it intends to measure, its type, the marks
allotted to it, expected difculty level and time taken to answer it. This chart
not only analyses the items, but also gives us a picture of coverage of contents,
objectives, type of questions and coverage of different difculty levels etc.
Moreover this gives us some idea about the total time to be taken for taking the
test. This chart further helps us check whether the test has been prepared as per
the blue print or not.
Table 3.2: Question-wise analysis chart
Note: Preparation of such a chart is a necessity for teacher made tests; but for
standardised tests we may or may not do it at the preparation stage. However,
such analysis may be necessary at the time of editing the nal form of the tests.
51
Curriculum Evaluation 3.2.3 Try-out
The questions may be carefully constructed, but there is no guarantee that they
will operate in the same manner as planned. So before the nal form of the test is
prepared it is necessary to have a try out.
For the trial of the items, a preliminary form of the test is generally prepared.
This contains more number of items than are actually required for the nal form.
Usually the number of items included in the trial form should nearly be double of
the number of items required for the nal form. The lesson is that at the item-
analysis stage many items will be discarded. A detailed scoring key of the trial
form should therefore be prepared.
i) Preliminary try-out: After the test items, directions for response,
administration and scoring are prepared it is tried out on a few ‘sample’
students just to ascertain how it works. At this stage 10 to 15 students
of different abilities are selected and the test is administered. The aim of
doing is to detect the omissions or mistakes if any, to examine whether
the directions to items are actually being followed by students, to examine
whether the time allowed is sufcient etc. Although a test is constructed
with caution it may have some errors or ambiguity of directions here and
there. The preliminary tryout will be to bring these to light. This helps us
to modify or revise the items or directions whenever necessary. After due
corrections the test is edited.
ii) Final try-out: At this stage the test is administered to a representative
sample. The sample may not be too large. As it is just a pilot study a sample
of 200 to 300 will do. But it must be borne in mind that this sample must
be a representative sample of poor, average and brilliant students. The aim
of such a tryout is to identify the defects and deciencies of the test and to
provide data for evaluating the test.
The purpose of the tryout can be summed up as follows:
a) to identify the defective or ambiguous items.
b) to discover the weaknesses in the mechanism of test administration.
c) to identify the non-functioning or implausible distracters in case of multiple
choice tests.
d) to provide data for determining the discriminating value of items.
e) to determine the number of items to be included in the nal form of the test.
f) to determine the time limit for the nal form.
At the tryout stage the directions must be strictly followed. Conditions for test
administration must be normal. The atmosphere should be calm and quiet. There
should be proper seating arrangements, light, ventilation and water arrangements.
Proper investigation and supervision must be ensured. A wrong administration of
the test will give us wrong data for its evaluation.
iii) Scoring: After the try-out form of the test is administered the answer sheets
are scored as per the scoring key and scoring directions. ‘Corrections for
guessing’ are also done if required under scoring directions. Now the scores
are ready for item analysis and evaluation of the test.
3.2.4 Evaluation
After scoring is complete, the test must be evaluated to examine whether the test
52 items are good and whether the test is reliable and valid. For this purpose we:
i) analyse the items to examine their worth of inclusion in the test (Item- Construction of Evaluation
analysis); Tools
Note: The table of areas under the normal curve may be referred to. Items with
difculty values in between + 1 are usually retained
Method 5: The item analysis procedures used to obtain a reliable ranking of
learners; indices of item difculty and item discriminating power include the
following:
i) Arrange the answer papers after scoring on the basis of merit (Highest mark
at the top and lowest mark at the bottom).
ii) Select the 27% of the answer papers from the top and 27% of the answer
papers from the bottom. The top 27% who have secured better marks
constitute the higher group (H-group) and the bottom 27% who have secured
poor marks constitute the lower group (L-group).
iii) Calculate WH for each item i.e., for each item to determine the number of
persons from the H-group who have wrongly answered an item or who have
omitted it.
iv) Calculate WL i.e., for each item calculates the number of persons in the
L-group who have wrongly answered the item or omitted the item.
v) Calculate WH + WL
W H+ W1
vi) I.D. = × 100 where
2n
n = number of persons in either lower group or higher group (n = 27% of N)
For multiple choice tests (where the options may be three or four) the
following formula is used.
W H+ W1 100 xoption
I.D. =
2n
× option - 1
Usually items in the range 16% to 84% of difculty level are retained.
l
We can calculate the desired WH + WL values from the following table.
Table 3.4: Calculation of item difculty levels for multiple-choice questions
Difculty WH + WL values
level No. of options each item has
2 3 4 5
16% .160n .213n .240n .256n
84% .840n 1.120n 1.260n 1.344
By referring to the table we can nd that for a’n’ of 120, the minimum WL -
WH, value for an item with 4 options should be 16. So, all the items whose
WL-WH. value is 16 or above are considered to be sufciently discriminating.
If WL-WH value of an item is less than 16, it is to be rejected-
c) Internal consistency of items with the whole test
Statistical methods are used to determine the internal consistency of items.
Biserial correlation gives the correlation of an item with its sub-test scores
and with total test-scores. This is the process of establishing internal validity.
There are also other methods of assessing internal consistency of items and
as they are beyond the scope of our present purpose, we have not discussed
them here.
55
Curriculum Evaluation 3.2.5 Finalisation
After item analysis, only good items with appropriate difculty level and
with satisfactory discriminating power are retained and these items form the
nal test. Time required for the test is determined by taking the average time
taken by three students who represent three groups: bright, average and below
average. Now the test is administered to a large representative sample and the
test-papers are scored.
i) to be neither too difcult not too easy for the prospective testees, and
ii) to discriminate effectively ‘the more able’ (among the testees) from ‘the less
able’.
We may also need information about the behaviour characteristics of an item
when we construct tests of special specications - like tests of same difculty
level (‘parallel tests’) and tests of progressive difculty levels (‘graded tests’).
Step 2: Divide the response sheets into three ability-groups (Col. I).
59
Curriculum Evaluation Table 3.6: Illustration: tabulation of scores to facilitate FV and DI computations
FV for item 1
Total Score of the sample Ability Sl. Roll No. Total Item-wise score
= Group No. in order individual V
Total No. of candidates of ranking score
9 + 63 + 2 I II III IV 1 2 3 4 5 6 7 8
= ´ 100
11 + 88 + 11
74 H 1 891 25 √ √ √ √ √ √ √
= x100 = 67.27%
110 I 2 702 24 √ √ x √ √ √ √
FV for item 4
G 3 801 23 x √ √ √ √ √ √
Total Score of the sample
= H 4 705 23 √ √ √ √ √ √ √
Total No. of candidates
E 5 712 23 √ √ √ √ x √ x
9 + 72 = 5
= ´ 100 R 6 813 22 √ √ x √ √ x x
11 + 88 + 11
7 811 22 x √ √ x √ √ √
86 8 737 21 √ x √ √ x x x
= ´ 100 = 78.18%
110 9 785 21 √ √ √ x x x √
DI for item 1
10 721 20 √ x √ √ √ √ x
= Facility in respect of HAG 11 850 20 √ √ x √ x x x
- Facility in respect of LAG Total 11 9 9 8 9 7 7 6
Candidates
9 2
= - M 12
11 11
I .
9-2
= D .
11 D .
7 L .
= = 0.636
11 E 99
DI for item 4
Total 88 63 66 49 72 47 15 41
= Facility in respect of HAG Candidates
- Facility in respect of LAG L 100 786 8 x x x √ x x √
O 101 809 8 x x √ √ x x x
9 5
= - W 102 722 7 x √ x x x √ √
11 11
E 103 826 7 √ x √ x √ x √
9-5
= R 104 813 7 x √ x √ x √ √
11
105 851 7 x x x √ x x √
4 106 870 6 x √ x x x x x
= = 0.363
11 107 764 6 √ x x x x x √
108 783 6 x x x √ x x x
109 822 5 x x x x x x x
110 847 5 x x x x x x √
Total 11 2 3 2 5 1 2 7
Candidates
60
Step 3: Draw a table of vertical columns and horizontal rows. Enter the serial Construction of Evaluation
number of the Response Sheets which you put while carrying out Step 1, one Tools
below the other in the rst column. (The corresponding Roll No. of candidates
can be given in the second column, if necessary.) Leave a gap of three rows each
below the HAG, the MAG and the LAG.
Step 4: Enter item-wise scores in the horizontal row against each candidate (Col.
V). When the item-wise scores of all candidates are entered, the total score of
the sample on each item could be calculated by adding up scores in the vertical
columns and the total score of each individual on the test could be calculated by
adding scores along the horizontal row.
Determining the facility value: Facility value is generally presented as a
percentage. In the case of an objective type item, it is calculated as the number
of learners answering the item correctly divided by the number of learners
attempting it. The fraction is multiplied by 100 to get the gure in percentage.
In the case of a supply-type question, facility value is the average mark obtained
by the sample on the question divided by the maximum mark allotted for the
question. Here too the fraction is converted into a percentage gure.
To summarise, FV of an objective item =
No. of learners answering the item correctly
´ 100
No. of learners taking the test
FV of a free-response question =
Average score obtained by the sample of the question
´ 100
Max. mark allotted for the question
The facility value ranges from 0% to 100%. FV represents the fact that none of
the sample has answered the item correctly and hence the item has no ‘facility’
whatsoever for the given sample. 100% FV represents the fact that everyone in
the sample has answered the item satisfactorily and the item has no difculty
whatsoever with the given sample.
Determining the discrimination index: Discrimination index of an item is
arrived at by deducting the facility value of the LAG (in the item from the
facility value of the HAG on the same item, The DI is always presented in the
form of decimal fraction and it may range from - 1.0 to + 1. .
These features are the focal operational points with regard to which even any
little amount of slackening of care might impair an educational test meeting
its stated or intended objective. The chief attributes of a good test—validity,
reliability and usability—should be veried with these features to ascertain
the quality of a given test. We shall now take up each one of these attributes
and discuss them in greater detail.
64
3.4.2 Validity of Tests Construction of Evaluation
Tools
What is validity?
The concept of the validity of a test is primarily a concern for the ‘basic
honesty’ of the test ‘honesty’ in the sense of ‘doing’ what an item promises to
do. It is a concern for the relationship , on the one hand, between the purpose
to be achieved and on the other hand, between the efforts taken, the means
employed and what those efforts and means actually achieve. There is always
a gap, whatever be the size, between the purpose of a test and the extent
of realisation of the purpose in practice. Hence absolute validity is ideal in
educational testing. Perfection in terms of validity—a perfect match between
the purpose and practice—is hard to achieve. This is due to:
i) the nature of ‘learning’, which is the subject of measurement, and
ii) the nature of each of a number of factors that become involved in the
measurement learning.
(These factors, we have discussed briey in Section 3.4 above).
The less variant that a test turns out in practice from its stated purpose, the
more valid it is. Hence validity is a measure of the degree of success with
which a test accomplishes what it sets out to accomplish. It is an attempt to
answer how close a test is in its operation to the purpose in its plan-design.
To be precise, a test is valid to the extent to which it measures what it
purports to measure.
Types of validity:
`Purpose’ and ‘practice’ then, are the two dimensions of considerations
involved in the concept of validity. ‘Practice’ is conditioned mainly by three
operant forces—the test, the testee and the examiner. If the demands made
by the test, the performance offered by the testee and the valuation (of the
testees’ performance) done by the examiner are all directed to be symmetrical
with the given/set purpose of the test, then validity is ensured. Thus the
test purpose is the constant point of reference for validity. Consequently,
several types of validity are conceived of to suit the specic purposes that
tests are designed to serve. We shall take up four types of them for our
discussion. They are:
i) Content validity
ii) Criterion-related validity
a) Concurrent validity, and
b) Predictive validity
iii) Construct validity
iv) Face validity
Content validity
Content validity is the most important criterion for the usefulness of a test,
especially of an achievement test. It is a measure of the match between
the content of a test and the content of the ‘teaching’ that preceded it. The
measure is represented subjectively after a careful process of inspection
comparing the content of the test with the objective of the course of
instruction,
65
Curriculum Evaluation The key aspect in content validity is that of sampling. Every achievement
test has a content area and an ability range specied for its operation. Given
the limited human endurance in taking a test (say three hours at a stretch)
and hence a limited test-duration, no single test can ever make a total
representation of any considerable length of a content area. A test, therefore,
is always a sample of many questions that can be asked. It is a concern of
content-validity to determine whether the sample is representative of the
larger universe it is supposed to represent.
A table of specications with a careful allocation of weight to different units
of the content area and the several abilities, keeping in view the set objectives
of the course and the relative signicance of each of these, can help a test-
constructor as a road-map in the construction of items and the development
of a test. A careful scrutiny of the table of specication (if any has been
used) and the loyalty with which it has been adhered to while developing
the test can help you assess the adequacy and appropriateness of sampling.
Where such a table of satisfaction is not available, you may have to develop
a ‘concept-mapping’ of the content area to check the sampling of content
represented by a test.
We should note here that a test that is content valid for one purpose may be
completely inappropriate for another. We should also note that the assessment
of content-validity has to be subjective basically as it depends on the
assessor’s estimate of the degree of correspondence between what is taught
(or what should be taught) and what is tested it requires a careful examination
of the stated objectives of the course in terms of course content and target
abilities and a study of the size and depth of realisation of their coverage.
Such examinations lead to ‘estimates’ and not to ‘measurements’. That is;
the observations of such examinations tend to be subjective statements. They
cannot be expressed in terms of objective numerical indices.
Check Your Progress 3
Notes: a) You can work out your answer in the space given below.
b) Check your answer with the one given at the end of this unit.
We have said that an achievement test of high content validity cannot be a
content valid test for diagnostic purposes. Why?
...................................................................................................................................
...................................................................................................................................
...................................................................................................................................
...................................................................................................................................
...................................................................................................................................
...................................................................................................................................
...................................................................................................................................
...................................................................................................................................
...................................................................................................................................
...................................................................................................................................
Criterion-related validity
Unlike content validity, criterion-related validity can be objectively measured
66 and declared in terms of numerical indices. The concept of criterion-related
validity focuses on a set ‘external’ criterion as its yardstick of measurement. Construction of Evaluation
The ‘external’ criterion may be a data of ‘concurrent’ information or of a Tools
future performance.
The ‘concurrent’ criterion is provided by a data-base of learner-performance
obtained on a test, whose validity has been pre-established. ‘Concurrent’ here
implies the following characteristics;
i) the two tests—the one whose validity is being examined and the one with
proven validity (which is taken as the criterion)—are supposed to cover the
same content area at a given level and the same objectives;
ii) the population for both the tests remains the same and the two tests are
administered in an apparently similar environment; and
iii) the performance data on both the tests are obtainable almost
simultaneously (which is not possible in the case of ‘predictive’ criterion).
The ‘predictive’ criterion is provided by the performance-data of the group
obtained on a course/career subsequent to the test which is administered to the
group and whose validity is under scrutiny.
The validity of a given test is established when ‘concurrent’ criterion
correlates highly (i.e. agrees closely) with its own data.
Validity established by correlation with ‘concurrent criterion’ yields
concurrent validity and similarly validity established against the scale of
‘predictive’ criterion is called predictive validity. The former resolves the
validity of tests serving the purpose of measuring prociency; the latter
resolves the validity of tests meant for predictive function. The ‘concurrent’
criterion has been widely used in the validation of psychological tests,
especially tests of intelligence. The general practice is that one or two
standardised tests of intelligence with proven quality are used to validate the
item. It is crucial in all selection and placement tests. For example, when
a Banking Recruitment Board selects candidates for the post of clerks on
the basis of a clerical aptitude-cum-intelligence test, the selection will be
purposeful only if a high correlation is established between the test results
of the candidates and their performance ability, subsequently, in the clerical
position. The higher the predictive validity, the more emphatic this assertion
will be. In all cases of criterion-related validity, an index of the degree of
correspondence between the tests being examined can be obtained. This index
of agreement is known as correlation coefcient in the statistical parlance.
Construct validity
The word ‘construct’ means the ideas developed in one’s mind to dene,
identify or explain objects/phenomena. Let us suppose that a person is
interested in the study of intelligence. He/she hypothesises that the third or
fourth generation learners will have a higher IQ than the rst generation
learners. On the basis of his/her observations he/she may build a theory
specifying the degree of difference in the IQ of the two groups of learners.
If a test is constructed, then, to measure the difference in the levels of
intelligence of rst generation learners and third/fourth generation learners,
the test would be considered to have construct validity to the extent that its
scores correspond to judgements made from the observations derived by the
scorer about the intelligence of the two groups of learners. If the expected
level of difference is not established by the test scores, then the construct
validity of the assumption that the test measures the difference in the levels of
intelligence is not supported. Thus, a test will be described to have construct
67
Curriculum Evaluation validity if its scores vary in ways suggested by the theory underlying the
construct. In other words construct validity is the degree to which one can
infer certain constructs in a psychological theory from the test score.
Construct validity is an important concept to those who are engaged in
theoretical research on various constructs.
Face validity
Before we take up a detailed scrutiny of a test for any of the above validity-
types, we generally tend to make an impressionistic assessment so as to
develop some propositions which may guide our approach to the assessment
of validity. Such propositions are developed on a facial understanding of the
extent to which a test looks like a valid test or the extent to which the test
seems logically related to what is being tested. These propositions constitute
what is known as face validity.
Face validity may not be dependable. A test may look right without being
rational or even useful. For instance, a terminal examination in a course
of 10 units may appear to have reasonable face validity until you come to
realise that it contains questions on the rst ve units only and therefore lacks
content validity. Sometimes there may be situations where a test may appear
to have low face validity, but in practice it may turn out to be a sound one. In
such cases, the testees too may not know what is being tested and ipso facto
the measure may provide far more effective assessment. For instance, the
ability to react quickly to a ash of light may be a good test of potential as a
football player. Such a test of reaction time may have content validity even if
it doesn’t have much face validity.
(Incidentally the ‘idea’ that prompts using this reaction time test for
identifying a potential football player is a construct. The idea or the construct
perhaps may be explained as that one who is able to react speedily and
correctly to the sudden darts of an object can be a potential football player).
70 d) Random score
Estimates of reliability Construction of Evaluation
Tools
The methods used to measure reliability differ according to the source of error
under consideration. The most common approaches to estimates of reliability
are:
i) Measures of stability,
ii) Measures of equivalence,
iii) Measures of stability and equivalence,
iv) Measures of internal consistency, and
v) Scorer reliability.
Measures of stability: Measures of stability are known as ‘test-retest
estimates of reliability’. They are obtained by administering a test twice to the
same group with a considerable time-interval between the two administrations
and correlating the two sets of scores thus obtained.
In this type of estimate we do not get specic information as to which of the
sources of error contribute(s) to the variance in the score. It gives only the
measure of consistency, over a stretch of time of a person’s performances on
the test.
The estimate of reliability in the case will vary according to the length of
time-interval allowed between the two administrations. The intervening
period can be relatively long, if the test is designed to measure relative stable
traits and the testees are not subject enduring the period between the two tests
administrations to experiences which tend to affect the characteristic being
measured. The intervening time should be shorter when the conditions are
not satised. But it should not be so short as to allow ‘memory’ or practice
effects’ to inate the relationship between the two performances.
Measure of equivalence: In contrast to the test-retest estimate of reliability
which measures change in performance from one time to another the estimate
of reliability with equivalent forms of tests measures changes due to the
specicity of knowledge within a domain. Instead of repeating the same test
twice with an intervening time-gap, the latter procedure administers two forms
(‘parallel’ in terms of content and difculty) of a test to the same group on
the same day (i.e. with negligible time-gap) and correlates the two sets of
scores obtained thereon.
The two methods of estimating reliability are quite different and can yield
different results. The choice between the two depends on the purpose for
which you administer the test. If your purpose is long-term prediction about
the reliability of the test, you can choose the procedure of retest reliability
estimation. If your purpose, on the other hand, is to infer one’s knowledge in
a subject matter area, you will have to depend on equivalent forms of estimate
of reliability.
Measures of stability and equivalence: When one is concerned with both
long-range prediction and inferences to the domain of knowledge, one should
obtain measures of both equivalence and stability. This could be done by
administering two similar (parallel) forms of a test with considerable time-
gap between the two administrations. The correlation between the two sets of
scores thus obtained by the same group of individuals will give the coefcient
of stability and equivalence. The estimate of reliability thus obtained will be
generally lower than the one obtained in either of the two other procedures.
71
Curriculum Evaluation Measures of internal consistency: The three methods discussed above are
concerned with consistency between two sets of scores obtained on two
different test administrations. The methods that we are to discuss, hereafter
collectively called ‘measures of internal consistency, arrive at reliability
estimate taking into consideration the scores obtained on a single test-
administration. ‘The estimate of reliability obtained through these methods is
mostly indices of homogeneity of items in the test, or of the extent of overlap
between the responses to an item and the total test score. The three types of
measures of internal consistency are discussed below.
Split-half estimates: Theoretically the split-half method of estimating
reliability is the same as the equivalent forms methods. Yet the split-half
method requires only one test administration but while scoring the items,
a sub score for each of the two halves of the test is obtained and the two
sub scores are correlated to get the reliability estimate of half the length of
the test. To estimate the reliability of the scores on the full length test, the
following formula is used:
2 × reliability on 1/2 test
Reliability on full test =
1 + reliability on 1/2 test
The application of this formula assumes that the variances of the two halves
are equal. That is to say that the items in one half are supposed to match in
respect of content and difculty with the corresponding items in the other.
The question then is how the tests can be split into two halves. Different
methods are followed, but ordinarily it is done by a preconceived plan (say,
assigning the odd numbered items to one half and the even numbered items to
the other) without obvious statistical measures to make them equivalent.
Kuder-Richardson estimates: This method of estimating the reliability of test
scores from a single administration of a single form of a test by means of
formulae KR 20 and KR 21 was developed by Kuder and Richardson. With
the help of these two formulae we can estimate whether the items in the test
are homogeneous, that is, whether each test item measures the same quality
or characteristics as every other, In other words, these formulae provide a
measure of internal consistency but do not require splitting the test in half for
scoring purposes.
The formulae are:
n æ å pq ö
1. KR 20 = ç 1- 2 ÷
n-1 è t ø
Where n = number of items in the test,
στ = standard deviation of the test scores,
P = Proportion of the group answering item correctly,
q = 1–P = proportion of the group answering a test item incorrectly.
We use KR 20, we have to
1) Compute the standard deviation of that rest (i.e., σt)
2) Compute p and q for each item,
3) Multiply P and q to obtain the value of pq for each item,
4) Add the value of all the items to get ∑pq
n t 2 - ( Mn - M )
KR 21 =
t2
Where σt2 = standard deviation of the test scores
n = number of test items in the test
and M = the mean of the test scores.
Crombach alpha: Kuder-Richardson estimates are possible when the scoring
of items is dichotomous. When the scoring is not dichotomous as in a test
consisting of essay questions, the formula developed by Cronbach can be used
to get the reliability estimate. This formula known as Cronbach alpha is the
same as KR 20 except for the fact that is replaced by, where is the variance of
a single item. The formula is:
é ù
2
n si
=
ê1 -
n -1 ë
ú
û
2
st
nr
rxx =
1 + ( n - 1) r
74
Construction of Evaluation
7) Multiple scorer Measure of scorer Administer a test once. Tools
method reliability Let it be scored by two
or more scorers. Correlate
the sets of scores to
measure the reliability of
the scores of one scorer.
Apply Spearman-Brown
Prohecy formula to obtain
the reliability of the sum
(or average) of the scores
of two or more scores.
Comparison of methods
As noted earlier each type of reliability measure represents different source(s).
A summary of this information is given in Table 9. Note that more sources
of error are represented by measures of equivalence and stability than by any
other type of measure. Naturally, reliability estimates obtained on measures
of equivalence and stability are likely to be lower. This should caution you
to take into account the type of measures used to report reliability estimate,
especially when you attempt to choose a test from among standardised tests
guided by reliability estimates.
The ‘X’ mark in Table 13 indicates the sources of error represented by the
reliability measures.
Table 13: Representation of sources of error
Types of Reliability Measures
Sources Stability Equivalence Equivalence Internal
of error Scorer & stability consistency
reliability
Trait
instability X X
Sampling
error X X X
Administrator
error X X X
Random error
within the test X X X X X
Scoring error X
3.4.4 Usability
We have discussed in detail the two chief criteria of test-validation—
validity and reliability. What remains to be seen are the considerations of
‘Usability’. ‘Usability’ mostly raises questions of feasibility with regard to
test-construction, administration, evaluation, interpretation and pedagogical
application.
While judging feasibility we should remember that the tests are usually
administered and interpreted by teachers without the desirable amount of
training in the procedures of measurement. Time available for testing and the
cost of testing also deserve attention.
75
Curriculum Evaluation Besides these, attributes like the case of administration, which has little
possibilities for error in giving directions, timing, etc., case and economy of
scoring without sacricing accuracy, ease of interpretation and application so
as to contribute to intellectual educational decisions, are factors pertinent to
the usability of tests.
77