Educational Measurement and Evaluation
Educational Measurement and Evaluation
Educational Measurement and Evaluation
AND EVALUATION
ii
COURSE TEAM
Editor:
Course Coordinator: Dr. Muhammad Tanveer Afzal
iii
CONTENTS
Sr. No Topics Page No
01 Foreword ................................................................................................v
02 Preface................................................................................................. vii
iv
FOREWORD
Learning is natural to the human beings, but in order to catalyze the process of
learning the efforts of teachers contribute a lot towards educational attainments.
The answer to the questions that to what extent the students have learned and
which instructional techniques work better is not simple. These questions are vital
to answer and answers need rigorous approach towards the measurement and
assessment of the students’ progress that consequently leads towards the better
decision making. In order to ensure and enhance the effectiveness of teaching-
learning process teachers need to get information regarding students’
performance. Based upon this information teachers make critical instructional
decisions for example whether to use a certain teaching method or not, whether
the progress of students towards attainment of educational goals is satisfactory or
not etc.
For the optimization of the students learning it is mandatory that teachers can
develop, administer, score and report the tests scores to the educational
stakeholders, the validity and reliability of the classroom test developed by the
teachers for the use in classroom can only be enhanced by exposing them to the
process and procedures of test development. The experience towards the
v
measurement and development of the test may contribute towards the professional
development of prospective and in-service teachers.
Vice-Chancellor
AIOU
vi
PREFACE
Classroom tests play a central role in the assessment of student learning. Teachers
use tests to assess the progress of the students learning. Tests provide relevant
measures of many important learning outcomes and indirect evidence concerning
others. They make expected learning outcomes explicit to students and parents
and show what types of performance are valued. In order to ensure and enhance
the effectiveness of teaching-learning process teachers need to get information
regarding students’ performance. Based upon this information teachers make
critical instructional decisions for example whether to use a certain teaching
method or not, whether the progress of students towards attainment of educational
goals is satisfactory or not, what if a student is having learning deficiency, how to
motivate a student etc. Classroom assessment primarily aims to yield the
information regarding students’ performance in order to help the teacher and/or
stakeholders to determine a certain degree, to which a learner has acquired
particular knowledge, has understood particular concepts or has mastered certain
skill.
The competency of the teachers to develop, administer, score and interpret the
results is the prime consideration of the tomorrow’s classrooms. Therefore, it is
necessary to enhance the knowledge and skills of the prospective teachers towards
the development and use of assessment tools. This particular course comprised of
nine units. The concept of measurement, assessment and evaluation is elaborated
in the first unit, the test items are developed in-line with the objectives/learning
outcomes, so objectives are discussed in unit two. The third and fourth units of the
textbook are about different types of tests and techniques used by the teachers.
The characteristics of assessment tools such as validity and reliability are
vii
explained in the sixth and seventh units. The 8th and 9th units of textbook are
about the interpretation and reporting of the test scores. The text includes relevant
examples for the elaboration of the concepts and the activities are placed for the
hands on works, which consequently, help to develop the attitude and the skills of
the prospective teachers.
In the end, I am thankful to the course team and especially the course
development coordinator for this wonderful effort.
viii
COURSE OBJECTIVES
Classrooms are busy places. Every day in every classroom, teachers make
decisions about their pupils, the success of their instruction and perform a number
of other tasks. Teachers continually observe, monitor, and review learners’
performance to obtain evidence for decision. Evidence gathering and classroom
marking are necessary and ongoing aspects of teaches’ lives in classroom. And
decisions based on this evidence serve to establish, organize, and monitor
classroom qualities such as pupil learning, interpersonal relations, social
adjustment, instructional content and classroom climate. Keeping in view the
tasks teachers have to perform in classroom, this course has been organized to
follow the natural progression of teacher’ decision making form organizing the
classroom as a social setting, to planning and conducting instruction to the formal
assessment of pupil learning, to grading and finally to communicating results to
an ongoing part of teaching therefore this course covers the broad range of
assessments. The course intends to achieve the following objectives.
COURSE OBJECTIVES
After studying this course the prospective teachers will be able to:
1. Understand the concepts and application of classroom assessment.
2. Integrate objectives with evaluation and measurement.
3. Acquire skills of assessing the learning outcomes.
4. Interpret test scores.
5. Know about the trends and techniques of classroom assessment.
ix
x
UNIT–1
Written By:
Prof. Dr. Rehana Masrur
Reviewed By:
Dr. Naveed Sultana
CONTENTS
Sr. No Topic Page No
Introduction ...................................................................................................................3
Objectives .....................................................................................................................3
OBJECTIVES
After studying this unit, the prospective teacher will be able to:
indicate the primary differences among the terms measurement, assessment and evaluation
explain the types of assessment used in the classroom milieu
compare and contrast the assessment for learning and assessment of learning
summarize the need for assessment
highlight the role of assessment in effective teaching-learning process
describe major characteristics of classroom assessment
identify the core principles of effective assessment
1.1 Concept of Measurement, Assessment and Evaluation
Despite their significant role in education the terms measurement, assessment, and evaluation are usually
confused with each other. Mostly people use these terms interchangeably and feel it very difficult to
explain the differences among them. Each of these terms has a specific meaning sharply distinguished
from the others.
Measurement: In general, the term measurement is used to determine the attributes or dimensions of
object. For example, we measure an object to know how big, tall or heavy it is. In educational perspective
measurement refers to the process of obtaining a numerical description of a student’s progress towards a
pre-determined goal. This process provides the information regarding how much a student has learnt.
Measurement provides quantitative description of the students’ performance for example Rafaih solved
23 arithmetic problems out of 40. But it does not include the qualitative aspect for example, Rafaih’s
work was neat.
Testing: A test is an instrument or a systematic procedure to measure a particular
characteristic. For example, a test of mathematics will measure the level of the
learners’ knowledge of this particular subject or field.
Assessment: Kizlik (2011) defines assessment as a process by which information is
obtained relative to some known objective or goal. Assessment is a broad term that
includes testing. For example, a teacher may assess the knowledge of English
language through a test and assesses the language proficiency of the students
through any other instrument for example oral quiz or presentation. Based upon this
view, we can say that every test is assessment but every assessment is not the test.
The term ‘assessment’ is derived from the Latin word ‘assidere’ which means ‘to sit beside’. In contrast
to testing, the tone of the term assessment is non-threatening indicating a partnership based on mutual
trust and understanding. This emphasizes that there should be a positive rather than a negative association
between assessment and the process of teaching and learning in schools. In the broadest sense assessment
is concerned with children’s progress and achievement.
Evaluation: According to Kizlik (2011) evaluation is most complex and the least understood term.
Hopkins and Antes (1990) defined evaluation as a continuous inspection of all available information in
order to form a valid judgment of students’ learning and/or the effectiveness of education program.
The central idea in evaluation is "value." When we
evaluate a variable, we are basically judging its
worthiness, appropriateness and goodness.
Evaluation is always done against a standard,
objectives or criterion. In teaching learning process
teachers made students’ evaluations that are usually
done in the context of comparisons between what
was intended (learning, progress, behaviour) and
what was obtained.
Evaluation is much more comprehensive term than measurement and assessment. It includes both
quantitative and qualitative descriptions of students’ performance. It always provides a value judgment
regarding the desirability of the performance for example, Very good, good etc.
Kizlik 2011 http://www.adprima.com/measurement.htm
Activity 1.1: Distinguish among measurement, assessment and evaluation with the
help of relevant examples
Assessment for learning has many unique characteristics for example this type of assessment is taken as
―practice." Learners should not be graded for skills and concepts that have been just introduced. They
should be given opportunities to practice. Formative assessment helps teachers to determine next steps
during the learning process as the instruction approaches the summative assessment of student learning. A
good analogy for this is the road test that is required to receive a driver's license. Before the final driving
test, or summative assessment, a learner practice by being assessed again and again to point out the
deficiencies in the skill
Another distinctive characteristic of formative assessment is student involvement. If students are not
involved in the assessment process, formative assessment is not practiced or implemented to its full
effectiveness. One of the key components of engaging students in the assessment of their own learning is
providing them with descriptive feedback as they learn. In fact, research shows descriptive feedback to be
the most significant instructional strategy to move students forward in their learning. Descriptive
feedback provides students with an understanding of what they are doing well. It also gives input on how
to reach the next step in the learning process.
Role of assessment for learning in instructional process can be best understood with the help of following
diagram.
Source:
http://www.stemresources.com/index.php?option=com_content&view=article&id=52&Itemid=70
Garrison, & Ehringhaus, (2007) identified some of the instructional strategies that can be used for
formative assessment:
Observations. Observing students’ behaviour and tasks can help teacher to identify if students are on
task or need clarification. Observations assist teachers in gathering evidence of student learning to
inform instructional planning.
Questioning strategies. Asking better questions allows an opportunity for deeper thinking and
provides teachers with significant insight into the degree and depth of understanding. Questions of
this nature engage students in classroom dialogue that both uncovers and expands learning.
Self and peer assessment. When students have been involved in criteria and goal setting, self-
evaluation is a logical step in the learning process. With peer evaluation, students see each other as
resources for understanding and checking for quality work against previously established criteria.
Student record keeping It also helps the teachers to assess beyond a "grade," to see where the
learner started and the progress they are making towards the learning goals.
b) Assessment of Learning (Summative Assessment)
Summative assessment or assessment of learning is used to evaluate students’ achievement at some point
in time, generally at the end of a course. The purpose of this assessment is to help the teacher, students
and parents know how well student has completed the learning task. In other words summative
evaluation is used to assign a grade to a student which indicates his/her level of achievement in the course
or program.
Assessment of learning is basically designed to provide useful information about the performance of the
learners rather than providing immediate and direct feedback to teachers and learners, therefore it usually
has little effect on learning. Though high quality summative information can help and guide the teacher to
organize their courses, decide their teaching strategies and on the basis of information generated by
summative assessment educational programs can be modified.
Many experts believe that all forms of assessment have some formative element. The difference only lies
in the nature and the purpose for which assessment is being conducted.
Comparing Assessment for Learning and Assessment of Learning
Assessment for Learning Assessment of Learning
(Formative Assessment) (Summative Assessment)
Checks how students are learning and is there any Checks what has been learned to date.
problem in learning process. it determines what to
do next.
Is designed to assist educators and students in Is designed to provide information to those not
improving learning? directly involved in classroom learning and
teaching (school administration, parents, school
board), in addition to educators and students?
Usually uses detailed, specific and descriptive Usually uses numbers, scores or marks as part of
feedback—in a formal or informal report. a formal report.
Usually focuses on improvement, compared with the Usually compares the student's learning either
student's own previous performance with other students' learning (norm-referenced) or
the standard for a grade level (criterion-
referenced)
Source: adapted from Ruth Sutton, unpublished document, 2001, in Alberta Assessment Consortium
c) Assessment as Learning
Assessment as learning means to use assessment to develop and support students' metacognitive skills.
This form of assessment is crucial in helping students become lifelong learners. As students engage in
peer and self-assessment, they learn to make sense of information, relate it to prior knowledge and use it
for new learning. Students develop a sense of efficacy and critical thinking when they use teacher, peer
and self-assessment feedback to make adjustments, improvements and changes to what they understand.
Self Assessment: ‘Formative assessment results in improved teaching learning process.’ Comment
on the statement and give arguments to support your response.
4. Assessment requires attention to outcomes but also and equally to the experiences that lead
to those outcomes.
Information about outcomes is of high importance; where students "end up" matters greatly. But to
improve outcomes, we need to know about student experience along the way -- about the curricula,
teaching, and kind of student effort that lead to particular outcomes. Assessment can help us understand
which students learn best under what conditions; with such knowledge comes the capacity to improve the
whole of their learning.
6. Assessment is effective when representatives from across the educational community are
involved.
Student education is a campus-wide liability, and assessment is a way of acting out that responsibility.
Thus, while assessment attempts may start small, the aim over time is to involve people from across the
educational community. Faculty plays an important role, but assessment's questions can't be fully
addressed without participation by educators, librarians, administrators, and students. Assessment may
also involve individuals from beyond the campus (alumni/ae, trustees, employers) whose experience can
enrich the sense of appropriate aims and standards for learning. Thus understood, assessment is not a task
for small groups of experts but a collaborative activity; its aim is wider, better-informed attention to
student learning by all parties with a stake in its improvement.
7. Assessment makes a difference when it begins with issues of use and illuminates questions
that people really care about.
Assessment recognizes the value of information in the process of improvement. But to be useful,
information must be connected to issues or questions that people really care about. This implies
assessment approaches that produce evidence that relevant parties will find credible, suggestive, and
applicable to decisions that need to be made. It means thinking in advance about how the information will
be used, and by whom. The point of assessment is not to collect data and return "results"; it is a process
that starts with the questions of decision-makers, that involves them in the gathering and interpreting of
data, and that informs and helps guide continuous improvement.
9. Through effective assessment, educators meet responsibilities to students and to the public.
There is a compelling public stake in education. As educators, we have a responsibility to the public that
support or depend on us to provide information about the ways in which our students meet goals and
expectations. But that responsibility goes beyond the reporting of such information; our deeper obligation
-- to ourselves, our students, and society -- is to improve. Those to whom educators are accountable have
a corresponding obligation to support such attempts at improvement. (American Association for Higher
Education; 2003)
Assessment does more than allocate a grade or degree classification to students – it plays an important
role in focusing their attention and, as Sainsbury & Walker (2007) observe, actually drives their learning.
Gibbs (2003) states that assessment has 6 main functions:
1. Capturing student time and attention
2. Generating appropriate student learning activity
3. Providing timely feedback which students pay attention to
4. Helping students to internalize the discipline’s standards and notions of equality
5. Generating marks or grades which distinguish between students or enable pass/fail decisions to be
made.
6. Providing evidence for other outside the course to enable them to judge the appropriateness of
standards on the course.
Surgenor (2010) summarized the role of assessment in learning in the following points.
It fulfills student expectations
It is used to motivate students
It provide opportunities to remedy mistakes
It indicate readiness for progression
Assessment serves as a diagnostic tool
Assessment enables grading and degree classification
Assessment works as a performance indicator for students
It is used as a performance indicator for teacher
Assessment is also a performance indicator for institution
Assessment facilitates learning in the one way or the other.
Activity 1.3: Enlist different role of formative and summative assessment in teaching
learning process.
Catherine Garrison, Dennis Chandler & Michael Ehringhaus, (2009). Effective Classroom
Assessment: Linking Assessment with Instruction: NMSA & Measured Progress Publishers
Kathleen Burke, (2010). How to assess authentic learning. California: Corwin Press
Carolin Gipps, ( 1994) Beyond Testing : Towards a Theory of Educational Assessment Routledge
Publishers
UNIT–2
Written by:
Prof. Dr. Rehana Masrur
Reviewed By:
Dr. Naveed Sultana
CONTENTS
Sr. No Topic Page No
Introduction .................................................................................................................22
Objectives ...................................................................................................................22
2.1 Purpose of a Test............................................................................................23
(1) Monitoring Student Progress ............................................................24
(2) Diagnosing Learning Problems.........................................................24
(3) Assigning Grades ..............................................................................25
(4) Classification and Selection of Students ...........................................25
(5) Evaluating instruction ......................................................................25
2.2 Objectives and Educational Outcomes ..........................................................25
(1) Definition of Objectives....................................................................25
(2) Characteristics/attributes of Educational Outcomes .........................26
(3) Taxonomy of Educational Objectives ...............................................27
2.3 Writing cognitive Domain Objectives ...........................................................28
2.4 Defining Learning Outcomes .........................................................................32
(1) Different Definitions of Learning Outcomes ....................................32
(2) Difference Between Objectives and Learning Outcomes .................33
(3) Importance of Learning Outcomes ...................................................33
(4) SOLO Taxonomy ..............................................................................33
2.5 Preparation of Content Outline ......................................................................34
2.6 Preparation of Table of Specification ............................................................37
2.7 Self-Assessment Questions ............................................................................40
LIST OF FIGURES
2.1 Defining objectives ........................................................................................26
INTRODUCTION
In this unit you will learn that how important are the objectives and learning outcomes in the process of
assessment. A teacher should know that the main advantage of objectives is to guide the teaching-
learning activities. In simple words these are the desired outcomes of an effort. Guided by these specific
objectives instructional activities are designed and subsequently assessment is carried out through
different methods. One of the most common methods to assess the ability of a student in any specific
subject is a test. Most tests taken by students are developed by teachers. The goal of this unit is for you to
be able to design, construct, and analyze a test for a given set of objectives or content area. Therefore, the
objective are key components for developing a test. These are the guiding principles for assessment. For
achievement testing cognitive domain is very much emphasized and widely used by educationists.
Taxonomy of Educational Objectives developed by Benjamin Bloom (1956) deals with activities like
memorizing, interpreting, analyzing and so on. This taxonomy provides a useful way of describing the
complexity of an objective by classifying into one of the hierarchical categories from simplest to
complex. One of the important task for a teacher while designing a test is the selection and sampling of
test items from course contents. The appropriateness of the content of a test is considered at earliest stages
of development. Therefore, the process of developing a test should begin with the identification of content
domain at first stage and development of table of specification at second stage. In this unit we have
focused on what we want students to learn and what content we want our tests to cover.
You will learn that how to work on different stages of assessment.
OBJECTIVES
After studying this unit, you should be able to;
describe the role of objectives and outcomes in the assessment of student achievement.
explain the purpose of a test.
explain levels of Cognitive Domain.
develop achievement objectives according to Bloom Taxonomy of Educational objectives.
identify and describe the major components of a table of specifications.
identify and describe the factors which determine the appropriate numbers of items for each
component in a table of specification.
What is a Test?
A test is a device which is used to measure behaviour of a person for a specific purpose. Moreover it is an
instrument that typically uses sets of items designed to measure a domain of learning tasks. Tests are
systematic method of collecting information that lead to make inferences about the characteristics of
people or objects. A teacher must understand that educational test is a measuring device and therefore
involves rules (administering, scoring) for assigning numbers that will be used for describing the
performance of an individual. You should also keep in mind that it is not possible for a teacher to test all
the subject matter of a course that has been taught to the class in a semester or in a year. Therefore,
teacher prepares tests while sampling the items from a pool of items in such a way that it represents the
whole subject matter. Teacher must also understand that whole content with many topics and concepts
that have been taught within a semester or in a year can not be tested in one or two hours. In simple words
a test should assess content area in accordance with relative importance a teacher has assigned to them. It
is believed most commonly that the meaning of a test is simple paper-and-pencil tests. But now a days
other testing procedures have been developed and are practiced in many schools.
Even tests are of many types that can be placed into two main categories. These are:
(i) Subjective type tests
(ii) Objective type tests
At elementary level students do not have much proficiency of writing long essay type answer of a
question, therefore, objective type tests are preferred. Objective type tests are also called selective-
response tests. In this types of tests responses of an item are provided and the students are required to
choose correct response. The objective types of tests that are used at elementary level are:
(i) Multiple choice
(ii) Multiple Binary-choice
(iii) Matching items
You will study about the development process of each of these items in next units. In this unit you have
been given just an idea that what does a test mean for a teacher. Definitely after going through this
discussion you might be ready to extract yourself from the above mentioned paragraphs that why it is
important for a teacher to know about a classroom test. What purpose it serves? The job of a teacher is to
teach and to test for the following:
Purposes of test:
You have learned that a test is a simple device which measures the achievement level of a student in a
particular subject and grade. Therefore we can say that a test is used to serve the following purposes:
3. Assigning Grades
A teacher assigns grade after scoring the test. The best way to assign grades is to collect objective
information related to student achievements and other academic accomplishments. Different
institutions have different criteria for assigning the grades. Mostly alphabets ‘A, B, C, D, or F are
assigned on the bases of numerical evidence.
5. Evaluating Instruction
Students’ performance on tests helps the teacher to evaluate her/his own instructional effectiveness
or to know that how effective their teaching have been. A teacher teaches a topic for two weeks. After
the completion of topic the teacher gives a test. The score obtained by students show that they learned the
skills and knowledge that was expected to learn. But if the obtained score is poor, then the teacher will
decide to retain, alter or totally discard their current instructional activities.
Activity-2.1: Visit some schools of your area and perform the following:
In teaching learning process, learning objectives have a unique importance. The role learning objectives
play includes but is not limited to the following three: firstly, they guide and direct for the selection of
instructional content and procedures. Secondly, they facilitate the appropriate evaluation of the
instruction. Thirdly, learning objectives help the students to organize their efforts to accomplish the intent
of the instruction.
Though all the three characteristics are essential for stating clear objectives, in some cases one or two of
these elements are easily implied by a simple statement.
3 Taxonomy of Educational Objectives
Following the 1948 Convention of the American Psychological Association, a group of college examiners
considered the need for a system of classifying educational goals for the evaluation of student
performance. Years later and as a result of this effort, Benjamin Bloom formulated a classification of "the
goals of the educational process". Eventually, Bloom established a hierarchy of educational objectives for
categorizing level of abstraction of questions that commonly occur in educational settings (Bloom, 1965).
This classification is generally referred to as Bloom's Taxonomy. Taxonomy means 'a set of classification
principles', or 'structure'. The followings are six levels in this taxonomy: Knowledge, Comprehension,
Application, Analysis, Synthesis, and Evaluation. The detail is given below:
Cognitive domain: The cognitive domain (Bloom, 1956) involves the development of intellectual skills.
This includes the recall or recognition of specific facts, procedural patterns, and concepts that serve in the
development of intellectual abilities and skills. There are six levels of this domain starting from the
simplest cognitive behaviour to the most complex. The levels can be thought of as degrees of difficulties.
That is, the first ones must normally be mastered before the next ones can take place.
Affective domain: The affective domain is related to the manner in which we deal with things
emotionally, such as feelings, values, appreciation, enthusiasms, motivations, and attitudes. The five
levels of this domain include: receiving, responding, valuing, organization, and characterizing by value.
Psychomotor domain: Focus is on physical and kinesthetic skills. The psychomotor domain includes
physical movement, coordination, and use of the motor-skill areas. Development of these skills requires
practice and is measured in terms of speed, precision, distance, procedures, or techniques in execution.
There are seven levels of this domain from the simplest behaviour to the most complex. Domain levels
include: Perception, set, guided response, mechanism, complex or overt response, adaptation.
http://www.nwlink.com/~donclark/hrd/bloom.html
http://www.learningandteaching.info/learning/bloomtax.htm
Over all Bloom’s taxonomy is related to the three Hs of education process that are Head, Heart and Hand.
Cognitive abilities in this taxonomy are arranged on continuum ranging from the lower to the higher
Lower
Higher
Source: Jolly T. Holden: A Guide To Developing Cognitive Learning Objectives. Retrieved From
http://gates.govdl.org/docs/A%20Guide%20to%20Developing%20Cogntive%20Learning%
20Objectives.pdf
Activity-2.2: Develop two objectives of comprehension level for this unit by using
appropriate action verbs.
Bloom's Taxonomy underpins the classical 'Knowledge, Attitude, Skills' structure of learning. It is such
a simple, clear and effective model, both for explanation and application of learning objectives, teaching
and training methods, and measurement of learning outcomes.
Bloom's Taxonomy provides an excellent structure for planning, designing, assessing and evaluating
teaching and learning process. The model also serves as a sort of checklist, by which you can ensure that
instruction is planned to deliver all the necessary development for students.
The Credit Common Accord for Wales defines learning outcomes as:
Statements of what a learner can be expected to know, understand and/or do as a result of a learning
experience. (QCA /LSC, 2004, p. 12)
Activity-2.4 Differentiate between learning Objective and Outcome with the help of
relevant examples
4. SOLO Taxonomy
The SOLO taxonomy stands for:
Structure of
Observed
Learning
Outcomes
SOLO taxonomy was developed by Biggs and Collis (1982) which is further explained by Biggs and
Tang (2007). This taxonomy is used by Punjab for the assessment.
It describes level of increasing complexity in a student's understanding of a subject through five stages,
and it is claimed to be applicable to any subject area. Not all students get through all five stages, of
course, and indeed not all teaching.
1 Pre-structural: here students are simply acquiring bits of unconnected information, which
have no organisation and make no sense.
2 Unistructural: simple and obvious connections are made, but their significance is not
grasped.
4 Relational level: the student is now able to appreciate the significance of the parts in relation
to the whole.
5 At the extended abstract level, the student is making connections not only within the given
subject area, but also beyond it, able to generalise and transfer the principles and ideas
underlying the specific instance.
SOLO taxonomy
http://www.learningandteaching.info/learning/solo.htm#ixzz1nwXTmNn9
Content taught
Content taught
Content taught
Figure-2.7 Inadequate representativeness
Test items
Content taught
In figures 2.5 to 2.9 the shaded area represents the test items which cover the content of subject matter
whereas un-shaded area is the subject matter (learning domain) which the teacher has taught in the class
in the subject of social studies.
Figures 2.5-2.8 show the poor or inadequate representativeness of content of test items. For example
in figure-2.5 test covers a small portion (shaded area) of taught content domain, rest of the items do not
coincide with the taught domain. In figure 2.5 & 2.6 most of the test items/questions have been taken
from a specific part of taught domain, therefore, the representation of taught content domain is
inadequate. Though, the test items have been taken from the same content domain. The content of test
items in figure 2.7 give very poor picture of a test. None of the parts of taught domain have been assessed,
therefore test shows zero representativeness. None of the test items in figure 2.8 have been taken from the
taught content domain. Contrary to this look at figure 2.9, the test items effectively sample the full range
of taught content.
It implies that the content from which the test item have to be taken should be well defined and
structured. With out setting the boundary of knowledge, behaviour, or skills to be measured, the test
development task will become difficult and complex. As a result the assessment will produce unreliable
results. Therefore a good test represents the taught content up to maximum extent. A test which is
representative of the entire content domain is actually is a good test. Therefore it is imperative for a
teacher to prepare outline of the content that will be covered during the instruction. The next step is the
selection of subject matter and designing of instructional activities. All these steps are guided by the
objectives. One must consider objectives of the unit before selection of content domain and subsequently
designing of a test. It is clear from above discussion that the outline of the test content should based on
the following principles:
1. Purpose of the test (diagnostic test, classification, placement, or job employment)
2. Representative sample of the knowledge, behaviour, or skill domain being measured.
3. Relevancy of the topic with the content of the subject
4. Language of the content should be according to the age and grade level of the students.
5. Developing table of specification.
A test, which meets the criteria stated in above principles, will provide reliable and valid information for
correct decision regarding the individual. Now keeping in view these principles go on the following
activity.
Activity-2.5:
Visit elementary school of your area and collect question papers/tests of sixth class of
any subject developed by the school teachers. Now perform the following:
(1) a. How many items are related with the content?
b. How many items (what percentage) are not related with the content
covered for the testing period?
c. Is the test representative of the entire content domain?
d. Does the test fulfill the criteria of test construction? Explain.
(2) Share your results electronically with your classmates, and get their opinion
on the clarification of concept discussed in unit-2
Look at table 2.2, the top of each column of the table represent the level of cognitive domain, the extreme
left column represent the categories of the content (topics) or assessment domains. The numerals in the
cells of two way table show the numbers of items to be included in the test. You can readily see that how
the fifty items in this table have been allocated to the content topics and the levels of cognitive behaviour.
The teacher may add some more dimensions. The table of specification represents four level of cognitive
domain. It is not necessary for teacher to develop a test that completely coincides with the content of
taught domain. The teacher is required to adequately sample the content of the assessment domain. The
important consideration here for teachers is that they must make a careful effort on conceptualizing the
assessment domain. An appropriate representativeness must be ensured. Unfortunately, many teachers
develop tests without figuring out what domains of knowledge, skills, or attitude should be promoted and
consequently, formally be assessed. A classroom test should measure what was taught. In simple words
a test must emphasize what was emphasized in the class. Now look at table 2.3. The table of
specification shows the illustration of assessment domain of unit-2 of this book:
Purpose of a test:
2 1 1 4
Table 2.3 is a very simple table of specification. It is possible to add more dimensions of the content. You
may further distribute the table in subtopics for each main topic. Lets have another look on a very specific
table of the following:
Table 2.4 Specific Table of Specification
Number of Test Items for following Cognitive Level
Knowledge Comprehension Application Analysis
Level of
Cognitive Knows Knows Understands
domain Solves Interprets
symbols specific effects of Total Total
equation results
& terms facts factors
Topics
Activity 2.6: Prepare table of specification for unit-2, you have just studied.
Web References
SOLO taxonomy http://www.learningandteaching.info/learning/solo.htm#ixzz1nwXTmNn9
http://www.nwlink.com/~donclark/hrd/bloom.html
http://www.learningandteaching.info/learning/bloomtax.html
http://gates.govdl.org/docs/A%20Guide%20to%20Developing%20Cogntive%20Learning%20Objectives.
pdf
http://www.qualityresearchinternational.com/glossary/learningoutcomes.htm
UNIT–3
Written By:
Dr. Naveed Sultana
Reviewed By:
Dr. Muhammad Tanveer Afzal
CONTENT
Sr. No Topic Page No
Introduction .................................................................................................................45
Objectives ..................................................................................................................45
Activity 3.1: Prepare the achievement test on content to be taught of any subject
while focusing its steps and discuss with your course mates.
Aptitude and ability tests are designed to assess logical reasoning or thinking performance. They consist
of multiple choice questions and are administered under exam conditions. They are strictly timed and a
typical test might allow 30 minutes for 30 or so questions. Test result will be compared to that of a control
group so that judgments can be made about your abilities.
You may be asked to answer the questions either on paper or online. The advantages of online testing
include immediate availability of results and the fact that the test can be taken at employment agency
premises or even at home. This makes online testing particularly suitable for initial screening as it is
obviously very cost-effective.
(a) Instructional
Teachers can use aptitude test results to adapt their curricula to match the level of their students, or to
design assignments for students who differ widely. Aptitude test scores can also help teachers form
realistic expectations of students. Knowing something about the aptitude level of students in a given class
can help a teacher identify which students are not learning as much as could be predicted on the basis of
aptitude scores. For instance, if a whole class were performing less well than would be predicted from
aptitude test results, then curriculum, objectives, teaching methods, or student characteristics might be
investigated.
(b) Administrative
Aptitude test scores can identify the general aptitude level of a high school, for example. This can be
helpful in determining how much emphasis should be given to college preparatory programs. Aptitude
tests can be used to help identify students to be accelerated or given extra attention, for grouping, and in
predicting job training performance.
(c) Guidance
Guidance counselors use aptitude tests to help parents develop realistic expectations for their child's
school performance and to help students understand their own strengths and weaknesses.
Activity: 3.2 Discuss with your course mate about their aptitudes towards teaching
profession and analyze their opinions.
3.1.3 Attitude
Attitude is a posture, action or disposition of a figure or a statue. A mental and neural state of readiness,
organized through experience, exerting a directive or dynamic influence upon the individual's response to
all objects and situations with which it is related.
Attitude is the state of mind with which you approach a task, a challenge, a person, love, life in general.
The definition of attitude is “a complex mental state involving beliefs and feelings and values and
dispositions to act in certain ways”. These beliefs and feelings are different due to various interpretations
of the same events by various people and these differences occur due to the earlier mentioned inherited
characteristics’.
(i) Components of Attitude
1. Cognitive Component:
It refers that's part of attitude which is related in general know how of a person, for example, he
says smoking is injurious to health. Such type of idea of a person is called cognitive component
of attitude.
2. Effective Component:
This part of attitude is related to the statement which affects another person. For example, in an
organization a personal report is given to the general manager. In report he points out that the sale
staff is not performing their due responsibilities. The general manager forwards a written notice
to the marketing manager to negotiate with the sale staff.
3. Behavioral Component:
The behavioral component refers to that part of attitude which reflects the intension of a person in
short run or long run. For example, before the production and launching process the product.
Report is prepared by the production department which consists of the intention in near future and
long run and this report is handed over to top management for the decision.
(ii) List of Attitude:
In the broader sense of the word there are only three attitudes, a positive attitude, a negative attitude, and
a neutral attitude. But in general sense, an attitude is what it is expressed through. Given below is a list of
attitudes that are expressed by people, and are more than personality traits which you may have heard of,
know of, or might be even carrying them:
Acceptance
Confidence
Seriousness
Optimism
Interest
Cooperative
Happiness
Respectful
Authority
Sincerity
Honest
Sincere
Activity: Develop an attitude scale for analyzing the factors motivating the prospective teachers to
join teaching profession.
(ii) Advantages
In general, intelligence tests measure a wide variety of human behaviours better than any other measure
that has been developed. They allow professionals to have a uniform way of comparing a person's
performance with that of other people who are similar in age. These tests also provide information on
cultural and biological differences among people.
Intelligence tests are excellent predictors of academic achievement and provide an outline of a person's
mental strengths and weaknesses. Many times the scores have revealed talents in many people, which
have led to an improvement in their educational opportunities. Teachers, parents, and psychologists are
able to devise individual curricula that matches a person's level of development and expectations.
(iii) Disadvantages
Some researchers argue that intelligence tests have serious shortcomings. For example, many intelligence
tests produce a single intelligence score. This single score is often inadequate in explaining the
multidimensional.
Another problem with a single score is the fact that individuals with similar intelligence test scores can
vary greatly in their expression of these talents. It is important to know the person's performance on the
various subtests that make up the overall intelligence test score. Knowing the performance on these
various scales can influence the understanding of a person's abilities and how these abilities are
expressed. For example, two people have identical scores on intelligence tests. Although both people have
the same test score, one person may have obtained the score because of strong verbal skills while the
other may have obtained the score because of strong skills in perceiving and organizing various tasks.
Furthermore, intelligence tests only measure a sample of behaviors or situations in which intelligent
behavior is revealed. For instance, some intelligence tests do not measure a person's everyday
functioning, social knowledge, mechanical skills, and/or creativity. Along with this, the formats of many
intelligence tests do not capture the complexity and immediacy of real-life situations. Therefore,
intelligence tests have been criticized for their limited ability to predict non-test or nonacademic
intellectual abilities. Since intelligence test scores can be influenced by a variety of different experiences
and behaviors, they should not be considered a perfect indicator of a person's intellectual potential.
Activity 3.4:
Discuss with your course mate about the intelligence testing and identify the methods
used to measure intelligence, and make a list of problems in measuring intelligence
Likert scaling is a bipolar scaling method, measuring either positive or negative response to a statement.
Sometimes an even-point scale is used, where the middle option of "Neither agree nor disagree" is not
available. This is sometimes called a "forced choice" method, since the neutral option is removed. The
neutral option can be seen as an easy option to take when a respondent is unsure, and so whether it is a
true neutral option is questionable. It has been shown that when comparing between a 4-point and a 5-
point Likert scale, where the former has the neutral option unavailable, the overall difference in the
response is negligible.
Activity 3.5: Apply the projective tests to any class and analyze the traits of students which differ
them with each other.
Activity 3.6: Discuss with your course mate about characteristics of norm and
criterion referenced tests and prepare a report about their usability.
3.2 Techniques
3.2.1 Questionnaire
A questionnaire is a research instrument consisting of a series of questions and other prompts for the
purpose of gathering information from respondents. Although they are often designed for statistical
analysis of the responses, this is not always the case.
A questionnaire is a list of written questions that can be completed in one of two basic ways
Firstly, respondents could be asked to complete the questionnaire with the researcher not present. This is
a postal questionnaire and (loosely) refers to any questionnaire that a respondent completes without the
aid of the researcher.
Secondly, respondents could be asked to complete the questionnaire by verbally responding to questions
in the presence of the researcher. This variation is called a structured interview.
Although the two variations are similar (a postal questionnaire and a structured interview could contain
exactly the same questions), the difference between them is important. If, for example, we are concerned
with protecting the respondent’s anonymity then it might be more appropriate to use a postal
questionnaire than a structured interview.
3. Leading Questions
Leading questions are questions that force your audience for a particular type of answer. In a leading
question, all the answers would be equally likely. An example of a leading question would be a question
that would have choices such as, fair, good, great, poor, superb, excellent etc. By asking a question and
then giving answers such as these, you will be able to get an opinion from your audience.
Example of an Open Format Question
How would you rate lecture method?
(i) Fair (ii) Good (iii) Excellent (iv) Superb
4. Importance Questions
In importance questions, the respondents are usually asked to rate the importance of a particular issue, on
a rating scale of 1-5. These questions can help you grasp what are the things that hold importance to your
respondents. Importance questions can also help you make business critical decisions.
5. Likert Questions
Likert questions can help you ascertain how strongly your respondent agrees with a particular statement.
Likert questions can also help you assess how your customers feel towards a certain issue, product or
service.
6. Dichotomous Questions
Dichotomous questions are simple questions that ask respondents to just answer yes or no. One major
drawback of a dichotomous question is that it cannot analyze any of the answers between yes and no.
7. Bipolar Questions
Bipolar questions are questions that have two extreme answers. The respondent is asked to mark his/her
responses between the two opposite ends of the scale.
Make the first questions interesting. Make them clearly related and useful to the topic of the
questionnaire. The beginning questions should not be open-ended or questions with a long list of
answer choices.
Arrange the order of questions to achieve continuity and a natural flow. Try to keep all questions on
one subject together. Put the more general questions first, followed by a more specific question. For
example, if you want to find out about a person’s knowledge of insurance, start with questions about
types of insurance, purpose of the different types, followed by questions about costs of these various
types.
Try to use the same type of question/responses throughout a particular train of thought. It breaks the
attention span to have a multiple choice question following a YES/NO question, then an open-ended
question.
Place demographic questions (age, gender, race/ethnicity, etc.) in the beginning of the questionnaire.
Use quality print in an easy-to-read type face. Allow sufficient open space to let the respondent feel it
is not crowded and hard to read.
Keep the whole question and its answers on the same page. Don’t cause respondents to turn a page in
the middle of a question or between the question and its answers.
Be sure that the question is distinguishable from the instructions and the answers. May be put the
instructions in boldface or italics.
Try to arrange questions and answers in a vertical flow. This way, the respondent moves easily down
the page, instead of side to side.
Give directions on how to answer. Specific instructions may include: (Circle the number of your
choice.) (Circle only one.) (Check all that apply.) (Please fill in the blank.) (Enter whole numbers.)
(Please do not use decimals or fractions.)
(iv) Disadvantages
Questionnaires are not always the best way to gather information. For example, if there is little previous
information on a problem, a questionnaire may only provide limited additional insight. On one hand, the
investigators may not have asked the right questions which allow new insight in the research topic. On the
other hand, questions often only allow a limited choice of responses. If the right response is not among
the choice of answers, the investigators will obtain little or no valid information.
Another setback of questionnaires is the varying responses to questions. Respondents sometimes
misunderstand or misinterpret questions. If this is the case, it will be very hard to correct these mistakes
and collect missing data in a second round.
Activity 3.7: Prepare a five point scale questionnaire to rank the problems of
elementary school teachers of rural areas.
3.2.2 Observation
An observation is information about objects, events, moves, attitudes and phenomena using directly one
or more senses. Observation can be defined as the visual study of something or someone in order to gain
information or learn about behaviour, trends, or changes. This then allows us to make informed decisions,
adjustments, and allowances based on what has been studied. Observation is a basic but important aspect
of learning from and interacting with our environment. Observation is an important part of learning how
to teach. Much of what beginner teachers need to be aware of cannot be learned solely in the class.
Therefore classroom observation presents an opportunity to see real-life teachers in real-life teaching
situations. In their reflections, many of our teacher friends mention their observations and how these
observations influence the way they plan and teach. Teachers are forever reflecting and making decisions,
and when they see someone else in action, in as much as they are seeing someone else, they are almost
simultaneously seeing themselves. This means that observation is important at every stage of a teacher’s
career. Overall classroom observation is form of ongoing assessment. Most teachers can "read" their
students; observing when they are bored, frustrated, excited, motivated, etc. As a teacher picks up these
cues, he/she can adjust the instruction accordingly. It is also beneficial for teachers to make observational
notes (referred to as anecdotal notes). These notes serve to document and describe student learning
relative to concept development, reading, social interaction, and communication skill.
(ii) Disadvantages:
People feel uncomfortable being watched, they may perform differently when being
observed.
The work being observed may not involved the level of difficulty or volume normally
experienced during that time period.
Some activities may take place at odd times, it might be inconvenience for the system
analyst.
The task being observed is subjected to types of interruptions.
Some task may not be in the manner in which they are observed.
Sometimes people act temporarily and perform their job correctly when they are being observed they
might actually violate the standard of manner.
Activity 3.8: Prepare and conduct a classroom observation focusing on different
teaching competencies of your classroom teacher, after collecting
the data to analyze the teachers performance in different subjects.
3.2.3 Interview
A conversation in which one person (the interviewer) elicits information from another person (the subject
or interviewee). A transcript or account of such a conversation is also called an interview.
2. Unstructured Interview
This interview is not planned in detail. Hence it is also called as Non-Directed interview. The question to
be asked, the information to be collected from the candidates, etc. are not decided in advance. These
interviews are non-planned and therefore, more flexible. Candidates are more relaxed in such interviews.
They are encouraged to express themselves about different subjects, based on their expectations,
motivations, background, interests, etc. Here the interviewer can make a better judgment of the
candidate's personality, potentials, strengths and weaknesses. However, if the interviewer is not efficient
then the discussions will lose direction and the interview will be a waste of time and effort.
3. Group Interview
Here, all the candidates or small groups of candidates are interviewed together. The time of the
interviewer is saved. A group interview is similar to a group discussion. A topic is given to the group, and
they are asked to discuss it. The interviewer carefully watches the candidates. He tries to find out which
candidate influences others, who clarifies issues, who summarizes the discussion, who speaks effectively,
etc. He tries to judge the behaviour of each candidate in a group situation.
4. Exit Interview
When an employee leaves the company, he is interviewed either by his immediate superior or by the
Human Resource Development (HRD) manager. This interview is called an exit interview. Exit interview
is taken to find out why the employee is leaving the company. Sometimes, the employee may be asked to
withdraw his resignation by providing some incentives. Exit interviews are taken to create a good image
of the company in the minds of the employees who are leaving the company. They help the company to
make proper Human Resource Development (HRD) policies, to create a favourable work environment, to
create employee loyalty and to reduce labour turnover.
5. Depth Interview
This is a semi-structured interview. The candidate has to give detailed information about his background,
special interest, etc. He also has to give detailed information about his subject. Depth interview tries to
find out if the candidate is an expert in his subject or not. Here, the interviewer must have a good
understanding of human behaviour.
6. Stress Interview
The purpose of this interview is to find out how the candidate behaves in a stressful situation. That is,
whether the candidate gets angry or gets confused or gets frightened or gets nervous or remains cool in a
stressful situation. The candidate who keeps his cool in a stressful situation is selected for the stressful
job. Here, the interviewer tries to create a stressful situation during the interview. This is done purposely
by asking the candidate rapid questions, criticizing his answers, interrupting him repeatedly, etc. Then the
behviour of the interviewee is observed and future educational planning based on his/her stress levels and
handling of stress.
7. Individual Interview
This is a 'One-To-One' Interview. It is a verbal and visual interaction between two people, the interviewer
and the candidate, for a particular purpose. The purpose of this interview is to match the candidate with
the job. It is a two way communication.
8. Informal Interview
Informal interview is an oral interview which can be arranged at any place. Different questions are asked
to collect the required information from the candidate. Specific rigid procedure is not followed. It is a
friendly interview.
9. Formal Interview
Formal interview is held in a more formal atmosphere. The interviewer asks pre-planned questions.
Formal interview is also called planned interview.
Disadvantages of Interview
Time consuming process.
Involves high cost.
Requires highly skilled interviewer.
Requires more energy.
May sometimes involve systematic errors.
More confusing and a very complicated method.
Different interviewers may understand and transcribe interviews in different ways.
Activity 3.9: Conduct an interview with your teachers regarding their jobs and find
out the problems of teachers during their jobs.
(b) Disadvantages
Highly subjective (rater error and bias are a common problem).
Raters may rate a child on the basis of their previous interactions or on an emotional,
rather than an objective, basis.
Ambiguous terms make them unreliable: raters are likely to mark characteristics by using
different interpretations of the ratings (e.g., do they all agree on what “sometimes”
means?).
Activity 3.10: Prepare a rating scale on attributes of good teaching and administer it
in your classroom for evaluating the performance of your teachers of
different subjects.
(b) Advantages
• It can be obtained easily and available on researcher’s convenience.
• It can be adopted and implemented quickly.
• It reduces or eliminates faculty time demands in instrument development and grading.
• It helps to score objectively.
• It can provide the external validity of test.
• It helps to provide reference group measures.
• It can make longitudinal comparisons.
• It can test large numbers of students.
(c) Disadvantages
• It measures relatively superficial knowledge or learning.
• Norm-referenced data may be less useful than criterion-referenced.
• It may be cost prohibitive to administer as a pre- and post-test.
• It is more summative than formative (may be difficult to isolate what changes are
needed).
• It may be difficult to receive results in a timely manner.
(d) Recommendations
• It must be selected carefully based on faculty review and determination of match
between test content and curriculum content.
• Request technical manual and information on reliability and validity from publisher.
• Check with other users.
• If possible, purchase data disk for creation of customized reports.
• If possible, select tests that also provide criterion-referenced results.
• Check results against those obtained from other assessment methods.
• Embedding the test as part of a course’s requirements may improve student motivation.
3.4 Summary
Classroom assessment test and techniques are a series of tools and practices designed to give teachers
accurate information about the quality of student learning. Information gathered isn’t used for grading or
teacher evaluation. Instead, it’s used to facilitate dialogue between students and teacher on the quality of
the learning process, and how to improve it. For this purpose there are many different types and
techniques of testing that can be done during an evaluation. They can be done by our school system or
independently. Keeping in view the learning domains or aspects different tests such as achievement tests,
aptitude tests, attitude scale, intelligence tests, personality tests, norm and criterion tests and assessment
techniques such as questionnaire, interview, observation, rating scale and standardized testing were
discussed.
TYPES OF TESTS
Written By:
Dr. Naveed Sultana
Reviewed By:
Dr. Muhammad Tanveer Afzal
CONTENT
Sr. No Topic Page No
Introduction ................................................................................................................81
Objectives ...................................................................................................................81
examine the role, advantages and disadvantages of different types of objective and subjective type
tests for measuring the students’ achievement.
describe the learning outcomes that are best measured with selection and supply test items.
differentiate the characteristics of all types of selection and supply categories of items concentrating
to measure the higher level of thinking of students.
4.1 Selection Type Items (objective type)
There are four types of test items in selection category of test which are in common use today. They are
multiple-choice, matching, true-false, and completion items.
The stem may be stated as a direct question or as an incomplete statement. For example:
Direct question
Which is the capital city of Pakistan? --------------- (Stem)
A. Paris. --------------------------------------- (Distracter)
B. Lisbon. -------------------------------------- (Distracter)
C. Islamabad. ---------------------------------- (Key)
D. Rome. --------------------------------------- (Distracter)
Incomplete Statement
The capital city of Pakistan is
A. Paris.
B. Lisbon.
C. Islamabad.
D. Rome.
Multiple choice questions are composed of one question with multiple possible answers (options),
including the correct answer and several incorrect answers (distracters). Typically, students select the
correct answer by circling the associated number or letter, or filling in the associated circle on the
machine-readable response sheet. Students can generally respond to these types of questions quite
quickly. As a result, they are often used to test student’s knowledge of a broad range of content. Creating
these questions can be time consuming because it is often difficult to generate several plausible
distracters. However, they can be marked very quickly.
Multiple Choice Questions Good for:
Application, synthesis, analysis, and evaluation levels
RULES FOR WRITING MULTIPLE-CHOICE QUESTIONS
There are several rules we can follow to improve the quality of this type of written examination.
8. Avoid Distracters in the Form of "All the answers are correct" or "None of the Answers is
Correct"!
Teachers use these statements most frequently when they run out of ideas for distracters. Students,
knowing what is behind such questions, are rarely misled by it. Therefore, if you do use such statements,
sometimes use them as the key answer. Furthermore, if a student recognizes that there are two correct
answers (out of 5 options), they will be able to conclude that the key answer is the statement "all the
answers are correct", without knowing the accuracy of the other distracters.
Advantages:
Multiple-choice test items are not a panacea. They have advantages and advantages just as any other type
of test item. Teachers need to be aware of these characteristics in order to use multiple-choice items
effectively.
Advantages
Versatility
Multiple-choice test items are appropriate for use in many different subject-matter areas, and can be used
to measure a great variety of educational objectives. They are adaptable to various levels of learning
outcomes, from simple recall of knowledge to more complex levels, such as the student’s ability to:
• Analyze phenomena
• Apply principles to new situations
• Comprehend concepts and principles
• Discriminate between fact and opinion
• Interpret cause-and-effect relationships
• Interpret charts and graphs
• Judge the relevance of information
• Make inferences from given data
• Solve problems
The difficulty of multiple-choice items can be controlled by changing the alternatives, since the more
homogeneous the alternatives, the finer the distinction the students must make in order to identify the
correct answer. Multiple-choice items are amenable to item analysis, which enables the teacher to
improve the item by replacing distracters that are not functioning properly. In addition, the distracters
chosen by the student may be used to diagnose misconceptions of the student or weaknesses in the
teacher’s instruction.
Validity
In general, it takes much longer to respond to an essay test question than it does to respond to a multiple-
choice test item, since the composing and recording of an essay answer is such a slow process. A student
is therefore able to answer many multiple-choice items in time it would take to answer a single essay
question. This feature enables the teacher using multiple-choice items to test a broader sample of course
contents in a given amount of testing time. Consequently, the test scores will likely be more
representative of the students’ overall achievement in the course.
Reliability
Well-written multiple-choice test items compare favourably with other test item types on the issue of
reliability. They are less susceptible to guessing than are true-false test items, and therefore capable of
producing more reliable scores. Their scoring is more clear-cut than short answer test item scoring
because there are no misspelled or partial answers to deal with. Since multiple-choice items are
objectively scored, they are not affected by scorer inconsistencies as are essay questions, and they are
essentially immune to the influence of bluffing and writing ability factors, both of which can lower the
reliability of essay test scores.
Efficiency
Multiple-choice items are amenable to rapid scoring, which is often done by scoring machines. This
expedites the reporting of test results to the student so that any follow-up clarification of instruction may
be done before the course has proceeded much further. Essay questions, on the other hand, must be
graded manually, one at a time. Overall multiple choice tests are:
Very effective
Versatile at all levels
Minimum of writing for student
Guessing reduced
Can cover broad range of content
Disadvantages
Versatility
Since the student selects a response from a list of alternatives rather than supplying or constructing a
response, multiple-choice test items are not adaptable to measuring certain learning outcomes, such as the
student’s ability to:
• Articulate explanations
• Display thought processes
• Furnish information
• Organize personal thoughts.
Perform a specific task
• Produce original ideas
• Provide examples
Such learning outcomes are better measured by short answer or essay questions, or by performance tests.
Reliability
Although they are less susceptible to guessing than are true false-test items, multiple-choice items are still
affected to a certain extent. This guessing factor reduces the reliability of multiple-choice item scores
somewhat, but increasing the number of items on the test offsets this reduction in reliability.
Difficulty of Construction
Good multiple-choice test items are generally more difficult and time-consuming to write than other
types of test items. Coming up with plausible distracters requires a certain amount of skill. This skill,
however, may be increased through study, practice, and experience.
Gronlund (1995) writes that multiple-choice items are difficult to construct. Suitable distracters are often
hard to come by and the teacher is tempted to fill the void with a “junk” response. The effect of
narrowing the range of options will available to the test wise student. They are also exceedingly time
consuming to fashion, one hour per question being by no means the exception. Finally multiple-choice
items generally take student longer to complete (especially items containing fine discrimination) than do
other types of objective question.
Difficult to construct good test items.
Difficult to come up with plausible distracters/alternative responses.
Activity 4.1: Construct two items of direct question and two items of incomplete statement
while following the rules of multiple items.
Example
Directions: Circle the correct response to the following statements.
1. Allama Iqbal is the founder of Pakistan. T/F
2. Democracy system is for the people. T/F
3. Quaid-e-Azam was the first Prime Minister of Pakistan. T/F
Good for:
Knowledge level content
Evaluating student understanding of popular misconceptions
Concepts with two logical responses
Advantages:
Easily assess verbal knowledge
Each item contains only two possible answers
Easy to construct for the teacher
Easy to score for the examiner
Helpful for poor students
Can test large amounts of content
Students can answer 3-4 questions per minute
Disadvantages:
They are easy to construct.
It is difficult to discriminate between students that know the material and students who don't
know.
Students have a 50-50 chance of getting the right answer by guessing.
Need a large number of items for high reliability.
Fifty percent guessing factor.
Assess lower order thinking skills.
Poor representative of students learning achievement.
Activity 4.2: Enlist five items by indicating them T/F (True & False)
Matching test items are used to test a student's ability to recognize relationships and to make associations
between terms, parts, words, phrases, clauses, or symbols in one column with related alternatives in
another column. When using this form of test item, it is a good practice to provide alternatives in the
response column that are used more than once, or not at all, to preclude guessing by elimination.
Matching test items may have either an equal or unequal number of selections in each column.
Matching-Equal Columns. When using this form, providing for some items in the response column to be
used more than once, or not at all, can preclude guessing by elimination.
Good for:
Knowledge level
Some comprehension level, if appropriately constructed
Types:
Terms with definitions
Phrases with other phrases
Causes with effects
Parts with larger units
Problems with solutions
Advantages:
The chief advantage of matching exercises is that a good deal of factual information can be tested in
minimal time, making the tests compact and efficient. They are especially well suited to who, what, when
and where types of subject matter. Further students frequently find the tests fun to take because they have
puzzle qualities to them.
Maximum coverage at knowledge level in a minimum amount of space/prep time
Valuable in content areas that have a lot of facts
Disadvantages:
The principal difficulty with matching exercises is that teachers often find that the subject matter is
insufficient in quantity or not well suited for matching terms. An exercise should be confined to
homogeneous items containing one type of subject matter (for instance, authors-novels; inventions
inventors; major events-dates terms – definitions; rules examples and the like). Where unlike clusters of
questions are used to adopt but poorly informed student can often recognize the ill-fitting items by their
irrelevant and extraneous nature (for instance, in a list of authors the inclusion of the names of capital
cities).
Student identifies connected items from two lists. It is useful for assessing the ability to discriminate,
categorize, and association amongst similar concepts.
Time consuming for students
Not good for higher levels of learning
Activity 4.3: Keeping in view the nature of matching items, construct at least five items of
matching case about any topic.
I. Word the statement such that the blank is near the end of the sentence rather than near the
beginning. This will prevent awkward sentences.
II. If the problem requires a numerical answer, indicate the units in which it is to be expressed.
Good for:
Application, synthesis, analysis, and evaluation levels
Advantages:
Easy to construct
Good for "who," what," where," "when" content
Minimizes guessing
Encourages more intensive study-student must know the answer vs. recognizing the answer.
Gronlund (1995) writes that short-answer items have a number of advantages.
They reduce the likelihood that a student will guess the correct answer
They are relatively easy for a teacher to construct.
They are will adapted to mathematics, the sciences, and foreign languages where specific types of
knowledge are tested (The formula for ordinary table salt is ________).
They are consistent with the Socratic question and answer format frequently employed in the
elementary grades in teaching basic skills.
Disadvantages:
May overemphasize memorization of facts
Take care - questions may have more than one correct answer
Scoring is laborious
According to Grounlund (1995) there are also a number of disadvantages with short-answer items.
They are limited to content areas in which a student’s knowledge can be adequately portrayed by
one or two words.
They are more difficult to score than other types of objective-item tests since students invariably
come up with unanticipated answers that are totally or partially correct.
Short answer items usually provide little opportunity for students to synthesize, evaluate and
apply information.
4.2.3 Essay
Essay questions are supply or constructed response type questions and can be the best way to measure the
students' higher order thinking skills, such as applying, organizing, synthesizing, integrating, evaluating,
or projecting while at the same time providing a measure of writing skills. The student has to formulate
and write a response, which may be detailed and lengthy. The accuracy and quality of the response are
judged by the teacher.
Essay questions provide a complex prompt that requires written responses, which can vary in length from
a couple of paragraphs to many pages. Like short answer questions, they provide students with an
opportunity to explain their understanding and demonstrate creativity, but make it hard for students to
arrive at an acceptable answer by bluffing. They can be constructed reasonably quickly and easily but
marking these questions can be time-consuming and grade agreement can be difficult.
Essay questions differ from short answer questions in that the essay questions are less structured. This
openness allows students to demonstrate that they can integrate the course material in creative ways. As a
result, essays are a favoured approach to test higher levels of cognition including analysis, synthesis and
evaluation. However, the requirement that the students provide most of the structure increases the amount
of work required to respond effectively. Students often take longer time to compose a five paragraph
essay than they would take to compose paragraph answer to short answer questions.
Essay items can vary from very lengthy, open ended end of semester term papers or take home tests that
have flexible page limits (e.g. 10-12 pages, no more than 30 pages etc.) to essays with responses limited
or restricted to one page or less. Essay questions are used both as formative assessments (in classrooms)
and summative assessments (on standardized tests). There are 2 major categories of essay questions --
short response (also referred to as restricted or brief) and extended response.
Restricted Response: more consistent scoring, outlines parameters of responses
Extended Response Essay Items: synthesis and evaluation levels; a lot of freedom in answers
Example 1:
List the major similarities and differences in the lives of people living in Islamabad and Faisalabad.
Example 2:
Compare advantages and disadvantages of lecture teaching method and demonstration teaching method.
Example:
Identify as many different ways to generate electricity in Pakistan as you can? Give advantages and
disadvantages of each. Your response will be graded on its accuracy, comprehension and practical ability.
Your response should be 8-10 pages in length and it will be evaluated according to the RUBRIC (scoring
criteria) already provided.
Over all Essay type items (both types restricted response and extended response) are
Good for:
Application, synthesis and evaluation levels
Types:
Extended response: synthesis and evaluation levels; a lot of freedom in answers
Restricted response: more consistent scoring, outlines parameters of responses
Advantages:
Students less likely to guess
Easy to construct
Stimulates more study
Allows students to demonstrate ability to organize knowledge, express opinions, show originality.
Disadvantages:
Can limit amount of material tested, therefore has decreased validity.
Subjective, potentially unreliable scoring.
Time consuming to score.
Activity 4.6: Develop an essay type test on this unit while covering the levels of knowledge,
application and analysis.
Written by:
Dr. Muhammad Tanveer Afzal
Reviewed by:
Prof. Dr. Rehana Masrur
CONTENT
Sr. No Topic Page No
Introduction ...............................................................................................................103
Objectives .................................................................................................................103
OBJECTIVES
After studying this unit, prospective teachers will be able to:
define reliability in their own words.
apply the different methods of assuring reliability on the tests.
identify the factors affecting reliability.
construct a test and check how much reliable it is.
identify measures for reducing the problems in conducting the tests.
5.1. Reliability
What does the term reliability mean? Reliability means Trustworthy. A test score is called reliable when
we have reasons for believing the test score to be stable and objective. For example if the same test is
given to two classes and is marked by different teachers even then it produced the similar results, it may
be considered as reliable. Stability and trustworthiness depends upon the degree to which score is free of
chance error. We must first build a conceptual bridge between the question asked by the individual (i.e.
are my scores reliable?) and how reliability is measured scientifically. This bridge is not as simple as it
may first appear. When a person thinks of reliability, many things may come into his mind – my friend is
very reliable, my car is very reliable, my internet bill-paying process is very reliable, my client’s
performance is very reliable, and so on. The characteristics being addressed are the concepts such as
consistency, dependability, predictability, variability etc. Note that implicit, reliability statements, is the
behaviour, machine performance, data processes, and work performance may sometimes not reliable.
The question is “how much the scores of tests vary over different observations?”
Where “pq” provides a test score error variance for an "average" person, we know that the sampled
people vary, i.e., the variance of their raw scores is greater than zero. Persons with high or low scores
have less score error variance than those with scores near fifty percent correct where the score error
variance is maximum. Since the "average" person variance used in the KR20 formula is always larger
than the lower score error variance of persons with extreme scores, it must always overestimate their
score error variances.
The second formula, which is easier to calculate but slightly less accurate is called KR21. It requires only
the information about the number of items, the mean of the test score and the standard deviation. The
formula KR21 is as under.
n 2 mn m
r1
2 n 1
Studies indicated that this formula provide good results even when the item difficulties are not consistent.
5.3.4 Difficulty
A test that is too difficult or too easy reduces the reliability (e.g., fewer test-takers get the answers
correctly or vice-versa). A moderate level of difficulty increases test reliability.
Transparency
In simple words transparency is a process which requires from teachers to maintain objectivity and the
honesty for developing, administering, marking and reporting the test results. Transparency refers to the
availability of clear, accurate information to students about testing. Such information should include
outcomes to be evaluated, formats used, weighting of items and sections, time allowed to complete the
test, and grading criteria. Transparency makes students part of the testing process. No one could doubt
any aspect of the testing process. It also requires setting rules and keeping record of the testing process.
Security
Most teachers feel that security is an issue only in large-scale, high-stakes testing. However, security is
part of both reliability and validity. If a teacher invests time and energy in developing good tests that
accurately reflect the course outcomes, then it is desirable to be able to recycle the tests or similar
materials. This is especially important if analyses show that the items, distracters and test sections are
valid and discriminating. In some parts of the world, cultural attitudes towards “collaborative test-taking”
are a threat to test security and thus to reliability and validity. As a result, there is a trade-off between
letting tests into the public domain and giving students adequate information about tests.
5.5 Summary
This unit dealt with the reliability and usability of a good test. First, the concepts were defined, and then
the methods of estimating and assuring reliability and the factors affecting was discussed in detail.
Finally, the concept of practicality was explained.
The procedures for test construction may seem tedious. However, regardless of the complexity of the
tasks in determining the reliability and usability of a test, these concepts are essential parts of test
construction. It means that in order to have an acceptable and applicable test, upon which reasonably
sound decisions can be made, test developers should go through planning, preparing, reviewing, and
pretesting processes.
Without determining these parameters, nobody is ethically allowed to use a test for practical purposes.
Otherwise, the test users are bound to make inexcusable mistakes, unreasonable decisions and unrealistic
appraisals.
VALIDITY OF THE
ASSESSMENT TOOLS
Written by:
Dr. Muhammad Tanveer Afzal
Revised by:
Prof. Dr. Rehana Masrur
CONTENT
Sr. No Topic Page No
Introduction ...............................................................................................................117
Objective ..................................................................................................................118
Examples:
1. Say you are assigned to observe the effect of strict attendance policies on class participation.
After observing two or three weeks you reported that class participation did increase after the
policy was established.
2. Say you are intended to measure the intelligence and if math and vocabulary truly represent
intelligence then a math and vocabulary test might be said to have high validity when used as a
measure of intelligence.
A test has validity evidence, if we can demonstrate that it measures what it says to measure. For instance,
if it is supposed to be a test for fifth grade arithmetic ability, it should measure fifth grade arithmetic
ability and not the reading ability.
Activity 6.1: Make a test from any chapter of science book of class 7th and test whether it is valid or
not with the reference to its content?
There are different types of content validity; the major types face validity and the curricular validity are as
below.
1 Face Validity
Face validity is an estimate of whether a test appears to measure a certain criterion; it does not guarantee
that the test actually measures phenomena in that domain. Face validity is very closely related to content
validity. While content validity depends on a theoretical basis for assuming if a test is assessing all
domains of a certain criterion (e.g. does assessing addition skills yield in a good measure for
mathematical skills? - To answer this you have to know, what different kinds of arithmetic skills
mathematical skills include ) face validity relates to whether a test appears to be a good measure or not.
This judgment is made on the "face" of the test, thus it can also be judged by the amateur.
Face validity is a starting point, but should NEVER be assumed to be provably valid for any given
purpose, as the "experts" may be wrong.
For example- suppose you were taking an instrument reportedly measuring your attractiveness, but the
questions were asking you to identify the correctly spelled word in each list. Not much of a link between
the claim of what it is supposed to do and what it actually does.
2. Curricular Validity
The extent to which the content of the test matches the objectives of a specific curriculum as it is formally
described. Curricular validity takes on particular importance in situations where tests are used for high-
stakes decisions, such as Punjab Examination Commission exams for fifth and eight grade students and
Boards of Intermediate and Secondary Education Examinations. In these situations, curricular validity
means that the content of a test that is used to make a decision about whether a student should be
promoted to the next levels should measure the curriculum that the student is taught in schools.
Curricular validity is evaluated by groups of curriculum/content experts. The experts are asked to judge
whether the content of the test is parallel to the curriculum objectives and whether the test and curricular
emphases are in proper balance. Table of specification may help to improve the validity of the test.
Activity 6.3: Curricular validity affects the performance of the examinees, how can you measure the
curricular validity of tests, discuss the current practice followed by the secondary level
teachers with two or three SST in your town.
Activity 6.4: Make a tests for a child of class 4th which measures the shyness construct of his
personality, and valid this test with reference to its construct validity.
There are different types of construct validity; the convergent and the discriminant validity are explained
as follows.
1. Convergent Validity
Convergent validity refers to the degree to which a measure is correlated with other measures that it is
theoretically predicted to correlate with. OR
Convergent validity occurs where measures of constructs that are expected to correlate do so. This is
similar to concurrent validity (which looks for correlation with other tests).
For example, if scores on a specific mathematics test are similar to students scores on other mathematics
tests, then convergent validity is high (there is a positively correlation between the scores from similar
tests of mathematics).
2. Discriminant Validity
Discriminant validity describes the degree to which the operationalization does not correlate with other
operationalizations that it theoretically should not be correlated with. OR
Discriminant validity occurs where constructs that are expected not to relate with each other, such that it
is possible to discriminate between these constructs. For example, if discriminant validity is high, scores
on a test designed to assess students skills in mathematics should not be positively correlated with scores
from tests designed to assess intelligence.
Convergence and discrimination are often demonstrated by correlation of the measures used within
constructs. Convergent validity and Discriminant validity together demonstrate construct validity.
Activity 6.5: Administer any test of English to grade 9th and predict the performance of the students for
future on the basis of that test. Compare its results after a month with their monthly
English test to check the criterion validity of that test with reference to the prediction
made about his performance on English language.
For example:
To assess the validity of a diagnostic screening test. In this case the predictor (X) is the test and the
criterion (Y) is the clinical diagnosis. When the correlation is large this means that the predictor is useful
as a diagnostic tool.
Examples:
1. If higher scores on the Boards Exams are positively correlated with higher G.P.A.’s in the
Universities and vice versa, then the Board exams is said to have predictive validity.
2. We might theorize that a measure of math ability should be able to predict how well a person will
do in an engineering-based profession.
Activity 6.6: Select a teacher made test for 10th grade and discuss it with any teacher for improvement
of the validity evidences in light of factors discussed above.
6.5 Summary
The validity of an assessment tools is the degree to which it measures for what it is designed to measure.
Lots of terms are used to describe the different types of evidence for claiming the validity of a test result
for a particular inference. The terms have been used in different ways over the years by different
authors. More important than the terms, is knowing how to look for validity evidence. Does the score
correlate with other measures of the same domain? Does the score predict future performance? Does the
score correlate with other domains within the same test? Does it negatively correlate with scores that
indicate opposite skills? Do the score results make sense when one simply looks at them? What impact
on student behaviour has the test had? Each of these questions relates to different kinds of validity
evidence (specifically: content validity, concurrent validity, predictive validity, construct validity, face
validity). Content validity evidence involves the degree to which the content of the test matches a content
domain associated with the construct. The concurrent validity evidences can be assured by comparing the
two tests. There are many factors that can reduce the validity of the test, the teachers or test developers
have to consider these factors while constructing and administration of the tests. It better to follow the
systematic procedure and this rigorous approach may help to improve the validity and the reliability of the
tests.
Web Resources
17. http://changingminds.org/explanations/research/design/types_validity.htm
18. http://professionals.collegeboard.com/higher-ed/validity/ ces/ handbook/ test-validity
19. http://professionals.collegeboard.com/higher-ed/ validity/ aces/ handbook /evidence
20. http://www.socialresearchmethods.net/kb/measval.php
21. http://www.businessdictionary.com/definition/validity.html
22. http://www.cael.ca/pdf/C6.pdf
23. 15.http://www.cambridgeassessment.org.uk/ca/digitalAssets/171263BB_CT
definitionIAEA08.pdf
UNIT–7
Written By:
Muhammad Idrees
Reviewed By:
Dr. Naveed Sultana
CONTENTS
Sr. No Topic Page No
Introduction ...............................................................................................................135
Objectives .................................................................................................................136
According to W. Wiersma and S.G. Jurs (1990) in some matching exercises the number of premises and
responses are the same, termed a balanced or perfect matching exercise. In others, the number and
responses may be different.
Advantages
The chief advantage of matching exercises is that a good deal of factual information can be tested in
minimal time, making the tests compact and efficient. They are especially well suited to who, what, when
and where types of subject matter. Further students frequently find the tests fun to take because they have
puzzle qualities to them.
Disadvantages
The principal difficulty with matching exercises is that teachers often find that the subject matter is
insufficient in quantity or not well suited for matching terms. An exercise should be confined to
homogeneous items containing one type of subject matter (for instance, authors-novels; inventions
inventors; major events-dates terms – definitions; rules examples and the like). Where unlike clusters of
questions are used to adopt but poorly informed student can often recognize the ill-fitting items by their
irrelevant and extraneous nature (for instance, in a list of authors the inclusion of the names of capital
cities).
Student identifies connected items from two lists. It is Useful for assessing the ability to discriminate,
categorize, and association amongst similar concepts.
Direct question
Which is the capital city of Pakistan? -------- (Stem)
A. Lahore. -------------------------------------- (Distracter)
B. Karachi. ------------------------------------- (Distracter)
C. Islamabad. ---------------------------------- (Key)
D. Peshawar. ----------------------------------- (Distracter)
Incomplete Statement
The capital city of Pakistan is
A. Lahore.
B. Karachi.
C. Islamabad.
D. Peshawar.
EXAMPLES:
Memory Only Example (Less Effective)
6. Be Grammatically Correct
Use simple, precise and unambiguous wording
Students will be more likely to select the correct answer by finding the grammatically
correct option
9. Use Only One Correct Option (Or be sure the best option is clearly the best option)
The item should include one and only one correct or clearly best answer
With one correct answer, alternatives should be mutually exclusive and not overlapping
Using MC with questions containing more than one right answer lowers discrimination
between students
11. Use Only a Single, Clearly-Defined Problem and Include the Main Idea in the Question
Students must know what the problem is without having to read the response options
14. Don’t Use MCQ When Other Item Types Are More Appropriate
Limited distracters or assessing problem-solving and creativity
Advantages
The chief advantage of the multiple-choice question according to N.E. Gronlund (1990) is its versatility.
For instance, it is capable of being applied to a wide range of subject areas. In contrast to short answer
items limit the writer to those content areas that are capable of being stated in one or two words, multiple
choice item necessary bound to homogeneous items containing one type of subject matter as are matching
items, and a multiple choice question greatly reduces the opportunity for a student to guess the correct
answer from one choice in two with a true – false items to one in four or five, there by increasing the
reliability of the test. Further, since a multiple – choice item contains plausible incorrect or less correct
alternative, it permits the test constructor to tine tune the discriminations (the degree or homogeneity of
the responses) and control the difficulty level of the test.
Disadvantages
N.E. Gronlund (1990) writes that multiple-choice items are difficult to construct. Suitable distracters are
often hard to come by and the teacher is tempted to fill the void with a “junk” response. The effect of
narrowing the range of options will available to the test wise student. They are also exceedingly time
consuming to fashion, one hour per question being by no means the exception. Finally they generally
take student longer to complete (especially items containing fine discrimination) than do other types of
objective question.
I. If at all possible, items should require a single-word answer or a brief and definite statement.
Avoid statements that are so indefinite that they may be logically answered by several terms.
a. Poor item:
Motorway (M1) opened for traffic in ____________.
b. Better item:
Motorway (M1) opened for traffic in the year______.
II. Be sure the question or statement poses a problem to the examinee. A direct question is often
more desirable than an incomplete statement because it provides more structure.
III. Be sure the answer that the student is required to produce is factually correct. Be sure the
language used in the question is precise and accurate in relation to the subject matter area being
tested.
IV. Omit only key words; don‟t eliminate so many elements that the sense of the content is impaired.
a. Poor item:
The ____________ type of test item is usually more _________ than the _____ type.
b. Better item:
The supply type of test item is usually graded less objectively than the _________ type.
V. Word the statement such that the blank is near the end of the sentence rather than near the
beginning. This will prevent awkward sentences.
VI. If the problem requires a numerical answer, indicate the units in which it is to be expressed.
B. Short Answer
Student supplies a response to a question that might consistent of a single word or phrase. Most effective
for assessing knowledge and comprehension learning outcomes but can be written for higher level
outcomes. Short answer items are of two types.
Simple direct questions
Who was the first president of the Pakistan?
Completion items
Advantages
Norman E. Gronlund (1990) writes that short-answer items have a number of advantages.
They reduce the likelihood that a student will guess the correct answer
They are relatively easy for a teacher to construct.
They are will adapted to mathematics, the sciences, and foreign languages where specific types of
knowledge are tested (The formula for ordinary table salt is ________).
They are consistent with the Socratic question and answer format frequently employed in the
elementary grades in teaching basic skills.
Disadvantages
According to Norman E. Grounlund (1990) there are also a number of disadvantages with short-answer
items.
They are limited to content areas in which a student‟s knowledge can be adequately portrayed by
one or two words.
They are more difficult to score than other types of objective-item tests since students invariably
come up with unanticipated answers that are totally or partially correct.
Short answer items usually provide little opportunity for students to synthesize, evaluate and
apply information.
Example 1:
List the major similarities and differences in the lives of people living in Islamabad and Faisalabad.
Example 2:
Compare advantages and disadvantages of lecture teaching method and demonstration teaching method.
Example:
Identify as many different ways to generate electricity in Pakistan as you can? Give advantages and
disadvantages of each. Your response will be graded on its accuracy, comprehension and practical ability.
Your response should be 8-10 pages in length and it will be evaluated according to the RUBRIC (scoring
criteria) already provided.
Test Item:
Name and describe five of the most important factors of unemployment in Pakistan. (10 points)
Rubric/Scoring Criteria:
(i) 1 point for each of the factors named, to a maximum of 5 points
(ii) One point for each appropriate description of the factors named, to a maximum of 5 points
(iii) No penalty for spelling, punctuation, or grammatical error
(iv) No extra credit for more than five factors named or described.
(v) Extraneous information will be ignored.
However, when essay items are measuring higher order thinking skills of cognitive domain, more
complex rubrics are mandatory. An example of Rubric for writing test in language is given below.
Table 7.2: Scoring Criteria (Rubrics) for Essay Type Item for 8th grade
Sr. No. Criteria Unsatisfactory Proficient Advance
Length of Text will be Length of Text will Length of Text will
1 Length according to the Prompt be according to the be according to the
Prompt Prompt
Writing is not according to the Writing is according Writing is
provided format to the provided completely
2 Layout
format to some according to the
extent provided format
Expected KEY WORDS* are Expected KEY Expected KEY
3 Vocabulary not used WORDS* are used WORDS* are used
to some extent mostly
Spellings of most words are Spellings of some Spellings of all
4 Spelling
incorrect words are incorrect words are correct
Selection and Few ideas are relevant to the Some ideas are Almost all ideas are
task and the given task relevant to the task relevant to the task
5 Organization of organization and the given task and the given task
Ideas organization organization
Very few Punctuation Marks Some Punctuation Almost all
6 Punctuation are used Marks are used Punctuation Marks
are used
Use of some basic RAMMAR Occasional use of Use of some basic
7 Grammar RULES** basic GRAMMAR GRAMMAR
RULES** RULES**
* KEY WORDS: Expected Key Words will be provided for each Writing Prompt
(i) Group together all item of similar format e.g. group all essay type item or MCQ‟s in one group.
(ii) Arrange test items from easy to hard
(iii) Space the items for easy reading
(iv) Keep items and their options on the same page of the test
(v) Position illustrations, tables, charts, pictures diagrams or maps near descriptions
(vi) Answer keys must be checked carefully
(vii) Determine how students record answers
(viii) Provide adequate and proper space for name and date
(ix) Test directions must be precised and clear
(x) Test must be proofread to make it error free
(xi) Make all the item unbiased (gender, culture, ethnic, racial etc)
II. Reproduction of the Test
Most test reproduction in the schools is done by photocopy machines. As you well know, the quality of
such copies can vary tremendously. Regardless of how valid and reliable your test might be, poor
printing/copies will not have a good impact. Take the following practical steps to ensure that time you
spent constructing a valid and reliable test does not end in illegible printing.
Manage printing of the test if test takers are large in number
Manage photocopy from a proper/new machine
Use good quality of the paper and printing
Retain original test in your own custody
Be careful while making sets of the test (staple different papers carefully)
Manage confidentiality of the test
7.6 Activities
Suppose you are a teacher and you intend to have a quarterly test of grade 5 students in the
subject of General Science. Prepare a Table of Specification highlighting hierarchy of knowledge,
contents, item types and weightage.
Locate a question paper of Pakistan Studies for class X of last year board exams and evaluate its
MCQs test items with reference to the guidelines you have learnt in this unit and mention short
comings.
Develop an essay type question for class VIII students in the subject of Urdu Language to assess
higher order thinking skills and prepare guidelines or scoring criteria (rubrics) for evaluators to
minimize the biasness and subjectivity.
INTERPRETING TEST
SCORES
Written By:
Muhammad Azeem
Reviewed By:
Dr. Muhammad Tanveer Afzal
CONTENT
Sr. No Topic Page No
Introduction ...........................................................................................165
Objectives ...........................................................................................165
8.1 Introduction of Measurement Scales and Interpretation of Test Scores ......166
8.2 Interpreting Test Scores by Percentiles........................................................167
8.3 Interpreting Test Scores by Percentages ......................................................171
8.4 Interpreting Test Scores by ordering and ranking ........................................173
8.4.1 Measurement Scales .......................................................................173
8.4.1.1 Nominal Scale ....................................................................173
8.4.1.2 Ordinal Scale......................................................................174
8.4.1.3 Interval Scale .....................................................................174
8.4.1.4 Ratio Scale .........................................................................174
8.5 Frequency Distribution ................................................................................175
8.5.1 Frequency Distribution Tables ............................................................175
8.6 Interpreting Test Scores by Graphic Displays of Distributions ..................179
8.7 Measures of Central Tendency ..................................................................184
8.7.1 Mean ...........................................................................................185
8.7.2 Median ...........................................................................................187
8.7.3 Mode ...........................................................................................188
8.8 Measures of Variability................................................................................188
8.8.1 Range ...........................................................................................189
8.8.2 Mean Deviation...............................................................................191
8.8.3 Variance ..........................................................................................192
8.8.4 Standard Deviation .........................................................................194
8.8.9 Estimation .......................................................................................194
8.10 Planning the Test .........................................................................................198
8.11 Constructing and Assembling the Test ......................................................1202
8.12 Test Administration .....................................................................................203
8.13 Self Assessment Questions .........................................................................205
8.14 References Suggested Reading’s .................................................................208
INTRODUCTION
Raw scores are considering as points scored in test when the test is scored according to the set procedure
or rubric of marking. These points are not meaningful without interpretation or further information.
Criterion referenced interpretation of test scores describes students’ scores with respect to certain criteria
while norm referenced interpretation of test scores describes students’ score relative to the test takers.
Test results are generally reported to parents as a feedback of their young one’s learning achievements.
Parents have different academic backgrounds so results should be presented them in understandable and
usable way. Among various objectives three of the fundamental purposes for testing are (1) to portray
each student's developmental level within a test area, (2) to identify a student's relative strength and
weakness in subject areas, and (3) to monitor time-to-time learning of the basic skills. To achieve any one
of these purposes, it is important to select the type of score from among those reported that will permit the
proper interpretation. Scores such as percentile ranks, grade equivalents, and percentage scores differ
from one another in the purposes they can serve, the precision with which they describe achievement, and
the kind of information they provide. A closer look at various types of scores will help differentiate the
functions they can serve and the interpretations or sense they can convey.
OBJECTIVES
After completing this unit, the students will be able to:
understand what are the test score?
understand what are the measurement scales used for test scores?
ways of interpreting test score
clarifying the accuracy of the test scores
explain the meaning of test scores
interpret test scores
usability of test scores
learn basic and significant concepts of statistics
understand and usage of central tendency in educational measurements
understand and usage of measure of variation in educational measurements
planning and administration of test
8.1 Introduction of Measurement Scales and Interpretation of Test Scores
Interpreting Test Scores
All types of research data, test result data, survey data, etc is called raw data and collected using four
basic scales. Nominal, ordinal, interval and ratio are four basic scales for data collection. Ratio is more
sophisticated than interval, interval is more sophisticated than ordinal, and ordinal is more sophisticated
than nominal. A variable measured on a "nominal" scale is a variable that does not really have any
evaluative distinction. One value is really not any greater than another. A good example of a nominal
variable is gender. With nominal variables, there is a qualitative difference between values, not a
quantitative one. Something measured on an "ordinal" scale does have an evaluative connotation. One
value is greater or larger or better than the other. With ordinal scales, we only know that one value is
better than other or 10 is better than 9. A variable measured on interval or ration scale has maximum
evaluative distinction. After the collection of data, there are three basic ways to compare and interpret
results obtained by responses. Students’ performance can be compare and interpreted with an absolute
standard, with a criterion-referenced standard, or with a norm-referenced standard. Some examples from
daily life and educational context may make this clear:
Sr. Standard Characteristics daily life educational context
No.
1 Absolute simply state the He is 6' and 2" He spelled correctly
observed outcome tall 45 out of 50 English
words
2 criterion- compare the He is tall His score of 40 out
referenced person's enough to of 50 is greater than
performance with a catch the minimum cutoff
standard, or branch of this point 33. So he must
criterion. tree. promoted to the
next class.
3 norm-referenced compare a person's He is the third His score of 37 out
performance with fastest ballar of 50 was not very
that of other people in the good; 65% of his
in the same context. pakistani class fellows did
squad 15. better.
All three types of scores interpretation are useful, depending on the purpose for which comparisons made.
An absolute score merely describes a measure of performance or achievement without comparing it with
any set or specified standard. Scores are not particularly useful without any kind of comparison.
Criterion-referenced scores compare test performance with a specific standard; such a comparison enables
the test interpreter to decide whether the scores are satisfactory according to established standards. Norm-
referenced tests compare test performance with that of others who were measured by the same procedure.
Teachers are usually more interested in knowing how children compare with a useful standard than how
they compare with other children; but norm-referenced comparisons may also provide useful insights.
For example, a score at the 60th percentile means that the individual's score is the same as or higher than
the scores of 60% of those who took the test. The 50th percentile is known as the median and represents
the middle score of the distribution.
Percentiles have the disadvantage that they are not equal units of measurement. For instance, a difference
of 5 percentile points between two individual’s scores will have a different meaning depending on its
position on the percentile scale, as the scale tends to exaggerate differences near the mean and collapse
differences at the extremes.
Percentiles cannot be averaged nor treated in any other way mathematically. However, they do have the
advantage of being easily understood and can be very useful when giving feedback to candidates or
reporting results to managers.
If you know your percentile score then you know how it compares with others in the norm group. For
example, if you scored at the 70th percentile, then this means that you scored the same or better than 70%
of the individuals in the norm group.
Percentile score is easily understood when tend to bunch up around the average of the group i.e. when
most of the student are the same ability and have score with very small rang.
To illustrate this point, consider a typical subject test consisting of 50 questions. Most of the students,
who are a fairly similar group in terms of their ability, will score around 40. Some will score a few less
and some a few more. It is very unlikely that any of them will score less than 35 or more than 45.
These results in terms of achievement scores are a very poor way of analyzing them. However, percentile
score can interpret results very clearly.
Definition
A percentile is a measure that tells us what percent of the total frequency scored at or below that
measure. A percentile rank is the percentage of scores that fall at or below a given score. OR
A percentile is a measure that tells us what percent of the total frequency scored below that
measure. A percentile rank is the percentage of scores that fall below a given score.
Both definitions are seams to same but statistically not same. For Example
Example No.1
If Aslam stand 25th out of a class of 150 students, then 125 students were ranked below Aslam.
Formula:
To find the percentile rank of a score, x, out of a set of n scores, where x is
included:
B 0.5E .100 percentile rank
n
Where B = number of scores below x
E = number of scores equal to x
n = number of scores
using this formula Aslam's percentile rank would be:
Formula:
To find the percentile rank of a score, x, out of a set of n scores, where x is not included:
number of scoresbelow x
.100 percentile rank
n
using this formula Aslam's percentile rank would be:
125
.83 83rd percentile
150
Therefore both definition yields different percentile rank. This difference is significant only for small
data. If we have raw data then we can find unique percentile rank using both formulae.
Example No.2
The science test scores are: 50, 65, 70, 72, 72, 78, 80, 82, 84, 84, 85, 86, 88, 88, 90, 94, 96, 98, 98,
99 Find the percentile rank for a score of 84 on this test.
Solution:
First rank the scores in ascending or descending order
50, 65, 70, 72, 72, 78, 80, 82, 84, |84, 85, 86, 88, 88, 90, 94, 96, 98, 98, 99
Since there are 2 values equal to 84, assign one to the group "above 84" and the other to the group "below
84".
Example No.3
The science test scores are: 50, 65, 70, 72, 72, 78, 80, 82, 84, 84, 85, 86, 88, 88, 90, 94, 96, 98, 98,
99. Find the percentile rank for a score of 86 on this test.
Solution:
First rank the scores in ascending or descending order
Since there is only one value equal to 86, it will be counted as "half" of a data value for the group "above
86" as well as the group "below 86".
Solution Using Formula:
B 0.5E .100 percentile rank
n
11 0.5(1) 11.5
.100 .100 58th percentile
20 20
Keep in Mind:
Percentile rank is a number between 0 and 100 indicating the percent of cases falling at or below
that score.
Percentile ranks are usually written to the nearest whole percent: 64.5% = 65% = 65th percentile
Scores are divided into 100 equally sized groups.
Scores are arranged in rank order from lowest to highest.
There is no 0 percentile rank - the lowest score is at the first percentile.
There is no 100th percentile - the highest score is at the 99th percentile.
Percentiles have the disadvantage that they are not equal units of measurement.
Percentiles cannot be averaged nor treated in any other way mathematically.
You cannot perform the same mathematical operations on percentiles that you can on raw
scores. You cannot, for example, compute the mean of percentile scores, as the results may be
misleading.
Quartiles can be thought of as percentile measure. Remember that quartiles break the data set
into 4 equal parts. If 100% is broken into four equal parts, we have subdivisions at 25%, 50%,
and 75% .creating the:
Example:
The marks detail of Hussan’s math test is shown. Find the percentage marks of Hussan.
Question Q1 Q2 Q3 Q4 Q5 Total
Marks 10 10 5 5 20 50
Marks 8 5 2 3 10 28
obtained
Solution:
Hussan’ s marks = 28
Total marks =50
Marks Obtained 28
Hussan got = 100 = 100 =56 %
Total Marks 50
For example, a number can be used merely to label or categorize a response. This sort of number
(nominal scale) has a low level of meaning. A higher level of meaning comes with numbers that order
responses (ordinal data). An even higher level of meaning (interval or ratio data) is present when numbers
attempt to present exact scores, such as when we state that a person got 17 correct out of 20. Although
even the lowest scale is useful, higher level scales give more precise information and are more easily
adapted to many statistical procedures.
Scores can be summarized by using either the mode (most frequent score), the median (midpoint of the
scores), or the mean (arithmetic average) to indicate typical performance. When reporting data, you
should choose the measure of central tendency that gives the most accurate picture of what is typical in a
set of scores. In addition, it is possible to report the standard deviation to indicate the spread of the scores
around the mean.
Scores from measurement processes can be either absolute, criterion referenced, or norm referenced. An
absolute score simply states a measure of performance without comparing it with any standard. However,
scores are not particularly useful unless they are compared with something. Criterion-referenced scores
compare test performance with a specific standard; such a comparison enables the test interpreter to
decide whether the scores are satisfactory according to established standards. Norm-referenced tests
compare test performance with that of others who were measured by the same procedure. Teachers are
usually more interested in knowing how children compare with a useful standard than how they compare
with other children; but norm referenced comparisons may also provide useful insights.
Criterion-referenced scores are easy to understand because they are usually straightforward raw scores or
percentages. Norm-referenced scores are often converted to percentiles or other derived standard scores.
A student's percentile score on a test indicates what percentage of other students who took the same test
fell below that student's score. Derived scores are often based on the normal curve. They use an arbitrary
mean to make comparisons showing how respondents compare with other persons who took the same
test.
Nominal Data
classification or gatagorization of data, e.g. male or female
no ordering, e.g. it makes no sense to state that male is greater than female (M > F) etc
arbitrary labels, e.g., pass=1 and fail=2 etc
Interval Data
ordered, constant scale, but no natural zero
differences make sense, but ratios do not (e.g., 30°-20°=20°-10°, but 20°/10° is not twice as hot!
e.g., temperature (C,F), dates
Ratio Data
ordered, constant scale, natural zero
e.g., height, weight, age, length
One can think of nominal, ordinal, interval, and ratio as being ranked in their relation to one another.
Ratio is more sophisticated than interval, interval is more sophisticated than ordinal, and ordinal is more
sophisticated than nominal.
Distribution
The distribution of a variable is the pattern of frequencies of the observation.
Frequency Distribution
It is a representation, either in a graphical or tabular format, which displays the number of
observations within a given interval. Frequency distributions are usually used within a statistical context.
Step 1:
Figure out how many classes (categories) you need. There are no hard rules about how many
classes to pick, but there are a couple of general guidelines:
Pick between 5 and 20 classes. For the list of IQs above, we picked 5 classes.
Make sure you have a few items in each category. For example, if you have 20 items, choose 5
classes (4 items per category), not 20 classes (which would give you only 1 item per category).
Step 2:
Subtract the minimum data value from the maximum data value. For example, our the IQ list
above had a minimum value of 118 and a maximum value of 154, so:
154 – 118 = 36
Step 3:
Divide your answer in Step 2 by the number of classes you chose in Step 1.
36 / 5 = 7.2
Step 4:
Round the number from Step 3 up to a whole number to get the class width. Rounded up, 7.2
becomes 8.
Step 5:
Write down your lowest value for your first minimum data value:
The lowest value is 118
Step 6:
Add the class width from Step 4 to Step 5 to get the next lower class limit:
118 + 8 = 126
Step 7:
Repeat Step 6 for the other minimum data values (in other words, keep on adding your class
width to your minimum data values) until you have created the number of classes you chose in
Step 1. We chose 5 classes, so our 5 minimum data values are:
118
126 (118 + 8)
134 (126 + 8)
142 (134 + 8)
150 (142 + 8)
Step 8:
Write down the upper class limits. These are the highest values that can be in the category, so in
most cases you can subtract 1 from class width and add that to the minimum data value. For
example:
118 + (8 – 1) = 125
118 – 125
126 – 133
134 – 142
143 – 149
150 – 157
Step 9:
Add a second column for the number of items in each class, and label the columns with
appropriate headings:
IQ Number
118 – 125
126 – 133
134 – 142
143 – 149
150 – 157
Step 10:
Count the number of items in each class, and put the total in the second column. The list of IQ
scores are: 118, 123, 124, 125, 127, 128, 129, 130, 130, 133, 136, 138, 141, 142, 149, 150, 154.
IQ Number
118 – 125 4
126 – 133 6
134 – 142 4
143 – 149 1
150 – 157 2
Example 2
A survey was taken in Lahore. In each of 20 homes, people were asked how many cars were registered to
their households. The results were recorded as follows:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
Use the following steps to present this data in a frequency distribution table.
1. Divide the results (x) into intervals, and then count the number of results in each interval. In this
case, the intervals would be the number of households with no car (0), one car (1), two cars (2)
and so forth.
2. Make a table with separate columns for the interval numbers (the number of cars per household),
the tallied results, and the frequency of results in each interval. Label these columns Number of
cars, Tally and Frequency.
3. Read the list of data from left to right and place a tally mark in the appropriate row. For example,
the first result is a 1, so place a tally mark in the row beside where 1 appears in the interval
column (Number of cars). The next result is a 2, so place a tally mark in the row beside the 2, and
so on. When you reach your fifth tally mark, draw a tally line through the preceding four marks to
make your final frequency calculations easier to read.
4. Add up the number of tally marks in each row and record them in the final column
entitled Frequency.
Your frequency distribution table for this exercise should look like this:
Table 1. Frequency table for the number of cars registered in each household
0 4
1 6
2 5
3 3
4 2
By looking at this frequency distribution table quickly, we can see that out of 20 households surveyed,
4 households had no cars, 6 households had 1 car, etc.
Relative frequency and percentage frequency
An analyst studying these data might want to know not only how long batteries last, but also what
proportion of the batteries falls into each class interval of battery life.
This relative frequency of a particular observation or class interval is found by dividing the frequency (f)
by the number of observations (n): that is, (f ÷ n). Thus:
Relative frequency = frequency ÷ number of observations
The percentage frequency is found by multiplying each relative frequency value by 100. Thus:
Percentage frequency = relative frequency X 100 = f ÷ n X 100
8.6 Interpreting Test Scores by Graphic Displays of Distributions
The data from a frequency table can be displayed graphically. A graph can provide a visual display of the
distributions, which gives us another view of the summarized data. For example, the graphic
representation of the relationship between two different test scores through the use of scatter plots. We
learned that we could describe in general terms the direction and strength of the relationship between
scores by visually examining the scores as they were arranged in a graph. Some other examples of these
types of graphs include histograms and frequency polygons.
A histogram is a bar graph of scores from a frequency table. The horizontal x-axis represents the scores
on the test, and the vertical y-axis represents the frequencies. The frequencies are plotted as bars.
A frequency polygon is a line graph representation of a set of scores from a frequency table. The
horizontal x-axis is represented by the scores on the scale and the vertical y-axis is represented by the
frequencies.
Frequency polygons are a graphical device for understanding the shapes of distributions. They serve the
same purpose as histograms, but are especially helpful in comparing sets of data. Frequency polygons are
also a good choice for displaying cumulative frequency distributions.
To create a frequency polygon, start just as for histograms, by choosing a class interval. Then draw an X-
axis representing the values of the scores in your data. Mark the middle of each class interval with a tick
mark, and label it with the middle value represented by the class. Draw the Y-axis to indicate the
frequency of each class. Place a point in the middle of each class interval at the height corresponding to
its frequency. Finally, connect the points. You should include one class interval below the lowest value in
your data and one above the highest value. The graph will then touch the X-axis on both sides.
A frequency polygon for 642 psychology test scores is shown in Figure 1. The first label on the X-axis is
35. This represents an interval extending from 29.5 to 39.5. Since the lowest test score is 46, this interval
has a frequency of 0. The point labeled 45 represents the interval from 39.5 to 49.5. There are three scores
in this interval. There are 150 scores in the interval that surrounds 85.
You can easily discern the shape of the distribution from Figure 1. Most of the scores are between 65 and
115. It is clear that the distribution is not symmetric inasmuch as good scores (to the right) trail off more
gradually than poor scores (to the left). In the terminology of Chapter 3 (where we will study shapes of
distributions more systematically), the distribution is skewed.
A cumulative frequency polygon for the same test scores is shown in Figure 2. The graph is the same as
before except that the Y value for each point is the number of students in the corresponding class interval
plus all numbers in lower intervals. For example, there are no scores in the interval labeled "35," three in
the interval "45,"and 10 in the interval "55."Therefore the Y value corresponding to "55" is 13. Since 642
students took the test, the cumulative frequency for the last interval is 642.
Frequency polygons are useful for comparing distributions. This is achieved by overlaying the frequency
polygons drawn for different data sets. Figure 3 provides an example. The data come from a task in which
the goal is to move a computer mouse to a target on the screen as fast as possible. On 20 of the trials, the
target was a small rectangle; on the other 20, the target was a large rectangle. Time to reach the target was
recorded on each trial. The two distributions (one for each target) are plotted together in Figure 3. The
figure shows that although there is some overlap in times, it generally took longer to move the mouse to
the small target than to the large one.
It is also possible to plot two cumulative frequency distributions in the same graph. This is illustrated
in Figure 4 using the same data from the mouse task. The difference in distributions for the two targets is
again evident.
Solution
8.7 Measures of Central Tendency
Suppose that a teacher gave the same test to two different classes and following results are obtained:
Class 1: 80%, 80%, 80%, 80%, 80%
Class 2: 60%, 70%, 80%, 90%, 100%
If you calculate the mean for both sets of scores, you get the same answer: 80%. But the data of two
classes from which this mean was obtained was very different in the two cases. It is also possible that two
different data sets may have same mean, median, and mode. For example:
Class A: 72 73 76 76 78
Class B: 67 76 76 78 80
Therefore class A and class B has same mean, mode, and median.
The way that statisticians distinguish such cases as this is known as measuring the variability of the
sample. As with measures of central tendency, there are a number of ways of measuring the variability of
a sample.
Probably the simplest method is to find the range of the sample, that is, the difference between the largest
and smallest observation. The range of measurements in Class 1 is 0, and the range in class 2 is 40%.
Simply knowing that fact gives a much better understanding of the data obtained from the two classes. In
class 1, the mean was 80%, and the range was 0, but in class 2, the mean was 80%, and the range was
40%.
Statisticians use summary measures to describe patterns of data. Measures of central tendency refer to
the summary measures used to describe the most "typical" value in a set of values.
Here, we are interested in the typical, most representative score. There are three most common measures
of central tendency are mean, mode, and median. A teacher should be familiar with these common
measures of central tendencies.
8.7.1 Mean
The mean is simply the arithmetic average. It is sum of the scores divided by the number of scores. it is
computed by adding all of the scores and dividing by the number of scores. When statisticians talk about
the mean of a population, they use the Greek letter μ to refer to the mean score. When they talk about the
mean of a sample, statisticians use the symbol to refer to the mean score.
It is symbolized as: X=
X
N
(read as "X-Bar") when computed on a sample
Computation - Example: find the mean of 2,3,5, and 10.
X=
X
=
2 3 5 10 20
= =5
N 4 4
Since means are typically reported with one more digit of accuracy that is present in the data, I reported
the mean as 5.0 rather than just 5.
Example 1
The marks of seven students in a mathematics test with a maximum possible mark of 20 are given below:
15 13 18 16 14 17 12
Find the mean of this set of data values.
Solution:
For example:
95-99 97 1 97
90-94 92 3 276
85-89 87 5 435
80-84 82 6 492
75-79 77 4 308
70-74 72 3 216
65-69 67 1 67
60-64 62 2 124
f=25=N Mid*f=2015
8.7.2 Median or Md
The score that cuts the distribution into two equal halves (or the middle score in the distribution).
The median of a set of data values is the middle value of the data set when it has been arranged in
ascending order. That is, from the smallest value to the highest value.
Example
The marks of nine students in a geography test that had a maximum possible mark of 50 are given below:
47 35 37 32 38 39 36 34 35
Find the median of this set of data values.
Solution:
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39 47
The fifth data value, 36, is the middle value in this arrangement.
Median = 36
In general:
Median =
1
n 1 th value, where n is the number of data values in the sample.
2
If the number of values in the data set is even, then the median is the average of the two middle values.
Fortunately, there is a formula to take care of the more complicated situations, including computing the
median for grouped frequency distributions.
Where:
L = Lower exact limit of the interval containing Md.
8.7.3 Mode
Mode is the most frequently occurring score. Note:
o There can be more than one. Can have bi- or tri-modal distributions and then speak of major and
minor modes.
o It is symbolized as Mo.
Example: Find the mode of 2,2,6,0,9 6,8 5,4,5,4,6,4,7,4
Solution: 4 is most frequent occurring score therefore mode is 4.
8.8.1 Range
Probably range is the simplest method to find variability of the sample, that is, the difference between the
largest/maximum/highest and smallest/minimum/lowest observation.
Range = Highest value - Lowest value
R = XH - XL
Example:
The range of the saleem’s four tests scores (3, 5, 5, 7) is:
XH = 7 and XL = 3
Therefore R = XH - XL= 7- 3= 4
Example
Consider the previous example in which results of the two different classes are:
Class 1: 80%, 80%, 80%, 80%, 80%
Class 2: 60%, 70%, 80%, 90%, 100%
The range of measurements in Class 1 is 0, and the range in class 2 is 40%. Simply knowing that fact
gives a much better understanding of the data obtained from the two classes. In class 1, the mean was
80%, and the range was 0, but in class 2, the mean was 80%, and the range was 40%. The relationship
between rang and variability can be graphically show as:
The range of Distribution A and B is the same, although Distribution A has more variability.
Co-efficient of Range
It is relative measure of dispersion and is based on the value of range. It is also called range co-efficient
of dispersion. It is defined as:
Co-efficient of Range = (XH – XL) / (XH + XL)
Let us take two sets of observations. Set A contains marks of five students in Mathematics out of 25
marks and group B contains marks of the same student in English out of 100 marks.
Set A: 10, 15, 18, 20, 20
Set B: 30, 35, 40, 45, 50
The values of range and co-efficient of range are calculated as:
Range Coefficient of Range
20 10
Set A: (Mathematics) 20–10=10 0.33
20 10
50 30
Set B: (English) 50–30=20 0.25
50 30
In set A the range is 10 and in set B the range is 20. Apparently it seems as if there is greater
dispersion in set B. But this is not true. The range of 20 in set B is for large observations and the range of
10 in set A is for small observations. Thus 20 and 10 cannot be compared directly. Their base is not the
same. Marks in Mathematics are out of 25 and marks of English are out of 100. Thus, it makes no sense
to compare 10 with 20. When we convert these two values into coefficient of range, we see that
coefficient of range for set A is greater than that of set B. Thus there is greater dispersion or variation in
set A. The marks of students in English are more stable than their marks in Mathematics.
x X X
M .D
N N
Thus for sample data in which the suitable average is the X , the mean deviation ( M .D ) is given by the
relation:
X X
M .D
n
For frequency distribution, the mean deviation is given by
f X X
M .D
f
Example:
Calculate the mean deviation from arithmetic mean in respect of the marks obtained by nine students
gives below and show that the mean deviation from median is minimum.
Marks (out of 25): 7, 4, 10, 9, 15, 12, 7, 9, 7
Solution:
After arranging the observations in ascending order, we get
Marks: 4, 7, 7, 7, 9, 9, 10, 12, 15
X 80
Mean 8.89
n 9
Marks X X X
4 1.89
7 1.89
7 1.89
7 1.89
9 0.11
9 0.11
10 1.11
12 3.11
15 6.11
Total 21.11
X X 21.11
M .D from mean 2.35
n 9
8.8.3 Variance
Variance is another absolute measure of dispersion. It is defined as the average of the squared
difference between each of the observations in a set of data and the mean. For a sample data the
variance is denoted is denoted by S2 and the population variance is denoted by 2 (sigma square).
That is:
Thus another name for the Variance is the Mean of the Squared Deviations About the Mean (or more
simply, the Mean of Squares (MS)). The problem with the MS is that its units are squared and thus
represent space, rather than a distance on the X axis like the other measures of variability.
Example:
Calculate the variance for the following sample data: 2, 4, 8, 6, 10, and 12.
Solution:
X XX 2
2 (2–7)2 = 25
4 (4–7)2 = 9
8 (8–7)2 = 1
6 (6–7)2 = 1
10 (10–7)2 = 9
12 (12–7)2 = 25
X=42
X X =70
2
X 42
X 7
n 6
S
2
X X
2
S
2
X X
2
n
70 35
S2 11.67
6 3
Variance = S2 = 11.67
Variance is another absolute measure of dispersion. It is defined as the average of the squared difference
between each of the observations in a set of data and the mean.
A simple solution to the problem of the MS representing a space is to compute its square root. That is:
Since the standard deviation can be very small, it is usually reported with 2-3 more decimals of accuracy
than what is available in the original data.
The standard deviation is in the same units as the units of the original observations. If the original
observations are in grams, the value of the standard deviation will also be in grams. The standard
deviation plays a dominating role for the study of variation in the data. It is a very widely used measure of
dispersion. It stands like a tower among measure of dispersion. As far as the important statistical tools are
concerned, the first important tool is the mean x and the second important tool is the standard deviation
. It is based on all the observations and is subject to mathematical treatment. It is of great importance
for the analysis of data and for the various statistical inferences.
Properties of the Variance & Standard Deviation:
1. Are always positive (or zero).
2. Equal zero when all scores are identical (i.e., there is no variability).
3. Like the mean, they are sensitive to all scores.
Example: in previous example
Variance = S2 = 11.67
8.8.9 Estimation
Estimation is the goal of inferential statistics. We use sample values to estimate population values. The
symbols are as follows:
Mean X µ
Variance s2 x2
Standard Deviation s x
It is important that the sample values (estimators) be unbiased. An unbiased estimator of a parameter is
one whose average over all possible random samples of a given size equals the value of the parameter.
Overall Example
Let's reconsider an example from above of two distributions (A & B):
Distribution A B
150 150
145 110
100 100
Data
100 100
55 90
50 50
600 600
N 6 6
X 100 100
A X X X2
150 100 50 2500
145 100 45 2025
100 100 0 0
100 100 0 0
55 100 -45 2025
50 100 -50 2500
600 0 9050
N 6
9050 9050
1810
6 1 5
Note that calculating the variance and standard deviation in this manner requires computing the mean and
subtracting it from each score. Since this is not very efficient and can be less accurate as a result of
rounding error, a computational formula is typically used. It is given as follows:
A X2
150 22500
145 21025
100 10000
100 10000
55 3025
50 2500
600 69050
N 6
Then, plugging in the appropriate values into the computational formula gives:
Note that the defining and computational formulas give the same result, but the computational formula is
easier to work with (and potentially more accurate due to less rounding error).
B X2
150 22500
110 12100
100 10000
100 10000
90 8100
50 2500
600 65200
N 6
Then, plugging in the appropriate values into the computational formula gives:
8.10 Planning the Test
One essential step in planning a test is to decide why you are giving the test. (The word
"test" is used although we are using it in a broad sense that includes performance
assessments as well as traditional paper and pencil tests.)
Are you trying to sort the students (so you can compare them, giving higher scores to
better students and lower scores to poor students)? If so, you will want to include some
difficult questions that you expect only a few of the better students will be able to
answer correctly. Or do you want to know how many of the students have mastered the
content? If your purpose is the latter, you have no need to distribute the scores, so very
difficult questions are unnecessary. You will, however, have to decide how many
correct answers are needed to demonstrate mastery. Another way to address the "why"
question is to identify if this is to be a formative assessment to help you diagnose
students' problems and guide future instruction, or a summative measure to determine
grades that will be reported to parents.
Airasian (1994) lists six decisions usually made by the classroom teacher in the test
development process: 1. what to test, 2. how much emphasis to give to various
objectives, 3. what type of assessment (or type of questions) to use, 4. how much time to
allocate for the assessment, 5. how to prepare the students, and 6. whether to use the test
from the textbook publisher or to create your own. Other decisions, such as whether to
use a separate answer sheet, arise later.
You, as the teacher, decide what to assess. The term "assess" is used here because the
term "assess" is frequently associated only with traditional paper and pencil
assessments, to the exclusion of alternative assessments such as performance tasks and
portfolios. Classroom assessments are generally focused on content that has been
covered in the class, either in the immediate past or (as is the case with unit, semester,
and end-of-course tests) over a longer period of time. For example, if we were
constructing a test for preservice teachers on writing test questions, we might have the
following objectives:
Now that we have made the what decision, we can move to the next step: deciding
how much emphasis to place on each objective. We can look at the amount of time in
class we have devoted to each objective. We can also review the number and types of
assignments the students have been given. For this example, let's assume that 20% of
the assessment will be based on knowing the advantages and disadvantages, 40% will
be on differentiating between well written and poorly written questions, and the other
40% will be on writing good questions. Now our planning can be illustrated with the use
of a table of specifications (also called a test plan or a test blueprint) as shown in table
below.
Table of Specifications:
#
Objectives/Content items/
Knowledge Comprehension Application
area/Topics % of
test
2. Be able to differentiate
between well and poorly
40%
written selection-type
questions
3. Be able to construct
appropriate selection-type
questions using the 40%
guidelines and rules that
were presented in class.
A table of specifications is a two-way table that matches the objectives or content you
have taught with the level at which you expect students to perform. It contains an
estimate of the percentage of the test to be allocated to each topic at each level at which
it is to be measured. In effect we have established how much emphasis to give to each
objective or topic.
In estimating the time needed for this test, students would probably need from 5 to 10
minutes for the 20 True-False questions (15-30 seconds each), 5-7 1/2 minutes for the
five comprehension questions (60-90 seconds each), and 20-30 minutes (rough estimate)
to read the material and write the four questions measuring application. The total time
needed would be from 30 to 48 minutes. If you are a middle or high school teacher,
estimated response time is an important consideration. You will need to allow enough
time for the slowest students to complete your test, and it will need to fit within a single
class period.
Accommodations
Accommodations may be needed for some of your students. It is helpful to keep those
students in mind as you plan your assessments. Some examples of accommodations
include:
Providing written instructions for students with hearing problems
Using large print, reading or recording the questions on audiotape (The student could
record the answers on tape.)
Having an aide or assistant write/mark the answers for the student who has coordination
problems, or having the student record the answers on audiotape or type the answers
Using written assessments for students with speech problems
Administering the test in sections if the entire test is too long for the attention of a
student Asking the students to repeat the directions to make sure they understand what
they are to do
Starting each sentence on a new line helps students identify it as a new sentence
Including an example with each type of question, showing how to mark answers
Written By:
Dr. Muhammad Saeed
Reviewed By:
Dr. Naveed Sultana
CONTENTS
Sr. No Topic Page No
Introduction ...............................................................................................................211
Objective ...................................................................................................................211
OBJECTIVES
After studying the Unit, the students will be able to:
1. understand the purpose of reporting test scores
2. explain the functions of test scores
3. describe the essential features of progress report
4. enlist the different types of grading and reporting systems
5. calculate CGPA
6. conduct parent teacher conferences
1. Instructional uses
The focus of grading and reporting should be the student improvement in learning. This is most likely
occur when the report: a) clarifies the instructional objectives; b) indicates the student‟s strengths and
weaknesses in learning; c) provides information concerning the student‟s personal and social
development; and d) contributes to student‟s motivation.
The improvement of student learning is probably best achieved by the day-to-day assessments of learning
and the feedback from tests and other assessment procedures. A portfolio of work developed during the
academic year can be displayed to indicate student‟s strengths and weaknesses periodically.
Periodic progress reports can contribute to student motivation by providing short-term goals and
knowledge of results. Both are essential features of essential learning. Well-designed progress reports can
also help in evaluating instructional procedures by identifying areas need revision. When the reports of
majority of students indicate poor progress, it may infer that there is a need to modify the instructional
objectives.
2. Feedback to students
Grading and reporting test results to the students have been an on-going practice in all the educational
institutions of the world. The mechanism or strategy may differ from country to country or institution to
institution but each institution observes this practice in any way. Reporting test scores to students has a
number of advantages for them. As the students move up through the grades, the usefulness of the test
scores for personal academic planning and self-assessment increases. For most students, the scores
provide feedback about how much they know and how effective their efforts to learn have been. They can
know their strengths and areas need for special attention. Such feedback is essential if students are
expected to be partners in managing their own instructional time and effort. These results help them to
make good decisions for their future professional development.
Teachers use a variety of strategies to help students become independent learners who are able to take an
increasing responsibility for their own school progress. Self-assessment is a significant aspect of self-
guided learning, and the reporting of test results can be an integral part of the procedures teachers use to
promote self-assessment. Test results help students to identify areas need for improvement, areas in which
progress has been strong, and areas in which continued strong effort will help maintain high levels of
achievement. Test results can be used with information from teacher‟s assessments to help students set
their own instructional goals, decide how they will allocate their time, and determine priorities for
improving skills such as reading, writing, speaking, and problem solving. When students are given their
own test results, they can learn about self-assessment while doing actual self-assessment. (Iowa Testing
Programs, 2011).
Grading and reporting results also provide students an opportunity for developing an awareness of how
they are growing in various skill areas. Self-assessment begins with self-monitoring, a skill most children
have begun developing well before coming to kindergarten.
1. Raw scores
The raw score is simply the number of points received on a test when the test has been scored according
to the directions. For example, if a student responds to 65 items correctly on an objective test in which
each correct item counts one point, the raw score will be 65.
Although a raw score is a numerical summary of student‟s test performance, it is not very meaningful
without further information. For example, in the above example, what does a raw score of 35 mean? How
many items were in the test? What kinds of the problems were asked? How the items were difficult?
2. Grade norms
Grade norms are widely used with standardized achievement tests, especially at elementary level. The
grade equivalent that corresponds to a particular raw score identifies the grade level at which the typical
student obtains that raw score. Grade equivalents are based on the performance of students in the norm
group in each of two or more grades.
3. Percentile ranking
A percentile is a score that indicates the rank of the score compared to others (same grade/age) using a
hypothetical group of 100 students. In other words, a percentile rank (or percentile score) indicates a
student‟s relative position in the group in terms of percentage of students.
Percentile rank is interpreted as the percentage of individuals receiving scores equal or lower than a given
score. A percentile of 25 indicates that the student‟s test performance is equal or exceeds 25 out of 100
students on the same measure.
4. Standard scores
A standard score is also derived from the raw scores using the normal information gathered when the test
was developed. Instead of indicating a student‟s rank compared to others, standard scores indicate how far
above or below the average (Mean) an individual score falls, using a common scale, such as one with an
average of 100. Basically standard scores express test performance in terms of standard deviation (SD)
from the Mean. Standard scores can be used to compare individuals of different grades or age groups
because all are converted into the same numerical scale. There are various forms of standard scores such
as z-score, T-score, and stanines.
Z-score expresses test performance simply and directly as the number of SD units a raw score is above or
below the Mean. A z-score is always negative when the raw score is smaller than Mean. Symbolic
representation can be shown as: z-score = X-M/SD.
T-score refers to any set of normally distributed standard cores that has a Mean of 50 and SD of 10.
Symbolically it can be represented as: T-score = 50+10(z).
Stanines are the simplest form of normalized standard scores that illustrate the process of normalization.
Stanines are single digit scores ranging from 1 to 9. These are groups of percentile ranks with the entire
group of scores divided into nine parts, with the largest number of individuals falling in the middle
stanines, and fewer students falling at the extremes (Linn & Gronlund, 2000).
7. Checklist of Objectives
To provide more informative progress reports, some schools have replaced or supplemented the
traditional grading system with a list of objectives to be checked or rated. This system is more popular at
elementary school level. The major advantage of this system is that it provides a detailed analysis of the
students‟ strengths and weaknesses. For example, the objectives for assessing reading comprehension can
have the following objectives.
Reads with understanding
Works out meaning and use of new words
Reads well to others
Reads independently for pleasure (Linn & Gronlund, 2000).
8. Rating scales
In many schools students‟ progress is prepared on some rating scale, usually 1 to 10, instead letter grades;
1 indicates the poorest performance while 10 indicates as the excellent or extra-ordinary performance. But
in the true sense, each rating level corresponds to a specific level of learning achievement. Such rating
scales are also used by the evaluation of students for admissions into different programmes at university
level. Some other rating scales can also be seen across the world.
In rating scales, we generally assess students‟ abilities in the context of „how much‟, „how often‟, „how
good‟ etc. (Anderson, 2003). The continuum may be qualitative such as „how good a student behaves‟ or
it may quantitative such as „how much marks a student got in a test‟. Developing rating scales has become
a common practice now-a-days, but still many teachers don‟t possess the skill of developing an
appropriate rating scale in context to their particular learning situations.
9. Letters to parents/guardians
Some schools keep parents inform about the progress of their children by writing letters. Writing letters to
parents is usually done by a fewer teachers who have more concern with their students as it is a time
consuming activity. But at the same time some good teachers avoid to write formal letters as they think
that many aspects are not clearly interpreted. And some of the parents also don‟t feel comfortable to
accept such letters.
Linn and Gronlund (2000) state that although letters to parents might provide a good supplement to other
types of reports, their usefulness as the sole method of reporting progress is limited by several of the
following factors.
Comprehensive and thoughtful written reports require excessive amount of time and energy.
Descriptions of students learning may be misinterpreted by the parents.
Fail to provide a systematic and organized information
10. Portfolio
The teachers of some good schools prepare complete portfolio of their students. Portfolio is actually
cumulative record of a student which reflects his/her strengths and weaknesses in different subjects over
the period of the time. It indicates what strategies were used by the teacher to overcome the learning
difficulties of the students. It also shows students‟ progress periodically which indicates his/her trend of
improvement. Developing portfolio is really a hard task for the teacher, as he/she has to keep all record of
students such as teacher‟s lesson plans, tests, students‟ best pieces of works, and their assessments records
in an academic year.
An effective portfolio is more than simply a file into which student work products are placed. It is a
purposefully selected collection of work that often contains commentary on the entries by both students
and teachers.
No doubt, portfolio is a good tool for student‟s assessment, but it has three limitations. First, it is a time
consuming process. Second, teacher must possess the skill of developing portfolio which is most of the
time lacking. Third, it is ideal for small class size and in Pakistani context, particularly at elementary
level, class size is usually large and hence the teacher cannot maintain portfolio of a large class.
3. Conduct conference with student, parent, and advisor. Advisee takes the lead to the greatest
possible extent
Have a comfortable setting of chairs, tables etc.
Notify a viable timetable for the conferences
Review goals set earlier
Review progress towards goals
Review progress with samples of work from learning activities
Present students strong points first
Review attendance and handling of responsibilities at school and home
Modify goals for balance of the year as necessary
Determine other learning activities to accomplish goals
Describe upcoming events and activities
Discuss how the home can contribute to learning
Parents should be encouraged to share their thoughts on students‟ progress
Ask parents and students for questions, new ideas
9.5 Activities
Activity 1:
Enlist three pros and cons of test scores.
Activity 2:
Give a self-explanatory example of each of the types of test scores.
Activity 3:
Write down the different purposes and functions of test scores in order of importance as per your
experience. Add more purposes as many as you can.
Activity 4:
Compare the modes of reporting test scores to parents by MEAP and NCCA. Also conclude which is
relatively more appropriate in the context of Pakistan as per your point of view.
Activity 5:
In view of the strengths and shortcomings in above different grading and reporting systems, how would
you briefly comment on the following characteristics of a multiple grading and reporting system for
effective assessment of students‟ learning?
a) Grading and reporting system should be guided by the functions to be served.
b) It should be developed cooperatively by parents, students, teachers, and other school personnel.
c) It should be based on clear and specific instructional objectives.
d) It should be consistent with school standards.
e) It should be based on adequate assessment.
f) It should provide detailed information of student‟s progress, particularly diagnostic and practical
aspects.
g) It should have the space of conducting parent-teacher conferences.
Activity 6:
Explain the differences between relative grading and absolute grading by giving an example of each.
Activity 7:
Faiza Shaheen, a student of MA Education (Secondary) has earned the following marks, grades and GPA
in the 22 courses at the Institute of Education & Research, University of the Punjab. Calculate her CGPA.
Note down that that maximum value of GPA in each course is 4.
Activity 8:
Write Do‟s and Don‟ts in order of priority as per your perception. You may add more points or exclude
what have been mentioned above.
9.6 Self-Assessment Questions
Part-I: MCQs:
Encircle the best/correct response against each of the following statements.
1. Comparing a students‟ performance in a test in relation to his/her classmates is referred to as:
a) Learning outcomes
b) Evaluation
c) Measurement
d) Norm-referenced assessment
e) Criterion-referenced assessment
10. Who said that „lack of information provided to consumers about test data has negative and
sweeping consequences‟
a) Hopkins & Stanley
b) Anderson
c) Linn & Gronlund
d) Barber et al.
e) Kearney
Key to MCQs