4 Essential Criterions of A Good Test

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

4 Essential Criterions of a Good Test | Educational Statistics

This article throws light upon the four essential criterions of a good test. The
criterions are: - 1. Reliability 2. Validity 3. Objectivity 4. Usability.

Criterion # 1. Reliability:

The dictionary meaning of reliability is consistency, dependence or trust. A

measurement procedure is reliable to the extent that repeated measurement gives


consistent results for the individual.

A test is considered to be reliable if it yields consistent results in its successive

administration. So by reliability of a test we mean how dependable or faithful the test

is. To express in a general way, if a measuring instrument measures consistently, it is


reliable.

When a test is reliable, scores made by the members of a group upon retest with the

same test or with alternate forms of the same test will differ very little or not at all
from their original values.

Example 1:

If a witness gives the same statement on an issue when asked again and again by a

lawyer in court, we place confidence in his statement and take his statement to be
reliable.

Example 2:

If a watch remains 10 minutes late every day compared to Hindustan time then we
can say that the watch is a reliable instrument.

Example 3:

Suppose we ask Amit to report his date of birth. He reports it to be 13th July, 1985.

After a lapse of time we asked the same question and he reported the same i.e. 13th
July, 1985.

1
We may put the question again and again and if the answer is the same we feel that
Amit’s statement is a reliable one.

Definitions:

1. Thorndike:

It is the consistency of a test with which it measures whatever is supposed to be

measured. Test reliability is usually thought as the degree to which the test is free
from compensating errors.

2. Gronlund and Linn:

Reliability refers to the consistency of measurement— that is, how consistent test
scores or other evaluation results are from one measurement to other.

3. Anastasi:

Reliability refers to the consistency of scores obtained by the same individuals when

re-examined with the same test on different occasions or with different sets of
equivalent items or under variable examining conditions.

4. Davis:

The degree of relative precisions of measurement of a set of test score is defined as


reliability.

5. Guilford:

Reliability is the proportion of the true variance in obtained test scores. From the

above discussion it became clear that the reliability of a test means the extent to

which the test yields the same result on successive administration on the same

population. Other conditions remaining constant, if the same test is administered on

the same population on two different occasions and the scores obtained by the

individuals on both occasions remain more or less the same, the test is said to be
reliable.

2
Reliability of a test tries to answer the following questions:

(i) How similar would be pupil’s scores, if they are given same test on two different
occasions?

(ii) How would the scores vary if a different sample of equivalent items are selected?

(iii) How would the scores vary if the test is scored by a different scorer?

(iv) How would the scores vary if the test is scored by the same scorer at different
times?

Characteristics of Reliability:

Reliability has the following characteristics:


(i) An estimate of reliability always refers to any particular type of consistency.

(ii) It refers to the accuracy or precision of a measuring instrument.

(iii) Reliability refers to the test results not the test itself.

(iv) It is the coefficient of internal consistency.

(v) The reliability of a set of measurement is logically as the proportion of the


variance that is true variance.

(vi) It is the measure of variable error or chance error or measurement error.

(vii) Reliability is a matter of degree. It does not exist in all or non-basis.

(viii) Reliability does not ensure the validity or truthfulness or purposiveness of a


test.

(ix) Reliability is a necessary but not a sufficient condition for validity. Low reliability

can restrict the degree of validity that is obtained, but high reliability provides no
assurance for a satisfactory degree of validity.

3
(x) Reliability is primarily statistical in nature in the sense that the scores obtained

on two successive occasions are correlated with each other. This coefficient of

correlation is known as self-correlation and its value is called the ‘reliability


coefficient’.

Reliability and Errors of Measurement:

The definitions of reliability can be grouped under three headings:


(i) Empirical,

(ii) Logical, and

(iii) Theoretical.

(i) Empirical:

The empirical definitions of reliability refer to the extent of correlation between two

sets of scores on the same test administered on the same individual on different
occasions.

(ii) Theoretical:

The theoretical meaning refers to the consistency or precision of test scores. It means
dependability of a test score.

(iii) Logical:
The logical meaning of reliability refers to errors of measurement.

Following illustration can proceed us in understanding the concept of

reliability and errors of measurement:

For example, Mr. Rohit secures 52 in a mental test. What does 52 indicate? Does it

speak of his true ability? Is it his true score? Rohit might have secured 52 by mere

chance. It may so happen that, Rohit, by chance, knew 52 items of the test and had
the items been a bit different he would not have secured this score.

4
All these questions are related with a fact that measurement involves some kinds of

errors viz., personal, constant, variable and interpretive errors. This error is called as

measurement errors. So while determining reliability of a test, we must take into


consideration the amount of errors present in measurement.

When the coefficient of reliability will be perfect (i.e. 1.00) the measurement

becomes accurate and it is free from all kinds of errors. But measurement in every
field involves some kind of errors. Therefore, reliability is never perfect.

A score on a test may be thought of as an index of true score plus errors of


measurement.

Total score or Actual obtained score = True Score + Error Score

If a score has a large component of ‘true score’ and a small component of error, it is

high; and contrariwise, if a test score has a small component of ‘true score’ and large
‘error’ component, its reliability is low.

The relations of actual obtained score, true score and error may be

expressed mathematically as follows:

X = X∞ + e
in which X = Obtained score for an individual on a test.

X∞ = true score of the same individual


e = the variable (chance) errors.

5
Errors of Measurement:

True score is the average of the obtained scores on an infinite number of parallel

forms of a test. Each obtained score will be either more or less than the true score.

The deviations of obtained scores from the true scores are called the “Errors of
measurement”.

Sometimes the errors of measurement may be less and sometimes more. Other

things being equal, smaller the errors of measurement, the greater the reliability of
measurement.

Standard Error of Measurement:

The errors of measurement (i.e. the variation of obtained scores from the true score)

will be normally distributed and the standard deviation of these variations (or errors
of measurement) is termed as “standard errors of measurement”.

We can find out Standard Error of Measurement (SE of measurement) when the
reliability coefficient and standard deviation of the distribution is given.

The formula to calculate standard error of measurement is as follows:

in which σsc = the SE of an obtained score

σ1 = the standard deviation of test scores

r11 = the reliability coefficient of the same test.

Example 4:

In a group of 300 college students, the reliability coefficient of an Aptitude Test in

Mathematics is .75, the test M is 80 and the SD of the score distribution is 16. John
achieves a score of 86. What is the SE of this score?

6
Solution:

From the above formula we find that and the odds are roughly 2 :

1 that the obtained score of any individual in the group of 300 does not miss its true

value by more than ± 8 points (i.e., ± 1 SEsc). The .95 confidence interval for John’s

true score is 86 ± 1.96 x 8 or 70 to 102.

Generalizing for the entire group of 300 students, we may expect about 1/3 of their
scores to be error by 8 or more points, and 2/3 to be error by less than this amount.

Criterion # 2. Validity:

The dictionary meaning of validity is “well based”, “efficacious”, “sound”. It refers to

“truthfulness”. Thus, anything which is truthful, well-based and which serves the
right purpose is valid.

Every test has certain objectives of its own. It is constructed for some specific

purpose and it is valid for that purpose. If a test measures what it intends to measure,

it is said to be valid. The validity provides a direct check on how well the test fulfills
its functions. Validity is the first requisite of a test becoming universal.

Reliability may be necessary but not a sufficient condition of validity. A test cannot

be valid unless it is reliable. It may be reliable but cannot be told valid. Relevancy of a
test is concerned with the test measures and the process of the measures.

In brief, we can say that a test is intended to serve the prediction function and thus

it’s worth or validity depends on the degree to which it is successful in estimating the
performance in some types of real life situation.

7
Example 5:

Suppose a witness gives one statement before the judge in a court. If on successive

cross examinations or cross questioning, he repeats the same statement again and
again then he is to be called as a reliable witness.

No doubt, his statement may be right or wrong. When his statement is true, he is said

to be a valid witness. But if his statement is consistently wrong, though he is reliable,


but not valid.

Example 6:

If a watch remains 10 minutes forward than ‘standard time’ it is a reliable time-piece.

Because it gives consistent result everyday with 10 minutes fast. Our purpose is to

know the time correctly and we could not know it. So the very purpose is not served.
Thus it will not be valid as judged by ‘Standard time’.

Thus, it is found that a test may be reliable, but it may not be valid. However, valid

measures or tests are always reliable. A test which is valid for a given purpose may
not be valid for another purpose.

A test which has been prepared to measure the computational skill of students in

mathematics may be valid for that purpose only, but not for measuring the
mathematical reasoning. So, validity refers to the very purpose of the test.

Definitions:

Anne Anastasi:

Writes “the validity of a test concerns what the test measures and how well it does
so.”

Rommel:

“The validity of an evaluating device is the degree to which it measures what it is


intended to measure.”

8
F.S. Freeman:

“An index of validity shows the degrees to which a test measures what it purports to
measure when compared with accepted criterion.”

L.J. Cronbach:
“Validity is the extent to which a test measures what it purports to measure.”

E.F. Lindquist:

Validity is the accuracy with which it measures that which is intended to measure or

as the degree to which it approaches infallibility in measuring what it purports to


measure.

From the foregoing discussion we form that validity refers to the “very purpose of the

test” and if the purpose is fulfilled, the test should be considered to be valid. So a test
to be valid one must do the job it wanted to do.

The concept of validity of a test, therefore, is chiefly a concern for the ‘basic honesty’

of the test. Honesty in the sense of doing what one promises to do. To be precise,
validity refers to how well a tool measures what it intends to measure.

Nature of Validity:

1. Validity refers to the truthfulness or purposiveness of test scores but not to the
instrument itself.

2. Validity is a matter of degree. It does not exist on an all-or-none basis. An

instrument designed for measuring a particular ability cannot be said to be cither


perfectly valid or not valid at all. It is generally more or less valid.

3. It is a measure of ‘constant error’ while reliability is the measure of ‘variable error’.

4. Validity ensures the reliability of a test. If a test is valid, it must be reliable.

9
5. Validity are not of different types. It is a unitary concept. It is based on various
types of evidence.

6. There is no such thing as general validity. A test is valid for some purpose or

situation, yet it is not valid for other purposes. In other words a tool is valid for a
particular purpose or in a particular situation; it is not generally valid.

For example, the results of a vocabulary test may be highly valid to test vocabulary
but may not be that much valid to test composition ability of the student.

Criterion # 3. Objectivity:

Objectivity is the most important characteristic of a good test. It is a prerequisite for


both validity and reliability. Objectivity of a test means the degree to which different
persons scoring give, the same result.

C.V. Good (1973):

C.V. Good (1973) defines objectivity in testing is “the extent to which the instrument

is free from personal error (personal bias) that is subjectivity on the part of the
scorer.”

Gronlund and Linn (1995):

“Objectivity of a test refers to the degree to which equally competent scorers obtain
the same results.”

Thus, it can be said that a test is considered objective when it makes for the
elimination of the scorer’s personal opinion and bias judgement.

Objectivity of a test refers to two aspects viz.:


(i) Objectivity of the items, and

(ii) Objectivity of scoring.

10
(i) Objectivity of the items:

Objectivity of the items means that the item must call for a definite single answer.

Objective items cannot have two or more answers. When the question is stated
differently, difference in scoring will occur.

For example:
“Explain the concept of personality.”

Here the scores given by the scorers will vary to a large extent because the question
does not clearly indicate the nature of the correct answer that is expected.

Here the child may write anything pertaining to the question. If the answer is scored
by different examiners, the marks would definitely vary.

Ambiguous questions, lack of proper direction, double barrel questions, questions

with double negatives, broad essay type questions etc. do not have objectivity. So,
much care is to be exercised while framing the questions.

(ii) Objectivity of scoring:

A tool is objective if it gives the same score even when different scorers score the

item. Objectivity in scoring may, thus, be considered as consistency in scoring by


different scorers.

Quite often, in actual situations, we find that the scorer’s whim or prejudices

influence marking. The Questions, asked about certain topics for which the scorer
has an inclination may fetch more marks than the other questions.

This type of irrational temperament towards scoring system is a kind of his/her

subjective treatment of the syllabus which, in turn, affects the evaluation process.
Therefore, objectivity in evaluation is to be ensured for accurate evaluation.

11
At the same time, subjectivity need not be condemned and entirely excluded, as that

is how most evaluations in reality are made. Subjective assessment based on careful

observation, unprejudiced and unbiased thinking and logical analysis of situations

and phenomena may also give accurate evaluation. This sort of disciplined
subjectivity can play an important role even in a school situation.

Criterion # 4. Usability:

Usability—degree to which the tool of evaluation can be successfully used by the test
users.

We have read by now, the three main criteria of a good test: Validity, reliability and

objectivity. Another important characteristic of a tool is its usability or practicability.

While selecting evaluation tools one must look for certain practical considerations

like comprehensiveness, ease of administration and scoring, ease of interpretation,


availability of comparable forms and cost of testing.

All these considerations induce a teacher to use tools of evaluation and such practical

considerations are referred to as the “usability” of a tool of evaluation. In other words

usability means the degree to which the tool of evaluation can be successfully used by
the teacher and school administrators.

(i) Comprehensibility:

The test items must be free from ambiguity. The direction to test items and other

directions to the test must be clear and understandable. The directions for

administration and the directions for scoring must be clearly stated so that one can

easily understand and follow them. Moreover, procedure of test administration,


scoring and score interpretation must be within the comprehension of the test user.

(ii) Ease of Administration:

It refers to the ease in which a test can be administered. Each test has its own

conditions for administration. While selecting a test, one should choose one, from a

12
collection of tests, which can be administered without much preparation and
difficulties.

a. Ease of administration includes clear and concise instructions for administration.

So, in order that a test is easily administered the directions to the administrator and
direction to the tastes should be easy, clear and complete.

b. Time is also a very important factor. For maximum administration in schools, it is


customary that a test must be taken within one normal class room period.

(iii) Ease of Scoring:

A test in order to better usable should have ease of scoring. Its scoring key should be
readymade and can be easily assessed. Sometimes, the places are earmarked at the
right-hand side of the questions to give responses.

In some cases responses are given on separate sheets. An ideal test can be scored by

anybody or even by a machine, which has been provided with a scoring key. Equal
marks should be allotted to each item in the test to make the scoring more easier.

According to feasibility, cither hand-scoring devices or machine-scoring devices may


be provided.

(iv) Ease of Interpretation:

If the test scores obtained can be easily understood and interpreted, a test is said to

be good. For this purpose, the test manual should provide complete norms for

interpretation of scores, such as age norms, grade norms, percentile norms and
standard score norms. The norms facilitate interpretation of test scores.

(v) Get-up of the Test:

The test should have a nice getup. This must be good and attractive look. The letters

should not be unnecessarily too small or too big. The quality of paper used,

13
typography and printing, letter size, spacing, pictures and diagrams presented, its
binding, space for pupils’ response etc. are to be examined.

(vi) Cost of the Test:

The test should not be too costly. The cost should be reduced to possible extent, so
that it can be used widely.
7 Key Characteristics of a Good Test in Education in 10 Minutes

Sticking to as many characteristics of a good test in education as possible is a challenging


process for teachers. In 10 min or less, you’ll get a brief on all commonly agreed upon
characteristics, practical ways to employ them in order to make your test reliable, and top of
the world universities that use them!
One of the major goals of education is to prepare students for the next step in their
future. They have to make sure that their learners have acquired enough knowledge about the
field of study. Only good tests ensure this. A good test is not only a score that learners
struggle to ace.
It’s feedback a student receives to improve his skills and knowledge and a good teacher loves
to get back to, always, to make sure their teaching strategies are on point and whether they
need development or not.
It’s also a feedback for decision-makers in all educational institutions and governmental
positions who need good data to get to the next step of the institution or the State’s education
plan.
It’s not something centric that students spend days of anxiety on, wondering how well they
will do in a given test and how well the test questions are actually written and whether they
are questions they do know the answer to or not.
Table of Contents

 What Are the Characteristics of a Good Test in Education?


o What Are the Qualities of Good Assessment?
 Reliability or Consistency
 How to Make Sure Your Test Is
Reliable?
 Validity
 Objectivity
 Comprehensiveness
 Absence of Ambiguity
 Preparation
o Appropriateness of Time

14
 Our Conclusion of Characteristics of Good Test
 7 Outstanding Characteristics of a Good Online Test
 Characteristics of a Good Test with Examples
o What Is the Purpose of a Test?
 Types of Tests
 Essay Questions Tests
 Objective Questions Tests
 Verbs Best Used in Good Tests
o Qualities of Traditional and Online Assessment
 Traditional Assessment
 Online Assessment
o How Do I Write a Good Test?
 15 Things You Need to Know about the Characteristics of a
Good Test in Education
 Are You Testing Students or Customers?
What Are the Characteristics of a Good Test in
Education?
What is a good test in education? It is an evaluation through which teachers measure learners’
abilities and points of weaknesses and strengths. It gauges their knowledge in the field of
study and provides both sides with real feedback.
A good test should ensure that learners are ready to move to the next step whether this step is
a high school, college, or even the military.

In our previous event, the first free online webinar, “Ensuring Effective E-Assessment for
Higher Education,” Qorrect e-assessment team, discussed the complete cycle of a good test in
detail focusing on higher education examination.

The team discussed how to analyze, design, develop, implement, and evaluate the phases that
together comprise the e-assessment life cycle, going through the e-assessment life cycle and
its importance to higher education, edtech role in the evolution of the digital assessment
process.

That’s plus considering the contribution of edtech in improving assessment quality, analyzing
the examinees’ responses, assessing the exam’s quality and the effectiveness of the involved
questions in measuring what they are designed to measure.

What Are the Qualities of Good Assessment? An assessment is a process


through which students can share their educational experiences. In order for a test to be a
good tool for measuring students’ knowledge and skills, it should have the following
characteristics of examination that are essential for the success of any test.

15
Reliability or Consistency
Reliability or consistency of a test means that learners should perform the same or get the
same score if they are exposed to different questions in different times and places. A test is
considered reliable when the same result is achieved over different tests.
As James Carlson mentions in his research memorandum, “The reliability of test scores is the
extent to which they are consistent across different occasions of testing, different editions of
the test, or different raters scoring the test taker’s responses.” He also mentions some
statistics to describe how a test can be reliable.
How to Make Sure Your Test Is Reliable?

1. Score Distribution: The percentage of test takers at each score level.


2. Mean Score: The average score, computed by summing the scores of all test takers
and dividing by the number of test takers.
3. Standard Deviation: A measure of the amount of variation in a set of scores. It can
be interpreted as the average distance of scores from the mean. (Actually, it is a
special kind of average called a “root mean square,” computed by squaring the
distance of each score from the mean score, averaging the squared distances, and
then taking the square root.)
4. Correlation: A measure of the strength and direction of the relationship between
the scores of the same people on two tests.
Reliability
is the ratio of the true score and the observed score variance? To measure a test’s reliability,
we may administer a test to the same group more than once.
However, errors may occur as students may forget or have some physical problems. Thus, it
is crucial to administer the same test in identical conditions to ensure that we will get the
same results.
Validity
A validity of a test can be achieved when the test measures what it is really intended to
measure. Therefore, certain criteria must be selected.
Validity is very important to gauge the quality of a given test as questions must be in line
with the selected criteria and measures.
Here are some of the top different types of validity:
Content Validity: A test should fairly represent the content of the course or the field of
study.
Criterion Validity: It is used to predict the performance of a job applicant or a student.
Convergent validity: This is mostly used in the field of sociology or psychology.
Discriminant Validity: Discriminant validity means that a test of a concept is not highly
correlated with other tests that are set to measure theoretically different concepts.

16
See how you can make objective, valid tests in education using
Qorrect e-assessment system.
Objectivity
According to Gronlund and Linn “Objectivity of a test refers to the degree to which equally
competent scores obtain the same results,” the test should be away from any personal or
subjective judgment. It should be based only on the evaluation of human development.
For example, in an essay-type test, students answer differently as each one has his/her own
style of writing.
Hence, when more than one instructor checks the test, they may give different scores
according to whether they like the style or not. So, here, the test is less objective.
To avoid such bias, sharp rules should be set in evaluating such types of tests. There should
be a unified guide for teachers to use while correcting such tests.
Personal judgment does not occur in true or false or multiple choice tests. Besides, teachers
should receive training on how to score a test as untrained teachers may give wrong scores
and not be able to maintain the required fairness and accuracy.

Comprehensiveness
A test should fully cover the entire field of study that students are exposed to during the
course. Vague questions should not be included especially during online tests when students
are confused and short in time.

Absence of Ambiguity
There has to be no place for ambiguity especially in online tests where examiners are absent.
Students should not be left in confusion and all questions have to be crystal clear.
According to Jacobs, Lucy C., from Indiana University, “ambiguous questions constitute the
major weakness in college tests.
Ambiguous questions often result when instructors put off writing test questions until the last
minute. Careful editing and an independent review of the test items can help to minimize this
problem.”

17
Preparation
To ensure the success of any test, instructors should take into consideration the following
factors:
 Students have to be well-prepared for the test through extensive revisions and
discussions.
 There should not be any gaps between the revision period and the exam.
 Examiners should make it clear to students which topics are expected to be
tackled in the exam.
 Students should be well-trained for the test type.

Appropriateness of Time
One of the top characteristics of a good test is when students have appropriate time to answer
all questions. For example, essay questions require more time than multiple choice or
true/false questions.
Some teachers take the test themselves first and then double or triple the time for students. A
good test is supposed to be practical and comprehensive.

Our Conclusion of Characteristics of Good Test


There is a strong sense, however, that the use of the word ‘characteristics’ or ‘criteria’ is not
optimal. It implies the development of standards against which assessments could be judged.
Instead, we believe there should be a general agreement that the word ‘framework’ captures
our desire to create a structure that might be useful in the development of a good test in
education more precisely.

7 Outstanding Characteristics of a Good Online Test


1. No logistic setback
2. Easy access from anywhere
3. High speed
4. Support essay questions, multiple-answer questions, short answers, & equation &
scientific questions
5. Built-in questions bank in quality online test systems
6. Immediate students results reports are generated
7. Highly detailed, error-free analytics reports on students’ performance as well as
test and questions quality

Characteristics of a Good Test with Examples


What Is the Purpose of a Test?
It is an evaluation process through which examiners know who you are and what you know
and think. They identify how you are different from others.

Types of Tests
Tests can be categorized into two types according to the questions they tackle:

18
Essay Questions Tests

1. This type is intended to gauge students’ information and knowledge of the field of
study. It measures their writing skills and how well they are able to show their
personality in writing.
2. There should not be anything to be memorized as students answer according to
their understanding of the course materials.
3. Through this type of tests, instructors are able to measure students’ logical thinking
and problem solving skills.

Objective Questions Tests

Such tests are easy to be marked as they have true and false answers and hence they are away
from any personal opinion or subjectivity. For example, true or false and multiple choice
questions are objective tests.
In an article titled “Harvard Courses Turn to Monitored Exams, Open-Book Assessments,
and Faith in Students As Classes Move Online” Juliet E. Isselbacher and Amanda Y. Su, The
Crimson writers, showed experiences of different professors through the COVID-19
pandemic and how they were forced to switch to online learning.
Professor Robert N. Stavins decided to change the exam to be open-book so as to guarantee
equality among students especially during the absence of any monitoring during the online
tests.
Other professors preferred to keep the same old style of the closed-book exam ensuring that it
is verified and monitored as professor Chaudoin said “We have to trust the students, and the
online exam tools give us a partial way to monitor things.”
Iaura Rose Smith, from the University of Manchester, shared her experience in her article
titled “My Online Exam Experience and Top Tips for Students.” She made it clear that online
examinations have changed her way of studying.
Instead of just thinking of passing a test, she focused on revision, knowledge, and real
understanding of the course material. She said, “I would recommend using this opportunity to
gain a deeper understanding of your subject area and expand your knowledge further than the
curriculum.”

Verbs Best Used in Good Tests


Educators recommend using a set of key verbs related to Bloom’s theory, classification of
educational objectives, when writing any type of test questions/test items.
This list of verbs guarantees that the teacher or test creator is indeed asking the right
questions, equivalent to that of the students’ level of knowledge and understanding. Here are
some of these verbs, according to California State University website.

19
Arrange – define describe – duplicate identify –
label – list match – memorize name – order –
Knowledge
outline recognize – relate – recall – repeat
reproduce – select – state
Explain – summarize paraphrase – describe
illustrate – classify convert – defend describe –
discuss distinguish – estimate explain express –
Comprehension extend – generalized give example(s) -identify –
indicate – infer – locate paraphrase – predict
recognize – rewrite review – select summarize –
translate
Use – compute – solve demonstrate – apply
construct – apply – change – choose compute –
demonstrate discover – dramatize employ –
Application
illustrate interpret – manipulate modify – operate
practice – predict prepare – produce relate –
schedule – show sketch – solve – use – write
Analyze – categorize compare – contrast separate –
apply – change – discover choose – compute
demonstrate – dramatize – employ illustrate –
Analysis
interpret manipulate – modify operate – practice
predict – prepare produce – relate schedule – show
– sketch – solve – use – write
Create – design hypothesize – invent develop –
arrange assemble – categorize collect – combine
comply – compose construct – create design –
Synthesis develop – devise – explain formulate – generate
plan – prepare – rearrange – reconstruct relate –
reorganize – revise – rewrite – set up – summarize
– synthesize – tell – write
Judge – recommend critique – justify appraise –
argue assess attach – choose – compare – conclude
contrast – defend describe – discriminate estimate
Evaluation
– evaluate explain – judge – justify interpret –
relate – predict – rate – select summarize
– support – value

20
Qualities of Traditional and Online Assessment
With the exposure of online learning and the use of advanced software systems in education,
most instructors had to change the traditional way of testing.

Traditional Assessment
Teachers used to measure students’ knowledge only by how they score in a given exam. They
give students only one chance to show their competencies without discussions or classroom
projects.

Online Assessment
Online assessment is a way through which teachers can improve students’ learning,
knowledge, beliefs, and skills. Online assessments can be behavioral, cognitive, or
communicative assessments.
Students may take the online assessment in the classroom or at home and this reduces their
stress. New tools are now introduced for instructors to set different types of assessments.
They can use game-based assessments through many tools such as Kahoot, as mentioned in
our previous article “11 Best Exam and Assessment Platforms of 2021.”
Teachers can also create polls and activities. Moreover, Google Forms enable teachers to
create and grade quizzes. They can choose multiple-choice quizzes or short-answer quizzes.
Some tools also provide teachers with excel reports of students’ grades and feedback can be
sent easily to students directly after the exam. Many advanced software systems allow
teachers to deliver reliable and cheat-free exams to students and grade them instantly. This
saves a lot of time for teachers.
Qorrect (e-exams system) generates automated reports of the test results. To analyze the
quality of the test, it provides feedback that no cheating happened during the online exam,
and analyzes the performance of the students during the examination.

How Do I Write a Good Test?


 Be specific
 Do not use ambiguous questions.
 Choose a suitable format for your test.
 Avoid the open-ended questions unless you are willing to accept any answer.
 Choose your words carefully and avoid any ambiguous language.
 Students should know how much each of the questions are worth.
To conclude, teachers should create their exams away from any subjectivity, ambiguity, or
lack of comprehensiveness. The appropriate format should be selected to match the course
materials and to measure students’ knowledge and skills.

21
15 Things You Need to Know about the Characteristics of a Good Test in
Education
Here are 15 tips the American Board concluded its workshop “Modes of Classroom
Assessment” with:
1. Bloom’ theory of educational objectives classification, in which cognitive skills
exist in a hierarchical order, is important in any assessment.
2. Assessment works better when they are ongoing and integrated into instruction as
opposed to episodic and marginally referenced to classroom instruction.
3. Lots of faculty members use MCQs for their classes because they are able to cover
much greater content and in a very short period of time; plus they’re known to be
very easy, compared to other questions, and are quick to score.
4. Many professors and test managers prefer to use other types of tests to to assess
their students/examinees: essay, papers or electronic portfolios, projects,
presentations tests.
5. Both views are not incorrect! However, a great teacher would use all of the
previously mentioned assessment forms throughout the academic year.
6. The value of tests is much greater and more pronounced when they are performed
as a part of a completely comprehensive program that’s designed to enhance
learning, progress, performance, and the educational institution’s success
7. “A comprehensive assessment-instruction system should contain a variety of
assessment techniques.”
8. A test can only test what it was formed to assess. So it’s up to the decision-makers
to process the data generated from it.
9. Summative assessments are called “assessments of learning,” and formative
assessments are called “assessment for learning.”
10. There should always be a balance between the intellectual skills being assessed.
11. Specific and descriptive instructional feedback that will help students to improve
their learning and prepare for mastery of the curricular topics are central to
effective formative assessments.
12. Frequent short tests are more instructionally helpful and provide better assessment
data than infrequent extended exams.
13. Diagnostic assessments measure a student’s current knowledge and skills for the
purpose of identifying a personalized program of learning for that student.
14. Quality assessments are valid, reliable and unbiased.
15. A test is no better than the quality of items it contains.

Are You Testing Students or Customers?


In their book “The Trouble with Higher Education,” Patrick Smith and Trevor Hussey had a
unique outlook on the system of education in general.
The book linked the rise of consumerism with education, addressing the effects this now has
on everything learning-related.

22
That includes how the system of universities works today and the high prices we face today
in a lot of top of the world higher ed institutions and universities… too high some students
stay in debt for years (although the book addresses education in the UK, a lot of other
countries may relate to the issues raised).
It is a significant consequence of these changes that students have come to see
themselves as customers. Increasingly their perception is that they are buying a
product. This encourages an instrumental view of education: its value lies not in
itself but in what it can be used to gain. An education that has to be purchased at
great expense is purchased for a purpose, and that purpose is what it will earn. At
the very least it must pay for itself.
Because of this we must start to raise the bar for quality education and testing. Because
students or customers are now less willing to tolerate less quality education, teaching,
facilities, testing, grading, & reporting.

What are the main criteria for a good test?

Functionality, concurrency, compatibility, performance, Usability, accessibility,


security. Considering these dimensions Quality while testing would pretty much
cover towards completeness of testing.

3.1 VALIDITY

A test is said to be valid if it measures accurately what it is intended to measure. This


seems simple enough. When closely examined, however, the concept of validity
reveals a number of aspects, each of which deserves our attention.

3.1.1 Content validity

A test is said to have content validity if its content constitutes a representative


sample of the language skills, structures, etc. with which it is meant to be concerned.
It is obvious that grammar test, for instance, must be made up of items testing
knowledge of control of grammar. But this in itself does not ensure content validity.
The test would have content validity if it included a proper sample of the relevant
structures. Just what are the relevant structures will depend, of course, upon the
purpose of the test. We wouldn’t expect an achievement test for intermediate
learners to contain just the same set of structures as one for advanced learners. In
order to judge whether or not a test has content validity, we need a specification of
the skills or structures etc. that is meant to cover. Such a specification should
be made at a very early stage in test construction. It is not to be expected that
everything in the specification will always appear in the test; there may simply be too
many things for all of them to appear in a single test. What is the importance of
content validity? First, the greater a test’s content validity, the more likely it is to be
an accurate measure of what it is supposed to measure. Secondly, such a test is

23
likely to have harmful backwash effect. Areas which are likely to become areas
ignored in teaching and learning. Too often the content of tests is the best safeguard
against this is to write full test specifications and to ensure that the test content is a
fair reflection of these. The effectiveness of a content validity strategy can be
enhanced by making sure that the experts are truly experts in the appropriate field
and that they have adequate and appropriate tools in the form of rating scales so
that their judgments can be sound and focused. However, testers should never rest
on their laurels. Once they have established that a test has adequate content
validity, they must immediately explore other kinds of validity of the test in terms
related to the specific performances of the types of students for whom the test was
designed in the first place.

3.1.2 Criterion-related validity/ Empirical validity

There are essentially two kinds of criterion-related validity: concurrent validity and
predictive validity. Concurrent validity is established when the test and the criterion
are administered at about the same time. To exemplify this kind of validation in
achievement testing, let us consider a situation where course objectives call for an
oral component as part of the final achievement test. The objectives may list a large
number of ‘function’ which students are expected to perform orally, to test of all
which might take 45 minutes for each student. This could well be impractical. The
second kind of criterion-related validity is predictive validity. This concerns the
degree to which a test can predict candidates’ future performance. An example
would be how well a proficiency test could predict a student’s ability to cope with a
graduate course at a British University. The criterion measure here might be an
assessment of the student’s English as perceived by his or her supervisor at the
university, or it could be the outcome of the course (pass/fail etc.)

3.1.3 Construct validity

A test, part of test, or a testing technique is said to have construct validity if it can be
demonstrated that it measures just the ability which it is supposed to measure. The
word ‘construct’ refers to underlying ability (or trait) which is hypothesized in a theory
of language ability. One might hypothesize, for example, that the ability to read
involves a number of sub-abilities, such as the ability to guess the meaning of
unknown words from the context in which they are met. It would be a matter of
empirical research to establish whether or not such a distinct ability existed and
could be measured. If we attempted to measure that ability in a particular test, then
that part of the test would have construct validity only if we were able to demonstrate
that we were indeed measuring just that ability. Construct validity is the most
important form of validity because it asks the fundamental validity question: What
this test really measuring? We have seen that all variables derive from constructs
and that constructs are no observable traits, such as intelligence, anxiety, and
honesty, “invented” to explain behavior. Constructs underlie the variables that
researchers measure. You cannot see a construct; you can only observe its effect.
“Why does the person act this way and that person a different way? Because one is
intelligent and one is not – or one is dishonest and the other is not.” We cannot prove
that constructs exist, just as we cannot perform brain surgery on a person to “see”
his or her intelligence, anxiety, or honesty.

24
3.1.4 Face validity

A test is said to have face validity if it looks as if it measures what it is supposed to


measure, for example, a test which pretended to measure pronunciation ability but
which did not require the candidate to speak (and there have been more) might be
thought to lack face validity. This would be true even if the test’s construct and
criterion-related validity could be demonstrated. Face validity is hardly a scientific
concept, yet it is very important. A test which does not face validity may not be
accepted by candidates, teachers, education authorities or employers. It may simply
not be used; and if it is used, the candidates’ reaction to it may mean that they do not
perform on it in a way that truly reflects their ability.

3.1.5 The use of validity

What use is the reader to make of the notion of validity? First, every effort should be
made in constructing tests to ensure content validity. Where possible, the tests
should be validated empirically against some criterion. Particularly where it is
intended to use indirect testing, reference should be made to the research literature
to confirm that measurement of the relevant underlying constructs has been
demonstrated using the testing techniques that are to be used.

3.2 RELIABILITY

Reliability is a necessary characteristic of any good test: for it to be valid at all, a test
must first be reliable as a measuring instrument. It test is administered to the same
candidates on different occasion (with no language practice work taking place
between these occasion), then, to the extent that it produces differing results. It is not
reliable. Reliability measured in this way is commonly referred to as test/re-test
reliability to distinguish it from mark/re-mark reliability. In short, in order to be reliable,
a test must be consistent in its measurements. Factors affecting the reliability of a
test are:

1. the extent of the sample of material selected for testing; whereas validity is
concerned chiefly with the content of the sample, reliability is concerned with
the size. The larger the sample (i.e the more tasks the tastes have to
perform), the greater the probability that the test as a whole is reliable-hence
the favoring of objectives tests, which allow for a wide field to be covered.
2. the administration of the test: is the same test administered to different groups
under different conditions or at different times? Clearly, this is an important
factor in deciding reliability, especially in tests of oral production and listening
comprehension.

One method of measuring the reliability of a test is to re-administer the same test
after a lapse of time. It is assumed that all candidates have been treated in the same
way in the interval – that they have either all been taught or that none of them
have. Another means of estimating the reliability of a test is by administering parallel
forms of the test to the same group. This assumes that two similar versions of a
particular test can be constructed; such tests must be identical in the nature of their
sampling, difficulty, length, rubrics, etc.

25
3.2.1 How to make tests more reliable?

As we have seen, there are two components of test reliability: the performance of candidates
from occasion to occasion, and the reliability of the scoring.
Take enough sample of behavior. Other things equal, the more items that you have on a test,
the more reliable that test will be. This seems intuitive right. While it is important to make a
test long enough to achieve satisfactory reliability, it should not be made so long that the
candidates become so bored or tired that the behavior that they exhibit becomes
unrepresentative of their ability. At the same time, it may often be necessary to resist
pressure to make a test shorter than is appropriate. The usual argument for shortening a test
is that it is not practical.
Do not allow candidates too much freedom. In some kinds of language test there is a
tendency to offer candidates a choice of questions and then to allow them a great deal of
freedom in the way that they answer the ones that they have chosen. Such a procedure is
likely to have a depressing effect on the reliability of the test. The more freedom that is given,
the greater is likely to be the difference between the performance. Write unambiguous
items. It is essential that candidates should not be presented with items whose meaning is
not clear or to which there is an acceptable answer which the test writer has not anticipated.
Provide clear and explicit instructions. This applies both to written and oral instructions. It is
possible for candidates to misinterpret what they are asked to do, then on some occasions
some of them certainly will. Test writers should not rely on the students’ powers of telepathy
to elicit the desired behavior.
Ensure that tests are well laid out and perfectly legible. Too often, institutional tests are
badly typed (or handwritten), have too much text in too small a space, and are poorly
reproduced. As a result, students are faced with additional tasks which are not ones meant to
measure their language ability. Their variable performance on the unwanted tasks will lower
the reliability of a test.
Candidates should be familiar with format and testing techniques. If any aspect of a test is
unfamiliar to candidates, they are likely to perform less well they would do otherwise (on
subsequently taking a parallel version, for example). For this reason, every effort must be
made to ensure that all candidates have the opportunity to learn just what will be required
of them.
Provide uniform and non-distracting conditions of administration. The greater the
differences between one administration of a test and another, the greater the differences
one can expect between a candidate’s performance on two occasions. Great care should be
taken to ensure uniformity.
Use items that permit scoring which is as objective as possible. This may appear to be a
recommendation to use multiple choice items, which permit completely objective scoring. An
alternative to multiple choice item which has a unique, possibly one word, correct response
which the candidates produce themselves. This too should ensure objective scoring, but in
fact problems with such matters as spelling which makes a candidate’s meaning unclear often

26
make demands on the scorer’s judgment. The longer the required response, the greater the
difficulties of this kind.
Make comparisons between candidates as direct as possible. This reinforces the suggestion
already made that candidates should not be given a choice of items and that they should be
limited in the way that they are allowed to respond. Scoring the compositions all on one topic
will be more reliable than if the candidates are allowed to choose from six topics, as has been
the case in some well-known tests. The scoring should be all the more reliable if the
compositions are guided. In this section, do not allow candidates too much freedom.
Provide a detailed scoring key. This should specify acceptable answer and assign points for
partially correct responses. For high scorer reliability the key should as detailed possible in its
assignment of points.
Train scorers. This is especially important where scoring is most subjective. The scoring of
comparisons, for example, should not be assigned to anyone who has not learned to score
accurately compositions form past administrations. After each administration, patterns of
scoring should be analyzed. Individuals whose scoring deviates markedly and inconsistently
from the norm should not be used again.
Identify candidates by number; not name. Scorers inevitably have expectations of candidates
that they know. Except in purely objective testing, this will affect the way that they score.
Studies have shown that even where the candidates are unknown to the scorers, the name
on a script (or a photograph) will make a significant difference to the scores given. For
example, a scorer may be influenced by the gender or nationality of a name into making
predictions which can affect the score given. The identification of candidates only by number
will reduce such effects.
Employ multiple, independent scoring. As a general rule, and certainly where testing is
subjective, all scripts should be scored by at least two independent scorers. Neither scorer
should know how the other has scored a test paper. Scores should be recorded on separate
score sheets and passed to a third, senior, colleague, who compares the two sets of scores
and investigates discrepancies.

3.3 ADMINISTRATION

A test must be practicable; in other words, it must be fairly straight forward to administer. It
is only too easy to become so absorbed in the actual construction of the test items that the
most obvious practical considerations concerning the test are overlooked. The length of time
available for the administration of the test is frequently misjudged even by experienced test
writers. Especially when the complete test consists of a number of sub-tests. In such cases
sufficient time may not be allowed for the administration of the test (i.e. a tryout of the test
to a small but representative group of tastes) Another practical consideration concerns the
answer sheets and the stationary used. Many tests require the tastes to enter their answers
on the actual question paper (e.g. circling the letter of the correct option), thereby
unfortunately reducing the speed of the scoring and presenting the question paper from

27
being used a second time. In some tests the candidates are presented with a separate answer
sheet, but too often insufficient thought has been given to possible errors arising from the
(mental) transfer of the answer sheet itself. A final point concerns the presentation of the test
paper itself. Where possible, it should be printed or typewritten and appear neat, tidy and
aesthetically pleasing. Nothing is worse and more disconcerting to the testee than an untidy
test paper, full of misspellings, omissions and corrections.

Features of good Tests A good test should be:


1-Valid: It means that it measures what it is supposed to measure. It tests what it ought to
test. A good test which measures control of grammar should have no difficult lexical items.

2- Reliable: If it is taken again by (same students, same conditions), the score will be almost
the same regarding that the time between the test and the retest is of reasonable length. If it
is given twice to same students under the same circumstances, it will produce almost the
same results. In this case it is said that the test provides consistency in measuring the items
being evaluated.

3- Practical: It is easy to be conducted, easy to score without wasting too much time or effort.

4- Comprehensive: It covers all the items that have been taught or studied. It includes items
from different areas of the material assigned for the test so as to check accurately the amount
of students’ knowledge

5- Relevant: It measures reasonably well the achievement of the desired objectives.

6- Balanced: It tests linguistic as well as communicative competence and it reflects the real
command of the language. It tests also appropriateness and accuracy.

7- Appropriate in difficulty: It is neither too hard nor too easy. Questions should be
progressive in difficulty to reduce stress and tension

8- Clear: Questions and instructions should be clear. Pupils should know what to do exactly.

9- Authentic: The language of the test should reflect everyday discourse

10- Appropriate for time: A good test should be appropriate in length for the allotted time.

11- Objective: If it is marked by different teachers, the score will be the same. Marking
process should not be affected by the teacher’s personality. Questions and answers are so
clear and definite that the marker would give the students the score he/she deserves.

12- Economical: It makes the best use of the teacher’s limited time for preparing and grading
and it makes the best use of the pupil’s assigned time for answering all items. So, we can say
that oral exams in classes of +30 students are not economical as it requires too much time
and effort to be conducted.

28
Validity we can come to one of the important aspects of testing – validity. Concerning Hughes,
every test should be reliable as well as valid. Both notions are very crucial elements of testing.
However, according to Moss (1994) there can be validity without reliability, or sometimes the
border between these two notions can just blur. Although, apart from those elements, a good
test should be efficient as well. According to Bynom (Forum, 2001), validity deals with what
is tested and degree to which a test measures what is supposed to measure (Longman
Dictionary, LTAL). For example, if we test the students writing skills giving them a composition
test on Ways of Cooking, we cannot denote such test as valid, for it can be argued that it tests
not our abilities to write, but the knowledge of cooking as a skill. Definitely, it is very difficult
to design a proper test with a good validity, therefore, the author of the paper believes that
it is very essential for the teacher to know and understand what validity really is. Regarding
Weir (1990:22), there are five types of validity: · Construct validity; · Content validity · Face
validity · Wash back validity; · Criterion-related validity. Weir (ibid.) states that construct
validity is a theoretical concept that involves other types of validity. Further, quoting
Cronbach (1971), Weird writes that to construct or plan a test you should research into teste’s
behavior and mental organization. It is the ground on which the test is based; it is the starting
point for a constructing of test tasks. In addition, Weird displays the Kelly’s idea (1978) that
test design requires some theory, even if it is indirect exposure to it. Moreover, being able to
define the theoretical construct at the beginning of the test design, we will be able to use it
when dealing with the results of the test. The author of the paper assumes that appropriately
constructed at the beginning, the test will not provoke any difficulties in its administration
and scoring later. Another type of validity is content validity. Weir (ibid.) implies the idea that
content validity and construct one are closely bound and sometimes even overlap with each
other.

Speaking about content validity, we should emphasize that it is inevitable element of a


good test. What is meant is that usually duration of the classes or test time is rather
limited, and if we teach a rather broad topic such as “computers”, we cannot design a test
that would cover all the aspects of the following topic. Therefore, to check the students’
knowledge we have to choose what was taught: whether it was a specific vocabulary or
various texts connected with the topic, for it is impossible to test the whole material. The
teacher should not pick up tricky pieces that either were only mentioned once or were
not discussed in the classroom at all, though belonging to the topic. S/he should not forget
that the test is not a punishment or an opportunity for the teacher to show the students
that they are less clever. Hence, we can state that content validity is closely connected
with a definite item that was taught and is supposed to be tested. Face validity, according
to Weir (ibid.), is not theory or samples design. It is how the examinees and administration
staff see the test: whether it is construct and content valid or not. This will definitely
include debates and discussions about a test; it will involve the teachers’ cooperation and
exchange of their ideas and experience. Another type of validity to be discussed is wash
back validity or backwash. According to Hughes (1989:1) backwash is the effect of testing
on teaching and learning process. It could be both negative and positive. Hughes believes

29
that if the test is considered to be a significant element, then preparation to it will occupy
the most of the time and other teaching and learning activities will be ignored. As the
author of the paper is concerned this is already a habitual situation in the schools of our
country, for our teachers are faced with the centralized exams and everything they have
to do is to prepare their students to them. Thus, the teacher starts concentrating purely
on the material that could be encountered in the exam papers alluding to the examples
taken from the past exams. Therefore, numerous interesting activities are left behind; the
teachers are concerned just with the result and forget about different techniques that
could be introduced and later used by their students to make the process of dealing with
the exam tasks easier, such as guessing form the context, applying schemata, etc. The
problem arises here when the objectives of the course done during the study year differ
from the objectives of the test. As a result, we will have a negative backwash, e.g. the
students were taught to write a review of a film, but during the test they are asked to
write a letter of complaint. However, unfortunately, the teacher has not planned and
taught that. Often a negative backwash may be caused by inappropriate test design.
Hughes further in his book speaks about multiple-choice activities that are designed to
check writing skills of the students. The author of the paper is very confused by that, for
it is unimaginable how writing an essay could be tested with the help of multiple choices.
Testing essay the teacher first of all is interested in the students’ ability to apply their ideas
in writing, how it has been done, what language has been used, whether the ideas are
supported and discussed, etc. At this point multiple-choice technique is highly
inappropriate. Notwithstanding, according to Hughes apart from negative side of the
backwash there is the positive backwash as well. It could be the creation of an entirely
new course designed especially for the students to make them pass their final exams. The
test given in a form of final exams imposes the teacher to re-organize the course, choose
appropriate books and activities to achieve the set goal: pass the exam. Further, he
emphasizes the importance of partnership between teaching and testing. Teaching should
meet the needs of testing. It could be understanding in the following way that teaching
should correspond the demands of the test. However, it is a rather complicated work, for
according to the knowledge of the author of the paper the teachers in our schools are not
supplied with specially designed materials that could assist them in their preparation the
students to the exams. The teachers are just given vague instructions and are free to act
on their own. The last type that could be discussed is criterion-related validity. Weir
(1990:22.) assumes that it is connected with test scores link between two different
performances of the same test: either older established test or future criterion
performance. The author of the paper considers that this type of validity is closely
connected with criterion and evaluation the teacher uses to assess the test. It could mean
that the teacher has to work out definite evaluation system and, moreover, should explain
what she finds important and worth evaluating and why. Usually the teachers design their
own system; often these are points that the students can obtain fulfilling a certain task.
Later the points are gathered and counted for the mark to be put. Furthermore, the

30
teacher can have a special table with points and relevant marks. According to our
knowledge, the language teachers decide on the criteria together during a special meeting
devoted to that topic, and later they keep to it for the whole study year. Moreover, the
teachers are supposed to make his/her students acquainted with their evaluation system
for the students to be aware what they are expected to do.

31

You might also like