NOTE 4 - Measurement and Reliability

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

MEASUREMENT

Measurement is a process of assigning numbers to some characteristics or variables or events


according to scientific rules. It is the process of observing and recording the observations that
are collected as part of a research effort. Measurement means the description of data in terms
of numbers accuracy; objectivity and communication. The combined form of these three is the
actual measurement.

• Accuracy: The accuracy is as great as the care and the instruments of the observer will
permit.
• Objectivity: Objectivity means interpersonal agreement. Where many persons reach
agreement as to observations and conclusions, the descriptions of nature are more likely to
be free from biases of particular individuals.
• Communication: Communication is the sharing of research between one person with
another one.

LEVELS OF MEASUREMENT
Level of measurement refers to the relationship among the values that are assigned to the
attributes for a variable. It is important because

o First, knowing the level of measurement helps you decide how to interpret the data from
that variable. When you know that a measure is nominal, then you know that the numerical
values are just short codes for the longer names.

o Second, knowing the level of measurement helps you decide what statistical analysis is
appropriate on the values that were assigned. If a measure is nominal, then you know that
you would never average the data values or do a t-test on the data.

S. S. Stevens (1946) clearly delineated the four distinguish levels of measurement. The levels
are nominal, ordinal, interval, or ratio.

NOMINAL
The nominal scale (also called dummy coding) simply places people, events, perceptions, etc.
into categories based on some common trait. Some data are naturally suited to the nominal
scale such as males vs. females, white vs. black vs. blue, and American vs. Asian.
The nominal scale is the lowest form of measurement because it doesn’t capture information
about the focal object other than whether the object belongs or doesn’t belong to a category;
either you are a smoker or not a smoker, you attended university or you didn’t, a subject has
some experience with computers, an average amount of experience with computers, or
extensive experience with computers.

Coding of nominal scale data can be accomplished using numbers, letters, labels, or any symbol
that represents a category into which an object can either belong or not belong. In research
activities a Yes/No scale is nominal.

ORDINAL
An ordinal level of measurement uses symbols to classify observations into categories that are
not only mutually exclusive and exhaustive; in addition, the categories have some explicit
relationship among them. For example, observations may be classified into categories such as
taller and shorter, greater and lesser, faster and slower, harder and easier, and so forth.

However, each observation must still fall into one of the categories (the categories are
exhaustive) but no more than one (the categories are mutually exclusive). Most of the
commonly used questions which ask about job satisfaction use the ordinal level of
measurement. For example, asking whether one is very satisfied, satisfied, neutral, dissatisfied,
or very dissatisfied with one’s job is using an ordinal scale of measurement.

The simplest ordinal scale is a ranking. When a market researcher asks you to rank 5 types of
tea from most flavorful to least flavorful, s/he is asking you to create an ordinal scale of
preference. There is no objective distance between any two points on your subjective scale. For
you the top tea may be far superior to the second preferred tea but, to another respondent with
the same top and second tea, the distance may be subjectively small.

INTERVAL
An interval level of measurement classifies observations into categories that are not only
mutually exclusive and exhaustive, and have some explicit relationship among them, but the
relationship between the categories is known and exact.

This is the first quantitative application of numbers. In the interval level, a common and
constant unit of measurement has been established between the categories. For example, the
commonly used measures of temperature are interval level scales.
We know that a temperature of 75 degrees is one degree warmer than a temperature of 74
degrees, just as a temperature of 42 degrees is one degree warmer than a temperature of 41
degrees.

Numbers may be assigned to the observations because the relationship between the categories
is assumed to be the same as the relationship between numbers in the number system. For
example, 74+1= 75 and 41+1= 42.

The intervals between categories are equal, but they originate from some arbitrary origin, that
is, there is no meaningful zero point on an interval scale.

RATIO
The ratio level of measurement is the same as the interval level, with the addition of an absolute
zero point. There is a meaningful and non-arbitrary zero point from which the equal intervals
between categories originate. For example, weight, area, speed, and velocity are measured on
a ratio level scale. In public policy and administration, budgets and the number of program
participants are measured on ratio scales.

A ratio scale is the top level of measurement and is not often available in social research. The
factor which clearly defines a ratio scale is that it has a true zero point.

The simplest example of a ratio scale is the measurement of length (disregarding any
philosophical points about defining how we can identify zero length). Ratio scale data would
use the same as for Interval data.
RELIABILITY
Reliability refers to the consistency or repeatability of an operationalized measure. A reliable
measure will yield the same results over and over again when applied to the same thing. It is
the degree to which a test consistently measures whatever it measures.

If you have a survey question that can be interpreted several different ways, it is going to be
unreliable. One person may interpret it one way and another may interpret it another way. You
do not know which interpretation people are taking. Even answers to questions that are clear
may be unreliable, depending on how they are interpreted.

Reliability refers to the consistency of scores obtained by the same persons when they are re-
examined with the same tests on different occasions, or with different sets of equivalent items,
or under other variable examining conditions.

Research requires dependable measurement. Measurements are reliable to the extent that they
are repeatable and that any random influence which tends to make measurements different from
occasion to occasion or circumstance to circumstance is a source of measurement error.

Errors of measurement that affect reliability are random errors and errors of measurement that
affect validity are systematic or constant errors. Reliability of any research is the degree to
which it gives an accurate score across a range of measurement. It can thus be viewed as being
‘repeatability’ or ‘consistency’.

There are a number of ways of determining the reliability of an instrument. The procedure can
be classified into two groups:

o External Consistency Procedures


It compares findings from two independent processes of data collection with each other as a
means of verifying the reliability of the measure. For example, test-retest reliability, parallel
forms of the same test, etc.

o Internal Consistency Procedures


The idea behind this procedure is that items measuring the same phenomenon should produce
similar results. For example, split-half technique.
TYPES OF RELIABILITY
All types of reliability are concerned with the degree of consistency or agreement between two
independently derived sets of scores. There are various types of reliability -

i. Test-Retest Reliability
The most obvious method for finding the reliability of test scores is by repeating the identical
test on a second occasion. Test-retest reliability is a measure of reliability obtained by
administering the same test twice over a period of time to a group of individuals.

The scores from ‘Time 1’ and ‘Time 2’ can then be correlated in order to evaluate the test for
stability over time. The reliability coefficient in this case is simply the correlation between the
scores obtained by the same persons on the two administrations of the test.

ii. Split-Half Reliability


Split-half reliability is a subtype of internal consistency reliability. In split- half reliability we
randomly divide all items that purport to measure the same construct into two sets. We
administer the entire instrument to a sample of people and calculate the total score for each
randomly divided half.

The most commonly used method to split the test into two is using the odd-even strategy. The
split-half reliability estimate, as shown in the figure, is simply the correlation between these
two total scores.

iii. Inter-Rater Reliability


Inter-rater reliability is a measure of reliability used to assess the degree to which different
judges or raters agree in their assessment decisions. Inter-rater reliability is also known as inter-
observer reliability or inter-coder reliability. Inter-rater reliability is useful because human
observers will not necessarily interpret answers the same way; raters may disagree as to how
well certain responses or material demonstrate knowledge of the construct or skill being
assessed.

Inter-rater reliability might be employed when different judges are evaluating the degree to
which art portfolios meet certain standards. Inter-rater reliability is especially useful when
judgments can be considered relatively subjective. Thus, the use of this type of reliability would
probably be more likely when evaluating artwork as opposed to math problems.
iv. Parallel-Forms Reliability
Parallel forms reliability is a measure of reliability obtained by administering different versions
of an assessment tool (both versions must contain items that probe the same construct, skill,
knowledge base, etc.) to the same group of individuals. The scores from the two versions can
then be correlated in order to evaluate the consistency of results across alternate versions. In
parallel forms reliability you first have to create two parallel forms. One way to accomplish
this is to create a large set of questions that address the same construct and then randomly
divide the questions into two sets. You administer both instruments to the same sample of
people.

The correlation between the two parallel forms is the estimate of reliability. For example, if
you wanted to evaluate the reliability of a critical thinking assessment, you might create a large
set of items that all pertain to critical thinking and then randomly split the questions up into
two sets, which would represent the parallel forms.

v. Cronbach’s Alpha (α) Reliability


Cronbach’s alpha is the most common measure of internal consistency (reliability). It is most
commonly used when you have multiple Likert questions in a survey/questionnaire that form
a scale and you wish to determine if the scale is reliable. For example, a researcher has devised
a nine-question questionnaire to measure how safe people feel at work at an industrial complex.

Each question was a 5-point Likert item from ‘strongly disagree’ to ‘strongly agree’. In order
to understand whether the questions in this questionnaire all reliably measure the same latent
variable (feeling of safety) [so a Likert scale could be constructed], a Cronbach’s alpha was
run on a sample size of 15 workers. The alpha coefficient for the items is .839, suggesting that
the items have relatively high internal consistency. Note that a reliability coefficient of .70 or
higher is considered ‘acceptable’ in most social science research situations.

vi. Alternative-form Reliability


One way of avoiding the difficulties encountered in test-retest reliability is through the use of
alternate forms of the test. The same persons can thus be tested with one from on the first
occasion and with another, equivalent form on the second. The correlation between the scores
obtained on the two forms represents the reliability coefficient of the test. It will be noted that
such a reliability of coefficient is a measure of both- (a)Temporal stability and (b) Consistency
of response to different item samples.

You might also like