NOTE 4 - Measurement and Reliability
NOTE 4 - Measurement and Reliability
NOTE 4 - Measurement and Reliability
• Accuracy: The accuracy is as great as the care and the instruments of the observer will
permit.
• Objectivity: Objectivity means interpersonal agreement. Where many persons reach
agreement as to observations and conclusions, the descriptions of nature are more likely to
be free from biases of particular individuals.
• Communication: Communication is the sharing of research between one person with
another one.
LEVELS OF MEASUREMENT
Level of measurement refers to the relationship among the values that are assigned to the
attributes for a variable. It is important because
o First, knowing the level of measurement helps you decide how to interpret the data from
that variable. When you know that a measure is nominal, then you know that the numerical
values are just short codes for the longer names.
o Second, knowing the level of measurement helps you decide what statistical analysis is
appropriate on the values that were assigned. If a measure is nominal, then you know that
you would never average the data values or do a t-test on the data.
S. S. Stevens (1946) clearly delineated the four distinguish levels of measurement. The levels
are nominal, ordinal, interval, or ratio.
NOMINAL
The nominal scale (also called dummy coding) simply places people, events, perceptions, etc.
into categories based on some common trait. Some data are naturally suited to the nominal
scale such as males vs. females, white vs. black vs. blue, and American vs. Asian.
The nominal scale is the lowest form of measurement because it doesn’t capture information
about the focal object other than whether the object belongs or doesn’t belong to a category;
either you are a smoker or not a smoker, you attended university or you didn’t, a subject has
some experience with computers, an average amount of experience with computers, or
extensive experience with computers.
Coding of nominal scale data can be accomplished using numbers, letters, labels, or any symbol
that represents a category into which an object can either belong or not belong. In research
activities a Yes/No scale is nominal.
ORDINAL
An ordinal level of measurement uses symbols to classify observations into categories that are
not only mutually exclusive and exhaustive; in addition, the categories have some explicit
relationship among them. For example, observations may be classified into categories such as
taller and shorter, greater and lesser, faster and slower, harder and easier, and so forth.
However, each observation must still fall into one of the categories (the categories are
exhaustive) but no more than one (the categories are mutually exclusive). Most of the
commonly used questions which ask about job satisfaction use the ordinal level of
measurement. For example, asking whether one is very satisfied, satisfied, neutral, dissatisfied,
or very dissatisfied with one’s job is using an ordinal scale of measurement.
The simplest ordinal scale is a ranking. When a market researcher asks you to rank 5 types of
tea from most flavorful to least flavorful, s/he is asking you to create an ordinal scale of
preference. There is no objective distance between any two points on your subjective scale. For
you the top tea may be far superior to the second preferred tea but, to another respondent with
the same top and second tea, the distance may be subjectively small.
INTERVAL
An interval level of measurement classifies observations into categories that are not only
mutually exclusive and exhaustive, and have some explicit relationship among them, but the
relationship between the categories is known and exact.
This is the first quantitative application of numbers. In the interval level, a common and
constant unit of measurement has been established between the categories. For example, the
commonly used measures of temperature are interval level scales.
We know that a temperature of 75 degrees is one degree warmer than a temperature of 74
degrees, just as a temperature of 42 degrees is one degree warmer than a temperature of 41
degrees.
Numbers may be assigned to the observations because the relationship between the categories
is assumed to be the same as the relationship between numbers in the number system. For
example, 74+1= 75 and 41+1= 42.
The intervals between categories are equal, but they originate from some arbitrary origin, that
is, there is no meaningful zero point on an interval scale.
RATIO
The ratio level of measurement is the same as the interval level, with the addition of an absolute
zero point. There is a meaningful and non-arbitrary zero point from which the equal intervals
between categories originate. For example, weight, area, speed, and velocity are measured on
a ratio level scale. In public policy and administration, budgets and the number of program
participants are measured on ratio scales.
A ratio scale is the top level of measurement and is not often available in social research. The
factor which clearly defines a ratio scale is that it has a true zero point.
The simplest example of a ratio scale is the measurement of length (disregarding any
philosophical points about defining how we can identify zero length). Ratio scale data would
use the same as for Interval data.
RELIABILITY
Reliability refers to the consistency or repeatability of an operationalized measure. A reliable
measure will yield the same results over and over again when applied to the same thing. It is
the degree to which a test consistently measures whatever it measures.
If you have a survey question that can be interpreted several different ways, it is going to be
unreliable. One person may interpret it one way and another may interpret it another way. You
do not know which interpretation people are taking. Even answers to questions that are clear
may be unreliable, depending on how they are interpreted.
Reliability refers to the consistency of scores obtained by the same persons when they are re-
examined with the same tests on different occasions, or with different sets of equivalent items,
or under other variable examining conditions.
Research requires dependable measurement. Measurements are reliable to the extent that they
are repeatable and that any random influence which tends to make measurements different from
occasion to occasion or circumstance to circumstance is a source of measurement error.
Errors of measurement that affect reliability are random errors and errors of measurement that
affect validity are systematic or constant errors. Reliability of any research is the degree to
which it gives an accurate score across a range of measurement. It can thus be viewed as being
‘repeatability’ or ‘consistency’.
There are a number of ways of determining the reliability of an instrument. The procedure can
be classified into two groups:
i. Test-Retest Reliability
The most obvious method for finding the reliability of test scores is by repeating the identical
test on a second occasion. Test-retest reliability is a measure of reliability obtained by
administering the same test twice over a period of time to a group of individuals.
The scores from ‘Time 1’ and ‘Time 2’ can then be correlated in order to evaluate the test for
stability over time. The reliability coefficient in this case is simply the correlation between the
scores obtained by the same persons on the two administrations of the test.
The most commonly used method to split the test into two is using the odd-even strategy. The
split-half reliability estimate, as shown in the figure, is simply the correlation between these
two total scores.
Inter-rater reliability might be employed when different judges are evaluating the degree to
which art portfolios meet certain standards. Inter-rater reliability is especially useful when
judgments can be considered relatively subjective. Thus, the use of this type of reliability would
probably be more likely when evaluating artwork as opposed to math problems.
iv. Parallel-Forms Reliability
Parallel forms reliability is a measure of reliability obtained by administering different versions
of an assessment tool (both versions must contain items that probe the same construct, skill,
knowledge base, etc.) to the same group of individuals. The scores from the two versions can
then be correlated in order to evaluate the consistency of results across alternate versions. In
parallel forms reliability you first have to create two parallel forms. One way to accomplish
this is to create a large set of questions that address the same construct and then randomly
divide the questions into two sets. You administer both instruments to the same sample of
people.
The correlation between the two parallel forms is the estimate of reliability. For example, if
you wanted to evaluate the reliability of a critical thinking assessment, you might create a large
set of items that all pertain to critical thinking and then randomly split the questions up into
two sets, which would represent the parallel forms.
Each question was a 5-point Likert item from ‘strongly disagree’ to ‘strongly agree’. In order
to understand whether the questions in this questionnaire all reliably measure the same latent
variable (feeling of safety) [so a Likert scale could be constructed], a Cronbach’s alpha was
run on a sample size of 15 workers. The alpha coefficient for the items is .839, suggesting that
the items have relatively high internal consistency. Note that a reliability coefficient of .70 or
higher is considered ‘acceptable’ in most social science research situations.