Psychological Assessment: Historical Perspective
Psychological Assessment: Historical Perspective
Psychological Assessment: Historical Perspective
HISTORICAL PERSPECTIVE
Historical Perspective
Early Antecedents
Antiquity to the 19th Century
Evidence suggests that the Chinese had a
relatively sophisticated civil service testing
program more than 4000 years ago or as
early as 2200B.C. Every third year in China,
oral examinations were given to help
determine work evaluations and promotion
decisions
By the Han Dynasty (206 B.C.E. to 220 C.E.), the use of test
batteries (two or more tests used in conjunction) was quite
common. These early tests related to such diverse topics as
civil law, military affairs, agriculture, revenue, and
geography. Tests had become quite well developed by the
Ming Dynasty (1368–1644 C.E.).
Reports by British missionaries and diplomats encouraged
the English East India Company in 1832 to copy the Chinese
system as a method of selecting employees for overseas
duty. Because testing programs worked well for the
company, the British government adopted a similar system
of testing for its civil service in 1855.
the French and German governments followed suit. In 1883,
the U.S. government established the American Civil Service
Commission, which developed and administered
competitive examinations for certain government jobs.
Ancient Greco-Roman Writing indicative of attempt to
categorize people interms of personality types(i.e reference
to abandance or defieciency in some bodily fliud such as
blood or phlegm)
Charles Darwin and Individual
Differences
To develop a measuring device, we must
understand what we want to measure.
An important step toward understanding
individual differences came with the publication
of Charles Darwin’s highly influential book, The
Origin of Species, in 1859.
Darwin spurred interest in individual difference.
According to him, individual differences are the
highest importance, for they afford materrials
for natural selection to act on.
Sir Francis Galton, a relative of Darwin’s, soon began applying
Darwin’s theories to the study of human beings
Given the concepts of survival of the fittest and individual
differences, Galton set out to show that some people possessed
characteristics that made them more fit than others, a theory he
articulated in his book Hereditary Genius, published in 1869. He
aspired to classify people “ according to their natural gifts” and to
ascertain their “deviation from the average ”.
Realized the need for measuring the characteristics of related and
unrelated person and focused on INDIVIDUAL DIFFERENCES
Galton was instrumental in inducing a number of educational in
institutions to keep systematic ANTHROPOMETRIC RECORDS of
their students. 1884; Galton set up an anthropometric laboratory at
the international exposition, where visitors coulod be measured on
certain variables such as height (standing and sitting), arm span,
weight, breathing capacity, keenness of vison and hearing,
strength of pull, strength of squeeze, swiftness of blow, memory of
form, discrimination of color, ahand steadiness, reaction time, and
other simple sensorimotor functions. For all of these efforts, Galton
is credited to be primarily to be responsible for the launching of
the testing movement and pioneered in the application of rating
scale and questionaire methods (included self-report
inventories ). He is also responsibe for the development of
statistical method for analysis of data on individual differences(i.e.
coefficient of correlation)
psychologist James McKeen Cattell, who
coined the term mental test, Cattell’s
doctoral dissertation was based on Galton’s
work on individual differences in reaction
time. As such, Cattell perpetuated and
stimulated the forces that ultimately led to
the development of modern tests. He
became active in the spread of the testing
movement; first to use the term “MENTAL
TEST”. He was instrumental in founding the
Psychological Corporation”
1838: Esquirol
French physician whose two volume work made the first explicit
distinction between mentally retarded and insane individuals
More than 100 pages of his work devoted to “mental retardation”
Esquirol pointed that there are many degrees of mental retardation
The individual’s use of languange provides the most dependedable
criterion of his intellectual level
Seguin
Another french physician
Pioneered in the training of mentally retarded persons
1837: establish the first school devoted to education of mentally
retarded children
1848: migrated to the USA, made suggestions regarding the
training of metally retarded persons
Some of the proceduresdeveloped by Seguin were eventually
incorperated into performance or nonverbal tests of intelligence .
He argue that chance variation in species would be selected or
rejected by nature according to adaptability and survival value
Experimental Psychology and
Psychophysical Measurement
J. E. Herbart. Herbart eventually used these
models as the basis for educational theories
that strongly influenced 19thcentury
educational practices. Following Herbart, E.
H. Weber attempted to demonstrate the
existence of a psychological threshold, the
minimum stimulus necessary to activate a
sensory system. Then, following Weber, G. T.
Fechner devised the law that the strength of
a sensation grows as the logarithm of the
stimulus intensity
Wilhelm Wundt, who set up a laboratory at the University of
Leipzig in 1879, is credited with founding the science of
psychology
Wundt was succeeded by E. B. Titchner, whose student, G.
Whipple, recruited L. L. Thurstone. Whipple provided the basis
for immense changes in the field of testing by conducting a
seminar at the Carnegie Institute in 1919 attended by
Thurstone, E. Strong, and other early prominent U.S. From this
seminar came the Carnegie Interest Inventory and later the
Strong Vocational Interest Blank.
classifying and identifying the mentally and emotionally
handicapped. One of the earliest tests resembling current
procedures, the Seguin Form Board Test was developed in an
effort to educate and evaluate the mentally disabled. Similarly,
Kraepelin devised a series of examinations for evaluating
emotionally impaired people.
The French minister of public instruction appointed a
commission to study ways of identifying intellectually
subnormal individuals in order to provide them with
appropriate educational experiences. One member of that
commission was Alfred Binet. Working in conjunction with the
French physician T. Simon, Binet developed the first major
general intelligence test.
The Evolution of Intelligence
and Standardized Achievement
Tests
the Binet-Simon Scale, was published in 1905. This
instrument contained 30 items of increasing difficulty and
was designed to identify intellectually subnormal
individuals.
A representative sample is one that comprises individuals
similar to those for whom the test is to be used. When the
test is used for the general population, a representative
sample must reflect all segments of the
population in proportion to their actual numbers.
The 1908 Binet-Simon Scale also determined a child’s
mental age, thereby introducing a historically significant
concept. By 1916, L. M. Terman of Stanford University had
revised the Binet test for use in the United States.
Terman’s revision, known as the Stanford-Binet
Intelligence Scale.
World War I
World War I, the army requested the
assistance of Robert Yerkes, who was then
the president of the American Psychological
Association. Yerkes headed a committee of
distinguished psychologists who soon
developed two structured group tests of
human abilities: the Army Alpha and the
Army Beta. The Army Alpha required reading
ability, whereas the Army Beta measured
the intelligence of illiterate
adults.
Achievement Tests
Standardized achievement tests provide
multiple-choice questions that are
standardized on a large sample to produce
norms against which the results of new
examinees can be compared.
Standards of evaluation
Psychology, tests, and public policy
Tests and Group Membership
Legal and Ethical Considerations
Laws are rules that individuals must obey for the good of the
society as a whole—or rules thought to be for the good of society
as a whole.
1. Directive.
2. nondirective
Social facilitation, we tend to act like the models around us (see Augustine, 2011). If
the interviewer is tense, anxious, defensive, and aloof, then the interviewee tends to
respond in kind. Thus, if the interviewer wishes to create conditions of openness,
warmth, acceptance, comfort, calmness, and support, then he or she must exhibit
these qualities.
Principles of Effective Interviewing
The Proper Attitudes
The session received a good evaluation by both participants when the patient
saw the interviewer as warm, open, concerned, involved, committed, and
interested, regardless of subject matter or the type or severity of the problem.
On the other hand, independent of all other factors, when the interviewer was
seen as cold, defensive, uninterested, uninvolved, aloof, and bored, the session
was rated poorly. To appear effective and establish rapport, the interviewer
must display the proper attitudes
Good interviewing is actually more a matter of attitude than skill (Duan &
Kivlighan, 2002). Experiments in social psychology have shown that
interpersonalinfluence (the degree to which one person can influence another)
is related to interpersonal attraction (the degree to which people share a
feeling of understanding, mutual respect, similarity, and the like) (Dillard &
Marshall, 2003). Attitudes related to good interviewing skills include warmth,
genuineness, acceptance, understanding,
openness, honesty, and fairness.
Responses to Avoid
As a rule, however, making interviewees feel uncomfortable tends to place them on
guard, and guarded or anxious interviewees tend to reveal little information about
themselves. If the goal is to elicit as much information as possible or to receive a
good rating from the interviewee, then interviewers should avoid certain responses,
including judgmental or evaluative statements, probing statements, hostility, and
false reassurance.
Each level in this system represents a degree of empathy. The levels range
from a response that bears little or no relationship to the previous statement
to a response that captures the precise meaning and feeling of the
statement.
Level-One Responses
Level-one responses bear little or no relationship to the interviewee’s response
Level-Two Responses
The level-two response communicates a superficial awareness of the meaning of
a statement. The individual who makes a level-two response never quite goes
beyond his or her own limited perspective. Level-two responses impede the flow
of communication.
Level-Three Responses
A level-three response is interchangeable with the interviewee’s statement.
According to Carkhuff and Berenson (1967), level three is the minimum level of
responding that can help the interviewee. Paraphrasing, verbatim playback,
clarification statements, and restatements are all examples of level-three
responses.
Level-Four and Level-Five Responses
Level-four and level-five responses not only provide accurate empathy but also go
beyond the statement given. In a level-four response, the interviewer adds
“noticeably” to the interviewee’s response
Active Listening
An impressive array of research has accumulated to document the power of the
understanding response. This type of responding, sometimes called active
listening, is the foundation of good interviewing
skills for many different types of interviews
Types of Interviews
Evaluation Interview
A confrontation is a statement that points out a discrepancy or inconsistency. Though
confrontation is usually most appropriate in therapeutic interviews, all experienced
interviewers should have this technique at their disposal
All contacts with clients ultimately need to be documented. However, there is some debate
over whether notes should be taken during an interview.
a moderate amount of note-taking seems worthwhile
3. Rapport
rapport A word often used to characterize the relationship between patient and clinician. In
the
context of the clinical interview, building good rapport involves establishing a comfortable
atmosphere and sharing an understanding of the purpose of the interview.
4. Communication
Maloney and Ward (1976) observed that the clinician’s questions may become
progressively more structured as the interview proceeds
Silence- silences can mean many things. The important point is to assess the meaning and
function of silence in the context of the specific interview. The clinician’s response to
silence should be reasoned and responsive to the goals of the interview rather than to
personal needs or insecurities. (silence is indicative of some resistance, or the client is
organizing)
Gratification of Self –
The clinical interview is not the time or the place for clinicians to work out their own
problems. clinicians must resist the temptation to shift the focus to themselves. Rather,
their focus must remain on the patient.
The Impact of the Clinician
Each of us has a characteristic impact on others, both socially and professionally. As a
result, the same behavior in different clinicians is unlikely to provoke the same response
from a patient
The Clinician’s Values and Background - clinicians must examine their own experiences
and seek the bases for their own assumptions before making clinical judgments of others.
What to the clinician may appear to be evidence of severe pathology may actually reflect
the patient’s culture. (or gender sensitivity)
The Patient’s Frame of Reference
the patient views the first meeting(or the whole process)is important. It may affect the
interview.
A patient may have an entirely distorted notion of the clinic and even be ashamed of
having to seek help. For many individuals, going to see a clinical psychologist arouses
feelings of inadequacy.
there are patients who start with a view of the clinician as a kind of savior
There are various types of psychometric tests, but most are objective tests
designed to measure educational achievement, knowledge, attitudes, or
personality traits. In addition to the tests themselves, there is another part of
psychometrics that deals with statistical research on the measurements that
psychometric tests are attempting to obtain.
Reliability means that the psychometrist will get roughly the same result
from the same person each time the test is administered. In other words,
does the test reliably do what it is designed to do?
Psychometrists and test writers worry a great deal about reliability and
validity, which is why new psychometric tests undergo rigorous trials
and norming periods before they go on the market. Norming refers to a
means of 'testing' the test and developing baseline scores before it's used
to test the general population.
Types of Assessments Used in Psychology
Psychological Assessment
Projective Tests
One great thing about projective tests is that they can get at unconscious
aspects of a person. Patterns in thoughts and feelings can sometimes be
difficult to see when you are living with them, but through projective
tests, those patterns begin to emerge. For example, if you tend to see
death and destruction when you look at inkblots or the images in the TAT,
you may have an issue with depression. You may not even realize that it's
a part of the way you see the world day in and day out; it's just a part of
you. But the projective tests can get at that part of you in a way other
tests can't.
Inventory-Type Tests
To try to make a test that is more standardized and less subjective,
inventories, or inventory-type tests, were developed. These include
surveys that try to measure a person's characteristics or attitudes. They
might include things like true-false questions or questions that ask you to
rate an idea on a scale of one to five.
But inventories have their own problems. One common issue with
inventories is that they depend on a person answering questions about who
they are. Why is this an issue? Some people might lie, either intentionally or
not. For example, a person might not want to mark that family is not very
important to them because they might feel that they would then be judged
as a bad person.
So, while projective tests don't generally have an issue with people lying
about who they are, they are more subjective than inventories. And while
inventories are more objective and standardized, they do face the
possibility that the person taking them won't be completely honest, which
can throw off the results.
Aptitude Tests
So far, we've talked about two types of tests that try to answer the
question of what a person's personality is like. But there are other
questions in psychology, too, including the question, 'What are you good
at?'
Aptitude tests try to measure what you are capable of doing. Tests like the
SAT and ACT are aptitude tests: they want to see if you are capable of
handling college-level work. Aptitude tests often cover general skills, like
problem-solving, critical thinking and perceptual speed, which is just a fancy
way of saying how quickly you react to something. They can also sometimes
cover basic subject-area concepts, like math or English.
There are generally two types of aptitude tests:
1. Speed tests
These types of tests have easier, more simple questions, but they
have many more of them. They want to see how many questions you can
answer correctly in the allotted time. The instructions for a speed aptitude
test will be something like, 'You can have 30 minutes to answer as
many questions correctly as you can.' As the name implies, doing well on
a speed test takes, well, speed - the faster you are able to correctly answer
questions, the better you will do.
2. Power tests
These types of aptitude tests have fewer questions than speed tests, but
the questions are more complex. These tests are concerned with how you
are able to figure out how to answer complex questions. The instructions
for a power test might be something along the lines of, 'Take your time and
try to answer each question correctly.' Power tests might also be timed, but
it's not about finishing each question quickly; it's about figuring out how to
get the correct answer.
Types of Measurement: Direct, Indirect, &
Constructs
Measurement
Imagine that you are a psychologist, and you decide to do a study to see if
people with red hair are more temperamental than those with brown or
blonde hair. But, how will you know if your subjects are temperamental? For
that matter, how will you know if they have red hair?
Direct Observation
But, what about temperament? Can we directly observe that? Well, sort
of. We can watch subjects interact with somebody and see who loses
their temper and who keeps calm. This might give us a clue about what
their temperament is like.
If we ask our subjects to take a survey and check off if they have red hair,
brown hair, or blonde hair, we can observe their checkmarks. We are not
directly observing their hair color but are making assumptions based
on the subjects' own observations.
Of course, there's a problem with indirect observation, too: how do you know
if someone's being honest? What if we ask our subjects to tell us whether
they have a short temper or not? That's seen as a negative trait, so some
people might not want to answer yes. As a result, they might not be
completely honest.
Constructs
Remember how we tried to directly observe temperament? We
watched our subjects' interactions with others to see who lost their
tempers. But, remember that we said we weren't actually observing
temperament; we were making an inference about temperament based
on the behavior we saw.
The human brain likes things to be quick and easy. We want to be able to
find patterns and communicate them to other human beings. So,
someone who acts angry and aggressive on a regular basis is said to be
short-tempered. This allows us to communicate with others. I can warn
my sister that her new boyfriend is 'short-tempered,' and she'll know
what I mean by that.
Reliability
There are many conditions that may impact reliability. They include:
day-to-day changes in the student, such as energy level, motivation,
emotional stress, and even hunger; the physical environment, which
includes classroom temperature, outside noises, and distractions;
administration of the assessment, which includes changes in test
instructions and differences in how the teacher responds to questions
about the test; and subjectivity of the test scorer.
Validity
An assessment can be reliable but not valid. For example, if you weigh
yourself on a scale, the scale should give you an accurate measurement of
your weight. If the scale tells you that you weigh 150 pounds every time you
step on it, it is reliable. However, if you actually weigh 135 pounds, then the
scale is not valid.
Practicality
The fourth quality of a good assessment is practicality. Practicality
refers to the extent to which an assessment or assessment procedure is
easy to administer and score. Things to consider here are:
So, how do you determine which diagnosis, if any, you give your client?
One tool that can help you is a psychological test. These are
instruments used to measure how much of a specific psychological
construct an individual has. Psychological tests are used to assess many
areas, including:
Let's look at an example involving a new client. You might decide that the
best way to narrow down your client's diagnosis is to administer the
Beck Depression Inventory (BDI), PTSD Symptom Scale Interview (PSSI)
and an insomnia questionnaire. You may be able to rule out a diagnosis or
two based on the test results. These assessments may be given to your
client in one visit, since they all take less than 20 minutes on average to
complete.
Types and Examples of Psychological Tests
Attitude tests, such as the Likert Scale or the Thurstone Scale, are used
to measure how an individual feels about a particular event, place,
person or object.
Direct observation tests are measures in which test takers are observed
as they complete specific activities. It is common for this type of test to be
administered to families in their homes, in a clinical setting such as a
laboratory or in a classroom with children. They include:
There are also specific clinical tests that measure specific clinical
constructs, such as anxiety or PTSD. Some examples of specific clinical
tests include:
There are a few kinds of frequently used tests, and each one of them is
focused on evaluating a different aspect of a person's cognitive abilities.
One of the most common is an aptitude test, which are tests designed to
evaluate a person's ability to learn a skill or subject. That's what aptitude is -
natural skill, talent, or capacity to learn.
An aptitude test, therefore, is not testing what you have learned or how well
you have responded to education or training. It's not even an evaluation of
intelligence or intellectual capacity. It's a measurement of a person's ability
to learn a specific subject.
Many training programs, from music and the arts to vocational and technical
programs, rely heavily on aptitude tests. These tests inform the instructors
whether or not a student is likely to succeed based on their personality,
talent, and potential for growth.
Many schools also use aptitude tests to determine how well a student is
likely to perform, which can be very useful in determining the best
educational style for them. Parts of exams like the SAT and GRE are
actually aptitude-based, meant to determine whether a candidate has the
skill and capacity to be successful at the next level of education.
Since aptitude tests are based on natural talent, skill, and ability to learn, you
don't study for them in the way you would other tests. You're not being
tested on knowledge or information you've been taught, but instead on your
capacity for learning new information. This being said, there are ways to
prepare for an aptitude test.
Achievement
Standardized tests like the SAT II (Subject Tests) are achievement tests,
meant to determine whether students achieved a certain level of
education and mastery of subjects before moving on to college.
In general, short yet frequent study sessions over a length of time (as
opposed to cramming it all at once the night before) result in higher levels of
retention and understanding.
Performance Assessments: Product vs.
Process
Performance assessments are assessments in which students demonstrate
their knowledge and skills in a non-written fashion. These assessments are
focused on demonstration versus written response. Playing a musical
instrument, identifying a chemical in a lab, creating a spreadsheet in computer
class, and giving an oral presentation are just a few examples of performance
assessments.
Like with test anxiety, awareness of the anxiety source and countering
the active stress process are good ways of effectively dealing with
cautiousness.
Uncontrollable Inhibitors
Sometimes things are just beyond the control of the person doing the
examination. Cohort effects are group-wide bonding or understanding. A
group can be anything, such as an entire generation growing up knowing
only the War on Terror or a small town knowing what it is like to live through
a severe drought. When it comes to how this may influence testing results,
entire groups of people may be shifted.
For instance, if you're doing IQ tests and an entire school did an IQ test
two months ago, then the subjects would be more familiar with the test
and thus score higher. Cohort effects can be related to educational
effects, which are systemic changes made by education. People with
higher education tend to live healthier and different lives than those with
lower education. When it comes to influencing test takers, people who
take a lot of written tests just know how to answer better.
Studying the effects of aging and disease, the educational effect may
also change the way the brain fends off disease. When testing
individuals with cognitive diseases, the cognitive reserve becomes an
issue. Cognitive resistance is when the resistance to Alzheimer's disease
increases or decreases based on lifestyle and education.
In studying the cognitive and behavioral components of people, you will find
that there is more resistance to decline in old age in people with higher
education and more intellectually challenging jobs. The old adage of 'use it
or lose it ' is basically what you need to remember. When it comes to
performance, you need to remember that people who think more tend to
think faster and better.
When studying the elderly and cohorts, we must be aware that as we age
there is a certain point in which the individual breaks off from the mean and
changes based on his or her own timetable. There appears to be a terminal
drop - a drastic decline in cognitive abilities one to five years before death.
It seems that the brain and body begin to wobble and break down and that is
likely part of the reason the person dies soon after. If doing testing,
understanding this is crucial, as it is something beyond the researcher's
control. When it comes to performance, you may come to realize that this is
happening to someone and it will be up to you to discuss it with the family.
20
Lesson 2: Standardization and Norming
• Standardization
• Norms
• Standardized Assessments in Educational Setting
• Basic Statistics of Score Distribution
• Comparing Scores to Larger Population
• Mean, Median, and Mode
• Standard Deviation and Bell Curve
• Norm-Referenced vs. Criterion-Referenced Tests
• Bias in Testing
Standardization and Norms of Psychological
Tests
Imagine that you are approaching the finish line of a race. Your heart is
pumping, and you can feel the adrenaline kicking in. As you cross the
finish line, you look up and see that you ran the race in 6 minutes and 43
seconds.
Did you do well on the race or not? How do you answer that question? Is it
based on the time it took you to finish, or based on whether you came in
first or last?
Believe it or not, the question of how well you did in your race is very
similar to the question of how well people do on intelligence tests, which
are meant to measure how much innate ability a person has. There are
many different intelligence tests, which are sometimes called IQ tests.
But there are a few things that they a l l ha ve in co mmon, including
standardization and norms. Let's look closer at those things.
Standardization
Let's rewind for a moment. You're not approaching the finish line for the
race; now you're at the starting line for the race, getting ready. But when
you line up to race, you realize that something is wrong. Everyone has a
different starting point: some people are only a few feet from the finish
line, while others are far, far away. What gives?
Your race does not have standardization, which ensures that ever ything
is the same for all participants. In the case of the race, this means that
every racer must run the same distance. In the case of intelligence tests,
it means that every test-taker must have the same circumstances.
You might be thinking, 'But a test isn't a race. How can it be different?' Think
about this: intelligence tests are being given out to people all over the
world, all the time. Also, there's not one person giving them but many, many
people.
So, imagine that you take an intelligence test that's given to you by Amy, a
nice woman who hands the test over, tells you that you have an hour to
take it, and then walks away. You are left to figure everything out on your
own.
But imagine that your friend takes that same test, but this time it's given by
someone named Rosa. Rosa notices when your friend starts to struggle
with a question, so she gives him a hint. When he really can't get an
answer, she lets him look the answers up online.
What if you score the same as your friend? Does that mean that you are
equally adept? No, because you didn't have standardization. That is, the
test you took was harder than your friend's test, even though it had the
same questions, just by virtue of the fact that you didn't have the same
help that he did.
Norms
You might be wondering, though, why it matters what you got on the test
compared to your friend, anyway. Who cares? Let's go back to the race for
a second. You've just finished in 6:43. How well did you do?
If you're like most people, your answer is along the lines of, 'Well, it
depends on how well the other people did.' After all, 6:43 might mean that
you came in first place, or it might mean that you were four minutes behind
the next-to-last person in the race.
What does this have to do with intelligence tests? Answer this: if you get a
100 on an IQ test, how well did you do?
A raw score on an intelligence test doesn't tell you a lot. For some tests,
100 is average. In others, it could be very good or very bad. But the point
is it doesn't tell you how you did until you compare it to others in the same
age group as you.
Norms allow you to compare your test scores with others. So, instead of
just knowing that you got a 100 on the test, you could also be told that a
score of 100 is at the 50th percentile. That tells you that roughly half of
the people who are in the same group as you scored higher and lower
than you did. You are average.
Standardized assessments are very common and can be used for several
purposes.
Achievement Assessments
Achievement assessments are designed to assess how much students
have learned from classroom instruction. Assessment items typically
reflect common curriculum used throughout schools across the state or
nation. For example, a history assessment might contain items that
focus on national history rather than history distinct to a particular state
or county.
There are advantages to achievement assessments. First, achievement
assessments provide information regarding how much a student has
learned about a subject. These assessments also provide information on
how well students in one classroom compare to other students. They also
provide a way to track student progress over time.
1. The school should choose an assessment that has a high validity for
the particular purpose of testing. Meaning that if the school wants to
assess knowledge of its student's science comprehension, it should
choose an assessment that evaluates science knowledge and skills.
2. The school should make sure the group of students used to 'norm' the
assessment are similar to the population of the school.
3. The school should take the student's age and developmental level
into account before administering any standardized assessments
There are other forms of assessment that are given at different times.
These are referred to as formative and summative assessments
Raw Score
Normal Distribution
For example, if we had the following raw scores from your classroom - 57,
76, 89, 92, and 95 - the variability would range from 57 being the low score to
95 being the high score. Plotting these scores along a normal distribution
would show us the variability. The midpoint of the distribution is also
illustrated.
Standard Deviation
The normal distribution curve helps us find the standard deviation of the
scores. Standard deviation is a useful measure of variability. It measures
the average deviation from the mean in standard units. Deviation, in this
case, is defined as the amount an assessment score differs from a fixed
value, such as the mean.
The mean and standard deviation can be used to divide the normal
distribution into several parts. The vertical line at the middle of the curve
shows the mean, and the lines to either side reflect the standard deviation.
A small standard deviation tells us that the scores are close together, and
a large number tells us that they are spread apart more. For example, a
set of classroom tests with a standard deviation of 10 tells us that the
individual scores were more similar than a set of classroom tests with a
standard deviation of 35.
In statistics, there is a rule called the 68-95-99.7 rule. This rule states that
for a normal distribution, almost all values lie within one, two or three
standard deviations from the mean. Specifically, approximately 68% of all
values lie within one standard deviation of the mean. Approximately 95% of
all values lie within two st an da rd de viation s of the mean and
approximately 99.7% of all values lie within three standard deviations of
the mean.
Comparing Test Scores to a Larger
Population
To calculate a Z-score, subtract the mean from the raw score and divide by
the standard deviation. For example, if we have a raw score of 85, a mean of
50 and a standard deviation of 10, we will calculate a Z- score of 3.5.
For example, let's take a test score of 85, the raw score. If 85 were the
highest grade on this test, the cumulative percentage would be 100%.
Since the student scored at the 100th percentile, she did better than or the
same as everyone else in the class. That would mean that everyone else
made either an 85 or lower on the test.
The Mean
Imagine you teach a class with 20 students, and they take a test with 20
multiple choice questions. Imagine that the grades you get back from
scoring their tests look like this:
So you can see we have student #1 through student #20, and you can
see that the scores range quite a bit. Looking at these scores you can
see that one student, student #1, got a perfect score of 20 out of
20. Many students got scores somewhere in the middle, with five
students getting half of the questions right - that would be a score of
10 out of
20. A few of the students did pretty well, only missing a few questions,
while a few students did pretty badly but at least got a few of the
questions right. How can we be more precise with analyzing these
scores using statistics?
Let's say the principal of your school wants a quick summary of how
your students did on their test. How would you summarize the results?
The most common type of statistic, in either the context of the
classroom assessment or in laboratory research projects, is
statistics of summary. There are various types of statistics of
summary, but in general their purpose is to quickly give a general
impression of the overall trend in results. So, just like you'd guess,
based on the term statistics of summary, these statistics just give you a
ballpark idea of what happened on the test. Let's go over three different
types of summary statistics.
The Median
Why would you use the median instead of the mean? In this example, the
two scores are pretty similar (a mean of 10.5 versus a median of 10). So
here, it's doesn't really make any difference which one you would pick. The
difference between the mean and the median only really matters if you have
extreme scores on one end or the other. Let's say you were curious to know
how many state capitals the children in your kindergarten class knew. Let's
say you have five kids in the class, and maybe you get scores like these:
Child 1: 1 capital
Child 2: 2 capitals
Child 3: 3 capitals
Child 4: 3 capitals
Child 5: 47 capitals
So child #1 only knew one capital, child 2 knew two capitals, and that's
basically average until you get to child 5, who actually knew 47 capitals. Here,
one of the scores (the child who knows 47 capitals) is extremely different
from the rest of the scores. When a score is extremely different from the rest
of the scores in a distribution, that score is called an outlier. If an outlier
exists in your data, it will have a huge effect on the mean. Here, the mean
would be:
1+ 2 + 3 + 3 + 47 = 56 / 5 = 11.2.
The Mode
The third and final statistic of summary is called the mode, which is
simply the score obtained by the most people in the group. Let's go
back to our original example of scores on the history test for the
Revolutionary War. When you look at the test scores again, what is the
most common score? The answer here is the score of 10. Five students
got that score, so the mode in our example is the score of 10. Again, in
this particular example, the mode is similar to both the mean and the
median.
So why would you use the mode instead of the mean or median? Usually
the mode is used for examples when scores are not in numerical form.
Remember, the mode is telling you what the most common answer is. So
modes are good when the data involved are categorical instead of
numerical.
Think about baseball teams. Who won the World Series last year? Do
you know the team that's won the World Series the most often ever
since it began? The answer is the New York Yankees. So it's accurate to
say that the mode team for winning the World Series is the Yankees,
because it's the most common answer.
Let's go over one more example. When you get a new car, your car
insurance price is based on a lot of things, like your gender and age, but
it's also based on the color of your car. You have to pay more for
insurance if you drive a red car. Why is that? It's because the mode color
of car that gets into accidents is red. In other words, red cars get in more
accidents than any other car - so red is the mode car accident color. It
wouldn't make sense to try to use a mean or a median when talking about
colors of cars, because there aren't any numbers involved. So for
categories like colors or baseball teams, you have to use the mode if you
want to create a statistic of summary.
Using Standard Deviation and Bell Curves for
Assessment
Standard Deviation
Let’s use the students score again from the previous table. Remember
there are 20 students from the example?
Now you want to know the basic variability within the classroom. So, did the
students' scores kind of clump up all together, meaning the students all
showed about the same amount of knowledge? Or did the scores vary
widely from each other, meaning some students did great whereas other
students failed the test?
The answer to this question can come very precisely from the standard
deviation, which i s a measurement that indicates how much a group of scores
vary from the average.
1. We'll start by finding the mean, or average, of all the scores. To do this,
we add up all the scores and divide by the total number of scores. This
gives us a mean of 10.5.
2. The next step is to take each score, subtract the mean from it and
square the difference. For example, looking at the top score of 20,
we subtract 10.5 and then square the difference to get
90.25. We repeat this process for each score
.
3. Now, we add up all our squared differences and divide by the total
number of scores. This gives us 353/20, or 17.65.
4. The final step is to take the square root of this number, which is 4.2.
This is the standard deviation of the scores.
Now that we have our standard deviation of 4.2, what the heck does that
mean? Well, it just gives us an idea of how much the scores on the
test clumped together. To understand this better, look at the two
distributions of scores on the screen.
The one on the left shows scores that are all very similar to each other.
So, because the scores are all close together, the standard deviation is
going to be very small. But, the one on the right shows scores that are
all pretty different from each other (lots of high scores on the test, but
also lots of failing grades on the test). For this distribution, we'd have a
high number for our standard deviation.
Let's plot the test scores in a graph. The x-axis is for the score
received, and the y-axis is for the number of students who got that
score. So, still using our same example of 20 students who took a test
with 20 questions, you can see here the pattern that shows up on the
graph:
There's a big bump in the middle, showing the five students who got the
middle score of 10. Then, the graph tapers off on each side, indicating that
fewer students got very high or very low scores. The shape of this
distribution, a large rounded peak tapering away at each end, is called a bell
curve.
Remember that we said that most teachers will want their students' scores
to look kind of like what we see here. We had a lot of scores that fell in the
middle (indicated by the big bump), which might be like a letter grade of a
C. We had a few students who did really well (which might be like the
grade of A) and a few students who did poorly (in other words, they got
an F). When you have a bell curve that looks like this one, with a bump in
the middle and little ends on each side, you know you have a normal
distribution. A normal distribution has this bell shape and is called
normal because it's the most common distribution that teachers see in a
classroom.
Skewed Distribution
When a distribution is not normal and is instead weighted heavily on either side
like this, it's called a skewed distribution.
Imagine that most of the students got an A on the test. What would that
distribution look like? It would look something like this one:
You can see here that the bump falls along the right side of the graph
(where the higher scores are), with it tapering off only on the left side,
showing that most students got high scores and only a few got low scores.
This is what you call a negatively skewed distribution.
The exact opposite would be true if most students got an F, which would
look like this graph, positively skewed distribution:
Types of Tests:
Norm-Referenced vs. Criterion-Referenced
Norm-Referenced
For example, let's look at your child's percentile score on a recent math
standardized assessment. The percentile indicates he scored a 55. This
means that he scored better than 55% of other students taking the same
assessment.
We see here from your son's score that he falls about one standard
deviation away from the mean (the average scores of the population
that took the same assessment). This information tells us that his score
is slightly above the scores of the other students.
Norm-referenced tests are a good way to compensate for any mistakes that
might be made in designing the measurement tool. For example, what if the
math test is too easy, and everybody aces it? If it is a norm- referenced
test, that's OK because you're not looking at the actual scores of the
students but how well they did in relation to students in the same age
group, grade, or class.
Criterion-Referenced
Let's go back to our race scenario. Saying that a runner came in third
place is norm-referenced because we are comparing her to the other
runners in the race. But, if we look at her time in the race, that's criterion-
referenced. Saying she finished the race in 58:42 is an objective measure
that is not a comparison to others.
Tests that are pass-fail are criterion-referenced, as are many tests for
certifications. Any test where there's a certain score that you have to
achieve to pass is criterion-referenced. So, for example, you could say
that students have to get a 70% on her test to pass, which would make
it a criterion-referenced test.
Cultural Bias
Most test biases are considered cultural bias. Cultural bias is the
extent to which a test offends or penalizes some students based on
their ethnicity, gender or socioeconomic status.
Researchers have identified multiple types of test bias that affect the
accuracy and usability of the test results.
Construct Bias
Method Bias
Item Bias
Item Bias refers to problems that occur with individual items on the
assessment. These biases may occur because of poor use of grammar,
choice of cultural phrases and poorly written assessment items.
For example, the use of phrases, such as 'bury the hatchet' to indicate
making peace with someone or 'the last straw' to indicate the thing that
makes one lose control, in test items would be difficult for a test-taker
from a different culture to interpret. The incorrect interpretation of
culturally biased phrases within test items would lead to inaccurate test
results.
Language Differences and Test Bias