Validity and Classroom Language Testing

RESEARCH ARTICLES
Validity and Classroom Language Testing: A Practical

Approach1
La validez y la evaluación de lenguas en el aula de

idiomas: un enfoque práctico
Frank Giraldo2
Citation/ Para citar este Artículo: Giraldo, F. (2020). Validity and Classroom Language Testing: A Practical Approach. Colomb. Appl. Linguistic.
J., 22(2), pp. 194-206.
Received: 05-Mar.-2020 / Accepted: 22-Dec.-2020
DOI: https://doi.org/10.14483/22487085.15998
Abstract
Validity and validation are common in large-scale language testing. These topics are fundamental because
they help stakeholders in testing systems make accurate interpretations of individuals’ language ability and related
ensuing decisions. However, there is limited information on validity and validation for classroom language testing,
for which interpretations and decisions based on curriculum objectives are paramount, too. In this reflection
article, I provide a critical account of these two issues as they are applied in large-scale testing. Next, I use this
background to discuss and provide possible applications for classroom language education through a proposed
approach for validating classroom language tests. The approach comprises the analyses of curriculum objectives,
design of test specifications, analysis of test items, professional design of instruments, statistical calculations,
cognitive validation and consequential analyses. I close the article with implications and recommendations for
such endeavours and highlight why they are fundamental for high-quality language testing systems in classroom
contexts.
Keywords: classroom language testing, language testing, validation, validity
Resumen
La validez y la validación son temas de discusión comunes en la evaluación de lenguas a gran escala. Estos
temas son fundamentales porque permiten que aquellos involucrados en estos sistemas de evaluación puedan
hacer interpretaciones claras, junto con las decisiones que de ellas se desprendan. No obstante, hay poca
información en la literatura relacionada con la validez y la validación en contextos de aprendizaje de lenguas,
donde las interpretaciones y decisiones basadas en objetivos curriculares también son fundamentales. En
este artículo de reflexión, hago una revisión crítica de cómo estos dos temas son utilizados en evaluación a
gran escala. Luego uso este contexto para discutir y presentar posibles aplicaciones para el aula de idiomas
a través de una propuesta de enfoque para la validación de instrumentos de evaluación en este contexto.
El enfoque incluye un análisis de objetivos curriculares, el diseño de especificaciones, el análisis de ítems
en instrumentos de evaluación, el diseño profesional de evaluaciones, cálculos estadísticos, la validación
cognitiva y, por último, análisis de consecuencias. El artículo lo concluyo con implicaciones y recomendaciones
1 This reflection article is on the validity of classroom language testing and connects theory and practice in validation.
2 Universidad de Caldas, Colombia. ORCID : https://orcid.org/0000-0001-5221-8245. frank.giraldo@ucaldas.edu.co
194
Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
Validity and Classroom Language Testing: A Practical Approach
pertinentes para este proceso, además de enfatizar of using these tests, used complex statistical
las razones por las cuales es vital para tener calculations and compared these tests with other
sistemas de evaluación de alta calidad. well-known instruments. Thus, validation research
and discussions are predominant in assessing the
Palabras clave: evaluación en el aula de clases, validity of large-scale testing (Chapelle & Voss, 2013,
evaluación de lenguas extranjeras, validación, validez Xi & Sawaki, 2017). However, the discussion on the
validity and validation of classroom language testing
has been limited, with researchers providing mostly
Introduction a conceptual approach (see Bachman & Damböck,
2018; Chapelle & Voss, 2013; Kane, 2012).
Validity is the most fundamental quality of
testing systems, across social, professional and Against this backdrop, the purpose of this
educational contexts. This assertion holds true reflection paper is twofold: to discuss validity as it
whether tests are used in large-scale or classroom relates to classroom language testing and language
settings. Among assessment discussions, there is a teachers and provide and reflect on strategies to
consensus that tests are not valid: Validity is not the validate classroom language tests such that they
quality of an assessment instrument (e.g. a test) but are manageable for teachers. I provide practical
relates to how appropriate interpretations are based examples to demonstrate this process. I start the
on assessment data, for making particular decisions paper with an overview of definitions for validity
(Chapelle, 1999; Fulcher, 2010; Green, 2004; and validation as central constructs and then
Messick, 1989; Popham, 2017). Thus, validity may discuss a practical approach for them in classroom
be conceived as an abstract notion and an ideal. language testing. I end the paper with implications
Because of the abstract nature of validity, validation of validating language tests, recommendations for
has emerged as the data-gathering process to argue validation and relevant limitations and conclusions.
for the validity of interpretations and decisions made
from tests. The quality and the process are crucial
in large-scale and classroom language testing Validity in Language Testing
(Chapelle & Voss, 2013; Kane & Wools, 2019). Validity in language testing is about how
Particularly, validation supports the development logical and true interpretations and decisions are
and monitoring of high-quality testing systems. made based on scores (or in general data) from
assessments. Validity has been considered a trait
Validation research in language assessment of tests: A test is valid if it measures what it has to
abounds, specifically for large-scale testing—tests measure and nothing more (Brown & Abeywickrama,
that affect many individuals (Bachman, 2004); such 2010; Lado, 1961). However, this view is no longer
research is expected because of the consequences used in educational measurement in general and in
of using these instruments. Chapelle, Enright language testing specifically.
and Jamieson (2008) argue in favour of the
validity of using the Test of English as a Foreign The following definition of validity in assessment
Language (TOEFL); the researchers claim that the is from the American Educational Research
TOEFL helps users make admission decisions for Association (AERA), American Psychological
English-speaking universities that use academic Association and National Council on Measurement
English. Other examples of validation projects are in Education (NCME; 2014, p. 11): ‘The degree
assessments of the validity of using a placement test to which evidence and theory support the
for international teaching assistants (Farnsworth, interpretations of test scores for proposed uses
2013), a web-based Spanish listening test to make of tests’. Earlier, Messick (1989, p. 13) provides
placement decisions (Pardo-Ballester, 2010) and a similar definition that since its inception was
Llosa’s (2007) comparison of a classroom test and welcomed in language testing. To him, validity is
a standardised test of English proficiency. These ‘an overall evaluative judgement of the degree to
studies have collected data to claim the validity which evidence and theoretical rationales support
195
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
the adequacy and appropriateness of interpretations ↓
and actions based on test scores’. Interpretations of: Test taker’s state of academic
English in listening, reading, speaking and writing
Thus, in language testing, a score represents (Theoretical Rationales).
individuals’ language ability and is used for making ↓
decisions, for example, to allow conditional Claim: The student does or does not have
admission to an English-speaking university sufficient academic English to study at university.
(e.g. the aforementioned TOEFL case), or in a ↓
classroom, to move on to another unit in a course. Decision or use: Based on scores from the
This decision-making process is what Messick TOEFL, confer or deny conditional admission for
calls interpretations and actions, or uses of tests in university.
AERA et al. (2014). The interpretations and actions
should be appropriate because they are based on The aforementioned claim and decision must
clearly defined constructs (i.e. language ability as a be validated; in other words, TOEFL developers
theoretical rationale) and on student performance on must demonstrate through considerable amounts of
a test—what Messick and AERA et al. call evidence. research-based data that the claim and decision are
valid, namely, logical and true. A similar approach
A couple of teachers using a placement test of can be used in classroom language assessment,
reading comprehension with a group of new students in which the chain of logic as overviewed can be
at a language institute is an example of evidence applied (see Bachman & Damböck, 2018; Chapelle
and theoretical rationale. On the basis of the score & Voss, 2013; Kane, 2012). The following hierarchy
from this instrument, a student is placed in Level is an example of a classroom language assessment
II (decision or use). In this case, validity depends for a listening quiz.
on demonstrating that 1) the student displayed a
performance in reading that merited being in Level Purpose: Identify the students who are learning or
II (evidence) and 2) that the test was based on a having difficulty with listening skills A and B.
clear definition of language ability for reading at ↓
Level II (theoretical rationale). If students start Level Assessment of: Performance on a listening quiz
II and perceive that their skills are beyond those of with 20 multiple-choice questions; number of right
their classmates, the interpretation (that the student and wrong answers (Evidence).
had reading skills to be in Level II) and the decision ↓
(placing the student accordingly) are not valid. If the Interpretations of: Students’ level of listening
student is ready for Level II, there is validity in the comprehension as outlined in the course syllabus
interpretation and decision from this testing system. (Theoretical Rationale).
↓
To further explicate validity in language testing, Claim: The student who passes the quiz has the
the following hierarchy synthesises and simplifies listening skills; the student who fails does not.
this quality for the TOEFL (based on Chapelle et al., ↓
2008). Tests serve purposes—they are not designed Decision or use: If all students pass the quiz, they
in a vacuum—and trigger the evidence (what test have developed the skills and are ready to develop
takers demonstrate) from which interpretations are new listening skills.
derived. Subsequently, these interpretations are used
to make claims and decisions about individuals. To argue for the validity of the aforementioned
claim and decision, the teacher using this quiz
Purpose: Measure a test taker’s proficiency in must present evidence to demonstrate at least the
academic English. following about the test:
↓
Assessment of: Performance on the TOEFL • It is designed to activate skills A and B, and they
(Evidence). are from the curriculum objectives.
196
• It was well designed to activate listening skills of score-based interpretations, decisions and
A and B. consequences (Bachman, 2005; Carr, 2011; Kane,
• It was not designed to activate listening skills 2006; Messick, 1994). Particularly, validation in
C and D. large-scale testing requires the use of considerable
• The students took the test without disruption; amounts of quantitative and qualitative data
there were no problems with the administration. (Xi & Sawaki, 2017), which in cases tend to be
• There were no instances of cheating. unnecessary for classroom testing (Brookhart,
• The teacher correctly checked the test and 2003; Popham, 2017). However, validation must
provided the relevant grades accurately: pass or also be acknowledged in classroom contexts
fail. because the validity of tests used in the classroom
• The answer key (the document that contains must be accounted for, too (Bonner, 2013; Brown
the correct answers) is accurate, namely, all the & Hudson, 2002; Popham, 2017).
correct answers really are the correct answers.
Specifically, I posit that validation in
To reiterate, validity is about how appropriate, classroom language testing may help scrutinise
logical and true interpretations and decisions are the appropriateness of curriculum objectives, the
based on data from assessment instruments. overall quality of tests and the fairness with which
If students cheated during this quiz, the score students are treated in assessments. The validation
might be inflating their listening skills, the teacher schemes for classroom assessment reported in the
is misinterpreting the data (correct answers) and literature (Bachman & Damböck, 2018; Bonner,
those who passed may not really have the skills. 2013; Chapelle & Voss, 2013; Kane, 2012) have
Additionally, the decision to advance to other tended to be theoretical and offer general principles.
listening skills in the course is not valid. Notably, However, according to my review of the literature,
if the teacher mistakenly used a test for skills C there are limited resources for language teachers
and D, the interpretations and decisions are not to reflect and act upon the idea of validating the
valid, either. The test was not fit for purpose in this tests they use. Therefore, in the next section of this
particular scenario. paper, I offer one possible praxis-based approach
for examining the validity of interpretations and
Thus, validity for classroom testing can be decisions as they emerge from using classroom
likened to the definitions by AERA et al. (2014) and language tests.
Messick (1989), with some modifications: Validity
in classroom language testing depends on how
appropriate interpretations and decisions are, based One Practical Approach for Validation in
on the data from instruments used to activate the Classroom Language Testing
relevant language skills stated in a curriculum. As My proposed approach for validation in language
aforementioned, validity is an abstract concept. To classrooms comprises three major stages: The first
make it practical, teachers can validate the tests stage relates to the congruence between curriculum
they use for accurate interpretations and decisions, objectives and the design of tests; the second stage
which I discuss next. is a close analysis of already-made instruments and
the use of basic statistics; the last stage collects
feedback to examine the consequences of using
Validation in Language Testing tests.
Validation is the process of evaluating the
validity of a testing system. Validation entails the
accumulation of empirical and theoretical evidence Curricular Focus
to demonstrate that a test has been used as Scholars in educational measurement in
expected and led to corresponding correct uses. general and those in language testing have argued
Language testing professionals generally refer to that tests should reflect the skills, tasks, or content
validation as the process to estimate the validity stipulated in a curriculum. This connection is
197
collectively called content validity (Bonner, 2013; Test Specifications and Fit-to-Spec
Brown & Hudson, 2002; Douglas, 2010; Fulcher, Analysis.
2010; Popham, 2017). If instruments collect A practical approach for the curriculum
evidence on students’ stance against curriculum level—and to have evidence for validity—relies
content, this evidence can be used to argue for the on the design of test specifications, test specs for
validity of an assessment. short (Davidson & Lynch, 2002; Fulcher, 2010). A
document with specs describes how a test should
Particularly, language teachers should ascertain be designed. Table 1 provides a simple example of
whether the language skills from a syllabus are a reading test.
language related. For example, in Colombia,
language learning is based on national standards Davidson and Lynch (2002) explain that teachers
stated in a document called Guía 22 (Ministerio de can conduct a fit-to-spec validity analysis. Once the
Educación Nacional de Colombia, 2006, p. 22). 15 items for the test in Table 1 are designed, teachers
Next, I present two examples that the document can assess whether the items clearly align with the
states as learning standards for Reading in English specs. To help teachers achieve this objective, I
in sixth grade. I include a translation for each converted the descriptions in Table 1 into a checklist
standard. that teachers can use (Table 2).
1) Puedo extraer información general y Test specs and the results of fit-to-spec
específica de un texto corto y escrito en un analysis are evidence for validation for three main
lenguaje sencillo. reasons. First, the specs should naturally be based
I can extract general and specific information on the language skills stated in a syllabus, which
from a short text written in simple language. can then provide evidence for the test’s content
2) Valoro la lectura como un hábito importante validity. Second, the fit-to-spec analysis can
de enriquecimiento personal y académico. unearth problematic items that are either assessing
I value reading as an important habit for something not stated in the specs (and therefore not
personal and academic edification. in the curriculum) or confusing the students. Finally,
problematic items can be changed such that they
At face value, number 1) is a specific better reflect the curriculum skills to be assessed.
reading skill; however, number 2) is a skill that Appropriate specs and congruence between tests
an individual can demonstrate regardless of and curriculum objectives will most likely contribute
language. Thus, 1) may be operationalised in a to the validity of interpretations and, therefore, the
language test, namely, a teacher can create a purpose and decisions based on data.
reading quiz to assess the students’ abilities in
performing in this skill. Number 2) cannot be
operationalised in a language test. Of course, the Professional Test Design.
standards are meant to guide learning, teaching Another test development action in tandem
and assessment. The main point is that language with specs is the principled design of items and
teachers should observe how connected their tasks. Language testing authors have provided
language assessment instruments are to the guidelines for the professional design of tests
skills of their language curriculum. Therefore, the (Alderson, Clapham, & Wall, 1995; Brown, 2011;
main recommendation is for teachers to analyse Carr, 2011; Fulcher, 2010; Hughes, 2002). In
whether the standards (or objectives) in their particular, Giraldo (2019) synthesises ideas from
curriculum are language related, i.e. that they these authors to provide checklists for the design
represent language ability. This notion is best of items and tasks. Table 3, which I adapted
encapsulated in this question: Can I design a test and modified from Giraldo (2019, p. 129-130),
that provides me with information on my students’ contains descriptors for a checklist that can be
level/development of this learning standard (or used to either design or evaluate a reading or
competence) in the English language? listening test.
198
Table 1. Sample Test Specifications for a Reading Test
The purpose of this test is to assess how students are developing the following reading skills.
Purpose
&
On the basis of the results from this test, the teacher and students can identify what they do well
Decision
and what they must improve or reinforce before advancing to other reading skills.
Identify the general message (moral) of tales.

Skills to be assessed
Identify specific details from narrative texts: characters’ personalities, dates and places of events;
and sequence of events (e.g. what occurs first and second).
1 fable.
1 classical tale (excerpt)
Types & length of texts
1 person’s narrative account
All texts are between 100 and 150 words.
Multiple-choice test with 15 questions

3 options (A, B and C) for each question
5 questions for the fable
Question 1 on the moral of the fable
Question 2 on an animal’s character
Question 3 on a date
Question 4 on a place
Question 5 on a sequence of events
5 questions for the classical tale

Method Question 6 on the main message of the tale
&
Instructions for design Question 7 on a character’s personality
5 questions for the personal account

Question 11 on the main message of the account
Question 12 on the main character’s personality
Table 2. A Fit-to-Spec Analysis Table
Questions Yes No
Is Question 6 on the main message of the tale?
Is Question 6 written for the students in the course?
Is Question 10 on a sequence of events?
Is Question 10 written for the students in the course?
Please provide feedback on items 1 to 15:
199
Table 3. Checklist of Guidelines for a Multiple-Choice Reading or Listening Test
Guidelines Yes No
The stem in Item # ___ is written clearly.

(If the stem is not clear for a fellow teacher or a student, it is probably not clear for the students with
whom it will be used.)
The question in Item # __ does not have unknown vocabulary for students.
All options in Item # __ are plausible, namely, they can be answered only by listening to/reading the text.
(If a student can guess the answer without listening or reading, the item is not assessing this construct.)
Item # __ does not provide the answer to another item.

(Item 4 may have information to answer Item 3. Check that this is not the case.)
Item # __ is independent of the other items.

(Each item in this test should be assessing one bit of the construct(s); thus, if items overlap, discard one
of them.)
The correct answer (the key) for Item # __ really is the correct answer.
Item # __ only has one correct answer.

(If the item has more than one answer, the options must be revised.)
Item #__is assessing one of the skills described in the test specs.
A well-designed instrument should be a The context for the proposed statistics is a

fundamental piece of evidence to argue for fictional diagnostic test of speaking. The example
the validity of classroom language testing. A is that two teachers teaching the Level III Speaking
professional design helps strengthen the quality of Skills Course want to determine the speaking level of
assessments because they are constructed primarily their 30 new students. To conduct this assessment,
to collect clear evidence on the constructs (i.e. skills) they use an interview format with a rubric that
of interest, leading to accurate interpretations of and comprises these criteria: fluency, pronunciation,
decisions on students’ language ability. A poorly discourse management, grammar accuracy and
designed test might yield unclear information, vocabulary control.
undermining the overall validity of the assessment
(Fulcher, 2010; Popham, 2003). The interview is based on the specific speaking
skills for Level III; thus, the assumption is that the 30
students should not ‘pass’ the test (they are about
Statistical Calculations and Analyses. to start Level III); in other words, the 30 students
Once the design of a test is complete and should not have the skills described in the rubric for
the instrument implemented, teachers may wish this test. The passing score in this situation is 3.5.
to conduct basic statistical analyses to evaluate
their instruments, along with corresponding • Calculate frequencies and percentages: The
interpretations and decisions. Authors such as two teachers can observe what percentage of
Bachman (2004), Brown (2011) and Carr (2008; students were between these score ranges: 1.0
2011) have offered foundational explanations to and 2.5, 2.6 and 3.4, and 3.5 and 5.0. Next,
calculate statistics for language testing. Excel, in the teachers can interpret the percentages. For
Microsoft Office, can be used to perform calculations; instance, if the score of 70% of students was
the most important aspect is interpretations of the between 3.5 and 5.0, they have the skills for Level
statistical data. Next, I propose simple calculations III speaking and should be in another course. If
that can yield evidence for validation. I suggest that the score of 70% of students is below this same
teachers use the calculations with which they feel score range (3.4 or lower), they are ready for
most comfortable. the course. In both cases, an argument could
200
be that the diagnostic instrument yielded useful for consistency, is 85%, the agreement level
data to examine the validity of interpretations between the two teachers is very high (Fulcher,
and decisions. 2010). Consistency in this scenario can be
interpreted as the two teachers using the rubric
• Calculate mode, median and mean. The accurately: They understood the constructs
two teachers can observe the mode score, (e.g. grammar accuracy, fluency) and assessed
the median score and the mean score for all them fairly while they heard students speaking
students. If the mode were 2.0, then the students during the interview.
with 2.0 are ready for Level III; if the median is
3.5, then 50% of students are ready for Level III • Calculate means and standard deviations in
and 50% seem to have the skills stated in the a differential groups study (Brown & Hudson,
learning objectives for the course. Finally, if the 2002). This type of study requires a somewhat
mean (the average of all the 30 scores) is 4.0, higher level of sophistication than the previous
the group has the speaking skills for Level III. calculations. The two teachers can use the
Notably, high scores (5.0) may inflate the mean; same interview and corresponding rubric with
thus, analysis of specific cases (e.g. low, failing students who are in the Level IV Speaking
scores) is warranted. Course and compare their performance with the
means and standard deviations of the students
• Calculate mean and standard deviation. These about to start Level III. The assumption in this
two statistics are useful when analysed together. case is that students in Level IV should pass the
If the mean for the group of 30 students is 2.5 interview because they have the skills presented
and the standard deviation (average distance of in Level III: The mean should be high and the
every score from the mean) is 0.2, then some standard deviation low. Both the mean and
students’ score was 2.7 and others was 2.3. On standard deviation for the students about to
the basis of this standard deviation, students start Level III should be low. If a high percentage
are observed to have a similarly low level of of students in Level IV fail the diagnostic
speaking, interpreted as the group being ready interview for Level III, the instrument must be
for Level III. If the mean and standard deviation investigated, and the validity of inferences and
are 4.4 and 0.2, respectively, the group has the decisions from it must be questioned. Perhaps
speaking skills for Level III. If the mean were 3.5 determining what occurred during the Level III
and the standard deviation for this particular course is necessary.
test were 1.0, two phenomena are possible:
The students have widely different levels of The statistical calculations in the aforementioned
speaking, or there was little consistency in the speaking scenario provide information on students’
assessment, as I explain next. speaking skills vis-à-vis the Level III course. For
validation purposes in general, statistics can be used
• Calculate the agreement coefficient and kappa to argue for the validity (or lack thereof) of language
for consistency. These two statistics help tests. For example, if in the aforementioned testing
present the extent of the agreement between scenario kappa is low (20% or less), the two teachers
two test administrations, two raters, or two disagreed widely and, therefore, interpretations and
score-based decisions such as pass and fail. decisions cannot be trusted –they are not valid.
In the aforementioned diagnostic test example, The central point is that for statistics to help with
suppose the two teachers assessed each student validation, they must be interpreted against the
at the same time, so each student received two constructs and the purposes for which a test is used.
scores. If the agreement coefficient is 70%, the
two teachers made the same decisions (pass
or fail) in 70% of the cases (21 students). The Cognitive Validation
performance of the other 30% (9 students) needs Authors such as Bonner (2013) and Green
to be revised. If kappa, a detailed calculation (2014) have suggested that teachers ask students for
201
insights into assessment processes and instruments performed well, this piece of evidence supports the
or observe students as they take tests. The idea of validity of interpretations and decisions. Similarly,
cognitive validation is to stimulate students’ thinking if students’ answers to question 1 reflect what test
and reflection regarding language assessment. specs stipulate, this observation can also be used
Bonner, for example, recommends the use of think- as evidence.
alouds, observations and interview protocols to tap
into students’ cognition. For example, teachers can
ask students the following questions (in an oral Analysis of Consequences.
interview or written open survey) to collect evidence Generally, assessments should lead to
for the validity of interpretations and decisions: beneficial consequences, especially when
assessments are used for instructional purposes
1. How did you feel while [writing your narrative (Bachman & Damböck, 2018; Green, 2014; Kane
text]? & Wools, 2019). By and large, the consequence of
2. What skills do you feel the [narrative task] was classroom language testing should be improved
assessing? Do you feel you had the opportunity language learning. Thus, a final proposed action
to demonstrate these skills on this test? for validating classroom language tests is to analyse
3. If anything, what was difficult for you in this their consequences. In Table 4 is a list of categories
[narrative task]? related to purposes for classroom language testing,
with proposed courses of action.
For ease of use, the three questions can be
asked in the language with which students are As Kane and Wools (2019) reiterate, classroom
most comfortable. The answers can then be used assessments should be useful in attaining
to investigate the validity of a given instrument. instructional purposes and their validity assessed on
For instance, if a student feels the instructions for the extent to which these objectives are fulfilled. The
a task were difficult to understand, and the teacher proposed questions for a consequential analysis in
notices that his/her performance was poor, maybe Table 4 might help teachers evaluate the reach and
the instructions caused the poor performance. usefulness of their tests.
In this case, interpretations and decisions must
be challenged and studied carefully. If students The steps in the proposed practical approach for
report that the instructions were clear and they validating classroom language tests, summarised in
Table 4. Language Testing Purposes and Analysis of Consequences
Purposes Consequential Analysis
After providing feedback on the diagnostic, ask students and teachers in the corresponding courses how
students are feeling/doing.
Diagnostic
For example: If the diagnosis stated that the student needed to be in the course, she/he should feel fine in it. Is
she/he improving language?
If after a progress test, students require additional emphasis on a particular language skill, provide the
Progress necessary review/reinforcement tasks and ask students whether the tasks are helping them with the areas that
need attention.
For students who failed the test and had to repeat the course:
To what extent are you now improving the language skills for this course?
For students who passed the test and are now in a new course:
Achievement
To what extent do you feel prepared for this course? Are you doing well? Do you feel you learned the skills/
contents from the last course?
To the teacher: To what extent do you feel these students are prepared for this course? Are they doing well? Do
you feel students achieved the learning objectives from the last course?
202
Figure 1, have language constructs as a common Implications and Recommendations

factor. Whether language teachers assess language
ability, reading, speaking, or any other language skill, Validating a classroom language test may
validity and validation in the language classroom imply the use of documentary evidence (e.g. from
use constructs as a central notion. First, language test specs) and empirical data (e.g. percentages
tests collect evidence on language curriculum from a statistical calculation) to support validity.
objectives operationalised through test specs; Such endeavours may also entail a considerable
second, a fit-to-spec analysis is concerned with the amount of work that may be too much for language
quality of items and tasks in relation to the specs; teachers to perform. In such a case, I suggest that
third, a professional design ensures that the correct validation be performed for tests that are formal
constructs can be triggered through the correct (e.g. an achievement test), for which the stakes are
means (i.e. instruments); fourth, statistics can be high and consequences impactful for students. On
useful especially when interpretations help teachers an everyday basis, such as when teachers conduct
analyse constructs; finally, cognitive validation an informal, alternative assessment, they may only
and consequential analysis engage stakeholders be concerned with how assessment data is feeding
in discussing, from a qualitative perspective, the back on teaching and learning and representing
constructs and appropriateness of instruments. instructional goals (Kane & Wools, 2019).
Together, the evidence from these steps can be
used to gauge the validity of classroom language Another implication of validation for classroom
tests: The relative accuracy of interpretations and testing that might emerge from the steps in
decisions from classroom test data. this paper is the need for language assessment
Figure 1. Sources of Evidence for Validity in Classroom Language Testing
203
literacy –LAL– (Fulcher, 2012; Inbar-Lourie, 2017). in observing what real-life tasks individuals can
In other words, teachers may need satisfactory perform using language (Long, 2015). Thus, in
understanding of theoretical knowledge and skills classrooms where task-based language assessment
for language testing, dimensions understudied in is the guiding methodology, other approaches to
language education programmes (Giraldo, 2018; validation are warranted.
Herrera & Macías, 2015; López & Bernal, 2009;
Vogt & Tsagari, 2014). For example, teachers must Finally, a limitation of the validation approach
know how to calculate and, most importantly, I discuss is that statistical analyses may not be a
interpret statistical information to evaluate validity in common topic for language teachers and may
testing. As a recommendation for promoting LAL, require further LAL, as aforementioned. As I state
teachers may use language testing textbooks or in this paper, validation is about collecting evidence
online resources; some of these are open source, for from various sources, and statistics is one source.
example, TALE Project (Tsagari et al, 2018), which Language teachers attempting to validate classroom
includes a handbook to study language assessment tests should, ultimately, analyse their expertise for
issues. their validation schemes for a given test and related
purpose. The present proposal may be a guide for
where to start their validity endeavour.
Limitations
In this article, I propose one approach to Conclusions

validating language tests in the classroom. Thus,
this is not an all-encompassing treaty; as authors Validity and validation should be concerns
in validation research have expressed, approaches in high-quality classroom language testing, and
to collecting evidence for validity considerations can their relevance should not be limited to large-
vary widely (Chapelle, 1999; 2012; Kane, 2012). A scale testing. Students, teachers and educational
specific approach will most necessarily depend on systems are the direct recipients of language tests.
the particular purposes, characteristics and needs Thus, the purpose of this paper was to reflect on
of a given educational context. As aforementioned, validity and validation as necessary discussions for
LAL might be necessary for validation; thus, the language teachers, along with one possible practical
higher the LAL of stakeholders, the more robust a approach to validation. Fundamentally, validity in
validation study can be. classroom language testing reflects the relative
appropriateness and accuracy of interpretations
Another limitation in this paper, primarily due and decisions based on data from instruments
to space constraints, is the validation of alternative which hopefully trigger instructional objectives
assessment systems. My reflections and discussion in for language learning. Validation in this scenario
this paper leaned toward a summative view of testing involves collecting evidence from various sources to
because, as explained in the implications section, evaluate the validity of interpretations and decisions
formal tests should be validated more systematically in classroom language testing.
given the consequences they entail. Thus, validation
for alternative schemes in assessment may warrant The approach I propose in this paper includes
further study, which I predict will resort to qualitative three stages: formulating curriculum objectives
research. and specifications; designing and analysing the
test items and tasks and the data from tests; and
A related limitation refers to the use of task- using a qualitative, student-based methodology. As
based assessment in the classroom (Norris, 2016). aforementioned, this is one proposed approach;
The discussions in this paper covered general thus, teachers may be interested in studying
language courses in which language ability is the and experimenting with different forms in which
overarching construct. Conversely, in task-based validation can be conducted. I posit that case studies
scenarios, stakeholders may be more interested of teachers validating their classroom language tests
204
Chapelle, C. A. (1999). Validity in language assessment.

may advance the field of language testing. These Annual Review of Applied Linguistics, 19, 254-272.
reports can contribute to the width and breadth of https://doi.org/10.1017/S0267190599190135
validation in language education. The goal of such
Chapelle, C. A., Enright, M. K., Jamieson, J. (Eds.) (2008).
an enterprise must be consolidating assessment Building a validity argument for the test of English
systems that are valid and useful for supporting as a foreign language. Routledge.
language learning.
Chapelle, C. A. (2012). Conceptions of validity. In G. Fulcher
& F. Davidson (Eds.), The Routledge handbook of
language testing (pp. 21-33). Routledge.
References
Chapelle, C. A., & Voss, E. (2013). Evaluation of language
tests through validation research. In A. J. Kunnan
Alderson, C., Clapham, C., & Wall, D. (1995). Language
(Ed.), The Companion to Language Assessment (pp.
test construction and evaluation.
1079–1097). John Wiley and Sons, Inc.
Cambridge University Press. American Educational
Davidson, F., & Lynch, B. (2002). Testcraft: A teacher’s
Research Association, American Psychological
guide to writing and using language test
Association, & National
specifications. Yale University Press.
Council on Measurement in Education (2014). Standards
Douglas, D. (2010). Understanding language testing.
for educational and psychological testing. American
Routledge.
Psychological Association.
Farnsworth, T. (2013). An investigation into the validity
Bachman, L. F. (2004). Statistical analyses for language
of the TOEFL iBT speaking test for international
assessment. Cambridge University Press.
teaching assistant certification. Language
Bachman, L. F. (2005). Building and supporting a case Assessment Quarterly, 10(3), 274-291. https://doi.or
for test use. Language Assessment Quarterly, 2(1), g/10.1080/15434303.2013.769548
1-34. https://doi.org/10.1207/s15434311laq0201_1
Fulcher, G. (2010). Practical language testing. Hodder
Bachman, L. F., & Damböck, B. (2018). Language Education.
assessment for classroom teachers. Oxford
Fulcher, G. (2012). Assessment literacy for the language
University Press.
classroom. Language Assessment Quarterly, 9(2),
Bonner, S. M. (2013). Validity in classroom assessment: 113-132. https://doi.org/10.1080/15434303.2011.6
Purposes, properties, and principles. In J. H. McMillan 42041
(Ed.), Sage handbook of research on classroom
Giraldo, F. (2018). Language assessment literacy:
assessment (pp. 87–106). Sage.
Implications for language teachers. Profile: Issues
Brookhart, S. M. (2003). Developing measurement theory in Teachers’ Professional Development, 20(1), 179-
for classroom assessment purposes and uses. 195. https://doi.org/10.15446/profile.v20n1.62089
Educational Measurement: Issues and Practice, 22(4),
Giraldo, F. (2019). Designing language assessments
5–12. https://doi.org/10.1111/j.1745-3992.2003.
in context: Theoretical, technical, and institutional
tb00139.x
considerations. HOW Journal, 26(2), 123-143.
Brown, H. D., & Abeywickrama, P. (2010). Language https://doi.org/10.19183/how.26.2.512
assessment: Principles and classroom practice.
Green, A. (2014). Exploring language assessment and
Pearson Longman.
testing. New York, USA: Routledge.
Brown, J. D. (2011). Testing in language programs:
Herrera, L., & Macías, D. (2015). A call for language
A comprehensive guide to English language
assessment literacy in the education and
assessment. McGraw Hill.
development of Teachers of English as a foreign
Brown, J. D., & Hudson, T. (2002). Criterion-referenced language. Colombian Applied Linguistics Journal,
language testing. Cambridge University Press. 17(2), 302-312. https://doi.org/10.14483/udistrital.
jour.calj.2015.2.a09
Carr, N. T. (2008). Using Microsoft Excel® to calculate
descriptive statistics and create graphs. Hughes, A. (2002). Testing for language teachers: Second
edition. Cambridge University Press.
Language Assessment Quarterly, 5(1), 43-62. h t t p s : / /
doi.org/10.1080/15434300701776336 Inbar-Lourie, O. (2017). Language assessment literacy. In
E. Shohamy, S. May, & I. Or (Eds.), Language Testing
Carr, N. T. (2011). Designing and analyzing language and Assessment (3rd ed., pp. 257-268). Springer.
tests. Oxford University Press.
205
Kane, M. (2006). Validation. In Brennan, R. (Ed.), Ministerio de Educación Nacional de Colombia (2006).
Educational measurement (4th ed. pp. 17–64). Estándares básicos de competencias en lenguas
American Council on Education and Praeger. extranjeras: Inglés. Formar en lenguas extranjeras:
¡el reto! Lo que necesitamos saber y saber hacer.
Kane, M. (2012). Articulating a validity argument. In Imprenta Nacional.
G. Fulcher & F. Davidson (Eds.), The Routledge
handbook of language testing (pp. 34-47). Norris, J. (2016). Current uses for task-based
Routledge. language assessment. Annual Review of Applied
Linguistics, 36, 230–244. https://doi.org/10.1017/
Kane, M., & Wools, S. (2019). Perspectives on the S0267190516000027
validity of classroom assessments. In S. Brookhart
& J. McMillan (Eds.), Classroom assessment and Pardo-Ballester, C. (2010). The validity argument
educational measurement (pp. 11-26). Routledge. of a web-based Spanish listening exam: Test
usefulness evaluation. Language Assessment
Lado, R. (1961). Language testing: The construction and Quarterly, 7(2), 137-159. https://doi.
use of foreign language tests. McGraw Hill. org/10.1080/15434301003664188
Llosa, L. (2007). Validating a standards-based Popham, J. (2003). Test better, teach better. The
classroom assessment of English proficiency: instructional role of assessment. Association for
A multitrait-multimethod approach. Language Supervision and Curriculum Development.
Testing, 24(4), 489-515. https://doi.
org/10.1177/0265532207080770 Popham, J. (2017). Classroom assessment: What
teachers need to know. Eighth edition. Pearson.
Long, M. (2015). Second language acquisition and
task-based language teaching. John Wiley and Tsagari, D., Vogt, K., Froelich, V., Csépes, I., Fekete, A.,
Sons, Inc. Green A., Hamp-Lyons, L., Sifakis, N. &, Kordia,
S. (2018). Handbook of assessment for language
López, A., & Bernal, R. (2009). Language testing in teachers. Retrieved from: http://taleproject.eu/
Colombia: A call for more teacher education and
teacher training in language assessment. Profile: Vogt, K., & Tsagari, D. (2014). Assessment literacy of
Issues in Teachers’ Professional Development, foreign language teachers: Findings of a European
11(2), 55-70. study. Language Assessment Quarterly, 11(4), 374-
402. https://doi.org/10.1080/15434303.2014.960046
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational
measurement. (3rd ed., pp. 13-103). Macmillan. Xi, X., & Sawaki, Y. (2017). Methods of test validation. In E.
Shohamy, I. G. Or, & S. May (Eds.), Language testing and
Messick, S. (1994). The interplay of evidence and assessment: Encyclopedia of language and education
consequences in the validation of performance (3rd ed., pp. 193-210). Cham, Switzerland: Springer.
assessments. Educational Researcher, 23, 13–23. https://doi.org/10.1007/978-3-319-02261-1_19
https://doi.org/10.3102/0013189X023002013
206

Validity and Classroom Language Testing

Uploaded by

Copyright:

Available Formats

Validity and Classroom Language Testing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Validity and Classroom Language Testing

Uploaded by

Copyright:

Available Formats

RESEARCH ARTICLES

Validity and Classroom Language Testing: A Practical

La validez y la evaluación de lenguas en el aula de

Keywords: classroom language testing, language testing, validation, validity

Table 1. Sample Test Specifications for a Reading Test

Identify the general message (moral) of tales.

Multiple-choice test with 15 questions

5 questions for the classical tale

5 questions for the personal account

Table 2. A Fit-to-Spec Analysis Table

Is Question 6 on the main message of the tale?

Is Question 6 written for the students in the course?

Is Question 10 on a sequence of events?

Is Question 10 written for the students in the course?

Please provide feedback on items 1 to 15:

The stem in Item # ___ is written clearly.

Item # __ does not provide the answer to another item.

Item # __ is independent of the other items.

Item # __ only has one correct answer.

A well-designed instrument should be a The context for the proposed statistics is a

Table 4. Language Testing Purposes and Analysis of Consequences

Purposes Consequential Analysis

Figure 1, have language constructs as a common Implications and Recommendations

Figure 1. Sources of Evidence for Validity in Classroom Language Testing

In this article, I propose one approach to Conclusions

Chapelle, C. A. (1999). Validity in language assessment.

You might also like