Validity and Classroom Language Testing
Validity and Classroom Language Testing
Validity and Classroom Language Testing
Citation/ Para citar este Artículo: Giraldo, F. (2020). Validity and Classroom Language Testing: A Practical Approach. Colomb. Appl. Linguistic.
J., 22(2), pp. 194-206.
Received: 05-Mar.-2020 / Accepted: 22-Dec.-2020
DOI: https://doi.org/10.14483/22487085.15998
Abstract
Validity and validation are common in large-scale language testing. These topics are fundamental because
they help stakeholders in testing systems make accurate interpretations of individuals’ language ability and related
ensuing decisions. However, there is limited information on validity and validation for classroom language testing,
for which interpretations and decisions based on curriculum objectives are paramount, too. In this reflection
article, I provide a critical account of these two issues as they are applied in large-scale testing. Next, I use this
background to discuss and provide possible applications for classroom language education through a proposed
approach for validating classroom language tests. The approach comprises the analyses of curriculum objectives,
design of test specifications, analysis of test items, professional design of instruments, statistical calculations,
cognitive validation and consequential analyses. I close the article with implications and recommendations for
such endeavours and highlight why they are fundamental for high-quality language testing systems in classroom
contexts.
Resumen
La validez y la validación son temas de discusión comunes en la evaluación de lenguas a gran escala. Estos
temas son fundamentales porque permiten que aquellos involucrados en estos sistemas de evaluación puedan
hacer interpretaciones claras, junto con las decisiones que de ellas se desprendan. No obstante, hay poca
información en la literatura relacionada con la validez y la validación en contextos de aprendizaje de lenguas,
donde las interpretaciones y decisiones basadas en objetivos curriculares también son fundamentales. En
este artículo de reflexión, hago una revisión crítica de cómo estos dos temas son utilizados en evaluación a
gran escala. Luego uso este contexto para discutir y presentar posibles aplicaciones para el aula de idiomas
a través de una propuesta de enfoque para la validación de instrumentos de evaluación en este contexto.
El enfoque incluye un análisis de objetivos curriculares, el diseño de especificaciones, el análisis de ítems
en instrumentos de evaluación, el diseño profesional de evaluaciones, cálculos estadísticos, la validación
cognitiva y, por último, análisis de consecuencias. El artículo lo concluyo con implicaciones y recomendaciones
1 This reflection article is on the validity of classroom language testing and connects theory and practice in validation.
2 Universidad de Caldas, Colombia. ORCID : https://orcid.org/0000-0001-5221-8245. frank.giraldo@ucaldas.edu.co
194
Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
Validity and Classroom Language Testing: A Practical Approach
pertinentes para este proceso, además de enfatizar of using these tests, used complex statistical
las razones por las cuales es vital para tener calculations and compared these tests with other
sistemas de evaluación de alta calidad. well-known instruments. Thus, validation research
and discussions are predominant in assessing the
Palabras clave: evaluación en el aula de clases, validity of large-scale testing (Chapelle & Voss, 2013,
evaluación de lenguas extranjeras, validación, validez Xi & Sawaki, 2017). However, the discussion on the
validity and validation of classroom language testing
has been limited, with researchers providing mostly
Introduction a conceptual approach (see Bachman & Damböck,
2018; Chapelle & Voss, 2013; Kane, 2012).
Validity is the most fundamental quality of
testing systems, across social, professional and Against this backdrop, the purpose of this
educational contexts. This assertion holds true reflection paper is twofold: to discuss validity as it
whether tests are used in large-scale or classroom relates to classroom language testing and language
settings. Among assessment discussions, there is a teachers and provide and reflect on strategies to
consensus that tests are not valid: Validity is not the validate classroom language tests such that they
quality of an assessment instrument (e.g. a test) but are manageable for teachers. I provide practical
relates to how appropriate interpretations are based examples to demonstrate this process. I start the
on assessment data, for making particular decisions paper with an overview of definitions for validity
(Chapelle, 1999; Fulcher, 2010; Green, 2004; and validation as central constructs and then
Messick, 1989; Popham, 2017). Thus, validity may discuss a practical approach for them in classroom
be conceived as an abstract notion and an ideal. language testing. I end the paper with implications
Because of the abstract nature of validity, validation of validating language tests, recommendations for
has emerged as the data-gathering process to argue validation and relevant limitations and conclusions.
for the validity of interpretations and decisions made
from tests. The quality and the process are crucial
in large-scale and classroom language testing Validity in Language Testing
(Chapelle & Voss, 2013; Kane & Wools, 2019). Validity in language testing is about how
Particularly, validation supports the development logical and true interpretations and decisions are
and monitoring of high-quality testing systems. made based on scores (or in general data) from
assessments. Validity has been considered a trait
Validation research in language assessment of tests: A test is valid if it measures what it has to
abounds, specifically for large-scale testing—tests measure and nothing more (Brown & Abeywickrama,
that affect many individuals (Bachman, 2004); such 2010; Lado, 1961). However, this view is no longer
research is expected because of the consequences used in educational measurement in general and in
of using these instruments. Chapelle, Enright language testing specifically.
and Jamieson (2008) argue in favour of the
validity of using the Test of English as a Foreign The following definition of validity in assessment
Language (TOEFL); the researchers claim that the is from the American Educational Research
TOEFL helps users make admission decisions for Association (AERA), American Psychological
English-speaking universities that use academic Association and National Council on Measurement
English. Other examples of validation projects are in Education (NCME; 2014, p. 11): ‘The degree
assessments of the validity of using a placement test to which evidence and theory support the
for international teaching assistants (Farnsworth, interpretations of test scores for proposed uses
2013), a web-based Spanish listening test to make of tests’. Earlier, Messick (1989, p. 13) provides
placement decisions (Pardo-Ballester, 2010) and a similar definition that since its inception was
Llosa’s (2007) comparison of a classroom test and welcomed in language testing. To him, validity is
a standardised test of English proficiency. These ‘an overall evaluative judgement of the degree to
studies have collected data to claim the validity which evidence and theoretical rationales support
195
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
the adequacy and appropriateness of interpretations ↓
and actions based on test scores’. Interpretations of: Test taker’s state of academic
English in listening, reading, speaking and writing
Thus, in language testing, a score represents (Theoretical Rationales).
individuals’ language ability and is used for making ↓
decisions, for example, to allow conditional Claim: The student does or does not have
admission to an English-speaking university sufficient academic English to study at university.
(e.g. the aforementioned TOEFL case), or in a ↓
classroom, to move on to another unit in a course. Decision or use: Based on scores from the
This decision-making process is what Messick TOEFL, confer or deny conditional admission for
calls interpretations and actions, or uses of tests in university.
AERA et al. (2014). The interpretations and actions
should be appropriate because they are based on The aforementioned claim and decision must
clearly defined constructs (i.e. language ability as a be validated; in other words, TOEFL developers
theoretical rationale) and on student performance on must demonstrate through considerable amounts of
a test—what Messick and AERA et al. call evidence. research-based data that the claim and decision are
valid, namely, logical and true. A similar approach
A couple of teachers using a placement test of can be used in classroom language assessment,
reading comprehension with a group of new students in which the chain of logic as overviewed can be
at a language institute is an example of evidence applied (see Bachman & Damböck, 2018; Chapelle
and theoretical rationale. On the basis of the score & Voss, 2013; Kane, 2012). The following hierarchy
from this instrument, a student is placed in Level is an example of a classroom language assessment
II (decision or use). In this case, validity depends for a listening quiz.
on demonstrating that 1) the student displayed a
performance in reading that merited being in Level Purpose: Identify the students who are learning or
II (evidence) and 2) that the test was based on a having difficulty with listening skills A and B.
clear definition of language ability for reading at ↓
Level II (theoretical rationale). If students start Level Assessment of: Performance on a listening quiz
II and perceive that their skills are beyond those of with 20 multiple-choice questions; number of right
their classmates, the interpretation (that the student and wrong answers (Evidence).
had reading skills to be in Level II) and the decision ↓
(placing the student accordingly) are not valid. If the Interpretations of: Students’ level of listening
student is ready for Level II, there is validity in the comprehension as outlined in the course syllabus
interpretation and decision from this testing system. (Theoretical Rationale).
↓
To further explicate validity in language testing, Claim: The student who passes the quiz has the
the following hierarchy synthesises and simplifies listening skills; the student who fails does not.
this quality for the TOEFL (based on Chapelle et al., ↓
2008). Tests serve purposes—they are not designed Decision or use: If all students pass the quiz, they
in a vacuum—and trigger the evidence (what test have developed the skills and are ready to develop
takers demonstrate) from which interpretations are new listening skills.
derived. Subsequently, these interpretations are used
to make claims and decisions about individuals. To argue for the validity of the aforementioned
claim and decision, the teacher using this quiz
Purpose: Measure a test taker’s proficiency in must present evidence to demonstrate at least the
academic English. following about the test:
↓
Assessment of: Performance on the TOEFL • It is designed to activate skills A and B, and they
(Evidence). are from the curriculum objectives.
196
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
Validity and Classroom Language Testing: A Practical Approach
• It was well designed to activate listening skills of score-based interpretations, decisions and
A and B. consequences (Bachman, 2005; Carr, 2011; Kane,
• It was not designed to activate listening skills 2006; Messick, 1994). Particularly, validation in
C and D. large-scale testing requires the use of considerable
• The students took the test without disruption; amounts of quantitative and qualitative data
there were no problems with the administration. (Xi & Sawaki, 2017), which in cases tend to be
• There were no instances of cheating. unnecessary for classroom testing (Brookhart,
• The teacher correctly checked the test and 2003; Popham, 2017). However, validation must
provided the relevant grades accurately: pass or also be acknowledged in classroom contexts
fail. because the validity of tests used in the classroom
• The answer key (the document that contains must be accounted for, too (Bonner, 2013; Brown
the correct answers) is accurate, namely, all the & Hudson, 2002; Popham, 2017).
correct answers really are the correct answers.
Specifically, I posit that validation in
To reiterate, validity is about how appropriate, classroom language testing may help scrutinise
logical and true interpretations and decisions are the appropriateness of curriculum objectives, the
based on data from assessment instruments. overall quality of tests and the fairness with which
If students cheated during this quiz, the score students are treated in assessments. The validation
might be inflating their listening skills, the teacher schemes for classroom assessment reported in the
is misinterpreting the data (correct answers) and literature (Bachman & Damböck, 2018; Bonner,
those who passed may not really have the skills. 2013; Chapelle & Voss, 2013; Kane, 2012) have
Additionally, the decision to advance to other tended to be theoretical and offer general principles.
listening skills in the course is not valid. Notably, However, according to my review of the literature,
if the teacher mistakenly used a test for skills C there are limited resources for language teachers
and D, the interpretations and decisions are not to reflect and act upon the idea of validating the
valid, either. The test was not fit for purpose in this tests they use. Therefore, in the next section of this
particular scenario. paper, I offer one possible praxis-based approach
for examining the validity of interpretations and
Thus, validity for classroom testing can be decisions as they emerge from using classroom
likened to the definitions by AERA et al. (2014) and language tests.
Messick (1989), with some modifications: Validity
in classroom language testing depends on how
appropriate interpretations and decisions are, based One Practical Approach for Validation in
on the data from instruments used to activate the Classroom Language Testing
relevant language skills stated in a curriculum. As My proposed approach for validation in language
aforementioned, validity is an abstract concept. To classrooms comprises three major stages: The first
make it practical, teachers can validate the tests stage relates to the congruence between curriculum
they use for accurate interpretations and decisions, objectives and the design of tests; the second stage
which I discuss next. is a close analysis of already-made instruments and
the use of basic statistics; the last stage collects
feedback to examine the consequences of using
Validation in Language Testing tests.
Validation is the process of evaluating the
validity of a testing system. Validation entails the
accumulation of empirical and theoretical evidence Curricular Focus
to demonstrate that a test has been used as Scholars in educational measurement in
expected and led to corresponding correct uses. general and those in language testing have argued
Language testing professionals generally refer to that tests should reflect the skills, tasks, or content
validation as the process to estimate the validity stipulated in a curriculum. This connection is
197
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
collectively called content validity (Bonner, 2013; Test Specifications and Fit-to-Spec
Brown & Hudson, 2002; Douglas, 2010; Fulcher, Analysis.
2010; Popham, 2017). If instruments collect A practical approach for the curriculum
evidence on students’ stance against curriculum level—and to have evidence for validity—relies
content, this evidence can be used to argue for the on the design of test specifications, test specs for
validity of an assessment. short (Davidson & Lynch, 2002; Fulcher, 2010). A
document with specs describes how a test should
Particularly, language teachers should ascertain be designed. Table 1 provides a simple example of
whether the language skills from a syllabus are a reading test.
language related. For example, in Colombia,
language learning is based on national standards Davidson and Lynch (2002) explain that teachers
stated in a document called Guía 22 (Ministerio de can conduct a fit-to-spec validity analysis. Once the
Educación Nacional de Colombia, 2006, p. 22). 15 items for the test in Table 1 are designed, teachers
Next, I present two examples that the document can assess whether the items clearly align with the
states as learning standards for Reading in English specs. To help teachers achieve this objective, I
in sixth grade. I include a translation for each converted the descriptions in Table 1 into a checklist
standard. that teachers can use (Table 2).
1) Puedo extraer información general y Test specs and the results of fit-to-spec
específica de un texto corto y escrito en un analysis are evidence for validation for three main
lenguaje sencillo. reasons. First, the specs should naturally be based
I can extract general and specific information on the language skills stated in a syllabus, which
from a short text written in simple language. can then provide evidence for the test’s content
2) Valoro la lectura como un hábito importante validity. Second, the fit-to-spec analysis can
de enriquecimiento personal y académico. unearth problematic items that are either assessing
I value reading as an important habit for something not stated in the specs (and therefore not
personal and academic edification. in the curriculum) or confusing the students. Finally,
problematic items can be changed such that they
At face value, number 1) is a specific better reflect the curriculum skills to be assessed.
reading skill; however, number 2) is a skill that Appropriate specs and congruence between tests
an individual can demonstrate regardless of and curriculum objectives will most likely contribute
language. Thus, 1) may be operationalised in a to the validity of interpretations and, therefore, the
language test, namely, a teacher can create a purpose and decisions based on data.
reading quiz to assess the students’ abilities in
performing in this skill. Number 2) cannot be
operationalised in a language test. Of course, the Professional Test Design.
standards are meant to guide learning, teaching Another test development action in tandem
and assessment. The main point is that language with specs is the principled design of items and
teachers should observe how connected their tasks. Language testing authors have provided
language assessment instruments are to the guidelines for the professional design of tests
skills of their language curriculum. Therefore, the (Alderson, Clapham, & Wall, 1995; Brown, 2011;
main recommendation is for teachers to analyse Carr, 2011; Fulcher, 2010; Hughes, 2002). In
whether the standards (or objectives) in their particular, Giraldo (2019) synthesises ideas from
curriculum are language related, i.e. that they these authors to provide checklists for the design
represent language ability. This notion is best of items and tasks. Table 3, which I adapted
encapsulated in this question: Can I design a test and modified from Giraldo (2019, p. 129-130),
that provides me with information on my students’ contains descriptors for a checklist that can be
level/development of this learning standard (or used to either design or evaluate a reading or
competence) in the English language? listening test.
198
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
Validity and Classroom Language Testing: A Practical Approach
The purpose of this test is to assess how students are developing the following reading skills.
Purpose
&
On the basis of the results from this test, the teacher and students can identify what they do well
Decision
and what they must improve or reinforce before advancing to other reading skills.
1 fable.
1 classical tale (excerpt)
Types & length of texts
1 person’s narrative account
All texts are between 100 and 150 words.
Questions Yes No
199
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
Table 3. Checklist of Guidelines for a Multiple-Choice Reading or Listening Test
Guidelines Yes No
The question in Item # __ does not have unknown vocabulary for students.
All options in Item # __ are plausible, namely, they can be answered only by listening to/reading the text.
(If a student can guess the answer without listening or reading, the item is not assessing this construct.)
The correct answer (the key) for Item # __ really is the correct answer.
Item #__is assessing one of the skills described in the test specs.
200
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
Validity and Classroom Language Testing: A Practical Approach
be that the diagnostic instrument yielded useful for consistency, is 85%, the agreement level
data to examine the validity of interpretations between the two teachers is very high (Fulcher,
and decisions. 2010). Consistency in this scenario can be
interpreted as the two teachers using the rubric
• Calculate mode, median and mean. The accurately: They understood the constructs
two teachers can observe the mode score, (e.g. grammar accuracy, fluency) and assessed
the median score and the mean score for all them fairly while they heard students speaking
students. If the mode were 2.0, then the students during the interview.
with 2.0 are ready for Level III; if the median is
3.5, then 50% of students are ready for Level III • Calculate means and standard deviations in
and 50% seem to have the skills stated in the a differential groups study (Brown & Hudson,
learning objectives for the course. Finally, if the 2002). This type of study requires a somewhat
mean (the average of all the 30 scores) is 4.0, higher level of sophistication than the previous
the group has the speaking skills for Level III. calculations. The two teachers can use the
Notably, high scores (5.0) may inflate the mean; same interview and corresponding rubric with
thus, analysis of specific cases (e.g. low, failing students who are in the Level IV Speaking
scores) is warranted. Course and compare their performance with the
means and standard deviations of the students
• Calculate mean and standard deviation. These about to start Level III. The assumption in this
two statistics are useful when analysed together. case is that students in Level IV should pass the
If the mean for the group of 30 students is 2.5 interview because they have the skills presented
and the standard deviation (average distance of in Level III: The mean should be high and the
every score from the mean) is 0.2, then some standard deviation low. Both the mean and
students’ score was 2.7 and others was 2.3. On standard deviation for the students about to
the basis of this standard deviation, students start Level III should be low. If a high percentage
are observed to have a similarly low level of of students in Level IV fail the diagnostic
speaking, interpreted as the group being ready interview for Level III, the instrument must be
for Level III. If the mean and standard deviation investigated, and the validity of inferences and
are 4.4 and 0.2, respectively, the group has the decisions from it must be questioned. Perhaps
speaking skills for Level III. If the mean were 3.5 determining what occurred during the Level III
and the standard deviation for this particular course is necessary.
test were 1.0, two phenomena are possible:
The students have widely different levels of The statistical calculations in the aforementioned
speaking, or there was little consistency in the speaking scenario provide information on students’
assessment, as I explain next. speaking skills vis-à-vis the Level III course. For
validation purposes in general, statistics can be used
• Calculate the agreement coefficient and kappa to argue for the validity (or lack thereof) of language
for consistency. These two statistics help tests. For example, if in the aforementioned testing
present the extent of the agreement between scenario kappa is low (20% or less), the two teachers
two test administrations, two raters, or two disagreed widely and, therefore, interpretations and
score-based decisions such as pass and fail. decisions cannot be trusted –they are not valid.
In the aforementioned diagnostic test example, The central point is that for statistics to help with
suppose the two teachers assessed each student validation, they must be interpreted against the
at the same time, so each student received two constructs and the purposes for which a test is used.
scores. If the agreement coefficient is 70%, the
two teachers made the same decisions (pass
or fail) in 70% of the cases (21 students). The Cognitive Validation
performance of the other 30% (9 students) needs Authors such as Bonner (2013) and Green
to be revised. If kappa, a detailed calculation (2014) have suggested that teachers ask students for
201
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
insights into assessment processes and instruments performed well, this piece of evidence supports the
or observe students as they take tests. The idea of validity of interpretations and decisions. Similarly,
cognitive validation is to stimulate students’ thinking if students’ answers to question 1 reflect what test
and reflection regarding language assessment. specs stipulate, this observation can also be used
Bonner, for example, recommends the use of think- as evidence.
alouds, observations and interview protocols to tap
into students’ cognition. For example, teachers can
ask students the following questions (in an oral Analysis of Consequences.
interview or written open survey) to collect evidence Generally, assessments should lead to
for the validity of interpretations and decisions: beneficial consequences, especially when
assessments are used for instructional purposes
1. How did you feel while [writing your narrative (Bachman & Damböck, 2018; Green, 2014; Kane
text]? & Wools, 2019). By and large, the consequence of
2. What skills do you feel the [narrative task] was classroom language testing should be improved
assessing? Do you feel you had the opportunity language learning. Thus, a final proposed action
to demonstrate these skills on this test? for validating classroom language tests is to analyse
3. If anything, what was difficult for you in this their consequences. In Table 4 is a list of categories
[narrative task]? related to purposes for classroom language testing,
with proposed courses of action.
For ease of use, the three questions can be
asked in the language with which students are As Kane and Wools (2019) reiterate, classroom
most comfortable. The answers can then be used assessments should be useful in attaining
to investigate the validity of a given instrument. instructional purposes and their validity assessed on
For instance, if a student feels the instructions for the extent to which these objectives are fulfilled. The
a task were difficult to understand, and the teacher proposed questions for a consequential analysis in
notices that his/her performance was poor, maybe Table 4 might help teachers evaluate the reach and
the instructions caused the poor performance. usefulness of their tests.
In this case, interpretations and decisions must
be challenged and studied carefully. If students The steps in the proposed practical approach for
report that the instructions were clear and they validating classroom language tests, summarised in
After providing feedback on the diagnostic, ask students and teachers in the corresponding courses how
students are feeling/doing.
Diagnostic
For example: If the diagnosis stated that the student needed to be in the course, she/he should feel fine in it. Is
she/he improving language?
If after a progress test, students require additional emphasis on a particular language skill, provide the
Progress necessary review/reinforcement tasks and ask students whether the tasks are helping them with the areas that
need attention.
For students who failed the test and had to repeat the course:
To what extent are you now improving the language skills for this course?
For students who passed the test and are now in a new course:
Achievement
To what extent do you feel prepared for this course? Are you doing well? Do you feel you learned the skills/
contents from the last course?
To the teacher: To what extent do you feel these students are prepared for this course? Are they doing well? Do
you feel students achieved the learning objectives from the last course?
202
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
Validity and Classroom Language Testing: A Practical Approach
203
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
literacy –LAL– (Fulcher, 2012; Inbar-Lourie, 2017). in observing what real-life tasks individuals can
In other words, teachers may need satisfactory perform using language (Long, 2015). Thus, in
understanding of theoretical knowledge and skills classrooms where task-based language assessment
for language testing, dimensions understudied in is the guiding methodology, other approaches to
language education programmes (Giraldo, 2018; validation are warranted.
Herrera & Macías, 2015; López & Bernal, 2009;
Vogt & Tsagari, 2014). For example, teachers must Finally, a limitation of the validation approach
know how to calculate and, most importantly, I discuss is that statistical analyses may not be a
interpret statistical information to evaluate validity in common topic for language teachers and may
testing. As a recommendation for promoting LAL, require further LAL, as aforementioned. As I state
teachers may use language testing textbooks or in this paper, validation is about collecting evidence
online resources; some of these are open source, for from various sources, and statistics is one source.
example, TALE Project (Tsagari et al, 2018), which Language teachers attempting to validate classroom
includes a handbook to study language assessment tests should, ultimately, analyse their expertise for
issues. their validation schemes for a given test and related
purpose. The present proposal may be a guide for
where to start their validity endeavour.
Limitations
204
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
Validity and Classroom Language Testing: A Practical Approach
205
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.
Kane, M. (2006). Validation. In Brennan, R. (Ed.), Ministerio de Educación Nacional de Colombia (2006).
Educational measurement (4th ed. pp. 17–64). Estándares básicos de competencias en lenguas
American Council on Education and Praeger. extranjeras: Inglés. Formar en lenguas extranjeras:
¡el reto! Lo que necesitamos saber y saber hacer.
Kane, M. (2012). Articulating a validity argument. In Imprenta Nacional.
G. Fulcher & F. Davidson (Eds.), The Routledge
handbook of language testing (pp. 34-47). Norris, J. (2016). Current uses for task-based
Routledge. language assessment. Annual Review of Applied
Linguistics, 36, 230–244. https://doi.org/10.1017/
Kane, M., & Wools, S. (2019). Perspectives on the S0267190516000027
validity of classroom assessments. In S. Brookhart
& J. McMillan (Eds.), Classroom assessment and Pardo-Ballester, C. (2010). The validity argument
educational measurement (pp. 11-26). Routledge. of a web-based Spanish listening exam: Test
usefulness evaluation. Language Assessment
Lado, R. (1961). Language testing: The construction and Quarterly, 7(2), 137-159. https://doi.
use of foreign language tests. McGraw Hill. org/10.1080/15434301003664188
Llosa, L. (2007). Validating a standards-based Popham, J. (2003). Test better, teach better. The
classroom assessment of English proficiency: instructional role of assessment. Association for
A multitrait-multimethod approach. Language Supervision and Curriculum Development.
Testing, 24(4), 489-515. https://doi.
org/10.1177/0265532207080770 Popham, J. (2017). Classroom assessment: What
teachers need to know. Eighth edition. Pearson.
Long, M. (2015). Second language acquisition and
task-based language teaching. John Wiley and Tsagari, D., Vogt, K., Froelich, V., Csépes, I., Fekete, A.,
Sons, Inc. Green A., Hamp-Lyons, L., Sifakis, N. &, Kordia,
S. (2018). Handbook of assessment for language
López, A., & Bernal, R. (2009). Language testing in teachers. Retrieved from: http://taleproject.eu/
Colombia: A call for more teacher education and
teacher training in language assessment. Profile: Vogt, K., & Tsagari, D. (2014). Assessment literacy of
Issues in Teachers’ Professional Development, foreign language teachers: Findings of a European
11(2), 55-70. study. Language Assessment Quarterly, 11(4), 374-
402. https://doi.org/10.1080/15434303.2014.960046
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational
measurement. (3rd ed., pp. 13-103). Macmillan. Xi, X., & Sawaki, Y. (2017). Methods of test validation. In E.
Shohamy, I. G. Or, & S. May (Eds.), Language testing and
Messick, S. (1994). The interplay of evidence and assessment: Encyclopedia of language and education
consequences in the validation of performance (3rd ed., pp. 193-210). Cham, Switzerland: Springer.
assessments. Educational Researcher, 23, 13–23. https://doi.org/10.1007/978-3-319-02261-1_19
https://doi.org/10.3102/0013189X023002013
206
Giraldo, F. (2020) • Colomb. Appl. Linguist. J.
Printed ISSN 0123-4641 Online ISSN 2248-7085 • July - December 2020. Vol. 22 • Número 2 pp. 194-206.