A Review On Validating Language Tests: Dinh Minh Thu
A Review On Validating Language Tests: Dinh Minh Thu
A Review On Validating Language Tests: Dinh Minh Thu
Abstract: Validity in language testing and assessment has its long fundamental role in research along
with reliability (Bachman & Palmer, 1996). This paper analyses basic theories and empirical research on
language test validity in order to provide the notion, the classification of language test validity, the validation
working frames and the trends of empirical research. Four key findings come out from the analysis. Firstly,
language test validity refers to an evaluative judgment of the language test quality on the ground of evidence
of the integrated components of test content, criterion and consequences through the interpretation of the
meaning and utility of test scores. Secondly, construct validity is a dominating term in modern validity
classification. The chronic division of construct validity into prior and post ones can help researchers have
a clearer validation option. Plus, test validation can be grounded in light of Messick (1989), Bachman
(1996) and Weir (2005). Finally, almost all empirical research on test validity the researcher has addressed
concerns international and national high-stakes proficiency tests. The research results open gaps in test
validation research for the future.
Keywords: language assessment, test usefulness, construct validity, validation
1. Introduction 1
Hughes, 2003; Borsboom, Mellenbergh, &
Van Heerden, 2004). Language test validity
Testing and assessment, shortened and validation has been investigated in
as assessment, has become a mainstream the world largely in light of the validation
in global language education for several theories proposed by Messick (1989),
decades (Bachman, 2000). Bachman Bachman & Palmer (1996) and Weir (2005).
& Palmer’s (1996) framework of test They require test developers articulate their
usefulness has functioned as a fundamental test validity. Albeith to its significance,
basis for professional English language test the matter has merely become Vietnamese
development, implementation and evaluation assessment researchers’ and practictioners’
all over the world. It is a combination of six concern recently (Trần, 2011; Nguyễn, 2017;
components, namely reliability, validity, Bùi, 2016; Vũ, 2016; Nguyễn, 2018; Nguyễn,
authenticity, interactiveness, practicality and 2018). This paper expects to raise Vietnamese
impact. Amongst this integration, validity is English language teachers’ awareness of
argued to be the dominating factor to ensure language test validity, which can impact their
that a test will measure what it claims to testing practice positively.
measure (Messick, 1989; Bachman, 1995; Four research questions are raised:
*
Tel.: 84-912362656
1. What is the concept of language test
Email: minhthu.knn.dhhp@gmail.com validity?
144 D.M. Thu/ VNU Journal of Foreign Studies, Vol.35, No.1 (2019) 143-154
2. What are the types of language test of (a) discovering and documenting students’
validity? strengths and weaknesses, (b) planning and
3. How can a language test be validated? enhancing instruction, or (c) evaluating progress
and making decisions about students;
4. What has previous empirical research
on test validation revealed? 2. the process, instrument or method used to
gather the information. (p.10)
The research is initiated with the
theoretical backgrounds of testing and Assessment is an umbrella term including
assessment. Later on, through content tests with diverse educational practices
analysis and practical experience, the author (Brown, 2003). Popham (2002, p. 4) adds
would present the concept, the classification, that “Educational assessment is a formal
the validation framework and the results of attempt to determine students’ status with
research on validity. respect to educational variables of interest.”
The process is “formal” because it takes
2. Methodology place professionally and systematically in the
classroom context. The phrase “educational
This secondary research is conducted variables of interests” suggests the acceptance
analytically when the researcher bases on the of variations in degrees of knowledge, learning
available sources of information to evaluate the styles, and attitudes. Therefore, assessing
interested research problem critically (Kothari, learners’ abilities demands teachers’open-
2004, p. 3). In order to reach the unified definition mindedness to accept diversities but keep
and classification of validity, the researcher inclusion of learning goals as well as equity
browses the prevailing relevant documentation. among learners. Echoing the view, McTighe’s
Findings from test validation framework and (2014, p. 2) claims that assessment should (1)
empirical studies undergo the same method. The serve learning, (2) use diverse measurement
data in this study comes from both objective and tools, (3) align with goals, (4) measure with
subjective reflective sources. matters, and (5) be fair. In order to reach the
goals, there should be test quality harness.
3. Theoretical background
3.2. The quality of a good language test
3.1. Language testing and assessment Bachman and Palmer (1996, p.18)
Testing and assessment are two terms released a framework of test usefulness or test
which are currently in common parlance. qualities. It is a combination of six components
While tests are defined as “a method of as presented below:
measuring a person’s ability, knowledge or Usefulness = Reliability + Construct
performance in a given domain” (Brown, validity + Authenticity + Interactiveness +
2003, p.3), assessment is referred to as an Impact + Practicality
“ongoing process” (Cizek, 1997) or “an To put it simply, reliability is the
ongoing strategy” (Brown, 2004). Cizek’s consistency in test results across testing
(1997) definition of assessment is selected times. Prior to defining construct validity, the
herein for its relative wholeness: notion of construct should be presented first.
1. the planned process of gathering and Construct is the specific ability definition
synthesizing information relevant to the purposes used as the basis for designing a test task
VNU Journal of Foreign Studies, Vol.35, No.1 (2019) 143-154 145
and interpreting scores gained from the task. in a language test was first mentioned by Lado
Construct validity is the degree for a test score (1961). He claims that if a test measures what
to be interpreted and generalised accurately it purports to measure, it is valid. This claim
to indicate the ability in measurement. sounds general and hard to be evaluated.
Authenticity refers to the correlation between The American Psychological Association
the test tasks and the target language use. (1995, p.9) makes it clearer that validity is
Interactiveness pertains to the engagement of “the appropriateness, meaningfulness, and
test takers when performing the test. Impact usefulness of the specific inferences made
means the test effect on stakeholders like from the test scores” (cited in Bachman, 1995,
learners, teachers, authorities and parents. p. 243). Here we see the role of the test score
Lastly, a test is practical when resources for to evaluate the validity of the test. In 1989,
developing, implementing and conducting the Hughes considers validity the test ability
test are available and applicable. to announce a test to be valid by measuring
To reach the target of usefulness, the test “accurately what it is intended to measure”
developer must identify a certain test purpose, (p. 22). Concurrently, Messick (1989, p. 245)
a certain test taker, and a target language designates validity “an overall evaluative
use domain. These qualities are integrative judgment of the degree to which empirical and
although the degree can vary across the theoretical rationales support the adequacy
contexts. A high-stake test puts more emphasis and appropriateness of interpretations and
on reliability and validity while a classroom actions based on test scores and other modes
test can have more elements of authenticity, of assessment”. He regards construct validity
interactiveness and impact. Reliability and as social consequences of testing, which
validity are core measurement qualities of can impose positive or negative washback
a test because they are closely reflected by on the users because it can determine
the meaningfulness, appropriateness and
the score interpretation, while the remaining
usefulness of the test through the interpretation
components concern more the societal aspects
of the test score. Another approach to test
of a test.
validity is to label it the test property being
evaluated rather than the judgment of the test
4. Findings and discussion
(Borsboom et al., 2004).
other means except for test scores. This new qualitatively and quantitatively (Bachman &
light will be elaborated in the coming part. Palmer, 1996; Messick, 1998; Weir, 2005).
4.2. A combined framework of validity In 1966, the Association revised the
validity structures to make it a “Trinitarian
Validity standards made its debut doctrine”, including construct validity,
in 1954 by the American Psychological concurrent validity and criterion-related
Association in four forms namely predictive validity (combined by predictive validity and
validity, concurrent validity, content validity concurrent validity) (Shepard, 1993). Lado
and construct validity (Shepard, 1993). (1961) and Davies (1968) add the element of
Predictive validity can be observed after face validity, which is decided by the look at
test administration to predict the future the test appearance, to the content validity.
performance while concurrent validity refers From another aspect, Campbell and Standley
to the concurrency of the test score and the (1966) introduce internal validity and external
criterion of an already-accepted test. Content validity. The former is a vital quality, shown
validity concerns the comparison between through the analysis of the test content, whilst
test specifications and test contents. Among the latter finds out the test generability for a test
types of validity, construct validity is the to be applied to different contexts based on the
most complicated, which gets its evidence test score. External validity is stated to belong
from the comparison between the need-to- to criterion validity. Alderson, Clapham and
be-proved-valid item and the supposed-valid Wall (1995) echo the classification of validity
item. Herein, it is important to clarify one key into internal and external classes and label
concept as “construct” in the field of testing external validity as criterion-oriented validity.
and assessment. It is the definition of a D’Este (2012) reviews Messick (1989)’s
specific ability used as the basis for designing new contribution to the validity framework
a test task and interpreting scores gained from on the ground of the test score. The new
the task (Bachman & Palmer, 1996). Hence, unified framework is composed of two facets,
construct validity denotes the degree for a one being the source of test justification
test score to be interpreted and generalised from “either evidence or consequence”, and
accurately to indicate the construct or another one concerning the function of the
ability in measurement (Bachman & Palmer, test outcome through “interpretation or use”
1996). Construct validity is qualified both (Messick, 1989, p. 20).
developers, test users and test takers, it can be a proposed consequence. The consequential
said that content validity leads to washback validity signals the shift from technical,
on test preparation through making test takers empirical and logical focus to the test use or
familiar with the test and reduce their anxiety policy focus (Bachman, 1995).
(Messick, 1996, p. 6). Positive washback can Weir (2005), one more significant theorist
be enhanced by a valid test (Morrow, 1986; of validity, categories validity as construct
Anderson & Wall, 1993; Frederiksen & validity in accordance with the temporal
Collins, 1989; cited in Messick, 1996). That consequence; therefore two major types of
is why it is important to find out the evidence
validity are priori validity and posterior
of validity in a test. Through Messick’s (1989)
validity. The former can be investigated
lenses, general validity consists of six aspects:
before the test event, embracing theory-based
the content aspect, the substantial aspect, the
validity and context validity. By comparison,
structural aspect, the generalizability aspect,
the latter accumulates evidence during
the external aspect and the consequential
and after the test event and is divided into
aspect. Except for the content and structural
scoring validity, criterion-related validity and
aspects, the four remaining criteria pertain to
consequential validity. Theory-based validity
the interpretation of the test score. The content
emphasizes the test developers’ knowledge of
aspect is shown the relationship between the
theories pertaining to the underlying language
content relevance and representativeness of
technical quality like the appropriate reading processes for real life application (Weir, 2005,
level. The substantive aspect includes both the p.18). Context validity is traditionally referred
theoretical ground and empirical evidence to as content validity but Weir uses this modern
term with an intention to cover both the test
The consequential validity of the test
contents and the test administrative setting
refers to the evaluation of both intended
(Weir, 2005, p.19). Scoring validity measures
and unintended consequences of score
the stability of the test results over time “in
interpretation and use concurrently and in
terms of the content sampling and free from
the future, with evidence of bias in scoring
and interpretation, the positive or negative bias” (Weir, 2005, p.23). In this sense, scoring
influence on class instructions and knowledge validity is popularly known as reliability.
acquisition (Messick, 1996, p. 13). It is The two sub-types namely criterion-related
interesting when Messick (1996, p.14) claims validity and consequential validity echo
that validity of a test should be investigated as Bachman (1990) and Messick (1989).
an assumed basis for washback. From the above discussion, it can be
According to Bachman (1995, p. 244- concluded that validity is a very complicated
256), a framework of validity comprises concept and labels of validity types can
content validity (actualized by the content be overlapped from different authors’
relevance and content coverage), criterion perspectives. Nonetheless, the term “construct
validity (shown through concurrent validity validity” plays the key role from the beginning
and predictive validity), construct validity of the classification of validity and continues
(revealed by the meaningfulness of construct). taking its prioritized place (Messick, 1989;
The consequential basis of validity is 1996; Shepard, 1993; Weir, 2005; Bachman,
discussed in Bachman (1995) in that a test is 1995). So far, the framework of validity can
not designed for the sake of the test but for be visualized as follows:
148 D.M. Thu/ VNU Journal of Foreign Studies, Vol.35, No.1 (2019) 143-154
3. To what extent does the test task reflect the test response are also worth consideration.
construct definition? The variations of these factors are likely to
4. To what extent do the scoring procedures swing the test results, which will make it hard
reflect the construct definition? to reach the appropriate conclusion of test
5. Will the scores obtained from the test help takers’ language ability. Test takers are likely
make the desire interpretations about test takers’ to perform a listening test better if they are in a
language ability? sufficiently small room with sufficiently loud
6. What characteristics of the SETTING are recording, for example. These theories sound
likely to cause different test takers to perform reasonable and can be applied for validating
differently? tests of various educational areas. However,
7. What characteristics of the test RUBRIC the details need more discussions. Or else, a
are likely to cause different test takers to perform language test will require a more specific set
differently? of questions to be answered.
8. What characteristics of the TEST INPUT
Weir (2005) proposes four socio-cognitive
are likely to cause different test takers to perform
validation frameworks corresponding to four
differently?
language skills at two phases before and after
9. What characteristics of the EXPETED
the test event. As previously mentioned, Weir’s
RESPONSE are likely to cause different test takers
(2005) classification of validity embraces
to perform differently?
five types as content validity, theory-based
10. What characteristics of the
validity, scoring validity, criterion-related
RELATIONSHIP BETWEEN INPUT AND
validity and consequential validity. The same
RESPONSE are likely to cause different test takers
structure of validating the assessment of each
to perform differently?
skill is depicted, initiating from test taker
(pp. 140-142) characteristics to the two first types of context
As required in the questions, the test validity and theory-based validity. From
construct is the primary concern in its relevance theory-based validity, responses are collected
to its obvious clarification, test purpose, test for scoring validity which bases on the score
tasks and score interpretation. Regarding the of the test, then followed by consequential
test performance, the test setting, test rubrics, validity and criterion-related validity.
test input, oriented response, as well as the Four language skills are operated in the
correspondence between the test input and same validation procedure as follows:
Figure 3. A socio-cognitive validation framework of language skills (adapted from Weir, 2005)
150 D.M. Thu/ VNU Journal of Foreign Studies, Vol.35, No.1 (2019) 143-154
In all the above validation suggestions, the entrance examination of the Federal
the quality of test construct, test input and test University of Amazonas lacks both face
characteristics are studied first. The correlation and content validity. Spelling and grammar
between the test input, response and scoring is mistakes are found. The test is blamed to
investigated to validate the test. Messick (1975; have complex syntax and lexis regardless of
cited in Sheparrd, 1993, p.414-415) did not normal language education at high school.
consider the role of content validity because of Bui (2016) investigated the test usefulness
the traditional thought of validity coming from of the Vietnam’s College English Entrance
the test score. Nonetheless, this view has been Exams (VCEEE) between two tests in 2014
changed (Yalow & Popam, 1983; Messick, and 2015. She also uses Bachman and Palmer
1989, 1996; Shepard, 1993; Bachman, 1990; (1996)’s model of language knowledge to
Weir, 2005) on the ground that content validity validate the test. It is reported that validity is
functions as a precursor to reach appropriate supported by the test methods of gap filling and
score interpretations. Therefore, content closes, but multiple item test methods, error
validity deserves a serious investigation prior detection and synonym/antonym selection
to the implementation of the test. cause problems of interpreting correct test
As presented, concerning language tests, takers’ ability. In addition, multiple choice
while validity is largely discussed in terms of questions is the sole test method in the old
its definitions and aspects, validation has its version, which is mended by the subjective
limited procedures, despite its complicateness. writing parts of sentence rewriting and
paragraph rewriting. Zahedkazemi (2015)
4.4. A review of large-scale test validation
conducts construct validation of two global
studies
sub-tests IELTS and TOEFL basing on the test
High-stake language tests have been scores. The results show that both tests share
validated by international and local differences and similarities in gauging test
researchers, exploiting both qualitative and takers’ language proficiency. In 2010, Tran et
quantitative approaches. High-stakes tests al. (2010) built up the conceptual framework
like entrance/ placement university tests and the methodology for the validation
or IELTS, TOEFL are widely investigated of the interpretation and use of the 2008
(Fulcher, 1997; Tran et al., 2010; Ito, 2001; University Entrance Examination English test
Bui, 2016; Zahedkazemi, 2015), besides the scores, exploiting Messick (1989)’s unified
tests measuring tertiary students’ achievement validation framework. Content analysis,
or proficiency language tests at individual Rasch modelling and path analysis contribute
universities (Rethinasamy & Nong, n.d.; to the methodology in details.
Choi, 1993; Zahedkazemi, 2015; Hiser & The second stream also records interesting
Ho, 2016; Graves, 1999; Sims, 2015; Choi, cases of validation. Choi (1994) measures
1999; Zahedkazemi, 2015; Trần, 2011). Both the content and construct validation of a
positive and negative findings have been criterion-referenced English proficiency
found from the research. Bachman’s (1990) test in order to come to a valid standardized
and Messick’s (1989) validation frameworks test labelled Seoul National University
are popularly exploited. Criterion-Referenced English Proficiency
Take a look at the first stream of validating Test (SNUCREPT). Bachman’s (1990)
entrance tests. According to Hitotuzi (2002), framework of communicative language
VNU Journal of Foreign Studies, Vol.35, No.1 (2019) 143-154 151
ability is exploited. The qualitative and In Vietnam, Trần (2011) finds out the
quantitative approaches with native speakers evidence of the content validity of an English
and computable tools respectively are mixed. achievement test for second year non-English
He claims that systematic development of major university students by using survey
the test can satisfy the validity and reliability questionnaires for both teachers and students
of the test. Choi (1999) validates the Test to see the degree of unsatisfactory level in
of English Proficiency (TEPS), developed some parts of the test due to the insufficient
and utilized in Seoul National University by preparation in designing test specifications
collecting both qualitative and quantitative and writing part instruction. Hoang (2009)
feedback from the test takers on the pilot also supplies the same results in terms of
test and the first administrative test to see test specifications. Rethinasamy & Nong
the validity of the test and the test fairness. (n.d) study the validity of the Advanced
He makes the comparison between the test Educational Program English Test (AEPET)
in study TEPS and the valid test TOP (Test at a university in Vietnam on three aspects,
of Oral Proficiency). The analysis of the including concurrent validity, predictive
test score is made, along with an interview validity and content validity. IELTS scores
of respondents who got higher TEPS scores are exploited to validate concurrent validity.
and available TOEIC scores and who are Scores of AEPET in four components:
teachers of English. In terms of the test score listening, reading, speaking and writing are
analysis, high correlation is found between used to validate the test content, revealing
the data from the two tests, illustrated by the high validity degree in the speaking and
correlation coefficients of over .63. Regarding reading tests and moderate degree in the two
the interview result, 42.7 of respondents remaining tests. The overall mean scores is
strongly agree on the test method/ fairness also moderate at 3.35. Test preparation is
of the test. Ito (2001) validates the Join included into the content validation, which
First Achievement Test (JFSAT) – Japanese shows an insufficient amount of instructions.
nationwide university entrance examination Although the problem identified in the
by investigating the reliability, concurrent paper is interesting, the authors have not
validity, criterion validity and construct provided details in the validation method.
validity of the test which is divided into Consequently, the result discussion is merely
five components, including pronunciation, on the surface. In 2017, Nguyễn studies the
grammar, spoken English, written English and cut-score validity of the VSTEP.3-5 listening
reading comprehension. The finding reveals test using Kane’s (2006) current argument-
that instead of the low reliability coefficient of based validation approach focusing on test
the paper-pencil pronunciation test (r =.208), tasks, accuracy and precision and cut scores.
other figures proves JFSAT a relatively Findings show that the test tasks follow the
valid test of English ability. In terms of the test specification strictly, the language input
construct validity of the test, low correlation relatively meets the demand. In terms of
coefficients remain in the pronunciation (r precision and accuracy, on the whole, the test
=.238, n.s) and spoken English (r =.600, can discriminate test takers to a reasonable
compared to the demanded criterion of r extent. The Angoff method and Bookmark
>.7). The pronunciation score has very little method are used to gauge the cut scores. By
contribution to the overall score. comparison with the expected reliability of
152 D.M. Thu/ VNU Journal of Foreign Studies, Vol.35, No.1 (2019) 143-154
at least 0.88, VSTEP listening test reliability (2005)’s models to gauge the validity of an
index is 0.815, which is rather low. overall internally-developed achievement
All in all, an insight into the experimental test. More importantly, the result of validity
research of language test validity points out will serve as evidence for washback, as
three pivotal matters. Firstly, in terms of Messick (1996, p.252) claims: “rather than
methodology, both quantitative and qualitative seeking washback as a sign of test validity,
approaches are exploited. Scoring validity, seek validity by design as a likely basis for
for example, suits the former while the latter washback”. He also adds that all tests are in
applies to content validity. Secondly, high- danger of construct irrelevance and construct
stakes international and national language under-representation. Compared to the
tests are the subjects in studies. Last but not theories in validity, research has not covered
least, validation mainly occurs to posterior or all. It is impossible to reach full validation,
external validity. but recommendations to increase the degree of
validity can be (1) making test specifications
5. Conclusion and pedagogical implication explicit, (2) maximizing direct testing, (3)
closely linking the scoring of response to the
So far, four research questions have test purpose, and (4) ensuring test reliability
been answered. A language test can claim (Hughes, 1983).
its validity when it can measure exactly the
test taker’s language ability actualized by the References
test construct. In the past, construct validity
Bachman, L. F. (1995). Fundamental Considerations in
is distinguished from content validity and Language Testing (Third Edition). Oxford: Oxford
criterion validity, but the modern view puts University Press.
construct validity the umbrella concept and Bachman, L. F. (2000). Modern language testing at the
turn of the century: Assuring that what we count
classifies validity into more types. The idea of counts. Language Testing, 17(1), 1–42. Retrieved
prior validity and posterior validity proposed from https://doi.org/10.1177/026553220001700101
by Weir (2005) is worth considering. Weir Borsboom, D., Mellenbergh, G. J., & Van Heerden,
J. (2004). The concept of validity. Psychological
(2005)’s validation model is also very Review, 111(4), 1061–1071. Retrieved from https://
interesting and specific for a language doi.org/10.1037/0033-295X.111.4.1061
test, covering four language sub-skills. In Bui, T. S. (2016). The Test Usefulness of the Vietnam’s
Vietnam, test validation also largely pertains college English Entrance Exam (Master’s
Thesis). Korea University, Seoul.
to high-stakes tests, especially a newly Choi, I. (1993). Construct Validation Study on
designed national test VSTEP (Vietnamese SNUCREPT ( Seoul National University Criterion-
Standardised Test of English Proficiency) at Referenced English Proficiency Test )*. Language
Research, 29(2), 243–275.
the University of Languages of International Choi, I. (1999). Test Fairness and Validity of the TEPS.
Studies, Vietnam National University, Hanoi. Language Research, 35(4).
Testing has never lost its society’s concern. D’Este, C. (2012). New views of validity in language
testing. EL.LE, 1(1), 61–76.
However almost all important tests have not
Fulcher, G. & Davidson, F. (2007). Language Testing
been validated. English gate-keeping tests at and Assessment - an advanced resource book.
universities or English entrance university London and New York: Routledge.
exams in 2017 and 2018, for example, all Fulcher, G. (1997). An English language placement
test: Issues in reliability and validity. Language
deserve validation. In addition, there leaves Testing, 14(2), 113–139. Retrieved from https://doi.
a gap in the documentation of using Weir org/10.1177/026553229701400201
VNU Journal of Foreign Studies, Vol.35, No.1 (2019) 143-154 153
Graves, K. (1999). Validity of the secondary level Rethinasamy, S. & Nong, T. H. H. (n.d.). Investigating
English Proficicency test at Temple University - the validity of the advanced educational program
Japan. Princetton, NJ: Educational Testing Sevice. English test of Vietnam with IELTS: Implications
Hiser, E. A. & Ho, K. S. T. (2016). C-Tests in Vietnam : for quality management of in-house test. Universiti
An Exploratory Study of English Proficiency. Malaisia.
Electronic Journal of Language Teaching, 13(2), Shepard, L. A. (1993). Chapter 9: Evaluating Test
184–202. Retrieved from http://e-flt.nus.edu.sg/ Validity. In L. Darling-Hammon (Ed.), Review of
Hughes, A. (2003). Testing for Language Teachers. Research in Education, 19(1), 405–450. Retrieved
Australian Review of Applied Linguistics, from https://doi.org/10.3102/0091732X019001405
27. Retrieved from https://doi.org/10.1017/ Sims, J. M. (2015). A Valid and Reliable English
CBO9780511732980 Proficiency Exam: A Model from a University
Ito, A. (2001). A Validation Study on the English language Language Program in Taiwan. English as a Global
test in a Japanese Nationwide University Entrance Language Education (EaGLE) Journal EaGLE
Examination. Asian EFL Journal, 7(2), 11–33. Journal, 1(12), 91–125. https://doi.org/10.6294/
Kothari, C. R. (2004). Research methodology: EaGLE.2015.0102.04
methods and techniques (Second revision). Tran, H. P., Griffin, P., & Nguyễn, C. (2010). Validating
New Age International Publishers. Retrieved the university entrance English test to the Vietnam
from https://doi.org/http://196.29.172.66:8080/jspui/ National University: A conceptual framework and
bitstream/123456789/2574/1/Research%20Methodology.pdf methodology. Procedia - Social and Behavioral
Messick, S. (1989). Meaning and Values in Test Sciences, 2(2), 1295–1304. Retrieved from https://
Validation: The Science and Ethics of Assessment. doi.org/10.1016/j.sbspro.2010.03.190.
Educational Researcher, 18(2), 5–11. Retrieved Tran, Q. T. (2011). The Content Validity of the Current
from https://doi.org/10.3102/0013189X018002005 English Achievement Test for Second Year Non
Messick, S. (1996). Validity and washback in langauge Major Students at Phuong Dong University
testing. Language Testing. Retrieved from http://ltj. (Master’s thesis). University of Languages and
sagepub.com/content/13/3/241.short International Studies, Hanoi.
Nguyen, T. N. Q. (2018). A study on the validty of Vu, T. P. A. (2016). 25 years of language assessment in
VSTEP writing tests for the sake of national and Vietnam : Looking back and looking forward. In
international integration. VNU Journal of Foreign New Directions in English Language Assessment in
Studies, 34(4), 115–129. Vietnam. Retrieved from https://www.britishcouncil.
Nguyen, T. P. T. (2018). An investigation into the content vn/.../new_directions_2016_dr_vu_thi_phu...
validity of a Vietnamese standardised test of Weir, C. J. (2005). Language Testing and Validation.
English Proficiency (VSTEP.3-5) Reading Test. An Evidence-based approach. New York: Palgrave-
VNU Journal of Foreign Studies, 34(4), 129–143. Macmillan.
Nguyen, T. Q. Y. (2017). Summary of doctor dissertation Zahedkazemi, E. (2015). Construct Validation of
and investigation into the cut-score validity TOEFL-iBT ( as a Conventional Test ) and IELTS
of the VSTEP. 3-5 listening test. University of ( as a Task-based Test ) among Iranian EFL Test-
Languages and International Studies, Vietnam takers ’ Performance on Speaking Modules, Theory
National University, Hanoi. Retrieved from http:// and Practice in Language Studies, 5(7), 1513–1519.
saudaihoc.ulis.vnu.edu.vn/files/uploads/2017/12/
Tom-tat-TA.pdf
154 D.M. Thu/ VNU Journal of Foreign Studies, Vol.35, No.1 (2019) 143-154
Tóm tắt: Song song với độ tin cậy, độ xác trị trong kiểm tra đánh giá ngôn ngữ từ lâu đã giữ
vai trò quan trọng trong các nghiên cứu (Bachman & Palmer, 1996). Bài báo này phân tích các lý
thuyết cơ bản và các nghiên cứu thực nghiệm về độ xác trị để cung cấp khái niệm tính xác trị trong
kiểm tra đánh giá ngôn ngữ, các tiểu loại xác trị, các khung lý thuyết đo độ xác trị và các khuynh
hướng nghiên cứu thực nghiệm tính xác trị. Có bốn kết quả chính thu được qua phân tích. Thứ
nhất, tính xác trị trong bài kiểm tra ngôn ngữ đánh giá chất lượng bài kiểm tra ngôn ngữ dựa trên
nội dung bài thi, tiêu chí bài thi, hệ quả bài thi thông qua việc xác định ý nghĩa và việc sử dụng
điểm số. Thứ hai, độ xác trị của năng lực ngôn ngữ là một thuật ngữ chủ chốt khi phân loại các độ
xác trị. Thêm vào đó, khung phân loại tiền xác trị và hậu xác trị sẽ giúp nhà nghiên cứu lựa chọn
hướng xác trị rõ ràng hơn. Thứ ba, khung lý thuyết xác trị dựa trên ba mô hình chính của Messick
(1989), Bachman (1996) và Weir (2005). Một kết luận nữa trong nghiên cứu này là phần lớn các
nghiên cứu về độ xác trị mà tác giả đã tiếp cận đều dựa trên các bài thi có tầm quan trọng lớn, ở
quy mô quốc tế hoặc quốc gia. Kết quả nghiên cứu cho thấy mảnh đất nghiên cứu độ xác trị trong
bài thi ngôn ngữ còn rất rộng.
Từ khóa: đánh giá ngôn ngữ, dụng tính của bài thi, độ xác trị về năng lực, việc xác trị