s11528-024-00988-5 (1)
s11528-024-00988-5 (1)
s11528-024-00988-5 (1)
https://doi.org/10.1007/s11528-024-00988-5
ORIGINAL PAPER
Abstract
Assessment is central to teaching and learning, and recently there has been a substantive shift from paper-and-pencil assess-
ments towards technology delivered assessments such as computer-adaptive tests. Fairness is an important aspect of the
assessment process, including design, administration, test-score interpretation, and data utility. The Universal Design for
Learning (UDL) guidelines can inform assessment development to promote fairness; however, it is not explicitly clear how
UDL and fairness may be linked through students’ conceptualizations of assessment fairness. This phenomenological study
explores how middle grades students conceptualize and reason about the fairness of mathematics tests, including paper-
and-pencil and technology-delivered assessments. Findings indicate that (a) students conceptualize fairness through unique
notions related to educational opportunities and (b) students’ reason about fairness non-linearly. Implications of this study
have potential to inform test developers and users about aspects of test fairness, as well as educators data usage from fixed-
form, paper-and-pencil tests, and computer-adaptive, technology-delivered tests.
Keywords Assessment · Computer adaptive testing · Fairness · Math · Testing · Universal Design for Learning
The classroom is a complex social environment that reflects which drew from Universal Design for Learning, which
the diversity of students’ cultural background, linguistic can assist scholars and practitioners to better understand an
abilities, economic status, and cognitive ability (Brookhart, assessment’s qualities, which in turn has the capacity to pro-
2003), and efforts persist in improving assessment prac- mote greater student learning. A second goal is to provide
tices within K-12 education that address this diversity (e.g., evidence about how students’ reason about the fairness of
Black & Wiliam, 1998; Harris et al., 2023; Shepard, 2016). mathematics tests—including those delivered on technologi-
The shift from paper-and-pencil assessments towards tech- cal devices. This study’s purpose is to explore middle school
nology-delivered assessments (Choi & McClenen, 2020; students’ conceptualizations related to mathematics test fair-
Thompson, 2017) has ramifications on students’ test perfor- ness and their reasoning about test fairness. An intended
mance, teachers’ instructional decision making, and assess- outcome is to contribute to conversations about Universal
ment scholarship. Educators should consider the potential Design for Learning and assessment development.
negative consequences of using assessment data to inform
decisions regarding students’ future learning opportunities,
which may come from fairness issues. One goal of this study Related Literature
is to highlight an intentional assessment design process,
Issues of Fairness in Testing
Vol:.(1234567890)
TechTrends (2024) 68:946–961 947
literature. Materials for any quantitative assessment imple- an opportunity to support diverse students, as well as pro-
mented in schools should provide users (e.g., teachers and mote equity and inclusivity through classroom instruction,
school administrators) with evidence that the test measures especially technology-inclusive educational experiences
what it intends (Bostic, 2023). That evidence supports the (Israel et al., 2022). UDL aims to “maximize accessibility
validity of inferences drawn from those assessment scores, for all intended examinees” (AERA et al., 2014, p. 50). The
as well as the actions or decisions test-users take based on UDL framework, which applies to instruction and assess-
those inferences (AERA et al., 2014; Folger et al., 2023; ment, promotes fairness by promoting access and support-
Kane, 2013). Validity is defined as the degree to which ing students’ individual learning needs (AERA et al., 2014;
evidence supports assessment-score interpretations and is Bostic et al., 2021; CAST, 2018). However, ways to leverage
a fundamental concern of testing (AERA et al., 2014). To UDL guidelines to address specific issues of fairness are not
be clear: Tests are not valid and using such language does definitively described in assessment literature. We briefly
not follow current standards. Results from tests and their overview UDL then describe how fairness has been used for
interpretations are valid (AERA et al., 2014). one assessment development project.
Issues of fairness are related to validity (AERA et al.,
2014; Herman & Cook, 2022). They may be inherent to the Universal Design for Learning
test, such as construct irrelevance—factors causing variation
in student performance not related to the construct of interest The UDL guidelines (see Fig. 1) are applicable across con-
(AERA et al., 2014). Fairness regarding construct-relevance tent areas and contexts to design experiences that meet
may relate to the cultural sensitivity of test development all learners’ needs (Bostic et al., 2021; CAST, 2018). The
(AERA et al., 2014; Herman & Cook, 2022), such as a test guidelines are organized by the intersection of three princi-
being designed for monolinguals and multilinguals in mind ples (see Table 1) and a hierarchy of supports—(1) promot-
(Cahill & Bostic, 2024). Issues of fairness may also arise ing access, (2) building skills, and (3) internalizing learn-
from how assessment results are interpreted and used. An ing. A UDL aim is to develop learners’ expertise through
example is using test scores to group students in a way that this support hierarchy. A core feature of UDL is promoting
has a discriminatory effect on subgroups of students based access to instruction or assessment for all learners (AERA
on race/ethnicity, gender, and others (AERA et al., 2014; et al., 2014; Bostic et al., 2021; Brown et al., 2022; CAST,
Shepard, 2016). 2018).
Tests drawing from Universal Design for Learning To offer an illustration of UDL, guideline seven of the
(UDL) principles and guidelines are better equipped to engagement principle may promote access by design-
handle the complexities of diverse learners. UDL provides ing authentic tasks that are relevant to students’ lived
Engagement Engagement refers to the motivational aspect of teaching and learning. Learners differ in ways that they are motivated to
learn.
Representation Representation refers to the content focus of teaching and learning. Learners differ in how they comprehend informa-
tion, and there is not one representational method that will best support all learners.
Action and Expression Action and expression refer to learner’s interactions throughout teaching and learning. Learners differ in the ways they
navigate a learning environment, and learners differ in the ways they express what they know.
experiences. Guideline eight might involve providing timely such as the Akaike Information Criterion and the Bayes-
feedback of student performance; feedback that is substan- ian Information Criterion (Bostic, 2011; Vrieze, 2012). A
tive and useful. Then, guideline nine may be operationalized qualitative approach is to ask potential respondents about
through encouraging student self-assessment and providing their perceptions of assessment items related to issues of
access to devices or data visualization tools that assist stu- fairness (Cahill & Bostic, 2024; Pepper & Pathak, 2008;
dents in monitoring their own learning progression (CAST, Sambell et al., 1997; Wendorf & Alexander, 2004).
2018). Exploring students' views about test fairness has many
Applying UDL to assessment development is one benefits. First, assessment creators receive a first-hand
research-based way to promote fairness in assessment account of potential respondents’ engagement with assess-
(Brown et al., 2022); “By using universal design, test devel- ment items, which is relevant for future administrations.
opers begin the test development process with an eye toward Second, it gives students a voice in the assessment devel-
maximizing fairness” (AERA et al., 2014, p. 57). In this opment process (Bostic, Folger, et al., 2024). Third, it has
sense, fairness is operationalized as maximizing accessibil- the potential to strengthen relationships between devel-
ity while limiting test bias. For example, applying UDL to opers, assessment users (e.g., schools, institutions), and
assessment design avoids item characteristics (e.g., the situ- other stakeholders (e.g., community members) when done
ational context of word problems) “that may bias scores for in collaborative spaces (Bostic, Folger, et al., 2024).
individuals or subgroups” (AERA et al., 2014, p. 58). Research on students' conceptualizations of fairness
With respect to the UDL principle of engagement, word has shown that they understand fairness, can recall fair
problems may be developed such that a problem's situational and unfair assessments, as well as their impact (Murillo
context is relevant—or at least believable—based on experi- & Hidalgo, 2017; Rasooli et al., 2018). Some scholarship
ences of the intended population of examinees (Brown et al., broadly categorized K-12 students’ conceptualizations of
2022). Item context relevance to students’ experiences has assessment fairness as (a) considerations of equality, and
grown as an area of concern within scholarship and practice (b) considerations of equity (Murillo & Hidalgo, 2017;
because these fairness-related issues can negatively impact Rasooli et al., 2018). Fairness as equality is discussed as
students’ outcomes in both the short- and long-term (AERA “transparency, objectivity, and evaluation of class content”
et al., 2014; Brown et al., 2022; Herman & Cook, 2022). (Murillo & Hidalgo, 2017, p. 10). Fairness as equity is
Minimizing construct-irrelevant variance is crucial to pro- characterized “as adaptation, diversification of tests, and
mote fairness. Extraneous factors causing variation in stu- qualitative assessment, even taking into account students'
dent performance raise concerns regarding (a) the fairness effort and attitudes” (Murillo & Hidalgo, 2017, p. 10).
of the test and (b) the validity of the inferences drawn from When given the opportunity, students communicate ineq-
students’ test scores (AERA et al., 2014). UDL applications uities on school assessments and voice concerns about the
have potential to promote assessment fairness, which ulti- assessments administered to them (Murillo & Hidalgo,
mately should be explored during assessment development. 2017; Rasooli et al., 2018). As one example, students rec-
ognize that multilingual learners are growing in numbers
Research on Students Conceptualizations in many USA classrooms (Bialik et al., 2018) and there-
of Assessment Fairness fore, tests should be developed to address all students’
needs. Test developers, scholars, and school personnel
Multiple approaches have been used to investigate should heed students’ voices if a goal is to create educa-
assessment fairness. Experts and bias panels can pro- tional assessments that are fair and accessible to students
vide quantitative or qualitative data regarding alignment (National Council of Teachers of Mathematics, 2014). To
between assessment items and a desired construct (Sireci date, exploring test fairness has not been described as a
& Faulkner-Bond, 2014; Lane, 2014; Fan et al., 2024). required part of assessment development, but instead as a
Quantitative analyses can provide evidence using tools recommendation.
TechTrends (2024) 68:946–961 949
Students’ opportunity to learn (OTL) is a key factor in RQ2: How do middle school students reason about the
research about assessment fairness. One framing of OTL is fairness of mathematics assessments?
the amount of time allocated for learning (Carroll, 1963). RQ3: How do middle school students’ conceptualizations
If students are given opportunities to learn the materials, or reasoning about fixed-form, paper-and pencil mathemat-
then in turn, they should have opportunities to demonstrate ics assessments compare to technology-delivered, computer-
their understanding of the material in ways that flows from adaptive mathematics assessments?
their instructional experiences (Rasooli et al., 2018; Suskie,
2000). Simply put, testing and teaching practices should
reflect one another. Method
Assessments that promote reasonable opportunities to
demonstrate understanding should be equitable, provide This study employs a naturalistic inquiry, phenomenological
students multiple ways to arrive at a solution, and include approach (Creswell, 2014) to explore students’ conceptu-
developmentally appropriate chances for engagement alizations of and experiences related to issues of fairness
(Rasooli et al., 2018). This also comes up as an idea related in mathematics tests. This research draws from naturalistic
to OTL: opportunities to demonstrate learning (OTD). inquiry, which permits multiple realities to be produced and
Formative and summative assessments (i.e., OTD) should communicated by students (Lincoln & Guba, 1985). Specifi-
mirror the classroom learning experiences if a goal is to cally, the ways students conceptualize fairness in testing are
assess student outcomes accurately and fairly (AERA et al., shaped by their participation in specific learning environ-
2014; Rasooli et al., 2018). ments, environments with unique socio-cultural contexts.
There has been quite a bit of research on assessment fair-
ness with adult learners and small groups of K-12 students
(Rasooli et al., 2018; Sambell et al., 1997; Suskie, 2000). Context & Participants
A recent synthesis of classroom assessment fairness noted
six major topics that influence test fairness: (i) OTL, (ii) This study is part of a large, multi-state grant-funded pro-
OTD, (iii) transparency and consistency, (iv) accommo- ject designed to create and validate technology-delivered,
dations, (v) constructive classroom environment, and (vi) computer-adaptive mathematical problem-solving measures.
construct-irrelevant variance (Rasooli et al., 2018). Each of The team has developed paper-and-pencil, fixed forms tests
these topics are interrelated, influencing the degree to which for grades 3-8, which are called the PSMs (see Bostic &
test takers or administrators communicate test fairness. One Sondergeld, 2015; Bostic et al., 2017). The Problem-Solving
implication from this synthesis is that it is unclear how stu- Measures Computer Adaptive Test (PSM-CAT) is intended
dents reason while evaluating test fairness. Scholars discuss to be delivered to grades 6-8 learners (ages 11-15; Bostic,
test fairness but how students arrive at those conclusions is May, et al., 2024). Sample items are found in prior stud-
uncertain. ies (Bostic & Sondergeld, 2015; Bostic et al., 2017; Bostic,
A purpose of this study is to better understand how mid- May, et al., 2024).
dle school students (ages 11-14) draw conclusions about the One day prior to collecting data for this study, students
fairness of mathematics assessments, which we link to their completed a four-item mathematics assessment drawn from
reasoning about test fairness. We ground this study within the larger PSM-CAT item pool. Items were aligned to con-
the context of a larger assessment development project using tent students had seen during classroom instruction in the
a mathematics test series called the Problem Solving Meas- last six months, as confirmed by participants’ teachers. A
ures (PSMs; Bostic et al., 2017; Bostic & Sondergeld, 2015, sample item is shown in Fig. 2, which aligns to Common
2018) and Problem Solving Measure - Computer Adaptive Core State Standard for Mathematical Content: 7.Expres-
Test (PSM-CAT; Bostic, May, et al., 2024). There are three sions and Equations.1: “Apply properties of operations as
research questions for this study: strategies to add, subtract, factor, and expand linear expres-
RQ1: How do middle school students’ conceptualize fair- sions with rational coefficients” (Common Core State Stand-
ness of mathematics assessments? ards Initiative [CCSSI], 2010, p.49) (Fig. 2).
All PSM-CAT items have undergone a rigorous review one of multiple gender descriptors. All participants selected
by (a) multiple grade-level mathematics teachers, (b) a male or female, and any names are pseudonyms selected by
bias panel consisting of students, educators, and commu- the participants.
nity members, and (c) mathematicians. Students providing
feedback were representatively sampled across different Data Collection
genders, ethnicities (including but not limited to Black,
Latinx/Hispanic, Asian or Pacific Islander, Native Ameri- A goal of the present study is to understand how students
can, and mixed-racial), community size (i.e., urban, subur- reason about mathematics assessment fairness. A pilot study
ban, and rural), and location (i.e., Midwest USA, Mountain was conducted at a school similar to the target schools to
West USA, and Southwest USA). Reviewers also included revise data collection tools and processes. Learning from
multilingual students as well as students with disabilities. that pilot study, our team chose a two-prong data collection
Educators and community members represented different approach with 1-1 interviews and small-group interviews.
subgroups as well, similar to the purposeful, representative One researcher conducted 1-1 interviews while the other
sampling used with students. While eliminating all bias is conducted small-group interviews. This approach allowed
impossible (AERA et al., 2014), items have limited bias and us to confirm findings from one data set with the other and
were deemed appropriate by all members of these panels. seek potential counter and confirmatory evidence. This sup-
Administering the four-item assessment was necessary for ported triangulation and trustworthiness with the data (Miles
two reasons. First, a goal was to give participants experi- et al., 2014) by using two data collection approaches, and
ences with PSM-CAT items and foster reflection and reason- ultimately led to robust findings.
ing about mathematics test fairness. This helped to address In total, 128 students participated in 1-1 or small-group
RQ3. Second, the experience provided a shared experience interviews. Researchers conducted 45 1-1 interviews and 26
for conversations during small-group interviews. small-group interviews. The typical small group consisted
Purposeful sampling was used to identify and select of three students. A semi-structured protocol was used in
information-rich cases related to the phenomenon (Creswell, all interviews. Student(s) met in a quiet space and an audio-
2014). The sample includes students between the ages 11-14 recording device recorded their words. They were given an
(i.e., grades seven through nine) from the Mountain West overview of the process and asked to confirm participation.
USA region. One large suburban district currently partici- During the overview, they were given information about
pating in the larger assessment project was selected for this technology-delivered computer-adaptive tests and how they
study’s data collection because of its diversity in students differ from fixed-form, paper-and-pencil tests. The infor-
and size. Five schools in the district agreed to participate. mation included the statement seen in Table 3. Next, we
Participants include students of different genders, race, and asked participants to describe a difference between tech-
students with disabilities. Demographic information is pre- nology-delivered, computer-adaptive tests and paper-and-
sented in Table 2. We gave participants the option to select pencil fixed-form tests as a check of their comprehension.
Table 3 Statement and questions used with 1-1 interviews and small-group interviews
Interview Protocol
Statement
A fixed-form, paper-and-pencil test is like what it sounds like: You write your work with a pen or pencil; and your work and solutions are
written on a test printed on paper. You complete all the problems on the test. If we put that test onto a computer, then it might be called
a technology-delivered test. In both cases, the tests are fixed, which means that everyone gets the same test items. One unique type of
technology-delivered test is a computer-adaptive test. This test is designed to give you items that are closer to your ability, based on whether
you answer it correctly or incorrectly. Thus, you might take more or less items than another student. Effectively, your test might be different
from a peer. When I ask you to compare fixed-form, paper-and-pencil tests and computer-adaptive, technology-delivered tests: I want you to
consider that in this case you have the same access to resources, would be scored in the same way, and have the same amount of time regard-
less of the test you took.
Interview Questions
#1 How do you determine the fairness of questions on a math test?
#2 Reflection on the test questions you completed recently.
Do you believe those test questions are fair for a peer?
#3 How did you determine the fairness of those questions on the math test?
#4 Are there differences in how you think about the fairness of paper-and-pencil tests versus tests on the computer?
After participants responded: How did you reach that conclusion?
If students communicated a reasonable understanding that out themes. Second cycle included pattern coding (Miles
contrasted the test formats, then the interview continued. If et al., 2014) to make sense of the data. To answer RQ2, we
not, then a researcher provided greater clarity and asked for carefully explored the patterns by which students described
students to reshare ideas. Most students described differ- their conceptualizations. That is, these patterns became data
ences during the first attempt; a small subset (12%) needed pathways about participants’ reasoning related to math test
a second opportunity. Second, students were first asked to fairness.
self-select a pseudonym. The recording device was acti-
vated, and the interview commenced (see Table 3). Students
in the small-group interview were encouraged to build on Findings
each other’s ideas. In both 1-1 and small-group interviews,
a researcher redirected participants as needed, and asked Our data analysis led to a single theme for RQ1: Middle
clarifying questions when necessary. One-on-one interviews school students conceptualize mathematics tests fairness
took roughly six minutes each. Small-group interviews took by relating to ideas broadly framed as educational oppor-
approximately 12 minutes. tunities. This theme, educational opportunities, has four
subthemes: (a) opportunity to learn (OTL) the material, (b)
Data Analysis opportunity to demonstrate (OTD) an understanding of the
material, (c) educational contexts including but not lim-
An inductive, multi-stage qualitative data analysis approach ited to their lived experiences (e.g., classroom and school
was used (see Fig. 3). Our team used checks-and-balances environments), and (d) topics related to diversity, equity
during data analysis to promote reliability across coders and inclusion (DEI). Some participants focused on one sub-
and promote trustworthiness (Miles et al., 2014). Disagree- theme whereas others shared ideas relevant to more than
ments about patterns in the data and potential themes were one subtheme. In the ensuing paragraphs, we unpack those
discussed until a consensus was reached. This discussion four subthemes.
helped to address concerns with triangulation (Miles et al., Our findings suggest a single theme for RQ2: Middle
2014) through different data sources (i.e., different partici- school students reason about fairness in a nonlinear fash-
pants, grade levels, and schools). ion, often shifting their focus from one subtheme (see RQ1)
Two researchers used thematic analysis to inductively to another. Figure 8 indicates that reasoning using solid and
code data (Creswell, 2014; Miles et al., 2014). We used a dashed arrows. These themes for RQ1 and RQ2 stayed con-
two-cycle model (see Fig. 3) to derive theme(s) related to the sistent across participants from different grade levels and
research questions like that used in recent STEM education schools. We provide three cases illustrating the nonlinearity
scholarship (e.g., Roberts et al., 2018). The first cycle was in students’ reasoning (see Fig. 9). For RQ3, we found that
steps one through four, and the second cycle was steps five middle school students identify conceptualizations of fair-
through eight. In the first cycle, our goal was to become inti- ness for fixed-form, paper-and-pencil mathematics tests as
mately familiar with the data and prepare for coding (Miles being the same for computer-adaptive, technology-delivered
et al., 2014). The second cycle goal was to code and draw mathematics tests. They communicate reasoning about the
952 TechTrends (2024) 68:946–961
fairness of both test formats in the same fashion. Through- the terms, skills, and vocabulary used in class. Rose voiced
out the findings, we refer to interviews as data from both how “the vocabulary used [on classroom-based tests] isn't
1-1 interviews and small-groups interviews to foster reading always how I’m used to my teacher teaching it.” She felt
ease. this was problematic and influenced how she reasoned about
mathematics test fairness.
RQ1: Conceptualizations about the Fairness Not only should the content be familiar to students; but
of Mathematics Tests also, the way that mathematics test questions are formatted
should mirror their instructional experiences. Joana refer-
Opportunity to Learn enced the PSM-CAT during her interview: “The [PSM-CAT
word problems] cover all things we learned. I was looking
Students most frequently discussed that they consider the at all the questions and I was able to do them because [the
fairness of mathematics tests regarding the degrees to which teacher] taught it [content] to us. There wasn’t anything on
they were taught the material seen on the test. Seventy of 71 the test we weren’t taught.” She later conveyed that the math-
(99%) interviews included mentions of OTL in some capac- ematical language and formatting of the word problems was
ity. Figure 4 displays various patterns that led to OTL as a something she had experienced previously.
subtheme. With regards to time and learning experiences, partici-
Two patterns relate to whether material was taught and pants shared examples of opportunities to practice skills or
the depth or degree to which it was taught. Akot’s comments applying concepts over multiple days in class and on home-
address this point: “I feel like if somebody was given a test work. Jordan divulged more about his reasoning, and he
in our class, and we have not been specifically taught about communicated that the “teacher taught us how to solve math
how to do that specific thing [from classroom instruction], questions. She [the teacher] gave us practice problems or
then it's a lot harder. You're just blindly guessing. That’s not homework to further help our understanding.” Collectively,
fair.” Comments were also similar to those like Maya’s: “If students were actively thinking about what they experienced
the teachers make sure that every kid has the opportunity in school and OTL as they reasoned about mathematics test
to learn [content on the test], then it is fair.” Participants fairness.
voiced ideas about the content, the degree to which they
received assistance during instruction, and the amount of Opportunity to Demonstrate
time to learn it. Keana shared a bit about time and assistance
in class learning, “it's [fair] if we learned it [the material], In addition to OTL, participants discussed an opportunity
and we had a few class periods to let it sink in with notes to demonstrate (OTD) their learning as something related
and homework… and we had help if we needed it.” Students to OTL. Thirty three of 71 (46%) interviews included men-
also felt mathematics test questions should be inclusive of tion of OTD. Figure 5 displays various patterns leading to
this subtheme. Our team considered whether OTD might be to draw upon content or skills learned during a unit. They
included as OTL; yet, it became evident after subsequent value having a reasonable amount of time on each question
analysis that these were unique subthemes. Frequently, mid- and perceive that as a factor for how they reason about math-
dle school participants connected OTD and OTL as aspects ematics tests fairness. Finally, participants felt that having
that inform how they reason about the fairness of mathemat- similar resources during testing as during instruction to be
ics tests. important and influenced their reasoning as well. Baylen
One pattern for OTD is clearly written tests with vocabu- commented that “...having a formula sheet and calculator
lary and material that was used in class. Kate commented …helps a lot. Like, if I forgot the formula, then I could still
that “some tests do not use the same vocabulary as in class solve the question.” She noted how the presence of a formula
on tests, which makes them “not straightforward. They twist sheet and/or calculator influences the degree to which a math
it [content] around making it harder.” In turn, a test may be test is fair. Thus, Baylen was pushing on resources available
measuring how well students understand language and com- while testing, which might reflect their instructional experi-
prehension and focus less on what they know and are able ences. Participants commented that tests across teachers and
to do with instructional content. Length of tests came up as classrooms can vary, and it is important that OTDs align
well. Participants discussed that a test should be a goldi- with their classroom experiences.
locks length—not too short and not too long. Gabriel found
that “having more than 20 questions [on a mathematics test] Educational Contexts
is so hard.” Wyatt considered page length as an indicator:
“Well, I'm used to one or two pages [of questions on a test], Students recognize that educational contexts vary, which
but if it's anything past four, then I consider it a lengthy implicates mathematics test fairness. There were multiple
test.” Ronald noted that “if it's a shorter test, then you can patterns that led to this subtheme (see Fig. 6). Nineteen of 71
take more time on each question.” Another pattern for OTD (27%) interviews included mentions of codes related to edu-
was related to assessment opportunities. Kate brought this cational context. In some schools, there might be one teacher
up as an example: “We need to take tests to make sure we leading all the instruction for one grade or one course. How-
know the material and all the seventh-grade knowledge we ever, the learning environment in one section might differ
should know, so we can get a grade and know we know the from another. In larger schools, there can be multiple teach-
stuff.” Without sufficient assessment opportunities, then stu- ers delivering instruction for a topic, which again implicates
dents may not know how well they formatively understand test fairness. Kalia told us that she recognized how “students
material. have different teachers in the school. And each teacher might
Participants commented that they have seen ‘trick ques- teach students with different [instructional] methods in the
tions’ on mathematics tests that they reasoned to be unfair. class. You would have different [learning and testing] expe-
Those ‘trick questions’, according to participants, tended not riences because teachers don’t teach the exact same way.”
TechTrends (2024) 68:946–961 955
Participants also recognize that variance in educational Diversity, Equity, and Inclusivity
contexts across schools within a district and from district
to district. Oscar captured a component of this when ref- Concerns related to diversity, equity, and inclusivity (DEI)
erencing his peer group: “...We're all in the same school were a fourth subtheme influencing how middle school
district. We should get taught the same thing but that students conceptualize the fairness of mathematics tests.
doesn’t happen all the time.” He continued to convey how Thirteen of 71 (18%) interviews included mentions of DEI.
this variance implicates fairness of district-wide tests. Patterns informing this subtheme are displayed in Fig. 7.
Some participants made connections to students who were The most frequently mentioned issue was students’ access
not previously enrolled in a school district. Sammy said, to mathematics tests that reduced bias, managed potential
“if somebody had been homeschooled before, and maybe language barriers, or considered the impact of learning
they had a different experience than us,” which connects difficulties. Students discussed personal troubles with
to acknowledging the role that context plays in assessment reading and language, recognizing how a mathematics
and learning. In other instances, students, like Natalie, test in English might be overly challenging for a peer less
discussed the testing environment. “It [any mathematics familiar with the language compared to a native English
test] should be given in a good environment, it should speaker (NES). Students used examples of their peers who
probably be quiet and there shouldn't be lots of distrac- are multilingual learners as well as those who struggle
tions.” Participants recognized the differences in micro- with reading. Fatima mentioned the issue of not having
and macro-levels in contexts, which can even vary day to translation options for tests, especially Spanish: “Honestly,
day. This, in turn, implicates the reasoning students use I think for the Hispanics who are not native English speak-
with their peers and how it can affect their conceptualiza- ers, like me, would like to have it [a mathematics test]
tions of test fairness. translated. So, then they understand the test better. People
learning English can’t do a math test because they can’t
read it. Because the girl who sits next to me: I translate for
her, and I help her. She barely understands English but is RQ2: Reasoning about Fairness
expected to do math in English. That’s not fair.” Caroline
is a student at a different school and said that “the needs of Having explored the ways students conceptualize fairness
students might not all be met [if students cannot read it]. and the main themes that arose during interviews, this leads
Like, a couple of kids in my class who don’t understand us to looking at how students come to that conceptualization.
English as well as me. That’s not a fair test for them.” Most Now, we will look at the thought process students go through
students in this sample were white, NESs and yet, many in the interviews to see how they come to make the conclu-
reasoned about test fairness through the lenses of others, sions about fairness they came to from RQ1. RQ2 wants to
including multilinguals, in their classrooms. see if students reason in similar ways amongst the majority
Another pattern related to DEI stems from those with of students or if each student varies in their reasoning. With
disabilities in the classroom. Sammy shared his struggles the many subthemes and patterns that were drawn from the
and how they influence his reasoning about fair tests. “I'm first research question, we are looking to see if students are
dyslexic. So it's harder for me to understand the math only thinking about one of the subthemes or multiple when
immediately. I mix things up and forget the steps and they make decisions of fairness of a math assessment. The
material sometimes. It’s better if I can get extra time to following analysis will describe and visualize the ways stu-
work on it so I don’t make mistakes.” He further com- dents described their process of getting to their views of
municated how testing accommodations are a key factor fairness.
in his conceptualization of a fair mathematics test. Sima Students reason about test fairness in many different
expressed reflecting on fairness through a lens of support. ways, oftentimes circling back to ideas, connecting ideas
“I have dyscalculia and ADHD. I think [assignments] to others, and having a single idea as well. There was no
should be equal because I need to learn the same things consistent pattern by which students described issues of
as others in my class.” Showing her desire, as a student mathematics test fairness, which is how we arrived at a
with learning disabilities, to have the same opportunities conclusion of students’ reasoning about mathematics tests
as her peers demonstrates a value for learning and fair as being nonlinear. We share three cases (see Fig. 8) with
testing. Moreover, students with and without disabilities readers to illustrate some ways that participants describe
recognized the value for DEI issues related to reasoning fairness of mathematics tests. While these three examples
about mathematics test fairness. are not exhaustive, they highlight the uniqueness related to
paper test are the same and they're both fair. It's identical, as Middle school students’ reasoning about the fairness of
in paper and online because it’s the same words and every- mathematics tests related to DEI issues highlights poten-
thing [resources] that are used.” Having the same resources, tial negative consequences from testing. As Fatima said,
experiences, and support in either testing situation is vitally “any test written in English won’t tell you much about
important according to these participants, and ultimately the her [a motions to her friend multilingual learner with low
way that items are presented should be similar if the two English proficiency] math knowledge because she doesn’t
formats are meant to be identical. know English. She is embarrassed because she can’t read
English.” Multilingual learners are a fast-growing popula-
Summary tion in the USA (Bialik et al., 2018; Kena et al., 2016) and
minimizing issues related to bias while promoting fairness
Middle school students’ conceptualizations of mathematics in assessments should be a priority (Fan et al., 2024).
test fairness included four subthemes related to educational Consequences of testing and bias are validity concerns,
opportunities. Subthemes included OTL, OTD, educational and assessment-development teams and test-users share
contexts, and DEI. They reason about the mathematics test responsibility in examining unintended consequences from
fairness in nonlinear ways. They value being tested on what testing (AERA et al., 2014). Fatima’s comments and these
they have been taught while also recognizing that there is findings lend themselves to support UDL framework usage
variance across students, classrooms, teachers, and schools. in mathematics test development and validation as critical
Their lived experiences play a substantive role in their views if a goal is to equitably measure students’ knowledge and
about mathematics test fairness. Reflecting across data, to reduce test score variance. Shepard (2016) also reminds
participants communicated reasoning about a fixed-form, us of the cognitive and affective implications for students
paper-and-pencil mathematics test and a computer-adaptive, when using tests inappropriately. Minimizing bias issues
technology-delivered mathematics test in the same fashion. through the UDL guidelines during assessment develop-
ment can foster making better tests (AERA et al., 2014;
CAST, 2018).
Discussion and Implications Students’ reasoning about issues of fairness in mathemat-
ics tests is aligned with UDL scholarship (Bostic et al., 2021;
This study highlights some scholarly and practical consider- Brown et al., 2022; CAST, 2018). For instance, the Rep-
ations related to mathematics test fairness and middle school resentation principle highlights how language and cultural
students’ reasoning about test fairness. differences influence students’ comprehension of informa-
tion. Participants communicated that the ways they expe-
Scholarship rience instruction (OTL) and their assessment experiences
(OTD) influence the depth and quality of learning. OTD is
Middle grades reason about the fairness of mathematics tests an element of Action and Expression, which articulates that
in nonlinear ways using up to four conceptualizations indica- learners may differ in the ways they express what they know
tive of educational opportunities. They discussed a variety (CAST, 2018). This study demonstrates an application of
of issues including how they learn, time spent learning, dif- UDL and reifies its importance, which was not explicit in
ferences across classroom contexts, and DEI issues. This the Standards for Educational and Psychological Testing
extends scholarship (e.g., Murillo & Hidalgo, 2017; Rasooli (AERA et al., 2014). One study implication is that assess-
et al., 2018) with more concrete conceptualizations from ment developers and users now have greater evidence for
middle school students about mathematics tests. It also high- using UDL for fixed-form, paper-and pencil mathematics
lights that middle school students recognize the potential assessments (e.g., PSMs) and technology-delivered, com-
inequities of mathematics testing, which implicates validity puter-adaptive mathematics assessments (e.g., PSM-CAT).
of score interpretations from mathematics tests. Brown et al. (2022) leverage the UDL framework to revise
A second finding was that they communicated reasoning word problems for mathematics assessments such that all
about computer-adaptive, technology-delivered mathematics students are provided opportunities to demonstrate their
tests and paper-and-pencil, fixed-form mathematics tests in knowledge and mathematical problem-solving skills. For
ways that are similar, assuming they have the same access, instance, the Engagement principle could be addressed by
resources, and contexts. This fills a gap about students’ ensuring mathematical work problems contain a situational
ideas related to computer-adaptive, technology-delivered context that is relevant or authentic to students completing
mathematics tests in comparison to paper-and-pencil, fixed the assessment (Brown et al., 2022). Although issues of
form tests. As testing moves further online, having the same fairness were not an explicit focus of the work conducted
resources is critically important to promote fairness and by Brown et al. (2022), findings from the current phenom-
build strong validity arguments. enological study suggest that such applications of the UDL
TechTrends (2024) 68:946–961 959
framework are one way that assessment developers and/or Using technology adapted assessments can support stu-
educators promote fairness in testing. dents’ learning and increase their access to learning oppor-
Students should have assessment experiences that mirror tunities. Creating a mathematics assessment that remains fair
learning experiences (Brookhart, 2003; Rasooli et al., 2018; when given as a paper-and-pencil test or as a computer-adap-
Suskie, 2000). Participants noticed how instruction varies tive test ensures that any student who takes the assessments,
across classrooms, schools, and districts. They recognized in either format, has the same opportunities to achieve and
that “assessment occurs within a classroom environment demonstrate their learning. The themes from this study can
or context or climate. The context affects the assessment” inform the development of assessments. As a few instances,
(Brookhart, 2003, p. 6). To that end, UDL becomes more promoting student accessibility through computer-adaptive
salient and the guidelines provide avenues for students to assessments and offering multiple language options, may
access and engage with the content during learning and increase the number of students that perceive the assess-
assessment experiences (CAST, 2018). ment as fair and ultimately, lead to better test results and
interpretations.
Practical Considerations Teachers should ensure that students are exposed to
vocabulary that will be on these assessments. Mathemat-
Participants reasoned about mathematics test fairness in ics is filled with unique academic vocabulary, as seen in
ways that are grounded in knowing themselves and their mathematics standards (e.g., Common Core State Standards,
peers, as well as contexts and experiences. For instance, CCSSO, 2010). The items from the DEAP-CAT assessment
they recognize that fairness issues can have negative impli- are connected to a mathematics standard, which ensures that
cations on their education. Classroom teachers and educa- students have a greater likelihood of success when they can
tional supervisors should consider ways that assessments read and understand the problem’s language.
are designed and the degree to which test development
aligns with desired learning outcomes. While our sample
predominantly identified as white, students were concerned
about issues of diversity, equity, and inclusivity that may not Limitations and Future Explorations
necessarily implicate them. That is, they recognized how
assessments should be linked with classroom instruction First, this study did not sample students from rural and urban
and students’ experiences. Middle school students recog- contexts. It may be worthwhile to replicate this study with
nize these issues implicate test fairness, which upon further other contexts to confirm these findings. Second, findings
study may suggest older students do as well. Such fairness are related to students’ conceptualizations of fairness for
issues are associated with standardized large-scale testing mathematics tests and how they reason about math tests.
that is intended to be broad across a diverse population. Future scholars might explore students’ reasoning about tests
District- and teacher-created assessments in which learn- from other content areas (e.g., Language Arts, Science, and
ing experiences can differ can also raise fairness concerns. Social Studies), as well as different grade levels (e.g., ele-
Taken collectively, mathematics test fairness can be a threat mentary and high school settings). Third, most participants
to students’ success. identified as white; hence, a future study should include
Our study contributes to the narrative that students recog- greater diversity in their sample. The current study utilized
nize issues of fairness and can reason about how those issues a phenomenological approach to explore issues of fairness in
might have differing impacts on them and peers. Variance mathematics tests based on the perceptions of study partici-
across student learning and test scores is one aspect that they pants. Creswell (2014) recommends that researchers seeking
acknowledge. Educators seeking to promote opportunities to initially understand a phenomenon start with qualitative
for student success ought to promote mathematics test fair- methods and scale up to quantitative methods as appropriate.
ness and consider the findings. Students—like the authors— Future research may employ quantitative or mixed-methods
believe that better tests, not more tests, would improve their to examine the generalizability of our findings regarding
educational experiences. As described in the previous sec- issues related to test fairness. For instance, scholars may
tion, UDL is one framework educators and test-developers want to use a survey-design for convergent or divergent evi-
can use to develop better tests, with considerations of fair- dence from this study. Integrating qualitative findings with
ness built into test development, or test modification, pro- quantitative data has potential to (a) yield additional insight
cesses. For instance, testing practices that support timely into how students consider issues of fairness in mathematics
feedback of student performance, feedback that is substan- tests, and (b) promote the generalizability of our findings.
tive and useful, aligns with the UDL Engagement principle.
Thus, test development processes may also include a plan for Funding Ideas in this manuscript stem from grant-funded research
communicating feedback of performance back to students. by the National Science Foundation (NSF #1720646; #1720661;
960 TechTrends (2024) 68:946–961
#1920621, #1920619, #2100988, #2101026). Any opinions, findings, Teaching PK-12, 114(7), 498–507. https://d oi.o rg/1 0.5 951/
conclusions, or recommendations expressed by the authors do not nec- MTLT.2020.0341
essarily reflect the views of the National Science Foundation. Bostic, J., Folger, T., May, T., Koskey, K., Matney, G., & Stone, G.
(2024). Borrowing theory from engineering: Applying empathic
Data Availability Supporting data are not available due to the nature of design principles to mathematics assessment development.
the research's approval through the University's IRB. Paper presented at annual meeting of the American Education
Research Association.
Declarations Bostic, J., May, T., Matney, G., Koskey, K., Stone, G., & Folger,
T. (2024). Computer adaptive mathematical problem-solving
Conflict of interest This study was reviewed by the Bowling Green measure: A brief validation report. In D. Kombe & A. Wheeler
State University’s Institutional Review Board and deemed exempt sta- (Eds.), Proceedings of the 51st annual meeting of the Research
tus. The PSMs and PSM-CAT, which are referenced in this study, are Council on Mathematics Learning (pp. 102–110). Columbia.
co-owned by Bowling Green State University and Drexel University, Brookhart, S. M. (2003). Developing measurement theory for class-
and not by one or more of the authors. room assessment purposes and uses. Educational Measurement:
Issues and Practice, 22(4), 5–12. https://d oi.o rg/1 0.1 111/j.
Open Access This article is licensed under a Creative Commons Attri- 1745-3992.2003.tb00139.x
bution 4.0 International License, which permits use, sharing, adapta- Brown, N., Bostic, J., Folger, T., Folger, L., Hicks, T., & Nofziger,
tion, distribution and reproduction in any medium or format, as long S. (2022). Revising assessments to address UDL and standards.
as you give appropriate credit to the original author(s) and the source, Mathematics Teacher: Learning & Teaching, 115(4), 252–264.
provide a link to the Creative Commons licence, and indicate if changes https://doi.org/10.5951/MTLT.2020.0365
were made. The images or other third party material in this article are Cahill, J., & Bostic, J. (2024). Influence of language on multilingual
included in the article’s Creative Commons licence, unless indicated middle grades learners’ mathematical problem-solving outcomes.
otherwise in a credit line to the material. If material is not included in Journal of Urban Mathematics Education (in press).
the article’s Creative Commons licence and your intended use is not Carroll, J. (1963). A model for school learning. Teachers College
permitted by statutory regulation or exceeds the permitted use, you will Record, 64, 723–733. https://doi.org/10.1177/016146816306400
need to obtain permission directly from the copyright holder. To view a 801
copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. CAST. (2018). Universal Design for Learning guidelines version 2.0.
Retrieved from http://www.udlcenter.org/aboutudl/udlguidelines
Choi, Y., & McClenen, C. (2020). Development of adaptive forma-
tive assessment system using computerized adaptive testing and
dynamic bayesian networks. Applied Sciences, 10, 1–17. https://
References doi.org/10.3390/app10228196
Common Core State Standards Initiative. (2010). Common Core State
Standards for Mathematics. http://www.corestandards.org/wp-
American Educational Research Association (AERA), American
content/uploads/Math_Standards.pdf
Psychological Association (APA), & National Council on
Creswell, J. W. (2014). Research design: quantitative, qualitative, and
Measurement in Education (NCME). (2014). Standards for
mixed method approaches (4th ed.). SAGE Publications, Inc.
educational and psychological testing. American Educational
Fan, Y., Koskey, K., Bright, D., May, T., Matney, G., & Bostic, J.
Research Association.
(2024). Exploring sources of bias to improve the universal design
Bialik, K., Scheller, A., & Walker, K. (2018). 6 facts about English
of assessments of mathematical problem-solving skills. Educa-
language learners in U.S. public schools. PEW Research Center
tional Assessment (in press).
http://tinyurl.com/4whvvabe
Folger, T., Bostic, J., & Krupa, E. (2023). Defining test-score interpre-
Black, P., & Wiliam, D. (1998). Inside the black box: Raising stand-
tation, use, and claims: Delphi study for the validity argument.
ards through classroom assessment. King’s College London.
Educational Measurement: Issues and Practice, 42(3), 22–38.
https://doi.org/10.1177/003172171009200119
https://doi.org/10.1111/emip.12569
Bostic, J. (2011). The effects of teaching mathematics through prob-
Harris, C., Wiebe, E., Grover, S., & Pellegrino, J. (Eds.). (2023).
lem-solving contents on sixth-grade students' problem-solving
Classroom-Based STEM assessment: Contemporary issues and
performance and representation use. (Unpublished doctoral
perspectives. Community for Advancing Discovery Research in
dissertation). University of Florida.
Education (CADRE). Education Development Center, Inc.
Bostic, J. (2023). Engaging hearts and minds in assessment research.
Herman, J., & Cook, L. (2022). Broadening the reach of the fair-
School Science and Mathematics Journal, 123(6), 217–219.
ness standards. In J. Jonson & K. Geisinger (Eds.), Fairness in
https://doi.org/10.1111/ssm.12621
educational and psychological testing: Examining theoretical,
Bostic, J., & Sondergeld, T. (2015). Measuring sixth-grade students’
research, practice, and policy implications of the 2014 standards
problem solving: Validating an instrument addressing the math-
(pp. 33–59). American Educational Research Association. https://
ematics Common Core. School Science and Mathematics Jour-
doi.org/10.3102/9780935302967_2
nal, 115, 281–291. https://doi.org/10.1111/ssm.12130
Israel, M., Kester, B., Williams, J. J., & Ray, M. J. (2022). Equity and
Bostic, J., & Sondergeld, T. (2018). In D. Thompson, M. Burton,
inclusion through UDL in K-6 computer science education: Per-
A. Cusi, & D. Wright (Eds.), Validating and vertically equat-
spectives of teachers and instructional coaches. ACM Transactions
ing problem-solving measures. Classroom assessment in math-
on Computing Education, 22(3), 1–22.
ematics: Perspectives from around the globe (pp. 139–155).
Kane, M. T. (2013). Validating the interpretations and uses of test
Springer.
scores. Journal of Educational Measurement, 50(1), 1–73. https://
Bostic, J., Sondergeld, T., Folger, T., & Kruse, L. (2017). PSM7 and
doi.org/10.1111/jedm.12000
PSM8: Validating two problem-solving measures. Journal of
Kena, G., Hussar, W., McFarland, J., de Brey, C., Musu-Gillette, L.,
Applied Measurement, 18(2), 151–162.
Wang, X., Zhang, J., Rathbun, A., Wilkinson-Flicker, S., Dil-
Bostic, J. D., Vostal, B., & Folger, T. (2021). Growing TTULPs
iberti, M., Barmer, A., Bullock Mann, F., & Dunlop Velez, E.
through your lessons. Mathematics Teacher: Learning and
(2016). The Condition of Education 2016 (NCES 2016-144). U.S.
TechTrends (2024) 68:946–961 961
Department of Education, National Center for Education Statistics validity of assessment. Studies in Educational Evaluation, 23(4),
http://nces.ed.gov/pubsearch 349–371. https://doi.org/10.1016/S0191-491X(97)86215-3
Lane, S. (2014). Validity evidence based on testing consequences. Shepard, L. A. (2016). Evaluating test validity: Reprise and progress.
Psicothema, 26(1), 127–135. Assessment in Education: Principles, Policy & Practice, 23(2),
Lincoln, E., & Guba, I. (1985). Naturalistic inquiry. Sage. 268–280. https://doi.org/10.1080/0969594X.2016.1141168
Miles, M., Huberman, A. M., & Saldaña, J. (2014). Qualitative Data Sireci, S. G., & Faulkner-Bond, M. (2014). Validity evidence based
Analysis: A Methods Sourcebook (3rd ed.). Sage. on test content. Psicothema, 26(1), 100–107 http://h dl.h andle.n et/
Murillo, F. J., & Hidalgo, N. (2017). Students’ conceptions about a fair 11162/100815
assessment of their learning. Studies in Educational Evaluation, Suskie, L. (2000). Fair assessment practices: Giving students equitable
53, 10–16. https://doi.org/10.1016/j.stueduc.2017.01.001 opportunities to demonstrate learning. AAHE Bulletin, 52(9), 7–9.
National Council of Teachers of Mathematics. (2014). Access and Thompson, G. (2017). Computer adaptive testing, big data, and algo-
equity in mathematics education: A position of the National Coun- rithmic approaches to education. British Journal of Sociology of
cil of Teachers of Mathematics. https://www.nctm.org/Standards- Education, 38(6), 827–840. https://doi.org/10.1080/01425692.
and-Positions/Position-Statements/Access-and-Equity-in-Mathe 2016.1158640
matics-Education Vrieze, S. I. (2012). Model selection and psychological theory: a dis-
Pepper, M. B., & Pathak, S. (2008). Classroom contribution: What cussion of the differences between the Akaike information crite-
do students perceive as fair assessment? Journal of Education rion (AIC) and the Bayesian information criterion (BIC). Psycho-
for Business, 83(6), 360–368. https://doi.org/10.3200/JOEB.83.6. logical Methods, 17(2), 228–243. https://doi.org/10.1037/a0027
360-368 127
Rasooli, A., Zandi, H., & DeLuca, C. (2018). Re-conceptualizing Wendorf, C., & Alexander, S. (2004). The influence of individual- and
classroom assessment fairness: A systematic meta-ethnography of class-level fairness-related perceptions on student satisfaction.
assessment literature and beyond. Studies in Educational Evalua- Studies in Educational Evaluation, 30, 190–206. https://doi.org/
tion, 56, 164–181. https://doi.org/10.1016/j.stueduc.2017.12.008 10.1016/j.cedpsych.2004.07.003
Roberts, T., Jackson, C., Mohr-Schroeder, M. J., Bush, S. B., Maiorca,
C., Cavalcanti, M., et al. (2018). Students’ perceptions of STEM Publisher’s Note Springer Nature remains neutral with regard to
learning after participating in a summer informal learning expe- jurisdictional claims in published maps and institutional affiliations.
rience. International Journal of STEM Education, 5(1), 1–14.
https://doi.org/10.1186/s40594-018-0133-4
Sambell, K., McDowell, L., & Brown, S. (1997). “But is it fair?”:
An exploratory study of student perceptions of the consequential