Feduc 06 654212

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

ORIGINAL RESEARCH

published: 07 May 2021


doi: 10.3389/feduc.2021.654212

Improving Learning: Using a


Learning Progression to Coordinate
Instruction and Assessment
Mark Wilson 1* † and Richard Lehrer 2†
1
Graduate School of Education, University of California, Berkeley, Berkeley, CA, United States, 2 Department of Teaching
and Learning, Peabody College, Vanderbilt University, Nashville, TN, United States

We describe the development and implementation of a learning progression specifying


transitions in reasoning about data and statistics when middle school students are
inducted into practices of visualizing, measuring, and modeling the variability inherent
in processes ranging from repeated measure to production to organismic growth.
A series of design studies indicated that inducting students into these approximations
of statistical practice supported the development of statistical reasoning. Conceptual
change was supported by close coordination between assessment and instruction,
Edited by:
where changes in students’ ways of thinking about data and statistics were illuminated
Neal M. Kingston,
University of Kansas, United States as progress along six related constructs. Each construct was developed iteratively
Reviewed by: during the course of design research as we became better informed about the forms
Bronwen Cowie, of thinking that tended to emerge as students were inducted into how statisticians
University of Waikato, New Zealand
Jere Confrey,
describe and analyze variability. To illustrate how instruction and assessment proceeded
North Carolina State University, in tandem, we consider progress in one construct, Modeling Variability. For this
United States
construct, we describe how learning activities supported the forms of conceptual
*Correspondence:
change envisioned in the construct, and how conceptual change was indicated by
Mark Wilson
MarkW@berkeley.edu items specifically designed to target levels of the construct map. We show how student
† These authors have contributed progress can be monitored and summatively assessed using items and empirical maps
equally to this work of items’ locations compared to student locations (called Wright maps), and how some
items were employed formatively by classroom teachers to further student learning.
Specialty section:
This article was submitted to Keywords: learning progression, data modeling, statistical reasoning, item response models, Rasch models
Assessment, Testing and Applied
Measurement,
a section of the journal
Frontiers in Education
INTRODUCTION
Received: 15 January 2021 In this paper, we illustrate the use of an organized learning model, specifically, a learning
Accepted: 06 April 2021
progression, to support instructionally useful assessment. Learning progressions guide instructional
Published: 07 May 2021
plans for nurturing students’ long-term development of disciplinary knowledge and dispositions
Citation: (National Research Council, 2006). Establishing a learning progression is an epistemic enterprise
Wilson M and Lehrer R (2021)
(Knorr Cetina, 1999) in which students are positioned to participate in the generation and
Improving Learning: Using a Learning
Progression to Coordinate Instruction
revision of forms of knowledge valued by a discipline. For that, we need both an instructional
and Assessment. design and an assessment design, and the two needs to be tightly co-ordinated. In particular, the
Front. Educ. 6:654212. assessments inform crucial aspects of the progression: (a) they provide formative information for
doi: 10.3389/feduc.2021.654212 the development and refinement of the learning progression, and (b) they provide formative and

Frontiers in Education | www.frontiersin.org 1 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

summative information for teachers using the learning student progress. As we later describe more completely, in
progression to iteratively refine instruction in response to this article we concentrate on a learning progression that was
evidence about student learning. developed to support transitions in students’ conceptions of
The idea of a learning progression is related to curriculum and data, chance, and statistical inference (Lehrer et al., 2020).
instructional concepts that have been apparent in the educational Conceptual change was promoted instructionally by inducting
literature for many years, and it is closely tied a learning trajectory students into approximations of core professional practices of
as commonly used in mathematics education (Simon, 1995). One statisticians. Following in the footsteps of professional practice,
definition that has become prominent is the following: students invented and revised ways of visualizing, measuring, and
modeling variability. In what follows, we primarily focus on the
Learning progressions are descriptions of the successively more
assessment component of the learning progression, with attention
sophisticated ways of thinking about an important domain of
knowledge and practice that can follow one another as children
to the roles of student responses and teacher practices in the
learn about and investigate a topic over a broad span of time. They development and deployment of the assessment component.
are crucially dependent on instructional practices if they are to Figure 1 illustrates a representation of transitions in students’
occur (Center for Continuous Instructional Improvement (CCII), ways of knowing as they learn to make and revise models of
2009). variability (MoV). This is an example of what is called a construct
map—in this case, the construct is called MoV, and it will be
This description is broadly encompassing, but, at the same the focus of much of the rest of this paper (so more detail will
time, the description signals something more than an ordered follow, shortly). Leaving aside the specifics of this construct, the
set of ideas, curriculum pieces, or instructional events: Learning construct map, then, is a structure defining a series of levels
progressions should characterize benchmarks of conceptual of increasing sophistication of a student’s understanding of a
change and in tandem, conceptual pivots—conceptual tools and particular (educationally important) idea, and is based on an
conjectures about mechanisms of learning that support the assumption that it makes (educational) sense to see the student’s
kinds of changes in knowing envisioned in the progression. progress as increasingly conceptually elaborated, with a series
Benchmarks of conceptual change are models of modal forms of of qualitatively distinct levels marking transitions between the
student thinking and like other models, must be judged on their entrée to thinking about the idea (sometimes called the “lower
utility. They are ideally represented at a “mid-level” of description anchor”), and the most elaborated forms likely given instructional
that captures critical qualities of students’ ways of thinking support (sometimes called the “upper anchor”) (Wilson, 2005).
without being either so broad as to provide very little guidance for Note that the construct map is not a full representation of
instruction and assessment, or so overwhelmingly fine-grained a learning progression in that it neglects description of the
as to impede ready use. Similarly, conceptual pivots are ways of conceptual pivots that might reliably instigate the progress of
thinking and doing that tend to catalyze conceptual growth, and conceptual change visualized in the map, nor does it specify
as such, are situated within a theoretically compelling framing of other elements of the educational system necessary to support
potential mechanisms of learning. student learning. However, the map has the virtue of mid-level
Further entailments of a learning progression include
commitments about alignment of discipline, learning,
instruction, and assessment. In our view, these include:
(a) an epistemic view of a discipline that describes how
concepts are generated and warranted;
(b) representations of learning structured as descriptions of
forms of student knowledge, including concepts and
practices, and consequential transitions among these
forms, as informed by the epistemic analysis;
(c) help for teachers to identify classes of student performances
as representing particular forms of student knowledge
around which teachers can craft instructional responses,
and
(d) assessments and reports designed to help reveal students’
ways of thinking, and to organize evidence of such thinking
in ways that help teachers flexibly adapt instruction.
Thus, we consider a learning progression to be an educational
system designed to support particular forms of student (and
perhaps teacher) conceptual change. This system must include
descriptions of learning informed by an epistemic view of a
discipline, the means to support these forms of learning, well-
articulated schemes of assessment, and professional development FIGURE 1 | The MoV construct map.
that produces pedagogical capacities oriented toward sustaining

Frontiers in Education | www.frontiersin.org 2 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

description of students’ ways of thinking that can grasped and initial conjectures to a more stabilized progression involved
adapted to the needs of particular communities. For example, meshing disparate professional communities, including teachers,
construct maps can be elaborated with classroom exemplars to statisticians, learning researchers, and assessment researchers.
assist teacher identification of particular levels of thinking and to Coordination among communities was mediated by a series of
exemplify how a teacher might leverage the multi-level variability boundary objects, ranging from curriculum units to construct
of student thinking in a classroom to promote conceptual change maps (such as the one shown in Figure 1) to samples of
(Kim and Lehrer, 2015). The map can be subject to a simple but student work that were judged in relation to professional
robust form of psychometric scaling known as Rasch modeling practices by statistical practitioners (Jones et al., 2017). Teachers
(Rasch, 1960/80; Wilson, 2005). played a critical collaborative role in the development of all
To situate the development and test of this and the five components of the assessment system, and teacher practices of
related constructs, we briefly describe the iterative design of the assessment developed and changed as teachers suggested changes
learning progression. to constructs (e.g., clarifications of descriptions, contributions
of video exemplars of student thinking) and items as they
Articulating the Learning Progression changed their instructional practices to use assessment results
As just noted, the learning progression sought to induct to advance student learning (Lehrer et al., 2014). Teachers
students into an ensemble of approximations to the practices collaborated with researchers to develop guidelines for employing
by which professionals come to understand variability, most student responses to formative assessments to conduct more
centrally, on ways to visualize, measure, and model variability. productive classroom conversations where students’ ways of
The instruction is designed so that these practices become thinking, as characterized by levels of a construct, constituted
increasingly coordinated and interwoven over time, so that, essential elements of a classroom dialog aimed at creating new
for example, initial ways of visualizing data are subsequently opportunities for learning (Kim and Lehrer, 2015). During such
employed by students to invent and interpret statistics of sample a conversation, dubbed a “Formative Assessment Conversation”
distribution. The initial construction of the progression involved by teachers, teachers drew upon student responses to juxtapose
analysis of core concepts and practices of data modeling (Lehrer different ways of thinking about the same idea (e.g., a measure
and Romberg, 1996) that we judged to be generative, yet of center). Teachers also employed items and item responses as
intelligible, to students. These were accompanied by conjectures launching pads for extending student conceptions. For example,
about fruitful instructional means for supporting student a teacher might change the nature of a distribution presented in
induction into ways of thinking and acting on variability that was an item and ask students to anticipate and justify effects of this
more aligned with professional practices. One form of conceptual change on sample statistics.
support for learning (a conceptual pivot in the preceding)
was a commitment to inducting students into approximations The Six Constructs in the Learning
of professional practices through cycles of invention and Progression
critique (Ford, 2015). For example, to introduce students to To describe forms of conceptual change supported by student
statistics as measures of characteristics of distribution, they first participation in data modeling practices, we generated six
invented a statistic to capture variation in a distribution and constructs (Lehrer et al., 2014). The constructs were developed
then participated in critique during which different inventions during the course of the previously cited design studies
were compared and contrasted with an eye toward how they which collectively established typical patterns of conceptual
approached the challenges of characterizing variability. It was growth as students learned to visualize, measure and model
only after students could participate in such a cycle that the variability generated by processes ranging from repeated
conventional statistics of variation were introduced, for now measure to production (e.g., different methods for making
students were in a position to see how conventions resolved some packages of toothpicks) to organismic growth (e.g., measures
of the challenges revealed by their participation in invention of plant growth). Conceptual pivots to promote change,
and critique. This participation in practice also helped students most especially inducting students into statistical practices of
understand why there are multiple statistics for characterizing visualizing, measuring and modeling variability, were structured
variation in distribution. and instantiated by a curriculum which included rationales
Conjectures about effective means for supporting learning for particular tasks, tools, and activity structures, guides for
were accompanied by development of an assessment system that conducting mathematically productive classroom conversations,
could be employed for both summative and formative purposes. and a series of formative assessments that teachers could deploy
These assessments provided evidence of student learning that to support learning.
further assisted in the reformation of theory and practice
of instruction over multiple iterations of instructional design. Visualizing Data
The learning progression was articulated during the course of Two of the six constructs represent progression in forms
a series of classroom design studies, first conducted by the of thinking that typically emerge as students are inducted
designers of the progression (e.g., Lehrer et al., 2007, 2011; into practices of visualizing data. Students were inducted into
Lehrer and Kim, 2009; Lehrer, 2017) and subsequently elaborated this practice by positioning them as inventors and critics of
by teachers who had not participated in the initial iterations visualizations of data they had generated (Petrosino et al., 2003).
of the design (e.g., Tapee et al., 2019). The movement from The first, Data Display (DaD), describes conceptions of data

Frontiers in Education | www.frontiersin.org 3 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

that inform how students construct and interpret representations kind of distribution, that of a sampling distribution of sample
of data. These conceptions are arranged along a dimension statistics, and with it, the emergence of a new perspective on
anchored by interpreting data through the lens of individual cases statistics as described by the upper anchor of the CoS construct.
to viewing data as distributed—that is, as reflecting properties of The constructs of CoS and Cha are related in that the upper
an aggregate. At the upper anchor of the construct, aggregate anchor of CoS depends upon conceptions of sample-to-sample
properties constitute a lens for viewing cases. For example, variation attributed to chance.
some cases may be more centrally located in a distribution, or
some cases may not conform as well as others with properties Modeling Variability
of the distributed aggregate. A closely associated construct, Building on changing conceptions of chance, the MoV construct
meta-representational competence (MRC), identifies keystone posits a progression in learning to construct and evaluate models
understandings as students learn to manage representations in that include elements of random variation. Modeling chance
order to make claims about data and to consider trade-offs among begins with identification of sources of variability, progresses to
possible representations in light of particular claims. employing chance devices to represent sources of variability, and
culminates in judging model fit by considering relations between
Conceptions of Statistics repeated model simulations and an empirical sample. Student
A third construct, conceptions of statistics (CoS), describes conceptions of models and modeling are fostered by positioning
changes in students’ CoS when they have repeated opportunities students to invent and contest models of processes, ranging from
to invent and critique measures of characteristics of a those involving signal and noise, with readily identified sources
distribution, such as its center and spread. Initially, students of variability, to those with less visible sources of variability, as in
tend to think of statistics not as measures but as the result the natural variation of a sample of organisms.
of computations. For these students, batches of data prompt
computation but without sensitivity to the data (e.g., the presence Informal Inference
of extreme values) or with a question in mind. As students invent The sixth and final construct, Informal Inference (InI), describes
and revise measures of distribution (the invention and critique transitions in students’ reasoning about inference. The term
of measures of distribution is viewed as another conceptual informal is meant to convey that students are not expected
pivot), their CoS encompass a view of statistics as measures, with to develop conceptions of probability density and related
corresponding sensitivity to qualities of the distribution being formalisms that guide professional practice, but they are
summarized, the generalizability of the statistic to other potential nonetheless involved in making generalizations or predictions
distributions, and to the question at hand. The upper anchor beyond the specific data at hand. The initial levels of the construct
of this construct entails recognition of statistics as subject to describes inferences informed by personal beliefs and experiences
sample-to-sample variation. in which data do not play a role, other than perhaps confirmation
of what one believes. A mid-level of the construct is represented
Conceptions of Chance by conceiving of inference as guided by qualities of distribution,
Chance (Cha) describes the progression of students’ such as central clumps in some visualizations or even summary
understanding about how elementary probability operates statistics. In short, inference is guided by careful attention to
to produce distributions of outcomes. Initial forms of characteristics evident in a sample. At the upper anchor, students
understandings are intuitive and rely on conceptions of develop a hierarchical image of sample in which an empirical
agency (e.g., “favorite numbers”). Initial transition away from sample is viewed as but one instance of a potentially infinite
this agentive view includes the development of the concept of a collection of samples generated by a long-term, repeated process
trial, or repeatable event, as students investigate the behavior of (Saldanha and Thompson, 2014). Inference is then guided by this
simple random devices. The concept of trial, which also entails understanding of sample, a cornerstone of professional practice
abandonment of personal influence on selected outcomes, makes of inference (Garfield et al., 2015).
possible a perspectival shift that frames chance as associated Generally, the construct maps are psychometrically analyzed
with a long-term process, a necessity for a frequentist view and scaled using multidimensional Rasch models (Schwartz et al.,
of probability (Thompson et al., 2007). Intermediate forms of 2017), and the requirement relationships are analyzed using
understandings of chance include development of probability structured construct models (Wilson, 2012; Shin et al., 2017).
as a measure of uncertainty, and estimation of probabilities as
ratios of target outcomes to all possible outcomes of a long-term,
repeated process. The upper anchor coordinates sample spaces CHECKING FOR PROGRESS:
and relative frequencies as complementary ways of estimating COMPARING PRE-TEST AND
probabilities. Transitions in conceptions of chance are supported POST-TEST RESULTS
by student investigation of the long-run behavior of chance
devices, and by the use of statistics to describe characteristics of When conceptualizing and building a learning progression,
the resulting distribution of outcomes. A further significant shift researchers need to inquire about the extent to which
in perspective occurs as students summarize a sample with a progression-centered instruction influences conceptual change.
statistic (e.g., percent of red outcomes in 10 repetitions a 2-color Constructs inform us about the nature of such change, and pre–
spinner) and then collect many samples. This leads to a new post, construct-based assessment informs us about the robustness

Frontiers in Education | www.frontiersin.org 4 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

of the design. In one of the design studies mentioned previously, TABLE 1 | Pre- and post-test mean ability estimates, gain scores, and Wald test
significance results.
the project team worked with a sixth-grade teacher to conduct
4 replications of the progression (4 classes taught by the same Construct Pre-test Post-test Gain Significance Test
teacher) with a total of 93 students. The study was aimed
at examining variation in student understandings and activity Cha 0.052 1.092 1.040 W = 7.32, p < 0.0001
between classes and to gauge the extent of conceptual change CoS −0.59 0.244 0.834 W = 6.80, p < 0.0001

for individual students across classes. Students responded to Dad −0.02 0.749 0.769 W = 9.01, p < 0.0001

tests taking approximately 1 h to complete before and after InI −0.414 0.217 0.631 W = 6.86, p < 0.0001

instruction. The tests shared only a few items in common, enough MoV −0.936 0.376 1.312 W = 12.80, p < 0.0001

to link the two tests together. Item response models (Wright MRC −0.199 0.164 0.363 W = 3.70, p < 0.0001

and Masters, 1981) were used to link the scale between the pre-
test and the post-test. The underlying latent ability was then
order to test if the difference between the post-test ability and the
used as the metric in which to calculate student gains. This
pre-test ability is statistical significant, we used a Wald test
framework allowed us to select a model that accounted for having
different items on the forms, varying item difficulty, and different (µ̂post − µ̂pre )
maximum scores of the items. W= r , (1)
s2pre s2post
In the analysis, we used data from both the pre-test and N + N
the post-test to estimate the item parameters. We used Rasch
models for two reasons: (a) tests that conform to the Rasch model where µ̂pre , µ̂post , s2pre , and s2 are the sample means and variance
post
assumptions have desirable characteristics (Wilson, 2005) and (b) and N is the sample size. A size α Wald test rejects the null
the technique of gain scores uses unweighted item scores, and hypothesis (of no difference) when | W| > zα/2 . Row 5 in Table 1
that is satisfied by the Rasch family of models. The estimated shows the results for the MoV construct. Column 1 indicates the
item parameters were then used as anchored values and the construct, columns 2–4 indicate the mean pre-test ability, mean
person ability parameters became the object of estimation. The post-test ability, and the gain, and column 5 shows the Wald test
mean differences between the pre- and post-test were estimated statistic and the p-value when using α = 0.05. For MoV, the test
simultaneously with the person and item parameters. statistic is W = 12.80 and p < 0.0001. Thus, we can reject the null
For the scaling model the Random Coefficients Multinomial hypothesis and conclude that the post-test ability is significantly
Logit Model (RCML) (Adams and Wilson, 1996) was used. higher than the pre-test ability. The other rows show the results
The usual IRT assumption of having a normal person ability for the remaining five constructs. In each case the p-value is less
distribution common across both test times is unlikely to be met than 0.0001 and thus we have statistically significant gains.
if indeed the instruction instigates conceptual change, because In addition to statistical significance, it is important to gauge
one would expect to see post-test student abilities that are higher effect size—that is, are the gains large enough to claim that
than the pre-test abilities, and this would likely lead to a bi- these are important effects? In our studies of scaling educational
modal person ability distribution if data from the two tests are achievement tests, we have found from experience that, looking
combined for analysis. To avoid the unidimensional normality at similar achievement tests, typical differences in achievement
assumption issue, we used a 2-dimensional analysis where the test results from 1 year to the next are approximately 0.3–0.5
first dimension was the pre-test and the second dimension logits [see for example, Wilson et al. (2019a) and Wilson et al.
was the post-test. This can be achieved with a constrained (2012)]. Hence, we see these gains (which are greater than half
version of the Multidimensional RCML (MRCML) (Adams et al., a logit for all but one of the constructs) as representing very
1997; Briggs and Wilson, 2003) with common item parameters important gains over the briefer (7–8 weeks) period of instruction
constrained to be equal across the two dimensions (pre and in data modeling.
post). This is a simple example of what is known as Andersen’s With this summative illustration in mind, we turn now
Model of Growth (Andersen, 1985). When one constrains the to some of its underpinnings and to the use of assessments
item difficulty of items common to both the pre- and post- by teachers. In what follows, we focus on one particular
test, then the metric is the same for the two test times, and the construct from among the six in the full Data Modeling
mean difference between pre- and post-test abilities is the gain learning progression, MoV, and describe (a) the development of
in ability between the two tests (Ayers and Wilson, 2011). The assessments based on the construct maps, and their relationships
MRCML model was designed to allow for flexibility in designing with instruction by teachers, (b) the development of empirical
custom models and is the basis for the parameter estimation maps of these constructs (referred to as “Wright maps”), and (c)
in the ConQuest software (Adams et al., 2020). We formulated the usages of reports based on these maps by teachers.
the MRCML as a Partial Credit Model (PCM) (Masters, 1982)
and used the within-items form of the Multidimensional PCM,
as each common item loads onto both the pre-test and the A CLOSER VIEW OF A CONSTRUCT:
post-test dimension. MODELING VARIABILITY
Results from the ConQuest analysis are summarized in
Table 1. In particular, we focus on MoV, as above, for illustrative As noted previously, the MoV construct refers to the conceptions
purposes: the mean ability gain for MoV was 1.312 logits. In and practices of modeling variability. Modeling-related concepts

Frontiers in Education | www.frontiersin.org 5 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

emerge as students participate in curricular activities designed structure playing the role of noise. Changes in conceptions of
to make the role of models in statistical inference visible and chance that emerge during the course of these investigations
tractable. In professional practice, models guiding inference are described by the Cha construct and by upper anchor of
rely on probability density functions, but we take a more the CoS construct.
informal approach (see Lehrer et al., 2020 for a more complete With this conceptual and experiential grounding, students
description). One building block of inference is an image of are challenged to invent and critique MoV producing processes
variable outcomes that are produced by the same underlying that now include chance. Initially, models are devoted to re-
process. Accordingly, we first have students use a 15 cm considerations of signal and noise processes from the perspective
ruler to measure the length of the same object, such as of reconsidering mistakes as random—for instance, despite being
the perimeter of a table or the length of their teacher’s careful, small slippages in iteration with a ruler over a long
outstretched arms. Much to their surprise, students find that if span appear inevitable and also unpredictable. Accordingly, the
they have measured independently, their measures are not all process of measuring an object’s length can be re-considered
the same, and furthermore, the extent of variability is usually as a blend of a fixed value of length and sources of random
substantially less when other tools, such as a meter stick, error. As students invent and critique MoV in contexts ranging
are used. Students’ participation informs their understanding from signal and noise processes to those generating “natural”
of potential sources of differences observed, which tend to variation, they have opportunities to elaborate their conceptions
rely on perceptions of signal (the length of the object) and of modeling variability.
noise (“mistakes” measurers made arising from small but
cumulative errors in iterating the ruler or from how they Modeling Variability (MoV)
treated the rounded corners of a table, etc.). A signal-noise With the preceding in mind, we can now describe the MoV
interpretation is a conceptual pivot in that it affords an initial construct and its construct map—refer to Figure 1 for an
step toward understanding how sample variability could arise outline. Students at the first level, MoV1, associate variability
from a repeated process. And, the accessibility of the process with particular sources, which is facilitated by reflecting on
of measuring allows students to make attributions of different processes characterized by signal and noise For example, when
sources of error—a prelude to an analysis of variance. The considering variability of measures of the same object’s length,
initial seed of an image of a long-range process, so important students may consider variability as arising from misdeeds of
to thinking about probability and chance, is systematically measurement, “mistakes” made by some measurers because they
cultivated throughout the curricular sequence. For example, were not “careful.” To be categorized at this initial level, it is
as students invent visualizations of the sample of measured sufficient that students demonstrate an attribution about one or
values of the object’s length, many create displays of data that more sources of variability but not that they implicate chance
afford noticing center clumps and symmetries of the batch origins to variability.
of data. This noticing provides an opportunity for teachers At level MoV2, students begin to informally order the
to have students account for what they have visualized. For contributions of different sources to variability, using language
example, what about the process tends to account for center such as “a lot” or “a little.” They refer to mechanisms and/or
clump and symmetry? Students critique their inventions with processes that account for these distinctions, and they predict
an eye toward what different invented representations tend or account for the effects on variability of changes in these
to highlight and subdue about the data, so that the interplay mechanisms or processes. For example, two students who
between invention and critique constitute opportunities for measured the perimeter of the same table attributed errors to
students to develop representational and metarepresentational iteration, which they perceived as substantial. Then, they went
competencies, which are described by the two associated on to consider other errors that might have less impact, such as
constructs DaD and MRC. a “false start.” This conversation clarifies that students need to
Students go on to invent measures of characteristics of the clarify the nature of each source of variability and decide whether
sample distribution, such as an estimate of the true length or not the source is worthy of including in a model.
of the object (e.g., sample medians) and the tendency of the
measurers to agree (i.e., precision of measure). Invented statistics Cameron: How would we graph–, I mean, what is a false
are critiqued with an eye toward what they attend to in the start, anyway?
sample distribution and what might happen if the distribution Brianna: Like you have the ruler, but you start at the ruler
were to be transformed in some way (e.g., sample size increased). edge, but the ruler might be a little bit after it, so you get,
Invention and critique help make the conceptions and methods of like, half a centimeter off.
statistics more intelligible to students, and transitions in students’ Cameron: So, then it would not be 33, it’d be 16.5, because
conceptions are illustrated by the CoS construct. After revisiting it’d be half a centimeter off?
visualizing and measuring characteristics of distributions in Brianna: Yeah, it might be a whole one, because on the ruler
other signal-noise contexts (e.g., manufacturing Play-Doh candy that we had, there was half a centimeter on one side, and half
rolls), students grapple with chance by designing chance devices a centimeter on the other side, so it might be 33 still, and I
and observing their behavior. The conceptual pivot of signal think we subtract 33.
and noise comes into play with the structure of the device Cameron: Yeah, because if you get a false start, you’re gonna
playing the role of signal and chance deviations from this miss (Lehrer et al., 2020).

Frontiers in Education | www.frontiersin.org 6 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

An important transition in student reasoning occurs at level the composition of multiple, often random sources. For example,
MoV3: here students explicitly consider chance as contributing a distribution of repeated measurements of an attribute of the
to variability. In the curricular sequence, students first investigate same object can be modeled as a composition of a fixed or true
the behavior of simple devices for which there is widespread measure of the attribute and one or more components of chance
acknowledgment that the behavior of the device is “random.” error in measure. Figure 3 illustrates a student-generated model
For example, consider the spinner illustrated in the top panel that approximates the variability evident in a sample of class
of Figure 2. This is a blank (“mystery”) spinner, and the task measures of the length of their teacher’s arm-span. In this Figure,
for the student is to draw a line dividing the spinner into two the spinners are ordered from left to right, and are labeled above
sectors which indicate the two proportions for the outcomes of each with a 3-letter code, described below. The first spinner
the spinner, which are given in the lower panel. The conceptual (labeled MDN) is simply the sample median of the observed
consequences of investigations like these are primarily captured sample, taken by the students as their “best guess of ” the true
in the Cha construct, but from the perspective of modeling, value of the teacher’s arm-span—this is a deterministic effect,
students come to appreciate chance as a source of variability. as represented by the fact that there is only one sector in the
At level MoV4, there is a challenging transition to spinner, so it will return the same value (i.e., 157) on each spin.
conceptualizing variability in a process as emerging from The remaining four spinners are all modeling random effects, as

FIGURE 2 | Representation of a “mystery” spinner.

Frontiers in Education | www.frontiersin.org 7 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

FIGURE 3 | Modeling variability as a composition of signal and multiple sources of random error.

they will return different values on each spin, with probability teachers to think about running the simulations “again and again”
proportional to the proportion of the spinner area occupied begin to appreciate the role of multiple runs of model simulations
by each sector1 . The second spinner (labeled GAP) represents to judge the suitability of the model. Sampling distributions of
the “gaps” that occur when students iterate application of their estimates of model parameters, such as simulated sample median
rulers across the teacher’s back: the probabilities of these gaps and IQR, are used to judge whether or not the model tends
are proportional to the areas of the spinner sectors, and the to approximate characteristics of an empirical sample at hand.
values that are returned are shown within each sector (i.e., −1 to This is a very rich set of concepts for students to explore, and,
−9). The students had surmised that smaller magnitudes of the eventually, grasp. Just one example of this richness is indicated
gaps are more likely to occur than larger magnitudes, hence the by a classroom discussion of the plausibility of sample values that
sectors vary in area, reducing from the area for −1 to the area were generated by a model, but which were absent in the original
for −9. The values are negative because gaps are unmeasured empirical sample. The argument for and against a model that
space and hence result in underestimates of the arm-span. The could generate such a value arose during a formative assessment,
third spinner (labeled LAP) represents the “overlaps” that occur and as in other formative assessments, the teacher conducted a
when the endpoint of one iteration of the ruler overlaps with follow-up conversation during which different student solutions
the starting point of the next iteration. The interpretation of were compared and contrasted to instigate a transition in student
the values and sectors is parallel to that for GAP. The values thinking, here from MoV4 to MoV5. As the teacher anticipated,
are positive value because this mistake creates overestimates of some students immediately objected to model outcomes not
the measure (i.e., the same space is counted more than once). represented as cases in the original sample. They proposed a
The fourth spinner (labeled Droop) represents both under- and revision to the model under consideration by the class which
over-estimates—these result when the teacher becomes tired and would eliminate this possibility.
her outstretched arms droop. The last spinner (labeled Whoops)
represents probabilities and values of mis-calculations when each Students: Take away−1 in the spinner
student generated a measure. The way the whole spinner device Teacher: Why?
works is that the result from each spinner is added to a total to Joash: Because there’s no 9 in this [the original sample].
generate a single simulated measurement value. This then can
be used to generate multiple values for a distribution (students But another student, Garth, responded, “Yeah, but that doesn’t
typically used 30 repetitions in the Data Modeling curriculum mean 9 is impossible.” He went on to elaborate, that the
because these corresponded to the number of measurers). MoV4 model was “focused on the probability of messing up,” so to
culminates with the capacity to compare two or more emergent the extent to which error magnitudes and probabilities were
models, a capacity which is developed as students critique models plausible, they should not be excluded, and one would have to
invented by others. accept the simulated values generated by the model as possible
At level MoV5, students consider variability when evaluating values (Lehrer et al., 2020). This discussion of “possible values”
models. For example, they recognize that just by chance one run eventually assumed increasing prominence among the students,
of a model’s simulated outcomes may fit an empirical sample well and led to a point where they began to consider empirical
(e.g., similar median and IQR values, similar “shapes” etc.) but the samples as simultaneously (a) a collection of outcomes observed
next simulated sample might not. So, students, often prompted by in the world and (b) a member of a potentially infinite collection
of samples (Lehrer, 2017). As mentioned previously, this dual
1
Or, equally, the internal angle of the sector, or the proportion of the circumference recognition of the nature of a sample is an important seed stock
occupied by the sector. of statistical inference.

Frontiers in Education | www.frontiersin.org 8 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

BUILDING AN ASSESSMENT SYSTEM IN refined construct map levels and/or items, and coding exemplars,
THE CONTEXT OF A CONSTRUCT MAP including re-descriptions of student reasoning and inclusion of
more or better examples of student work (Constructs were also
Having achieved this initial step of generating a construct revised to make them more intelligible and useful for guiding
map that reflected benchmarks in students’ conceptions of instruction in partnership with teachers, as suggested earlier).
modeling variability, we designed an assessment system to New items were developed where we found gaps in coverage of
provide formative feedback to teachers to help them monitor the landmarks in the roadmap of student learning described by
student progress and also to provide summative assessment the construct. In the light of student responses, some items that
for other classroom and school uses. In practice, design and could not be repaired were discarded, others were redesigned
development of the assessment system paralleled that of the to generate clearer evidence of student reasoning. Sometimes,
development of the curricular sequence, albeit with some lag student responses to items could not be identified as belonging to
to reflect upon the robustness of emerging patterns of student the construct but nonetheless appeared to indicate a distinctive
thinking. In this process, we follow Wilson’s (2005) construct- and important aspect of reasoning: This led to revision of
centered design process, the Bear Assessment System (BAS), constructs and/or levels. A much more detailed explanation of
where items are designed to measure specific learning levels for this design approach to developing an assessment system is
the construct, where item modes include both multiple choice given in Wilson (2005).
and constructed response types. The student responses to the
constructed response items are first mined to develop scoring Example Item 1—Piano Width
guidelines consisting of descriptions of student reasoning and to To illustrate the way that items are matched to construct map
provide examples of student work, all aligned to specific levels levels, consider the Piano Width task illustrated in Figure 4.
of performance. This process involved multiple iterations of This task capitalizes on the Data Modeling student’s experiences
item design, scoring, comparison of the coded responses to the with ruler iteration errors (“gaps and laps”) in learning about
construct map, then fitting of items to psychometric models using measurement as a process that generated variability. We
the resulting data, and subsequent item revision/generation. comment specifically on question 1 of the Piano Width item:
These iterations occur over fairly long periods, and are based The first part–1(a)—is intended mainly to have the student adopt
on the data from students across multiple project teachers— a position, and is coded simply as correct or incorrect. The
these teachers have variable amounts of expertise, but were all interesting question, as far as the responses is concerned, is the
engaged in the professional development that was an inherent second part–1(b)—here the most sophisticated responses are at
part of being a member of that team. The review teams the MoV2 level, and typically fall into one of two categories after
included the Vanderbilt project leaders, who, along with other choosing “Yes” to question 1(a).
colleagues, were working directly with teachers. This brought
the instructional experiences of teachers into the process, and, MOV2B: the student describes how a process or change
for some issues that arose, teachers proposed and tried out new in the process affects the variability, that is, the
approaches. Sometimes, due to patterns in student responses, we student compares the variability shown by the two

FIGURE 4 | The Piano Width item. A group of musicians measured the width of a piano in centimeters. Each musician in this group measured using a small ruler
(15 cm long). They had to flip the ruler over and over across the width of the piano to find the total number of centimeters. A second group of musicians also
measured the piano’s width using a meter stick instead. They simply laid the stick on the piano and read the length. The graphs below display the groups’
measurements. 1(a) The two groups used different tools. Did the tool they used affect their measurement? (check one) () Yes () No. 1(b) Explain your answer. You can
write on the displays if that will help you to explain better. 2(a) How does using a different tool change the precision of measurements? (check one). (a) Using different
tools does not affect the precision of measurements. (b) Using the small ruler makes precision better. (c) Using the meter stick makes precision better. 2(b) Explain
your answer. (What about the displays makes you think so?).

Frontiers in Education | www.frontiersin.org 9 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

displays. The student mentions specific data points empirical characteristics of the item (Wilson, 2005; Adams et al.,
or characteristics of the displays. For example, one 2020).
student wrote: “The Meter stick gives a more precise The way that the item is represented is as follows:
measurement because more students measured 80–
84 with the meter stick than with the ruler.” (a) if an item has k scoring levels, then there are k-1 thresholds
MOV2A: the student informally estimates the magnitude of on the Wright map, one for each transition between scores;
variation due to one or more sources, that is the (b) each item threshold gives the ability level (in logits) that a
student and mentions sources of variability in the student must obtain to have a 50% chance of scoring at the
ruler or meter stick. For example, one student wrote: associated scoring category or above (The locations of these
“The small ruler gives you more opportunities to thresholds for the MoV items are shown in the columns on
mess up.” the right side of Figure 5—more detail below).
Note that this is an illustration of how the construct map levels
For example, suppose an item has three possible score levels (0,
may be manifested into multiple sub-levels, and, as in this case,
1, and 2). In this case there will be two thresholds. Suppose that
there may be some ordering among the sublevels (i.e., MoV2B is
the first threshold has a value of −0.25 logits: This means that a
seen as a more complete answer than MoV2A).
student with that same ability of −0.25 has an equal chance of
Less sophisticated responses are also found:
scoring in category 0 compared to categories 1 and 2. If their
MOV1: the student attributes variability to specific sources ability is lower than the threshold value (−0.25 logits), then
or causes, that is the student chooses “Yes” and they have a higher probability of scoring in category 0; if their
attributes the differences in variability to the ability is higher than −0.25, then they have a higher probability
measuring tools without referring to information of scoring in either category 1 or 2 (than 0). These thresholds
from the displays. For example, one student wrote: are, by definition, ordered: In the given example, the second
“The meterstick works better because it is longer.” threshold value must be greater than −0.25. Items may have just
one threshold (i.e., dichotomous items, for example, traditional
Of course, students also give unclear or irrelevant responses, multiple-choice items), or they can be polytomous. It would be
such as the following: “Yes, because pianos are heavy”—these very transparent if every item had as many response categories as
are labeled as “No Link(i),” and abbreviated NL(i). In the initial there are construct map levels–however, this is often not the case–
stages of instruction in this topic, students also gave a level of sometimes items will have response categories that focus only
response that is not clearly yet at level MoV1, but was judged on a sub-segment of the construct, or, somewhat less commonly,
to be better than completely irrelevant—typically these responses sometimes items will having several different response categories
contained relevant terms and ideas, but were not accurate enough that match to just one of the levels of a construct map. For
to warrant labeling as MoV1. For example, one student wrote: this reason, it is particularly important to pay careful attention
“No, Because it equals the same” This scoring level was labeled to how item response categories can be related to the levels of
“No Link(ii)” and abbreviated NL(ii), and was placed lower the construct map.
than MoV1. Note that a complete scoring guide for this task is The locations of the item thresholds can be graphically
shown in Appendix A. summarized in a Wright map, which is a graph that
simultaneously shows estimates for both the students and
An Empirical Version of the Learning items on the same (logit) scale. Figure 5 shows the Wright Map
Construct—The Wright Map for MoV, with the thresholds represented by “i.k,” where i is
We used a sample of 1002 middle school students from the item number and k is the threshold number, so that, say,
multiple school districts involved in a calibration of the learning “9.2” stands for the second threshold for the 9th item. On the
progression, which included (a) generation of tools to support left side of the Wright Map, the distribution of student abilities
professional development beyond the initial instantiations of is displayed, where ability entails knowledge of the skills and
the progression, most especially collaboration with teachers to practices for MoV. The person abilities have a roughly symmetric
develop curriculum and associated materials, (b) expansion of distribution. On the right side are shown the thresholds for 9
constructs to include video exemplars (Kim and Lehrer, 2015), questions from 5 tasks in MoV. In Figure 5, the thresholds for
and (c) item calibration (Schwartz et al., 2017), all of which were Piano Width questions 1(b) and 2(b) are labeled as 9.k and 10.k,
conducted prior to implementation of a cluster randomized trial. respectively. Again, focussing on question 1(b) (item 9), the
In a series of analyses carried out before the one on which the thresholds (9.1, 9.2, and 9.3) were estimated to be −0.97, 0.16,
following results are based, we investigated rater effects for the and 1.20 logits, respectively. Looking at Figure 5, one can see
contructed response items, and no statistically significant rater that they stand roughly in the middle of the segments (indicated
effects were found, so these are not included in the analysis. by the horizontal lines) of the logit scale for the NL(ii), Mov1 and
We fitted a partial-credit, one-dimensional item response (IRT) MoV2&3 levels, respectively.
model, often termed a Rasch model, to the item responses related Looking beyond a single item, we need to investigate the
to the MoV construct. This model distinguishes among levels of consistency of the locations of these thresholds across items. We
the construct (Masters, 1982). For each item, we use threshold used a standard-setting procedure called “construct mapping”
values (also called “Thurstonian thresholds”) to describe the (Draney and Wilson, 2011) to develop cut-scores between the

Frontiers in Education | www.frontiersin.org 10 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

FIGURE 5 | The Wright Map for MoV.

levels. Following that process, we found that the thresholds fall little hope for a student to use a chance devise to represent
quite consistently into the ordered levels, with a few exceptions, a source of variability (MoV3) if they cannot informally
specifically 8.3, 10.3, and 11.1. In our initial representations describe such a source (MoV2)—and these can and do overlap
of this Wright map, we found that the thresholds for levels quite a bit in the classrooms of the project. Students are
2 and 3 were thoroughly mixed together. We spent a large still improving on MoV2 when they are initially starting on
amount of time exploring this, both quantitatively, using the Mov3, and they continue to improve on both at about the
data, and qualitatively, examining item contents, and talking to same time. Hence, at least formally, that, while we decided
curriculum developers and teachers about the apparent anomaly. to uphold the distinction between Mov2 and MoV3, we also
Our conclusion was that these two levels, although there is decided to ignore the difference in difficulty of the levels, and
certainly a necessary hierarchy to their lower ends—there is to label the segment of the scale (i.e., the relevant band) as

Frontiers in Education | www.frontiersin.org 11 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

FIGURE 6 | The Revised MoV construct map.

MoV2&3. Thus, our MoV construct map may be modified as in according to the designed constructs, and report on the results, in
Figure 6. terms of (a) a comprehensive analysis report, and (b) individual
These band-segments can then be used as a means of labeling and group results for classroom use (Torres Irribarra et al.,
estimated student locations with respect to the construct map 2015; Fisher and Wilson, 2019; Wilson et al., 2019b). In this
levels NL(i) to MoV4. For example, a student estimated to account, we will not dwell on the structures and features of this
be at 1.0 logits could be interpreted as being at the point of program, but will focus instead on those parts of the BASS reports
most actively learning (specifically, succeeding at the relevant that will be helpful to a teacher involved in teaching based on
levels approximately 50% of the time) within the construct the MoV construct.
map levels MoV2 and MoV3, that is, being able to informally Figure 6 gives an overall empirical picture of the MoV
describe the contribution of one or more sources of variability construct, and this is also the starting point for a teacher’s use
to the observed variability in the system, while at the same time of the software2 . This map shows the relationship between the
developing a chance device (such as a spinner) to represent that students in the calibration sample for the MoV construct and the
relationship. The same student would be expected to succeed MoV items, as represented by their Thurstone thresholds. The
more consistently (approximately 75%) at level MoV1 (i.e., thresholds span across approximately 5 logits, and the student
being able to identify sources of variability), and succeed much span is about the same, though they range about 1.5 logits lower.
less often (approximately 25%) at level MoV4 (i.e., develop an This is a relatively wide range of probabilities of success—for a
emergent model of variability). Thus, the average gain of the threshold located at 0.0 logits, a student at the lowest location will
students on the Modeling Variability construct, as reported above have approximately a 0.05 chance of achieving at that threshold
(1.312 logits) was, in addition to being statistically significant, level: in contrast, a student at the highest location will have
also educationally meaningful, representing approximately a approximately a 0.92 chance of achieving at that threshold level.
difference of a full MoV construct level. This very wide range reflects that this construct is not one that is
commonly taught in schools, so that the underlying variation is

USING THE BASS SYSTEM TO 2


The BASS software has been developed as an “enterprise-wide” application,
and hence, can be used to facilitate the entire sequence tasks of assessment
IMPLEMENT THE ASSESSMENTS IN system development, from the conception of constructs, to the gathering of
THE LEARNING PROGRESSION development-level data sets, the analysis of assessment data sets, the building
of an item-bank, and the use of that item bank in specific assessment activities.
The components of the assessment system, as shown above, Different types of users have different scopes of interaction with the software, and
specifically, teachers would have the roles of assembling and scheduling assessment
including the construct maps, the items, and the scoring guides, activities, receiving teacher-level reports on the results, and generating reports,
are implemented within the online Bear Assessment System such as class summary reports and student-level reports. Training for these roles
Software (BASS), which can deliver the items, automatically score has been carried out on a one-to-one basis while the software development is
being completed, and will be implemented using online training. Of course, the
those designed that way (or manage a hand-scoring procedure
interpretation of these reports requires more than just training in use of software
for items designed to be open-ended), assemble the data into a but also includes the development of a teacher’s understanding of the essential
manageable data set, analyze the data using Rasch-type models ideas of the DM learning progression.

Frontiers in Education | www.frontiersin.org 12 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

FIGURE 7 | The Wright Map for MoV: Class I (left) and II (right). Each box represent approximately 1 case: 102 altogether.

not much influenced by the effects of previous instruction. This I and Class II. These each represent the results for a whole
is a sobering challenge for a teacher—how to educate both the school rather than for a single classroom, so should be interpreted
students who are working at the NL(i) level, not even being able as a group of students who might be spread across several
to write down appropriate words in response to the items, and individual classrooms. The class whose distribution is shown on
the students who are working at the Mov4 level, where they are the far left of the logit scale in Figure 7 is representative of
able to engage in debates about the respective qualities of different students before systematic instruction in modeling variability.
probabilistic models. The majority of students are at either level NL(i) or NL(ii),
Of course, no real class will have such a large number of and hence, that there has been little successful past instruction
students to contend with, so we have illustrated the equivalent on this topic. Nevertheless, we can see that here are a few
map to Figure 5 for two different classes in Figure 7—Class students who are working at the lower ends of MoV2&3, that

Frontiers in Education | www.frontiersin.org 13 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

FIGURE 8 | A BASS report on the performance of an individual student on the MoV items set.

is they can recognize that there is qualitative ordering of the variability-generating process (“possible values”), which reflected
effects of different sources of variation, and are beginning to the emphasis on sampling variability characteristic of MoV5.
be able to understand how these might be encapsulated in a The teacher can also look more deeply into a student’s record
model based on chance. Now, contrast that with the class (Class for the construct, and examine their performance at the item
II) whose distribution is shown immediately to the right of level. This is illustrated in Figure 9, where the student’s responses
Class I in Figure 7. Here the whole distribution has moved up for each item that the teacher chose for the student’s class are
by approximately a logit, and hence, the number of students shown graphically–this shows exactly which responses (s)he gave
at that lowest level (NL(i) are very few. The largest group in to each item, and matches them to the construct map levels (i.e.,
this class is at the level MoV1, that is they can express their with the items represented in the rows and the construct map
thinking about possible sources of variation, and a few students levels are illustrated in the columns). Here it can be seen3 that
are beginning to operate at the highest level we observed, MoV4. the item-level results are quite consistent with the overall view as
This class is representative of students with some experience with given in Figure 5—with the student performing with a moderate
constructing models but likely not with extensive opportunities level of success on the MoV2 items (i.e., the last 3), and doing as
to invent and revise models across multiple variability-generating well as can be expected on the first 3 items, which do not prompt
contexts. This broad envisioning of the educational challenge responses for levels 1 or 2.
of the classes (i.e., the range of the extant construct levels) Of course, not every student will give results that are so
allows a teacher to anchor their instructional planning in reliable consistent with the expected order of items in the Wright
information on student performance on an explicitly known (to Map. This is educationally relevant, as performances that are
the teacher) set of tasks. inconsistent with the usual may indicate that the student has
Turning now to specific interpretive and diagnostic special interests, experiences, or even attitudes that should be
information that a teacher can gain from the system, consider considered in interpreting their results. We use a special type
an individual student report, as shown in Figure 8. Here we of graphical display, called a “kidmap,” that can make this
see that this student (anonymously labeled as “Ver1148” in this clearer (although like any other specialist figure, it does need
paper) is doing moderately well in the calibration sample—(s)he explanation). The kidmap for student Ver1148 is shown in
is most likely located in the Mov2 level (the 95% confidence Figure 10. The student’s location on the map is shown as the
interval is indicated on the graph by the horizontal bars horizontal line marked with their identifier in the middle of
around the central dot). This information provides some useful the figure. The logit scales to the left and right of this show,
educational possibilities for what next to do with this student— respectively, (a), to the left, the construct map levels that the
they should continue practicing the informal observation and student achieved, and (b), to the right, those that the student
description of sources of variability, and they should be moving did not achieve. The extent of the measurement error around
on to learn about how to use a chance device to represent the student location is marked by the blue band around the
the probabilities. horizontal line. The way to read the graph is to note that, when
Individual student reports can also assist teachers to use an the student has performed as expected by his overall estimated
item formatively by juxtaposing student solutions at adjacent location, then
levels of the construct, and inviting whole-class reflection
as means to help students to extend their reasoning to (a) the construct map levels achieved should show up in the
higher levels of the construct (Kim and Lehrer, 2015). The bottom left-hand quadrant of the graph,
formative assessment conversation described previously during (b) the construct map levels not achieved should show up in
the presentation of MoV5 exemplifies this adjoining construct the top right-hand quadrant of the graph, and
level heuristic. In that conversation, the teacher invited contrast (c) there may be a region of inconsistency within and/or near
between student models that emphasized recapitulation of to the region of uncertainty (i.e., the blue band).
values observed in a single sample (a MoV4 level) with a
model that instead allowed for values plausibly reflecting the 3
Note that the level MoV0 shown here was re-reclassified as Mov1.

Frontiers in Education | www.frontiersin.org 14 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

FIGURE 9 | A BASS item scores report for an individual student on the MoV items set.

FIGURE 10 | A Kidmap report for an individual student on the MoV items set.

In fact, looking at Figure 10, we can see that this student’s To illustrate the way that this type of graph can help identify
performance is very much consistent with their estimated student performances that are inconsistent with their overall
location, there are no construct map levels showing up in the estimated location, look now at Figure 11. In this kidmap, for
“off ” diagonal; quadrants (top left and bottom right). student Ver1047, we can see that the “off-diagonal” quadrants

Frontiers in Education | www.frontiersin.org 15 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

are indeed occupied by a number of item levels. Those in the the bottom right quadrant are those for which the student did
top left quadrant are those for which the student has performed not perform as well as expected. This can them be interpreted
better that their overall estimate would predict, while those in very specifically by examining the student scores output shown

FIGURE 11 | A Kidmap report for a second individual student on the MoV items set.

FIGURE 12 | A BASS item scores report for a second individual student on the MoV items set.

Frontiers in Education | www.frontiersin.org 16 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

in Figure 12. Here it can be readily seen that this student but this can be alleviated by the creation of clones of each item, in
has performed very inconsistently on the item that the teacher each format, although that is not always easy.
assigned, responding at a high level (MoV4) for the first 3 items In the second step, the assessments described above need
and at the lowest level (NL(i)) for the last three items. An actual to be deployed with a system of teacher observations matched
interpretation of these results would require more information to the same set of constructs, allowing teachers to record their
about the student and the context—nevertheless, it is important judgments on student performances in relatively unstructured
that the teacher be made aware of this sort of result, as it may situations, including group-work. Work is currently underway
be very important for the continued success of this individual to develop and try-out such an observational system and to
student. Fortunately, cases such as that for Ver1148 can be readily establish connections with the BASS data-base so that the two
detected by using a “Person Misfit” index, thus avoiding the need systems can be mutually supportive. There are multiple issues
for a teacher to examine the kidmaps for every one of their in disentangling group-level observations and individual item
students—this is also reported for each student by BASS. This responses, but some work on hierarchical Rasch modeling has
type of result, where the student’s response vector can be used already begun (Wilson et al., 2017).
as a form of quality control information on the overall student In the third step, connecting the performances of students
estimate, is an important step forward in assessment practice— in the context of the DM constructs needs to be related to
giving the teacher potential insights into the way that students teacher actions. The tradition of fidelity studies is based on the
have learned the content of instruction. observation of low-inference teacher actions, due to the relatively
good reliability of judgments about such actions. However, these
low-inference actions are seldom the most important educational
CONCLUSION activities carried out by teachers. Hence, an important agenda
is the development of a system of observing and judging high-
This paper has described how a learning progression can inference teacher activities, using the constructs and the construct
be defined using benchmarks of conceptual change and in maps as leverage to make the activities more judgeable. This work
tandem, conceptual pivots that may instigate change. These has begun, and sound results have been reported (Jones, 2015),
together characterize the structural components of the learning but much more needs to be accomplished in this area.
progression. The assessment aspect of the progression has
been characterized using constructs and their accompanying
construct maps, which serve multiple purposes: as ways to
highlight and describe anticipated forms of student learning,
DATA AVAILABILITY STATEMENT
as a basis for recording and visualizing progress in student The data analyzed in this study is subject to the following
learning, and as a method to link assessment with instruction. licenses/restrictions: The data sets are still undergoing analysis.
Instruction is intertwined with assessment in ways that are Requests to access these datasets should be directed to MW,
manifest in the structure of the assessment system, as in the MarkW@berkeley.edu.
negotiation of the representation of constructs in ways that
teachers find useful, intelligible, and plausible, and in the
practice of assessment, where teachers use formative items to
advance learning by engaging students in productive, construct- ETHICS STATEMENT
centered conversation.
The studies involving human participants were reviewed
and approved by Committee for the Protection of Human
Next Steps
Subjects, University of California, Berkeley, and by Vanderbilt
The next major steps in continuing the assessment work
University’s Human Research Protections Program. Written
described here are the following. In the first step, the many
informed consent to participate in this study was provided by the
constructed response items in the DM item bank, such as the one
participants’ legal guardian/next of kin.
used as an example above, need to be augmented with similar
selected response items. The selected response versions can be
developed from the responses collected as part of the calibration
of the open ended responses. These selected response items are AUTHOR CONTRIBUTIONS
crucially needed in order to lighten the load on teachers so that
they can avoid having to score the open ended responses that Both authors listed have made a substantial, direct and
their students make. The aim, however, is not to then ignore the intellectual contribution to the work, and approved it for
open ended items, but to preserve both formats in the DM item publication.
bank so that the open ended ones can be used for instructional
purposes as well as informal assessments, as part of assessment
conversations. The closed form items can then be used in more FUNDING
formal situations such as for unit-tests, longer-term summative
tests, and in evaluation contexts. Care is needed to avoid using The research reported here was supported by the Institute of
similar pairs of items in these two ways with the same students, Education Sciences, U.S. Department of Education, through

Frontiers in Education | www.frontiersin.org 17 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

grant R305B110017 to the University of California, Berkeley SUPPLEMENTARY MATERIAL


and grant R305A110685 to Vanderbilt University. The
opinions expressed are those of the authors and do not The Supplementary Material for this article can be found
represent the views of the Institute or the U.S. Department of online at: https://www.frontiersin.org/articles/10.3389/feduc.
Education. 2021.654212/full#supplementary-material

REFERENCES Lehrer, R., Kim, M., and Schauble, L. (2007). Supporting the development of
conceptions of statistics by engaging students in modeling and measuring
Adams, R. J., and Wilson, M. (1996). “Formulating the Rasch model as variability. Int. J. Comput. Math. Learn. 12, 195–216. doi: 10.1007/s10758-007-
a mixed coefficients multinomial logit,” in Objective Measurement III: 9122-2
Theory Into Practice, eds G. Engelhard and M. Wilson (Norwood, NJ: Lehrer, R., Kim, M.-J., Ayers, E., and Wilson, M. (2014). “Toward establishing
Ablex). a learning progression to support the development of statistical reasoning,”
Adams, R. J., Wilson, M., and Wang, W. (1997). The multidimensional random in Learning Over Time: Learning Trajectories in Mathematics Education, eds
coefficients multinomial logit model. Appl. Psychol. Meas. 21, 1–23. doi: 10. A. Maloney, J. Confrey, and K. Nguyen (Charlotte, NC: Information Age
1177/0146621697211001 Publishers), 31–60.
Adams, R. J., Wu, M. L., Cloney, D., and Wilson, M. R. (2020). ACER Lehrer, R., Schauble, L., and Wisittanawat, P. (2020). Getting a grip
ConQuest: Generalised Item Response Modelling Software [Computer Software]. on variability. Bull. Math. Biol. 82:106. doi: 10.1007/s11538-020-
Version 5. Camberwell, VIC: Australian Council for Educational Research 00782-3
(ACER). Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika 47,
Andersen, E. B. (1985). Estimating latent correlations between repeated testings. 149–174. doi: 10.1007/bf02296272
Psychometrika 50, 3–16. National Research Council (2006). “Systems for state science assessments.
Ayers, E., and Wilson, M. (2011). “Pre-post analysis using a 2-dimensional IRT Committee on test design for K-12 science achievement,” in Board on Testing
model,” in Paper Presented at the Annual Meeting of the American Educational and Assessment, Center for Education. Division of Behavioral and Social Sciences
Research Association (New Orleans). and Education, eds M. R. Wilson and M. W. Bertenthal (Washington, DC: The
Briggs, D., and Wilson, M. (2003). An introduction to multidimensional National Academies Press).
measurement using Rasch models. J. Appl. Meas. 4, 87–100. Petrosino, A., Lehrer, R., and Schauble, L. (2003). Structuring error and
Center for Continuous Instructional Improvement (CCII) (2009). Report of experimental variation as distribution in the fourth grade. Math. Think. Learn.
the CCII Panel on Learning Progressions in Science. CPRE Research Report, 5, 131–156. doi: 10.1080/10986065.2003.9679997
Columbia University. New York, NY: Center for Continuous Instructional Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests.
Improvement (CCII). Reprinted by University of Chicago Press, 1980.
Draney, K., and Wilson, M. (2011). Understanding rasch measurement: selecting Saldanha, L. A., and Thompson, P. W. (2014). Conceptual issues in understanding
cut scores with a composite of item types: the construct mapping procedure. the inner logic of statistical inference. Educ. Stud. Math. 51, 257–270.
J. Appl. Measur. 12, 298–309. Schwartz, R., Ayers, E., and Wilson, M. (2017). Mapping a learning progression
Fisher, W. P., and Wilson, M. (2019). An online platform for sociocognitive using unidimensional and multidimensional item response models. J. Appl.
metrology: the BEAR assessment system software. Measur. Sci. Technol. Measur. 18, 268–298.
31:5397, [Special Section on the 19th International Congress of Metrology (CIM Shin, H.-J., Wilson, M., and Choi, I.-H. (2017). Structured constructs models based
2019)]. doi: 10.1088/1361-6501/ab5397 on change-point analysis. J. Educ. Measur. 54, 306–332. doi: 10.1111/jedm.
Ford, M. J. (2015). Educational implications of choosing ‘practice’ to describe 12146
science in the next generation science standards. Sci. Educ. 99, 1041–1048. Simon, M. A. (1995). Reconstructing mathematics pedagogy from a constructivist
doi: 10.1002/sce.21188 perspective. J. Res. Math. Educ. 26, 114–145. doi: 10.5951/jresematheduc.26.2.
Garfield, J., Le, L., Zieffler, A., and Ben-Zvi, D. (2015). Developing students’ 0114
reasoning about samples and sampling as a path to expert statistical thinking. Tapee, M., Cartmell, T., Guthrie, T., and Kent, L. B. (2019). Stop the silence. How
Educ. Stud. Math. 88, 327–342. doi: 10.1007/s10649-014-9541-7 to create a strategically social classroom. Math. Teach. Middle Sch. 24, 210–216.
Jones, R. S. (2015). A Construct Modeling Approach to Measuring Fidelity in doi: 10.5951/mathteacmiddscho.24.4.0210
Data Modeling Classrooms. Unpublished doctoral dissertation, Nashville, TN: Thompson, P. W., Liu, Y., and Saldanha, L. (2007). “Intricacies of statistical
Vanderbilt University. inference and teachers’ understanding of them,” in Thinking With Data, eds
Jones, R. S., Lehrer, R., and Kim, M.-J. (2017). Critiquing statistics in student and M. C. Lovett and P. Shah (New York, NY: Lawrence Erlbaum Associates),
professional worlds. Cogn. Instruct. 35, 317–336. doi: 10.1080/07370008.2017. 207–231.
1358720 Torres Irribarra, D., Diakow, R., Freund, R., and Wilson, M. (2015). Modeling for
Kim, M.-J., and Lehrer, R. (2015). “Using learning progressions to design directly setting theory-based performance levels. Psychol. Test Assess. Model. 57,
instructional trajectories,” in Annual Perspectives in Mathematics Education 396–422.
(APME) 2015: Assessment to Enhance Teaching and Learning, ed. C. Suurtamm Wilson, M. (2005). Constructing Measures: An Item Response Modeling Approach.
(Reston, VA: National Council of Teachers of Mathematics), 27–38. Mahwah, NJ: Erlbaum. New York, NY: Taylor and Francis.
Knorr Cetina, K. (1999). How the Sciences Make Knowledge. Cambridge, MA: Wilson, M. (2012). “Responding to a challenge that learning progressions pose to
Harvard University Press. measurement practice: hypothesized links between dimensions of the outcome
Lehrer, R. (2017). Modeling signal-noise processes supports student progression,” in Learning Progressions in Science, eds A. C. Alonzo and A. W.
construction of a hierarchical image of sample. Stat. Educ. Res. J. 16, Gotwals (Rotterdam: Sense Publishers), 317–343. doi: 10.1007/978-94-6091-
64–85. 824-7_14
Lehrer, R., and Kim, M. J. (2009). Structuring variability by negotiating its measure. Wilson, M., Morell, L., Osborne, J., Dozier, S., and Suksiri, W. (2019a). Assessing
Math. Educ. Res. J. 21, 116–133. doi: 10.1007/bf03217548 higher order reasoning using technology-enhanced selected response item types
Lehrer, R., and Romberg, T. (1996). Exploring children’s data modeling. Cogn. in the context of science. Paper Presented at the 2019 Annual Meeting of the
Instruct. 14, 69–108. doi: 10.1207/s1532690xci1401_3 National Council on Measurement in Education Toronto, Ontario, Canada,
Lehrer, R., Kim, M. J., and Jones, S. (2011). Developing conceptions of statistics Toronto, ON.
by designing measures of distribution. Int. J. Math. Educ. (ZDM) 43, 723–736. Wilson, M., Scalise, K., and Gochyyev, P. (2017). Modeling data
doi: 10.1007/s11858-011-0347-0 from collaborative assessments: learning in digital interactive

Frontiers in Education | www.frontiersin.org 18 May 2021 | Volume 6 | Article 654212


Wilson and Lehrer Improving Learning

social networks. J. Educ. Measur. 54, 85–102. doi: 10.1111/jedm. Conflict of Interest: The authors declare that the research was conducted in the
12134 absence of any commercial or financial relationships that could be construed as a
Wilson, M., Scalise, K., and Gochyyev, P. (2019b). Domain modelling for advanced potential conflict of interest.
learning environments: the BEAR assessment system software. Educ. Psychol.
39, 1199–1217. doi: 10.1080/01443410.2018.1481934 Copyright © 2021 Wilson and Lehrer. This is an open-access article distributed
Wilson, M., Zheng, X., and McGuire, L. (2012). Formulating latent growth under the terms of the Creative Commons Attribution License (CC BY). The use,
using an explanatory item response model approach. J. Appl. Measur. distribution or reproduction in other forums is permitted, provided the original
13, 1–22. author(s) and the copyright owner(s) are credited and that the original publication
Wright, B. D., and Masters, G. N. (1981). Rating Scale Analysis. Chicago, IL: MESA in this journal is cited, in accordance with accepted academic practice. No use,
Press. distribution or reproduction is permitted which does not comply with these terms.

Frontiers in Education | www.frontiersin.org 19 May 2021 | Volume 6 | Article 654212

You might also like