1 s2.0 S0883035522001719 Main
1 s2.0 S0883035522001719 Main
1 s2.0 S0883035522001719 Main
A R T I C L E I N F O A B S T R A C T
Keywords: The Netherlands have a problem regarding quality as well as equality in their school system.
Equality of chances Many students fail to reach minimal skill levels and underachieve with respect to their own
Validity learning capacities. They end up on educational levels that do not correspond to their learning
Assessment
potential, especially when their parents do not speak the language or cannot afford tutoring to
Teaching paradigms
Dutch school policy
exploit this potential. In this article, we argue that important reasons for this development lie in a
lack of validity of judgments on various levels of the educational system. We summarize these
reasons into 8 theses about the validity of methods, starting from methods of testing and
assessment, over the design of the teaching process and underlying paradigms, and ending with
the more general cybernetic effects of school policy and the meritocratic social paradigms in
which they are embedded.
1. Introduction
Similar to other European countries, the Netherlands ratified international laws which guarantee equal access to educational levels
for every child, based on individual merits1. Yet, there is an increasing awareness in the Dutch society that the Dutch school system has
a problem concerning equality2 as well as quality. According to the Programme for International Student Assessment (PISA), the
quality of the Dutch school system (which used to be outstanding) has been on a decline since 20063. School inspectors are warning
that reading standards, as well as arithmetic, math and science skills, have dropped considerably during the last 20 years. A report of
the Dutch inspectorate of education (’Onderwijsinspectie’) from 2019 has discovered that the equality of chances has likewise corroded
considerably during the same period (Baltussen, 2019), and the newest edition comes to similar conclusions (Oppers, 2022).
One reason is that children whose parents cannot prepare their kids for tests or pay for private tutoring do not have the same chance
of making educational progress (Marks et al., 2006), even if they have sufficient learning potential. Dutch parents who know how the
Abbreviations: CITO, centraal instituut voor toetsontwikkeling; LVS, leerlingvolgsysteem; vwo, voorbereidend wetenschappelijk onderwijs; havo,
hoger algemeen voortgezet onderwijs; vmbo, voorbereidend middelbaar beroepsonderwijs.
* Corresponding author.
E-mail address: s.scheider@uu.nl (S. Scheider).
1
According to, e.g., the European convention on Human Rights (ECHR) (Pillai, 2012) Article 2 of Protocol Number 1, as well as the UN universal
declaration of human rights Article 26 (United Nations, 1948).
2
Equality understood here in terms of accessibility and horizontality, cf. (McCowan, 2016).
3
https://www2.compareyourcountry.org/pisa/country/nld?lg=en.
https://doi.org/10.1016/j.ijer.2022.102097
Received 20 May 2022; Received in revised form 5 October 2022; Accepted 11 November 2022
Available online 28 November 2022
0883-0355/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
Fig. 1. Causal diagram of educational inequality. Red factors are related to social background, blue factors capture the quality of a school system,
and green factors are characteristics of a child. To reduce inequality, the role of the background in exploiting learning potentials needs to
be reduced.
system works, on the other hand, can use it to their advantage (Elffers, 2018), by exploiting the learning potential of their children from
the privileged position of their own social background (orange factors in Fig. 1). However, parents are not to blame. The only way to
mitigate this effect on inequality is to compensate for unequal background conditions by sound assessment, selection and teaching
methods in publicly financed schools (blue factors in Fig. 1). This is supported by multifarious empirical evidence. For example, studies
based on PISA4 data have revealed that early and institutionalized student selection into levels amplifies the effect of cultural and
socio-economic background on achievements (Dupriez and Dumay, 2006; Gorard and Smith, 2004; Marks et al., 2006). Furthermore, it
is well known that the qualities of teaching are crucial for mitigating such effects on inequality (Hanselman, 2018), and that early
student assessment and tracking methods5 have a significantly negative effect on educational achievements, as shown by a recent
meta-study (Felouzis and Charmillot, 2013; Gamoran and Mare, 1989; Terrin and Triventi, 2022). While previous research has focused
on comparative educational statistics for investigating such causes of inequality (cf. Felouzis and Charmillot (2013); Yang Hansen and
Strietholt (2018)), studies on the role of the methodological validity of assessments for educational inequality, especially in a Dutch
national context, are lacking. In this article, we argue that an important reason for the existing inequality problem in the Netherlands is
a lack of validity of educational methods with respect to assessment, selection and teaching.
In our everyday lives, we constantly rely on the judgment of other people who take on the responsibility of deciding in our names
about issues we cannot decide ourselves, be it judges, policemen, politicians, as well as teachers. For obvious reasons, our lives and
biographies depend on whether such judgments are valid. Validity implies that criteria are defined relative to specific goals of what needs
to be judged. A judgment is valid if it can be approved by everyone who accepts the criteria for this goal, including ourselves (Janich,
2001).
For example, an important form of judgment in school are assessments of learning outcomes (cf. Moss et al. (2006)). In this case,
validity criteria are defined with respect to learning goals. Valid assessments measure whether pupils have acquired precisely the skills
that correspond to such goals (McKeachie and Svinicki, 2013, ch. 10). However, misjudgment of skills can arise when confusing goals.
For example, a test of math skills given in terms of a story (’verhaaltjessommen’) may require language skills that intermix with the
learning goals of algebra, leading to a misjudgment of algebra skills. Furthermore, to grant equal chances to children, valid assessments
are required not only for learning outcomes, but also for learning potentials (Vogelzang, 2016). In this case, static tests are not a valid
assessment method because we are targeting a disposition to learn, not a learning history. To assess potentials it is required to observe
performances in dynamic assessments (Tzuriel, 2012). This can be done by observing the modifiability of learning behavior focusing on
maximum learning progress and not on averages. To be valid, dynamic assessments therefore need to be embedded into the learning
process. Validity in this case can be based on the success of trials. For example, if a student’s academic potential was assessed as low but
later on, the student graduates at a university, this renders the assessment invalid. Finally, validity can also be considered for methods
used in an educational system that are not assessments. For example, teaching methods as well as methods of school policy become
invalid with respect to the goal of equality of chances precisely when they are not effective in exploiting learning potentials. We
therefore argue that equality of educational chances requires validity along at least three dimensions:
1. Valid assessments of skill levels6, to be able to judge the level of proficiency a child has acquired with respect to a learning goal.
2. Valid assessments of learning potentials.7, to be able to judge whether a child will be able to master certain educational pathways.
4
OECD’s Programme for International Student Assessment https://www.oecd.org/pisa.
5
Dividing the student population into groups according to their achievements.
6
This is sometimes called ‘content validity’ in assessment research (Moss et al., 2006).
7
This is roughly equivalent to ‘(predictive) criterion validity’ (Moss et al., 2006). We deviate from this terminology here because criteria are
relevant for all kinds of validity, and assessing learning potentials implies special criteria.
2
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
Fig. 2. Overview of the 8 theses substantiated in this article. They address the validity of educational methods on three nested layers of the
educational system. Lines connect theses whose underlying methods are mutually dependent.
3. Effective exploitation of these learning potentials in teaching methods. Note the requirement to be effective, not efficient8.
Yes, it is doubtful whether the Dutch system in primary and secondary school performs well along these three dimensions. Karen
Heij, a researcher who has worked as an educational test designer, recently doubted the appropriateness of the Dutch selection and
testing system (Heij, 2021). In our own critique, we will reuse some of her arguments, as well as refer to certain questions raised by
Boonstra et al. (2020). In contrast to these authors, we explore reasons for a lack of validity of educational methods on different levels
of the Dutch school system, and explain how these reasons tend to enforce each other. We substantiate our reasons in terms of 8 theses
which should be regarded as hypotheses requiring further research. As shown in Fig. 2, our theses address the question of validity on
three nested levels: The level of assessment methods, the teaching methods, as well as the policy level.
2. Eight theses about a lack of validity as a reason for educational inequality in the Dutch school system
The following methodological critique is based on arguments from learning psychology, educational science and learning
philosophy.
A critique (greek: the art of judgment) is a method in philosophy of using reason to check claims for their justification. It has its roots
in Kant’s critique of reason and in language philosophy (Janich, 2001), but has also been applied in psychology and education science,
e.g., to critique the validity of PISA assessments (Fernandez-Cano, 2016; Wolf, 1998; Yang Hansen and Strietholt, 2018). In our case,
we check whether validity assumptions underlying the current educational system are justified along the three dimensions mentioned
above, based on the following sources:
• Statistical and experimental evidence from learning psychology and educational science
• Governmental reports, laws, as well as reports on educational history
• Test samples and other published educational tools
• Evidence from newspapers and online publications
Thus, our argumentation is based solely on available sources of knowledge and evidence, as referenced underneath each thesis.
8
Being effective means to exploit as much as possible of a given potential. Being efficient means to exploit a given potential with minimal effort.
3
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
To decide about future paths on educational levels (‘niveaus’), state-wide standardized tests are held regularly as part of a student
tracking system (‘leerlingvolgsysteem’ (LVS)) (cf. Appendix A.1). In addition, certain qualitative forms of assessment are frequently
used by teachers. In the following, we argue why underlying assessments are of questionable validity both as a means to assess skill
levels as well as learning potentials.
We focus on the test developed by the Centraal instituut voor toetsontwikkeling (CITO)9 as an example for LVS, which is most
prominent. Yet, similar arguments can be made for other LVS systems, too.
9
https://nl.wikipedia.org/wiki/Centraal_Instituut_voor_Toetsontwikkeling.
10
https://www.kidsweek.nl/voor-ouders/heeft-oefenen-voor-de-cito-toets-zin.
11
https://www.leidenpsychologyblog.nl/articles/is-intelligence-testing-in-school-the-smart-thing-to-do.
12
With contingent skills, we mean skills that are not required for the tested learning goals. For example, when a math test presupposes language
skills not required for the tested math skills.
13
Note that the principle of methodical order does not imply that learning must be linear. In fact it is often cyclical, e.g., according to Kolb’s
experiential learning cycle (Kolb and Kolb, 2018).
4
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
order in the period before the test. In a nutshell, the order of testing needs to follow the order of instruction and practicing, otherwise
we risk to merely measure differences in habituation with certain types of questions. We call this fundamental principle constructive
alignment, following Biggs (1996).
Yet, the reality of testing in Dutch primary schools is a continuous violation of the constructive alignment principle. CITO tests are
standardized contentwise and held at regular half-year intervals. However, this is done in a way that makes it deliberately hard to align
with the learning process. Tested skills are taken from a mix of subject areas learned over longer periods, intermixing e.g. algebra and
geometry with measurement units in unforeseeable ways14. This makes it hard to prepare for a CITO test or to use it as an assessment
method for testing a particular acquired skill level. There is thus a high chance that students come across questions they have not fully
grasped because topics do not necessarily follow a methodical order. For example, tests in primary school sometimes contain geometry
tasks that require pupils to calculate areas of triangles or volumes with triangular shapes. Yet, it is clear that a methodical order would
require understanding the formula (1/2 * a * b) first. The latter, however, is only fully understandable when it is motivated appro
priately in geometry lessons in secondary school. Similarly, primary school tests may contain questions requiring children to simplify
fractions without having fully grasped fractional algebra. This is because the function of the CITO is not to give feedback to students or
teachers, but is seen as a way to divide pupils into different educational streams, as explained in the following.
Thesis 2. CITO is not a valid system for testing the skills of students because it is not aligned with the learning process. Different subject
areas that are learned at different times are mixed in a test. Ambiguous questions would require practising, however practising is not
intended. More generally, CITO fails as a test for skills as well as potentials because it has never been intended as a feedback
instrument.
14
Sometimes it is argued this delay in CITO is intentional because it would test a student’s memory. However, this is not a convincing argument
since valid memory tests require specific memorizing instructions which are lacking in the case of CITO.
15
The sample on which these quantiles are computed can be up to 6 years old.
16
It is sometimes argued that LVS systems would be based on absolute reference norms because test questions are made with ‘referentieniveaus’ in
mind. However, this is a misconception: The norm for measuring CITO grades is based on social references.
5
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
Fig. 3. The CITO grading system.
6
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
levels. These ‘soft’ criteria are not officially recommended by the inspectorate of education17 because they lack any scientific meth
odological basis. Still, they are used regularly in different places, e.g., in terms of ‘leerlingprofielen’18 or as part of student reports
(’Onderwijskundig rapport’) by primary schools. Though it is currently unknown how widely spread such criteria are, education
professionals told us in personal exchange that they are considered a proven tool of practice, especially in situations of doubt over the
capabilities of a student. One example is shown in Fig. 4. As can be seen, the criteria cover study skills (’vaardigheden’), social-
emotional (’sociaal emotioneel’) characteristics as well as working attitude (’werkhouding’). Here is a translated summary for the
level of vwo/havo as opposed to the level vmbo-b (highlighting by the authors of this article):
• Study skills: vwo/havo students are supposed to work on abstract tasks, can apply abstract concepts, can reflect based on their own
judgement, and can draw conclusions. Students on the vmbo-b level are supposed to learn only in concrete terms and in the ‘now’.
• Social-emotional characteristics: vwo/havo students are supposed to use others’ feedback and can criticize other people’s re
actions. They are also less influenced by others and less dependent on confirmation by teachers. Students on the vmbo-b level are
expected to see school primarily as a social event.
• Working attitude: vwo/havo students can concentrate on their work for at least 30 minutes, can make their own plans, are assertive and
can lead group tasks. Students on the vmbo-b level can concentrate only for 10 minutes and need guidance from teachers.
In this table, we can see that CITO and other test results are treated equivalently with intelligence test (IQ) scores. From a scientific
standpoint, this mapping is highly questionable for the reasons outlined above.
Furthermore, there are numerous reasons to doubt the validity of these qualitative criteria for the purpose of assessing learning
potentials. For instance, regarding study skills, it is well known in Psychology that ‘being absorbed by the now’, denoted by the
technical term flow, is a very important concept for learning as such, regardless of the educational level (Csikszentmihalyi, 1990;
Engeser and Rheinberg, 2008). It is known that literally everyone experiences a flow when learning ‘in-depth’, not only people with
low learning potential. For this reason, it is questionable to take such a characteristic as a criterion for differentiating learning po
tentials. Furthermore, the idea to differentiate low and high learning potentials based on distinguishing concrete versus abstract or
theoretical concepts reflects another misconception, one that has become famous as a so-called category error by Ryle (1949): Even in
the most theoretical sciences, like mathematics, knowledge is not gained abstractly. Knowledge builds on know-how, i.e., the
disposition to operate on concrete material. This means that we can learn theoretical concepts only by applying them in concrete
practices19 (Janich, 2001; 2015). Therefore the distinction between practice and theory is essentially unsuitable for differentiating
educational skill levels.
The idea of students on vwo/havo levels supposedly being more asocial in their behaviour, i.e., less dependent on others and their
teacher’s guidance, reveals another questionable stereotype underlying this assessment scheme. In-depth understanding of a concept
always requires not only practice but also critical guidance and interaction with teachers and peers (Janich, 2001). The essential role of
social relations for education is highlighted by numerous researchers, e.g., by the social dimension of Biesta’s triangle (Biesta, 2020),
by the extensive empirical investigations of the pediatrician Remo Lago (Lago and Beglinger, 2009), as well as by learning theorist
Etienne Wenger (Wenger, 1999). Since even the most intelligent students need guidance and feedback in their learning, the idea of the
‘autistic gifted’ student is largely a myth. Or as Gooch (2019) puts it:
Since the knowledge gained comes primarily through interrogation of and by others, education is relational, depending on
personal interaction between teacher and student.
Even the qualitative criteria about working attitudes on different educational levels are hard to defend. Though the ability to
concentrate and the ability to self-structure and to make plans form surely an important requisite for mastering a university level, they
are not a sign of lack of intelligence or learning capacity. This can be seen from the fact that concentration and self-organization is not only
a problem for slow learners, it is also one for intelligent and gifted students or students with Attention Deficit Hyperactivity Disorder
(ADHD). Around 30 per cent of the gifted children in the Netherlands (’hoogbegaafdheid’) (with an IQ higher than 130) end up on an
education level below havo, i.e., they clearly underachieve20. Furthermore, underachievement is not only a problem for the gifted. There
is a significant amount of children who do not underperform because they lack learning potential, but because they lack self-
confidence, have difficulties concentrating and organizing themselves (Dittrich, 2014).
Thesis 4. Qualitative criteria based on student characteristics (’kenmerken’, ‘leerlingprofielen’), including concentration levels,
social dependence and working attitudes, are invalid as a means to assess a student’s learning potential.
Addressing the problems of assessment mentioned in the previous section cannot be done by merely redesigning tests, or by
improving assessment norms or criteria. It needs to be accompanied by a redesign of the educational process in class and a change of
17
https://www.onderwijsinspectie.nl/onderwerpen/overgang/welke-factoren-meewegen.
18
https://swvbepo.nl/.
19
This can also be a mental practice, e.g. the practice of manipulating mathematical expressions.
20
https://centrumregenboogkind.nl/hoogbegaafdheid/.
7
S. Scheider et al.
8
the underlying teaching culture (Heij, 2021). In the following, we discuss this culture by contrasting different paradigms21.
Our methodical critique so far has focused on the assessment and teaching processes, but not on the political and societal conditions
in which they are embedded.
21
A paradigm is a predominant thought pattern of a culture.
22
Establishing test validity becomes a challenge if such goals are not standardized, e.g., on an international level (Wolf, 1998).
23
See e.g. Besluit kerndoelen onderbouw VO of 2006, https://www.rijksoverheid.nl/documenten/besluiten/2010/09/17/kerndoelen-onderbouw-
voortgezet-onderwijs.
24
For example, in ‘rekenen’, ‘referentieniveaus’ across levels 1F/S to 3F/S include the handling of numbers, ratios, measurements and tables/
diagrams. For primary schools, similar concretizations are formulated in ‘Inhoudslijnen’ (TULE), see https://www.slo.nl/sectoren/po/
inhoudslijnen-po.
25
’Methodentoetsen’ are foreseen as a kind of formative testing designed by schools, however, they are usually regarded as less relevant for
selection purposes.
26
The Munich Model of Giftedness and Talent (Ziegler and Heller, 2000) includes classroom climate and quality of instruction as important
moderator variables for explaining underachievement.
9
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
Fig. 5. Unintended feedback of educational outcome assessments R3/R4 on the selection into levels in secondary school. For a school, distributing
pupils into lower levels is advantageous because it will increase the probability of scoring high in R3 and R4.
1. R1 (’onderwijspositie’) measures the difference between upflow (the percentage of pupils streaming from lower levels to higher
levels) and downflow (the percentage of pupils streaming from higher levels to lower levels) in year three of secondary school with
respect to the level of the recommendation in primary school. The more positive the difference, the better.
2. R2 (’onderbouwsnelheid’) measures the percentage of pupils that reach year 3 without repetition. The higher, the better.
3. R3 (’bovenbouwsucces’) is a combination of both factors, the upflow/downflow difference and percentage without repetition,
measured from year 3 for each level for the remainder of the school years (after year 3).
4. R4 (’gemiddeld CE’) is the average score of the central exam over all exams written in the final year.
On first look, these criteria all seem to make perfect sense. The Dutch society indeed wants schools to increase upflow and prevent
downflow, to prevent pupils from repeating school levels, and to make pupils score as high as possible in the final exams. Schools who
‘succeed’ in realizing this should be rewarded, because they obviously are doing something right, or so it seems?
However, on a closer look, there are severe problems with this scheme. For one, the different indicators are dependent on results of
standardized tests which were designed for pupil selection purposes, not for evaluating schools, and thus cannot be used as a valid test of
school performance at the same time (Heij, 2021). Thus, there exists a validity conflict on the level of goals when using a test for both
purposes. Furthermore, in the cybernetic system of Dutch school evaluation, the intended effect becomes easily nullified due to un
intended feedback. To see why, think for a moment how the establishment of such a system will feed back on the behavior of responsible
personnel and teachers in a school (cf. Fig. 5). For sure, the responsible persons will try to minimize the chance of falling below the
threshold levels defined by the inspectorate of education. For factor R3, e.g., the threshold is currently supposed to lie at 76, 99%
(havo) and 80, 58% (vwo). However, the factors that influence the probability of upflow, downflow, repetition, and the distribution of
average grades on a school are largely influenced by social preconditions in the school’s neighborhood which are beyond the influence
of a school. Exactly this effect can be seen by the fact that the inspectorate of education was forced to introduce corresponding social
correction factors for their threshold levels29. For this reason, a school will try to minimize its own risk simply by being conservative in its
assessments of those students. Without any doubt, distributing children into lower levels in secondary level selection moments (e.g. in the
3rd year of secondary school) raises their probability of scoring high and decreases the probability of repetition and downflow, and
thus improves the criteria R3 and R4. A similar effect can be observed for R1 and R2, because primary schools tend to strategically
anticipate the decision process on secondary level30. In consequence, if a school followed a more liberal selection policy at a larger
scale, it would risk the lowering of average grades, the rise of repetitions, and a decrease of upflow. Schools therefore tend to become
distrustful of the learning potentials of their students. A visible sign of this distrust is the comparably high percentage of children who
need to repeat a grade in secondary school to make sure they succeed later on31. In 2012, the percentage of secondary school graduates
who repeated at least one grade was 45 %, on havo level even 52% of the graduates (Van Vuuren and Van der Wiel, 2015). Two third of
these repetitions occurred in secondary school. So approximately half of all graduates had to repeat a grade. Compared to other countries
in the same year, 27,6 % of all Dutch 15-year-olds had already repeated at least one school grade, which is approximately twice the
OECD average (Zapata et al., 2014).
27
https://www.onderwijsinspectie.nl/onderwerpen/onderwijsresultatenmodel-vo/indicatoren.
28
Sometimes, such scores are also discussed under the term ‘slagingspercentage’, however, the latter refers only to the percentage of successful
final exams, which appears as one factor in R3.
29
https://www.onderwijsinspectie.nl/onderwerpen/onderwijsresultatenmodel-vo/indicatoren.
30
This behavior is recognized under the term ‘strategisch gedrag’. In fact, the inspectorate of education has realized this problem and as a
consequence stopped the evaluation of teachers based on LVS results (Borghans and Schils, 2015; Muskens and Tholen, 2015).
31
Though repetitions are also among the indicators, they are recognized as means to improve school performance (Van Vuuren and Van der Wiel,
2015), because they can improve the other indicators (downflow as well as average grades).
10
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
Table 1
Amount of graduates per year with scientific, applied and vocational diplomas in the Netherlands from 2010 till 2019. Percentages of these three
levels are given from total. For 3rd level education: wo includes university-level bachelor, master, doctoraal and beroepsdiploma. Applied university
diplomas (hbo) and vocational degrees (mbo). Source: CBS Statline, ‘Gediplomeerden en afgestudeerden; onderwijssoort, vanaf 1900’, https://
opendata.cbs.nl/statline/#/CBS/nl/dataset/80384ned/table?dl=59925
Thesis 7. The Dutch school evaluation and rewarding scheme in secondary school favors conservative, elitist pupil biographies to
increase the probability of scoring high in educational outcome assessments. This conflicts with the assessment of learning potentials.
In general, the use of results of selection tests as a means to evaluate school performance makes the scheme invalid in the first place.
32
There is an increasing trend on the university level, however this is probably due to an increase of foreign students.
33
https://www.fr.de/wissen/schulabschluesse-deutschland-11730979.html.
11
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
Fig. A1. Task from a CITO test in basic algebra (’rekenen’), group 8.
Dutch study (Lubbers et al., 2010)34. This undermines the very principles of an open society (Popper, 1971). To avert such threats,
education should not be seen as a matter of private fortune, but rather as a common good.
Thesis 8. The privilege of academic education in the Netherlands is reserved for a fixed percentage of students due to arbitrary
quotas which have remained unquestioned for decades. The underlying meritocratic justification is radical: it effectively denies a
significant part of society access to educational pathways and prevents effective exploitation of learning potentials.
In this article, we have investigated possible reasons for the increasing educational inequality and the general decrease of skill
levels in the Netherlands. The article was written in the form of a critique, which is a way to reflect on the validity of methods of
judgment. We explained why such judgments may go wrong on various levels of the Dutch school system and why this conflicts with
the central goal to provide equal access to education guaranteed by law.
Our arguments highlight the various forms of misjudgment that keep students from exploiting their learning potentials. Since such
reasons are not understood well enough, professionals can still assume that, though being far from perfect, the current system would
‘work’ in principle. And correspondingly, many Dutch conclude it would be their own ‘fault’ or ‘fate’ if their children do not succeed in
the school system. Our arguments also shutter further assumptions in the public debate.
For example, it is often argued that since CITO is standardized, it would provide an ‘objective’ way of selecting students into levels,
as opposed to the ‘subjective’ assessment of a teacher. In particular, it is argued that teachers could be more biased than CITO against
students with a migration background. However, as we have argued above, there are many reasons to doubt the validity of CITO both
as a way to judge student skills and learning potentials, as well as way to select students on this basis. Since an invalid test is always
biased, no matter how much it has been standardized, we would be trading bias for bias. This argument is therefore not convincing.
Furthermore, assuming that teachers’ assessments are biased just because they are fallible humans or because they have diverse
backgrounds reveals a dangerous prejudice against teachers and their profession. It not only runs against the very idea of the Dutch
freedom of education, it also shows a naive conception of objectivity and a naive trust in quantification. Instead, it should be clear from our
discussion that ‘objective’ (i.e., trans-subjective) judgments are just a consequence of validity (Janich, 2001), which needs to be
defined relative to a purpose. Once a valid assessment is in place, it will also be unbiased for this purpose, regardless of who designed it
34
”Principally, our results showed that critical approaches to learning, often applied by autonomous students, are not rewarded in Dutch sec
ondary education. Checking other sources and forming one’s own opinion appear to affect grades negatively” (Lubbers et al., 2010).
12
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
or who is assessed, and regardless of whether it is a standardized, quantitative test or not. Stereotypical qualitative assessment
schemes, as currently in use, do not help because they lack any methodical validity.
From a research perspective, further empirical research is needed to substantiate our theses and to gather more evidence for the
observations that have been made. For example, it is currently unknown how widely distributed stereotypical qualitative assessment
schemes are, and to what extent quantitative tests such as CITO are dependent on prior contingent skills. An important question is how
concrete learning goals need to be in order to serve goal constrained teaching. From a didactic viewpoint, educational researchers
should investigate how tests can be aligned with these goals and the teaching process, and how dynamic assessments would alter the
estimation of potentials. Furthermore, the detrimental effects of indicator driven school evaluations and fixed quotas on the bi
ographies of children should be further investigated by comparative cohort studies. Finally, a far reaching consequence for research is
that statistical studies (cf. Dronkers and van de Werfhorst (2016)) which use data about achievements within CITO-like systems cannot
be used to draw conclusions about skill levels and learning potentials of Dutch students.
From a political perspective, we have argued in this article that valid assessments require, on the one hand, the definition of
concrete and testable learning goals, and on the other hand, the embedding of assessments into goal constrained teaching. The first
would require the Dutch society to standardize concrete goals across all areas of knowledge and, in turn, to let go of standardized tests.
Thus, not the testing or teaching needs to be standardized, but the learning goals. Testing needs to be given into the hands of teachers as
part of their responsibility for achieving these goals. It might be argued that both teachers and the society are overwhelmed with this
task. However, the Netherlands is the only country in Europe that selects secondary school students into educational levels based on a
standardized national test (Naayer et al., 2016). In other European countries, the selection lies in the hands of teachers and is based on
repetitive assessments of whether pupils have reached minimum learning goals required for passing on to the next grade. Furthermore,
since the selection advice given at the end of primary school is usually non-binding (Naayer et al., 2016), pupils can try out academic
level education to disprove invalid assessments of their learning potential. To realize this in the Netherlands might require the
strengthening of the academic education of teachers and a raise of their salary. On the level of school policy, first, it is required to
rethink how schools are evaluated, because the current scheme is not only invalid as a means to assess a school’s quality, but also
minimizes the chances of growth of its students due to unintended feedback which reinforces conservative assessments. And second, if
evaluations are redesigned by dropping social reference norms and adopting absolute reference norms, fixed quotas need to be
abolished, because the number of students on a certain educational path can no longer be determined a-priori. It is an outcome of valid
assessments. Finally, such reforms need to be embedded into a public discourse over humanistic education in the Netherlands, a
country with one of the oldest liberal traditions in the world.
Acknowledgements
We are very grateful to Martin Uunk, Greetje Timmermans, Karen Heij and Frank Ostermann for their valuable suggestions
regarding earlier versions of the manuscript. We thank Jessica Hilhorst, Annette Ter Haar, Claire Boonstra and all other members of the
BOT group for their continuous support. We also would like to thank the anonymous reviewers for their helpful suggestions. Finally,
we thank all members of Studio Moio, in particular Eva de Leeuw, Mitchell Kort, Laila Aissi, Thijn van der Geest and Rashad Mashaal,
for discussing and translating our article.
In this section, we summarize the most relevant characteristics of the school system and shortly illustrate the most important
standardized test.
A1. Overview
The variety of Dutch school forms, including public-authority schools, Catholic, Protestant, Montessori, Waldorf, Dalton, and
Jenaplan schools, is due to the principle of freedom of education (’Vrijheid van onderwijs’) which has its roots in the division of the
Dutch society into cultural and religious pillars (’verzuiling’) (Spiecker and Steutel, 2001). In consequence, schools have been free to
choose their teaching methods as well as learning contents, while being publicly funded for both public-authority and private schools.
Main goals are formulated in nationwide guidelines which are binding for all schools and which focus mainly on algebra and text
comprehension skills, as laid down in current national laws (Secondary Education Act35 and its attainment targets (’kerndoelen’)36 and
reference levels (’referentieniveaus’)37. To assure students meet the performance requirements specified in these guidelines, state-wide
standardized tests are held regularly as part of a student tracking system (’leerlingvolgsysteem’ (LVS)), starting in primary school. These
tests shall assure that minimal reference levels are reached across the diverse school landscape. The most frequently used LVS is the
CITO system developed by the Centraal instituut voor toetsontwikkeling38.
Between the ages of four and twelve, children attend primary school (’basisschool’) running through eight grades (’group 1’ through
35
https://wetten.overheid.nl/BWBR0002399.
36
https://www.rijksoverheid.nl/documenten/besluiten/2010/09/17/kerndoelen-onderbouw-voortgezet-onderwijs.
37
https://www.rijksoverheid.nl/onderwerpen/taal-en-rekenen/referentiekader-taal-en-rekenen.
38
https://nl.wikipedia.org/wiki/Centraal_Instituut_voor_Toetsontwikkeling.
13
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
‘group 8’). School attendance is mandatory and starts in group 2 at age five. Starting from group 3, twice a year all pupils are tested
based on standardized LVS tests around January and in the beginning of June. In group 8 all schools are required to do a final aptitude
test called the ‘eindtoets basisonderwijs’. Since 2015, teachers have the final authority to decide about school recommendation, yet the
‘eindtoets’ together with previous LVS test results as well as qualitative assessments are the most relevant criteria used in this decision.
The result is a binding recommendation which constrains accessibility to the different educational levels of secondary school.
Secondary education, which begins at the age of 12 and is compulsory until the age of 18, is offered at several levels. Vocational
training (vmbo) is offered on 4 different levels. Non-vocational education is offered on two levels: havo (five years) and vwo (six years).
The havo diploma is the minimum requirement for admission to hbo (university of applied sciences). The vwo diploma grants access to
academic universities. There are various ways how students with a lower recommendation can still make it up to university. For
example, one can get into university also after successfully completing the first year (’propaedeuse’) of hbo.
In the first half of the 20th century, the Netherlands was searching for a way to prevent pupils from dropping out of school. The idea
was to select pupils into educational levels in a more systematic way according to their merits and according to society’s needs (Heij,
2021). The need for a standardized method of selection was answered by A.D. de Groot’s invention of CITO in 1966. Multiple choice
questions allowed to standardize and automate the grading (Groot and van Naerssen, 1973). However, based on which learning goals
should a decision on educational levels be made (De Groot, 1965), in particular when the society was divided into cultural and
religious pillars (’verzuiling’)? In Amsterdam in 1967, de Groot was involved in setting up a precursor of CITO, the ‘Amsterdamse
schooltoets’. At the time, approximately 20% of primary school recommendations in Amsterdam lead to an academic secondary level
(Heij, 2021, ch. 4.8). de Groot found a correlation between his test scores and these recommendations, and on this basis concluded that
his tests would be ‘valid’ (Heij, 2021, p.140). Thenceforth, the situation in Amsterdam was taken as a norm for future selection. This had
also the advantage that the limited educational resources (e.g. the limited amount of places available on a level) could be allocated in
an efficient manner39.
CITO tests contain usually 3 text comprehension tests about 4-7 texts each, 3 math tests with 32 questions each mostly based on
stories (’verhaaltjessommen’), and two spelling tests. In addition, kids read aloud to a teacher while stopping the time and checking for
reading mistakes. The questions of a CITO test are closed ended questions with single select options. This means that only a single one of
the offered answers are supposed to be true. The two examples in the following illustrate the difficulty of interpreting CITO tasks and
answers unambiguously.
Fig. A1 asks the pupils to assess the metric unit of a given height value (120) for a church tower. The expected answer is ‘decimeter’,
meaning that the tower is supposed to be only 12 meters high, and not ‘meter’, which would lead to a height of 120 meters. This answer
can be guessed by recognizing that the woman is less than 2 meters high and can be stapled six times to measure the tower’s height.
However, this approach assumes that the depicted drawing is true to scale, which remains unclear from the task description. If this
assumption is not made, then a plausible height can also be guessed based on one’s experience. Churches in other European countries
are often higher than 120 meters. For example, the Cologne Cathedral in Germany is 157 meters, and Notre Dame in Paris is 130 meters
high. Depending on the experience of a child, it can be perfectly plausible to assume that the correct answer is ‘meter’, and not
‘decimeter’.
In the following example, pupils need to read a text about the painter Rembrandt (excerpt) and the reasons for his fame:
The painter Rembrandt van Rijn lived from 1606 to 1669. At that time, the Golden Age, the Netherlands was doing very well.
The wealth of the population ensured that the arts also prospered. There were almost 150 painters in Amsterdam alone. People
wanted a nice painting on the wall and had the money to buy it. Yet there are only a few of those 150 painters that we still know
today, while Rembrandt is world famous today. How come his paintings are sold for millions?... [translation by the authors]
After reading this text, pupils need to answer several single option multiple-choice questions (one reproduced here with our own
translations):
Opgave 1. What does the writer want to say with the sentence: there were almost 150 painters in Amsterdam alone?
39
Heij e.g. reports about a practice of selection in Curaçao where the percentage was explicitly based on the number of available places in havo/
vwo, see (Heij, 2021, p.81).
14
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
all in answering this question. Still, the most likely (and expected) answer is the last one, as it is the statement immediately preceding
the sentence cited in the question: 150 artists being a sign that art was doing well. However, it can hardly be argued that 150 artists are
not equally well a sign for the richness of the population, as well as for the Golden Age. Furthermore, the former fact can rightfully be
considered also a sign for Amsterdam as the capital of art during the Golden Age. This ambiguity can only be resolved by measuring
syntactical distances between the sentences. In summary, note that answering both questions asks for special training because it requires
skills that have nothing to do with comprehending a text or understanding an algebra task.
References
Baltussen, M.e.a. (2019). De staat van het onderwijs 2017/2018. Technical Report. Utrecht: Inspectie van het Onderwijs.
Biesta, G. (2015). Beautiful risk of education. Routledge.
Biesta, G. (2020). Risking ourselves in education: Qualification, socialization, and subjectification revisited. Educational Theory, 70(1), 89–104.
Biggs, J. (1996). Enhancing teaching through constructive alignment. Higher Education, 32(3), 347–364.
Bildungsberichterstattung, A. (2020). Bildung in Deutschland. Ein indikatorengestuetzter Bericht mit einer Analyse zu Bildung in einer digitalisierten Welt. Technical
Report. wbv Media.
Boonstra, C., de Graaf Bierbrauwer, C., & Carstens, N. (2020). Het onderwijsvragenboek: waarom doen we de dingen zoals we ze doen? Amsterdam University Press.
Borghans, L., & Schils, T. (2015). De invloed van opbrengstindicatoren op het functioneren van scholen in het primair onderwijs. Technical Report. Onderwijsinspectie,
Universiteit Maastricht.
Boterman, W. R. (2013). Dealing with diversity: middle-class family households and the issue of blackand white schools in amsterdam. Urban Studies, 50(6),
1130–1147.
Csikszentmihalyi, M. (1990). Flow: The psychology of optimal experience. New York: Harper & Row.
De Groot, A. (1965). Vijven en zessen.
Dickhäuser, O., Janke, S., Praetorius, A.-K., & Dresel, M. (2017). The effects of teachers’ reference norm orientations on students’ implicit theories and academic self-
concepts. Zeitschrift für Pädagogische Psychologie, 31, 205–219.
Dittrich, E. (2014). Underachievement leading to downgrading at the highest level of secondary education in the netherlands: A longitudinal case study. Roeper
Review, 36(2), 104–113.
Dronkers, J., & van de Werfhorst, H. (2016). Meritocratisering in schoolloopbanen in nederland. Meritocratie, 25–43.
Dupriez, V., & Dumay, X. (2006). Inequalities in school systems: effect of school structure or of society structure? Comparative Education, 42(02), 243–260.
Elffers, L. (2018). De bijlesgeneratie: opkomst van de onderwijscompetitie. Amsterdam University Press.
Elliot, A. J., & McGregor, H. A. (2001). A 2× 2 achievement goal framework. Journal of Personality and Social Psychology, 80(3), 501.
Engeser, S., & Rheinberg, F. (2008). Flow, performance and moderators of challenge-skill balance. Motivation and Emotion, 32(3), 158–172.
Felouzis, G., & Charmillot, S. (2013). School tracking and educational inequality: A comparison of 12 education systems in switzerland. Comparative Education, 49(2),
181–205.
Fernandez-Cano, A. (2016). A methodological critique of the pisa evaluations. Relieve, 22(1), 1–16.
Gamoran, A., & Mare, R. D. (1989). Secondary school tracking and educational inequality: Compensation, reinforcement, or neutrality? American Journal of Sociology,
94(5), 1146–1183.
Glasersfeld, E.v. (1995). A constructivist approach to teaching. In L. P. Steffe, & J. Gale (Eds.), Constructivism in education (pp. 3–15).
Glasersfeld, E.v. (2006). Radical constructivism in mathematics education (vol. 7). Springer Science & Business Media.
Gooch, P. W. (2019). Course Correction: A Map for the Distracted University. University of Toronto Press.
Gorard, S., & Smith, E. (2004). An international comparison of equity in education systems. Comparative Education, 40(1), 15–28.
Groot, A.d., & van Naerssen, R. (1973). Studietoetsen: construeren, afnemen, analyseren (deel I). Den Haag: De Gruyter Mouton.
Hanselman, P. (2018). Do school learning opportunities compound or compensate for background inequalities? evidence from the case of assignment to effective
teachers. Sociology of Education, 91(2), 132–158.
Heij, K. (2021). Van de kat en de bel: Tellen en vertellen met de eindtoets basisonderwijs. Ph.D. thesis.
Ingenkamp, K. (1977). Die Fragwürdigkeit der Zensurengebung [The dubiety of school grades]. Weinheim: Beltz.
Janich, P. (2001). Logisch-pragmatische Propädeutik: ein Grundkurs im philosophischen Reflektieren. Velbrück Wiss.
Janich, P. (2015). Handwerk und Mundwerk: über das Herstellen von Wissen. CH Beck.
Keeves, J. P., Morgenstern, C., & Saha, L. J. (1991). Educational expansion and equality of opportunity: Evidence from studies conducted by iea in ten countries in
1970–71 and 1983–84. International Journal of Educational Research, 15(1), 61–80.
Kolb, A., & Kolb, D. (2018). Eight important things to know about the experiential learning cycle. Australian Educational Leader, 40(3), 8–14.
Lago, R., & Beglinger, M. (2009). Schülerjahre. Wie Kinder besser lernen. München, Zürich (Piper), 4.
Leeuw, E.d., & Lecina, N. (2016). Onderwijs dat kansrijk maakt. Technical Report. Ministerie OCW.
Lubbers, M. J., Van Der Werf, M. P., Kuyper, H., & Hendriks, A. J. (2010). Does homework behavior mediate the relation between personality and academic
performance? Learning and Individual Differences, 20(3), 203–208.
Marks, G. N., Cresswell, J., & Ainley, J. (2006). Explaining socioeconomic inequalities in student achievement: The role of home and school factors. Educational
Research and Evaluation, 12(02), 105–128.
McCowan, T. (2016). Three dimensions of equity of access to higher education. Compare: A Journal of Comparative and International Education, 46(4), 645–665.
McFarland, J., Hussar, B., Zhang, J., Wang, X., Wang, K., Hein, S., Diliberti, M., Cataldi, E. F., Mann, F. B., & Barmer, A. (2019). The condition of education 2019. nces
2019-144. National Center for Education Statistics.
McKeachie, W., & Svinicki, M. (2013). McKeachie’s teaching tips. Cengage Learning.
Merry, M. S. (2019). Educational Justice: Liberal Ideals, Persistent Inequality, and the Constructive Uses of Critique. Palgrave McMillan.
Merry, M. S., & Boterman, W. (2020). Educational inequality and state-sponsored elite education: the case of the dutch gymnasium. Comparative Education, 1–25.
Moss, P. A., Girard, B. J., & Haniford, L. C. (2006). Chapter 4: Validity in educational assessment. Review of Research in Education, 30(1), 109–162.
Muskens, M., & Tholen, R. (2015). Onderzoek sturen op cijfers en rendementen. Technical Report. Ministerie van OCW, Research Ned Nijmegen.
Naayer, H., Spithoff, M., Osinga, M., Klitzing, N., Korpershoek, H., & Opdenakker, M. (2016). De overgang van primair naar voortgezet onderwijs in internationaal
perspectief. Technical Report. GION onderwijs/onderzoek.
Oppers, A.e.a. (2022). De staat van het onderwijs 2022. Technical Report. Utrecht: Inspectie van het Onderwijs.
Osborne, J., Erduran, S., & Simon, S. (2004). Enhancing the quality of argumentation in school science. Journal of Research in Science Teaching, 41(10), 994–1020.
Pillai, S. (2012). Right to education under european convention for the protection of human rights and fundamental freedoms 1950. Christ University Law Journal, 1(1),
101–115.
van Pinxteren, M., & de Beer, P. (2016). Meritocratie: op weg naar een nieuwe klassensamenleving. Amsterdam University Press.
Plomin, R., & Von Stumm, S. (2018). The new genetics of intelligence. Nature Reviews Genetics, 19(3), 148–159.
Popper, K. R. (1971). The open society and its enemies: The spell of Plato (vol. 1). Princeton University Press.
Resing, W. C., Elliott, J. G., & Vogelaar, B. (2020). Assessing potential for learning in school children. Oxford research encyclopedias, education.
15
S. Scheider et al. International Journal of Educational Research 117 (2023) 102097
16