Arieletal ExperimentalDesigns 2021
Arieletal ExperimentalDesigns 2021
Arieletal ExperimentalDesigns 2021
net/publication/352038974
EXPERIMENTAL DESIGNS
CITATIONS READS
0 18,982
3 authors:
Alex Sutherland
Behavioural Insights Team
65 PUBLICATIONS 1,764 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Barak Ariel on 29 June 2021.
BARAK ARIEL
MATTHEW BLAND
ALEX SUTHERLAND
THE SAGE QUANTITATIVE RESEARCH KIT
ISBN 978-1-5264-2662-8
At SAGE we take sustainability seriously. Most of our products are printed in the UK using responsibly sourced
papers and boards. When we print overseas, we ensure sustainable papers are used as measured by the PREPS
grading system. We undertake an annual audit to monitor our sustainability.
– David Cox and Nancy Reid (2000, p. 1). The Theory of the Design of Experiments.
1 Introduction 1
2 R is for Random 15
Comparisons in Science 16
Counterfactuals 17
Randomisation 20
Procedures for Random Assignment (How to Randomise) 22
Simple Random Assignment 23
Alternatives to Simple Random Assignment: Restrictive Allocation 25
The Randomised Block Design 26
Trickle Flow Random Assignment 29
Matched Pairs Design 31
Minimisation 35
Randomisation in Practice: The ‘Randomiser’ 36
Units of Randomisation (What to Randomise?) 36
Person-Based Randomisation 37
Place-Based Randomisation 39
Temporal-Based Randomisation 41
A Final Word Regarding the Choice of Units of Randomisation 42
Should We Use Significance Tests for Baseline Imbalances After
Randomisation?44
Inferential Statistics and Probability Theory 46
Randomisation and Sample Size Considerations 48
Summarising the Benefits of Randomisation 50
3 C is for Control 53
Glossary165
References169
Index219
List of figures
List of tables
List of boxes
Alex Sutherland is the Chief Scientist and Director of Research and Evaluation
for the Behavioural Insights Team. He has nearly 20 years of experience in experi-
mental designs and evaluation, including work on the use and understanding of
research evidence for decision-making and policy development. His research interests
lie in criminal justice, violence and violence prevention. He previously worked at
RAND Europe and the University of Cambridge. He is currently a member of the UK
Government Trial Advice Panel.
Chapter Overview
Contextualising randomised experiments in a wide
range of causal designs�������������������������������������������������������������������������������� 7
Causal designs and the scientific meaning of causality������������������������������� 9
Why should governments and agencies care about causal designs?�������� 11
Further Reading����������������������������������������������������������������������������������������� 13
Formal textbooks on experiments first surfaced more than a century ago, and thou-
sands have emerged since then. In the field of education, William McCall published
How to Experiment in Education in 1923; R.A. Fisher, a Cambridge scholar, released
Statistical Methods for Research Workers and The Design of Experiments in 1925 and
1935, respectively; S.S. Stevens circulated his Handbook of Experimental Psychology in
1951. We also have D.T. Campbell and Stanley’s (1963) classic Experimental and Quasi-
Experimental Designs for Research, and primers like Shadish et al.’s (2002) Experimental
and Quasi-Experimental Designs for Generalised Causal Inference, which has been cited
nearly 50,000 times. These foundational texts provide straightforward models for
using experiments in causal research within the social sciences.
Fundamentally, this corpus of knowledge shares a common long-standing meth-
odological theme: when researchers want to attribute causal inferences between
interventions and outcomes, they need to conduct experiments. The basic model
for demonstrating cause-and-effect relationships relies on a formal, scientific pro-
cess of hypothesis testing, and this process is confirmed through the experimental
design. One of these fundamental processes dictates that causal inference neces-
sarily requires a comparison. A valid test of any intervention involves a situation
through which the treated group (or units) can be compared – what is termed a
counterfactual. Put another way, evidence of ‘successful treatment’ is always
relative to a world in which the treatment was not given (D.T. Campbell, 1969).
Whether the treatment group is compared to itself prior to the exposure to the inter-
vention, or a separate group of cases unexposed to the intervention, or even just
some predefined criterion (like a national average or median), contrast is needed.
While others might disagree (e.g. Pearl, 2019), without an objective comparison, we
cannot talk about causation.
Causation theories are found in different schools of thought (for discussions, see
Cartwright & Hardie, 2012; Pearl, 2019; Wikström, 2010). The dominant causal
framework is that of ‘potential outcomes’ (or the Neyman–Rubin causal framework;
Rubin, 2005), which we discuss herein and which many of the designs and exam-
ples in this book use as their basis. Until mainstream experimental disciplines revise
the core foundations of the standard scientific inquiry, one must be cautious when
recommending public policy based on alternative research designs. Methodologies
based on subjective or other schools of thought about what causality means will
not be discussed in this book. To emphasise, we do not discount these methodologies
and their contribution to research, not least for developing logical hypotheses about
the causal relationships in the universe. We are, however, concerned about risks to
the validity of these causal claims and how well they might stand a chance of being
implemented in practice. We discuss these issues in more detail in Chapter 4. For
further reading, see Abell and Engel (2019) as well as Abend et al. (2013).
However, not all comparisons can be evaluated equally. For the inference that a
policy or change was ‘effective’, researchers need to be sure that the comparison
group that was not exposed to the intervention resembles the group that was exposed
to the intervention as much as possible. If the treatment group and the no-treatment
group are incomparable – not ‘apples to apples’ – it then becomes very difficult to
‘single out’ the treatment effect from pre-existing differences. That is, if two groups
differ before an intervention starts, how can we be sure that it was the introduction
of the intervention and not the pre-existing differences that produce the result?
To have confidence in the conclusions we draw from studies that look at the causal
relationship between interventions and their outcomes means having only one
attributable difference between treatment and no-treatment conditions: the treat-
ment itself. Failing this requirement suggests that any observed difference between
the treatment and no-treatment groups can be attributed to other explanations. Rival
hypotheses (and evidence) can then falsify – or confound – the hypothesis about the
causal relationship. In other words, if the two groups are not comparable at baseline,
then it can be reasonably argued that the outcome was caused by inherent differ-
ences between the two groups of participants, by discrete settings in which data
on the two groups were collected, or through diverse ways in which eligible cases
were recruited into the groups. Collectively, these plausible yet alternative explana-
tions to the observed outcome, other than the treatment effect, undermine the test.
Therefore, a reasonable degree of ‘pre-experimental comparability’ between the two
groups is needed, or else the claim of causality becomes speculative. We spend a con-
siderable amount of attention on this issue throughout the book, as all experimenters
share this fundamental concern regarding equivalence.
Experiments are then split into two distinct approaches to achieve pre-experimental
comparability: statistical designs and randomisation. Both aim to facilitate equita-
ble conditions between treatment and control conditions but achieve this goal dif-
ferently. Statistical designs, often referred to as quasi-experimental methods, rely on
statistical analysis to control and create equivalence between the two groups. For
example, in a study on the effect of police presence on crime in particular neighbour-
hoods, researchers can compare the crime data in ‘treatment neighbourhoods’ before
and after patrols were conducted, and then compare the results with data from ‘con-
trol neighbourhoods’ that were not exposed to the patrols (e.g. Kelling et al., 1974;
Sherman & Weisburd, 1995). Noticeable differences in the before–after comparisons
would then be attributed to the police patrols. However, if there are also observable
differences between the neighbourhoods or the populations who live in the treat-
ment and the no-treatment neighbourhoods, or the types of crimes that take place in
these neighbourhoods, we can use statistical controls to ‘rebalance’ the groups – or
at least account for the differences between groups arising from these other variables.
Through statistically controlling for these other variables (e.g. Piza & O’Hara, 2014;
R.G. Santos & Santos, 2015; see also The SAGE Quantitative Research Kit, Volume 7),
scholars could then match patrol and no-patrol areas and take into account the con-
founding effect of these other factors. In doing so, researchers are explicitly or implic-
itly saying ‘this is as good as randomisation’. But what does that mean in practice?
While on the one hand, we have statistical designs, on the other, we have exper-
iments that use randomisation, which relies on the mathematical foundations of
probability theory (as discussed in The SAGE Quantitative Research Kit, Volume 3).
Probability theory postulates that through the process of randomly assigning cases
into treatment and no-treatment conditions, experimenters have the best shot of
achieving pre-experimental comparability between the two groups. This is owing to
the law of large numbers (or ‘logic of science’ according to Jaynes, 2003). Allocating
units at random does, with a large enough sample, create balanced groups. As we
illustrate in Chapter 2, this balance is not just apparent for observed variables (i.e.
what we can measure) but also in terms of the unobserved factors that we cannot
measure (cf. Cowen & Cartwright, 2019). For example, we can match treatment and
comparison neighbourhoods in terms of crimes reported to the police before the
intervention (patrols), and then create balance in terms of this variable (Saunders
et al., 2015; see also Weisburd et al., 2018). However, we cannot create true balance
between the two groups if we do not have data on unreported crimes, which may be
very different in the two neighbourhoods.
We cannot use statistical controls where no data exist or where we do not measure
something. The randomisation of units into treatment and control conditions largely
mitigates this issue (Farrington, 2003a; Shadish et al., 2002; Weisburd, 2005). This
quality makes, in the eyes of many, randomised experiments a superior approach to
other designs when it comes to making causal claims (see the debates about ‘gold
standard’ research in Saunders et al., 2016). Randomised experiments have what is
called a high level of internal validity (see review in Grimshaw et al., 2000; Schweizer
et al., 2016). What this means is that, when properly conducted, a randomised experi-
ment gives one the greatest confidence levels that the effect(s) observed arose because
of the cause (randomly) introduced by the experiment, and not due to something else.
The parallel phrase – external validity – means the extent to which the results
from this experiment can apply elsewhere in the world. Lab-based randomised
experiments typically have very high internal validity, but very low external valid-
ity, because their conditions are highly regulated and not replicable in a ‘real-world’
scenario. We review these issues in Chapter 3.
Importantly, random allocation means that randomised experiments are prospective
not retrospective – that is, testing forthcoming interventions, rather than ones that
have already been administered where data have already been produced. Prospective
studies allow researchers to maintain more control compared to retrospective studies.
The researcher is involved in the very process of case selection, treatment fidelity (the
extent to which a treatment is delivered or implemented as intended) and the data
collated for the purposes of the experiment. Experimenters using random assignment
are therefore involved in the distribution and management of units into different real-
life conditions (e.g. police patrols) ex ante and not ex post. As the scholar collaborates
with a treatment provider to jointly follow up on cases, and observe variations in the
measures within the treatment and no-treatment conditions, they are in a much better
position to provide assurance that the fidelity of the test is maintained throughout the
process (Strang, 2012). These features rarely exist in quasi-experimental designs, but at
the same time, randomised experiments require scientists to pay attention to maintain-
ing the proper controls over the administration of the test. For this reason, running a
randomised controlled trial (RCT) can be laborious.
In Chapter 5, we cover an underutilised instrument – the experimental protocol –
and illustrate the importance of conducting a pre-mortem analysis: designing and
crafting the study before venturing out into the field. The experimental protocol
requires the researcher to address ethical considerations: how we can secure the rights
of the participants, while advancing scientific knowledge through interventions that
might violate these rights. For example, in policing experiments where the partici-
pants are offenders or victims, they do not have the right to consent; the policing
strategy applied in their case is predetermined, as offenders may be mandated by
a court to attend a treatment for domestic violence. However, the allocation of the
offenders into any specific treatment is conducted randomly (see Mills et al., 2019). Of
course, if we know that a particular treatment yields better results than the comparison
treatment (e.g. reduces rates of repeat offending compared to the rates of reoffending
under control conditions), then there is no ethical justification for conducting the
experiment. When we do not have evidence that supports the hypothesised benefit
of the intervention, however, then it is unethical not to conduct an experiment. After
all, the existing intervention for domestic batterers can cause backfiring effects and
lead to more abuse. This is where experiments are useful: they provide evidence on
relative utility, based on which we can make sound policy recommendations. Taking
these points into consideration, the researcher has a duty to minimise these and other
ethical risks as much as possible through a detailed plan that forms part of the research
documentation portfolio.
Vitally, the decision to randomise must also then be followed with the question
of which ‘units’ are the most appropriate for random allocation. This is not an easy
question to answer because there are multiple options, thus the choice is not purely
theoretical but a pragmatic query. The decision is shaped by the very nature of the
field, settings and previous tests of the intervention. Some units are more suitable for
addressing certain theoretical questions than others, so the size of the study matters,
as well as the dosage of the treatment. Data availability and feasibility also determine
these choices. Experimenters need to then consider a wide range of methods of actu-
ally conducting the random assignment, choosing between simple, ‘trickle flow’,
block random assignment, cluster, stratification and other perhaps more nuanced
and bespoke sequences of random allocation designs. We review each of these design
options in Chapter 2.
We then discuss issues with control with some detail in Chapter 3. The mechanisms
used to administer randomised experiments are broad, and the technical literature on
these matters is rich. Issues of group imbalances, sample sizes and measurement con-
siderations are all closely linked to an unbiased experiment. Considerations of these
problems begin in the planning stage, with a pre-mortem assessment of the possible
pitfalls that can lead the experimenter to lose control over the test (see Klein, 2011).
Researchers need to be aware of threats to internal validity, as well as the external
validity of the experimental tests, and find ways to avoid them during the experi-
mental cycle. We turn to these concerns in Chapter 3 as well.
In Chapter 4, we account for the different types of experimental designs available
in the social sciences. Some are as ‘simple’ as following up with a group of partici-
pants after their exposure to a given treatment, having been randomly assigned into
treatment and control conditions, while others are more elaborate, multistage and
complex. The choice of applying one type of test and not another is both concep-
tual and pragmatic. We rely heavily on classic texts by D.T. Campbell and Stanley
(1963), Cook and Campbell (1979) and the amalgamation of these works by Shadish
et al. (2002), which detail the mechanics of experimental designs, in addition to their
rationales and pitfalls. However, we provide more updated examples of experiments
that have applied these designs within the social sciences. Many of our examples
are criminological, given our backgrounds, but are applicable to other experimental
disciplines.
Chapter 4 also provides some common types of quasi-experimental designs that can
be used when the conditions are not conducive to random assignment (see Shadish
et al., 2002, pp. 269–278). Admittedly, the stack of evidence in causal research largely
comprises statistical techniques, including the regression discontinuity design, pro-
pensity score matching, difference-in-difference design, and many others. We intro-
duce these approaches and refer the reader to the technical literature on how to
estimate causal inference with these advanced statistics.
Before venturing further, we need to contextualise experiments in a wide range
of study designs. Understanding the role that causal research has in science, and
what differentiates it from other methodological approaches, is a critical first step.
To be clear, we do not argue that experiments are ‘superior’ compared to other meth-
ods; put simply, the appropriate research design follows the research question and
the research settings. The utility of experiments is found in their ability to allow
RCTs are (mostly) regarded as the ‘gold standard’ of impact evaluation research
(Sherman et al., 1998). The primary reason for this affirmation is internal validity,
which is the feature of a test that tells us that it measures what it claims to measure
(Kelley, 1927, p. 14). Simply put, well-designed randomised experiments that are cor-
rectly executed have the highest possible internal validity to the extent that they ena-
ble the researcher to quantifiably demonstrate that a variation in a treatment (what
we call changes in the ‘independent variable’) causes variation(s) in an outcome,
or the ‘dependent variable(s)’ (Cook & Campbell, 1979; Shadish et al., 2002). We
will contextualise randomised experiments against other causal designs – this is more
of a level playing field – but then illustrate that ‘basically, statistical control is not as
good as experimental control’ (Farrington, 2003b, p. 219) and ‘design trumps analysis’
(Rubin, 2008, p. 808).
Another advantage of randomised experiments is that they account for what is
called selection bias – that is, results derived from choices that have been made
or selection processes that create differences – artefacts of selection rather than true
differences between treatment groups. In non-randomised controlled designs, the
treatment group is selected on the basis of its success, meaning that the treatment
provider has an inherent interest to recruit members who would benefit from it.
This is natural, as the interest of the treatment provider is to assist the participants
with what they believe is an effective intervention. Usually, patients with the best
prognosis are participants who express the most desire to improve their situation, or
individuals who are the most motivated to successfully complete the intervention
1
Notably, however, researchers resort to quasi-experimental designs especially when policies have
been rolled out without regard to evaluation, and the fact that some cases were ‘creamed in’ is
not necessarily borne out of an attempt to cheat. Often, interventions are simply put in place
with the primary motivation of helping those who would benefit the most from the treatment.
This means that we should not discount quasi-experimental designs, but rather accept their
conclusions with the necessary caveats.
relies on randomised experiments will result in more precise and reliable answers to
questions about what works for policy and practice decision-makers.
In light of these (and other) advantages of randomised experiments, it might be
expected that they would be widely used to investigate the causes of offending and
the effectiveness of interventions designed to reduce offending. However, this is not
the case. Randomised experiments in criminology and criminal justice are relatively
uncommon (Ariel, 2009; Farrington, 1983; Weisburd, 2000; Weisburd et al., 1993;
see more recently Dezember et al., 2020; Neyroud, 2017), at least when compared to
other disciplines, such as psychology, education, engineering or medicine. We will
return to this scarcity later on; however, for now we return to David Farrington:
The history of the use of randomised experiments in criminology consists of feast and
famine periods . . . in a desert of nonrandomised research. (Farrington, 2003b, p. 219)
We illustrate more thoroughly why this is the case and emphasise why and how we
should see more of these designs – especially given criminologists’ focus on ‘what
works’ (Sherman et al., 1998), and the very fact that efficacy and utility are best
tested using experimental rather than non-experimental designs. Thus, in Chapter 6,
we will also continue to emphasise that not all studies in criminal justice research
can, or should, follow the randomised experiments route. When embarking on an
impact evaluation study, researchers should choose the most fitting and cost-effective
approach to answering the research question. This dilemma is less concerned with
the substantive area of research – although it may serve as a good starting point to
reflect on past experiences – and more concerned with the ways in which such a
dilemma can be answered empirically and structurally.
Causality in science means something quite specific, and scholars are usually in
agreement about three minimal preconditions for declaring that a causal relation-
ship exists between cause(s) and effect(s):
Beyond these criteria, which date back as far as the 18th-century philosopher David
Hume, others have since added the requirement (4) for a causal mechanism to be
explicated (Congdon et al., 2017; Hedström, 2005); however, more crucially in the
context of policy evaluation, there has to be some way of manipulating the cause
(for a more elaborate discussion, see Lewis, 1974; and the premier collection of papers
on causality edited by Beebee et al., 2009). As clearly laid out by Wikström (2008),
If we cannot manipulate the putative cause/s and observe the effect/s, we are stuck with
analysing patterns of association (correlation) between our hypothesised causes and
effects. The question is then whether we can establish causation (causal dependencies)
by analysing patterns of association with statistical methods. The simple answer to this
question is most likely to be a disappointing ‘no’. (p. 128)
Holland (1986) has the strictest version of this idea, which is often paraphrased as
‘no causation without manipulation’. That in turn has spawned numerous debates
on the manipulability of causes being a prerequisite for causal explanation. As Pearl
(2010) argues, however, causal explanation is a different endeavour.
Taking the three prerequisites for determining causality into account, it immedi-
ately becomes clear why observational studies are not in a position to prove causal-
ity. For example, Tankebe’s (2009) research on legitimacy is valuable for indicating
the relative role of procedural justice in affecting the community’s sense of police
legitimacy. However, this type of research cannot firmly place procedural justice as a
causal antecedent to legitimacy because the chronological ordering of the two vari-
ables is difficult to lay out within the constraints of a cross-sectional survey.
Similarly, one-group longitudinal studies have shown significant (and negative)
correlations between age and criminal behaviour (Farrington, 1986; Hirschi &
Gottfredson, 1983; Sweeten et al., 2013).2 In this design, one group of participants is
followed over a period of time to illustrate how criminal behaviour fluctuates across
different age brackets. The asymmetrical, bell-shaped age–crime curve illustrates
that the proportion of individuals who offend increases through adolescence, peaks
around the ages of 17 to 19, and then declines in the early 20s (Loeber & Farrington,
2014). For example, scholars can study a cohort of several hundred juvenile delin-
quents released from a particular institution between the 1960s and today, and learn
when they committed offences to assess whether they exhibit the same age–crime
curve. However, there is no attempt to compare their behaviour to any other group
of participants. While we can show there is a link between the age of the offender and
the number of crimes they committed over a life course, we cannot argue that age
causes crime. Age ‘masks’ the causal factors that are associated with these age brackets
(e.g. peer influence, bio-socio-psychological factors, strain). Thus, this line of obser-
vational research can firmly illustrate the temporal sequence of crime over time,
but it cannot sufficiently rule out alternative explanations (outside of the age factor)
2
We note the distinction between different longitudinal designs that are often incorrectly referred
to as a single type of research methodology. We discuss these in Chapter 4.
to the link between age and crime (Gottfredson & Hirschi, 1987). Thus, we ought to
be careful in concluding causality from observational studies.3
Even in more complicated, group-based trajectory analyses, establishing causal-
ity is tricky. These designs are integral to showing how certain clusters of cases or
offenders change over time (Haviland et al., 2007). For instance, they can convinc-
ingly illustrate how people are clustered based on the frequency or severity of their
offending over time. They may also use available data to control for various factors,
like ethnicity or other socio-economic factors. However, as we discussed earlier, they
suffer from the specification error (see Heckman, 1979): there may be more variables that
explain crime better than grouping criterion (e.g. resilience, social bonds and internal
control mechanisms, to name a few), which often go unrecorded and therefore cannot
be controlled for in the statistical model.
3
On the question of causality, see Cartwright (2004), but also see the excellent reply in Casini
(2012).
the law rely more and more on research evidence to shape public policies, rather
than experience alone. When deciding to implement interventions that ‘work’, there
is a growing interest in evidence produced through rigorous studies, with a focus on
RCTs rather than on other research designs. In many situations, policies have been
advocated on the basis of ideology, pseudo-scientific methodologies and general con-
ditions of ineffectiveness. In other words, such policies were simply not evidence-
based approaches, ones that are not established on systematic observations (Welsh &
Farrington, 2001).
Consequently, we have seen a move towards more systematic evaluations of
crime-control practices in particular, and public policies in general, imbuing these
with a scientific research base. This change is part of a more general movement in
other disciplines, such as education (Davies, 1999; Fitz-Gibbon, 1999; Handelsman
et al., 2004), psychology (among many others, see Webley et al., 2001), economics
(Alm, 1991) and medicine. As an example, the Cochrane Library has approximately
2000 evidence-based medical and healthcare studies, and is considered the best sin-
gular source of such studies. This much-needed vogue in crime prevention policy
began attracting attention some 15 years ago due to either ‘growing pragmatism or
pressures for accountability on how public funds are spent’ (Petrosino et al., 2001,
p. 16). Whatever the reason, evidence-based crime policy is characterised by ‘feast
and famine periods’ as Farrington puts it, which are influenced by either key indi-
viduals (Farrington, 2003b) or structural and cultural factors (Shepherd, 2003). ‘An
evidence-based approach’, it was said, ‘requires that the results of rigorous evaluation
be rationally integrated into decisions about interventions by policymakers and prac-
titioners alike’ (Petrosino, 2000, p. 635). Otherwise, we face the peril of implement-
ing evidence-misled policies (Sherman, 2001, 2009).
The aforementioned suggests that there is actually a moral imperative for con-
ducting randomised controlled experiments in field settings (see Welsh & Farrington,
2012). This responsibility is rooted in researchers’ obligation to rely on empirical and
compelling evidence when setting practices, policies and various treatments in crime
and criminal justice (Weisburd, 2000, 2003). For example, the Campbell Collaboration
Crime and Justice Group, a global network of practitioners, researchers and policy-
makers in the field of criminology, was established to ‘prepare systematic reviews
of high-quality research on the effects of criminological intervention’ (Farrington &
Petrosino, 2001, pp. 39–42). Moreover, other local attempts have provided policymakers
with experimental results as well (Braithwaite & Makkai, 1994; Dittmann, 2004; R.D.
Schwartz & Orleans, 1967; Weisburd & Eck, 2004). In sum, randomised experimental
studies are considered one of the better ways to assess intervention effectiveness in
criminology as part of an overall evidence-led policy imperative in public services
(Feder & Boruch, 2000; Weisburd & Taxman, 2000; Welsh & Farrington, 2001; however
cf. Nagin & Sampson, 2019).
Chapter Summary
Further Reading
Ariel, B. (2018). Not all evidence is created equal: On the importance of matching
research questions with research methods in evidence-based policing. In
R. Mitchell & L. Huey (Eds.), Evidence-based policing: An introduction (pp. 63–86).
Policy Press.
This chapter provides further reading on the position of causal designs within
research methods from a wider perspective. It lays out the terrain of research
methods and provides a guide on how to select the most appropriate research
method for different types of research questions.
Chapter Overview
Comparisons in science������������������������������������������������������������������������������ 16
Procedures for random assignment (how to randomise)��������������������������� 22
Randomisation in practice: the ‘randomiser’��������������������������������������������� 36
Units of randomisation (what to randomise?)�������������������������������������������� 36
A final word regarding the choice of units of randomisation�������������������� 42
Should we use significance tests for baseline imbalances
after randomisation?���������������������������������������������������������������������������������� 44
Inferential statistics and probability theory������������������������������������������������ 46
Randomisation and sample size considerations����������������������������������������� 48
Summarising the benefits of randomisation���������������������������������������������� 50
Further Reading����������������������������������������������������������������������������������������� 51
Comparisons in science
Box 2.1
Counterfactuals
Ideally, we should follow the same people along two different parallel universes,
so that we could observe them both with and without attending AA. These are
true counterfactual conditions, and they have been discussed extensively elsewhere
(Dawid, 2000; Lewis, 2013; Morgan & Winship, 2015; Salmon, 1994). A counterfac-
tual is ‘a comparison between what actually happened and what would have hap-
pened in the absence of the intervention’ (White, 2006, p. 3). These circumstances
are not feasible in real-life settings, but the best Hollywood representation of this
idea is the 1998 film Sliding Doors, starring Gwyneth Paltrow. In that film, we see the
protagonist ‘make a train’ and ‘not make a train’, and then we follow her life as she
progresses in parallel (Spoiler alert: in one version, she finds out her partner is cheat-
ing on her, and in the other, she does not).1 However, without the ability to travel
through time, we are stuck: we cannot both observe the ‘train’ and the ‘no-train’
worlds at the same time.
In the context of the AA example above, we can use another group of people who
did not take part in AA and look at their drinking patterns at the same time as those
One of the authors uses the trailer for this film in teaching to illustrate ideal counterfactuals,
1
who were in the AA group – for example, alcoholics in another town who did not
take part in AA. This approach is stronger than a simple before–after with one group
only. It is also a reasonable design: if the comparison group is ‘sufficiently similar’ to
the AA participants, then we can learn about the effect of AA in relative terms, against
a valid reference group.
The question is, however, whether the alcoholics in the other town who did not
take part in AA are indeed ‘sufficiently similar’. The claim of similarity between the
AA participants and the non-AA participants is crucial. In ideal counterfactual set-
tings, the two groups are identical, but in non-ideal settings, we should expect some
differences between the groups, and experiments attempt to reduce this variation as
much as possible. Sometimes, however, it is not possible to artificially reduce this
variation. For example, our AA participants may have one attribute that the com-
parison group does not have: they chose to be in the AA group. The willingness to
engage with the AA group is a fundamental difference between the AA treatment
group and the comparison group – a difference that goes beyond merely attending AA.
For example, those at AA might be more open to personal change and thus have a
generally better prognosis to recover from their drinking problem. Alternatively, it
might be that those in the AA group drank much more alcohol pre-intervention;
after all, they were feeling as though they needed to seek help. It could also be that
the AA group participants were much more motivated to address their drinking
issues than other people – even amongst those who have the same level of drinking
problems. This set of issues is bundled together under the term selection bias (see
review in Heckman, 1990; Heckman et al., 1998). In short, the two groups were dif-
ferent before the AA group even started treatment, and those differences are linked
to differences in their respective drinking behaviour after the treatment has been
delivered (or not delivered).
In this context, D.T. Campbell and Stanley (1963) note the possibility that ‘the
invitation itself, rather than the therapy, causes the effect’ (p. 16). Those who are
invited to participate may be qualitatively different than those who are uninvited
to take part in the intervention. The solution is to create experimental and control
groups from amongst seekers of the treatment, and not to assign them beforehand.
This ‘accounts for’ the motivation part of treatment seeking behaviour. To avoid
issues of withholding treatment, a waitlist experiment in which everybody ultimately
gets treatment, but at different times, might be used. This approach introduces new
challenges, which are discussed in Chapter 4.
Still, not being able to compare like with like undermines our faith in the con-
clusion that AA group participation is causally linked to less drinking – but we can
problematise this issue even further. In the AA example, we might have the data
about drinking patterns through observations (e.g. with daily breath testing). We can
collect data and measure drinking behaviour or alcohol intake, and then match the
two groups based on this variable. For example, the comparison group can comprise
alcoholics who have similar drinking intake to those in the AA group. However, the
problem is not so much with things we can measure that differ – although that can
still be difficult – the problem is with variables we cannot measure but that have
a clear effect on the outcomes (the ‘un-observables’). Statisticians are good at con-
trolling for factors that are represented in the data but cannot create equal groups
through statistical techniques when there are no data on unobservable phenomena.
For example, the motivation to ‘kick the drinking habit’ or the willingness to get help
are often unobservable or not recorded in a systematic way. If they are not recorded,
then how can we compare the AA participants with the non-participants?
Finally, participants in AA expect to gain benefits from their involvement, which
is different than those who did not choose to attend meetings. If those who did not
participate are in a different treatment group, they should have a different set of
expectations about their control conditions. Thus, motivations, commitments and
decisions are relevant factors in the success of any treatment that involves people –
but they are rarely measured or observed and therefore cannot be used during the
matching process (Fewell et al., 2007; M. Miller et al., 2016). We base our conclu-
sions on observed data but are resting on the untestable assumption that unobserved
differences are also balanced between the treatment and the control groups.
Given how problematic this assumption of similarity is in unobserved measures,
it is worth considering the issue further. Let us say the study was conducted in a
city where several members of the AA group were diagnosed with mental health
problems, such as depression, and a physician prescribed the ‘group’ treatment for
them as medical treatment, a process known as ‘social prescribing’ (Kimberlee, 2015;
Sullivan et al., 2005). This difference alters our expectations about the results; those
with depression might be starting with a worse prognosis, for example, but end with
a better one. Additionally, they might also benefit more from AA because their prior
mental ill health explains why their drinking might be initially higher and their
membership of the group is a result of a physician’s prescription. These conditions
may not happen at all in the comparison city – but we would not know this unless
we have had access to individual participants’ medical records. Imagine that we went
ahead with comparing the drinking behaviour of the two groups, and even assume
that they have similar levels of drinking patterns before the intervention was applied
in the treatment group, but not the comparison group. If we then followed up both
groups of people, we can measure their consumption in the time after the interven-
tion takes place, remembering that only one set of people has been involved in the
group. Figure 2.1 shows a hypothetical comparison. We see that being in the AA
group actually seems to make people worse after attending. Given that we know they
had a pre-existing diagnosis of depression and were perhaps more likely to worsen
over time, we could explain this relationship away. Nevertheless, imagine if we did
not know about the depression: our conclusion would be that the group actually made
them worse than the comparison group – when in fact this may not necessarily be
true. What is more, if we were policymakers, we might cancel funding for AA groups
based on this evaluation.
Treatment Worse
group well-being
Pre-existing
depression
So what can we do? We can randomly assign eligible and willing participants to
either AA group or no-AA group (delayed AA group in the waitlist design). This
approach, which utilises the powers of science and probability theories, is detailed
below.
Randomisation
The most valid method we have in science to create two groups that are as similar
as possible to each other, thus creating the most optimal counterfactual conditions,
is randomisation. The random assignment of treatment(s) to participants is under-
stood as a fundamental feature of scientific experimentation in all fields of research,
because randomisation creates two groups that are comparable (Chalmers, 2001;
Lipsey, 1990), more than any other procedure in which participants are allocated
into groups. If properly conducted, any observed differences between the groups in
endpoint outcomes can be attributed to the treatment, rather than baseline features
of the groups (Schulz & Grimes, 2002b). If we collect enough units and then assign
them at random to the two groups, they are likely to be comparable on both known
and unknown confounding factors (Altman, 1991; M.J. Campbell & Machin, 1993).
However, what properties of randomisation give credence to the aforementioned
claims about the superiority of this procedure over other allocation procedures?
The answer lies in probability theory, which predicts that randomisation in ‘suf-
ficiently large’ trials will ensure that the study groups are similar in most aspects
(Torgerson & Torgerson, 2003). There are different mathematical considerations
When a deck of 52 playing cards is well shuffled, some players will still be dealt a better
set of cards than others. This is called the luck of the draw by card players . . . in card
games, we do not expect every player to receive equally good cards for each hand. All
this is true of the randomised experiment. In any given experiment, observed pretest
means will differ due to luck of the draw when some conditions are dealt a better set of
participants than others. But we can expect that participants will be equal over condi-
tions in the long run over many randomised experiments [emphasis added]. (p. 250)
For this reason (see more in The SAGE Quantitative Research Kit, Volume 3),
Farrington and Welsh (2005) were able to conclude that ‘statistical control (for
baseline balance) should be unnecessary in a randomised experiment, because
the randomisation should equate the conditions of all measured and unmeasured
variables’ (p. 30).
2
We simplify for the sake of the example, however the ‘natural sex’ ratio at birth is approximately
105 boys per 100 girls (ranging from around 103 to 107 boys). See Ritchie (2019).
There are many different procedures for allocating units into treatment and control
conditions. Below, we discuss four: we have one ‘pure’ random assignment pro-
tocol, or the ‘simple random assignment’ procedure, in which units are assigned
purely by chance into the treatment and control groups. As the number of units
involved in testing any particular intervention grows, pure random assignment
will increasingly generate the most comparable treatment and control conditions.
However, researchers are aware that any one experiment may not necessarily create
two balanced groups, especially when the sample size is not large ‘enough’. For exam-
ple, in a sample of of 152 domestic violence cases randomly assigned by the court to
one of two types of mandated treatments between September 2005 and March 2007,
pure random assignment produced 82 versus 70 participants (Mills et al., 2012), or a
1:1.17 ratio.
Therefore, as we discuss more fully in this section, it is not always possible to rely
purely on chance, and strategies have been developed over the years as alternatives
to pure random assignment. These are the ‘restricted random assignment’ protocols,
and while they are common in medicine, they are far less common in experimen-
tal disciplines in the social sciences. According to these procedures, the researcher
arranges the data prior to random assignment in certain ways that increase the pre-
cision of the analysis. For example, researchers often segregate the pool of partici-
pants into subgroups, or blocks, based on an important criterion; this criterion is used
because there is reason to believe that some participants react ‘better’ to the interven-
tion. Random assignment is conducted after the subgrouping has occurred, so the test
is still fair and does not disadvantage any one group of participants over the other.
There are other procedures like the ‘randomised block design’ described above: trickle
flow random assignment and batch random assignment, matched pairs design and
minimisation. We delve into these below.
random sequence. In practice, treatment providers should not be able to know how
the next patient or participant will be assigned – and this benefit is crucial in order to
reduce the likelihood of selection bias.
To consider the damaging effect of selection bias, imagine a situation where a doc-
tor who is taking part in an experiment knows that the next patient will be offered
the treatment but does not think the patient is the right ‘fit’ – this might lead to
the doctor not offering treatment (overriding the allocation). This can happen with
police experiments as well, when police officers or their partner agencies in field
experiments allocate certain offenders to prevention programmes that they (genu-
inely) believe they would benefit from, and not based on the predetermined random
assignment sequence. Interestingly, in both cases, the practitioners are hurting the
very patients or offenders they are trying to help, because such ‘contaminated’ tri-
als could lead to nil differences between the treatment and the control groups (as
patients of both groups are receiving the same intervention). The experimenter (who
is unaware of the contamination) would then conclude that the intervention did not
‘work’ and therefore not recommend its adoption as policy.
3
You can recreate this issue with a coin toss. If you flip a coin 10 times, you might end up with
9 heads and 1 tail, or 3 of one and 7 of the other. Now, imagine the 9:1 allocation represents a
characteristic like ‘being a frequent offender’ – we end up with 9 in treatment and 1 in control –
which we would expect to have quite a large impact on our results.
structured approach is taken in the random allocation of the cases, given the
importance of this criterion.
As we noted, imbalances are more likely to occur with a simple randomisation
procedure in small sample trials (Farrington & Welsh, 2005; Lachin et al., 1988).
Therefore, when there are ‘several hundred participants’ or fewer, restricted randomi-
sation techniques should be preferred over simple random allocation (Friedman
et al., 1985, p. 75; R.B. Santos & Santos, 2016). The question of sample size depends
on the context, the units of randomisation, and the putative effect size (as shown in
a series of studies by David Weisburd and his colleagues; see C.E. Gill & Weisburd,
2013; Hinkle et al., 2013; Weisburd, 2000; Weisburd & Gill, 2014).
Six varieties of oats are to be compared with reference to their yields, and 30 experi-
mental plots are available for experimentation. However, evidence is on file that indi-
cates a fertility trend running from north to south, the northernmost plots of ground
being most fertile. Thus, it seems reasonable to group the plots into five blocks of
six plots each so that one block contains the most fertile plots, the next block con-
tains the next most fertile group of plots and so on down to the fifth (southernmost)
block, which contains the least fertile plots. The six varieties would then be assigned
at random to the plots within each block, a new randomisation being made in each
block. (p. 372)
Thus, like simple random assignments, where units are unrestrictedly distrib-
uted at random to either treatment or control (or more than two groups, as
the case may be) group, under the randomised block design, units are still allo-
cated randomly to either treatment or control group. However, this allocation is
conducted within pre-identified blocks. The blocking process is established and
based on a certain qualitative criterion, which is intended to separate the sam-
ple, prior to assignment, into subgroups that are more homogeneous than the
group as a whole (Hallstrom & Davis, 1988; Simon, 1979). Blocking is therefore
a design feature that reduces variance of treatment comparisons by capitalising
on the existing blocks (or strata), which are less heterogeneous than the entire
pool of units.
Box 2.2
Equally important, everything else in the experiment remains constant. This means that
the treatment (its dosage, delivery frequency or method of administration), the instru-
ments used to measure the pre-treatment and the post-treatment observations (e.g. con-
sistent measure by the same observation instrument) and the management of the study in
continuous and uninterrupted ways remains identical throughout the experiment. When
it comes to block random assignment, this constancy remains intact, but within blocks of
data (Matts & Lachin, 1988; Ostle & Malone, 2000; Rosenberger & Lachin, 2002).
Note that blocking can also allow the researcher to oversample a particular sub-
group that is relatively small in the overall population (e.g. uncommon ethnicities,
prognoses or extreme crime values). Blocking can therefore allow for a clear represen-
tation of different levels of variables and then a fair comparison between these dif-
ferent levels. Further note that within the blocks there is usually only one treatment
group and one control group because the test statistic is considered more ‘stable’ this
way (see Gacula, 2005; however, cf. Grommon et al., 2012). There are compilations of
multiple studies, which D.T. Campbell and Stanley (1963) refer to as factorial designs.
These designs are not discussed in this book.
Imagine a trial on the effect of a domestic violence counselling treatment for male
offenders on subsequent domestic violence arrests. Such an experiment would benefit
from blocking the sample before random assignment to the study groups, according to
number of prior domestic violence arrests. There may be a clear reason to assume that
persistent batterers will respond differently to counselling (likely to be less susceptible
to treatment). One approach can be to block the sample based on their prior arrest
history: for example, ‘substantial history’, ‘some history’ and ‘no additional history’ of
previous domestic violence4 and then randomly assigning treatment and no-treatment
within each block. We can also further block the data according to a second criterion
(e.g. whether the abuser has already undergone some sort of treatment for domes-
tic violence in the past); however, ordinarily, there will be a single blocking factor
(Ostle & Malone, 2000). With the blocking procedure, the intra-block variance of pre-
intervention domestic violence levels (measured in terms of prior domestic violence
arrests) is expected to be lower than the variance across the entire sample (Canavos &
Koutrouvelis, 2008) – that is, the blocking creates more homogeneous blocks of domes-
tic violence levels and allows the researcher to compare the intervention at these differ-
ent levels. Another way to describe this is by saying that the ‘signal’ (i.e. the interven-
tion effect) was not changed; however, the ‘noise’ (i.e. the variance)5 is decreased. The
researcher is able to gain more precision in their estimate of the treatment effect.
4
In this example, the arrest history is a continuous count variable and the three ‘blocks’ are assigned
mutually exclusive qualitative values – much like the fertility rates of the field segments in the
original Fisher (1935) and Hill (1951) illustrations of randomised block designs in agriculture.
5
In statistical terms, noise normally refers to unexplained variation in data samples.
all experimental slots are filled. First, treatment effects in field experiments are usu-
ally not ‘large’ (e.g. Weisburd et al., 2016), so the dilemma about curtailing the exper-
iment before the planned randomisation sequence has been fully filled is almost
unimaginable.6 We visit these ethical considerations in Chapter 5.
Second, most field experiments take a minimalist approach: recruit the least number
of units needed for the trial to result in statistically significant differences between the
two groups (should such a difference exist). Large trials are typically more expensive
and more difficult to manage (Weisburd et al., 2003), so researchers carefully plan for
‘just the right size’. Therefore, early indications about large and observable differences
between the treatment and control groups, but which are based on small samples, can
be highly suspect. Extreme results can simply be a natural extreme variation in the
data. However, this issue will naturally resolve when more cases are recruited into the
study. For example, in an experiment on the effect of police body-worn cameras on
assaults against police recruits, several new officers from the same treatment group may
be attacked immediately after being recruited into the study; this scenario will inflate
the mean of the entire group because of the limited time of observation (i.e. assaults
against recruits are infrequent, and they are less likely as the study period is shortened).
By chance alone, an unusual number of units from the same study group exhibited
extreme scores, but this extreme overall difference in scores is likely to ‘flatten out’
when more units are assigned into the two groups over time. This issue, which is largely
a statistical power consideration, is visited in Chapter 3.
The trickle flow can take two major forms: case by case or in sequential batches of
cases. When eligible units enter the study one at a time, then each allocation is done
randomly at the moment the unit becomes eligible and available for recruitment.
For example, as soon as an offender is arrested, they are then randomly allocated
into treatment or no-treatment conditions (as in the case of the CARA experiment
depicted in Case Study 2.1; see also Braucht & Reichardt, 1993; Efron, 1971).
Trickle flow in batches means that the experimenter conducts the random assign-
ment of multiple cohorts that enter the study. An example is a training programme
for new police officers, which is primarily delivered to many candidates simultane-
ously (see Case Study 2.4): as the police department trains new recruits in multiple
batches throughout the year, each incoming batch may be slightly different than the
other batches, so random assignment is conducted within these cohorts. This design
and its analysis are similar to the block random assignment procedure, but the effect
of the temporal sequencing of the batches should be taken into account as well.
Finally, notice that in trickle flow studies there is a special need to create balance
between the treatment and the control groups in terms of the ‘time-heterogeneity’
Although we note that with the use of administrative data it is now possible to run experiments
6
bias, which does not occur when units are allocated in one ‘go’. This bias is created by
changes that occur in participants’ characteristics and responses between the times
of entry into the trial (Rosenberger & Lachin, 2002, pp. 41–45). Therefore, we need
to have balance not just in terms of the total number at the end of the experiment
but throughout the recruitment process as well (Hill, 1951). This issue was explored
in various statistical notes (Abou-El-Fotouh, 1976; Cochran & Cox, 1957; Lagakos &
Pocock, 1984; Matts & Lachin, 1988; Rosenberger & Lachin, 2002, p. 154). In practi-
cal terms, however, in order to avoid the time-heterogeneity bias, randomisation can
be restricted in a way that requires that every so often (e.g. every 20 units) the split
between treatment and control conditions will be 50%:50%.
Case Study of a Batch Trickle Flow Design: Diversity Training for Police
Recruits (Platz, 2016)
Seeking to address concerns that stemmed from the public debate about bias, diversity and
policing behaviour, the Australian Queensland Police Department conducted an RCT of a
‘values education programme’ given to new police recruits. A group of 260 new police offic-
ers were assigned to treatment (in which they participated in a 2-week diversity course during
their initial training) or control group (in which they did not complete the course during train-
ing). The recruit cohorts entered the programme in three batches and thus were randomised
congruent with their intake dates. Results from the 132 experimental and 128 control par-
ticipants overall indicated that support for diversity declined over the duration of the initial
training; however, it was less so for those individuals receiving the diversity course.
7
On the history of matched paired designs see Welsh et al. (2020).
of interest, and then every multiple of two units are paired (e.g. units 1 and 2, 3 and
4, 5 and 6 and so on), with each pair forming a ‘block’ (pair A is 1 and 2, pair B is 3
and 4 etc.). Within each pair, units are placed into treatment or control. This process
ensures balance on this single continuous variable. (It is worth noting that the major
drawback of this design is that if one unit from the pair retires from the study then
both units are lost to analysis, unless statistical corrections for missing data are used ex
post facto – an approach that controlled experiments aim to avoid.)
An example from experiments on the effect of hotspots policing illustrates the
utility of this approach (and its perils, too). Ten pairs of hotspots (i.e. 20 hotspots) are
ranked according to the number of crimes that took place in the hotspots pretest, and
then randomly assigned in order to test the effect of saturated police presence in treat-
ment hotspots compared to control hotspots, which receive ordinary policing tactics.
The outcome of interest is the number of crimes reported to the police from the hot-
spots. For each pair of hotspots (which were, again, rank ordered in terms of the ‘heat’
at the dependent variable’s pretest scores), one is randomly allocated to the treatment
group and then visited by an officer several times during the day, for 15 minutes per
visit. After three months, all 20 hotspots are analysed, as shown in Table 2.1.
The benefit of this approach is clear: the noise is dramatically reduced – which can
be seen in the range of scores. Given the range of hotspots in this city in terms of the
number of crimes these areas are exposed to, the sample can be said to be heteroge-
neous. Therefore, within pairs that are more closely similar in terms of the outcome
variable, the crime levels are much more similar than the overall group. As shown,
this pairwise random assignment is particularly useful for situations where the focus
is reducing statistical variability in the data. We increase the statistical power of
the test, as the intra-block variance – that is, between the two pairs – is lower than
the simple randomisation technique. Experimenters can therefore benefit from
this approach – especially in small n studies and when there is a clear dependence
between the two pairs (twin studies, co-offenders, co-dependent victims etc.). We
return to the issue of statistical power in Chapter 5.
This benefit, however, comes with a cost, which must be taken into consideration
when considering this design. When the two units are paired based on a particular
blocking criterion, they should subsequently be viewed as matching based on that
criterion only, not any other variable. This means that the experimenter will not be
able to claim balance based on any other variable. Take, for example, a recent RCT
on the effect of hotspots policing in Sacramento, California (Telep et al., 2014). The
study was designed to reduce crime in the city during a 90-day RCT, using saturated
police presence at hotspots by spending about 15 minutes in each hotspot and
moving from hotspot to hotspot in an unpredictable order to increase the percep-
tion of the costs of offending in those areas (approximately one to six hotspots in
their patrols, once every Two hours). Officers were not given specific instructions
on what to do in each hotspot; they received daily recommendations through their
on-board computers to engage in proactive activities such as proactive stops and
citizen contacts.
A total of 42 hotspots were identified as eligible units (being the ‘hottest’ hotspots in
the area), with half assigned to treatment and half to business-as-usual conditions. In
order to reduce the variability between the 21 treatment hotspots and the 21 control
hotspots, the sites were paired prior to randomisation based upon similarity in levels
of calls for service, crime incidents and similar physical appearance based on the initial
observations. After pairing, a computerised random-number generator assigned hot-
spots to either the treatment or the control group. The pairing ‘worked’: dependent or
paired samples t-tests showed no statistically significant differences between the treat-
ment and the comparison hotspots in calls for service or serious crimes in 2008, 2009
or 2010, suggesting no reason for concern about pre-randomisation baseline differ-
ences between the group (although see section below on significance tests for baseline
imbalances after randomisation). The results of the intervention and tightly managed
experiment suggested significant overall declines in both calls for service and crime
incidents in the treatment hotspots relative to the controls.
However, the matching criteria did not create similar hotspots in terms of other
important features – for example, time spent at hotspots, the type of activities that
officers delivered at the hotspots or the frequency of visits that officers made to the
hotspots. In fact, there were stark baseline differences (Mitchell, 2017). While we may
assume that treatment and control hotspots should have received, on average, similar
levels of the attention (measured by the time spent at the hotspots), in reality, this
was not the case. This is important, because if we are now interested in measuring
other outcomes – for example, a link between post-randomisation dosage in terms of
the GPS-recorded time spent at hotspots and differences in outcome variations – then
we are not able to do that, unless we implement statistical controls for the baseline
inequality that the pairing has inherently created in the data. Thus, this was a flaw in
the design: not considering ahead of time the likelihood of analysing other outcome
variables. If these were measurable, then they could and should have been incorpo-
rated into the study design using, for example, minimisation or stratification. These
controls become challenging with a sample size of 42, especially since the whole
premise of RCTs is not to implement statistical controls in order to create equilibrium
at baseline.
Minimisation
Introduced by Pocock and Simon (1975) and Taves (1974), minimisation is a ‘dynamic’
randomisation approach that is suitable when the experimenter wishes to stratify
many variables at once within a trickle flow experiment. In this scenario, the process
of minimisation is as follows. The first case is allocated truly randomly. The second
case is then allocated to both treatment and control, with whichever allocation mini-
mises the differences between the putative treatment and the control groups (hence
minimisation) being favoured. The third case is allocated in the same way and so on.
If at any point the groups are equal on the balance measure – which is a summary of
the standardised differences in means between groups – the next allocation is again
truly random.
To prevent this process from being deterministic (and thus open to cheating), there
is a degree of unpredictability built in. That is, even when it seems obvious that a
case would be allocated to control, it might still be allocated to treatment. (This is
sometimes governed by allocation favouring treatment over control using a ‘biased
coin’; see Taves, 2010.)
(Continued)
Table 2.2 School-level balance in the London Education and Inclusion Project trial
Balance Factors
FSM School Sex ASB (PCA) SEN School Size
<37 ≥37 Mixed Single <Mean ≥Mean <12.5 ≥12.05
Allocation % % Sex Sex ASB ASB % % Large Medium Small Total
Intensive 8 9 12 5 10 7 9 8 8 6 3 17
Light 9 10 14 5 9 10 9 10 7 7 5 19
Total 17 19 26 10 19 17 18 18 15 13 8 36
Note. FSM = free school meals; ASB = antisocial behaviour; PCA = Principal Components Analysis;
SEN = special educational needs.
To this point, we have discussed what randomisation is and why and how it works;
however, we have not given much exposure to the issue of practicality (although
the choice of the units of randomisation is very much realpolitik). Tossing coins or
drawing names from hats or sealed envelopes is not common practice. Online solu-
tions like the ‘randomiser’ (Ariel et al., 2012; Linton & Ariel, 2020), an open-source
software that allows treatment providers to conduct random allocation processes
themselves, can provide a solution for most randomisation procedures.8 Given the
substantial costs associated with random assignments using humans, randomisers are
user-friendly, safe and cheap platforms that enable researchers and their partners to
conduct the allocation themselves. The integrity of the random allocation procedure
can be preserved, as the research team maintains full control over the process at the
back end.
Once we acknowledge that experimenters should pay close attention to the ways in which
they conduct random assignment, the next question is ‘What can they randomise?’
A fundamental aspect of the design of all experimental methods is a:
“clear identification of the experimental unit. By definition, this is the smallest object or
material that can be randomly and independently assigned to a particular treatment or
intervention in the experiment.” (N.R. Parsons et al., 2018, p. 7)
Person-based randomisation
Intuitively, the simplest approach would be to randomly allocate individual police
officers into treatment and control groups. This way, whatever n of eligible front-line
officers the participating department has, the researchers would then allocate 50%
of n to a Taser group and the remaining 50% of n to the no-Taser group. The use of
force by these officers will then be measured during a reasonable follow-up period,
and any variations in the rate of force used by officers in police–public contacts
would then be attributed to the treatment (Tasers), given the random allocation
(Henstock & Ariel, 2017).
However, this research design is not so simple in practice. On paper, indeed the
randomisation of individual officers is ideal because their allocation at random
should cancel out selection biases, differences between the officers themselves or
other factors that may affect the use of force in police–public encounters. However,
one issue that cannot be ignored is the interference, or the crossover, between treat-
ment and control officers. While in many police departments the officers work in
solo formations (i.e. a single-officer patrol unit), many departments deploy officers
in double-officer formations (Wain & Ariel, 2014). Therefore, individual officers
cannot be the unit of analysis in these departments when the basic unit of patrol is a
double-crew patrol. If an officer in the treatment group is paired with an officer in the
non-treatment group (as they normally would), there will be a contamination effect.
When this patrol unit attends a call for service or conducts a stop and frisk, it is as if
both officers are operating under the treatment conditions. Thus, a simple random
assignment procedure would have to be a patrolling unit, rather than the individual
officer. Otherwise, there could be a scenario where one officer was randomly assigned
into treatment conditions (Taser), while their partner was randomly assigned into
control conditions (no-Taser).
This contamination is detrimental to the study. We hold the view that any experi-
ment that ignores this issue is doomed to failure – and it is difficult to correct for the
spillover once the experiment has been completed. In practical terms, under these
conditions, the scholar has lost control over the administration of the treatment;
they cannot reliably tell the difference between treatment and control conditions.
To complicate this further, consider additional risks to the independence of the
treatment and control conditions in both models (single- or double-crew forma-
tions). Operational needs within emergency response police systems often require
ad hoc, triple crewing or even larger formations, particularly when responding to
complicated crimes. This means that officers in the control group could have been
‘contaminated’ by responding to calls together with any number of members of the
treatment group – and vice versa (Ariel, Sutherland, & Sherman, 2019). As the treat-
ment is hypothesised to affect interactions with members of the public, ‘control offic-
ers’ would have had their behaviours altered in response to the presence of their
colleagues’ Tasers. At the very least, suspects and victims would behave differently
when Tasers were present (Ariel, Lawes, et al., 2019), even if only some officers were
equipped with them. The medical analogy is a clinical trial where both the treatment
and the control patients are sharing the same pill, even though it was assigned to
treatment patients only.
Finally, another reason why using the individual officer as the unit of analysis
seems illogical is that it dismisses group dynamics and organisational factors that
are very difficult to control for (Forsyth, 2018). These may include the character of
the officer or the sergeant managing the shift, the degree of the officers’ cynicism,
comradery, codes of silence and a host of institutional undercurrents that are recog-
nised in the literature (Ariel, 2016; Skolnick, 2002; Tankebe & Ariel, 2016). There
are underlying forces and cultural codes of behaviour that characterise entire shifts,
and as most of these factors are not recorded, they therefore cannot be included
in statistical models that aim to control for their confounding effects (see Maskaly
et al., 2017). Thus, in many studies, using the officer as the unit of randomisation
highlights a misunderstanding of the ways in which police officers are deployed in
the field. Police are typically deployed in formations that, a priori, create a weak
research design for individually randomised trials, leading to poor intervention
fidelity, crossovers and spillovers (see more broadly Bellg et al., 2004; Shadish et al.,
2002, pp. 450–452).
Place-based randomisation
Another unit of analysis that can be utilised in a study of the effect of Tasers is loca-
tion. Criminological research has shown that crime is heavily concentrated in dis-
crete areas called hotspots (Braga & Weisburd, 2010; Sherman et al., 1989; Sherman
et al., 1995; Weisburd, 2015). For example, Sherman et al. (1989) are credited with
being the first to systematically explore these concentrations, identifying that half of
all calls for service come from less than 3.5% of addresses in a given city. Weisburd et al.
(2004) subsequently demonstrated that 50% of the crime recorded in Seattle, USA,
occurs at 4.5% of the city’s street segments (see also Weisburd et al., 2012). Hotspots
have also been shown to be relatively stable over time and location (W,eisburd et al.,
2012). Notably, Weisburd et al. (2004) also found that Seattle street segments that
recorded the highest amount of criminal activity at the beginning of the authors’
longitudinal study were similarly ranked at the end of it. Such micro-places remain
stable because they provide opportunities for criminal activity that other areas may
lack (Brantingham & Brantingham, 1999). Thus, a large body of research on hotspots
clearly demonstrates that crime and disorder tend to concentrate in very small, pre-
dictable and stable places. Such crimes include violence, robbery and shootings (see
Braga et al., 2008; Rosenfeld et al., 2014; Sherman et al., 2014).
Given this line of research, a focus on hotspots ‘provides a more stable target for
police activities, has a stronger evidence base and raises fewer ethical and legal problems’
(Weisburd, 2008, p. 2). Evidence shows that when experimenters nominate hotspots as
the unit of analysis and then apply directed police interventions, this can result in effec-
tive crime reductions (Braga, Weisburd, & et al., 2019). For example, in the Philadelphia
Foot Patrol Experiment (Ratcliffe et al., 2011), violent crime fell by 23% in the treat-
ment area after three months of dedicated patrolling relative to the comparison areas.
An RCT of directed police presence at the ‘hottest’ 115 platforms within the London
Underground was also shown to cause a significant overall reduction in crime and calls
for service (Ariel, Sherman, et al., 2020; Braga, Turchand, et al., 2019).
In terms of research methods, a study on the effect of Tasers can start by con-
centrating on violent hotspots, then subsequently allocating officers equipped with
Tasers to half of the identified locations. It is important that the hotspots are allo-
cated accordingly (as discussed above), which means that officers with Tasers will go
to treatment hotspots and officers without Tasers will go to control hotspots. We can
then measure the utility of the deployment of officers with Tasers to reduce the use of
force in the hotspots, compared to hotspots without this intervention.
Temporal-based randomisation
As a final alternative, we can use time as a means to group together units and assign
them to treatment and non-treatment groups. One specific example is the use of
police shifts (e.g. 08.00–17.00 shift) as the unit of randomisation and analysis (see
Ariel et al., 2015; Ariel & Farrar, 2012). One clear benefit is sample size: even in a
department with say 100 front-line officers, there are thousands of shifts every year.
When designing an experiment on the effect of Tasers in policing, we can capitalise
on this factor. We can randomly divide all shifts into treatment shifts and control
shifts: during treatment shifts, all officers equipped with Tasers will conduct their
regular patrols, and during control shifts, all officers will not be equipped with Tasers
but will still conduct their regular patrols. At the end of the trial, we can compare and
contrast the rates of use of force in the treatment versus control shifts.
Randomly assigning shifts as the unit of analysis is not ideal, given the potential
contamination effect on other units, but it may be the most optimal unit for the
policing context (and has applications to other shift-based work cultures, particularly
emergency services). The issue with contamination when using shifts is as follows:
the same officers experience both treatment and control shifts, so there is the likeli-
hood that behavioural modifications due to treatment conditions can be ‘carried
over’ into control conditions. If Tasers affect behaviour, then there may be a learn-
ing mechanism at play, where officers adapt their overall behaviour (and possibly
attitudes), and this broader change affects them during control conditions as well.
We might speculate that as officers begin to habituate their modified behaviour, they
adopt the ‘Taser version’ of themselves whenever on patrol.
In such an experiment, officers participating on multiple occasions in both treat-
ment and control conditions potentially violate the independence of the treatment
and control groups (Rubin, 1990a) and the requirement that observations be inde-
pendent (see Ariel et al., 2015). However, the unit of analysis in this randomisation
procedure is the shift, not the officer. The set of conditions encountered during each
shift cannot be repeated because time moves only in one direction. The manipula-
tion will be whether the shift involves Tasers or no Tasers. The use-of-force outcome
is then driven by officers with Tasers during each shift, versus shifts without Tasers.
Likewise, because the shift was randomised, and officers experienced multiple shifts
with and without Tasers, we know that, on average, all else was equal, including which
officer was involved.
Similarly, being able to define units, treatments and outcomes in this detailed way
means that we can be surgical about where violations to the experimental protocol
are occurring. More importantly, however, spillover effects often result from experi-
ments and, indeed, may be the intention (Angelucci & Di Maro, 2016). In our tests on
body-worn cameras, for example (see below for more details), officers were exposed
to both treatment and control conditions: the spillover meant that officers in control
conditions were affected by their counterpart treatment conditions and altered their
behaviour regardless of treatment condition.
Put another way, the exposure of officers to both treatment and control conditions
is likely to affect the estimation of treatment effects, we think, asymmetrically. That
is to say, officers in control shifts are likely to change their behaviour as a result of
exposure to Tasers during treatment shifts. In the absence of detailed evidence, the
working hypothesis is that during control shifts officers would change their behav-
iour to mimic that of treatment shifts. The spillover would therefore act to shrink the
gap between treatment and control conditions by making control shifts more like
treatment shifts. If true, this means that the estimated effect represents lower bound
estimates of the intervention effect, rather than inflating the error. In other words,
this so-called flaw makes our job in showing a significant difference from control
conditions even harder, not easier, than a more conservative statistical test.
The protocol for analysing experiments requires the experimenter to decide ahead
of time the exact formulation of the units of analysis. The axiom in different experi-
mental disciplines, particularly those with more elaborate designs compared to social
science experiments, is that ‘you analyse them as you have randomised them’. This
general rule, set sometime ago by R.A. Fisher (cited in Boruch, 1997, p. 195), is fun-
damental: a trial, which is primarily concerned with the causal link between the
intervention and the outcome(s), must be grounded in a pre-specified proposition
about the units under investigation.9 N.R. Parsons et al. (2018) reiterated this rule by
saying that ‘the experimental unit is usually the unit of statistical analysis’, illustrat-
ing that a deviation from the general rule can be problematic.10 Pashley et al. (2020)
summarise this issue with the following remarks:
9
See more in Kaiser (2012, p. 3867).
10
On the other hand, subgroup analyses may be a different case. As Rothwell (2005) summarised,
‘Subgroup analyses are important if there are potentially large differences between groups in the
risk of a poor outcome with or without treatment, if there is potential heterogeneity of treatment
effect in relation to pathophysiology, if there are practical questions about when to treat, or if
there are doubts about benefit in specific groups, such as elderly people, which are leading to
potentially inappropriate under treatment. Analyses must be predefined, carefully justified, and
limited to a few clinically important questions, and post hoc observations should be treated
with scepticism irrespective of their statistical significance. If important subgroup effects are
anticipated, trials should either be powered to detect them reliably or pooled analyses of several
trials should be undertaken’ (p. 176).
It is a long-standing idea in statistics that the design of an experiment should inform its
analysis. Fisher placed the physical act of randomisation at the center of his inferential
theory, enshrining it as ‘the reasoned basis’ for inference (Fisher, 1935). Building on
these insights, Kempthorne (1955) proposed a randomisation theory of inference from
experiments, in which inference follows from the precise randomisation mechanism
used in the design. This approach has gained popularity in the causal inference litera-
ture because it relies on very few assumptions (Imbens & Rubin, 2015; Splawa-Neyman
et al., 1990). (p. 2)
11
In practice, many suggest that cluster randomised controlled trials do not follow this rule, with
one unit of randomisation (e.g., the classroom) and a unit of analysis (e.g., students' scores).
Similarly, multiple informant studies, for example parent-child-teacher reporting, or a ripple-effect
experiment, may also seem to deviate from the 'analyse them as you randomise them' idiom.
However, we must be careful not to confuse between data analysis and hypothesis testing: with
clustered randomisation, we are observing two worlds: one in which the intervention exists, and
a world in which it does not. Any unit of analysis available to measure must be connected to the
treatment/no-treatment conditions .
The short answer to the question posed in this title is ‘no’. The primary reasons
why we do not have to conduct statistical tests for the pre-experimental balance are
as follows.
First, the statistical tests that are used to analyse the results of the experiment
already consider the possibility of non-equivalent conditions occurring by chance
(Senn, 1994; see also Mutz et al., 2019). The post-test statistical significance testing
allows for the possibility that pre-test scores of some covariates may fluctuate enough
to produce significant differences between the experimental arms.
Second, with a sufficient number of statistical tests, one or more statistical com-
parisons is highly likely to emerge as statistically significant, by chance alone. With
p < 0.05, it is expected that at least one comparison would yield a statistically signifi-
cant difference – or one in 20 tests. Therefore, we know that tests may return results
of significant differences, no matter what the true population covariance means (see
also Bolzern et al., 2019, in the context of cluster RCTs).
Third, experimenters are cognisant that these baseline differences are inevita-
ble, especially when the randomised groups are small (Chu et al., 2012). In the
long run, however, these differences will average out. With repeated tests on the
same phenomenon, the baseline means of the covariates in the two groups will
overlap. Random allocation will create, over multiple tests in which participants
are randomly allocated into treatment and control conditions, conditions similar
to pretest levels. This feature of synthesised random assignments from the same
population over repeated occurrences is extended to any particular variable, or
group of variables, and the ways in which they interact. Thus, if we trust random
allocation, which is underpinned by probability theory, then whatever baseline
differences may occur due to chance will cancel out over time, so they can largely
be ignored.12
As we discuss in the box below, the consequence of these assumptions is that there
is no need to prove that some factors are significantly more pronounced in one group
over the other, because we already assume the probability that these differences will
emerge. As a result, we do not need to control for any covariates, which some stat-
isticians have convincingly argued based on several grounds (Altman & Doré, 1990;
De Boer et al., 2015). First, significance testing and statistical modelling as a guide to
whether baseline imbalances exist can be viewed as inappropriate (Senn, 1989, 1994,
1995). As Altman (1985) pointed out, these tests can show whether the observed dif-
ferences happen by chance, though we already know that any observed significant
differences could be due to chance.
Furthermore, baseline equality needs to be a derivative of logical procedures, not
of a parametric formula (Peduzzi et al., 2002). This means that baseline charac-
teristics should be compared using clinical and ‘common sense knowledge’ about
which variables are important to the primary outcome, and only then adjusting for
their known effect through design, if necessary. If there are important covariates,
they should be included in the process of random allocation before the experiment
commences – a topic we cover below. Assessing baseline imbalance should be
descriptive and transparent, but there is no need to ‘add’ equality between two
groups postrandomisation – a view that was in fact embraced in the Cochrane
Collaboration Guidelines and Consolidated Standards of Reporting Trials, which we
12
We urge the reader to read the superb review and critique of Fisher’s view that randomisation
‘relieve[s] the experimenter from the anxiety of considering and estimating the magnitude of the
innumerable causes by which the data may be disturbed’ by Saint-Mont (2015).
discuss in Chapter 5. These tests are not only unnecessary but also can be counter-
beneficial. Balance testing is inappropriate and, as remarked by Mutz and Pemantle
(2012), conducting balance tests
equivalence’, we might conclude that there is a difference with a much greater prob-
ability than the 5% we think we are using (with 20 variables, it would in fact be a 64%
chance of falsely rejecting the null hypothesis).
This is often described as the problem of running multiple outcomes testing (see
discussion in Langley et al., 2020), and an illustration helps to understand the issue.
With M independent tests and an acceptable error rate of 5% (α), the formula for
calculating the probability of falsely rejecting the null (no difference) hypothesis is
1 − (1 − α)^M. So with M = 1, the result is 1 − (1 − 0.05)1 = 0.05 (5%) as expected, with
a 5% significance (α) level. Nevertheless, with M = 2, the chance of falsely rejecting
the null in one of the tests increases to 10%, with M = 3, it is 14%, with M = 4, it is
19% and with M = 5, it is 23%. This means that with five covariates tested, there is a
23% chance of rejecting the null hypothesis and incorrectly concluding that there is
a difference where there is none.
Instead of significance testing, we advocate comparing the means and distributions
of baseline variables graphically and appraising any differences in terms of whether
they may have a meaningful influence on outcomes. Imagine a situation where we
could only use simple randomisation and not assess balance until the end of the
trial. We then find that the two groups differed in their means by 3% at the pre-
intervention stage. When we look at their post-intervention outcome, we see a differ-
ence of 10%. In that situation, we might conclude that the ‘gain’ of 7% is because of
the intervention and that the pre-intervention difference is not sufficient to explain
the outcome observed. (In reality, we would include the pre-intervention data in our
analysis, so we have the ‘baseline controlled’ model.)
Box 2.3
13
The question of sample size is linked to ‘statistical power’, an issue with which social scientists
like David Weisburd are particularly concerned (Weisburd & Gill, 2014). We return to this more
formidably in Chapter 4.
end up in baseline inequality in terms of group sizes, it is very unusual and highly
unlikely to get anything other than a result close to 50:50 split when the sample
size is this large.
(2) Pre-randomisation equilibrium: Covariates are the important characteristics
that define the participants – for example, age, gender, ethnicity or criminal back-
ground. When covariates are found to be different between the experimental arms
before the administration of the intervention, it means that the distribution of
pre-treatment variables can influence the outcome (Berger, 2004), because base-
line imbalances create groups that are not comparable. Any apparent effect of the
intervention may be spurious and lead to biased parameter estimations (Altman,
1996). Consider a hypothetical trial with two groups, where some covariance vari-
able exists in 15% of the overall sample (e.g. Kernan et al., 1999, p. 20). The chance
that the two groups will differ by more than 10% for the proportion of participants
with the covariance variable is 33% for a trial of 30 patients, 24% for a trial of 50
patients, 10% for a trial of 100 patients, 3% for a trial of 200 patients and 0.3% for
a trial of 400 patients. Thus, it is commonly understood that as the sample size
increases, the effect of baseline covariates should become negligible. The larger the
study, the lower the likelihood that the covariates will be distributed unequally due
to chance (i.e. random assignment).
As another example, assume that a variable of interest occurs in 20% of the sample
(e.g. 20% of the patients will have an allergic reaction to the treatment, or 20% of the
offenders will drop out of the study once they are offered the treatment). In a sample
of 200, this variable will exist in 40 individuals. If the split between treatment and
control due to random assignment was a perfect 50:50, then you would expect 20
participants in the treatment group and 20 participants in the control group to expe-
rience that variable. Nevertheless, when the split is expected to be 60:40 (even for
‘just 5%’ of the time that you run this experiment), you may end up with 24 partici-
pants in one group versus 16 participants in the other group who have experienced
this variable. The difference then grows to 50%.
The problem is that even if the assumption that larger samples lead to baseline bal-
ance were true, many field experiments are not large enough. Studies of more than a
few hundred participants randomly assigned into treatment and control conditions
are not common in criminology. Field experiments in the criminal justice system
are usually much smaller than this, with relatively weak effects. In Farrington and
Welsh’s (2005) systematic review of ‘high-quality’ RCTs in criminology, for exam-
ple, nearly one-third (29%) of the reported trials had sample sizes below 200. In
Peter Neyroud’s (2017) review of experiments in policing, just 12 of the 122 policing
experiments identified reached a sample size of 1000. A longitudinal view of the
growth of some experiments in criminology can be found in Braga et al. (2014), and
while the trajectory of experimental criminology is up, the number of ‘large trials’
remains an exception.
Randomisation is beneficial for several reasons. Perhaps, most importantly, it breaks the
link between treatment allocation and potential outcomes. With randomisation, we do
not need to be concerned that differences between groups can be attributed to the manip-
ulation (non-randomisation) of treatment allocation. In doing so, randomisation allows
us to generate a comparison group that is ‘as similar as possible’ to our intervention group.
We have then tried to convey that as sample size is increased, randomisation is
likely to create equal groups. This concern is of particular interest when the studied
effect is suspected to be relatively small as well, therefore being more sensitive to
‘disappear’ in the noise that is created by the covariates and the outliers. Probability
theory states that imbalances between the groups will become negligible as sample
size increases, which refers to a distribution of replicated trials. We have then shown
that there is still value in these studies, with baseline inequality that was not ‘sorted’
by the random assignment. Results from experiments compiled across a series of rep-
lications (e.g. those seen in numerous trials of police wearing body-worn cameras)
would then ‘cancel out’ any differences that may have been detected between the
groups in any one test of the treatment effect.
What does all of this mean? The good news is that if we are content that our ran-
domisation was successful, then we are able to make our causal inference. That is, if we
have randomised, then we know, by virtue of randomisation, that observed differences
between groups arose because of the randomised group membership, rather than some-
thing else. If we have randomised properly, and in enough numbers, we can compare
the outcomes for treatment and control groups and be sure that the difference between
them is our causal estimate. What remains to be seen is how other features of the experi-
ment fit in with the process of randomisation – a topic we turn to in the next chapter.
Chapter Summary
• In this chapter, we go deeper into the theory behind the ‘magic’ of randomisation –
probability theory. We also discuss some of the different forms that random assignment
can take, including methods for reducing pre-experimental differences using various
techniques.
• First, we review the importance of comparisons and relative effectiveness in causal
research. We then discuss common procedures of randomisation that help researchers
achieve baseline balance between treatment and comparison groups.
• Finally, we discuss units of randomisation that experimenters can use to analyse cause-and-
effect relationships, which is then followed by some more technical considerations such as
sample size and whether we should test for baseline imbalances after randomisation.
Further Reading
Morgan, S. L., & Winship, C. (2015). Counterfactuals and causal inference. Cambridge
University Press.
Various authors have extensively studied the issue of counterfactuals over the years,
applying different approaches to conceptualising and measuring them. Researchers
have developed a variety of methods for applying the counterfactual approach to
observational data analyses, which has been only briefly touched upon in this
chapter. This book offers further reading on these concepts, as well as informa-
tion on the statistical instruments used for causal inference under observational
conditions. Morgan and Winship’s book provide one of the most comprehensive
accounts of these approaches, published as part of a Cambridge University series on
analytical methods for social research.
Chapter Overview
What do we mean by control?������������������������������������������������������������������� 54
Threats to internal validity�������������������������������������������������������������������������� 54
Threats to external validity������������������������������������������������������������������������� 75
Conclusion�������������������������������������������������������������������������������������������������� 88
Further Reading����������������������������������������������������������������������������������������� 90
The second fundamental rule of causal research is to maintain a high level of control
in the conduct of the study. This includes all the steps associated with experimental
methods, from the early design stages up to the dissemination of the results. In a
laboratory study, the environment is highly controlled, as per the quote from Cox
and Reid (2000) that opens this book. Of course, the strength of the control is always
relative to the context in which the experiment is conducted, and the purity of the
experimental design is often defied by ‘human error’, such as post-randomisation
decisions that change the course of the experimental blueprint. Participants drop out,
treatment providers often challenge the treatment fidelity by delivering the inter-
vention inconsistently (e.g. police–public interactions do not all look the same) and
random allocations are frequently broken. Real-life constraints are the price we pay
for experiments conducted in the social sciences.
Subsequently, these issues are critical because an error can jeopardise the integ-
rity and the overall validity of the test. Careful planning is necessary by taking into
account the potential violations of certain rules. There are always gaps between the
experimental protocol and the experiments researchers administer in practice. Still,
we must be mindful not to overly compromise, otherwise certain experiments will be
‘doomed to failure’ due to lack of controls before they even commence.
Experimenters need to have a detailed plan on how to increase their control over
the experiment. In this chapter, we discuss two core issues1:
One of the most important reasons to keep controls at reasonable levels in the experi-
ment is to reduce threats to the internal validity of the test. These threats have
been known to scholars for some time (D.T. Campbell & Stanley, 1963); however,
they are often overlooked. Criminology is no different, and while the fundamentals
of science are violated regularly in many fields, some evaluation studies are imple-
mented despite being based on poor designs that jeopardise internal validity.
Internal validity was well described by the popular social research methods website
Research Methods Knowledge Base (Trochim, 2006):
1
For a detailed and more comprehensive review, see Shadish et al. (2002). See also Bryman (2016).
Internal validity is only relevant in studies that try to establish a causal relationship. It’s
not relevant in most observational or descriptive studies, for instance. But for studies that
assess the effects of social programs or interventions, internal validity is perhaps the pri-
mary consideration. In those contexts, you would like to be able to conclude that your pro-
gram or treatment made a difference – it improved test scores or reduced symptomology.
But there may be lots of reasons, other than your program, why test scores may improve or
symptoms may reduce. The key question in internal validity is whether observed changes
can be attributed to your program or intervention (i.e. the cause) and not to other possible
causes (sometimes described as ‘alternative explanations’ for the outcome).
We prefer this definition of internal validity due to its practical focus – the factors
that jeopardise our conclusion that the intervention is the cause of the differences
between the group that was exposed to the treatment and those that served as con-
trols. Making a claim that a treatment affected the participants in some way is a
powerful statement, and we must place sufficient controls over the execution of the
experiment so that this finding cannot be attributed to some other factor(s).
The principal origin of discussions of validity, then, is the American Psychological
Association’s Technical Recommendations for Psychological Tests and Diagnostic
Techniques (1954), although the term validity was already used in and exempli-
fied by Thurstone’s handbook The Reliability and Validity of Tests: Derivation and
Interpretation of Fundamental Formulae Concerned With Reliability and Validity of Tests
and Illustrative Problems (1931) and R.A. Fisher’s 1935 Design of Experiments, which
explored in more detail the design of the experiment, including validity. As care-
fully reviewed by Heukelom (2009), the Technical Recommendations manual did
not include the concepts internal and external validity.2 It was only when Donald
Campbell (1957) challenged the focus and emphasised the distinction between
internal and external validity, and then fully developed it with Stanley in their
premier reader Experimental and Quasi-Experimental Designs for Research on Teaching
(D.T. Campbell & Stanley, 1963):
Validity will be evaluated in terms of two major criteria. First, and as a basic minimum,
is what can be called internal validity: did in fact the experimental stimulus make some
significant difference in this specific instance? The second criterion is that of external
validity, representativeness, or generalizability: to what populations, settings, and vari-
ables can this effect be generalized. (D.T. Campbell, 1957, pp. 296–298)
Thus, internal validity is the ‘basic minimum without which any experiment is
uninterpretable: did in fact the experimental treatments make a difference in this
2
The American Psychological Association’s Technical Recommendations manual (1954)
distinguished four types of validity: (1) content validity, (2) predictive validity, (3) concurrent
validity and (4) construct validity.
specific experimental instance?’ (D.T. Campbell & Stanley, 1963, p. 5). In later
years, D.T. Campbell (1968) referred to this type of validity as ‘local molar causal
validity’, which emphasises that ‘causal’ inference is limited to the context of the
particular ‘treatments, outcomes, times, settings and persons studied’ (Shadish
et al., 2002, p. 54). The word molar, which designates here the body of matter as a
whole, suggests that ‘experiments test treatments that are a complex package con-
sisting of many components, all of which are tested as a whole within the treat-
ment condition’ (Shadish et al., 2002, p. 54). In other words, local molar causal
validity – internal validity, in short – is about whether a complex and multivariate
treatment package is causally linked to a change in the dependent variable(s) – in
particular settings, at specific times and for certain participants (see also Coldwell &
Herbst, 2004, p. 40).
We should emphasise that the treatment is always complex, often like a ‘black
box’, because most interventions in the social sciences are complex and multi-
faceted. For example, when cognitive behavioural therapy (CBT) is found to be
an effective intervention for offenders (e.g. see Barnes et al., 2017), it naturally
includes multiple elements that makes the CBT stimulus ‘work’, and it is difficult
to pinpoint a singular element as the most important component. For this reason,
it is vital for experimenters to report precisely and with as many details as possible
what the intervention included, not only to assess its content validity – and its
descriptive validity (C.E. Gill, 2011) – but also to understand what it is about ‘it’
that caused an effect.
Box 3.1
without CCTV – however, the locations must be similar. To illustrate how CBT reduces
recidivism requires a group of offenders who have undergone this type of treatment
and a comparison group of offenders who have not, the two groups of offenders
need to be balanced – that is, the same. If the groups are not the same before the
treatment group was exposed to the intervention (and, again, the comparison group
did not have the intervention), then there can be serious challenges to the causal
estimate – the internal validity of the test. If the CCTV systems are installed in low-
crime areas and the control locations in high-crime areas, then the differences the
researchers have found could easily be attributed to the crime levels, rather than
the CCTV systems. Similarly, studies that have shown how court-mandated domestic
abuser intervention reduces the reporting of repeated abuse to the police must have
similar offenders in both groups – the group that received the intervention and the
group that did not. Otherwise, we cannot be sure that these state-funded interven-
tions reduce domestic violence.
The primary risk to internal validity is that we will conclude that variations in the
dependent variables are attributable to the intervention, when in fact some other fac-
tor may be responsible. For example, if we have pre-test and post-test measures – that
is, one observation of the baseline data before the intervention (e.g. last year’s crime
figures), and one more observation of the outcome data after the intervention (post-
test crime figures) – changes from the pre to the post periods might be attributable
to another factor. Under such conditions, we will not be able to conclude that the
intervention caused the changes.
However, this risk is just one of several. The overall issue is that there are systematic
rather than random variations in the differences between the treatment and the control
conditions, and these differences between the two groups are the cause of the dispar-
ity, not the treatment itself. In attempting to break down this issue in detail, we turn
to the factors most famously discussed and developed by D.T. Campbell (1957; see also
Campbell & Stanley, 1963). These are events that threaten the internal validity of the
test – and primarily challenge the conclusion that the variation from the levels of the
dependent variable before the intervention was administered and its levels afterwards
was due to the intervention, and nothing else. Without accounting for these factors – in
effect ruling them out as plausible limitations – the effect of the experimental stimulus
will be said to be confounded and the inferences made from the experiment mislead-
ing. Some of these are naturally controlled for due to the random allocation of cases
into treatment and control conditions. However, others remain a threat to the internal
validity, even though the study follows an RCT design protocol. In the following pages,
we review these threats and some solutions to remedy these hazards.
Box 3.2
History
These are the ‘life’ events occurring between the first and the second measurements
(i.e. before and after participants are exposed to a stimulus). When history effects are
controlled for, then we would conclude that extraneous effects are not responsible
for the differences in the before and after measures. Researchers therefore aim to
use measures that are unaffected by real-life conditions that may cause variations
between the pre and the post measures, as opposed to the intended effect of the treat-
ment under investigation. In field settings, however, external life events do occur and
could therefore cause the participants to behave differently between the pre and the
post observations, independent of the tested intervention.
It is often argued that when the treatment and the control groups of participants
are equally exposed to historical events, then the threat to internal validity – that
is, the causal relation between the treatment and the outcome – is mitigated. If all
participants are affected equally by external factors – such as the weather, events
reported in the news or a new chief executive officer – then all units are exposed to
the same experiences (i.e. a fair playing field for both treatment and control groups).
Therefore, as all units in the experiment go through the same historical processes and
stimuli, then from an internal validity perspective we are less concerned.
It would be dangerous to assume, however, that the balance in the history effects
that the two study groups experience is naturally maintained. First, the treatment
and control study participants may be exposed to these external events differentially.
This can be common in policing experiments, for example, in which batches of par-
ticipants enter their assigned group sequentially (e.g. see Alderman, 2020; Ariel &
Langley, 2019). Therefore, since history effects have a temporal component (they
happened at some time but then stopped), the events may have occurred during the
recruitment and assignment of one batch, but not in others.
Second, as we explained in detail in Chapter 2, randomisation does not always
work. By chance alone, an experimenter may be confronted with ‘more’ or ‘less’
exposure to historical events in one group. Furthermore, the differences between
the treatment and the control groups due to one group experiencing more of some
type(s) of exposures than the other are more pronounced in smaller studies, when the
effect of small yet influential outliers exposed to historical events is stronger. Under
these settings, internal validity is at risk, despite the assumption of balanced exposure
due to random assignment. The concern is exacerbated with types of historical events
that are unmeasured or unmeasurable (and therefore cannot be controlled), that due
to chance alone are more pronounced in one group rather than the other.
The response to this realistic view of experiments expounds upon our earlier argu-
ment that the evolution of scientific knowledge takes place through an iterative pro-
cess: brick by brick. No one would advocate a strong case for evidence-informed policy
based on only one experiment. The more replications on a particular research question
which point to the same direction and with a normally distributed set of effect sizes,
the more we trust the results of the original test. In a similar way, if there are historical
events that jeopardise the internal validity of one experiment, we expect these extrane-
ous effects to dissipate when considering the effects in a series of experiments.
Maturation
This term refers to internal processes that the participants go through over the passage
of time, between the first and the second measurements, and are responsible for
the before–after variations, notwithstanding the studied treatment effect. These may
include growing older, gaining new knowledge, making new peers, experiencing
‘turning point’ events and so on. The changes occur ‘naturally’ without the treatment
effect, and when they occur, we may find it difficult to conclude that the post-treatment
observations are due to the intervention alone.
For example, if we take a group of offenders who are in their 30s, then any inter-
vention to stop their criminal behaviour will have to perform ‘above and beyond’
the natural ‘dying out’ of crime phenomenon that the overwhelming majority
of people experience as they reach their late 30s. The age–crime curve, which has
been heavily studied in criminology, repeatedly shows that as people get older, they
are substantially less likely to exhibit recidivism (Farrington, 1986; Steffensmeier
et al., 1989; Steffensmeier et al., 2020). There are different theories for this phenom-
enon; however, the evidence is robust: the overwhelming majority of criminals do
in fact step away from a life of criminal behaviour, or at the very least substantially
reduce their involvement in crime (Hirschi & Gottfredson, 1983; Piquero & Brezina,
2001; Sampson & Laub, 2003, 2017; Sweeten et al., 2013). This is great news for
society; however, it challenges our ability to evaluate an intervention designed to
assist mature offenders to stop committing crimes. Over time, the participants would
exhibit a reduction in criminal involvement versus the pretreatment measurement
scores anyway, and we would find it difficult to ascribe the change to the treatment,
rather than the natural fade out of crime patterns.
To be sure, maturation is not just an issue for longitudinal studies, or experiments
that have an extensive follow-up period. Short-term experiments can suffer the same
maturation issue – for example, in terms of reduced aptitude or ability over time;
some participants naturally get more tired, agitated, annoyed or just generally under-
perform between the first and the second measures (see review in Guo et al., 2016;
Porter et al., 2004; A. Sutherland et al., 2017).3 For instance, we may provide police
cadets with ‘procedural justice’ training to ensure that they are more trusted by mem-
bers of the community (greater legitimacy). If we conducted a before–after measure-
ment of their knowledge of procedural justice – once before the training and then
again after the training – then we might be measuring changes that the cadets have
gone through as part of their overall personal development as cops. This maturation –
natural change that would occur even in the absence of the procedural justice
training – can threaten the internal validity of our test (Antrobus et al., 2019).
One common way to reduce maturation threat is by recruiting participants simul-
taneously, of generally the same age, gender and geospatial parameters. Remember,
the allocation of units into treatment and control groups using randomisation
is meant to dramatically reduce any differential due to chance. Random assign-
ment is hypothesised to create balance between the groups of maturation as well.
However, due to chance, this may not happen, especially in smaller studies. For
this reason – just like we have shown for the effects of historical events – the more
homogeneous the participants are at pre-allocation stage (in terms of age, gender
etc.), the less likely that the treatment and control groups would ‘mature’ differ-
ently from one another.
Testing
Testing threats are the effects of being exposed to the test in the first measure-
ment on the scores of the second measurement. The most intuitive example is
3
The issue of appropriate follow-up periods is discussed often in the literature (e.g. D.T. Campbell
& Stanley, 1963, p. 31; Farrington & Welsh, 2005), but there is not yet an agreement on what
constitutes a valid period to follow up on cases post-allocation.
taking an exam and then taking the same exam again: the repeated exposure to
the test, rather than any manipulation that came into play between the two tests,
causes variations in the subsequent scores. For example, McDaniel et al. (2007)
have shown that quizzing without additional reading improves performance on
the criterial tests relative to material not targeted by quizzes. People are also
affected by the test score itself and can be motivated to change behaviour based
on these scores – so that the second time they are tested, they exhibit a change
due to the first scores. For example, overweight individuals can be affected by
their weight measurement and as a result embark on a new diet regime – with
or without any intervention in place. This suggests that if an intervention is
being tested at the same time as the participant’s own initiative to lose weight,
we would not be able to separate its efficacy from the participant’s own volition
(Shuger et al., 2011).
This issue is particularly pertinent with control participants. If the point of the
pre-test measure is to create a baseline for both treatment and control partici-
pants, and then the measure itself becomes an active manipulation, we are biasing
the study in favour of the control group – thus eliminating the no-treatment sta-
tus of the participants. Suppose a study hypothesises that when police detectives
have a ‘warning chat’ with prolific and serious offenders, the offenders would
reduce their criminal behaviour (Ariel, Englefield, et al., 2019; Denley & Ariel,
2019; Englefield & Ariel, 2017; Frydensberg et al., 2019). Offenders are believed to
be deterred by these warning conversations because they assume that the risk of
apprehension has just been elevated (and no rational actor wishes to be caught).
Now, suppose that the measure of interest is self-control, given the vast literature
on the relationship between low self-control and crime (Gottfredson & Hirschi,
1990). If we measured all of the participants’ self-control levels before the study,
then exposure to the test itself may affect the participants’ criminal behaviour.
However, this is a more concerning issue with the control offenders who have not
been contacted by the police, because the exposure to the pretest measure can
potentially make them think that in fact the police have placed them under greater
scrutiny as well – thus leading them to exercise self-control. Therefore, instead of
having a clean comparison between treatment offenders who have been exposed
to the warning chats and control offenders who had not been exposed to these
warning chats, we are in fact comparing two ‘treated’ groups. While it is still the
case that the participants exposed to the warning chat are in a ‘better’ position
to reduce their criminal behaviour, the comparison is not clean. One could also
argue that even the treatment group has been exposed to a pretest ‘interven-
tion’ by taking a pre-intervention test (e.g. raising their awareness of their inap-
propriate lack of self-control), which therefore masks the true treatment effect.
However, the bias is more pronounced in the no-treatment group. The self-con-
trol test (Duckworth & Kern, 2011) would undermine the no-treatment status of
the control group by serving as a stimulus – which by implication undermines the
no-treatment status of the control group when they take the test again at the
post-treatment stage.
Experimenters do not pay sufficient attention to these testing effects (Ariel,
Sutherland, & Sherman, 2019; Willson & Putnam, 1982). There are some tech-
niques that could be used to minimise the effect of testing – for example, by
increasing the temporal lag between pre-test and post-test measures – however,
these solutions are far from ideal. The most accurate and scientific way of ruling
out the effect of testing is the Solomon four-group design (Solomon, 1949), which
is discussed in Chapter 4; this is a design in which some participant groups are
measured at pretest and others are not, in order to measure the potential effect
of the pretest itself on the post-test. However, the Solomon four-group design is
rare – and for good reason – as it is an elaborate design that requires a relatively large
number of units and tighter control (Dukes et al., 1995; Pretorius & Pfeifer, 2010).
Instrumentation
Instrumentation threats derive from changes in the measurement instrument
between the two measurement points. For example, researchers may use replace-
ments of original observers or different coding techniques for calibration of the
collection tool – so that the second measure becomes invariably different. This
scenario suggests that even in the absence of the intervention, there are varia-
tions in the scores over time, due to using a different instrument, which makes
it difficult to ‘see’ the treatment effect. Classic examples in the criminal justice
are (a) variations in perceptions of crime severity over time (Apospori & Alpert,
1993), (b) amendments to the definition of a crime (e.g. moving a particular drug
from class A to class B, or the definition of sexual assaults; Von Hofer, 2000),
(c) deployment of more experienced students to conduct ride-alongs with the
police to measure procedural justice practices (as the students gain practice in
the observations procedure; Raudenbush & Sampson, 1999) and so on. In all of
these examples, differences between the pre-test and the post-test measures are
due to the variation in the instrumentation, rather than a result of the tested
intervention – hence causing differences between the pre and the post phases,
independent of the intervention.
It should become immediately clear why instrumentation is a greater threat in
longitudinal experiments where multiple measurements take place, and often over
a long period of time (C.E. Schwartz & Sprangers, 1999). As time passes, there is
greater likelihood that changes will take place in terms of the instrumentation used
in the study. Any study that evaluates an intervention over years is susceptible
to this threat to the internal validity of the test, as necessary amendments to the
instrument need to be made. For example, a study that is based on surveys of vic-
tims of crime must modernise the definition of certain victimisation types, not
least due to legal changes but also in terms of what people are willing to share in
these surveys. The case of fraud is a good example: while today it is clear that this
crime type must be incorporated in victim surveys, since it is the most prevalent
crime category in most metropolitan Western cities, for many years fraud was not
included in victimisation surveys. The Crime Survey of England and Wales is an
example of this: only recently were questions about fraud and particularly cyber-
enabled fraud incorporated (Jansson, 2007). Subsequently, comparing the different
surveys over the years – or between countries – becomes more challenging.
To clarify, instrumentation is different from testing effects because the latter refer
to changes that occur in the participants themselves whereas instrumentation effects
refer to changes in the data collection tool. Testing implies that the participant is
getting better (or worse) through exposure to the instrument, and therefore their
scores in the post-test are different from their scores in the pre-test as a result of the
exposure to the pre-test measure. On the other hand, instrumentation implies that
the change in pre and post scores is associated with the method of measuring the
phenomenon. Instrumentation is therefore different from maturation as well, because
maturation effects are linked to internal changes that the participant undergoes (inde-
pendent of the experiment).
The practical implication of this threat to the internal validity of the test is
to avoid (as much as possible) making changes to the instrument. D.M. Murray
(1998) offers an elegant method of calibrating two or more instruments, if vari-
ations indeed have taken place – however, the need to adapt or to control for
changes in the instrument is usually met with some resistance, because the results
will always be somewhat suspect given this peril. One way to address this critique
is to argue that when the treatment and the control group participants both go
through the same variations in the instruments, then instrumentation effects are
‘cancelled out’ between the two groups. As we noted already, this is generally true.
When the two groups both go through the same biases, then we can say that the
two groups are equal (except for the exposure to the intervention by the treatment
participants), which is one of the main priorities of experiments. Random assign-
ment creates the same types and levels of concern that emerge with instrumentation
effects. Still, we must assume that the effects are equally distributed between the
two treatment and control groups – and this can be problematic in small experi-
ments, as we argued earlier.
Box 3.3
changes between the first and the second measurements may be attributed to these
‘regressions’ to the trend line, rather than the intervention. This is particularly con-
cerning when we select a group of participants that is extremely higher or lower than
such an average: the post-treatment measures may represent a natural return – that is,
reversion – to the overall trend. In other words, patients, crime locations, offenders,
victims or other units are selected from a population because they are significantly
more affected by a particular problem. But they may regress to the overall mean
anyway. If we see a drop in the post-treatment observation, then it could be that the
group reverted to normal behaviour.
Several examples come to mind in the context of criminal justice research. Certain
addresses may be chosen for police interventions, as they experience a disproportion-
ate level of crime (Sherman et al., 1989); the top ‘troubled families’ are assigned to
social care treatments because they are disproportionately more likely to have fam-
ily members who are offenders, victims of crime or both (Hayden & Jenkins, 2014);
hospitals apply ‘triaging’ techniques to solve overcrowding problems in emergency
departments to fast track assault patients with less severe symptoms (Oredsson et al.,
2011). The list goes on and suggests that many environments in which units are ran-
domly assigned into treatment and control conditions are chosen based on extreme
scores (low or high) and special needs. When these extreme scores are chosen – and
by implication the participants who exhibit these pre-test scores – there is a risk that
the scores would be less extreme in a retest of the original measure (see D.T. Campbell
& Kenny, 1999). For instance, domestic violence offenders chosen for police inter-
vention due to their potential high harm to their partners will not top the harm list
in the subsequent year (Bland & Ariel, 2015, 2020), and businesses that experienced
a great deal of vandalism will not be re-victimised with the same severity as prior to
police intervention. Collectively, this phenomenon is ubiquitous.4
The regression to the mean phenomenon is particularly relevant for hotspots
policing and experiments on the effectiveness of police presence in these small geo-
spatial locations. We support the argument that the definition must incorporate a
relatively extensive period of time for a place to be considered a hotspot – at least
one year. If the location is not ‘hot’ on a year-to-year basis (or another long period
of time), we should not call it a hotspot. Excluding places that experience a greater
level of crime or harm, however, for a limited time – for example, a few weeks or
a couple of months – does not mean that the police should not interject in order
to ‘cool’ these places down. However, the intervention does not necessarily have
4
For a historic review, see Stigler (1997), which shows that the concept was already discussed as
early as 1869 by Francis Galton, who tried to understand why it was that ‘talent or quality once it
occurred tended to dissipate rather than grow’ (p. 107).
Differential selection
Threats from differential selection are the effects associated with comparing treat-
ment and control groups that are not drawn from the same pool of participants, so
the treatment group is inherently different from the control group. The characteris-
tics of participants in the treatment group are not the same as those in the control
group. The implication of differential selection is to therefore artificially disadvantage
one group of participants before the stimulus is applied, so the test becomes unfair.
This concern is linked directly to selection bias, and for this reason, RCTs have
an advantage over other causal designs: randomisation directly deals with selection
bias. This does not suggest that experimentalists should not be mindful of the issue
of selection bias; randomisation protocols often break down in field experiments,
and the risk of confounding is therefore present even in RCTs. In principle, at least,
the groups can be assumed to be balanced on known and unknown factors due to
randomisation, with nil differential selection effects, but this assumption is often
challenged (Berger, 2005a, 2005b, 2006).
For quasi-experimental models, however, selection bias remains one of the most
fundamental difficulties. Researchers using statistical instruments to control for dif-
ferential selection go through great lengths to convince the audience that a particu-
lar effect is statistically controlled for. However, statistical controls that attempt to
account for pre-treatment differences between the groups are unable to remove the
entire imbalance – in theory as well as in practice. In a population with an infinite
number of covariates and their interaction effects, statistical controls can be applied
only to a subset of covariates and their interaction effects on which observable data
exist. The statistical control over selection bias will therefore asymptotically approach
complete control over the effect of the covariates, conditional on the availability of
measurable covariate effects. None of this is new: the term e for error is part of the
formulae of virtually any statistical tests.
Box 3.4
comparison groups, the internal validity of the test is jeopardised. More generally,
attrition refers to an issue where participants do not complete the intervention pro-
gramme. This is commonplace, especially in field settings, when the participants
are potentially inclined to complete the treatment – drug users, prolific offenders,
unmotivated delinquents and so on. Domestic violence perpetrators are a case in
point (Mills et al., 2013; Mills et al., 2019). Attrition in treatment programmes
for domestic violence offenders has been an enduring problem ranging from 22%
to 99% (Daly & Pelowski, 2000), and, more recently, ranging from 30% to 50%
(Babcock et al., 2004, pp. 1028–1030; Gondolf, 2009a, 2009b; Labriola et al., 2008) –
which is concerning, because non-compliance in domestic violence treatment pro-
grammes remains a strong predictor of reoffending (Heckert & Gondolf, 2005). In
our context, it suggests that there may be systematic differences between the com-
pletion rates of one group over the other, and this alone may affect the outcome of
interest (rather than the intervention itself).
The problem of experimental mortality is exacerbated when it occurs more fre-
quently in one experimental arm than the others. In these cases, the differential
attrition occurs because of the interaction with belonging to the experimental arm
that experiences a differential level of attrition. Therefore, random assignment can-
not solve this problem completely – that is, the attrition may be causally linked to the
intervention itself. The tested intervention caused participants to drop out for a vari-
ety of reasons – for example, the intervention can be too demanding, its participants
may feel that the intervention lacks efficacy or the treatment deliverers themselves
simply do not ‘believe’ in the intervention and therefore are unmotivated to deliver
it as the experimental protocol dictates. Subsequently, units drop out of one of the
study groups at a higher rate than the other study group(s). Those who remain in the
programme may have unique features that their overall group does not – for example,
more (or less) motivation, more (or less) compliance or more (or less) risk aversion.
One way to deal with attrition issues in RCTs is by using an intention-to-treat
(ITT) approach. This procedure allows for an assessment of the treatment effectiveness,
whether or not it was actually delivered (Gibaldi & Sullivan, 1997). Using ITT, experimenters
simply ‘ignore’ the attrition, or the level of completion of the programme, and
focus on the potential effects of offering the treatment policy instead of the effects
of the treatment-as-delivered. This approach is usually appropriate in field experi-
ments, especially those meant to test treatments that in nature are multifaceted and
complex where part of this heterogeneity is the attrition of participants. We can-
not force completion of treatments according to an experimental protocol, particu-
larly with experiments involving human subjects, nor can we ignore human error in
any human actions. Therefore, an ITT approach provides a solution to the attrition
bias by measuring the outcomes of the study of all participants randomly assigned to
the groups, rather than just the outcomes of those who completed the trial (measuring
treatment-as-delivered). If we were studying a domestic abuse intervention, for
example, in an ITT trial, we would consider the outcomes of all assigned participants
assigned to the experimental arms. This list would include treatment participants and
control participants who failed to turn up, dropped out before the programme was
completed or were deported.
Treatment spillover
In causal research, we assume that each participating unit is not affected by any
other unit. This means that we assume that any potential outcome that takes place
in the experiment is a result of the treatment, not the influence of the other units.
If we allocate participants into treatment and control conditions, each group must
be independent – otherwise, we may encounter biases in our causal estimates. When
the groups are not independent, we experience interference, and we must therefore
conclude that the treatment effect was either inflated or deflated, meaning that the
true impact of the independent variable on the dependent variable is masked to some
degree. This is referred to as a spillover, and while it is generally overlooked in crimi-
nology (Ariel, Sutherland, & Sherman, 2019), it is one of the most concerning issues
in experimental research.
Spillover effects in RCTs contaminate the purity of the experimental design. The
diffusion can take many forms, referring to the ‘bleeding’ from treatment to control
or vice versa, between treatment groups, in particular batches, blocks or clusters, and
even between individual units within the same experimental arm (Baird et al., 2014;
D.T. Campbell et al., 1966; Shadish et al., 2002). For example, when the threat of
spillover denotes an interference from the treatment group into the control group, it
leads to ‘contaminated control conditions’, which challenge the ‘counterfactual con-
trast’ between units that were exposed to the intervention and units that were not.
More explicitly, the spillover problem was discussed in the context of the ‘stable unit
treatment value assumption’, or SUTVA (Rubin, 1980). These are circumstances akin
to having both experimental and control patients taking the pill in a drug trial: when
everybody is exposed to the treatment effect, the intervention is ‘interfering’ with the
neutrality of the comparison group.
There is another type of SUTVA issue – beyond the interference of one group
affecting the other – and that is between the individual units themselves within the
same group. When the interference is caused within the group, the treatment spills
over from some individuals onto other individuals in the same group. Sobel (2006)
referred to this situation as ‘partial interference’. This is an important issue, because
causal studies and most statistical models assume that there is a single version of
each treatment level applied wholly to each experimental unit, which is referred to
as ‘treatment homogeneity’. One example would be that every offender assigned
to a 12-session domestic violence intervention programme attended all 12 sessions
(Mills et al., 2012). Likewise, in a study on the effect of text message nudges sent to
offenders as reminders to attend their court hearings, it might be assumed that every
individual has received, read and then internalised the messages (Cumberbatch &
Barnes, 2018). Similarly, a trial on the effect of omega-3 supplements on behav-
iour problems in children might posit that the participants adhered fully to the
experimental protocol and took precisely 1 g/day of omega-3 during the days of the
experiment (Raine et al., 2015).
However, when participants from any of these groups interact with one another,
they can increase or reduce the treatment effect.5 The treatment homogeneity
assumption is challenged. For example, if members of a restorative justice exper-
imental arm talk amongst themselves, share their experiences and form a group
opinion about the efficacy of the treatment, then these collective experiences and
cross-fertilisation with ideas, norms and thoughts violate SUTVA. There is a wealth
of ethnographic research that indicates the extent to which group dynamics have
an effect on perceptions, behaviours and norms (e.g. Hare, 1976; Lewin, 1948; Shaw,
1981; Thibaut, 2017). Kruskal (1988) commented that nearly all real-life experi-
ments should assume (partial) interference, but ignoring dependence can have
detrimental consequences (see applications in Lájer, 2007; and a review in Kenny et
al., 2006). When researchers expect ‘treatment propagation’ (spillover; Bowers et al.,
2018; Johnson et al., 2017), they should arguably incorporate it into the experimental
design using a group-based model (see Box 3.5 for an illustration).
Box 3.5
5
The same rule also applies to the other arms of the test – that is, if there are more treatment
conditions, then we assume that each condition was applied fully and equally across units, and
that the counterfactual condition (placebos, no-treatment, business as usual intervention etc.)
was maintained fully and equally across non-treatment units as well. As you can imagine, this is
not always easy.
police officers. In this study, a sample of 2224 police officers randomly assigned
into a treatment group were instructed to wear BWCs while on patrol in compari-
son to the control group, who were not given the devices. In principle, this design
is powerful enough to detect small differences between groups. However, there
is a catch; the design does not take into account the fact that many – if not most –
police–public encounters that require the use of force are dealt with by at least
two officers on site. In fact, most police patrols in the contemporary USA are in
double formations (‘double crewing’). Given these facts, there is a strong degree
of ‘treatment diffusion’ or ‘control officers’ who attend calls with the ‘treatment
officers’ and are, by definition, ‘contaminated’ by being exposed to manipulation
and subsequently behave differently than when the camera is not present. The risk
of spillover (another common term for diffusion and contamination) becomes even
more pronounced when three or more officers attend the same call. Therefore, the
study’s treatment fidelity is not only at risk when both arms are exposed to the same
intervention.
This spillover explains why such a randomised controlled trial concludes that BWCs
are not effective in reducing the rates of complaints or the use of force. It appears that
the contamination is so pervasive that an ‘intention-to-treat’ analysis – that is, one in
which all units are analysed in the groups to which they were randomised – would result
in no measurable impact (Gravel, 2007). Such a conclusion is unsurprising; after all,
both treatment and control officers were treated.
One research design that can most greatly reduce spillover effects is the cluster-ran-
domised trials design, in which entire and remote groups are randomly assigned into
treatment and control conditions (see Donner & Klar, 2010). Such designs assume
that contamination will occur and incorporate it in the model. It requires numerous
groups – entire police forces, entire departments, entire schools and so on. The issue
is usually obtaining enough clusters to achieve sufficient statistical power, as power
is largely a function of the number of clusters rather than units within clusters. This
difficulty explains why cluster-randomised trials are not commonly used in the social
sciences, with the notable exception of education. This design could be achieved if a
single department operated over a sufficiently large geographical area with enough
subdivisions or partners so that it was possible to allocate entire stations, precincts
and so on to different conditions. Other options can be used, as forces ‘naturally’ roll
out BWCs that also take into account the clustering.
Validity threats need not operate singly. Several can operate simultaneously. If they do,
the net bias depends on the direction and the magnitude of each individual bias plus
whether they combine additively or multiplicatively (interactively). (p. 51)
analysis is the crime hotspot and the test measures police effectiveness (Ariel, Sutherland,
et al., 2016a; Sherman & Weisburd, 1995), but it has yet to catch on in epidemiological and
longitudinal studies of crime (cf. Sherman, Neyroud, et al., 2016). When this distinction
is not applied, it then becomes unclear whether the police’s data represent a genu-
ine variation in crime rates or police actions; they are linked, but they are not the same.
One alternative explanation is that police tactics and recording strategies dictate fluctua-
tions in some crime figures (Ariel, Bland, et al., 2017; Ariel & Partridge, 2017). When the
police target drug or gun offences, these output figures go up (Sherman et al., 1995). For
crime categories that are conditional on police outputs – stop and search, arrests, seizures,
crackdowns and so on – we would expect an increase in these categories when the police
increase attention, because the police have placed these behaviours as performance
targets (Martin et al., 2010).
Box 3.6
On a wider level, however, we can consider external validity in three layers (Shadish
et al., 2002).6 These layers will become important as we review the different aspects
of external validity:
1 Narrow to broad: this is probably the most intuitive and well-known type of
concern for external validity. It deals with the question ‘Can we generalise from
the “persons, settings, treatments and outcomes” of one particular experiment to a
larger population?’ For example, can we generalise the effect in an experiment on
police presence on the London Underground train platforms in 2011–2012 onto all
train stations across in England and Wales (Ariel, Sherman, et al., 2020)? In order to
be able to reach this conclusion, we must ensure that the platforms and the crimes
experienced, the police who have delivered the treatment, the type of treatment
delivered and the types of measures the researchers looked at are similar. Otherwise,
the transition from the narrow to the broad may be questionable.
2 Broad to narrow: here we target a subsection of the experimental sample – for
example, in an attempt to check whether the study findings are relevant to a
single person or a group of individuals, one could ask whether restorative justice
conferences that were found to reduce overall post-traumatic stress symptoms in
victims compared to similar victims who did not go through this intervention
(Angel et al., 2014) would work for one particular crime victim. If the treatment
effect was totally ubiquitous (like gravity), then we can see how – and why –
the generalisation from a broad finding that the intervention is effective in
reducing post-traumatic stress symptoms in any one individual victim. However,
in the social sciences, we rarely ever have such strong interventions, which is
why external validity always remains a concern, especially in broad to narrow
generalisations.
6
Shadish et al. (2002) in fact discuss five targets. However, we feel that the remaining two are
somewhat confusing and do not discuss them here.
The second important factor is directly related to the first, but on a more
technical level. Indeed, the assumption is that we cannot achieve 100% repro-
ducibility in the social sciences, and we are unlikely to repeat the same results
with the exact same direction, statistical significance level and effect size – and
realistically we should expect some variations from the findings of any original
experiment. However, how can we aim to achieve as much replicability of find-
ings as possible? This is a methodological concern: implementing guidelines
for creating the necessary settings in which the experimental settings can be
reasonably repopulated. For example, any social science experiment conducted
in entirely bespoke settings is unlikely to carry high external validity, because
these settings cannot be repeated outside the scope of the study. Similarly, if
the participants are unique to the point that no other persons outside the study
parameters resemble them, the experiment is likely to suffer from low external
validity as well. Therefore, we need technical guidelines on how to set up experi-
ments that have a reasonable level of external validity; otherwise, the outcomes
will not matter much outside the specific environment of the original study. We
will provide some of these guidelines in Chapter 4. First, though, let us examine
in more specifics the major threats to external validity in terms of participants,
places, settings and time.
Participants
The very first threat that experimenters must deal with is the issue of selecting –
intentionally or unintentionally – units that are inherently different from the
population from which these units were sampled (R.M. Martin & Marcuse, 1958;
Rosenthal, 1965). One of the most important examples is the use of volunteers in
field experiments: participants who choose to take part in a study on a charitable
basis may be unrepresentative of the population from which they were recruited.
They may, for example, be altruists, a quality that makes them different from the
population to which they belong. The problem is exacerbated when the unit of
analysis is an entire volunteering organisation: the willingness of an entire police
force, an entire school or a whole treatment facility to participate in an experi-
ment may be an indication of difference from all other organisations (Levitt &
List, 2007b).
Similarly, those who participate for cash incentives may be different from those
who are uninterested. The same types of volunteers may not be available in other
settings – and in fact may not be available at all in real-life settings once the experi-
menters have left the research site. Take, for example, experiments that use online
platforms to recruit participants (e.g. Amazon MTurk, SurveyMonkey): they suffer
from misrepresentation,7 and while they are appealing to researchers, the extent to
which certain people are hired to participate in a clinical experiment online are simi-
lar to the kinds of populations criminology is interested in remains an unanswered
question. In policing, we also have a similar issue when it comes to volunteers. There
is nothing methodologically wrong about using volunteers, for-pay participants
or enthusiastic do-gooders – however, the researcher must hedge the conclusions
derived from these specific populations – as opposed to the overall population from
which these volunteers were recruited (Jennings et al., 2015; Ready & Young, 2015).
The concern with external validity in terms of people is not just about the willing-
ness of the participants to take part in an experiment. When the sample under investi-
gation is unique, or when the cohort that takes part in the study has distinct features,
we may not be able to conclude that the findings are transferable to other samples or
cohorts that do not share these features. Can we generalise the findings from an exper-
iment conducted on Danish gang members to non-Nordic gang members (Højlund &
Ariel, 2019)? Are lessons learned about gang injunctions in Merseyside, UK, relevant
to US gang problem (Carr et al., 2017)? To what extent are prisons in Israel similar to
prisons in England and Wales (Hasisi et al., 2016)? Will the same conclusions found on
the mass deployment of Tasers in one small force in London be found in larger forces
outside of London (Ariel, Lawes et al., 2019)? If there are social, cultural or background
differences between the original experiment and those of the target population, then
we risk making errors in the translation of the conclusions to the target population(s).
To emphasise, we are not suggesting that these experiments are not translatable, but
simply that we need evidence to support the hypothesis of generalisability.
Finally, we note that the issue of external validity is more profound when only one
group – for example, the treatment group – is required to participate in the study, but
not participants from other groups. Not only will the groups be unbalanced in their
willingness to participate, a source of concern in terms of internal validity, but we may
also find it difficult to generalise to the overall population if the treatment group par-
ticipants must express their consent to participate after random allocation. Under these
conditions, the treatment group is ‘better off’ than the control group, as those who are
willing to take part in the treatment are often more motivated, engaged and have a bet-
ter prognosis to succeed than the control group, which is made up of such individuals
as well as individuals who are unmotivated, disengaged and in poorer conditions.
7
We question the generalisability of all experiments which use these volunteers or for-pay
participants because they may have different qualities than those who do not go on these
websites to search for studies in which they can participate. While we see the merit in these
studies, especially in terms of costs and convenience, we also see external validity concerns that
are difficult to control – particularly about how representative the population of participants is of
the overall population (not least by way of access to the internet, language barriers etc.).
where both offender and victim were willing to participate was the case assigned
to the facilitating officer. If either party was unwilling to participate, the case was
not conferenced and, thus, was processed through normal channels like the control
cases. (McCold & Wachtel, 1998, p. 17)
Thus, those who eventually participated in the conferences were qualitatively different from
those who dropped out in the treatment group and from those who were randomly assigned
into control conditions. Dropping out was not an option for participants in the control group.
This means that the final two study groups – adjudication-only group and volunteers in the
restorative justice group – were inherently different even before study commenced. The study
group participants were inadvertently creamed and, by definition, better off than the popu-
lation of juvenile delinquents from which the unit was drawn (not least in comparison to the
comparison group, which creates systematic variations that lead to self-selection issues) – thus
making any conclusion about the efficacy and cost-effectiveness of the intervention weak.
gang-related violence. The core police tactic was to increase the certainty, swiftness
and severity of punishment in a number of innovative ways, often by directly inter-
acting with offenders and communicating clear incentives for compliance with and
consequences for criminal activity.
However, are the settings that characterise urban US gangs relevant to rural UK
gangs? Is the level of resourcing obtained by an expensive programme like ‘Pulling
Levers’ replicable to Coventry anti-gang units (see discussion in Delaney, 2006)?
Is the level of harm perpetuated through firearms in American cities similar to
the hand-to-hand combat and knife crime that characterises the majority of urban
street gangs in the UK? Is the level of funding for research similar to the amount of
money available for scholars in the UK? The experimental settings in the ‘Pulling
Levers’ projects may not be immediately transferable to other places in the USA or
to other countries – and maybe even to the same locations where these projects
were implemented, however, in later years.
Time
One of the major difficulties in generalising from a particular experiment is time.
How much can we conclude from a study that was conducted in the 1950s, for exam-
ple, to the ‘temporal settings’ of 2020?
Take the famous ‘Connecticut crackdown on speeding’ experiment (D.T. Campbell,
1968). In 1955, Connecticut experienced a heavy death toll in highway traffic acci-
dents. Speeding was suggested as the major cause of this phenomenon, so Governor
Abraham Ribicoff increased sanctions against speeding by suspending the licences
of drivers for 30 days, 60 days or lifetime suspensions, depending on the number of
times they were apprehended. After implementing the new initiative in 1956, the
results were encouraging for the first six months: a 15% reduction in deaths com-
pared to the same period in 1955. While there are fundamental flaws inherent in this
conclusion, let us assume that we can take the conclusion of the original analysis
at face value: that speeding can be causally associated with fatal accidents and that
the enforcement of traffic violation through tickets has a causal effect on speeding.8
However, this study was conducted more than half a century ago. There had been sub-
stantial improvements in road safety systems, safety features of vehicles, medical tech-
nology and information technology. Speed cameras, airbags, seat belts and artificial
8
More recently, Luca (2015) has argued that tickets significantly reduce accidents but that there is
limited evidence that tickets lead to fewer fatalities, and a meta-analysis synthesising the evidence
concluded that the state of the art of the research is generally weak, suggesting that ‘estimates of
changes in violations or accidents should be treated as provisional and do not necessarily reflect
causal relationships’ (Elvik, 2016, p. 202).
intelligence sensors are only a few elements that make the original Connecticut study
somewhat obsolete (Ariel, 2019; Høye, 2010, 2014; Phillips et al., 2011). It is also the
case that the driving culture has shifted since, not least in terms of drunk driving (see
Jacobs, 1989). Therefore, the tremendous leap we have made in road safety casts doubts
on the generalisability of the original study to the present day.
practical terms, if we decided that the experiment will include 100 participants out of
a population of 10,000, then each participant has a 1% chance of being selected into
the study – just like everybody else. Then, by either using a simple RAND function
in Microsoft Excel or any other computer software, we can randomly choose the 100
units and exclude the other 9900 (McCullough & Wilson, 2005).
The second and third rules about random sampling are that there is no unit in the
sampling frame that has a guaranteed chance to be included in the study (rule 1), or
a guaranteed chance not to be included in the study (rule 2). These rules ensure that
we will not have a unique set of characteristics in the sample vis-à-vis the population,
a scenario that reduces the generalisability of the conclusions.
Despite these acknowledged rules (see Lohr, 2019), experiments based on true
probability samples are rare, and the truth is that most police experiments are
based on convenience or purposeful samples. These are sampling techniques in
which the experimenter selects units based on their availability for practical rea-
sons (e.g. treating all hotspots that have a certain threshold of crime levels) or
statistical terms (e.g. achieving statistical power – see explanation and expansion
in Britt & Weisburd, 2010, as well as a primer by Jacob Cohen, 2013). Any experi-
ment in which all known hotspots in a city above a certain threshold of ‘heat’
are selected to participate in the experiment (see Duckett & Griffiths, 2016), all
domestic offenders in a certain area are placed in either treatment or control con-
ditions (Strang et al., 2017) or all eligible victims and witnesses in a particular force
are chosen to receive text messages to mobile phones inviting them to appear in
court to reduce their non-appearance rates (Cumberbatch & Barnes, 2018), is not a
probability sample. This means that, by definition, we should not expect the units
to represent all places, domestic violence offenders or victims and witnesses who
are scheduled to appear in court. Thus, real-life RCTs are often conducted with a
specific population or problem in mind, so the issue of external validity, in the
statistical sense, is unanswered.
An exception to this is experiments in which there are more eligible participants
in the population from which the units are drawn than available treatments, and
then the researcher can implement probability sampling. For example, drug treat-
ment facilities that have the capacity to treat n participants at any given time;
however, the geographic region may have a larger number of drug addicts who can
be assigned to treatment and control conditions. Another example is tax compli-
ance research, where the entire population of taxpayers are potentially eligible to
participate in an experiment; however, the tested intervention can only be applied
on a limited number of taxpayers (e.g. Ariel, 2012). These studies, however, seem
to be rare in criminology given the types of populations with which the police
normally interact.
In the ‘Ballad [of John Henry]’, title character John Henry works as a rail driver whose
occupation involves hammering spikes and drill bits into railroad ties to lay new
tracks. John Henry’s occupation is threatened by the invention of the steam drill,
a machine designed to do the same job in less time. The ‘Ballad of John Henry’
describes an evening competition in which Henry competes with the steam drill one
on one and defeats it by laying more track. Henry’s effort to outperform the steam
drill causes a misleading result, however, because although he did in fact win the
competition, his overexertion causes his death the next day (Salkind, 2010).
settings – e.g. a school, a police beat or a hospital) aim to overcome these issues.
Levitt and List (2007a) argued that we could find it difficult to interpret data from
lab experiments, as they are not immediately generalisable to the real world. The
authors contend that laboratory findings fail to generalise to real-life settings. Study
participants behave differently when they are in the sterile conditions of a laboratory,
when often the observed behaviour is hypothetical, rather than real, and when field
settings do not naturally confound the independent or the dependent variables.
To explicate the external validity issue of lab experiments, take, for example,
those that aim to understand how Taser stun guns affect police officers’ use-of-
force decisions (e.g. Sousa et al., 2010). This is an important area of study, not
least because Tasers remain a contentious tactical option in policing. At the same
time, these studies are rarely true field experiments and should be viewed as lab-
oratory experiments – even though the participants are actual law enforcement
agents – because they ask how officers behave ‘as they would in a natural setting’
(Sousa et al., 2010, p. 42), not how they actually do behave in the field. Instead,
scholars tested training scenarios involving different levels of suspect resistance,
with police trainers performing the roles of suspects. The ‘suspects’ are police offic-
ers and the decision to ‘use’ force does not happen in the stressful settings of the
field. Therefore, the experimental settings do not fully mimic natural settings –
and in fact, they ought not to be judged as such. These lab experiments are crucial
to understanding what might possibly happen in police–public contacts; however,
we are nevertheless unclear as to what extent they reflect what does happen to cops
who are confronted with resisting suspects.
This is not to say that laboratory experiments are not important. As remarked by
D.T. Campbell and Stanley (1963), ‘an ivory tower artificial laboratory science is a
valuable achievement even if unrepresentative, and artificiality may often be essen-
tial to the analytic separation of variables fundamental to the achievements of many
sciences’ (p. 18). Thus, there are merits to laboratory experimentation insofar as they
help lay out future hypotheses and to construe the necessary dimensions of theoreti-
cal developments.
Still, we think that Levitt and List’s (2007a) critique remains valid (cf. Kessler &
Vesterlund, 2015), for the reasons we highlight in this section. Stimuli used in lab
experiments do not fully resemble the stimuli of interest in the real world, the lab
participants do not resemble the individuals who are ordinarily confronted with
these stimuli and the context within which actors operate do not resemble the
context of real-life interests (Gerber & Green, 2011). Subsequently, for this (and
other) reasons, field experimentation directly attempts to simulate, as closely as
is possible, the conditions under which a causal process occurs, with the aim to
enhance the external validity of experimental findings (Boruch, Snyder, et al., 2000).
The essential differentiation between the two experimental settings – laboratory versus
field environments – is the extent to which field experiments aim to mimic the
natural surroundings of the participants. For impact evaluations, field research
is optimal, as it provides the strongest case for generalisation (Farrington, 2006).
However, even under these natural conditions of real-life settings, there may
still be perils to the external validity of the test – as we have tried to illustrate
in this section.
With that being said, we must also take into account the laboratorial narra-
tive, which may be suspicious of field tests given the lack of experimental control
with which field conditions are usually characterised. Clinical trials, psychological
experimenters or biologists are more than happy to sacrifice external validity in
favour of internal validity, which in its purist form necessitates clinical settings. To
control for all exogenous factors implies shutting down any interaction between
the intervention and the externalities – an impossible task for field trials but a pos-
sibility for laboratory trials. To single out a treatment effect, in a total way, is an
option that cannot be materialised in real-life settings. Thus, to surgically probe a
hypothesis (Popper, 2005) entails sterile laboratory conditions. The degree to which
these findings are applicable to non-laboratory conditions then becomes a contest-
able question, and it is often impossible to move these field settings into the labora-
tory; but there is no doubt that the most stringent levels of controls can be applied
in closed settings.
are several examples: spending the same number of minutes – researchers are rarely
dedicated to only one study site and there will be situational, professional and sub-
stantive reasons – legitimate and otherwise – where the experimental protocol will be
breached. In this sense, spillover effects – especially in the administration of a single
value and version of the treatment – can be unavoidable. They are especially inescap-
able when the sample size is large and precise administration across all units is more
challenging (Weisburd et al., 1993).
For example, some participants will take up their allocated treatment, such as
therapy or ‘pathway treatment’, as ascribed by the treatment provider, while others
will take part only partially. Likewise, some hotspots may be visited by the police as
assigned by the experimental protocol – for example, 15-minute visits, three times a
day – nonetheless, other hotspots will be patrolled to a lesser degree. In both these
examples, the overall treatment effect may lead to statistically significant differences
between the study arms; however, the effect size may be diluted. This was the case in
several experiments testing the application of technological innovations in policing
(see review in Ariel, 2019) regarding hotspots policing studies, and in batterers inter-
vention programmes, to name a few.
Conclusion
Our aim in this chapter was to cover the common threats to the internal and external
validity of the test. Insofar as internal validity concerns are raised, it is vital that the
risks to the claim of causal inference will be mitigated as much as possible, preferably
through design but also using statistical models, when permissible. When it comes
to external validity, there is no true statistical solution, because the degree to which
one trial’s results are transferable to other settings is ad hoc. The burden of proof of
generalisability is empirical, not logical, and appears through replications.
We paid attention to the distinction between true experiments and quasi-experimental
models. However, we stress that randomisation alone does not solve all internal validity
concerns (Fitzgerald & Cox, 1994), and it provides no solution to external validity con-
cerns. While these threats are therefore more pronounced in quasi-experimental designs,
close attention should be given to these threats in RCTs as well. As we explained, there
are at least two reasons for that. First, not all randomised experiments are created equal.9
There are different types. Some designs, like the pre-test–post-test control group design,
are more rigorous than others, such as the post-only control group design; the former
9
For a comparison of different experimental designs, see, for example, the Maryland Scale
(Farrington et al., 2002; Sherman et al., 1998) and other metrics (e.g. Hadorn et al., 1996).
have a greater capacity to eliminate the threats to internal validity. Some designs, like
the Solomon four-group design, are so rigorous that they are almost un-implementa-
ble in field settings (or at least we rarely see them in use in the social sciences). Since
there are different experiments, there are also different levels of each threat, and we
cover these in Chapter 4.
Second, the very assumption of equilibrium due to randomisation depends on
a number of postulations – that is, the necessary conditions covered in Chapter 2
needed for the effect of randomisation to take place. These are not easy to assem-
ble. Even in laboratory conditions, sterility often breaks: dirty test tubes, heat-
ing/air conditioning malfunctions or uncooperative undergraduate students who
take part in experiments for university credit. Randomisation alone thus cannot
remove these threats. One clear reason is that the threats do not occur in silos:
they interact amongst each other and create new concerns that, again, random
assignment cannot eliminate. Therefore, threats to internal validity remain a con-
cern in any trial.
Still, we need to think about the threats to internal validity in a contextual way –
that is, in relation to other experimental designs. There are many different types of
experimental designs, and some are better equipped to deal with internal validity
then others. As we discussed in Chapter 2, the random allocation of units into dif-
ferent groups deals directly with issues of internal validity and minimises them the
most in comparison with other experimental designs. Randomisation – if conducted
properly – reduces the likelihood that our conclusion about the causal relationship
between the intervention and the outcome, relative to the control group, is errone-
ous. But within the world of RCTs, there are different designs, and each is a better fit
for the experimental settings under investigation. Another type of threat to the valid-
ity of the test is external validity.
External validity is multifaceted and complicated and stands at the heart of criti-
cism against experimental designs. To what extent can we generalise from a single
experiment to other settings, people, places or times? This is a tough question,
because it can only be illustrated ad hoc, not a priori – unless the sampling frame
from which units were randomly assigned is taken from the same population of
units. Access to pure random samples of this sort is difficult to achieve in the social
sciences. Thus, total external validity is aspirational. There are ways to mitigate it
somewhat, like running field experiments in lieu of clinical trials, or having suffi-
ciently large probability samples, but the primary method of ascertaining generali-
sation at different levels is to replicate the test, in diversified settings with different
experimental designs but achieving the same statistical result. The story of different
experimental designs is told in the next chapter.
Chapter Summary
• In this chapter, we take a closer look at the controlled aspect of RCTs, including the
purpose of establishing counterfactuals, the supervision of experimental conditions
and practical solutions to managing trials in the field.
• The chapter provides a rigorous treatment of the theory and practice of establishing
internal and external validity, the potential threats to these and how controls can wholly
or partially remove these threats.
• We show how the need for control can be materialised by applying a set of guidelines
as to what experiments should consider.
Further Reading
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-
experimental designs for generalized causal inference. Houghton Mifflin.
Shadish et al. (2002) are often credited with providing one of the most detailed
explanations of the various threats to the validity of the conclusions of a study.
While this chapter discussed internal and external validity, many other hazards to
validity must be guarded against. For a more comprehensive exposition of these
threats, as well as other validity concerns in experimental and quasi-experimental
designs, this seminal book should be consulted.
Chapter Overview
Different kinds of trials������������������������������������������������������������������������������� 92
Pre-experimental designs�������������������������������������������������������������������������� 95
True experimental designs����������������������������������������������������������������������� 100
Quasi-experimental designs��������������������������������������������������������������������� 108
Conclusion������������������������������������������������������������������������������������������������ 119
Further Reading��������������������������������������������������������������������������������������� 120
In a seminal text, Campbell and Stanley (1963) lay out a taxonomy of exper-
imental design options for the social sciences. There are many such designs – in
terms of their ability to overcome threats to internal validity and external valid-
ity (Campbell, 1957), as well as their applicability given the real-life condi-
tions experimenters must face. Campbell and Stanley (1963), Shadish et al. (2002),
Cook and Campbell (1979), as well as most accepted texts on quantitative research
methods in various fields (Bernard, 2017; Cox, 1958; Davies & Francis, 2018;
Dawson, 1997; Fitzgerald & Cox, 1994; Jupp, 2012), categorise experiments into
three types: (1) pre-experimental designs, (2) true experimental designs and
(3) quasi-experimental designs. Within each type, several options for causal research
are possible, depending on the research questions and the availability of experimen-
tal settings. This breakdown is useful, because it sorts experimental designs based on
their ability to control for internal validity threats. Scholars can then broadly agree
on the strength of evidence, based on which policy recommendations can emerge.
Of course, the ability of the researcher to execute a trial that deals with all threats to
its validity concerns defines the quality of any particular piece of evidence. However,
at a conceptual level, ranking studies using this logical framework is helpful – and we
will follow the same tradition.
Figure 4.1 summarises these designs, while this chapter presents these experi-
mental methods more elaborately. We rely greatly on the original taxonomy offered
by Campbell and Stanley (1963) but provide a selective ontology of contemporary
experimental criminology to illustrate the designs. Altogether, we describe 13 experi-
mental designs – three pre-experimental designs, three true experimental designs and
seven quasi-experimental designs.
As we describe in greater detail later in this chapter, experiments can be arranged
on a scale, ranked by the degree of control the researcher has over the administra-
tion of the test. True experiments incorporate the random assignment of participants
into the treatment or control groups of the study, as well as reliable measures before
and after the administration of the intervention. In these prospective designs, the
researcher is actively – and therefore prospectively – involved in the systematic
exposure of treatment participants to the stimulus. Pre-experimental designs are also
prospective, and the researcher has a degree of control over the implementation
of the treatment(s), except that they omit one or more features of the true experi-
mental design. For example, they often do not include a pre-treatment measure
prior to the exposure to the manipulation, or lack a parallel comparison group, or
randomisation in the allocation of the participants into the experimental arms.
On the other hand, in quasi (i.e. resembling) experimental research designs the
researcher is usually not explicitly involved in the assignment of participants into
Pre-Experimental Designs
X = Observation of Treatment
O = Observation of No Treatment
R = Randomisation
Quasi-Experimental Designs
Group 2: R X O2
the have control over the exposure of the participants to the treatment, because
the intervention usually had already occurred before the experimenter became
involved in the study. Instead of random allocation, quasi-experimentalists aim to
control for the lack of comparability between the groups using statistical models
and matching techniques.
Of the three research designs, quasi-experimental models are the most prevalent
in the social sciences, followed by pre-experimental designs, and then true experi-
ments. However, our intent is to push experimenters to conduct true RCTs and use
other modalities only when true experimental designs are not feasible, for whatever
reason. The choice of design will depend on practical considerations, the ability of
the researcher to control for issues that may affect the validity of the test and what
options are obtainable, out of the different methodological scenarios the experi-
menter encounters. Our underlying position, however, is that true RCTs are inher-
ently stronger and provide the most valid causal estimates, of all research designs.
Any other design is, paradigmatically, a compromise, for the reasons we discussed in
Chapter 2 concerning the benefits of randomisation and the issues associated with
statistical matching.
That being said, we do not prima facie dismiss evidence gathered by way of other
designs. While some evidence can be considered ‘better’ than others (Sherman
et al., 1998), we nevertheless take the view that scientific exploration is founded on
cumulative wisdom, and there are useful takeaways from every study. We would not
endorse a general rule based on the outcomes of a single experiment, even a trial that
is based on the most rigorous experimental conditions – as internal validity is part of
a wider concern regarding the strength of evidence (like external validity). No single
trial should be deemed authoritative (Lanovaz & Rapp, 2016). The accretion of mul-
tiple blocks of evidence shapes our knowledge about causal relationships between
independent variables, dependent variables and factors that shape their interactions
(Schmidt, 1992). We discuss these points as we go through the various experimental
designs in this chapter.
We also present common methodological approaches to analysing the results of
these designs, in terms of statistical tests, but in broad terms only. One frustrating
problem in experimental designs is the inability of scholars, especially technicians,
to agree on ‘best practices’ for analysing the results of a given experiment. It is likely
that a ‘best practice’ proposition is a fallible suggestion; experimenters know that
there are many ways of estimating the causal relations using statistics, and statisti-
cians cannot always agree with what is the most fitting statistical architecture for
each experimental design. As the old British expression goes, ‘there are more ways of
killing a cat than choking it with cream’, and in the case of statistics, this is especially
Pre-experimental designs
1
As highlighted by Singal et al. (2014), efficacy is defined as the performance of an intervention
under ideal and controlled experimental conditions, whereas effectiveness refers to its
performance under ‘real-world’ experimental conditions.
the numerous problems associated with this type of pre-experimental design make it
extremely difficult to evaluate the results of the treatment effort. . . . It is impossible
to firmly determine whether ‘observed’ or measured changes are a result of exposure
to an independent variable or the result of ‘uncontrolled’, extraneous variables such as
history, maturation, or regression. (p. 301)
Therefore, the one-group, post-only design remains a common design but should only
be applied with a proviso that the study lacks the necessary precision to guide practice
and inform us about the causal relationship (between the independent and dependent
variables). Our position is therefore that its value both for scientific knowledge build-
ing and for policy implications remains limited.
a baseline, and then measured again after the module has been completed (Israel
et al., 2014). Any variation between the two time points (pre and post) is hypothesised
to be a result of the training module. A gain in scores would be interpreted as a
‘success’, while a reduction from the baseline will be viewed as a ‘backfire effect’.
However, a lot can happen between the two observation points in time, which may
have caused the variation in scores. To illustrate why the pre-test–post-test designs
are indeed ‘poor’, we will consider two threats that such experiments usually fail to
control: history and maturation. Nonetheless, nearly all risks to internal validity are
at elevated levels in this research design, including regression to the mean, testing
and instrumentation.2
Historical events interfere with the causal inferences postulated by the experi-
menter and offer rival explanations for the differences, or the delta, between the
two measures. When evaluating training programmes, for example, negative pub-
licity involving police officers can have a meaningful effect on their confidence
in their authority (Nix & Wolfe, 2017). If exposure to these high-profile cases
occurs between O1 and O2 (i.e. observation at pre- and at post-stages), then we can
no longer be sure that a change from O1 is a result of the training module, or the
negative publicity of the police. Clearly, the longer the temporal gap between the
two measurements, the more likely those other, extraneous factors have affected
the variation, but when considering the complexities of life, there are countless
rival historical hypotheses that can debunk, exacerbate or reduce the treatment
effect, to the point that we cannot rely on the trial results, even at short-term
spans. It is not possible to achieve experimental isolation in field settings, and
these confounding factors affect the conclusions without a reasonable method to
control for them.
Similarly, there may also be maturation effects – the psychological and biologi-
cal changes that participants may go through between O1 and O2 – which, again,
confound the effect of the tested intervention. For example, Crolley et al. (1998)
evaluated an outpatient behaviour therapy programme by examining 16 child
sexual molestation offenders in Atlanta, Georgia. Using various psychological
batteries and recidivism rates measured before and after completing treatment,
the data were deemed to support the intervention. However, numerous system-
atic, naturally occurring internal changes – for example, spontaneous regression,
natural mellowing of participants and so on – may have resulted in the before–
after differences, rather than those differences being a result of the intervention.
2
See a description of these threats to internal validity in Chapter 3.
We stress that these can take place at short-term durations as well. Unfavourable
trainers who affect certain participants, poor settings (too hot or cold) and day-
of-week or hour-of-day effects can all affect the personal engagement, attention
or overall attitude of participants (e.g. Danziger et al., 2011), and they are all pos-
sible within-group variations outside the scope of the causal relationship tested
in the trial.
with internal validity. It is also derived from probability theory, but its aim is
to help assure that the treatment and control groups are equivalent prior to the
treatment.)
Once recruitment has concluded, consent for participation has been collected
(whenever possible, see Vollmann & Winau, 1996; Weisburd, 2003) and the settings
are optimal for assigning participants to the various study groups, the experimenter
then conducts random assignment using one of the allocation protocols discussed in
Chapter 2. Then, the experimenter would measure the dependent variable in each
group at baseline – or O1 for the treated group and O3 for the untreated group see
design 4 in Figure 4.1. The studied treatment is subsequently applied to the group
that was assigned exposure to the treatment stimulus (CBT, retroactive justice confer-
ences or therapy), but not to the other groups. Finally, another measure is taken of
the dependent variable at the post-test observation stage (O2 and O4, respectively) of
both the treated and the untreated participants.
Notwithstanding this ambiguity, the most appealing feature of the classic experi-
mental design is that it neatly controls for rival hypotheses. The effects of history are
controlled for, because any effect that may have occurred in the treatment group
necessarily caused an effect on the participants in the control group. The variation
from O1 to O2 in one group and then from O3 to O4 in the other group due to history is
equal in the two groups, because both are systematically and simultaneously exposed
to the same factors. For this reason, it should be immediately clear why a pre-test–
post-test without a control group design is unable to control for history, whereas an
experiment with two parallel groups can.
Note, however, that controlling for history also requires a regulation for simul-
taneity. As we suggested earlier, if the experimental group is run before the control
group, or vice versa, history effects remain a concern: different times of the day
or days of the week may cause the groups to be systematically different (Danziger
et al., 2011). The optimal solution for this problem (amongst others discussed
below) is to randomise experimental occasions – that is, therapy sessions, moments
of exposure to the intervention or participation in individual training sessions.
If there are extraneous history effects, they will be equally distributed across the
experimental and control units (Leppink, 2019, pp. 248–249). All those in the same
occasion share the same history, and therefore have sources of similarity other than
the intervention.
In a similar way, maturation, testing and instrumentation effects are also controlled for
using this design (Campbell & Stanley, 1963, p. 14; Shadish et al., 2002, pp. 257–278).
Any psychological and biological factors are distributed equally in both groups due to
randomisation. Importantly, both manifested and latent factors are similarly distrib-
uted in the two groups – both measurable and unmeasurable variables and those that
were measured and that went unmeasured by the research team.3
Similarly, utilising the same fixed measurement instrument in the two groups is
expected to result in comparable overall scores if the treatment condition were not
applied. Had there not been a stimulus, the pre-test and post-test measures should
produce similar results. One exception to this neatness is the use of a relatively
small number of observers, in a way that makes it impossible to randomly assign
them between the experimental sessions. For example, in policing studies, research-
ers often deploy observers in systematic social observations like ride-alongs with
the police to document interactions between officers and citizens (e.g. see Berk
& Sherman, 1985; Hirschel & Hutchison, 1992; Sherman & Berk, 1984, p. 264).
However, these sessions can become very expensive (paying for assistants’ time is a
major cost) and complicated (Sahin, 2014), and usually only a small research team
is available to conduct these observations. Therefore, to reduce any bias associated
with one but not other observers, they ought to be randomly assigned to different
3
To clarify, the ‘measurable’ variables refer to information that exists in the data available for the
researcher to analyse, whereas the ‘unmeasurable’ variables refer to possible confounders that
affect the results, but for which no data are available for the researcher to measure and to use in
the analysis.
4
One method for measuring how well raters agree with each other, and the consistency of the
rating system, is the intercluster correlation coefficient. It is a widely used reliability index in test–
retest, intrarater and interrater reliability analyses (Koo & Li, 2016).
5
As per Chapter 3, the ITT model is an analytical framework in which all units randomly
assigned to the study groups are included in the final outcome analysis as if they completed
the trial as assigned – even if they dropped out, died, switched study groups or were exposed
to a treatment that was not initially assigned to them. ITT is particularly informative when the
tested intervention is a certain policy offered to people, because in real-life settings, people do
tend to drop out, die and switch between policies. Due to the random allocation of units into
the experimental arms, the pretreatment conditions for not complying with the experimental
protocol are distributed randomly as well. Therefore, ITT freely assumes that dropping out is part
of the package, but with an equal probability of dropping out between the study groups. This
balance allows the researcher to ignore these occurrences in the analyses and not treat it as a
covariate or an endogenous factor. In practical terms, if 100 participants were randomly assigned
into two even groups, then the denominator for each group, for any overall outcome of interest,
would be all 50 originally assigned to the group.
However, there are conditions in which the dropping out rate can be extreme, and statistical
corrections to the causal inference model should be considered (see e.g., Angrist, 2006). While
these tools should be used very cautiously, if at all (Sherman, 2009, p. 12), they do present a
viable solution for estimating and then correcting for dropping out effects.
the no-treatment participants would undermine their no-treatment status. For exam-
ple, control participants may react to the pre-test observation, and this reaction will
cause their pretest measures to reflect awareness to the test in places where it is not
wanted: a new policy, intervention or status associated with the intervention (Ariel,
Sutherland, & Bland, 2019).
may interact with the treatment X and the post-test observation O2, and that the
pre-test measure O3 may interact with the post-test observation O4, we can control
for these interactions by establishing how the participants behave without these
pre-test measures (i.e. O5 and O6 only).
Using this incredibly useful but underutilised model, the threats to internal
validity are controlled, with particular emphasis on testing effects – but it also
controls neatly for external validity threats. If across all comparisons the treat-
ment effect remains consistent and pronounced, then the strength of the inference
is increased.
• directed police presence in crime hotspots reduces crime and disorder (Ariel, Sherman,
et al., 2020; Ariel, Weinborn, et al., 2016; Ratcliffe et al., 2011; Ratcliffe et al., 2020;
Sherman & Weisburd, 1995);
• face-to-face restorative justice led by police officers reduces recidivism, increases victims’
satisfaction, lowers their post-traumatic stress symptoms and saves revenues, compared
to usual criminal justice system processes (Angel et al., 2014; Sherman et al., 2015; Strang
et al., 2013; see also Mills et al., 2013; Mills et al., 2019);
• innocent suspects are less likely to be mistakenly identified and guilty suspects are
more likely to be correctly identified in simultaneous rather than sequential police line-
ups (Amendola & Wixted, 2015);
• nudges in the criminal justice system do not usually lead to desired effects as the
theory predicts, with non-significant differences between various reminders and control
(Continued)
conditions (Chivers & Barnes, 2018; Cumberbatch & Barnes, 2018; Monnington-Taylor
et al., 2019), however see the Science paper suggests otherwise https://www.ideas42.
org/wp-content/uploads/2020/10/Behavioral-nudges-reduce-failure-to-appear-for-
court_Science.full_.pdf
• placing a marked police patrol in certain areas reduces property crimes (Ratcliffe et al.,
2020);
• police body-worn cameras can lead to reductions in complaints against the police
(Ariel, Sutherland, et al., 2017; Ariel, Sutherland, et al., 2016a, 2016b) as well as
assaults against security guards (Ariel, Newton, et al., 2019) and increase the perceived
legitimacy of the police (Ariel, Mitchell, et al., 2020; Mitchell et al., 2018; however, cf.
Lum et al., 2020);
• police-led target-hardening crime prevention strategy to burglary victims and their
close neighbours does not lead to statistically significant reductions in repeat or near-
repeat burglary (Johnson et al., 2017);
• policing interventions directed at increasing collective actions with citizens at crime
hotspots increase citizens’ fear of crime (Weisburd et al., 2020);
• problem-oriented policing strategies reduce the incidence of violence in hotspots
(Braga et al., 1999; Telep et al., 2014);
• procedural justice practices affect citizens’ perceptions of police legitimacy, trust in the
police and social identity (Mazerolle, Bennett et al., 2013; Murphy et al., 2014);
• requirement to attend brief group therapy with cautioning leads to reduced
subsequent reoffending in low-level intimate partner violence (Strang et al., 2017);
• ‘scared straight’ programmes backfire (Petrosino et al., 2000);
• second response programmes (i.e. police interventions that follow the initial police call
for service) to tackle domestic violence do not ‘work’ (Davis et al., 2010);
• standard or reduced frequency of mandatory community supervision for low-risk
offenders does not lead to different recidivism rates (Barnes et al., 2010);
• training police recruits on procedural justice results in short-term benefits for police–
public relations (Antrobus et al., 2019);
• truancy interventions in schools lead to reductions in violent behaviour (Bennett et al.,
2018; Cardwell et al., 2019; Mazerolle et al., 2019);
• using civil remedies (i.e. non-offending third parties such as property owners) controls
drug use and sale (Mazerolle et al., 2000) and
• working 10-hour shifts is healthier than working 8-hour shifts among police officers
(Amendola et al., 2011).
Quasi-experimental designs
treatment, the outcomes and the allocation of cases have already occurred. Second,
it may not be ethical to conduct true experiments (see Mitchell & Lewis, 2017; we
discuss these issues in Chapter 5). Third – and perhaps most crucially – treatment
providers are not always open to the idea of random assignment. From anecdotal
experience, we can say that prison authorities, lawyers and police departments are
often apprehensive about RCTs. Therefore, despite the limitations we have discussed,
‘natural’ experiments and quasi-experimental designs often fit the bill.
For example, testing the effect of allocating court cases to judges of a different
ethnicity, with a view to measuring how different judges treat defendants of dif-
ferent backgrounds, is likely to be best studied as a ‘natural experiment’, as
the allocation of cases has already happened (e.g. Gazal-Ayal & Sulitzeanu-Kenan,
2010). Such studies are still considered experiments because the distribution of
court cases to the judges is done with a certain degree of randomness, that is, with-
out a systematic pattern that prefers certain judges to others in the allocation of
new cases – unless the case requires a judge with expertise in a subject matter. Under
these conditions, we would be able to falsify the null hypothesis of no differences
between the groups of judges and understand whether their backgrounds matter in
terms of court outcomes.6
Far more common, however, are the retrospective causal studies within the
‘quasi-experimental designs’ category. In these studies, the researcher obtains
existing datasets – police records, survey responses or court cases – about a par-
ticular phenomenon and is interested in observing a set of independent variables
and their relationship with a set of dependent variables. The researcher is not
involved in the allocation of the treatment, but still attempts to extrapolate the
cause-and-effect relationship based on the exposure to a certain intervention in
one subgroup of cases (offenders, officers, places, etc.) and then compare it to
another subgroup of similar cases. The ‘trick’ in these designs is the creation of a
comparison group that is equal to the ‘treated’ group, so that the claim of causal
relationship is sufficiently compelling.
These quasi-experimental designs come in many forms, including regression discon-
tinuity, interrupted time series and propensity-based methods (Angrist & Pischke,
2014; Cook & Campbell, 1979; Morgan & Winship, 2007, 2012; Shadish et al.,
2002; cf. Shadish, 2013). These approaches tend to blur the distinction between
‘design’ and ‘analysis’, because they are inherently statistical in nature and are still
6
Recently, some scholars have applied a ‘synthetic group’ experiment (Abadie & Gardeazabal,
2003) in situations where there is no comparison group running simultaneously with the
treatment group. A recent and interesting example can be found in Bartos et al. (2020).
in every T13…24). We can attribute the increase in the temperature in the room
to the change in the thermostat, and nothing else; it is difficult (if not scary)
to think of another factor why the temperature had risen in observations fol-
lowing T12, other than the adjustment of the thermostat. This is particularly the
case here, because we have a perfect dose–response relationship (a change in the
thermostat of 10 degrees has led to a change of exactly 10 degrees in the room
temperature, in a consistent manner).
This design suffices for straightforward physical expressions of cause and effect.
The question, however, is whether the time-series experimental model is suffi-
ciently strong to control for threats to the internal validity of quasi-experiments in
the social sciences – and the answer is plainly no. We discuss these issues in Chapter 2
(and more robustly in Nagin et al., 2009; Nagin et al., 2015, p. 92; Sherman, 2009,
p. 13), but we provide here an example to emphasise this issue. An experiment by
Loftin et al. (1991) looked at the relationship between restrictions on firearms sales
in Washington, D.C., and levels of violent crime. The study has shown that, follow-
ing restrictions, a reduction in gun-related crime was recorded.
Though this cause-and-effect relationship is logical, the evidence of the drop in
gun-related injuries following the change in rules limiting the sale and licensing
of firearms can still be reasonably explained through rival hypotheses. Have gun-
related injuries dropped in jurisdictions without gun restrictions? Are there sce-
narios where gun licensing restrictions resulted in no impact or even an increase
in gun-related injuries? Has the increase in police efficiency in combating gang-
related crime affected the frequency of gun-related injuries, but not the licensing
rules? Finding comparable control conditions in these settings is very difficult, as
the units of analysis are entire jurisdictions that can cover many millions of people
(Toh & Hernán, 2008). Unlike the room temperature example, or any other experi-
ment in nature where a variation in the data is incontestably a result of the stimu-
lus, the social sciences do not ‘work’ this way. This one-group time-series analysis
cannot control for alternative explanations to the apparent reduction in gun crime
when gun laws are put in place.
However, there is one possible scenario in which an extension of this time-series
experimental design would be considered sufficiently powerful to remove rival expla-
nations: multiplicities. As we stressed, any one study utilising a time-series model is
deemed insufficient, but a series of studies using time-series analyses, in different
settings, with diverse populations and at different times, which collectively result
in similar trends across the time-series, can produce convincing causal estimates.
Indeed, these will remain asymptotic: it is a method that approaches the necessary
conditions for true cause and effect, but never fully arrives at these necessary condi-
tions. But multiple time-series experiments, from different sources, showing the same
direction and magnitude of effects, can be quite convincing.
We note that, overall, this and similar studies were subsequently incorporated into
a systematic review by Crandall et al., 2016, and Lee et al., 2017, on the link between
restrictive licensing laws and firearm-related injuries, and the conclusions were similar:
stronger firearm laws are associated with reductions in firearm homicide rates.
are evaluated by looking at whether the treatment group deviates from its baseline
trend by a greater amount than the comparison group. However, the DID design . . .
evaluates the impact of a program by looking at whether the treatment group deviates
from its baseline mean by a greater amount than the comparison group. (Somers et al.,
2013, p. iii)
DID calculation shows causal effect as a function of the difference between the
changes in the before and after means of the two groups (see Figure 4.2).
The design may prove useful in natural experiment circumstances (e.g. see Bilach
et al., 2020). However, the major drawback of DID is that it requires a substantial set
of assumptions about the comparability of the comparison and treatment groups at
pre-intervention stages (known as the parallel trends assumption). It is unlikely that
the research team can be confident that this design accounts appropriately for all
confounding variables required for a causal inference, which therefore places limits
on our confidence in conclusions from DID designs.
300
250
Intervention applied
Dependent variable
200
150
100
50
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Time period
7
A related model is the equivalent materials design, which includes different treatment materials; its
rationale and design are ostensibly the same as the one presented here.
participants with extreme scores: the hottest hotspots, the most harmful felons,
the most frequent reoffenders, the most harmed victims or those otherwise most in
need of an intervention (Dudfield et al., 2017; Liggins et al., 2019; Sherman et al.,
2016; J. Sutherland & Mueller-Johnson, 2019). Based on these extreme scores, the
control group is then selected for matching purposes; after all, the selection of the
comparison group must at least be done in a way that matches with the treatment
group based on the dependent variable, to create the pre-experimental equivalence
of groups. However, as we have shown in Chapter 3, there is a natural tendency for
patterns to regress to the mean. If the means of the two groups are substantially
different and require a matching procedure, then the process of matching ‘not only
fails to provide the intended equation but in addition insures the occurrence of
unwanted regression effects’ (Campbell & Stanley, 1963, p. 49). This implies that
the two groups will differ on their post-test scores independently of any effects of
the treatment.
Another issue is the self-selection of treatment group participants, as they either
deliberately sought out or were ‘forced into’ exposure to the treatment, whereas the
control group participants are deliberately not. This makes all the difference in the
world: assuming that the participants share the same motivations, perceptions, prog-
nosis or recruitment opportunities. For this reason alone, experimenters should be
concerned about relying on findings from this design – unless these issues can be
factored out. Often, this is not the case (but cf. Hasisi et al., 2016; Haviv et al., 2019;
Kovalsky et al., 2020; G. Perry et al., 2017).
0.45
Proportion of cases revictimised
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 5 10 15 20 25 30
Number of risk ‘points’
However, RDD is a proxy in lieu of randomisation only when two conditions are
met: first, that there is continuity around the threshold point, and second, that all
participants comply with their ‘assigned’ treatment condition. In practice, these con-
ditions may not always be observed, and they interfere with the internal validity of
RDD. Finally, RDD tends to have issues with external validity. Since RDD only esti-
mates the treatment effect at the threshold, any generalisation for cases further away
from this threshold level is compromised (Y. Kim & Steiner, 2016).
Conclusion
Not all experimental designs are created equal, and they vary in terms of the degree
of control they exercise over the experimental process. We reviewed the 11 classic
types of experimental designs laid out by Campbell and Stanley (1963) and two spe-
cific quasi-experimental innovations (RDD and PSM). We have made the case for
implementing true experiments over other designs. When presented with the choice
of either conducting an RCT or implementing a different causal design, RCTs should
be preferred, for all the reasons we discussed in this chapter.
Our preference towards RCTs is not just about the benefits of randomisation for
the creation of optimal counterfactual conditions. As we reviewed in Chapter 2 and
then in more detail in Chapter 3, science has yet to develop a more convincing causal
inference model other than the RCT. Indeed, experiments using random allocation
are not always feasible when conducting field tests. They can be difficult to handle:
case attrition, diffusion of treatments, inconsistent treatment fidelity and problems
associated with small samples are common features. The experimenter ought to have
a strong protocol to follow, as we discuss in Chapter 5, and conduct a robust pre-
mortem analysis to plan ahead and then offset issues that will arise during the experi-
ment. Still, true experiments have in them a set of conditions that protect the validity
of the trial from a long list of threats, as listed in Chapter 3. The other experimental
designs we reviewed in this chapter have more limited control. Therefore, when the
conditions are possible, RCTs should be implemented.
Of course, this is not to say that other experimental designs have no value. As the
body of evidence accumulates on any particular intervention, the pre-experimental
designs can become informative as well. Why should we discount pre–post-only stud-
ies if a series of them show the same result, across different populations, settings and
time? When these designs present outcomes that do not conflict with or strongly
deviate from the overall body of evidence, they should not be ignored. Our knowl-
edge of the tested intervention will only get richer, not more confusing, with the
accumulation of evidence gathered from a range of research designs.
Finally, quasi-experimental designs, or statistical modelling on retrospective data,
will always be required. Even though they have inherent issues associated with weaker
control conditions to which the tested intervention is compared (e.g. selection bias),
these models can provide great insight on tested hypotheses. Data sets created by the
state (e.g. crime data), historic data and secondary data analysis more broadly will
continue to provide opportunities to sharpen our theories of causal inference, using
quasi-experimental designs. For these and other reasons, this chapter does not ignore
non-RCTs at all: it celebrates their contributions, and instead focuses on the settings
in which they are the most optimal.
Chapter Summary
Further Reading
Weisburd, D., Hasisi, B., Shoham, E., Aviv, G., & Haviv, N. (2017). Reinforcing the
impacts of work release on prisoner recidivism: The importance of integrative
interventions. Journal of Experimental Criminology, 13(2), 241–264.
Apel, R. J., & Sweeten, G. (2010). Propensity score matching in criminology and
criminal justice. In A. Piquero & D. Weisburd (Eds.), Handbook of quantitative
criminology (pp. 543–562). Springer.
The discussion in this chapter briefly introduced PSM as one of the most popular
techniques today for creating statistically balanced arms in quasi-experimental
designs. Seemingly, experimenters evaluating interventions in prison settings have
enjoyed more experience with this technique than other areas in criminology. Some
examples are particularly noteworthy, such as this recent study by Weisburd et al.
on the benefits of work release on prisoner recidivism. For further reading on PSM
in criminology, this chapter by Apel and Sweeten should be consulted.
Chapter Overview
Recipes for experiments�������������������������������������������������������������������������� 124
Experimental protocols and quality control tools������������������������������������ 124
Protocols for reporting the findings of the experiment��������������������������� 132
Implementation of experiments��������������������������������������������������������������� 136
Ethics in experiments������������������������������������������������������������������������������� 140
Conclusion������������������������������������������������������������������������������������������������ 144
Appendix: CrimPORT template��������������������������������������������������������������� 146
Further Reading��������������������������������������������������������������������������������������� 153
One of the most important stages of scientific inquiry is the meticulous planning
ahead of all the necessary steps, hazards and possibilities that may occur during the
experiment. Unlike random discoveries, which have always been part of the scientific
journey, running a strategic experiment presents various obstacles. Trials are meant
to be methodical, disciplined and structured. In this sense, prospective studies are dif-
ferent from retrospective analyses of data because the prospective experimenter can-
not go back in time and reformulate the research hypotheses, or switch to a different
intervention midcourse. Field and clinical trials usually have to ‘stick’ to the original
game plan (see Sherman, 2010). Deviations often necessitate a restarting of the trial.
As experiments tend to be expensive, and changes are often difficult to explain with-
out jeopardising the integrity of the test, the recommended course of action is to act
according to the plan. If nothing else, an experimental cycle that does not follow its
own guidelines is a messy test to analyse and should therefore be avoided.
As alluded to in the synopsis, there are three issues at stake here. First, experi-
menters should plan how they will conduct their experiments using a protocol or
template. Second, there are organisational frameworks that are more conducive to
the efficient implementation of the experimental process. Finally, there are crucial
ethical considerations to consider. We discuss these issues in this chapter and begin
by exploring experimental protocols.
Researchers can use protocols, or blueprints, to help them conduct sound and
robust experiments. In many ways, the protocol – a detailed plan of the overall
experimental process – nudges the experimenter into compliance with the study
design and, by implication, industry’s guided practices. Experimental protocols are
‘key when planning, performing and publishing research in many disciplines, espe-
cially in relation to the reporting of materials and methods’ (Giraldo et al., 2018,
p. 1). Protocols are important, especially in the form of checklists, because they
serve as active aids and memoirs (see review of nudges and checklists by Langley
et al., 2020). The closer the experiment sticks to the original recipe and follows its
game plan, the more credible and transparent the results. This is true for the early
planning stages, as well as the stage at which the final reports are disseminated.
Thus, the protocol provides a detailed account of activities that the experimenter
will face. For example, protocols often include a practical timetable; specific guides
of actions; details of the parties who will be involved in the study; comprehensive
definitions of the intervention, data and measurements and the overall experimen-
tal procedures and materials.
The protocol generally follows a master template. These checklists often include
established key components that any test should consider. For example, clinicaltrials.
gov, provided by the US National Institutes of Health, invites researchers to register
their trials ahead of time, and requires information on a set of 13 study elements
(https://register.clinicaltrials.gov):
1 Study identification
2 Study status
3 Sponsor/collaborators
4 Oversight
5 Study description
6 Conditions
7 Study design
8 Arms/groups and interventions
9 Outcome measures
10 Eligibility
11 Contacts/locations,
12 Sharing statement
13 References
These sets of rules not only focus the experimenter on the salient issues they ought
to take into account but also allow the scientific community an opportunity to sys-
tematically and methodically assess how the test was executed, and the validity of
the causal estimates that it proposes. In part, this additional layer of peer review is
made possible by the requirement to publish the experimental protocol; experiments
are less likely to ‘bury’ the study should they not like the results or to ‘fiddle’ with
the figures. They are also less likely to ‘go fishing’ for statistically significant results,
as we discuss below. Overall, this creates transparency, accountability, safety as well
as consistency and efficiency in science.
The FDA’s guidelines suggest that having a protocol serves four main purposes.
First, using protocols introduces uniformity in research practices. Unlike qualitative
research and observational research strategies, which are characterised by more free-
dom in the way in which a study is administered, evaluative research necessitates
a more formal framework. Once a research protocol is used for evaluative research,
then not only the ordering of the content is controlled but the terminology, key
terms and methods are also homogenised and standardised, as much as possible. This
uniformity is welcomed, as the research process and its final report are not prosaic and
should not be left to creative interpretation. Whether a requirement set out by the FDA
should be an industry standard is arguable, but it highlights a growing recognition of
the need for uniform practices.
Similarly, protocols are particularly pivotal for the purpose of replications in science.
When they include all the necessary information for obtaining consistent results,
protocols are akin to a ‘cookbook’, in which all the processes and actions that form
part of the experiment are detailed. Freedman et al. (2017), as well as Baker (2016),
have convincingly shown that adequate and comprehensive reporting in the protocol
facilitates reproducibility (Casadevall & Fang, 2010; Festing & Altman, 2002; see
additional practical stages in Drover & Ariel, 2015; Harrison & List, 2004; Leeuw &
Schmeets, 2016; List, 2011; Welsh et al., 2013).
The second utility of a protocol to which we alluded earlier is that it drives the
scholar to think about the adverse effects of the intervention. Importantly, the protocol
can be viewed as a pre-mortem diagnosis or pre-mortem analysis: a consideration
of all the factors that can potentially go wrong and ways to mitigate these concerns.
This provides the context for scrutiny into potential problems and their solutions,
with a balanced emphasis on implementation as well as safety and ethics. After all,
field experiments involve human beings, so experimenters must consider the potential
adverse effects on the participants and minimise them as much as possible. As such,
this is relevant to all studies involving human participants, and institutional review
boards (IRBs) and ethics committees have a long tradition of looking precisely at
these questions: Is the welfare of the participants meticulously considered? Is there a
potential for a backfire effect and, if so, what can be done to minimise this risk? Are
there any potential risks to the researchers themselves? These are issues that, once
introduced in the experimental protocol, are more likely to be considered (see more
broadly in Neyroud, 2016).
A third and crucial purpose of the protocol is to reduce the number of revisions
to the experiment. Indeed, changes in protocols are inevitable. There are natural
variations over time – for example, availability or definition of data, randomisation
procedures, change in treatment providers, revisions to the eligibility criteria, defini-
tion of the treatment and so on. Before the final version of the protocol is approved
and signed off on, multiple iterations are commonplace. However, the researcher
must be aware that each revision is expensive and will result in delays in the trial.
More importantly, revisions may result in increasing the risk of harm to the persons
being studied or adding to the nuisance of being studied (see review in Meinert, 2012,
pp. 195–204). Therefore, revisions should be kept to a minimum whenever possible.
The crucial recommendation, however, is to keep track of the changes and transparently
report on these revisions.
The final reason – arguably the most important for an experimental protocol – is to
force more complete and full reporting. To place this rationale in perspective, consider
the following finding reported by the author of Bad Pharma: How Drug Companies
Mislead Doctors and Harm Patients, Ben Goldcare (2014)1:
Overall, for the treatments that we currently use today, the chances of a trial being pub-
lished are around 50 percent. The trials with positive results are about twice [as likely]
to be published as trials with negative results. So, we’re missing half of the evidence that
we’re supposed to be using to make informed decisions. [And] we’re not just missing any
old half, we’re selectively missing the unflattering half.
We note that going ‘fishing for statistical significance’, ‘data dredging’ or ‘p-hack-
ing’ are concerning practices, especially when it comes to evaluative research
with real policy implications (see discussions by Head et al., 2015; Payne, 1974;
Weisburd & Britt, 2014; Wicherts et al., 2016). Here, the researcher misuses data
analysis in order to locate statistically significant patterns because these outcomes
are more publishable. A common way of detecting statistically significant out-
comes outside the parameters of the main effects is by conducting multiple tests of
significance on selective subgroups of participants, by combining variables, omit-
ting uncomfortable participants, altering the follow-up periods or breaking down
the treatment components – until one or more of the statistical tests yield a finding
that is significant under the usual p = 0.05 level (Gelman & Loken, 2013). Since any
data set with a degree of randomness is likely to contain relationships, experiment-
ers may use one or more of these techniques to celebrate a statistically significant
outcome, which again is more publishable.
Thus, there is an inherent and systematic concern that published evidence is
one-sided: only showing ‘successful’ interventions and established cause-and-effect
relationships. What does not ‘work’ or is not statistically significant is less likely to
be published, and therefore remains in the ‘grey literature’ that is more difficult to
find. Indeed, the issue of selective reporting and what is referred to as ‘selection bias’
1
See interview in Time magazine (28 February 2013): https://healthland.time.com/2013/02/28/
how-drug-companies-distort-science-qa-with-ben-goldacre/
2
The FDA Code of Federal Regulations is a codification of the general and permanent rules
published in the Federal Register by the executive departments and agencies of the Federal
Government. Title 21 [Revised as of 1 April 2019] includes a ‘Good Laboratory Practice For
Nonclinical Laboratory Studies’, and under it, in subpart G, there are details of the basics of a
‘Protocol for and Conduct of a Nonclinical Laboratory Study’ in ‘Sec. 58.120 Protocol’.
description of the treatment and comparison conditions. However, the overall frame-
work and the minimum set of details that any reasonable protocol should include
have remained the same:
Such protocols provide the impetus for scholars around the globe to develop proto-
cols that are more closely linked to their own disciplines (e.g. Cameli et al., 2018).
While experimental designs are meant to be universal and follow certain research
architectures, there are still differences between the disciplines in the ways in which
these designs are instituted. Therefore, there are different protocols. For example, the
recruitment of cases in psychology laboratory experiments can look very different
from the recruitment of participants for law enforcement experiments in field set-
tings. Issues such as consent, blinding or double blinding and incentives paid to par-
ticipants for taking part in the study, for example, are inherently different between
research fields. Similarly, the question of sample size looks very different in exper-
iments that analyse places as the units of analysis rather than individuals (as we
reviewed in Chapter 3). Therefore, bespoke protocols for different research settings
are welcomed.
One example is the protocol template called SPIRIT (Standard Protocol Items:
Recommendation for Interventional Trials; see www.spirit-statement.org/trial-protocol-
template). SPIRIT is a protocol based on widely endorsed industry standards and the
accepted protocol template for the journal Trials. We recommend having a look at
this tool.
Another detailed and practical template is the CrimPORT (Criminological Protocol
for Operating Randomised Trials; Sherman & Strang, 2009). CrimPORT focuses on
both the internal mechanisms of trials as well as the ‘managerial elements involved
in making the experiment happen’ (Sherman & Strang, 2012, p. 402). The 12 sections
of this protocol call on the researcher, prior to the administration of the experiment,
to consider the fundamental definitions of the various factors linked to the study
(e.g. the treatment, the measurement, the sample and other internal elements of the
test). By paying close attention to the operational and practical details of the up-and-
coming experiment, the CrimPORT leads to a pre-mortem diagnosis of the possible
pitfalls the study may encounter and to find ways to remedy them. This includes
the organisational dynamics that any trial entails, which are often overlooked when
planning experiments. For ease of reference, the protocol is provided in the appendix
to this chapter, and the core factors are listed below:
Box 5.1
The elements of the CrimPORT are generally intuitive and lay out the framework
of the necessary conditions for successful trials. At the same time, we note that some
articles of the CrimPORT can be complicated due to the nature of the issues they raise.
As an illustration of the difficulties these protocols highlight, consider the first item of
the CrimPORT: ‘defining the hypotheses of the experiment’. The CrimPORT logically
purports that any experiment requires a hypothesis that specifies an anticipated causal
relationship between the independent and the dependent variables (or no relationship,
as in the null hypothesis), so that statistical tests of significance and measures of effect
size can be performed (for a more detailed review, see Kendall, 2003). Experimental
hypotheses ought to be quantifiable, specific and formulated in advance of the experi-
ment. The specificity and accuracy of the empirical research question are vital, and
therefore great attention should be given to its conceptual and operational definition.
What stimulus is being tested, precisely? To what counterfactual conditions or alterna-
tive treatments is the treatment effect compared? What are the conditions under which
the null hypothesis will be falsified? These can be difficult questions to answer.
One element that is not required by the CrimPORT is the need for a literature
review. The APA, which publishes guidelines on how to write academic papers and
is considered by many to be an authoritative guideline for these matters, requests a
brief literature review to support a study’s hypothesis. A reasonable literature review
must summarise the state of available knowledge as well as the systematic process
that was used to arrive at this literature review (see Baumeister & Leary, 1997). The
literature review should incorporate a summary of findings from systematic reviews
and meta-analyses, if there are any, as they should form the strongest basis for laying
out the hypotheses of the experiment.
Box 5.2
We emphasise that the literature review requirement does not mean that the exper-
iment must endorse a particular theory, as a theoretical framework is not a necessary
condition for hypothesis testing. For experiments to reliably and validly demonstrate
The presumption at the outset of the trial [in a criminal case] is that the defendant is
innocent. In theory, there is no need for the defendant to prove that he or she is inno-
cent. The burden of proof is on the prosecuting attorney, who must marshal enough
evidence to convince the jury that the defendant is guilty beyond a reasonable doubt.
Likewise, in a test of significance, a scientist can only reject the null hypothesis by pro-
viding evidence for the alternative hypothesis. If there is not enough evidence in a trial
to demonstrate guilt, then the defendant is declared ‘not guilty’. This claim has nothing
to do with innocence; it merely reflects the fact that the prosecution failed to provide
enough evidence of guilt. In a similar way, a failure to reject the null hypothesis in a
significance test does not mean that the null hypothesis is true. It only means that the
scientist was unable to provide enough evidence for the alternative hypothesis.
While the pre-experimental protocol such as the CrimPORT is useful for planning
experiments, we need bespoke templates to report the findings as well. The need
for better reporting is shown by multiple reviews of existing experiments that have
found dramatic underreporting or misreporting of key information that would make
the experiment reproducible. Moher et al. (2015) have shown that fewer than 20%
of popular publications in the life sciences have adequate descriptions of study
design and analytic methods. A.E. Perry et al. (2010) found even more concerning
non-compliance with proper reporting rules in criminology, reaching the conclu-
sion that ‘the state of descriptive validity in crime and justice is inadequate’ (p. 245).
Thus, more ‘accurate and comprehensive documentation for experimental activities
is critical’, remark Giraldo et al. (2018, p. 2), because ‘knowing how the data were
produced is . . . important’.
By using templates for sharing the evidence and research methods with the wider
scientific community, we hope to introduce consistency, clarity, transparency and
accountability to the process of disseminating the results of experiments. Reporting
standards for experiments were therefore developed over the years. One of the most
popular choices is the CONSORT (Consolidated Standards of Reporting Trials; see
review in M.K. Campbell et al., 2004; Montgomery et al., 2018). Another is TIDieR – the
Template for Intervention Description and Replication (Hoffmann et al., 2014),
although the CONSORT is more prevalent.
CONSORT is a 25-item checklist that guides reporting of trials. It lists a set of rec-
ommendations for reporting the results of the experiment, using a standard way to
prepare ‘reports of trial findings, facilitating their complete and transparent report-
ing, and aiding their critical appraisal and interpretation’ (www.consort-statement.
org). For example, the checklist items focus on reporting how the trial was designed,
analysed and interpreted. Perhaps the most obvious point in the CONSORT checklist
is to include ‘randomised controlled trial’ in the title of the publication. This is both
to make clear what the paper is about and also to make it easier for others to find the
publication (e.g. if conducting a systematic review or just looking for evidence on a
given topic).
CONSORT also includes a flow diagram, which displays the progress of all par-
ticipants through the trial. The usefulness of the flowchart cannot be overstated and
should form part of any report, tracking trial participants from the point of recruit-
ment or randomisation through to outcome reporting and analysis. This is where rig-
our in terms of ‘attention to detail’ comes in – we need to be able to track and report
on all the individuals included in the trial from the point they were randomised
through to our analysis.
In addition, extensions of the CONSORT statement have been developed to give
additional guidance for RCTs with specific designs, data and interventions. There
are several versions of CONSORT, tailored to minimum reporting requirements for a
range of trial scenarios (e.g. individually randomised and cluster-randomised trials),
Box 5.3
Introduction: background 2a
Introduction: objectives a 2b
Methods: interventions a
5, 5a, 5b, 5c
Randomisation: implementation 10
Results: harms 19
Discussion: limitations 20
Discussion: generalisability 21
Discussion: interpretation 22
Important information…
Registration 23
Protocol 24
Declaration of interests a 25
Excluded (n = )
• Not meeting inclusion criteria (n = )
• Declined to participate (n = )
• Other reasons (n = )
Randomised (n = )
Allocation
Follow-up
Analysis
Analysed (n = ) Analysed (n = )
• Excluded from analysis (give reasons) (n = ) • Excluded from analysis (give reasons) (n = )
Implementation of experiments
For the most part, issues associated with ‘how to’ conduct experiments have been left
undiscussed in mainstream experimental criminology, largely because they are not
necessarily ‘interesting’ as assessed by leading peer-reviewed journals. However, the
management, administration and politics of experiments are controversially com-
plex and multifaceted. In these frameworks, one can locate the ‘craft’ of running
experiments, and at least three steps are crucial: (1) the creation of a coalition for the
purpose of the experiment, (2) the incorporation of field managers and pracademics
in field experiments and (3) intensifying our focus on implementation sciences. Let
us now consider these implementation elements in greater detail.
Researcher–practitioner coalitions
In experiments, particularly field trials, where the experimenter must rely on a treat-
ment provider to deliver the intervention – the police, courts, charities or schools –
a special type of relationship emerges between the researcher and the practitioner.
Unlike medicine, where relationships between research universities and hospitals
are well established, we have yet to find such an intertwined network of academics
and practitioners in the social sciences. Most experimental projects are short-lived,
purpose-oriented and depend on the actors rather than systems for their longevity.
Whether the interest in conducting an experiment arises from the researcher or from
the practitioner, collaboration is required throughout the entire experimental cycle
(Garner & Visher, 2003). This cooperative process requires the intimate and con-
tinuous involvement of all sides – a conclusion reached in a wide range of research
settings (Braga & Hinkle, 2010; Feder et al., 2011; J.R. Greene, 2010; Sherman, 2015;
Sherman et al., 2014; Weisburd, 2005). Reflecting on these studies, Strang (2012)
concluded that
experiments require close cooperation between the parties because of the need for
maintenance and monitoring. . . . Relationships which may be characterised as temporary
coalitions [emphasis added] for a common purpose may, under the right conditions,
ultimately mature into true research partnership. (p. 211)
must be given by this coalition to the ongoing cooperation. The day-to-day pres-
sures of running a major project can hinder the success of the experiment, whether
a legal memorandum of understanding and willingness on behalf of the leadership
is present or not. Operational staff must be invested, both emotionally and profes-
sionally, in the success of the intervention; otherwise, the experiment is unlikely
to succeed. This is particularly the case when there is lack of respect between field
staff members and headquarters leaders (as seems to be the case in Israel, for both
the police and the national education system; see Brants-Sabo & Ariel, 2020; Jona-
than-Zamir et al., 2019).
Meaningful experiments often disrupt the daily routines of treatment providers,
and consequently, the programme of change requires active agreement to comply
with the research protocol. Otherwise, a host of problems can ‘go wrong’, ranging
from getting cases into the experiment, screening for eligibility, managing random
assignment, attrition from the study, consistency of delivery of the experimen-
tal conditions and monitoring and measuring programme delivery (Strang, 2012,
pp. 217–222). Therefore, a strong and ongoing collaborative approach is needed
to enhance the likelihood that the experiment will be delivered with integrity
(Weisburd, 2000). Mutual consent, emotional investment and genuine belief in the
purpose of the experiment are required; otherwise, the experiment is less likely to
succeed (Boruch, 1997).
In brief, we note that in clinical trials there are five phases – but we can extrapolate
from these guidelines for field tests in the social sciences as well: the earlier phases
look at whether an intervention is safe or has side effects, while later phases test
the effectiveness of the intervention versus control conditions. Phase 0 and phase 1
are small trials with usually between 10 and 50 participants, which aim to test the
harmfulness of the intervention, without comparison groups. Phase 2 may or may
not have randomised allocation of participants but would have a comparison group
to test the effect of the intervention on a sample of 100 against control conditions.
Phase 3 refers to larger trials, which usually use random assignment of hundreds
or thousands of participants – the common experiments we discussed in this book.
Phase 4 usually refers to longitudinal studies to investigate long-term benefits and
side effects. For more details, see Cancer Research UK (2019).
to mobilise the agency, vertically and horizontally, for the purpose of the experimental
collation. While we cannot offer causal evidence to that effect, observational data sug-
gests that the pracademic route is often successful (Ariel, Garner, et al., 2019).
(a) carefully selected practitioners receive coordinated training, coaching, and frequent
performance assessments;
(b) organizations provide the infrastructure necessary for timely training, skilful super-
vision and coaching, and regular process and outcome evaluations;
(c) communities and consumers are fully involved in the selection and evaluation of
programs and practices; and
(d) state and federal funding avenues, policies, and regulations create a hospitable envi-
ronment for implementation and program operations. (p. vi.)
On the other hand, one feature that is consistently crucial in any experiment is
the tracking of inputs, outputs and data outcomes (see Sherman, 2013). While the
concept of tracking has been used in evidence-based policing more broadly as a line
of research (see e.g. Damen, 2017; De Brito & Ariel, 2017; Dulachan, 2014; Gresswell,
2018; Henderson, 2014; Jenkins, 2018; Pegram, 2016; Rowlinson, 2015; Young,
2014), we can generalise from this body of evidence two key lessons for experimen-
tal designs more broadly, which can help mainstream implementation sciences
more robustly.
First, experimenters must keep a detailed account of data – and not just about the
experiment but also about the ways in which experiment was implemented: how
many meetings were held, with whom and under what conditions? What is the com-
position of the research team, and how and what was done by each member to exe-
cute the research plan? What is the process of endorsement of the research within the
treatment-delivery team? How many hours were dedicated for each element of the
experimental cycle? What mechanisms were placed to assure buy-in from stakehold-
ers? These and other relevant tracking questions about the implementation of the
experiment, from an organisational and structural side, are crucial if we want to be
able to assess the application of the test and the fidelity of the experiment. Without
tracking data, we may be left without an answer about the mechanisms that lead to
the causal estimates found in the study (e.g. Grossmith et al., 2015).
Second, implementation queries are not just important for process evaluations but
also critical from an external validity perspective. As we reviewed in Chapter 3, exter-
nal validity is always a concern in experimental research, as many argue that, at least,
in field settings an experiment does not mimic real-life conditions but rather provides
a synthetic appearance of true population means. In part, this argument has weight
because we are unable to quantify the environmental and contextual settings of the
experiment due to a lack of observable data. Implementation science comes to correct
for this bias by requiring a detailed qualitative account of the experiment. The more
details shared, the more we can learn about the generalisability of the study’s outcomes.
Ethics in experiments
The issue of ethics is not currently covered in the CrimPORT template, but we sense
that it should be included, as it is part of the experimental architecture (e.g. Stern &
Lomax, 1997, and as introduced in The SAGE Quantitative Research Kit, Volume 1),
especially within those studies involving human participants. Concerns about the
ethical viability of randomised experiments have been consistently cited as one of
the main impediments to their proliferation (Clarke & Cornish, 1972; Neyroud,
2017). While ethical considerations may come in a variety of shapes and sizes, there
are two prevalent themes of ethical concerns. One is a perceived lack of fairness in
denying treatments to control group participants. A second is the perception of risk
of harm to control group participants and the subsequent implications for the legiti-
macy of the agency. These themes unintentionally ignore fundamental points about
random assignment – that there is an equal chance of receiving the treatment for all
participants and that the benefit of treatment is usually unquantified at the point of
experimentation. Yet they reflect a realpolitik concern present in many agencies and
merit the attention of experimenters at the earliest stages of their design processes,
principally in two respects: (1) securing the support of stakeholders and participants
and (2) ensuring that the participants are as safe as possible.
Social scientists can draw ethical lessons from experiments in clinical trials (Boruch,
Victor, et al., 2000; Weisburd, 2005). In medical trials, the dominant fundamental
ethos is taken from the Hippocratic Oath often summarised as ‘first, do no harm’.3 If
participants are at risk, then there must be a good justification for potentially harm-
ing them. Experimentation may be considered necessary only when there is ‘collective
equipoise’, where the medical community is genuinely uncertain about the effects of
a desired treatment. Under these conditions, tests are needed to make sure that future
patients will not be harmed as a result of the tested treatment. There is always at least
a marginal risk inherent in any situation where a participant takes part in a study, and
if the study places an individual at risk, it had better be for a good reason.
This ethical synopsis in medicine is reasonable, as one would expect to result from the
maturity of clinical trial practice in medical research. However, social science–based tri-
als are considerably less mature and experienced than biomedical trials (see the Global
Policing Database by Mazerolle et al., 2017; as well as Higginson et al., 2014). We do not
have enough experience in terms of the ethical considerations. This lack of experience
leads many to reach the wrong conclusions about the need to conduct more experi-
ments, not less (see review in Neyroud, 2016; see also Mark & Lenz-Watson, 2011).
One example is randomisation. Contrary to what antagonists of RCTs think, the
random allocation of participants into the study arms is more ethical than any other
treatment allocation procedure. If the study aims to test the effectiveness of a treat-
ment, and the researcher knows that this particular treatment is indeed effective, then
not allocating this treatment to patients may be seen as unethical. After all, the main
3
For this reason, the World Health Organization stopped an experiment on efficacy and safety
of hydroxychloroquine and azithromycin for the treatment of ambulatory patients with mild
COVID-19, as it was found during the trial that the drug presents serious health risks to those
who take it (see story on BBC, 22 May 2020; www.bbc.co.uk/news/world-52779309). The evidence
did not deter President Donald Trump from publicly endorsing it and claiming to have taken it
to prevent COVID-19 (which, as we know, did not help him from catching and possibly super-
spreading the virus). As in other areas of his presidency, Trump epitomised an evidence-free
policymaking approach (see Drezner, 2020).
purpose of science is to discover how to reduce harm and promote well-being. If the
science behind the treatment has been ‘proven’ to the point that we know that the
treatment ‘works’, we must give it to those who can benefit from it. However, the issue
is that we often do not know that the treatment is effective; we assume that it should,
and therefore we conclude that randomly excluding participants from the treatment
is unethical. Yet how do we know that it should without a test? Something that is
logical is not necessarily true, nor does the fact that a treatment has been practised for
a long time (prisons, arrests, fines and counselling) mean that it has a desirable effect.
If you know that the intervention positively affects well-being, then flipping a coin
that would prevent treatment is indeed unfair. When you do not know – and hence
the reason for conducting an experiment in the first place – then dividing the risk of
a backfire effect is the fairest when it is done randomly.
be exposed to known and unknown risks. Therefore, the rule and common practice
is that every time an individual takes part in a study, their written consent should be
required. IRB and ethics review boards take this issue very seriously.
However, there are exceptions to this rule. Rebers et al. (2016) have identified four
exemptions to the rule of informed consent for research with an intervention: ‘data
validity and quality, major practical problems, distress or confusion of participants . . . and
privacy protection measures’ (p. 9). For example, if an individual is legally required
to take part in treatment, then consent is not possible or required, because there are
no procedures for obtaining informed consent. For example, offender treatment
programmes do not require consent from the offenders prior to assignment into
treatment because the offender has no choice but to participate. Domestic abuser
interventions are mandated by law in many states in the USA (see Mills et al., 2012),
for example, and failure to take part in treatment results in punishment. There are
many types of batterer intervention programmes that courts can mandate domestic
offenders to attend (Babcock et al., 2004; Cheng et al., 2019), and the offenders do
not get to pick them (unless there are compelling reasons). Consequently, an experi-
ment that investigates the effects of such a programme cannot seek consent because
it interferes with the natural processes of treatment. As the informative IRB rule
dictated by the University of Michigan,4 the requirement of informed consent can be
waived when ‘the research . . . involves no procedures for which written consent is
normally required outside the research context’.
Furthermore, assignment to an experimental treatment can be seen as fair, even
without obtaining informed consent, when the tested intervention is assumed to
benefit the offender. If we hypothesise that the treatment has a harmful effect on the
offender and that there are no benefits to anybody else from exposing the offender to
the experimental treatment, then we should not conduct the experiment (in fact, we
should not use it at all). The research hypothesis must therefore present a compelling
reason why exposure to the stimulus would benefit the treatment participants, even
if the term benefit were defined broadly, while consent is not sought.
4
https://bit.ly/3pF3GHp
conclusions, and this may have negative consequences for participants, stakehold-
ers, the wider research community and the general public, whose tax funds have
probably paid for the experiment. It is important, then, that experimental designs
are planned with the involvement of a suitably qualified designer.
We do not prescribe what such an appropriate qualification may be, but we sug-
gest that completing an ethics assessment should address this issue and give specific
regard to the experience, and independence, of the researchers involved. Devereaux
et al. (2005) go as far as suggesting that ‘increased use of the expertise based design
will enhance the validity, applicability, feasibility, and ethical integrity of randomised
controlled trials’ (p. 388). At minimum, the qualified researcher must be well-versed
in the experimental architecture, because a poorly executed study can then create
unnecessary risk to the participants. If the experiment fails, then whatever the level
of risk the stimulus and the participation in an experiment present to the participants
cannot be justified. For example, if we plan an experiment that is underpowered
(i.e. there are not enough participants to show a statistically significant effect), then
exposure to a risky intervention would be deemed unnecessary and patently unfair.
Knowledge in these and related matters is therefore an ethical requirement.
Conclusion
Experimental tools that promote the well-being of participants as well as the effi-
cient operation of the experiment are important. Checklists, reviews and templates
promote good scientific practice, as they enhance transparency, accountability and
overall competence. As we covered in this chapter, protocols have many benefits. For
example, they force the experimenter to follow best practice guidelines: a detailed
and publicised blueprint limits the likelihood of going ‘fishing for statistical signifi-
cance’. Similarly, having a protocol implies that the researcher is obliged to pub-
lish the results of the experiment, analysed based on the parameters set forth in the
experimental protocol. There are no strict rules about the publication format – in
5
For example, https://www.crim.cam.ac.uk/global/docs/crimport.pdf/view
6
https://trialsjournal.biomedcentral.com/
teach us, and the evidence gradually accumulated in the ‘how to’ sciences of experi-
ments can be informative and helpful. Tracking of all the inputs, outputs and out-
comes of the experiment can be useful as well, because such research informs us as
to the conditions under which experiments can be more successfully implemented.
Finally, we discussed ethics in experiments. One element to highlight is our argu-
ment for the ethical consideration of randomisation: It generates a fair comparison
group. Therefore, these should be exploited in situations when one treatment may
be better than another treatment, but we want to know that with greater certainty.
However, other benefits to randomisation make random assignment much more use-
ful to researchers, practitioners and policymakers. Within this context, we suggested
a three-tier process of considering the ethical dimensions of the experiment, which
supplement the usual IRB ethics review processes. The more researchers consider
issues such as the benefits and risks to participants, the consent needed from partici-
pants and what to do in lieu of informed consent, as well as the involvement of an
experienced experimenter, the better it would be for proper scientific exploration.
Contents
1. NAME AND HYPOTHESES
2. ORGANISATIONAL FRAMEWORK
3. UNIT OF ANALYSIS
4. ELIGIBILITY CRITERIA
5. PIPELINE: RECRUITMENT OR EXTRACTION OF CASES
6. TIMING
7. RANDOM ASSIGNMENT
8. TREATMENT AND COMPARISON ELEMENTS
9. MEASURING AND MANAGING TREATMENTS
10. MEASURING OUTCOMES
11. ANALYSIS PLAN
12. DUE DATE AND DISSEMINATION PLAN
2. Organisational framework
(Check only one from a, b, c, or d)
3. Unit of analysis
(Check only one)
3.3 __C. Situations (describe category: police–citizen encounters, fights, etc.) ______________
3.4 __D. Other (describe) _________________________________________
4. Eligibility criteria
7. Random assignment
7.1 How will a random assignment sequence be generated?
(Coin toss, every Nth case, and other non-random tools are banned from
CCR-RCT).
7.1.1 Random numbers table case number sequence sealed envelopes with
case numbers outside and treatment assignment inside, with two-sheet
paper surrounding treatment __
7.1.2 Random numbers case-treatment generator programme on secure
computer __
7.1.3 Other (please describe below) __
7.2 Who is entitled to issue random assignments of treatments?
7.2.1 Role:
7.2.2 Organisation:
8.1.2 What elements must not happen, with dosage level (if measured) indicated.
8.1.2.1 Element A:
8.1.2.2 Element B:
8.1.2.3 Element C:
8.1.2.4 Other elements:
8.2.1.2 Element B:
8.2.1.3 Element C:
8.2.1.4 Other elements:
8.2.2 What elements must not happen, with dosage level (if measured) indicated.
8.2.2.1 Element A:
8.2.2.2 Element B:
8.2.2.3 Element C:
8.2.2.4 Other elements:
9.2 Management
9.2.1 Who will see the treatment measurement data?
9.2.2 How often will treatment measures be circulated to key leaders?
9.2.3 If treatment integrity is challenged, whose responsibility is correction?
10.2 Monitoring
10.2.1 How often will outcome data be monitored?
10.2.2 Who will see the outcome monitoring data?
10.2.3 When will outcome measures be circulated to key leaders?
10.2.4 If experiment finds early significant differences, what procedure is to
be followed?
12.7 Does the principal investigator agree to post any changes in agreements
affecting items 12.1 to 12.6 above?
12.8 Does the principal investigator agree to file a final report within two years
of cessation of experimental operations, no matter what happened to the
experiment? (e.g., ‘random assignment broke down after three weeks and
the experiment was cancelled’ or ‘only 15 cases were referred in the first 12
months and experiment was suspended’).
Chapter Summary
• This chapter lays out the steps that experimenters need to consider when developing
and managing field trials. We introduce three interconnected topics. First, we discuss
the experimental protocol, a vehicle through which better research practices are made
possible.
• There is the pre-experimental protocol, which is the blueprint of the research project. It
is useful because writing up the detailed plan forces the researcher to conduct a
‘pre-mortem diagnosis’ of the possible pitfalls and risks the experiment may encounter –
and then to offer solutions to remedy these threats.
• There is also the post-experimental protocol, usually in the form of a checklist, which
addresses the reporting of the findings of the experiment. There are industry-standard
templates that can help researchers report their findings and identify the key elements
necessary for the scientific community to then be able to critically assess the study’s
findings. Importantly, these protocols are central for reproducibility purposes.
• We then discuss some of the more practical steps in the implementation of
experiments in field settings. These operational issues need to be addressed if we
want to produce the most valid causal estimates. Having the necessary organisational
framework can ‘make or break’ experiments, no matter how potentially effective the
tested treatment.
• We highlight the need for creating an efficient researcher–practitioner coalition, the
necessary role of field manager and why we should pay attention to implementation
science, which can inform us under which conditions the field experiment is more likely
to succeed.
• We talk about the emergence of the pracademic as an entity who can streamline
experiments in field settings, given their double role as both practitioners and
academics.
• Finally, we discuss some of the ethical considerations that one must be aware of when
conducting experiments. We make the case for ethical experimentation and suggest a
framework for considering ethics in experimental designs.
• Our aim for this chapter is to provide a roadmap for anyone interested in designing
and reporting on an effective trial.
Further Reading
United Nations Global Health Ethics Unit. (n.d.). Ethical standards and procedures for
research with human beings. www.who.int/ethics/research/en
The issue of ethics in experimental research is so crucial that the United Nations
dictated special guidelines on the principles to observe for the protection of research
participants. This collection of documents, principles and standards is informative
for any scholar considering applying experimental methods when human beings
are involved. It also illustrates the global view on these ethical issues, indicating the
universality of such ethics.
Schulz, K. F., Altman, D. G., Moher, D., & CONSORT Group. (2010). CONSORT
2010 statement: Updated guidelines for reporting parallel group randomised
trials. Trials, 11(1), 32.
Turner, L., Shamseer, L., Altman, D. G., Schulz, K. F., & Moher, D. (2012). Does use
of the CONSORT statement impact the completeness of reporting of randomised
controlled trials published in medical journals? A Cochrane review. Systematic
Reviews, 1(1), 60.
This chapter briefly introduced the benefits of research protocols for systematic
reviews and meta-analyses. A reliable, consistent approach to reporting the findings
from experiments is crucial for scholars when studying the overall body of evidence
on one research question from multiple sources. Without systematic reporting of all
relevant information about the study and its outcomes, future reviews of evidence
may not accurately synthesise the data in meta-analyses of research. Schulz et al.
offer a review of these issues and how protocols are in a position to solve such
concerns; however, note their varying degrees of success, as discussed in this review
by Turner et al.
Chapter Overview
Not all experimental designs are created equal; choose the
most appropriate wisely��������������������������������������������������������������������������� 158
The future of experimental disciplines����������������������������������������������������� 160
How do we mainstream experiments into policymaking processes?������ 162
Further Reading��������������������������������������������������������������������������������������� 162
This book delved into the type of research method that is meant to uncover causal
relationships between variables. There are many types of research strategies, but to
answer questions that involve cause-and-effect expressions, experiments are required.
There are different experimental designs at our disposal, and the fitness of any model
is dependent on the contextual, theoretical and organisational factors that determine
the choice of design.
In Chapter 1 as well as throughout this book, we have made the case for using RCTs
over other types of causal designs when conducting evaluative research – particularly
in impact evaluations involving policy and tactics. However, we are cognisant that
other experimental designs exist, including advanced statistical designs that aim to
create the conditions analogous to true experiments. These will remain popular and
prevalent. Still, prospective, true experiments with random allocation of units are
preferable over other experimental strategies.
Chapter 2 has paid considerable attention to the benefits of randomisation in
causal research. The primary reason for the superiority of the random allocation of
units into treatment and control arms of the study is its ability to create the most con-
vincing counterfactual conditions. To talk about causality means using a comparison
group, a reasonable benchmark. The ability of randomisation to create a comparable
group through a fair and systematic process of allocating units into the study groups
is supported by the fundamentals of mathematics and probability theory. Over time,
with enough iterations (i.e. trials), the sample means would be as close to the true
population means as possible. There are different methods of randomisation, includ-
ing simple, block or minimisation approaches. Collectively, however, these processes
create comparable groups to treatment groups, in all known and unknown, measured
and unmeasured, factors, except that the treatment group is exposed to a stimulus,
but the control group is not. These processes can be applied to different units of anal-
ysis, but, again, they are usually more advantageous than using non-randomisation
techniques for creating pretreatment balance between the groups.
We then offered an exposition of the various threats that different research designs
could face while making causal estimations. Chapter 3, which focused on the ways
through which more control over the administration of the test can be applied,
has highlighted the common threats to the validity of a test and how they can be
mitigated. These hazards can be detrimental, and experimenters go to considerable
lengths to convince the audience of their ability to control for these threats. With this
in mind, however, researchers tend to agree that the threat of history, maturation,
testing or instrumentation effects are serious concerns that may counter the valid-
ity of the study’s conclusions. In a similar way, biases associated with regression to
the mean, differential selection and treatment spillover effects remain a concern in
every experimental design. As we reviewed the various risks, we arrived at the same
Experimental designs are required to make valid causal estimates. A controlled test,
with convincing counterfactual conditions to which the intervention is compared, is
needed if we want to make reasonably sound causal expressions (Farrington, 1983).
The experimental approach has the power to control the course of the trial and to
single out the consequences of the manipulation of interest. Observations alone are
not enough. Surveys and non-experimental designs have the power to present corre-
lational, but not causal, relationships about the world. To estimate causal inferences,
to predict consequences and to make sensible statements about how people may be
affected by a particular intervention, causal designs are needed (Ariel, 2018).
Yes, there is a degree of hierarchy in causal research methods, ranging from quasi-
experimental, through pre-experimental, to true experiments. Within true experi-
ments, some models are more dependable than others. All experimenters search for
as much control against possible threats to the validity and reliability of their tests –
and there are plenty – but they would tend to agree that true experiments exert more
control than non-true experiments. It is not easy to rule out alternative hypotheses
to an observed effect, even though it is fundamental for the causal stipulation. As
some methodological approaches have a better chance of removing certain threats,
they should be preferred. Experiments with a prospective assignment of units into
treatment and control conditions using random allocation are the answer. Relying on
theorems in mathematics and probability theory, randomisation has the best chance
to control many of these threats. Science has yet to produce a more robust approach
for determining causation, so pre-experimental or quasi-experimental designs are a
compromise when true experiments are not possible.
The principal superiority of true experiments over non-true experiments was made
long ago (Fisher, 1935; McCall, 1923): experimental rather than quasi-experimental
designs are preferred for causal assessment, whenever they are possible. Experiments
that use statistical controls in lieu of randomisation are doing so due to practical or
ethical reasons. Selection biases and misspecification errors – that is, the inability to
statistically control for confounding variables – are still a major source of concern.
This does not mean we should discount studies based on statistical matching tech-
niques, but when all things are considered, carefully executed randomised controlled
designs have a stronger chance of providing valid causal estimates of the treatment
effect, beyond those produced through statistical models.
To emphasise, there are many situations when prospective RCTs are impossible,
unethical or impractical. If the conditions are not appropriate – for example, when
there is an insufficient number of cases to randomise – then randomisation is inap-
propriate. The test would be ‘doomed to failure’, and a pre-mortem analysis should
have advised against running an RCT under these conditions. Non-true experimental
designs should then be considered.
Furthermore, the contribution to science from any experiment is incontestable.
There are learning points from every study. We have repeatedly tried to make the
case for viewing causal expressions as building blocks of knowledge. No stand-alone
trial is enough. Yet if replications of a test produce a series of reliable outcomes, or
pre-experimental designs reproduce similar findings as the literature indicated, then
why should we immediately discount them? There are different ways to isolate the
causal factor from alternative predictors, and when considered holistically, evidence
from all experimental designs is important, even if not all evidence is created equal
(Ariel, 2019).
Sometimes, however, we have a limited body of evidence on a particular research
question, and we are then forced to critically assess the level of evidence produced
against a benchmark. The production of the evidence will be evaluated against the
model used – and how well – as a stand-alone piece of research, the test convinces us
of its conclusion. How well would a quasi-experiment control for all the threats to
the validity of the test? Can a pretest–post-test-only experiment reduce our anxiety
about selection biases? Would one experiment with only two units (e.g. one treat-
ment vs one control hotspot) produce valid estimates that are sufficiently strong to
dictate future policing policy? Categorically, the answers to these questions are no.
The risk is too great.
Despite the hierarchal relationship we just portrayed about the different experimen-
tal designs within causal research, how true experiments are viewed by different
social scientists is an ongoing debate. If RCTs are the ‘gold standard’,1 why do we
not see more of them in applied sciences, such as criminology, education or political
sciences? After all, there are just a few hundred randomised experiments in policing,
and even fewer in the political sciences. As the theory behind experimental models
is not contested in mainstream quantitative research (though their applicability is),
why would they not be more prevalent in the social sciences?
In the remaining pages, we try to unearth some of the reasons why we do not have
a blossoming community of experimenters in these sciences, as we see, for instance,
in the biomedical disciplines. We contextualise these reasons within three trends:
proliferation, diversification and collaboration. These discussions will lead us to sug-
gestions on how to mainstream experiments into the social sciences more robustly.
Proliferation
As the experimental community grows and more social science researchers apply
experimental methods to the study of crime and crime policy, the sheer number
of randomised trials is rapidly accumulating on a numerical basis. Rigorous evalu-
ations are increasingly required, particularly during periods of public expenditure
austerity. We simply cannot afford to do the same ‘things’ that may not provide any
benefit with the current budget cuts to crime policies. This is a global phenomenon.
Yet, these financial crises should not be wasted and can be viewed as an opportu-
nity: our experience with policymakers around the globe is that they all recognise
that practices must be subject to rigorous testing. Expensive policies and tactics that
only a decade ago would be implemented based on hunches and goodwill are now
required to undergo piloting, impact evaluations and academic assessment prior to
full deployment.
At this juncture, experimental disciplines shine. As more police forces, justice
departments, judiciaries and treatment providers undergo these similar processes,
more collaborations between these agencies and experimenters emerge, the num-
ber of research projects increases and we move closer to a vision of experimental
disciplines in which major policy decisions are not taken in the absence of multiple
experiments on an issue (Sherman & Cohn, 1989).
The popular title ‘gold standard’ may be unfitting (see arguments by Sampson, 2010), although
1
we still argue that, in principle, the hierarchical nature of research methods remains intact.
Diversification
With the proliferation and infiltration of experimental work into decision-making
comes a wide gamut of research questions. Evident from George Mason’s excellent
Centre for Evidence-Based Crime Policy, we see an expansion of the units of analysis,
population types and treatment categories under investigation. This is true for all
research methods, but it seems particularly true for experimental designs. For instance,
proactive, rather than reactive, crime policies – embodied chiefly in mature police depart-
ments such as the West Midlands Police and Philadelphia Police Department – seem to
introduce a new model for policymaking: test first, apply later. New methods of pre-
venting crime, managing criminals, caring for vulnerable populations and focusing
on hotspots are constantly mushrooming, and within this general movement, it is no
wonder that experimental criminology is rapidly increasing worldwide.
Randomised trials provide valid causal estimates of treatment effects, which is why
there is a strong consensus that experiments are the ‘gold standard’ for evaluation
research in criminology. That which works, does not work or is (currently) only prom-
ising given limited evidence can be effectively demonstrated through proper tests.
The number of research questions, mirroring the various tactics that can be effec-
tively analysed experimentally, are put to the test prior to procurement or force-wide
implementation. Body-worn cameras, recruiters of criminal networks, honeypots,
soft policing in hotspots, parental patrols in hotspots, GPS and WiFi tracking of offic-
ers, legitimacy in counterterrorism and so on mirror this diversification (Sherman
et al., 2014; Wain et al., 2017).
Collaboration
However, the state of the art of experimental fields can improve in two major areas,
which we feel have transferrable implications for other fields developing experi-
mental approaches. As evident in Braga et al.’s (2014) work on experimenters’
networks, ‘neighbourhoods’ are small and disconnected. Compared perhaps to
experimental psychology, the size of co-authorship and the number of cross-insti-
tutional collaborations should equally grow. The quantity of ties each node – or
experimenter – is associated with is often limited, except those neighbourhoods
associated with Professor Lawrence Sherman and Professor David Weisburd. We
believe that with new experimenters covering new grounds, indeed, the network
will expand with a greater number of degrees of connection between institutions
and researchers. Thus, the more we talk to one another, the better. Yet to do this,
we need more tutoring and a greater willingness by supervisors to work with junior
experimenters on their trials. The Jerry Lee Centre for Experimental Criminology
(https://www.crim.cam.ac.uk/research/research-centres/experimental-criminology)
Linked to our need for more influential nodes in experimental criminology, a major
issue that we face is the misalignments between the moving parts associated with
controlled trials. As we all know, experiments require great(er) attention to proper
planning and design before the first participant is rolled into the study. In field trials,
where the stakes are high, this process can take months, possibly years, of piloting
and recalibrating, until the proper protocol is established. Yet, this process does not
always correspond with institutional processes, procurement cycles or how long the
decision-maker has before taking on a new role within an organisation. As important,
funding cycles nearly always do not match the needs of the experiment, which in
many ways leaves the game to senior experimenters who can divert available research
funds into new projects. If a chief of police asked us to conduct an experiment on
body-worn cameras and we had to wait for a funding agency to provide resources in
12 to 18 months, clearly the opportunity to conduct the experiment would vanish.
Finally, a stronger link between research institutions and in-house pracademics is
required. As we discussed in Chapter 5, it will not only widen the network of inter-
ested partnerships, and cement our role in evidence-based policy, but it will also help
to streamline experiments. Our experience with the British Transport Police (BTP) and
the leadership of (then) Assistant Chief Constable Mark Newton is an example of an
incredible academic–practitioner relationship. BTP has materialised Sherman’s model
of ‘Totally Evidence-Based Policing Agency’ (Sherman, 2015) with more experiments
under way, a strong steering towards using the scientific approach in policing and
continuous learning. This is the ‘scientification’ of policing. A strong and dedicated
team of analysts constantly seek ‘testable’ questions in a policing environment and
then design, conduct and analyse the results prior to deployment. Only time will tell
whether the infiltration of pracademics into experimental disciplines will be a game
changer in the (much-needed) proliferation of experimental designs in policymaking.
Further Reading
Farrington, D. P., Lösel, F., Boruch, R. F., Gottfredson, D. C., Mazerolle, L., Sherman,
L. W., & Weisburd, D. (2019). Advancing knowledge about replication in
criminology. Journal of Experimental Criminology, 15(3), 373–396.
Control group: A comparison group that is unexposed to the studied treatment effect.
In experiments, participants can be assigned to the control group either randomly (by
chance) or using statistical matching techniques when randomisation is not possible.
Control groups can comprise of no-treatment, placebo or alternative treatments.
Covariate: A variable associated with the outcome variable that can therefore affect
the relationship between the studied intervention and the outcome variable. These
extraneous variables are included in quasi-experimental designs to rule out alternative
explanations to the observed change in the outcome variable, as well as to increase
the precision of the overall causal model.
Effect size: The magnitude of the difference between treatment and control
conditions following the intervention, expressed in standardised units.
External validity: The degree to which the study outcomes can be generalised to
different people, places, times and contexts; often expressed in narrative rather than
mathematical terms.
Implementation: A set of processes that have taken place (or that have been
withheld) as an indispensable part of the studied treatment and its effects.
Interaction effect: Situations where two or more variables jointly affect the
dependent variable, thus considered a new treatment term.
Internal validity: The degree to which the inference about the causal relationship
between the independent and dependent variables is valid.
Null hypothesis: A statement about the lack of a relationship between the variables
under investigation. As the causal expression that is tested in the experiment, the
null hypothesis serves as the starting point in experimental research.
Participant: Any type of unit that takes part in a study, such as individuals, cases
or groups.
explore how a trend in the dependent variable was ‘interrupted’ by a treatment effect
(also known as interrupted time-series analysis).
Check out the next title in the collection: Linear Regression: An Intro-
duction to Statistical Models, for guidance on Linear Regression, the
fundamental statistical model used in quantitative social research.
Abadie, A., & Gardeazabal, J. (2003). The economic costs of conflict: A case study
of the Basque Country. American Economic Review, 93(1), 113–132. https://doi.
org/10.1257/000282803321455188
Abell, P., & Engel, O. (2019). Subjective causality and counterfactuals in the social
sciences: Toward an ethnographic causality? Sociological Methods & Research.
Advance online publication. https://doi.org/10.1177/0049124119852373
Abend, G., Petre, C., & Sauder, M. (2013). Styles of causal thought: An empirical
investigation. American Journal of Sociology, 119(3), 602–654. https://doi.
org/10.1086/675892
Abou-El-Fotouh, H. A. (1976). Relative efficiency of the randomised complete
block design. Experimental Agriculture, 12(2), 145–149. https://doi.org/10.1017/
S0014479700007213
Abramowitz, M., & Stegun, I. A. (Eds.). (1972). Handbook of mathematical functions
with formulas, graphs, and mathematical tables (Vol. 55, 10th Printing). National
Bureau of Standards.
Alderman, T. (2020). Can a police-delivered intervention enhance students’ online
safety? A cluster randomised controlled trial on the effect of the ThinkUKnow
programme in the Australian Capital Territory [Unpublished master’s
dissertation]. University of Cambridge.
Allen, M. (Ed.). (2017). The SAGE encyclopaedia of communication research methods.
Sage. https://doi.org/10.4135/9781483381411
Alm, J. (1991). A perspective on the experimental analysis of taxpayer reporting.
Accounting Review, 66(3), 577–593.
Altman, D. G. (1985). Comparability of randomised groups. Journal of the
Royal Statistical Society: Series D (The Statistician), 34(1), 125–136. https://doi.
org/10.2307/2987510
Altman, D. G. (1990). Practical statistics for medical research. CRC Press. https://doi.
org/10.1201/9780429258589
Apel, R. J., & Sweeten, G. (2010). Propensity score matching in criminology and
criminal justice. In A. Piquero & D. Weisburd (Eds.), Handbook of quantitative
criminology (pp. 543–562). Springer. https://doi.org/10.1007/978-0-387-77650-7_26
Apospori, E., & Alpert, G. (1993). Research note: The role of differential experience
with the criminal justice system in changes in perceptions of severity of
legal sanctions over time. Crime & Delinquency, 39(2), 184–194. https://doi.
org/10.1177/0011128793039002004
Ariel, B. (2009, January). Systematic review of baseline imbalances in randomised
controlled trials in criminology [Paper presentation]. The communicating
complex statistical evidence conference, University of Cambridge, UK.
Ariel, B. (2011, July 5). London underground crime data (2009–2011) & Operation
“BTP-LU-RCT” [Paper presentation]. The 4th international NPIA-Cambridge
conference on evidence-based policing, Cambridge, UK.
Ariel, B. (2012). Deterrence and moral persuasion effects on corporate tax
compliance: Findings from a randomised controlled trial. Criminology, 50(1),
27–69. https://doi.org/10.1111/j.1745-9125.2011.00256.x
Ariel, B. (2016). Police body cameras in large police departments. Journal of Criminal
Law & Criminology, 106(4), 729–768.
Ariel, B. (2018). Not all evidence is created equal: On the importance of matching
research questions with research methods in evidence based policing. In R. Mitchell
& L. Huey (Eds.), Evidence based policing: An introduction (pp. 63–86). Policy Press.
Ariel, B. (2019). Technology in policing. In D. Weisburd & A. A. Braga (Eds.),
Innovations in policing: Contrasting perspectives (2nd ed., pp. 521–516). Cambridge
University Press.
Ariel, B., & Bland, M. (2019). Is crime rising or falling? A comparison of police-
recorded crime and victimization surveys. Methods of Criminology and Criminal
Justice Research (Sociology of Crime, Law and Deviance, Vol. 24, pp. 7-31). Emerald
Publishing Limited.
Ariel, B., Bland, M., & Sutherland, A. (2017). ‘Lowering the threshold of effective
deterrence’—Testing the effect of private security agents in public spaces on
crime: A randomized controlled trial in a mass transit system. PLOS ONE, 12(12),
e0187392.
Ariel, B., Englefield, A., & Denley, J. (2019). “I heard it through the grapevine”: A
randomised controlled trial on the direct and vicarious effects of preventative
specific deterrence initiatives in criminal networks. Journal of Criminal Law &
Criminology, 109(4), 819–867.
Ariel, B., & Farrar, W. (2012). The Rialto Police Department wearable cameras
experiment experimental protocol: CRIMPORT. Institute of Criminology,
University of Cambridge.
Ariel, B., Farrar, W. A., & Sutherland, A. (2015). The effect of police body-worn
cameras on use of force and citizens’ complaints against the police: A randomised
controlled trial. Journal of Quantitative Criminology, 31(3), 509–535. https://doi.
org/10.1007/s10940-014-9236-3
Ariel, B., & Farrington, D. P. (2014). Randomised block designs. In G. Bruinsma & D.
Weisburd (Eds.), Encyclopedia of criminology and criminal justice (pp. 4273–4283).
Springer. https://doi.org/10.1007/978-1-4614-5690-2_52
Ariel, B., Garner, G., Strang, H., & Sherman, L. W. (2019, June 6). Creating a
critical mass for a global movement in evidence-based policing: The Cambridge
Pracademia [Paper Presentation]. 2019 Drapkin Symposium, Hebrew University,
Jerusalem, Israel.
Ariel, B., & Langley, B. (2019, July 10). Procedural justice in preventing terrorism:
An RCT and rollout [Paper presentation]. The 12th international evidence-based
policing conference, Cambridge, UK.
Ariel, B., Lawes, D., Weinborn, C., Henry, R., Chen, K., & Brants Sabo, H. (2019).
The ‘less-than-lethal weapons effect’ – Introducing TASERs to routine police
operations in England and Wales: A randomised controlled trial. Criminal Justice
and Behavior, 46(2), 280–300. https://doi.org/10.1177/0093854818812918
Ariel, B., Mitchell, R. J., Tankebe, J., Firpo, M. E., Fraiman, R., & Hyatt, J. M. (2020).
Using wearable technology to increase police legitimacy in Uruguay: The case of
body-worn cameras. Law & Social Inquiry, 45(1), 52–80. https://doi.org/10.1017/
lsi.2019.13
Ariel, B., Newton, M., McEwan, L., Ashbridge, G. A., Weinborn, C., & Brants,
H. S. (2019). Reducing assaults against staff using body-worn cameras
(BWCs) in railway stations. Criminal Justice Review, 44(1), 76–93. https://doi.
org/10.1177/0734016818814889
Ariel, B., & Partridge, H. (2017). Predictable policing: Measuring the crime control
benefits of hotspots policing at bus stops. Journal of Quantitative Criminology,
33(4), 809–833. https://doi.org/10.1007/s10940-016-9312-y
Ariel, B., & Sherman, L. W. (2012). Mandatory arrest for misdemeanour domestic
violence effects on repeat offending: Protocol (1–30). Campbell Systematic Reviews,
8(1), 1–30. https://doi.org/10.1002/CL2.85
Ariel, B., Sherman, L. W., & Newton, M. (2020). Testing hot-spots police patrols
against no-treatment controls: Temporal and spatial deterrence effects in the
London Underground experiment. Criminology, 58(1), 101–128. https://doi.
org/10.1111/1745-9125.12231
Ariel, B., Sutherland, A., & Bland, M. (2019). The trick does not work if you have
already seen the gorilla: How anticipatory effects contaminate pre-treatment
Baird, S., Bohren, J. A., McIntosh, C., & Özler, B. (2014). Designing experiments to
measure spillover effects. The World Bank. https://doi.org/10.1596/1813-9450-6824
Baker, M. (2016). Reproducibility crisis. Nature, 533(7604), 353–366.
Barnard, J., Du, J., Hill, J. L., & Rubin, D. B. (1998). A broader template for analyzing
broken randomised experiments. Sociological Methods & Research, 27(2), 285–317.
https://doi.org/10.1177/0049124198027002005
Barnes, G. C., Ahlman, L., Gill, C., Sherman, L. W., Kurtz, E., & Malvestuto,
R. (2010). Low-intensity community supervision for low-risk offenders: A
randomised, controlled trial. Journal of Experimental Criminology, 6(2), 159–189.
https://doi.org/10.1007/s11292-010-9094-4
Barnes, G. C., Hyatt, J. M., & Sherman, L. W. (2017). Even a little bit helps: An
implementation and experimental evaluation of cognitive-behavioral therapy for
high-risk probationers. Criminal Justice and Behavior, 44(4), 611–630. https://doi.
org/10.1177/0093854816673862
Barnes, G. C., Williams, S., Sherman, L. W., Parmar, J., House, P., & Brown, S. A.
(2020). Sweet spots of residual deterrence: A randomized crossover experiment in
minimalist police patrol. SocArXiv. https://doi.org/10.31235/osf.io/kwf98
Bartos, B. J., McCleary, R., Mazerolle, L., & Luengen, K. (2020). Controlling gun
Violence: Assessing the impact of Australia’s Gun Buyback Program using a
synthetic control group experiment. Prevention Science, 21(1), 131–136.
Baumeister, R. F., & Leary, M. R. (1997). Writing narrative literature reviews. Review
of General Psychology, 1(3), 311–320. https://doi.org/10.1037/1089-2680.1.3.311
Beebee, H., Hitchcock, C., & Menzies, P. (Eds.). (2009). The Oxford handbook
of causation. Oxford University Press. https://doi.org/10.1093/oxfor
dhb/9780199279739.001.0001
Beller, E. M., Gebski, V., & Keech, A. C. (2002). Randomisation in clinical
trials. Medical Journal of Australia, 177(10), 565–567. https://doi.
org/10.5694/j.1326-5377.2002.tb04955.x
Bellg, A. J., Borrelli, B., Resnick, B., Hecht, J., Minicucci, D. S., Ory, M., Ogedegbe,
G., Orwig, D., Ernst, D., & Czajkowski, S. (2004). Enhancing treatment fidelity in
health behavior change studies: Best practices and recommendations from the
NIH Behavior Change Consortium. Health Psychology, 23(5), 443–451. https://doi.
org/10.1037/0278-6133.23.5.443
Bennett, S., Mazerolle, L., Antrobus, E., Eggins, E., & Piquero, A. R. (2018). Truancy
intervention reduces crime: Results from a randomised field trial. Justice Quarterly,
35(2), 309–329. https://doi.org/10.1080/07418825.2017.1313440
Bennett, S., Newman, M., & Sydes, M. (2017). Mobile police community office: A
vehicle for reducing crime, crime harm and enhancing police legitimacy? Journal
Braga, A. A., Papachristos, A., & Hureau, D. (2012). Hot spots policing effects on
crime. Campbell Systematic Reviews, 8(1), 1–96. https://doi.org/10.4073/csr.2012.8
Braga, A. A., Pierce, G. L., McDevitt, J., Bond, B. J., & Cronin, S. (2008). The strategic
prevention of gun violence among gang-involved offenders. Justice Quarterly,
25(1), 132–162. https://doi.org/10.1080/07418820801954613
Braga, A. A., Turchan, B. S., Papachristos, A. V., & Hureau, D. M. (2019). Hot spots
policing and crime reduction: An update of an ongoing systematic review and
meta-analysis. Journal of Experimental Criminology, 15(3), 289–311. https://doi.
org/10.1007/s11292-019-09372-3
Braga, A. A., & Weisburd, D. L. (2010). Policing problem places: Crime hot spots and
effective prevention. Oxford University Press. https://doi.org/10.1093/acprof:
oso/9780195341966.001.0001
Braga, A. A., Weisburd, D. L., & Turchan, B. (2019). Focused deterrence strategies
effects on crime: A systematic review. Campbell Systematic Reviews, 15(3), Article
e1051. https://doi.org/10.1002/cl2.1051
Braga, A. A., Weisburd, D. L., Waring, E. J., Mazerolle, L. G., Spelman, W., &
Gajewski, F. (1999). Problem-oriented policing in violent crime places: A
randomised controlled experiment. Criminology, 37(3), 541–580. https://doi.
org/10.1111/j.1745-9125.1999.tb00496.x
Braga, A. A., Welsh, B. C., Papachristos, A. V., Schnell, C., & Grossman, L. (2014).
The growth of randomised experiments in policing: The vital few and the
salience of mentoring. Journal of Experimental Criminology, 10(1), 1–28. https://
doi.org/10.1007/s11292-013-9183-2
Braithwaite, J., & Makkai, T. (1994). Trust and compliance. Policing and Society, 4(1),
1–12. https://doi.org/10.1080/10439463.1994.9964679
Brantingham, P. L., & Brantingham, P. J. (1999). A theoretical model of crime hot
spot generation. Studies on Crime & Crime Prevention, 8(1), 7–26.
Brants-Sabo, H., & Ariel, B. (2020). Evidence map of school-based violence
prevention programs in Israel. International Criminal Justice Review. Advance
online publication. https://doi.org/10.1177/1057567720967074
Braucht, G. N., & Reichardt, C. S. (1993). A computerized approach to trickle-
process, random assignment. Evaluation Review, 17(1), 79–90. https://doi.org/10.1
177/0193841X9301700106
Braver, M. W., & Braver, S. L. (1988). Statistical treatment of the Solomon four-group
design: A meta-analytic approach. Psychological Bulletin, 104(1), 150–154. https://
doi.org/10.1037/0033-2909.104.1.150
Britt, C. L., & Weisburd, D. (2010). Statistical power. In A. Piquero & D. Weisburd
(Eds.), Handbook of quantitative criminology (pp. 313–332). Springer. https://doi.
org/10.1007/978-0-387-77650-7_16
Cardwell, S. M., Mazerolle, L., & Piquero, A. R. (2019). Truancy intervention and
violent offending: Evidence from a randomised controlled trial. Aggression and
Violent Behavior, 49, Article 101308. https://doi.org/10.1016/j.avb.2019.07.003
Carr, R., Slothower, M., & Parkinson, J. (2017). Do gang injunctions reduce violent
crime? Four tests in Merseyside, UK. Cambridge Journal of Evidence-Based Policing,
1(4), 195–210. https://doi.org/10.1007/s41887-017-0015-x
Cartwright, N. (2004). Causation: One word, many things. Philosophy of Science,
71(5), 805–819. https://doi.org/10.1086/426771
Cartwright, N., & Hardie, J. (2012). Evidence-based policy: A practical guide to
doing it better. Oxford University Press. https://doi.org/10.1093/acprof:os
obl/9780199841608.001.0001
Casadevall, A., & Fang, F. C. (2010). Reproducible science. Infection and Immunity,
78(12), 4972–4975. https://doi.org/10.1128/IAI.00908-10
Casini, L. (2012). Causation: Many words, one thing? THEORIA, 27(2), 203–219.
https://doi.org/10.1387/theoria.4067
Chalmers, I. (2001). Comparing like with like: Some historical milestones in the
evolution of methods to create unbiased comparison groups in therapeutic
experiments. International Journal of Epidemiology, 30(5), 1156–1164. https://doi.
org/10.1093/ije/30.5.1156
Cheng, S. Y., Davis, M., Jonson-Reid, M., & Yaeger, L. (2019). Compared to what?
A meta-analysis of batterer intervention studies using nontreated controls or
comparisons. Trauma, Violence, & Abuse. Advance online publication. https://doi.
org/10.1177/1524838019865927
Chivers, B., & Barnes, G. (2018). Sorry, wrong number: Tracking court attendance
targeting through testing a “nudge” text. Cambridge Journal of Evidence-Based
Policing, 2(1–2), 4–34. https://doi.org/10.1007/s41887-018-0023-5
Chow, S.-C., & Liu, J.-P. (2004). Design and analysis of clinical trials: Concepts and
methodologies. Wiley-IEEE.
Chu, R., Walter, S. D., Guyatt, G., Devereaux, P. J., Walsh, M., Thorlund, K., &
Thabane, L. (2012). Assessment and implication of prognostic imbalance in
randomised controlled trials with a binary outcome: A simulation study. PLOS
ONE, 7(5), Article e36677. https://doi.org/10.1371/journal.pone.0036677
Clarke, R. V. G., & Cornish, D. B. (1972). The controlled trial in institutional research:
Paradigm or pitfall for penal evaluators? (Home Office Research Studies No. 15). Her
Majesty’s Stationery Office. http://library.college.police.uk/docs/hors/hors15.pdf
Cochran, W. G., & Cox, G. M. (1957). Experimental designs. Wiley.
Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Academic Press.
https://doi.org/10.4324/9780203771587
Danziger, S., Levav, J., & Avnaim-Pesso, L. (2011). Extraneous factors in judicial
decisions. Proceedings of the National Academy of Sciences of the United States of
America, 108(17), 6889–6892. https://doi.org/10.1073/pnas.1018033108
Davies, P. (1999). What is evidence-based education? British Journal of Educational
Studies, 47(2), 108–121. https://doi.org/10.1111/1467-8527.00106
Davies, P., & Francis, P. (2018). Doing criminological research. Sage.
Davis, R. C., Weisburd, D., & Hamilton, E. E. (2010). Preventing repeat incidents of
family violence: A randomised field test of a second responder program. Journal
of Experimental Criminology, 6(4), 397–418. https://doi.org/10.1007/s11292-010-
9107-3
Dawid, A. P. (2000). Causal inference without counterfactuals. Journal of the
American Statistical Association, 95(450), 407–424. https://doi.org/10.1080/016214
59.2000.10474210
Dawson, T. E. (1997). A primer on experimental and quasi-experimental design
(ED406440). ERIC. https://files.eric.ed.gov/fulltext/ED406440.pdf
Day, S. J., & Altman, D. G. (2000). Blinding in clinical trials and other studies.
British Medical Journal, 321(7259), 504. https://doi.org/10.1136/bmj.321.7259.504
DeAngelo, G., Toger, M., & Weisburd, S. (2020). Police response times and injury
outcomes (CEPR Discussion Paper No. DP14536). SSRN. https://ssrn.com/
abstract=3594157
De Boer, M. R., Waterlander, W. E., Kuijper, L. D., Steenhuis, I. H., & Twisk, J. W.
(2015). Testing for baseline differences in randomised controlled trials: An
unhealthy research behavior that is hard to eradicate. International Journal of
Behavioral Nutrition and Physical Activity, 12(1), Article 4. https://doi.org/10.1186/
s12966-015-0162-z
De Brito, C., & Ariel, B. (2017). Does tracking and feedback boost patrol time in
hot spots? Two tests. Cambridge Journal of Evidence-Based Policing, 1(4), 244–262.
https://doi.org/10.1007/s41887-017-0018-7
Delaney, C. (2006). The effects of focused deterrence on gang homicide: An evaluation of
Rochester’s ceasefire program [Master’s thesis, Rochester Institute of Technology].
RIT Scholar Works. https://scholarworks.rit.edu/cgi/viewcontent.cgi?article=8208
&context=theses
Denley, J., & Ariel, B. (2019). Whom should we target to prevent? Analysis
of organized crime in England using intelligence records. European Journal
of Crime, Criminal Law and Criminal Justice, 27(1), 13–44. https://doi.
org/10.1163/15718174-02701003
Devereaux, P. J., Bhandari, M., Clarke, M., Montori, V. M., Cook, D. J., Yusuf, S.,
Sackett, D. L., Cina, C. S., Walter, S. D., Haynes, B., Schunemann, H. J.,
Norman, G. R., & Guyatt, G. H. (2005). Need for expertise based randomised
controlled trials. British Medical Journal, 330(7482), 330–388. https://doi.
org/10.1136/bmj.330.7482.88
De Winter, J. C. (2013). Using the student’s t-test with extremely small sample sizes.
Practical Assessment, Research, and Evaluation, 18, Article 10.
Dezember, A., Stoltz, M., & Marmolejo, L. (2020). The lack of experimental research
in criminology: Evidence from Criminology and Justice Quarterly. Journal of
Experimental Criminology. Advance online publication. https://doi.org/10.1007/
s11292-020-09425-y
Dittmann, M. (2004). What makes good people do bad things. Monitor on
Psychology, 35(9), 68. https://doi.org/10.1037/e309182005-051
Donner, A., & Klar, N. (2010). Design and analysis of cluster randomisation trials in
health research. Arnold.
Drezner, D. W. (2020). The toddler in chief: What Donald Trump teaches us about
the modern presidency. University of Chicago Press. https://doi.org/10.7208/
chicago/9780226714394.001.0001
Drover, P., & Ariel, B. (2015). Leading an experiment in police body-worn
video cameras. International Criminal Justice Review, 25(1), 80–97. https://doi.
org/10.1177/1057567715574374
Duckett, S., & Griffiths, K. (2016). Perils of place: Identifying hotspots of health
inequalities (Report No. 2016-10). Grattan Institute. https://grattan.edu.au/
wp-content/uploads/2016/07/874-Perils-of-Place.pdf
Duckworth, A. L., & Kern, M. L. (2011). A meta-analysis of the convergent validity
of self-control measures. Journal of Research in Personality, 45(3), 259–268. https://
doi.org/10.1016/j.jrp.2011.02.004
Dudfield, G., Angel, C., Sherman, L. W., & Torrence, S. (2017). The “power curve”
of victim harm: Targeting the distribution of crime harm index values across
all victims and repeat victims over 1 year. Cambridge Journal of Evidence-Based
Policing, 1(1), 38–58. https://doi.org/10.1007/s41887-017-0001-3
Dukes, R. L., Ullman, J. B., & Stein, J. A. (1995). An evaluation of DARE (Drug Abuse
Resistance Education), using a Solomon four-group design with latent variables.
Evaluation Review, 19(4), 409–435. https://doi.org/10.1177/0193841X9501900404
Dulachan, D. (2014). Tracking citizens’ complaints against police in Trinidad (Trinidad
and Tobago) [Unpublished master’s thesis]. Institute of Criminology, University of
Cambridge.
Eby, L. T., Allen, T. D., Evans, S. C., Ng, T., & DuBois, D. L. (2008). Does mentoring
matter? A multidisciplinary meta-analysis comparing mentored and non-
mentored individuals. Journal of Vocational Behavior, 72(2), 254–267. https://doi.
org/10.1016/j.jvb.2007.04.005
Eck, J. E., & Weisburd, D. (Eds.). (1995). Crime and place (Vol. 4). Criminal Justice Press.
Efron, B. (1971). Forcing a sequential experiment to be balanced. Biometrika, 58(3),
403–417. https://doi.org/10.1093/biomet/58.3.403
Ellis, S., & Arieli, S. (1999). Predicting intentions to report administrative and
disciplinary infractions: Applying the reasoned action model. Human Relations,
52(7), 947–967. https://doi.org/10.1177/001872679905200705
Elvik, R. (2016). Association between increase in fixed penalties and road safety
outcomes: A meta-analysis. Accident Analysis & Prevention, 92, 202–210. https://
doi.org/10.1016/j.aap.2016.03.028
Engel, R. J., & Schutt, R. K. (2014). Fundamentals of social work research. Sage.
Englefield, A., & Ariel, B. (2017). Searching for influencing actors in co-offending
networks: The recruiter. International Journal of Social Science Studies, 5(5), 24–45.
https://doi.org/10.11114/ijsss.v5i5.2351
Fagerland, M. W. (2012). T-tests, non-parametric tests, and large studies: A paradox
of statistical practice? BMC Medical Research Methodology, 12(1), Article 78. https://
doi.org/10.1186/1471-2288-12-78
Farrington, D. P. (1983). Randomised experiments on crime and justice. Crime and
Justice, 4, 257–308. https://doi.org/10.1086/449091
Farrington, D. P. (1986). Age and crime. Crime and Justice, 7, 189–250. https://doi.
org/10.1086/449114
Farrington, D. P. (2003a). British randomised experiments on crime and justice.
Annals of the American Academy of Political and Social Science, 589(1), 150–167.
https://doi.org/10.1177/0002716203254695
Farrington, D. P. (2003b). A short history of randomized experiments in
criminology. Evaluation Review, 27(3), 218–227. https://journals.sagepub.com/
doi/pdf/10.1177/0193841X03027003002
Farrington, D. P. (2006). Developmental criminology and risk-focused prevention.
In M. Maguire, R. Morgan, & R. Reiner (Eds.), The Oxford handbook of criminology
(pp. 657–701). Oxford University Press.
Farrington, D. P., Gottfredson, D. C., Sherman, L. W., & Welsh, B. C. (2002). The
Maryland scientific methods scale. In D. P. Farrington, D. L. MacKenzie,
L. W. Sherman, & B. C. Welsh (Eds.), Evidence-based crime prevention (pp. 13–21).
Routledge. https://doi.org/10.4324/9780203166697_chapter_2
Farrington, D. P., & Petrosino, A. (2001). The Campbell collaboration crime and
justice group. Annals of the American Academy of Political and Social Science, 578(1),
35–49. https://doi.org/10.1177/000271620157800103
Farrington, D. P., & Welsh, B. C. (2005). Randomised experiments in criminology:
What have we learned in the last two decades? Journal of Experimental
Criminology, 1(1), 9–38. https://doi.org/10.1007/s11292-004-6460-0
Gelman, A., Skardhamar, T., & Aaltonen, M. (2020). Type M error might explain
Weisburd’s paradox. Journal of Quantitative Criminology, 36(2), 295–304. https://
doi.org/10.1007/s10940-017-9374-5
Gerber, A. S., & Green, D. P. (2011, July). Field experiments and natural
experiments. In R. E. Goodin (Ed.), The Oxford handbook of political science
(pp. 1108–1132). Oxford University Press. https://doi.org/10.1093/oxfor
dhb/9780199604456.013.0050
Gesch, C. B., Hammond, S. M., Hampson, S. E., Eves, A., & Crowder, M. J. (2002).
Influence of supplementary vitamins, minerals and essential fatty acids on the
antisocial behaviour of young adult prisoners: Randomised, placebo-controlled
trial. British Journal of Psychiatry, 181(1), 22–28. https://doi.org/10.1192/
bjp.181.1.22
Gibaldi, M., & Sullivan, S. (1997). Intention-to-treat analysis in randomized trials:
Who gets counted? Journal of Clinical Pharmacology, 37(8), 667–672. https://doi.
org/10.1002/j.1552-4604.1997.tb04353.x
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on
“estimating the reproducibility of psychological science.” Science, 351(6277),
1037–1037. https://doi.org/10.1126/science.aad7243
Gill, C. E. (2011). Missing links: How descriptive validity impacts the policy
relevance of randomised controlled trials in criminology. Journal of Experimental
Criminology, 7(3), Article 201. https://doi.org/10.1007/s11292-011-9122-z
Gill, C. E., & Weisburd, D. (2013). Increasing equivalence in small-sample place-
based experiments: Taking advantage of block randomisation methods. In
B. Welsh, A. Braga, & G. Bruinsma (Eds.), Experimental criminology: Prospects for
advancing science and public policy (pp. 141–162). Cambridge University Press.
https://doi.org/10.1017/CBO9781139424776.011
Gill, J. L. (1984). Heterogeneity of variance in randomised block experiments.
Journal of Animal Science, 59(5), 1339–1344. https://doi.org/10.2527/
jas1984.5951339x
Giraldo, O., Garcia, A., & Corcho, O. (2018). A guideline for reporting experimental
protocols in life sciences. PeerJ, 6, Article e4795. https://doi.org/10.7717/
peerj.4795
Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational
Researcher, 5(10), 3–8. https://doi.org/10.3102/0013189X005010003
Goldacre, B. (2014). Bad pharma: How drug companies mislead doctors and harm
patients. Macmillan.
Gondolf, E. W. (2009a). Implementing mental health treatment for batterer
program participants: Interagency breakdowns and underlying issues. Violence
Against Women, 15(6), 638–655. https://doi.org/10.1177/1077801209332189
Grossmith, L., Owens, C., Finn, W., Mann, D., Davies, T., & Baika, L. (2015). Police,
camera, evidence: London’s cluster randomised controlled trial of body worn video.
College of Policing. https://bja.ojp.gov/sites/g/files/xyckuh186/files/bwc/pdfs/
CoPBWVreportNov2015.pdf
Guo, Y., Kopec, J. A., Cibere, J., Li, L. C., & Goldsmith, C. H. (2016). Population
survey features and response rates: A randomised experiment. American
Journal of Public Health, 106(8), 1422–1426. https://doi.org/10.2105/
AJPH.2016.303198
Haberman, C. P., Clutter, J. E., & Henderson, S. (2018). A quasi-experimental
evaluation of the impact of bike-sharing stations on micro-level robbery
occurrence. Journal of Experimental Criminology, 14(2), 227–240. https://doi.
org/10.1007/s11292-017-9312-4
Hadorn, D. C., Baker, D., Hodges, J. S., & Hicks, N. (1996). Rating the quality of
evidence for clinical practice guidelines. Journal of Clinical Epidemiology, 49(7),
749–754. https://doi.org/10.1016/0895-4356(96)00019-4
Hallstrom, A., & Davis, K. (1988). Imbalance in treatment assignments in stratified
blocked randomisation. Controlled Clinical Trials, 9(4), 375–382. https://doi.
org/10.1016/0197-2456(88)90050-5
Handelsman, J., Ebert-May, D., Beichner, R., Bruns, P., Chang, A., DeHaan, R.,
Gentile, J., Lauffer, S., Stewart, J., Tilghman, S. M., & Wood, W. B. (2004).
Scientific teaching. Science, 304(5670), 521–522. https://doi.org/10.1126/
science.1096022
Hare, A. P. (1976). Handbook of small group research. Free Press.
Harris, W. S., Gowda, M., Kolb, J. W., Strychacz, C. P., Vacek, J. L., Jones, P. G.,
Forker, A., O’Keefe, J. H., & McCallister, B. D. (1999). A randomized, controlled
trial of the effects of remote, intercessory prayer on outcomes in patients
admitted to the coronary care unit. Archives of Internal Medicine, 159(19),
2273–2278. https://doi.org/10.1001/archinte.159.19.2273
Harrison, G. W., & List, J. A. (2004). Field experiments. Journal of Economic Literature,
42(4), 1009–1055. https://doi.org/10.1257/0022051043004577
Hasisi, B., Shoham, E., Weisburd, D., Haviv, N., & Zelig, A. (2016). The “care
package,” prison domestic violence programs and recidivism: A quasi-
experimental study. Journal of Experimental Criminology, 12(4), 563–586. https://
doi.org/10.1007/s11292-016-9266-y
Haviland, A. M., & Nagin, D. S. (2005). Causal inference with group-based trajectory
models. Psychometrika, 70(3), 557–578. https://doi.org/10.1007/s11336-004-
1261-y
Haviland, A. M., Nagin, D. S., & Rosenbaum, P. R. (2007). Combining propensity
score matching and group-based trajectory analysis in an observational
Hollis, S., & Campbell, F. (1999). What is meant by intention to treat analysis?
Survey of published randomised controlled trials. British Medical Journal,
319(7211), 670–674. https://doi.org/10.1136/bmj.319.7211.670
Hough, M., Bradford, B., Jackson, J., & Quinton, P. (2016). Does legitimacy necessarily
tame power? Some ethical issues in translating procedural justice principles into
justice policy (LSE Legal Studies Working Paper No. 13/2016). SSRN. https://doi.
org/10.2139/ssrn.2783799
Høye, A. (2010). Are airbags a dangerous safety measure? A meta-analysis of the
effects of frontal airbags on driver fatalities. Accident Analysis & Prevention, 42(6),
2030–2040. https://doi.org/10.1016/j.aap.2010.06.014
Høye, A. (2014). Speed cameras, section control, and kangaroo jumps: A meta-
analysis. Accident Analysis & Prevention, 73, 200–208. https://doi.org/10.1016/j.
aap.2014.09.001
Huey, L., & Mitchell, R. J. (2016). Unearthing hidden keys: Why pracademics are an
invaluable (if underutilized) resource in policing research. Policing, 10(3), 300–307.
https://doi.org/10.1093/police/paw029
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and
bias in research findings. Sage.
Irving, B., & Hilgendorf, L. (1980). Police interrogation: A case study of current practice.
Her Majesty’s Stationery Office.
Israel, T., Harkness, A., Delucio, K., Ledbetter, J. N., & Avellar, T. R. (2014).
Evaluation of police training on LGBTQ issues: Knowledge, interpersonal
apprehension, and self-efficacy. Journal of Police and Criminal Psychology, 29(2),
57–67. https://doi.org/10.1007/s11896-013-9132-z
Israel Police Service. (2019). Annual report of Israel Police for 2018. www.gov.il/
BlobFolder/reports/police_annual_report_under_the_freedom_of_information_
law_2018/he/annual_report_under_the_freedom_of_information_law_2018.pdf
Jacobs, J. B. (1989). Drunk driving: An American dilemma. University of Chicago Press.
https://doi.org/10.7208/chicago/9780226222905.001.0001
Jaitman, L. (2018). Frontiers in the economics of crime: Lessons for Latin America and
the Caribbean (Technical Note No. IDB-TN-01596). Office of Strategic Planning
and Development Effectiveness, Inter-American Development Bank. https://doi.
org/10.18235/0001482
Jansson, K. (2007). British Crime Survey: Measuring crime for 25 years. Home Office.
https://webarchive.nationalarchives.gov.uk/20100408175022/http://www.
homeoffice.gov.uk/rds/pdfs07/bcs25.pdf
Jaycox, L. H., McCaffrey, D., Eiseman, B., Aronoff, J., Shelley, G. A., Collins,
R. L., & Marshall, G. N. (2006). Impact of a school-based dating violence
prevention program among Latino teens: Randomised controlled effectiveness
Mitchell, R. J., & Lewis, S. (2017). Intention is not method, belief is not evidence,
rank is not proof: Ethical policing needs evidence-based decision making.
International Journal of Emergency Services, 6(3), 188–199. https://doi.org/10.1108/
IJES-04-2017-0018
Moher, D., Schulz, K. F., & Altman, D. G. (2001). The CONSORT statement:
Revised recommendations for improving the quality of reports of parallel group
randomised trials. BMC Medical Research Methodology, 1, Article 2. https://doi.
org/10.1186/1471-2288-1-2
Moher, D., Shamseer, L., Clarke, M., Ghersi, D., Liberati, A., Petticrew, M., Shekelle,
P., Stewart, L. A., & PRISMA-P Group. (2015). Preferred reporting items for
systematic review and meta-analysis protocols (PRISMA-P) 2015 statement.
Systematic Reviews, 4, Article 1. https://doi.org/10.1186/2046-4053-4-1
Monnington-Taylor, E., Bowers, K., Hurle, P. S., Ward, L., Ruda, S., Sweeney,
M., Murray, A., & Whitehouse, J. (2019). Testimony at court: A randomised
controlled trial investigating the art and science of persuading witnesses and
victims to attend trial. Crime Science, 8, Article 10. https://doi.org/10.1186/
s40163-019-0104-1
Montgomery, P., Grant S., Mayo-Wilson, E., Macdonald, G., Michie, S., Hopewell,
S., Moher, D., & CONSORT-SPI Group. (2018). CONSORT-SPI Group. Reporting
randomised trials of social and psychological interventions: The CONSORT-SPI,
2018 Extension. Trials, 19, Article 407. https://doi.org/10.1186/s13063-018-2733-1
Morgan, S. L., & Winship, C. (2007). Counterfactuals and causal inference. Cambridge
University Press. https://doi.org/10.1017/CBO9780511804564
Morgan, S. L., & Winship, C. (2012). Bringing context and variability back into
causal analysis. In H. Kincaid (Ed.), Oxford handbook of the philosophy of the social
sciences (pp. 319–354). Oxford University Press. https://doi.org/10.1093/oxfor
dhb/9780195392753.013.0014
Morgan, S. L., & Winship, C. (2015). Counterfactuals and causal inference. Cambridge
University Press. https://doi.org/10.1017/CBO9781107587991
Murphy, K., Mazerolle, L., & Bennett, S. (2014). Promoting trust in police: Findings
from a randomised experimental field trial of procedural justice policing. Policing
and Society, 24(4), 405–424. https://doi.org/10.1080/10439463.2013.862246
Murray, A. (2013). Evidence-based policing and integrity. Translational Criminology,
(5), 4–6.
Murray, D. M. (1998). Design and analysis of group-randomized trials (Vol. 29). Oxford
University Press.
Mutz, D., & Pemantle, R. (2012). The perils of randomisation checks in the analysis of
experiments. ScholarlyCommons. https://repository.upenn.edu/cgi/viewcontent.
cgi?article=1767&context=asc_papers
Mutz, D. C., Pemantle, R., & Pham, P. (2019). The perils of balance testing in
experimental design: Messy analyses of clean data. The American Statistician,
73(1), 32–42. https://doi.org/10.1080/00031305.2017.1322143
Nagin, D. S., Cullen, F. T., & Jonson, C. L. (2009). Imprisonment and re-offending.
In M. Tonry (Ed.), Crime and justice: A review of research (Vol. 38, pp. 115–200).
University of Chicago Press. https://doi.org/10.1086/599202
Nagin, D. S., & Sampson, R. J. (2019). The real gold standard: measuring
counterfactual worlds that matter most to social science and policy.
Annual Review of Criminology, 2, 123–145. https://doi.org/10.1146/annurev-
criminol-011518-024838
Nagin, D. S., Solow, R. M., & Lum, C. (2015). Deterrence, criminal opportunities,
and police. Criminology, 53(1), 74–100. https://doi.org/10.1111/1745-9125.12057
Nagin, D. S., & Weisburd, D. (2013). Evidence and public policy: The example
of evaluation research in policing. Criminology & Public Policy, 12(4), 651–679.
https://doi.org/10.1111/1745-9133.12030
Near, J. P., & Miceli, M. P. (1985). Organizational dissidence: The case of whistle-
blowing. Journal of Business Ethics, 4(1), 1–16. https://doi.org/10.1007/
BF00382668
Nelson, M. S., Wooditch, A., & Dario, L. M. (2015). Sample size, effect size,
and statistical power: A replication study of Weisburd’s paradox. Journal of
Experimental Criminology, 11(1), 141–163. https://doi.org/10.1007/s11292-014-
9212-9
Neyman, J. (1923). Próba uzasadnienia zastosowań rachunku prawdopodobieństwa
do doświadczeń polowych [On the application of probability theory to
agricultural experiments]. Roczniki Nauk Rolniczych Tom X, 10, 1–51.
Neyroud, P. W. (2011). Operation turning point: An experiment in “offender-desistance
policing.” Experimental Protocol: CRIMPORT. Institute of Criminology, University
of Cambridge. www.criminologysymposium.com/download/18.62fc8fb415c2ea1
06932dcb4/1500042535947/MON11+Peter+Neyroud.pdf
Neyroud, P. W. (2016). The ethics of learning by testing: The police, professionalism
and researching the police. In M. Cowburn, L. Gelsthorpe, & A. Wahidin (Eds.),
Research ethics in criminology (pp. 89–106). Routledge.
Neyroud, P. W. (2017). Learning to field test in policing: Using an analysis of
completed randomised controlled trials involving the police to develop a
grounded theory on the factors contributing to high levels of treatment integrity
in police field experiments [Doctoral dissertation, University of Cambridge].
Apollo. https://doi.org/10.17863/CAM.14377
Neyroud, P. W., & Slothower, M. (2015). Wielding the sword of Damocles: The
challenges and opportunities in reforming police out-of-court disposals in
Sherman, L. W., Gottfredson, D. C., MacKenzie, D. L., Eck, J., Reuter, P., & Bushway,
S. D. (1998). Preventing crime: What works, what doesn’t, what’s promising. National
Institute of Justice, Office of Justice Programs, U.S. Department of Justice. www.
ncjrs.gov/pdffiles/171676.PDF
Sherman, L. W., Neyroud, P. W., & Neyroud, E. (2016). The Cambridge crime harm
index: Measuring total harm from crime based on sentencing guidelines. Policing,
10(3), 171–183. https://doi.org/10.1093/police/paw003
Sherman, L. W., Schmidt, J. D., & Rogan, D. P. (1992). Policing domestic violence:
Experiments and dilemmas. Free Press.
Sherman, L. W., Shaw, J. W., & Rogan, D. P. (1995). The Kansas City gun
experiment. Population, 4, 8–142. https://doi.org/10.1037/e603872007-001
Sherman, L. W., & Strang, H. (2009). Testing for analysts’ bias in crime prevention
experiments: can we accept Eisner’s one-tailed test? Journal of Experimental
Criminology, 5(2), 185–200.
Sherman, L. W., & Strang, H. (2012). Restorative justice as evidence-based sentencing. The
Oxford handbook of sentencing and corrections (pp. 215–243). Oxford University Press.
Sherman, L. W., Strang, H., Mayo-Wilson, E., Woods, D. J., & Ariel, B. (2015). Are
restorative justice conferences effective in reducing repeat offending? Findings
from a Campbell systematic review. Journal of Quantitative Criminology, 31(1),
1–24. https://doi.org/10.1007/s10940-014-9222-9
Sherman, L. W., & Weisburd, D. (1995). General deterrent effects of police patrol in
crime “hot spots”: A randomised, controlled trial. Justice Quarterly, 12(4), 625–648.
https://doi.org/10.1080/07418829500096221
Sherman, L. W., Williams, S., Ariel, B., Strang, L. R., Wain, N., Slothower, M.,
& Norton, A. (2014). An integrated theory of hot spots patrol strategy:
Implementing prevention by scaling up and feeding back. Journal of Contemporary
Criminal Justice, 30(2), 95–122. https://doi.org/10.1177/1043986214525082
Shuger, S. L., Barry, V. W., Sui, X., McClain, A., Hand, G. A., Wilcox, S.,
Meriwether, R. A., Hardin, J. W., & Blair, S. N. (2011). Electronic feedback
in a diet-and physical activity-based lifestyle intervention for weight loss: A
randomised controlled trial. International Journal of Behavioral Nutrition and
Physical Activity, 8, Article 41. https://doi.org/10.1186/1479-5868-8-41
Siegel, S. (1957). Nonparametric statistics. The American Statistician, 11(3), 13–19.
https://doi.org/10.1080/00031305.1957.10501091
Simon, R. (1979). Restricted randomisation designs in clinical trials. Biometrics,
35(2), 503–512. https://doi.org/10.2307/2530354
Sindall, K., Sturgis, P., & Jennings, W. (2012). Public confidence in the police: A
time-series analysis. British Journal of Criminology, 52(4), 744–764. https://doi.
org/10.1093/bjc/azs010
Singal, A. G., Higgins, P. D., & Waljee, A. K. (2014). A primer on effectiveness and
efficacy trials. Clinical and Translational Gastroenterology, 5(1), e45. https://doi.
org/10.1038/ctg.2013.13
Skolnick, J. (2002). Corruption and the blue code of silence. Police Practice and
Research, 3(1), 7–19. https://doi.org/10.1080/15614260290011309
Smith, G. C., & Pell, J. P. (2003). Parachute use to prevent death and major trauma
related to gravitational challenge: Systematic review of randomised controlled
trials. British Medical Journal, 327(7429), 1459–1461. https://doi.org/10.1136/
bmj.327.7429.1459
Sobel, M. E. (2006). What do randomised studies of housing mobility demonstrate?
Causal inference in the face of interference. Journal of the American Statistical
Association, 101(476), 1398–1407. https://doi.org/10.1198/016214506000000636
Solomon, R. L. (1949). An extension of control group design. Psychological Bulletin,
46(2), 137–150. https://doi.org/10.1037/h0062958
Somers, M. A., Zhu, P., Jacob, R., & Bloom, H. (2013). The validity and precision of
the comparative interrupted time series design and the difference-in-difference
design in educational evaluation. MDRC.
Sousa, W., Ready, J., & Ault, M. (2010). The impact of TASERs on police use-of-
force decisions: Findings from a randomised field-training experiment. Journal of
Experimental Criminology, 6(1), 35–55. https://doi.org/10.1007/s11292-010-9089-1
Spiegelhalter, D. (2019). The art of statistics: Learning from data. Penguin.
St. Clair, T., Hallberg, K., & Cook, T. D. (2016). The validity and precision of the
comparative interrupted time-series design: Three within-study comparisons.
Journal of Educational and Behavioral Statistics, 41(3), 269–299. https://doi.
org/10.3102/1076998616636854
Steffensmeier, D. J., Allan, E. A., Harer, M. D., & Streifel, C. (1989). Age and the
distribution of crime. American Journal of Sociology, 94(4), 803–831. https://doi.
org/10.1086/229069
Steffensmeier, D. J., Lu, Y., & Na, C. (2020). Age and crime in South Korea: Cross-
national challenge to invariance thesis. Justice Quarterly, 37(3), 410–435. https://
doi.org/10.1080/07418825.2018.1550208
Stern, J. E., & Lomax, K. (1997). Human experimentation. In D. Elliott & J. E.
Stern (Eds.), Research ethics: A reader (pp. 286–295). University Press of New
England.
Stevens, J. P. (2012). Applied multivariate statistics for the social sciences. Routledge.
https://doi.org/10.4324/9780203843130
Stevens, J. R. (2017). Replicability and reproducibility in comparative psychology.
Frontiers in psychology, 8, 862.
Stevens, S. S. (Ed.). (1951). Handbook of experimental psychology. Wiley.
Sykes, J. (2015). Leading and testing body-worn video in an RCT [Unpublished master’s
dissertation]. University of Cambridge.
Sytsma, V. A., & Piza, E. L. (2018). The influence of job assignment on
community engagement: Bicycle patrol and community-oriented policing.
Police Practice and Research, 19(4), 347–364. https://doi.org/10.1080/15614263.
2017.1364998
Tankebe, J. (2009). Public cooperation with the police in Ghana: Does procedural
fairness matter? Criminology, 47(4), 1265–1293. https://doi.org/10.1111/j.1745-
9125.2009.00175.x
Tankebe, J., & Ariel, B. (2016). Cynicism towards change: The case of body-worn cameras
among police officers (Hebrew University of Jerusalem Legal Research Paper No.
16-42). SSRN. https://doi.org/10.2139/ssrn.2850743
Taves, D. R. (1974). Minimization: A new method of assigning patients to treatment
and control groups. Clinical Pharmacology & Therapeutics, 15(5), 443–453. https://
doi.org/10.1002/cpt1974155443
Taves, D. R. (2010). The use of minimization in clinical trials. Contemporary Clinical
Trials, 31(2), 180–184. https://doi.org/10.1016/j.cct.2009.12.005
Taxman, F. S., & Rhodes, A. G. (2010). Multisite trials in criminal justice settings:
Trials and tribulations of field experiments. In A. Piquero & D. Weisburd
(Eds.), Handbook of quantitative criminology (pp. 519–540). Springer. https://doi.
org/10.1007/978-0-387-77650-7_25
Taylor, C. (2019, January 28). What “fail to reject” means in a hypothesis test.
ThoughtCo. www.thoughtco.com/fail-to-reject-in-a-hypothesis-test-3126424
Telep, C. W., Mitchell, R. J., & Weisburd, D. (2014). How much time should
the police spend at crime hot spots? Answers from a police agency directed
randomised field trial in Sacramento, California. Justice Quarterly, 31(5), 905–933.
https://doi.org/10.1080/07418825.2012.710645
Thibaut, J. W. (2017). The social psychology of groups. Routledge. https://doi.
org/10.4324/9781315135007
Thurstone, L. L. (1931). The reliability and validity of tests: Derivation and interpretation
of fundamental formulae concerned with reliability and validity of tests and illustrative
problems. Edwards Brothers. https://doi.org/10.1037/11418-000
Thyer, B. A. (2006). Faith-based programs and the role of empirical research. Journal
of Religion & Spirituality in Social Work: Social Thought, 25(3–4), 63–82. https://doi.
org/10.1300/J377v25n03_05
Toby, J. (1957). Social disorganization and stake in conformity: Complementary
factors in the predatory behavior of hoodlums. Journal of Criminal Law,
Criminology & Police Science, 48(1), 12–17. https://doi.org/10.2307/1140161
Toh, S., & Hernán, M. A. (2008). Causal inference from longitudinal studies with
baseline randomisation. International Journal of Biostatistics, 4(1), Article 22.
https://doi.org/10.2202/1557-4679.1117
Torgerson, J. D., & Torgerson, C. J. (2003). Avoiding bias in randomised controlled
trials in educational research. British Journal of Educational Studies, 51(1), 36–45.
https://doi.org/10.1111/1467-8527.t01-2-00223
Travis, L. F., III. (1983). The case study in criminal justice research: Applications
to policy analysis. Criminal Justice Review, 8(2), 46–51. https://doi.
org/10.1177/073401688300800208
Trowman, R., Dumville, J. C., Torgerson, D. J., & Cranny, G. (2007). The impact
of trial baseline imbalances should be considered in systematic reviews: A
methodological case study. Journal of Clinical Epidemiology, 60(12), 1229–1233.
https://doi.org/10.1016/j.jclinepi.2007.03.014
Trochim, W. M. K. (2006). Internal validity. Research Methods Knowledge Base.
Available at https://conjointly.com/kb/internal-validity/#:~:text=But%20
for%20studies%20that%20assess,is%20perhaps%20the%20primary%20
consideration.&text=All%20that%20internal%20validity%20means,%2C%20
the%20outcome)%20to%20happen.
Ttofi, M. M., & Farrington, D. P. (2011). Effectiveness of school-based programs to
reduce bullying: A systematic and meta-analytic review. Journal of Experimental
Criminology, 7(1), 27–56. https://doi.org/10.1007/s11292-010-9109-1
Tyler, T. R., Jackson, J., & Bradford, B. (2014). Procedural justice and cooperation. In
G. Bruinsma & D. Weisburd (Eds.), Encyclopedia of criminology and criminal justice
(pp. 4011–4024). Springer. https://doi.org/10.1007/978-1-4614-5690-2_64
Van Mastrigt, S., Gade, C. B., Strang, H., & Sherman, L. W. (2018). Restorative justice
conferences in Denmark. Experimental Protocol: CRIMPORT. Institute of Criminology,
University of Cambridge. www.crim.cam.ac.uk/documents/KIP-CrimPORT
Vickers, A. J., & Altman, D. G. (2001). Analysing controlled trials with baseline and
follow up measurements. British Medical Journal, 323(7321), 1123–1124. https://
doi.org/10.1136/bmj.323.7321.1123
Vidal, B. J., & Kirchmaier, T. (2018). The effect of police response time on
crime clearance rates. Review of Economic Studies, 85(2), 855–891. https://doi.
org/10.1093/restud/rdx044
Villaveces, A., Cummings, P., Espitia, V. E., Koepsell, T. D., McKnight, B., &
Kellermann, A. L. (2000). Effect of a ban on carrying firearms on homicide rates
in 2 Colombian cities. JAMA Journal of the American Medical Association, 283(9),
1205–1209. https://doi.org/10.1001/jama.283.9.1205
Vollmann, J., & Winau, R. (1996). Informed consent in human experimentation
before the Nuremberg code. British Medical Journal, 313(7070), 1445–1447.
https://doi.org/10.1136/bmj.313.7070.1445
Volpe, M. R., & Chandler, D. (2001). Resolving and managing conflicts in academic
communities: The emerging role of the “pracademic.” Negotiation Journal, 17(3),
245–255. https://doi.org/10.1111/j.1571-9979.2001.tb00239.x
Von Hofer, H. (2000). Crime statistics as constructs: The case of Swedish rape
statistics. European Journal on Criminal Policy and Research, 8(1), 77–89. https://doi.
org/10.1023/A:1008713631586
Wain, N., & Ariel, B. (2014). Tracking of police patrol. Policing, 8(3), 274–283.
https://doi.org/10.1093/police/pau017
Wain, N., Ariel, B., & Tankebe, J. (2017). The collateral consequences of GPS-LED
supervision in hot spots policing. Police Practice and Research, 18(4), 376–390.
https://doi.org/10.1080/15614263.2016.1277146
Walker, D. (2010, October 10–13). Being a pracademic: Combining reflective
practice with scholarship [Keynote address]. AIPM Conference, Darwin, Northern
Territiry, Australia.
Walker, D. H. T., Cicmil, S., Thomas, J., Anbari, F. T., & Bredillet, C. (2008).
Collaborative academic/practitioner research in project management: Theory and
models. International Journal of Managing Projects in Business, 1(1), 17–32. https://
doi.org/10.1108/17538370810846397
Walters, G. D., & Bolger, P. C. (2019). Procedural justice perceptions, legitimacy
beliefs, and compliance with the law: A meta-analysis. Journal of experimental
Criminology, 15(3), 341–372.
Webley, P., Lewis, A., & Mackenzie, C. (2001). Commitment among ethical
investors: An experimental approach. Journal of Economic Psychology, 22(1), 27–42.
https://doi.org/10.1016/S0167-4870(00)00035-0
Wei, L., & Zhang, J. (2001). Analysis of data with imbalance in the baseline outcome
variable for randomised clinical trials. Drug Information Journal, 35(4), 1201–1214.
https://doi.org/10.1177/009286150103500417
Weisburd, D. (2000). Randomised experiments in criminal justice policy:
Prospects and problems. Crime & Delinquency, 46(2), 181–193. https://doi.
org/10.1177/0011128700046002003
Weisburd, D. (2003). Ethical practice and evaluation of interventions in crime and
justice: The moral imperative for randomised trials. Evaluation Review, 27(3),
336–354. https://doi.org/10.1177/0193841X03027003007
Weisburd, D. (2005). Hot spots policing experiments and criminal justice research:
Lessons from the field. Annals of the American Academy of Political and Social
Science, 599(1), 220–245. https://doi.org/10.1177/0002716205274597
Weisburd, D. (2008). Place-based policing (Ideas in American Policing Series No. 9).
Police Foundation. www.policefoundation.org/wp-content/uploads/2015/06/
Weisburd-2008-Place-Based-Policing.pdf
Weisburd, D., Groff, E. R., & Yang, S. M. (2012). The criminology of place: Street
segments and our understanding of the crime problem. Oxford University Press.
https://doi.org/10.1093/acprof:oso/9780195369083.001.0001
Weisburd, D., Lum, C. M., & Petrosino, A. (2001). Does research design
affect study outcomes in criminal justice? Annals of the American
Academy of Political and Social Science, 578(1), 50–70. https://doi.
org/10.1177/000271620157800104
Weisburd, D., Lum, C. M., & Yang, S. M. (2003). When can we conclude that
treatments or programs “don’t work”? Annals of the American Academy of Political
and Social Science, 587(1), 31–48. https://doi.org/10.1177/0002716202250782
Weisburd, D., Petrosino, A., & Mason, G. (1993). Design sensitivity in
criminal justice experiments. Crime and Justice, 17, 337–379. https://doi.
org/10.1086/449216
Weisburd, D., & Taxman, F. S. (2000). Developing a multicenter randomised trial in
criminology: The case of HIDTA. Journal of Quantitative Criminology, 16(3),
315–340. https://doi.org/10.1023/A:1007574906103
Welsh, B. C., Braga, A. A., & Bruinsma, G. J. (Eds.). (2013). Experimental criminology:
Prospects for advancing science and public policy. Cambridge University Press.
https://doi.org/10.1017/CBO9781139424776
Welsh, B. C., & Farrington, D. P. (2001). Toward an evidence-based approach to
preventing crime. Annals of the American Academy of Political and Social Science,
578(1), 158–173. https://doi.org/10.1177/000271620157800110
Welsh, B. C., & Farrington, D. P. (2012). Crime prevention and public policy.
In D. P. Farrington & B. C. Welsh (Eds.), The Oxford handbook of crime
prevention (pp. 3–19). Oxford University Press. https://doi.org/10.1093/oxfor
dhb/9780195398823.013.0001
Welsh, B. C., Podolsky, S. H., & Zane, S. N. (2020). Between medicine and
criminology: Richard Cabot’s contribution to the design of experimental
evaluations of social interventions in the late 1930s. James Lind Library Bulletin:
Commentaries on the History of Treatment Evaluation. www.jameslindlibrary.org/
articles/between-medicine-and-criminology-richard-cabots-contribution-to-the-
design-of-experimental-evaluations-of-social-interventions-in-the-late-1930s/
White, H. (2006). Impact evaluation: The experience of the independent evaluation group
of the World Bank. The World Bank.
Whitehead, T. N. (1938). The industrial worker. Harvard University Press.
Wicherts, J. M., Veldkamp, C. L., Augusteijn, H. E., Bakker, M., Van Aert, R., &
Van Assen, M. A. (2016). Degrees of freedom in planning, running, analyzing,
and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in
Psychology, 7, Article 1832. https://doi.org/10.3389/fpsyg.2016.01832
clusters 11, 29, 47, 73, 77, 105, 116 domestic violence 5, 23, 24, 27–8, 56–7, 66,
coalitions 136–7, 152 70, 71, 72, 75, 83, 108, 117, 130, 143, 157
cognitive behavioural therapy (CBT) 56, 118 Donner, A. 73
Cohen, J. 83 Doré, C. J. 45
Cohn, E. G. 160 double-blind studies 103, 129
collaboration 161–2 Duckworth, A. L. 63
community-oriented policing 99, 118
compensatory rivalry effect 84–5 Eck, J. E. 157
confidence levels 4 effect size 8, 128, 130–1, 166
confounding variables 4, 19–20, 37, 38, 57, effectiveness of interventions 9, 12, 13, 70, 74,
68, 97, 110, 112, 118, 159 87, 96, 137, 141–2, 165
Congdon, W. J. 9 efficacy 9, 62, 70, 72, 80, 95, 166
‘Connecticut crackdown on speeding’ Engel, R. J. 99
experiment 81–2, 112 equivalent time-samples design 113–14
Conover, W. J. 31 error 54, 159
consent 129, 142–3, 146 definition 68
CONSORT 133–5, 145, 158 see also bias
content validity 56 ethics 109, 126, 140–4, 146, 152
control groups 3–4, 8, 18, 62–3 experimental protocol 5
definition 165 trickle flow 29
see also randomised controlled trials (RCTs) evidence-based practice 7, 11, 12, 138, 140,
controlled interrupted time-series 161, 162, 166
design 112 Excel 22, 83
convenience or purposeful samples 83 external validity 4, 6, 54, 65, 75–90, 92, 94,
Cook, T. D. 6, 7 100–1, 106–7, 113–14, 157
Cornish, D. B. 140 definition 166
costs of study 48 implementation 140
counterfactuals 2, 17–20, 56, 90, 158, 165 RDD 117
covariance 47
ANCOVA 47, 105, 114 Fagerland, M. W. 105
baseline 48, 49 Farrar, W. 41, 130
covariates 34, 48, 50, 68 Farrington, D. P. 4, 7, 9, 10, 11–12, 21, 23, 26,
definition 165 49, 60, 76, 87, 158
pre-randomisation equilibrium 49 Fewell, Z. 19
Cox, D. R. v 54 field managers 136, 137–9, 145, 152, 158
Cox regression 47 ‘first, do no harm’ 142
Crandall, M. 112 Fisher, R. A. 2, 42–3, 45, 55, 75, 159
creaming 80 Fixsen, D. L. 138, 139
CrimPORT 129–31, 140, 146–52 Food and Drug Administration (FDA)
Crolley, J. 97–8 125–6, 128
Cumberbatch, J. R. 72, 83 Forsyth, D. R. 38
Freedman, L. P. 126
Danziger, S. 98, 102 Friedman, L. M. 25, 26
data dredging 125, 127, 145
Davis, K. 26 Garner, J. 25, 136
Day, S. J. 8 Gelman, A. 127
De Boer, M. R. 45 generalisation 80–3
Delaney, C. 81 time 81–2
dependent variables 7, 100–1 validity 76–7, 85–7, 88–9
definition 165 generalised linear models 105
specification error 8 Gibaldi, M. 70
Devereaux, P. J. 144 Gill, C. E. 56
Dezember, A. 9, 11 Giraldo, O. 124, 133
Di Maro, V. 41 Glass, G. V. 131
difference-in-differences design (DID) 112–13 Goldcare, B. 127
differential selection 67–8, 156 Gondolf, E. W. 70
diversification 85, 161 Gottfredson, M. 10, 11, 62