Clayson 2019

Received: 3 May 2019
|
Revised: 28 June 2019
| Accepted: 28 June 2019
DOI: 10.1111/psyp.13437
ORIGINAL ARTICLE
Methodological reporting behavior, sample sizes, and statistical

power in studies of event‐related potentials: Barriers to
reproducibility and replicability
Peter E. Clayson1,2 | Kaylie A. Carbine3 | Scott A. Baldwin3 | Michael J. Larson3,4
1
Veterans Affairs Greater Los Angeles
Healthcare System, Los Angeles, California
Abstract
2
Department of Psychiatry and Methodological reporting guidelines for studies of ERPs were updated in
Biobehavioral Sciences, David Geffen Psychophysiology in 2014. These guidelines facilitate the communication of key
School of Medicine, University of
methodological parameters (e.g., preprocessing steps). Failing to report key param-
California Los Angeles, Los Angeles,
California eters represents a barrier to replication efforts, and difficulty with replicability in-
3
Department of Psychology, Brigham creases in the presence of small sample sizes and low statistical power. We assessed
Young University, Provo, Utah whether guidelines are followed and estimated the average sample size and power in
4
Neuroscience Center, Brigham Young recent research. Reporting behavior, sample sizes, and statistical designs were coded
University, Provo, Utah
for 150 randomly sampled articles from five high‐impact journals that frequently
Correspondence published ERP research from 2011 to 2017. An average of 63% of guidelines were
Peter E. Clayson, VA Greater Los Angeles
reported, and reporting behavior was similar across journals, suggesting that gaps in
Healthcare System, MIRECC 210A, Bldg.
210, 11301 Wilshire Blvd., Los Angeles, reporting is a shortcoming of the field rather than any specific journal. Publication
CA 90073. of the guidelines article had no impact on reporting behavior, suggesting that editors
Email: pclayson@ucla.edu
and peer reviewers are not enforcing these recommendations. The average sample
Funding information size per group was 21. Statistical power was conservatively estimated as .72‒.98 for
Office of Academic Affiliations, Advanced
a large effect size, .35‒.73 for a medium effect, and .10‒.18 for a small effect. These
Fellowship Program in Mental Illness
Research and Treatment, United States findings indicate that failing to report key guidelines is ubiquitous and that ERP stud-
Department of Veterans Affairs ies are primarily powered to detect large effects. Such low power and insufficient
following of reporting guidelines represent substantial barriers to replication efforts.
The methodological transparency and replicability of studies can be improved by the
open sharing of processing code and experimental tasks and by a priori sample size
calculations to ensure adequately powered studies.
KEYWORDS
ERPs, replicability, reporting guidelines, sample size, statistical power
1 | IN T RO D U C T ION 2018; Forstmeier, Wagenmakers, & Parker, 2017; Nelson,

Simmons, & Simonsohn, 2018; Shrout & Rodgers, 2018).
Replication difficulties in psychological science have fo- A common target of criticism is flexible data analysis that
cused attention on research practices that contribute to inflates the chance of erroneously observing significant ef-
replication failures (Chambers, 2017; De Boeck & Jeon, fects. Unfortunately, this practice is endemic in cognitive
neuroscience and psychophysiology, such as in studies of
This article has been contributed to by US Government employees and ERPs (Baldwin, 2017; Larson & Carbine, 2017; Luck &
their work is in the public domain in the USA. Gaspelin, 2017), because collecting and analyzing ERPs is
© 2019 Society for Psychophysiological Research 1 of 17

Psychophysiology. 2019;56:e13437. wileyonlinelibrary.com/journal/psyp |
https://doi.org/10.1111/psyp.13437
|
2 of 17 CLAYSON et al.
computationally intensive and requires many methodologi- Sample sizes are an important determinant of statistical
cal choices. Given the numerous possible researcher degrees power, and statistical power is impacted by multiple other fac-
of freedom in ERP studies, publication guidelines for ERP tors, such as the statistical analysis approach, number of obser-
studies were published in Psychophysiology by a committee vations/trials, effect size, and reliability of measurements. Low
convened by the Society for Psychophysiological Research statistical power is prevalent in neuroscience studies (Button
(Keil et al., 2014). These guidelines help to facilitate meth- et al., 2013b) and has been observed for some ERP research
odological transparency by identifying the key parameters (Baldwin, 2017). In a meta‐analysis of the relationship between
that should be reported in ERP studies. Such information is the error‐related negativity and anxiety, the average number of
critical for evaluating the quality of research and for ensuring participants per group was 22 (Moser, Moran, Kneip, Schroder,
sufficient information is present to conduct replication stud- & Larson, 2016). The average statistical power of these stud-
ies. The first purpose of the present study was to evaluate the ies was conservatively estimated at around 10% to detect the
extent to which these publication guidelines are followed and estimated effect size of Cohen's d = −.36 (Baldwin, 2017).
to determine how the publication of these guidelines influ- According to the ERP guidelines article, statistical power and/
enced reporting behavior in ERP research. or effect sizes should be explicitly stated when they are relevant
Adhering to systematic publication guidelines for ERP to the research question (Keil et al., 2014). However, a system-
studies ensures that key details are reported and sheds light on atic review of 100 clinical EEG and ERP studies recently found
critical data processing steps. Ambiguous or missing experi- that only 40% of studies reported effect sizes and no study (0%)
mental details hinder replication efforts and can elevate false reported a priori power analyses, suggesting that statistical power
positive rates in the face of undisclosed flexibility (Carp, 2013). is rarely reported in ERP research (Larson & Carbine, 2017).
Hence, a purpose of a Methods section is to clearly communi- Low statistical power is also associated with the exagger-
cate all of these steps and to provide justification of the rele- ation of significant effects in the presence of researcher bias
vant decisions in the data processing pipeline. For example, a (Button et al., 2013a; Ioannidis, 2005, 2008). For example,
recent study conducted an informal analysis on whether ERP exploiting researcher degrees of freedom and conducting many
studies justified the measurement windows and sites used for different tests of effects increases the likelihood of finding a
ERP analysis (Luck & Gaspelin, 2017), which is a requirement spuriously large effect, because only large statistical effects
of the publication guidelines (Keil et al., 2014). Four of the will satisfy a statistical significance threshold in a study with
fourteen studies (29%) published in Psychophysiology failed low power (Ioannidis, 2005, 2008). Such researcher degrees of
to sufficiently justify measurement windows and sites. This freedom are reflected in some ERP analysis approaches. For
small analysis indicated that only a minority of studies failed to example, if a researcher fails to find a significant effect at the
report this critical step. However, the sample was small and the a priori electrode of interest, it is possible to look at additional
extent to which all necessary data processing steps are reported electrodes, such as all sites along the midline or multiple lat-
for entire studies remains unclear. eralized sites. In this way, ERP studies can reduce statistical
Another important research practice associated with rep- power but inflate chances to find significant effects by examin-
lication failures is the use of small sample sizes, an issue ing multiple channels (Baldwin, 2017; Luck & Gaspelin, 2017).
that is exacerbated when researcher degrees of freedom are Taking multiple analysis approaches until a significant finding
exploited (Forstmeier et al., 2017). Studies of small samples is obtained potentially leads to erroneous conclusions, but this
can lead to the attenuation or exaggeration of effect sizes practice can be difficult to detect when such exploratory anal-
(i.e., magnitude error) or flip the direction of the relationship yses are presented as confirmatory. Because of the relationship
between variables (i.e., sign error; Brand & Bradley, 2016; between statistical power and the statistical analysis performed,
Gelman, 2018; Gelman & Carlin, 2014; Loken & Gelman, it is important to consider the statistical analysis approaches
2017). Hence, significant findings based on small samples used in ERP studies to estimate the power in the literature.
can lead to erroneous statistical inferences that fail to rep- The present study focused on reporting behavior and sam-
licate. The bias to believe such findings in small samples is ple sizes of ERP studies, which are two important aspects
referred to as the “law of small numbers fallacy” (Tversky for replication. The first aim was to determine the extent
& Kahneman, 1971). This fallacy reflects the belief that to which ERP studies published in multiple journals that
because it is more difficult to observe statistical significance focus on ERP research follow the publication guidelines in
in a small sample than it is to observe significance in a large Psychophysiology (Keil et al., 2014). Then, we examined
sample, then finding a statistically significant effect in a small whether the publication of the guidelines article impacted
sample must be “true” and represent a robust and real effect. reporting behavior by testing differences in guideline report-
However, in studies of small samples, magnitude and sign ing pre‐ and post‐guideline publication. We also sought to
errors are common because of the noisy nature of the data. determine the typical sample sizes of ERP studies; the aver-
Thus, small sample sizes can undermine the replicability and age statistical power of the ERP literature was then computed
generalizability of scientific research. based on the typical sample size and statistical designs.
CLAYSON et al.
|
3 of 17
2 | M ET H OD power to detect a Cohen's d of .46 (i.e., a medium effect

size).
The aims, inclusion/exclusion criteria, design, and analy-
ses were preregistered on Open Science Framework (OSF;
2.2 | Article rating
https://osf.io/pdbw3/). Raw data, source code for all analy-
ses, and supplementary material are also posted on OSF Rating of the articles followed a rubric that was compiled
(https://osf.io/mbsvy/). based on the Keil et al. (2014) guidelines article. The rubric
included information from the following eight categories:
participant information, EEG recording, stimulus and tim-
2.1 | Article selection ing, preprocessing, ERP measurement, statistics, principal
To determine the journals to use for coding ERP stud- components analysis (PCA), and independent components
ies, the top 10 journals with the most ERP studies pub- analysis (ICA). Information related to sample size, number
lished between 2011 and 2013 and between 2015 and of groups, statistical analyses, and software packages was
2017 were identified from PubMed. Journals were then also coded.
ranked based on their 2017 impact factors. The top three Each guideline was coded based on whether the informa-
journals were selected and included NeuroImage, Clinical tion was clearly presented and adequate for confident direct
Neurophysiology, and Journal of Cognitive Neuroscience. replication of the study. Given the emphasis on estimating
We also included the two flagship society journals for ERP observed statistical power of ERP studies, the final sample
research, Psychophysiology and International Journal of sizes used for ERP analyses were extracted. The final sample
Psychophysiology. size is typically smaller than the initial sample size reported
The ERP guidelines article was published in 2014 (Keil in ERP studies, because participants are often excluded due
et al., 2014). Fifteen ERP studies were randomly sampled to too few trials retained for analysis, poor ERP score reliabil-
without replacement from each abovementioned journal ity, or hardware/software malfunction.
for the following years: 2011, 2012, 2013, 2015, 2016, and Prior to coding the 150 studies, 10 studies were pilot
2017. Articles from 2014 were not examined, because it was coded by authors P.E.C. and K.A.C. Mismatches were re-
expected that there would be a delay in the adoption of the solved by unanimous consensus of P.E.C., K.A.C., and
guidelines. The first five articles sampled from each journal M.J.L. The same procedure for coding all 150 articles was
from each year that satisfied the inclusion and exclusion cri- followed so that each article was coded by two raters with
teria were selected for subsequent coding. This resulted in any discrepancies adjudicated by consensus of all three rat-
a total of 150 articles. Inclusion criteria were (a) the study ers. Percent agreement across variables among the two raters
reported an ERP experiment, (b) the study was conducted was high (median: 96.7%, mean: 95.0%, minimum to maxi-
in human participants, and (c) the study was published in mum = 79.3% to 100%). Median Cohen's kappa was accept-
one of the five journals of interest during one of the 6 years able (median: .83, mean: .78, minimum to maximum: .25 to
of interest. Exclusion criteria included (a) multimodal stud- 1.00) but appeared low for a few variables due to the low
ies (e.g., fMRI and ERP), (b) EEG time‐frequency analyses variability in response options (there were only two rating
(without ERP analyses), (c) poster abstracts, and (d) studies options for many variables). The preregistered rubric for ar-
coded as part of the piloting coding procedures for the raters. ticle coding, description of specific information relevant to
each coded guideline, and the ratings (raw data) are posted
on OSF.
2.1.1 | Statistical power analysis
We had no a priori predictions regarding what effect size
2.3 | Data analysis
would be expected for the change in reporting behavior
following the publication of the ERP guidelines article. We first investigated whether there were differences in re-
Alternatively, we conducted sensitivity analyses to deter- porting behavior across the five journals using a one‐way
mine the statistical effect size that a two‐tailed independent analysis of variance (ANOVA). Then, we determined
samples t test would be powered to detect. An independent whether the ERP guidelines article impacted reporting be-
samples t test was chosen in order to compare reporting havior. Reporting behavior (i.e., the proportion of reported
behavior prior to publication of the guidelines article to re- guidelines) from articles published between 2011 and 2013
porting behavior after publication of the guidelines article. was compared to reporting behavior from articles published
Sensitivity analyses were conducted using a power of .80, between 2015 and 2017. A two‐tailed independent samples
a sample size of 75 for each group (preguideline publica- t test was first conducted to determine whether there was a
tion and postguideline publication; 150 total articles), and change in reporting behavior. In order to avoid the biasing
an alpha level of .05. The present analyses had sufficient effects of heterogeneity of variances, equal variances were
|
not assumed. Hence, Welch's t test was used, and adjusted Group) ANOVA were computed using G*Power (version 3;
degrees of freedom are reported (Welch, 1947). Pooled Faul, Erdfelder, Buchner, & Lang, 2009).
standard deviations were used in the calculation of Cohen's
d (Bonett, 2008).
To conclude that there was no meaningful effect of the
3 | RESULTS
guidelines article on reporting behavior, tests of equivalence
3.1 | Reporting guidelines
were performed using the “two one‐sided tests” procedure
(Lakens, 2017; Schuirmann, 1987). The two one‐sided tests The percentage of guidelines that were reported per article
procedure provides a framework for estimating that an effect was similar across the five journals, F(4, 145) = 1.21, p = .31,
is statistically equivalent to zero or, in other words, that there η2 = .03 (see Table 1). Articles reported an average of 63%
is “no effect.” The equivalence test requires specifying an ef- (SD = 7%) of guidelines. The range of reported guidelines
fect size of interest. Because the current study was powered across articles was quite wide (range = 39% to 82%), and
to detect a difference for a Cohen's d of only .46, the test of no article reported all guidelines. As mentioned above, these
equivalence used a Cohen's d of .50 as the smallest effect size percentages refer to the proportion of articles that reported
of interest. The two one‐sided tests procedure tested whether the guideline when it was necessary to do so. For example,
the difference pre‐ and postguidelines reporting behavior was if an article did not use PCA, it was not necessary to specify
between Cohen's ds of −.50 and .50. any PCA parameters. Considering that reporting behavior
Power and sample size estimates for the independent was consistent across the five journals (see Table 1), sum-
groups t test and paired samples t test were computed using maries of reporting behavior collapse across journal member-
the “power” command in Stata (version 15.1; StatCorp, ship. Information for each journal is separately presented in
2017). Estimates for the 2 (Between Group) × 2 (Within the supplementary material posted on OSF.
T A B L E 1 Summary statistics for

Journal Mean Median SD Range
reported guidelines and sample sizes
Percent of reported guidelines per article
Clinical Neurophysiology 63% 65% 10% 39% to 82%
International Journal of 64% 65% 7% 51% to 77%
Psychophysiology
Journal of Cognitive Neuroscience 62% 62% 6% 51% to 72%
NeuroImage 61% 61% 7% 44% to 72%
Psychophysiology 64% 65% 6% 53% to 74%
All journals 63% 64% 7% 39% to 82%
Total participants per article
Clinical Neurophysiology 43 33 32 5 to 146
International Journal of 27 22 17 8 to 80
Psychophysiology
Journal of Cognitive Neuroscience 23 21 12 10 to 66
NeuroImage 20 17 8 10 to 46
Psychophysiology 31 24 22 10 to 96
All journals 29 22 21 5 to 146
Total participants per group per article
Clinical Neurophysiology 25 21 17 5 to 86
International Journal of 18 17 7 8 to 55
Psychophysiology
Journal of Cognitive Neuroscience 20 17 10 9 to 66
NeuroImage 18 16 5 10 to 27
Psychophysiology 21 21 7 10 to 40
All journals 21 18 11 5 to 86
Note: Total participants per article indicates the number of participants in each coded article (ignoring group
membership). Total participants per group per article indicates the number of participants in each group for a
given article. The “All journals” row shows the summary statistics across all journals.
CLAYSON et al.
|
5 of 17
Reporting guidelines were binned into eight categories half‐power cutoffs were used (4%), the online filter roll‐off
that were consistent with Keil et al. (2014). Figure 1 shows (3%), or filter family (2%).
the percentage of guidelines reported within each category
across all 150 articles. In the order of the most guidelines re-
3.1.3 | Stimulus and timing
ported to the fewest, stimulus and timing (86% of guidelines
reported) was the highest reported category and was followed This category refers to the stimulus and timing parameters
by participant information (79%), statistics (76%), prepro- of the paradigm used during EEG recording (see Figure 2).
cessing (68%), PCA (65%), EEG recording (52%), ERP mea- Most articles reported clear information about the timing
surement (51%), and then ICA (36%). The guidelines for each characteristics of the paradigm (90%). However, only 56% of
category are discussed in detail below. articles reported enough information about the stimuli (such
as specific information about color, size, or which pictures
were used) that would allow direct replication of the para-
3.1.1 | Participant information
digm to be possible.
Participant information comprises demographic character-
istics of the participants, which includes gender, age, and
3.1.4 | Preprocessing
education level of participants (see Figure 2). Most articles
reported the gender (99%) and age (95%) of participants, but The preprocessing category comprises information related
fewer articles reported education level (43%). to offline EEG data reduction and processing (see Figure 2).
The order in which preprocessing steps was performed was
clear in 90% of articles. When applicable, all articles (100%)
3.1.2 | EEG recording
clearly reported the offline reference and the offline filter
The EEG recording category consisted of guidelines related cutoffs. However, few articles reported whether half‐ampli-
to the online recording of EEG, such as information about tude or half‐power cutoffs (2%) were used or the filter roll‐off
the sensors, amplifiers, and online filtering (see Figure 2). (26%) and filter family (18%).
Most articles reported the EEG sampling rate (97%) and the Preprocessing steps are implemented in software packages,
online reference electrodes (87%). Although the majority of and the reporting of such information can sometimes be used
articles also reported the online filter cutoffs (75%), very few to infer some specific parameters of preprocessing. Of the 150
articles reported specific information about filter characteris- articles coded, 86 (57%) reported the software packages used
tics. Most articles failed to report whether half‐amplitude or for EEG data analysis, and some of these articles reported
using multiple software packages (see Table 2). There were
17 different software packages reported, and the most fre-
quently used software packages were EEGLAB (n = 31) and
100%
BrainVision Analyzer (n = 28).
3.1.5 | ERP measurement

75%
This category mostly consisted of information related to ERP
quantification (see Figure 3). Most articles reported inferen-
tial statistics (97%), measurement sensors (95%), measure-
50%
ment timing window (95%), and the measurement approach
(91%). However, very few articles reported whether a priori
sensors (5%) and temporal windows (3%) were used. For peak
25% amplitude measures, only 8% of articles reported whether a
local or absolute peak amplitude approach was used.
0% 3.1.6 | Statistical analyses

A
A
on
ics
ing
ing
ent
ng
PC
IC
ati
tist
rdi
This category refers to information related to statistical anal-

ess
Tim
orm
ure
Sta
eco
c
pro
and
eas
Inf
yses (see Figure 3). Most articles reported p values (99%),

GR
Pre
PM
lus
nt
EE
ipa
clearly described the statistical procedures used (95%), and

mu
ER
rtic
Sti
provided inferential test statistics (91%). When permutation

Pa
F I G U R E 1 Proportion of guidelines reported across the eight statistics were used, about half of articles reported the num-
categories of interest ber of permutations (44%) and the method for identifying
|
100% 100%
75% 75%
50% 50%
25% 25%
0% 0%
Fil oll− s
Fil er Cu ors
l
Se Filte catio r
del ped ffs
Fa f
Am ass r Mat ier
ly
er
efe te
On enso mpl ce
Mo or Im Cuto s
Ha ctiv enso mpli s
Po Sen l
eve
p/− ive eria

e Lo fie
ter Of
R off
n
e
Ag
Ma ine R g Ra
mi
nd
anc
f A ren
s
i
t
nL
Ge
n
tio
On mpli
S fA
r
ter
w
uca
r
o
o
Sa
ke
Ed
lf− e/P
ns
lin
S
A
100% 100%
75% 75%
50% 50%
25% 25%
0% 0%
nso t Re ow
tifa Refe fs
ffs
ing
val
ing
ing
orr ce
ff
we amily
Cle ction
ion
li
Ar on W der
on
mu
Fil ll−O
f
ct C ren
i
uto
uto
ind
ter
Tim
Tim
Tim
r In ject
Or
t
Fil rpola
Sti
e
l In
F
Of lter C
rC
o
ar
rR
er
nse
lus
ar
ar
ria
te
t
Cle
Cle
e
e
mu
c
spo
ti
ert
Po
lin
tifa
eF
Cle enta
Sti
Int
p/−
f
Re
flin
Ar
gm
Am
Se
ar
Of
ar
Se
lf−
Cle
Ha
F I G U R E 2 Proportion of guidelines reported within the following four categories: participant information, EEG recording, stimulus and
timing, and preprocessing
T A B L E 2 Frequency table of software packages a significance threshold (56%). Appropriate corrections for
violating assumptions of statistical models was reported in
Software package Frequency
55% of articles, and 40% of articles considered corrections
EEGLAB 31 for multiple comparisons.
BrainVision Analyzer 28
Brain Electrical Source Analysis (BESA) 7 3.1.7 | PCA
Cartool 6
When PCA was used, all articles (100%) provided sufficient
Scan 6
information so that the preprocessing steps implemented prior
ERP PCA (EP) Toolkit 5
to PCA were clear (see Figure 3). Most articles described
Fieldtrip 5
the structure of EEG data submitted to the PCA (75%), the
ERPLAB 4 rotation applied to the data (62%), and the decision rule for
EEProbe 3 retaining or discarding components (62%). Half of the arti-
NetStation 2 cles (50%) described the PCA algorithm, and relatively few
BrainStorm 1 articles described the type of association matrix (38%).
EMSE 1
EPlyzer 1 3.1.8 | ICA
ERPSS 1
When ICA was used, all articles provided sufficient infor-
Fully Automated Statistical Thresholding for EEG 1
mation regarding preprocessing steps prior to the ICA (see
Artifact Rejection (FASTER)
Figure 3). The majority of articles described how ICA com-
Statistical Parametric Mapping (SPM) 1
ponents were selected (58%). However, very few articles
CLAYSON et al.
|
7 of 17
F I G U R E 3 Proportion of guidelines 100% 100%

reported within the following four
75% 75%
categories: ERP measurement, statistics,
PCA, and ICA 50% 50%
25% 25%
0% 0%
A P ori Se pe
d
scr t Ap ow
ow
s
s
s
ues
n N ber o stics
ics
Nu e Sta ch
ri W sors
lds
ns
par
on
ber Trial
l
asu men ensor
rte
ibe
a
Ty
tio
a
Tri
Me rem tatist
ind
ind
om
val
ati
pro
po
n
h
scr
ti
uta
s
A P eak
l
rem t W
e
f
of
S
hre
lt C
o
dp
R
lS
Vi
erm
sD
ent
nT
s
rte
Mu
Me rentia
tat
rio
iv
of
en
ri
ure
fP
m
um
ipt
po
tio
st S
on
re
for
ro
ced
Re
asu
uta
asu
e
cti
Te
Inf
on
De
Pro
mb
rre
rm
Mi
cti
Me
Co
Pe
Nu
rre
Co
100% 100%
75% 75%
50% 50%
25% 25%
0% 0%
r
ta
le
trix
on
ta
s
lea
lea
ent
tio
Da
Ru
Da
rith
rith
ati
Ma
gC
gC
ota
on
ret
of
ion
of
lgo
lgo
mp
AR
sin
sin
ion
erp
ure
ure
cis
AA
AA
Co
ces
ces
iat
Int
PC
uct
uct
De
PC
IC
soc
of
pro
pro
or
Str
Str
er
of
As
Pre
Pre
mb
Inf
Nu
T A B L E 3 Percentage of guidelines reported by year performed to determine if the results were statistically equiv-
alent to the absence of an effect. The equivalence tests set
Year Mean Median SD Range
equivalence bounds to ± 3.62%. Both lower‐bound and upper‐
2011 62% 64% 8% 44% to 72% bound equivalence tests were significant, t(142.33) = 2.33,
2012 63% 64% 9% 39% to 74% p = .01; t(142.33) = −3.79, p < .01, 95% CI [−2.8%, 1.1%],
2013 62% 64% 7% 49% to 74% respectively.
2015 64% 64% 6% 55% to 82% Based on the combination of the null‐hypothesis signifi-
2016 63% 65% 6% 51% to 73% cant test and the equivalence tests, the observed impact of the
2017 63% 62% 7% 51% to 77% guidelines article was not statistically different from zero and
was statistically equivalent to zero. In short, the publication
Note: Estimates represent the percentage of guidelines that were reported across
all five journals. of the ERP guidelines article had no impact on reporting be-
havior in the 3 years after publication.
reported the ICA algorithm (13%), structure of the data sub-
mitted to the ICA (8%), or the number of components re-
tained/removed (3%). 3.3 | Sample sizes
Summary statistics for the number of participants exam-
3.2 | Impact of guidelines article
ined in each article are presented in Table 1. For all ar-
To determine whether the publication of guidelines for ERP ticles coded, each article contained an average of 29
studies impacted reporting behavior, reporting behavior for (median = 22, SD = 21, range 5 to 146) participants. The
the 3 years prior to the publication of the guidelines article sample size decreased when considering the number of
was compared to the reporting behavior for the 3 years fol- participants in each group examined in each article. When
lowing its publication (see Table 3). There was no significant considering participants per group, each article contained
difference in reporting behavior, t(142.33) = 0.73, p = .47, an average of 21 (median = 18, SD = 11; range = 5 to
Cohen's d = .12, 95% CI [−1.5%, 3.2%]. 86) participants per group. Of those articles that reported
Beyond determining whether the guidelines article in- data on more than one group, 77% examined two groups of
creased or decreased reporting behavior, analyses were participants.
|
3.4 | Statistical power needed to achieve a statistical power of .80 for a small effect
size (Cohen's d = .2, Cohen's f = .1) and an alpha level of .05
There was a great deal of heterogeneity in the statistical mod-
was 788 participants (394 participants per group) for an inde-
els used in the coded articles. A summary of the between‐
pendent samples t test, 199 participants for a paired samples t
and/or within‐subject ANOVA models used in at least three
test, and 298 participants (194 participants per group) for a 2
articles are shown in Table 4 (see supplementary material
(Between Group) × 2 (Within Group) ANOVA interaction. A
on OSF for a description of all statistical models). In order
summary of the number of participants needed for each effect
to estimate the statistical power in the coded articles, we
size and statistical design is provided in Table 6.
estimated power for three models: independent samples t
Next, the average sample size in the coded articles was
tests, paired samples t tests, and a 2 (Between Group) × 2
considered to determine the average effect size that the coded
(Within Group) ANOVA interaction (see Figure 4, Table
articles were powered to detect. The effect size (Cohen's d
5). Additional power analyses were not conducted for the
or Cohen's f) for a two‐tailed test that a study with 21 par-
2 (Within Group) × 2 (Within Group) ANOVA design, be-
ticipants per group, a statistical power of .80, and an alpha
cause it is equivalent to a paired t test on difference scores.
level of .05 would be able to detect is .89 (large effect size)
The number of participants needed to achieve a statistical
for an independent samples t tests, .62 (medium‐to‐large ef-
power of .80 for a large effect size (Cohen's d = .8, Cohen's
fect size) for a paired samples t test, and .44 (large effect
f = .4) and an alpha level of .05 was 52 participants (26 par-
size) for 2 (Between Group) × 2 (Within Group) ANOVA
ticipants per group) for an independent samples t test, 15 par-
interaction.
ticipants for a paired samples t test, and 22 participants (11
Lastly, the achieved statistical power was computed based
participants per group) for a 2 (Between Group) × 2 (Within
on a sample size of 21 participants per group (see Table 7),
Group) ANOVA interaction. The number of participants
and this analysis provides a conservative estimate of the
statistical power of coded ERP studies. Studies achieved
T A B L E 4 Frequency table of statistical models a power of only .72 to detect a large effect size (Cohen's d
Factors and levels in ANOVAs Frequency
=.80) for an independent samples t test. Studies that used
paired samples t tests achieved a power of .94 to detect a large
2 (Within) 15
effect size (Cohen's d = .80) but were insufficiently powered
2 (Within) × 2 (Within) 11 (achieved power: .59) to detect a medium effect size (Cohen's
2 (Within) × 2 (Between) 7 d = .50). For a 2 (Between Group) × 2 (Within Group) in-
2 (Within) × 2 (Within) × 2 (Between) 6 teraction, studies achieved .98 power to detect a large effect
3 (Within) 6 size (Cohen's f = .40), but they were underpowered (achieved
2 (Within) × 2 (Within) × 3 (Within) 4 power: .73) to detect a medium effect size (Cohen's f = .25).
2 (Within) × 4 (Within) 4 Taken together, independent samples t tests were insuffi-
3 (Within) × 2 (Within) 4 ciently powered to detect large effect sizes, and paired sam-
ples t tests and tests of the 2 (Between Group) × 2 (Within
2 (Within) × 2 (Within) × 2 (Within) 3
Group) interactions were sufficiently powered to detect only
2 (Within) × 3 (Within) 3
large effect sizes.
2 (Within) × 3 (Within) × 2 (Within) 3
3 (Within) × 2 (Within) × 2 (Within) × 2 (Between) 3
4 (Within) 3 4 | DISCUSSION
Number of factors in ANOVAs Frequency
Across 150 ERP studies, an average of 63% of guidelines were
1 29
reported, which suggests that published ERP studies omit
2 41
key information required for independent replication. This
3 43
reporting behavior was consistent across five prominent ERP
4 26 journals: Clinical Neurophysiology, International Journal
5 8 of Psychophysiology, Journal of Cognitive Neuroscience,
Note: # (Within) indicates the number of within‐subject levels. # (Between) indi- NeuroImage, and Psychophysiology. Hence, gaps in methods
cates the number of between‐subject levels (i.e., groups). For the sake of brevity, reporting appear to be a shortcoming of the field, rather than
this table shows the number of factors and levels for those statistical models that any specific journal or impact factor level. Notably, the ERP
were used at least three times in the coded articles, which represents only 72
of the 147 coded articles (49%). Three articles were not included, because they
guidelines article (Keil et al., 2014) had no impact on report-
used approaches such as multilevel modeling. All models are shown in the sup- ing behavior for the 3 years following its publication. With
plementary material on OSF. Abbreviation: ANOVA, analysis of variance. regard to the sample size of ERP studies, the average sample
CLAYSON et al.
|
9 of 17
Independent Samples Paired Samples

3.0 1.50
2.5 1.25
2.0 1.00
Cohen's d
Cohen's d
1.5 0.75
1.0 0.50
0.5 0.25
0.0 0.00
25 50 75 100 25 50 75 100
Total Sample Size Total Sample Size
Between x Within Interaction

1.50 Power Level
.80
1.25 .85
.90
1.00 .95
Cohen's f
0.75
0.50
0.25
0.00
25 50 75 100
Total Sample Size
F I G U R E 4 Plots show the relationship between statistical power, effect sizes, and total sample sizes for an independent samples t test, a
paired samples t test, and a test of the 2 (Between Group) × 2 (Within Group) interaction in an ANOVA. Dotted lines represent small (Cohen's
d = .2; Cohen's f = .1), medium (Cohen's d = .5; Cohen's f = .25), and large effect sizes (Cohen's d = .8; Cohen's f = .4)
size per group was 21 participants. Considering this sample and inadequate reporting of research (Chalmers & Glasziou,
size, ERP studies had sufficient power to observe only large 2009). It is likely that missing details in the ERP data analysis
statistical effects (Cohen's d > .8, Cohen's f > .4). Taken to- pipeline might similarly lead to wasted resources.
gether, the present study revealed critical shortcomings in the A potential reason that Methods sections lack enough
reporting of common ERP practices that hinder the ability to information for replication is an underappreciation of the
independently replicate ERP studies and the probability for importance of direct (close) replications (Chambers, 2017;
ERP studies to find replicable effects. Nosek & Lakens, 2014; Schmidt, 2009; Simons, 2014). A
direct replication tests the repeatability of a finding by du-
plicating the study design and analysis of the original study.
4.1 | Reporting behavior
Direct replications are important for increasing the preci-
The widespread omission of over a third of the required re- sion of effect size estimates in meta‐analyses, establishing
porting guidelines serves as a substantial barrier to replication generalizability of effects, identifying boundary conditions
efforts and to the evaluation of research quality. It is unclear for “real” effects, and correcting scientific theory (Nosek
how to judge whether experimental manipulations and data & Lakens, 2014). It is important for direct replications
collection and processing are sound without sufficient details to be conducted by outside laboratories in order to verify
to be reproducible. Poor research reporting documentation is the robustness of the effects found in the original research
not unique to ERP studies and has been observed in other (Nosek & Lakens, 2014; Schmidt, 2009; Simons, 2014).
subfields of neuroscience (Carp, 2012; Guo et al., 2014; Without reporting all experimental details, it is unlikely
Muncy, Hedges‐Muncy, & Kirwan, 2017; Poldrack et al., that replication studies will be successful. In fact, failure to
2017) and the biomedical sciences (Chalmers & Glasziou, report important methodological details that impact study
2009; Glasziou et al., 2014; Ioannidis et al., 2014). It is esti- findings or failure to disclose flexibility in data analysis
mated that, in the biomedical sciences, billions of dollars are is considered a questionable research practice that con-
wasted every year due to the consequences of misreporting tributes to replication difficulties (Forstmeier et al., 2017;
|
T A B L E 5 Numerical summary of the

Detectable effect
relationship between sample size, statistical
Analysis Statistical power Required participants for typical n
power, large effect sizes, and statistical
Independent samples .80 52 .89 analysis
t test
.85 60 .95
.90 68 1.03
.95 84 1.14
Paired samples t test .80 16 .64
.85 18 .69
.90 20 .74
.95 24 .83
2 (Between .80 52 .44
Group) × 2 (Within .85 60 .47
Group) interaction
.90 68 .51
.95 84 .57
Note: Required participants indicates the number of participants needed to obtain a given level of statistical
power to detect a large effect size. A large effect size was considered a Cohen's d of .80 for independent sam-
ples t tests and paired samples t test. A large effect size was considered a Cohen's f of .40 for the 2 (Between
Group) × 2 (Within Group) interaction. The detectable effect for typical n indicates the effect size that a study
with 21 participants per group and a given level of statistical power would be able to detect. Alpha level was
set to .05 for all analyses.
John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & T A B L E 6 Numerical summary of the number of participants
needed to achieve statistical power of .80 for each effect size and
Simonsohn, 2011).
statistical design
Unfortunately, direct replications are quite rare, and in
psychology they are frequently replaced with conceptual rep- Number of
lications (LeBel & Peters, 2011; Nosek & Lakens, 2014; Analysis Effect size participants
Schmidt, 2009). A conceptual replication1 seeks to test a phe- Independent samples t test .20 788
nomenon using a different method, and conceptual replica- .50 128
tions are essential for theory testing. The consequence of .80 52
replacing direct replications with conceptual replications is
Paired samples t test .20 199
that conceptual replications cannot conclusively disprove the
.50 34
original finding and failures to replicate are often attributed
to methodological changes (LeBel & Peters, 2011; Simons, .80 15
2014). Hence, science cannot self‐correct when theories are 2 (Between Group) × 2 .10 298
built on conceptual replications. Additionally, conceptual (Within Group) interaction .25 50
replications can exploit researcher degrees of freedom to er- .40 22
roneously support the original study idea by changing analy- Note: Number of participants for the independent samples t test and interaction
sis approaches (Forstmeier et al., 2017; Simmons et al., 2011; effect reflect total number of participants (not number of participants per group).
Simons, 2014).
Some information was consistently underreported across
coded studies, suggesting an underappreciation of the impact cutoffs, specifying whether those cutoffs were half‐amplitude
of reporting some EEG data reduction parameters on final or half‐power, describing the filter roll‐offs, and identifying
ERP scores. The most frequently underreported parameters the filter family. The most commonly reported aspect of fil-
related to characteristics of the online and offline filters. tering was the filter cutoff. However, even when using the
The ERP guidelines article recommends reporting the filter same filter cutoffs, the signal quality is differentially impacted
based on the other mentioned filter characteristics (Widmann
1 & Schröger, 2012; Widmann, Schröger, & Maess, 2015). The
Using the term replication when referring to a study of the robustness of
an effect under different methodological parameters is a misnomer. characteristics of the filter used should be determined by the
Nothing is being replicated per se—rather, a phenomenon of interest is ERP components of interest, and the same filter cutoffs and
simply being tested under different conditions. filter characteristics are not well suited to all studies of any
CLAYSON et al.
|
11 of 17
T A B L E 7 Numerical summary of the achieved power for each et al., 2017) for single‐group studies and from 14.75 (Carp,
statistical design, effect size, and a group sample size of 21 2012) to 19 (Poldrack et al., 2017) per group for studies with
Analysis Effect size Achieved power
multiple groups. The estimated statistical power of fMRI
studies is between .08 and .31 (Button et al., 2013b), and a
Independent samples t test .20 .10
recent large assessment of statistical power in the fields of
.50 .35 cognitive neuroscience and psychology estimated a median
.80 .72 power of .73 for large effects, .44 for medium effects, and
Paired samples t test .20 .14 .12 for small effects (Szucs & Ioannidis, 2017). Although
.50 .59 statistical power of ERP studies is on par with the fields of
.80 .94 cognitive neuroscience and psychology, the consequences
2 (Between Group) × 2 .10 .18 of small samples and low power nonetheless limit the inter-
(Within Group) .25 .73 pretability and potential replicability of ERP studies.
interaction The coded ERP studies were only powered at a level of
.40 .98 .80 to detect large statistical effects for paired samples t
tests and 2 (Between Group) × 2 (Within Group) ANOVA
interactions, which together accounted for only 15% of
ERP component (Cook & Miller, 1992; Edgar, Stewart, & coded studies. Given that statistical power was low for
Miller, 2005; Nitschke, Miller, & Cook, 1998; Widmann & studies of smaller effects and that most studies used more
Schröger, 2012; Widmann et al., 2015). complicated statistical analyses, many observed statistical
It was also rare for a study to report using a priori deter- effect sizes are likely exaggerated due to the statistical sig-
mined sensors and temporal windows for ERP measurement. nificance threshold commonly applied to published studies
A common approach to scoring ERPs is to choose sensors (Gelman, 2018; Rosenthal, 1979; Simmons et al., 2011).
and temporal windows based on grand averages where the This bias to publish statistically significant effects incen-
effect of interest appears maximal. This practice often leads tivizes studying small samples and noisy measurements,
to finding significant effects but results in a high rate of spu- because researchers can exploit the garden of forking paths
rious findings (Luck & Gaspelin, 2017). When possible, it (or researcher degrees of freedom) to find statistically sig-
is considered best practice to select sensors and temporal nificant effects (Baldwin, 2017; Brand & Bradley, 2016;
windows a priori to avoid biased measurement and analysis. Clayson & Miller, 2017b; Gelman, 2018; Gelman & Carlin,
However, in some cases it is not possible to have definite 2014; Larson & Carbine, 2017; Loken & Gelman, 2017).
a priori predictions, such as when using a novel paradigm. When researcher degrees of freedom are intentionally
In these instances, there are alternative approaches, such as exploited, such as when multiple iterations in the data
using a functional localizer, collapsed localizer, a window‐ processing pipeline are tested until a significant result is
independent or mass univariate measurement approach, or obtained, the likelihood of finding replicable effects is re-
factor analysis (see Dien, 2017; Luck & Gaspelin, 2017). duced. Furthermore, some meta‐analytic approaches for
Regardless of the approach used, the measurement sensors estimating effect sizes are unable to adjust for the presence
and temporal windows used should be clearly reported and of this type of bias. Most of these approaches are only de-
justified (Keil et al., 2014). signed to adjust for journal publication bias (i.e., the bias
that journals are more likely to publish significant stud-
ies than nonsignificant studies), but they are not designed
4.2 | Sample size and statistical power
to adjust for questionable research practices that inflate
The average overall sample size of studies included was the likelihood of finding statistical significance (Carter,
29 (median = 22); the typical sample size of a group of Schönbrodt, Gervais, & Hilgard, 2018; Simonsohn, Nelson,
participants was 21 on average (median = 18) in the coded & Simmons, 2014a, 2014b; Simonsohn, Simmons, &
articles, and this resulted in an estimated statistical power Nelson, 2015). As a result, some meta‐analytic approaches
of .72‒.98 for a large effect size, .35‒.73 for a medium ef- are unable to identify the true effect size in the literature
fect size, and .10‒.18 for a small effect size (see Table 7). and suffer from inflated false positives (Carter et al., 2018).
However, these power estimates are considered conserva- Hence, ERP meta‐analyses might consider employing ap-
tive, because the majority of studies used more complicated proaches that identify whether the literature is biased due
statistical designs that will reduce power. Regardless, the to questionable research practices, such as undisclosed re-
observed sample size and statistical power for the statistical searcher degrees of freedom. A p‐curve analysis is one such
designs of focus are similar to other subfields of neurosci- approach that operates under the assumption of journal
ence. For example, estimates of the median sample sizes of publication bias and can identify the use of questionable
fMRI studies range from 15 (Carp, 2012) to 28.5 (Poldrack research practices to obtain statistically significant effects
|
(Clayson, Carbine, & Larson, in press; Simonsohn et al., even though it is likely that interpolation was conducted.
2015). Indeed, a preregistered p‐curve analysis on current Hence, it is likely that the reporting behavior presented
and 10‐year‐past psychophysiological studies showed gen- here was somewhat overestimated. The reported analy-
erally good evidential value and low selective reporting in ses focused on ERP studies, but other types of EEG stud-
the field, but demonstrated relatively low average statistical ies, such as time‐frequency analyses, might more or less
power (Carbine, Lindsey, Rodeback, & Larson, 2019). closely adhere to reporting guidelines (Cohen, 2017). It is
The estimate of statistical power of the coded articles is also possible that the publication of the Keil et al. (2014)
considered conservative, because it was based on the most guidelines article had no impact on reporting behavior,
frequently used approaches for statistical analysis. For exam- because reporting behavior has remained stable since the
ple, most studies used more factors or levels in ANOVAs publication of earlier ERP reporting guidelines (Donchin,
than were considered in the power analyses, which would Callaway, Cooper, & Desmedt, 1977; Picton et al., 2000;
lead to lower power (see Table 4) and do not appear to follow Pivik et al., 1993). Nonetheless, only two thirds of the re-
best practices for repeated measures ANOVAs of ERP data quired information for replicating contemporary ERP stud-
(Dien, 2017). Including many factors in exploratory ies is being routinely reported.
ANOVAs also leads to an increase in the familywise error Furthermore, the spirit of the Keil et al. (2014) guidelines
rate, because it is not common practice to correct for multiple article was to facilitate communication among researchers by
comparisons in ANOVAs (Cramer et al., 2015; Luck & providing explicit recommendations for reporting (see How
Gaspelin, 2017). Luck and Gaspelin showed that the proba- To Use This Document section; Keil et al., 2014, p. 2). It is
bility of a Type I error is 5%, 14%, 30%, 54%, and 80% for a possible that authors consciously chose to deviate from the
one‐, two‐, three‐, four‐, and five‐factor ANOVA, respec- recommended guidelines for reasons specific to their study,
tively. For the coded articles, most of the articles used more which is acceptable. However, “such deviations are [to be]
than one factor (see Table 4), indicating that the familywise explicitly documented and explained” (Keil et al., 2014, p. 2).
error rate is above 5%. ERP studies2 often conduct many In the present coding procedure, a guideline was coded as ac-
ANOVAs, such as an ANOVA on multiple ERP components, ceptable as long as it was addressed. For example, sensors did
on amplitude and latency measurements, or across multiple not need to be chosen a priori so long as how sensors would
time windows or electrodes. For example, some coded ERP be chosen was specified, such as through a mass univariate or
studies performed multifactorial ANOVAs on 50‐ms chunks functional localizer approach. Hence, it is unlikely that such
of activity across the entire ERP epoch (e.g., −200 to 800 ms, deviations accounted for the low reporting behavior. In line
resulting in 20 separate ANOVAs). Such practices virtually with the spirit of the guidelines article, we believe that com-
ensure that a statistically significant, although possibly spu- munication of methodological parameters among researchers
rious or inflated, effect will be observed in an ERP study. could, and should, be improved.
When conducting ANOVAs, correcting for multiple compar- For the present manuscript, we randomly selected articles
isons, reducing the number of factors, and removing unnec- from five different journals; these articles examined healthy
essary analyses should be used to reduce the familywise and participants as well as various clinical and developmental
experiment‐wise error rates (see Luck & Gaspelin, 2017). populations, but this was not explicitly coded. It is possible
that some populations require special considerations when
recording and analyzing ERP data. Consistent with recom-
4.3 | Limitations
mendations from the guidelines article, deviations from stan-
The present study has some limitations. We only coded dard protocol should be documented and justified. Although
whether authors followed guidelines based on what was such data can be costly and difficult to acquire, “the rules
stated by the authors in the published studies. It is pos- of statistical inference have no empathy for how hard it is to
sible that additional analysis steps were performed, but not acquire data” (Nosek, Ebersole, DeHaven, & Mellor, 2018, p.
reported. Some guidelines, such as the type of interpola- 5). Despite that data collection might be slow, the driving re-
tion used or offline filtering, were coded only when they search questions are important enough to answer rigorously.
were reported, but it is likely that additional unreported
steps were performed by authors. For example, some stud-
4.4 | Moving forward
ies reported conducting topographical analyses but did not
mention how bad channels were interpolated or handled. Failing to report all key methodological parameters appears
In such instances, the type of interpolation was not coded, to be commonplace in ERP research, which serves as a sub-
2
stantial barrier to replication efforts. Because each meth-
We chose not to provide specific citations as examples in the Discussion
section. All of the issues discussed occurred in multiple articles, and thus it
odological parameter can impact ERP findings, reporting all
is likely a field‐wide issue rather than an issue with one particular group or parameters is a best practice for evaluating research quality
lab. and replicability. The question remains as to how the field
CLAYSON et al.
|
13 of 17
moves forward to resolve these issues. We offer a few sugges- removes one incentive (the publication of a manuscript) to
tions below based on how other fields are addressing the rep- exploit researcher degrees of freedom at the data analysis
lication problem in science (see Button et al., 2013a, 2013b; stage, because the manuscript is accepted for publication at
Ioannidis, 2005, 2008; Ioannidis et al., 2014; Lilienfeld, the completion of the first phase regardless of whether signif-
2017; Tackett, Brandes, King, & Markon, 2019; Yom, 2018) icant effects are observed. Of the primary psychophysiology
and discuss some issues that are specific to ERP research. journals, only the International Journal of Psychophysiology
The disclosure of key methodological parameters will in- has, thus far, implemented the registered reports format
crease transparency that hopefully unveils when researcher (Larson, 2016). Many have argued that the incentive struc-
degrees of freedom are exploited and calibrates a careful ture for academia needs to shift away from rewarding volu-
reader's confidence in reported effects. As such, it would be minous publishing to rewarding rigorous, careful research
helpful for editors and reviewers to enforce ERP reporting (Baldwin, 2017; Ioannidis et al., 2014; Nelson, Simmons, &
guidelines. A reason that key information might be omitted Simonsohn, 2012). An advantage of a registered report for-
from ERP studies is due to space limitations for journal ar- mat is that it shifts the incentive away from massaging data to
ticles. In such an event, authors could be encouraged to post uncover a statistically significant effect to designing a careful
all study details necessary for direct replication as support- test of a specific hypothesis.
ing information or to online repositories, such as OSF. Open Another barrier to reporting might be that researchers do
sharing of processing code and experimental tasks could also not know the specific ways in which data were processed
enhance reporting of most pipeline steps and further facilitate and analyzed due to a reliance on various software analysis
replication. The ERP guidelines article has a checklist in the packages (for a similar discussion, see Software as a Black
Appendix for ensuring that all key parameters are reported Box section in Clayson & Miller, 2017b). There are numer-
(Keil et al., 2014). A completed checklist could be submitted ous software packages available for processing and analyzing
with journal articles to ensure that all methodological param- ERP data, and it is impossible to be an expert in all meth-
eters are communicated in the manuscript. odologies. Hence, the appeal of prepackaged code is under-
It is possible that some researchers are carefully consider- standable, but such code can become a “black box” (Clayson
ing each methodological parameter and simply not reporting & Miller, 2017b). The extent to which popular software anal-
each parameter due to oversight. Alternatively, researchers ysis packages can build in validity checks would be helpful
might be exploiting researcher degrees of freedom to find for researchers who are not easily able to judge when anal-
statistically significant effects. One approach that combats ysis approaches are appropriate. Another useful feature for
the exploitation of researcher degrees of freedom is study popular software packages that process ERP data would be
preregistration or the registered reports format adopted in functions that generate a printout of how the data were pro-
some journals (Larson, 2016; Munafò et al., 2017; Nosek cessed. Ideally such a processing summary would mirror the
et al., 2018; Nosek & Lakens, 2014). In essence, a prereg- Appendix of the ERP guidelines article (Keil et al., 2014) and
istration is a locked analysis plan that is sealed before any provide all information that should be reported in a Methods
data analysis (and ideally data collection) is conducted. section.
Preregistration, when correctly followed, prevents the ex- Although it is easy to suggest that researchers collect
ploitation of researcher degrees of freedom by locking in a more participants to improve statistical power, there may
data analysis plan prior to examining the data (Nosek et al., be practical barriers that prevent some from doing so (e.g.,
2018). Preregistrations of ERP studies can include a prespec- limited access to particular patient populations). One ap-
ified hypothesis, EEG preprocessing plan, and a data analysis proach to improve statistical power is through collabora-
plan, and it could be helpful to complete the checklist from tion by conducting multisite ERP studies (Baldwin, 2017),
the Appendix of the ERP guidelines article in such prereg- and this is already a popular approach among some fMRI
istrations. The OSF offers a mechanism for preregistering a groups. Such multisite studies can increase samples sizes,
study (https://osf.io/) that can include the information men- statistical power, and generalizability of findings. In ad-
tioned above as well as software scripts for data processing dition to multisite studies, depositing EEG data in reposi-
and statistical analysis. tories can facilitate sharing and combining of data sets to
The gold standard of preregistration is the registered re- hopefully improve statistical power. A few repositories
port format (https://cos.io/rr/), which consists of two phases currently exist for such purpose and include the OpenfMRI
of peer review. In the first phase, the study hypothesis and database (https://www.openfmri.org/; Poldrack et al., 2013;
methodology are peer reviewed prior to data collection. Upon Poldrack & Gorgolewski, 2017) and the Patient Repository
successful completion of this first phase, the manuscript is for EEG Data + Computational Tools (PRED + CT; http://
provisionally accepted for publication. During the second predict.cs.unm.edu/; Cavanagh, Napolitano, Wu, & Mueen,
phase, the full manuscript is reviewed and published as long 2017). Furthermore, funding agencies have started to re-
as the proposed methodology was followed. This format quire depositing data to repositories. For example, all grant
|
applications and awards submitted to the National Institute for a lower cost. Although ERP research is indeed more
of Mental Health (NIMH) that involve human subjects after affordable, the typical ERP study seems to suffer from the
January 1, 2020, will be required to deposit all raw and an- same problems associated with small sample sizes as fMRI
alyzed data, including psychophysiological data, unless an studies, and this fact appears to be underappreciated, at least
explicit exception is granted (https://grants.nih.gov/grants/ anecdotally. Small samples, low statistical power, and undis-
guide/notice-files/NOT-MH-19-033.html). closed researcher flexibility contribute to the low replicabil-
Along the lines of statistical power, one feature that is ity of ERP studies. The replicability of ERP studies can be
often underappreciated when conducting a priori power improved by conducting a priori sample size calculations to
calculations is the impact of score reliability on statisti- ensure adequately powered samples for relevant effect sizes
cal power. Unreliable scores can reduce statistical power and by conducting multisite ERP studies to increase sample
(Boudewyn, Luck, Farrens, & Kappenman, 2017; Clayson size and generalizability of findings.
& Miller, 2017b; Fischer, Klein, & Ullsperger, 2017;
Kolossa & Kopp, 2018; Luck & Gaspelin, 2017). Given the
relationship between the number of trials included in an ACKNOWLEDGMENTS
ERP average and internal consistency estimates of reliabil-
Writing of this manuscript was supported in part by the Office
ity (Clayson & Miller, 2017a), power contour plots can be
of Academic Affiliations, Advanced Fellowship Program
used to estimate the optimal balance between the number
in Mental Illness Research and Treatment, Department of
of experimental trials, the number of participants, and the
Veterans Affairs. Michael J. Larson, Ph.D., is the editor‐in‐
statistical power of a given effect size (Baker et al., 2019).
chief of the International Journal of Psychophysiology and
The positive relationships between reliability and effect
receives an honorarium from Elsevier, the journal publisher.
sizes have been shown in both between‐group (Hajcak,
International Journal of Psychophysiology was one of the
Meyer, & Kotov, 2017) and within‐person (Clayson &
journals reviewed for the present manuscript.
Miller, 2017a) ERP studies. For example, between‐group
effect sizes (healthy controls vs. people with generalized
anxiety disorders) increased with increases in internal con- ORCID
sistency (Hajcak et al., 2017).
Peter E. Clayson https://orcid.org/0000-0003-4437-6598
Kaylie A. Carbine https://orcid.org/0000-0003-2696-8880
4.5 | Conclusions Scott A. Baldwin https://orcid.org/0000-0003-3428-0437
An average of 63% of key methodological parameters was Michael J. Larson https://orcid.org/0000-0002-8199-8065
reported across 150 ERP studies from five prominent jour-
nals, which suggests that the underreporting of recommended
guidelines is a ubiquitous practice. Hopefully, this underre- R E F E R E NC E S
porting is due to oversight on the part of authors. However,
Baker, D. H., Vilidaite, G., Lygo, F. A., Smith, A. K., Flack, T. R.,
it is possible that underreporting might be due to attempts to Gouws, A. D., & Andrews, T. J. (2019). Power contours: Optimising
obscure data processing and analysis practices when authors sample size and precision in experimental psychology and human
exploit researcher degrees of freedom to find statistically sig- neuroscience. ArXiv. https://arxiv.org/abs/1902.06122
nificant effects. We have recommended some solutions, such Baldwin, S. A. (2017). Improving the rigor of psychophysiology
as the preregistration of data processing and analysis plans to research. International Journal of Psychophysiology, 111, 5–16.
motivate rigor over novelty. We hope that moving forward https://doi.org/10.1016/j.ijpsycho.2016.04.006
Bonett, D. G. (2008). Confidence intervals for standardized linear con-
authors, reviewers, and editors encourage the use of the ERP
trasts of means. Psychological Methods, 13, 99–109. https://doi.
reporting guidelines (Keil et al., 2014) to facilitate commu-
org/10.1037/1082-989X.13.2.99
nication among researchers and improve the replicability of Boudewyn, M. A., Luck, S. J., Farrens, J. L., & Kappenman, E. S.
ERP research. (2017). How many trials does it take to get a significant ERP ef-
Small sample sizes and low statistical power appear en- fect? It depends. Psychophysiology, 55(6), e13049. https ://doi.
demic to ERP studies, which is consistent with the larger org/10.1111/psyp.13049
field of cognitive neuroscience. Our findings suggest that Brand, A., & Bradley, M. T. (2016). The precision of effect size es-
ERP research is powered to detect only large statistical ef- timation from published psychological research: Surveying confi-
dence intervals. Psychological Reports, 118, 154–170. https://doi.
fects for simple statistical designs. Anecdotally speaking,
org/10.1177/0033294115625265
one of the advantages of using ERPs over some other neu- Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint,
roimaging techniques, such as fMRI, is that conducting ERP J., Robinson, E. S. J., & Munafò, M. R. (2013a). Confidence and
studies is more affordable. The affordability is often cited as precision increase with high statistical power. Nature Reviews
an advantage, because larger sample sizes can be obtained Neuroscience, 14, 585–586. https://doi.org/10.1038/nrn3475-c4
CLAYSON et al.
|
15 of 17
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Dien, J. (2017). Best practices for repeated measures ANOVAs of
Robinson, E. S. J., & Munafò, M. R. (2013b). Power failure: Why ERP data: Reference, regional channels, and robust ANOVAs.
small sample size undermines the reliability of neuroscience. Nature International Journal of Psychophysiology, 111, 42–56. https://doi.
Reviews Neuroscience, 14, 365–376. https://doi.org/10.1038/nrn3475 org/10.1016/j.ijpsycho.2016.09.006
Carbine, K. A., Lindsey, H. M., Rodeback, R. E., & Larson, M. J. Donchin, E., Callaway, E., Cooper, E., & Desmedt, R. (1977).
(2019). Quantifying evidential value and selective reporting in re- Publication criteria for studies of evoked potentials (EP) in man.
cent and 10‐year past psychophysiological literature: A pre‐regis- Report of the methodology committee. In J. E. Desmedt (Ed.),
tered P‐curve analysis. International Journal of Psychophysiology, Progress in clinical neurophysiology: Vol 1. Attention, voluntary
142, 33–49. https://doi.org/10.1016/j.ijpsycho.2019.06.004 contraction and event‐related cerebral potentials (pp. 1‒11).
Carp, J. (2012). The secret lives of experiments: Methods report- Basel, Switzerland: Karger.
ing in the fMRI literature. NeuroImage, 63, 289–300. https://doi. Edgar, J. C., Stewart, J. L., & Miller, G. A. (2005). Digital filters in ERP
org/10.1016/j.neuroimage.2012.07.004 research. In T. C. Handy (Ed.), Event‐related potentials: A methods
Carp, J. (2013). Better living through transparency: Improving the handbook (pp. 33–56). Cambridge, MA: MIT Press.
reproducibility of fMRI results through comprehensive methods Faul, F., Erdfelder, E., Buchner, A., & Lang, A.‐G. (2009). Statistical
reporting. Cognitive, Affective, & Behavioral Neuroscience, 13, power analyses using G*Power 3.1: Tests for correlation and regres-
660–666. https://doi.org/10.3758/s13415-013-0188-0 sion analyses. Behavior Research Methods, 41, 1149–1160. https://
Carter, E. C., Schönbrodt, F. D., Gervais, W. M., & Hilgard, J. (2018). doi.org/10.3758/BRM.41.4.1149
Correcting for bias in psychology: A comparison of meta‐analytic Fischer, A. G., Klein, T. A., & Ullsperger, M. (2017). Comparing the
methods. PsyArXiv. https://doi.org/10.31234/osf.io/9h3nu error‐related negativity across groups: The impact of error‐ and
Cavanagh, J. F., Napolitano, A., Wu, C., & Mueen, A. (2017). The pa- trial‐number differences. Psychophysiology, 54, 998–1009. https://
tient repository for EEG data + computational tools (PRED+CT). doi.org/10.1111/psyp.12863
Frontiers in Neuroinformatics, 11, 67. https ://doi.org/10.3389/ Forstmeier, W., Wagenmakers, E.‐J., & Parker, T. H. (2017). Detecting
fninf.2017.00067 and avoiding likely false‐positive findings—A practical guide.
Chalmers, I., & Glasziou, P. (2009). Avoidable waste in the production Biological Reviews of the Cambridge Philosophical Society, 92,
and reporting of research evidence. Lancet, 374, 86–89. https://doi. 1941–1968. https://doi.org/10.1111/brv.12315
org/10.1016/S0140-6736(09)60329-9 Gelman, A. (2018). The failure of null hypothesis significance test-
Chambers, C. (2017). The seven deadly sins of psychology: A mani- ing when studying incremental changes, and what to do about it.
festo for reforming the culture of scientific practice. Princeton, NJ: Personality and Social Psychology Bulletin, 44, 16–23. https://doi.
Princeton University Press. org/10.1177/0146167217729162
Clayson, P. E., Carbine, K. A., & Larson, M. J. (in press). Error‐re- Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing
lated negativity and reward positivity as biomarkers of depression: type S (sign) and type M (magnitude) errors. Perspectives on
P‐curving the evidence. International Journal of Psychophysiology. Psychological Science, 9, 641–651. https://doi.org/10.1177/17456
Manuscript submitted for publication. 91614551642
Clayson, P. E., & Miller, G. A. (2017a). ERP Reliability Analysis Glasziou, P., Altman, D. G., Bossuyt, P., Boutron, I., Clarke, M., Julious,
(ERA) Toolbox: An open‐source toolbox for analyzing the re- S., …Wager, E. (2014). Reducing waste from incomplete or unus-
liability of event‐related potentials. International Journal of able reports of biomedical research. Lancet, 383, 267–276. https://
Psychophysiology, 111, 68–79. https ://doi.org/10.1016/j.ijpsy doi.org/10.1016/S0140-6736(13)62228-X
cho.2016.10.012 Guo, Q., Parlar, M., Truong, W., Hall, G., Thabane, L., McKinnon, M.,
Clayson, P. E., & Miller, G. A. (2017b). Psychometric consider- …Pullenayegum, E. (2014). The reporting of observational clini-
ations in the measurement of event‐related brain potentials: cal functional magnetic resonance imaging studies: A systematic
Guidelines for measurement and reporting. International Journal review. PLOS One, 9, e94412–94411. https://doi.org/10.1371/journ
of Psychophysiology, 111, 57–67. https ://doi.org/10.1016/j.ijpsy al.pone.0094412
cho.2016.09.005 Hajcak, G., Meyer, A., & Kotov, R. (2017). Psychometrics and the
Cohen, M. X. (2017). Rigor and replication in time‐frequency anal- neuroscience of individual differences: Internal consistency limits
yses of cognitive electrophysiology data. International Journal between‐subjects effects. Journal of Abnormal Psychology, 126,
of Psychophysiology, 111, 80–87. https ://doi.org/10.1016/j.ijpsy 823–834. https://doi.org/10.1037/abn0000274
cho.2016.02.001 Ioannidis, J. P. A. (2005). Why most published research findings are
Cook, E. W. III, & Miller, G. A. (1992). Digital filtering: Background false. PLOS Medicine, 2, e124. https ://doi.org/10.1371/journ
and tutorial for psychophysiologists. Psychophysiology, 29, 350– al.pmed.0020124
362. https://doi.org/10.1111/j.1469-8986.1992.tb01709.x Ioannidis, J. P. A. (2008). Why most discovered true associations are
Cramer, A. O. J., van Ravenzwaaij, D., Matzke, D., Steingroever, H., inflated. Epidemiology, 19, 640–648. https ://doi.org/10.1097/
Wetzels, R., Grasman, R. P. P. P., … Wagenmakers, E.‐J. (2015). EDE.0b013e31818131e7
Hidden multiplicity in exploratory multiway ANOVA: Prevalence Ioannidis, J. P. A., Greenland, S., Hlatky, M. A., Khoury, M. J., Macleod,
and remedies. Psychonomic Bulletin & Review, 23, 640–647. M. R., Moher, D., …Tibshirani, R. (2014). Increasing value and re-
https://doi.org/10.3758/s13423-015-0913-5 ducing waste in research design, conduct, and analysis. Lancet, 383,
De Boeck, P., & Jeon, M. (2018). Perceived crisis and reforms: Issues, 166–175. https://doi.org/10.1016/S0140-6736(13)62227-8
explanations, and remedies. Psychological Bulletin, 144, 757–777. John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the
https://doi.org/10.1037/bul0000154 prevalence of questionable research practices with incentives for
|
truth telling. Psychological Science, 23, 524–532. https ://doi. of Sciences of the United States of America, 115, 2600–2606. https://
org/10.1177/0956797611430953 doi.org/10.1073/pnas.1708274114
Keil, A., Debener, S., Gratton, G., Junghöfer, M., Kappenman, E. S., Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to
& Luck, S. J., …Yee, C. M. (2014). Committee report: Publication increase the credibility of published results. Social Psychology, 45,
guidelines and recommendations for studies using electroenceph- 137–141. https://doi.org/10.1027/1864-9335/a000192
alography and magnetoencephalography. Psychophysiology, 51, Picton, T. W., Bentin, S., Berg, P., Donchin, E., Hillyard, S. A., Johnson,
1–21. https://doi.org/10.1111/psyp.12147 R., …Taylor, M. J. (2000). Guidelines for using human event‐related
Kolossa, A., & Kopp, B. (2018). Data quality over data quantity in com- potentials to study cognition: Recording standards and publication
putational cognitive neuroscience. NeuroImage, 172, 775–785. https criteria. Psychophysiology, 37, 127–152.
://doi.org/10.1016/j.neuroimage.2018.01.005 Pivik, R. T., Broughton, R. J., Coppola, R., Davidson, R. J., Fox, N., &
Lakens, D. (2017). Equivalence tests: A practical primer for t tests, cor- Nuwer, M. R. (1993). Guidelines for the recording and quantitative
relations, and meta‐analyses. Social Psychological and Personality analysis of electroencephalographic activity in research contexts.
Science, 8, 355–362. https://doi.org/10.1177/1948550617697177 Psychophysiology, 30, 547–558.
Larson, M. J. (2016). Commitment to cutting‐edge research with rigor Poldrack, R. A., Baker, C. I., Durnez, J., Gorgolewski, K. J., Matthews,
and replication in psychophysiological science. International P. M., Munafò, M. R., …Yarkoni, T. (2017). Scanning the hori-
Journal of Psychophysiology, 102, ix–x. zon: Towards transparent and reproducible neuroimaging research.
Larson, M. J., & Carbine, K. A. (2017). Sample size calculations in Nature Reviews Neuroscience, 18, 115–126. https://doi.org/10.1038/
human electrophysiology (EEG and ERP) studies: A systematic nrn.2016.167
review and recommendations for increased rigor. International Poldrack, R. A., Barch, D. M., Mitchell, J. P., Wager, T. D., Wagner,
Journal of Psychophysiology, 111, 33–41. https://doi.org/10.1016/j. A. D., Devlin, J. T., … Milham, M. P. (2013). Towards open shar-
ijpsycho.2016.06.015 ing of task‐based fMRI data: The OpenfMRI project. Frontiers in
LeBel, E. P., & Peters, K. R. (2011). Fearing the future of empirical Neuroinformatics, 7, 12. https://doi.org/10.3389/fninf.2013.00012
psychology: Bem's (2011) evidence of psi as a case study of defi- Poldrack, R. A., & Gorgolewski, K. J. (2017). OpenfMRI: Open shar-
ciencies in modal research practice. Review of General Psychology, ing of task fMRI data. NeuroImage, 144, 259–261. https ://doi.
15, 371–379. https://doi.org/10.1037/a0025172 org/10.1016/j.neuroimage.2015.05.073
Lilienfeld, S. O. (2017). Psychology's replication crisis and the grant Rosenthal, R. (1979). The file drawer problem and tolerance for null
culture: Righting the ship. Perspectives on Psychological Science, results. Psychological Bulletin, 86, 638–641.
12, 660–664. https://doi.org/10.1177/1745691616687745 Schmidt, S. J. (2009). Shall we really do it again? The powerful concept
Loken, E., & Gelman, A. (2017). Measurement error and the replica- of replication is neglected in the social sciences. Review of General
tion crisis. Science, 355, 584–585. https ://doi.org/10.1126/scien Psychology, 13, 90–100. https://doi.org/10.1037/a0015108
ce.aal3618 Schuirmann, D. J. (1987). A comparison of the two one‐sided tests
Luck, S. J., & Gaspelin, N. (2017). How to get statistically signif- procedure and the power approach for assessing the equivalence
icant effects in any ERP experiment (and why you shouldn't). of average bioavailability. Journal of Pharmacokinetics and
Psychophysiology, 54, 146–157. https://doi.org/10.1111/psyp.12639 Biopharmaceutics, 15, 657–680.
Moser, J. S., Moran, T. P., Kneip, C., Schroder, H. S., & Larson, M. J. Shrout, P. E., & Rodgers, J. L. (2018). Psychology, science, and knowl-
(2016). Sex moderates the association between symptoms of anx- edge construction: Broadening perspectives from the replication
iety, but not obsessive compulsive disorder, and error‐monitoring crisis. Annual Review of Psychology, 69, 487–510. https ://doi.
brain activity: A meta‐analytic review. Psychophysiology, 53, 21– org/10.1146/annurev-psych-122216-011845
29. https://doi.org/10.1111/psyp.12509 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False‐positive
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, psychology: Undisclosed flexibility in data collection and analysis
C. D., du Sert, N. P., …Ioannidis, J. P. A. (2017). A manifesto for allows presenting anything as significant. Psychological Science,
reproducible science. Nature Human Behavior, 1, 1–9. https://doi. 22, 1359–1366. https://doi.org/10.1177/0956797611417632
org/10.1038/s41562-016-0021 Simons, D. J. (2014). The value of direct replication. Perspectives on
Muncy, N. M., Hedges‐Muncy, A. M., & Kirwan, C. B. (2017). Discrete Psychological Science, 9, 76–80. https ://doi.org/10.1177/17456
pre‐processing step effects in registration‐based pipelines, a pre- 91613514755
liminary volumetric study on T1‐weighted images. PLOS One, 12, Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014a). p‐Curve and
e0186071–0186015. https://doi.org/10.1371/journal.pone.0186071 effect size: Correcting for publication bias using only significant re-
Nelson, L. D., Simmons, J., & Simonsohn, U. (2018). Psychology's results. Perspectives on Psychological Science, 9, 666–681. https://
naissance. Annual Review of Psychology, 69, 511–534. https://doi. doi.org/10.1177/1745691614553988
org/10.1146/annurev-psych-122216-011836 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014b). p‐Curve:
Nelson, L. D., Simmons, J. P., & Simonsohn, U. (2012). Let's pub- A key to the file‐drawer. Journal of Experimental Psychology:
lish fewer papers. Psychological Inquiry, 23, 291–293. https://doi. General, 143, 534–547. https://doi.org/10.1037/a0033242
org/10.1080/1047840X.2012.705245 Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2015). Better p‐
Nitschke, J. B., Miller, G. A., & Cook, E. W., III. (1998). Digital filter- curves: Making p‐curve analysis more robust to errors, fraud, and
ing in EEG/ERP analysis: Some technical and methodological com- ambitious p‐hacking, a reply to Ulrich and Miller (2015). Journal
parisons. Behavior Research Methods, Instruments, and Computers, of Experimental Psychology: General, 144, 1146–1152. https://doi.
30, 54–67. https://doi/org/10.3758/BF03209416 org/10.1037/xge0000104
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). StatCorp. (2017). Stata Statistical Software: Release 15. College
The preregistration revolution. Proceedings of the National Academy Station, TX: StataCorp LLC.
CLAYSON et al.
|
17 of 17
Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assessment of pub- Widmann, A., Schröger, E., & Maess, B. (2015). Digital filter design
lished effect sizes and power in the recent cognitive neuroscience for electrophysiological data—A practical approach. Journal of
and psychology literature. PLOS Biology, 15, e2000797–2000718. Neuroscience Methods, 250, 34–46. https://doi.org/10.1016/j.jneum
https://doi.org/10.1371/journal.pbio.2000797 eth.2014.08.002
Tackett, J. L., Brandes, C. M., King, K. M., & Markon, K. E. (2019). Yom, S. (2018). Analytic transparency, radical honesty, and strategic
Psychology's replication crisis and clinical psychological science. incentives. Political Science & Politics, 51, 416–421. https://doi.
Annual Review of Clinical Psychology, 15, 579–604. https://doi. org/10.1017/S1049096517002554
org/10.1146/annurev-clinpsy-050718-095710
Tversky, A., & Kahneman, D. (1971). Belief in the law of small num-
bers. Psychological Bulletin, 76, 105–110.
How to cite this article: Clayson PE, Carbine KA,
Welch, B. L. (1947). The generalization of ‘Student's’ problem when Baldwin SA, Larson MJ. Methodological reporting
several different population variances are involved. Biometrika, 34, behavior, sample sizes, and statistical power in studies
28–35. https://doi.org/10.2307/2332510 of event‐related potentials: Barriers to reproducibility
Widmann, A., & Schröger, E. (2012). Filter effects and filter artifacts in and replicability. Psychophysiology. 2019;56:e13437.
the analysis of electrophysiological data. Frontiers in Psychology, https://doi.org/10.1111/psyp.13437
3, 233. https://doi.org/10.3389/fpsyg.2012.00233

Clayson 2019

Uploaded by

Copyright:

Available Formats

Clayson 2019

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clayson 2019

Uploaded by

Copyright:

Available Formats

Received: 3 May 2019

Methodological reporting behavior, sample sizes, and statistical

Peter E. Clayson1,2 | Kaylie A. Carbine3 | Scott A. Baldwin3 | Michael J. Larson3,4

1 | IN T RO D U C T ION 2018; Forstmeier, Wagenmakers, & Parker, 2017; Nelson,

© 2019 Society for Psychophysiological Research 1 of 17

2 | M ET H OD power to detect a Cohen's d of .46 (i.e., a medium effect

T A B L E 1 Summary statistics for

3.1.5 | ERP measurement

0% 3.1.6 | Statistical analyses

This category refers to information related to statistical anal-

yses (see Figure 3). Most articles reported p values (99%),

clearly described the statistical procedures used (95%), and

provided inferential test statistics (91%). When permutation

del ped ffs

Ha ctiv enso mpli s

p/− ive eria

F I G U R E 3 Proportion of guidelines 100% 100%

Independent Samples Paired Samples

Between x Within Interaction

T A B L E 5 Numerical summary of the

You might also like