Fnbeh 16 835444

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

ORIGINAL RESEARCH

published: 18 February 2022


doi: 10.3389/fnbeh.2022.835444

Role of Environment and


Experimenter in Reproducibility of
Behavioral Studies With Laboratory
Mice
Martina Nigri 1,2* , Johanna Åhlgren 3 , David P. Wolfer 1,2 and Vootele Voikar 3,4*
1
Faculty of Medicine, Institute of Anatomy, University of Zurich, Zurich, Switzerland, 2 Department of Health Sciences
and Technology, Institute of Human Movement Sciences and Sport, ETH Zürich, Zurich, Switzerland, 3 Laboratory Animal
Center, HiLIFE, University of Helsinki, Helsinki, Finland, 4 Neuroscience Center, HiLIFE, University of Helsinki, Helsinki, Finland

Behavioral phenotyping of mice has received a great deal of attention during the past
three decades. However, there is still a pressing need to understand the variability
Edited by: caused by environmental and biological factors, human interference, and poorly
Alena Savonenko, standardized experimental protocols. The inconsistency of results is often attributed to
Johns Hopkins University,
the inter-individual difference between the experimenters and environmental conditions.
United States
The present work aims to dissect the combined influence of the experimenter and
Reviewed by:
Thomas J. Gould, the environment on the detection of behavioral traits in two inbred strains most
The Pennsylvania State University, commonly used in behavioral genetics due to their contrasting phenotypes, the
United States
Ioannis Zalachoras, C57BL/6J and DBA/2J mice. To this purpose, the elevated O-maze, the open field
Swiss Federal Institute of Technology with object, the accelerating rotarod and the Barnes maze tests were performed by
Lausanne, Switzerland
two experimenters in two diverse laboratory environments. Our findings confirm the
Laurel Seemiller,
The Pennsylvania State University, well-characterized behavioral differences between these strains in exploratory behavior,
United States, in collaboration with motor performance, learning and memory. Moreover, the results demonstrate how
reviewer TG
the experimenter and the environment influence the behavioral tests with a variable-
*Correspondence:
Martina Nigri dependent effect, often with mutually exclusive contributions. In this context, our study
martina.nigri@anatomy.uzh.ch highlights how both the experimenter and the environment can have an impact on
Vootele Voikar
the strain effect size without altering the direction of the conclusions. Importantly, the
vootele.voikar@helsinki.fi
general agreement on the results is reached by converging evidence from multiple
Specialty section: measures addressing the same trait. In conclusion, the present work elucidates the
This article was submitted to
Learning and Memory,
contribution of both the experimenter and the laboratory environment in the intricate
a section of the journal field of reproducibility in mouse behavioral phenotyping.
Frontiers in Behavioral Neuroscience
Keywords: mouse behavioral phenotyping, inbred strains, reproducibility, experimenter effect, environment effect
Received: 14 December 2021
Accepted: 26 January 2022
Published: 18 February 2022
INTRODUCTION
Citation:
Nigri M, Åhlgren J, Wolfer DP and Behavior, representing the final output of the nervous system in all living organisms, results
Voikar V (2022) Role of Environment
from the interaction between genotype and environment. Measures of behavioral outcomes are
and Experimenter in Reproducibility
of Behavioral Studies With Laboratory
therefore essential for characterizing the animal models of neurodegenerative and neuropsychiatric
Mice. diseases. As a consequence, behavioral phenotyping of genetically modified mice has turned to
Front. Behav. Neurosci. 16:835444. be a commonly used approach in behavioral neuroscience and genetics over the last 25 years
doi: 10.3389/fnbeh.2022.835444 (Voikar, 2020).

Frontiers in Behavioral Neuroscience | www.frontiersin.org 1 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

Along with the widespread use of this approach, some serious recognize the need for good planning and expertise in behavioral
concerns about the validity and interpretation of data derived testing as a prerequisite for reliable and reproducible research.
from knockout mice in general were raised, related to the It would be important to add here that even more reproducible
problems with defining the genetic background of mutant mice results have been obtained when animals are studied by means
(Gerlai, 1996; Silva et al., 1997). In addition, it appeared that of automated home-cage based approach (Krackow et al., 2010;
conflicting results from different laboratories using supposedly Robinson and Riedel, 2014; Robinson et al., 2018; Arroyo-Araujo
the same mutant (or inbred) mouse lines were rather common et al., 2019). On the other hand, such automated and unbiased
and solution was seen in standardization. measurements are still able to detect differences in behavior
In order to test the success of standardization, a seminal between the laboratories, which may need to be considered in
study was carried out in three laboratories (Crabbe et al., 1999). evaluation (Pernold et al., 2019).
Despite rigorous standardization of test protocols, equipment, The availability of well-characterized inbred mouse strains
animals and many environmental variables, the outcome revealed allows investigators to study the gene-environment interactions.
systematic differences between the laboratories. Moreover, and Efforts are made toward establishing ‘mouse phenome’ database
more importantly, some phenotypic differences were dependent where reference values of common inbred strains in a variety of
on the specific testing lab. These findings opened the debate behavioral tasks and physiological measurements can be found
over the need and usefulness of standardization (Würbel, 2000, (Paigen and Eppig, 2000; Moldin et al., 2001). The C57BL/6 and
2002; Wahlsten, 2001; Van der Staay and Steckler, 2002) and DBA/2J mice are the oldest, and probably the most commonly
in a way, paved the way to more extensive discussions about used inbred strains in behavioral genetics. For many behavioral
reproducibility (Editorial, 2009, 2013). Revisiting the 1999 study domains, they are considered to display a moderate phenotype
and provision of detailed analysis, revealed that the most salient (Crawley et al., 1997), which allows a feasible detection of
difference between the laboratories might have been introduced behavioral changes at the baseline and in response to various
by the persons having contact with the experimental animals manipulations (Stiedl et al., 1999; Cabib et al., 2000; Voikar et al.,
(Wahlsten et al., 2003). The role of experimenter effect has been 2005; Youn et al., 2012).
further addressed and confirmed by other studies (Lariviere et al., The aim of the present study was to further evaluate the
2001; Chesler et al., 2002; Bohlen et al., 2014; Sorge et al., 2014). relative impact of the experimenter and the environment on
The method of handling of animals deserves also full appreciation replicability of mouse behavioral phenotype. To this aim, a
(Hurst and West, 2010). battery of behavioral test was performed by two experimenters
Another conclusion of extended analysis was that even if in two diverse laboratory environments. Selection of behavioral
there were advantages of test standardization, the laboratory tests was based on the assumption that both objective (automated
environments could never be made sufficiently similar to recording by video-tracking) and subjective (handling, manually
guarantee identical results (Wahlsten et al., 2003). In fact, for recorded behavior) measures were considered. The C57BL/6
many assays achieving “identical” result is not needed – more and DBA/2J inbred strains were deliberately chosen for
important measure for reproducibility is to reach the consensus their markedly different and well-characterized behaviors.
in the direction of the effect (Goodman et al., 2016; Kafkafi However, no particular emphasis was placed on standardizing
et al., 2018). However, this may not be possible to discuss or environmental parameters.
assess if the design and reporting of animal studies is deficient
(Kilkenny et al., 2009; Editorial, 2019). To this end, the authors
should familiarize themselves with guidelines for preparing, MATERIALS AND METHODS
conducting and reporting before even starting the experiments
(Smith et al., 2018; Percie du Sert et al., 2020). In addition, for All the behavioral tests were carried out by a 25-year-old female
sound and rigorous research, confirmation studies by different experimenter (M) and a 49-year-old male experimenter (V) in
groups and coordinated multicenter trials are recommended two diverse laboratory environments: the Institute of Anatomy in
(Mogil and Macleod, 2017). Zürich (Z) and the Laboratory Animal Center in Helsinki (H). All
Several multi-laboratory studies have been carried out the experimental procedures were carried out in accordance with
since 1999. For instance, Lewejohann et al. concluded that the European legislation (Directive 2010/63/EU), having been
the reliability of behavioral phenotyping is not challenged approved by the veterinary office of the Canton of Zürich (license
seriously by experimenter and laboratory environment as long number 060/2021) and National Animal Experiment Board of
as appropriate standardizations are met and suitable controls Finland (license ESAVI/10165/04.10.07/2016).
are involved (Lewejohann et al., 2006). In addition, development
of standard operating procedures for large-scale phenotyping Animals and Environment
project generated reproducible results between laboratories for Four batches of eight weeks old female C57BL/6J (n = 12)
a number of the test output parameters (Mandillo et al., and DBA/2J (n = 12) mice were obtained from Charles River
2008). Another study demonstrated that analysis of mouse Laboratories (France). Thus, the total number of animals used
timing behavior led to robust and reliable endophenotypes was 96 (48 C57BL/6J and 48 DBA/2J). The mice were kept
across different labs (Maggi et al., 2014). Yet one more project in same strain-groups of 4 in standard Type III cages (ZH:
addressed the standardization of experimental conditions in temperature 21.9 ± 0.3◦ C and relative humidity 60.2 ± 9.6%) or
multi-laboratory effort (Richter et al., 2011). Overall, these studies in individually ventilated cages (HE: temperature 21.7 ± 0.4◦ C

Frontiers in Behavioral Neuroscience | www.frontiersin.org 2 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

and relative humidity 55.5 ± 5.3%) for an adaptation period Ethovision XT15 system (Noldus Information Technology,
of three weeks before the behavioral testing. Food and water Wageningen, The Netherlands). The data were exported to
were available ad libitum (see the Table 1 and Supplementary custom designed software Wintrack (Wolfer et al., 2001) for
Figure 3, for details). Cage changes occurred once a week since further analysis.
the mice arrived at the testing animal facility. Before testing, the
animal caretakers took care of clean cages. To avoid stress during Conventional Behavioral Testing
behavioral testing, cage changes were always performed on The behavioral testing started when the mice were 12 weeks
Fridays, allowing animals to adapt to new cages over the weekend. old and each experimenter was introduced to them by a
During the experiment (starting with handling, marking and gentle handling (∼3 min – picking up from the cage, tail
weighing the mice), the experimenter taking care of the entire marking, measuring body weight, and allowing to explore on
behavioral test battery was also moving the mice to the clean the experimenter’s palm) three days before start of testing.
cages. The first two batches were housed under a 12/12 inverted Sample size calculation was based on previous experience. Same
light-dark cycle (light on 20:00–8:00) and the testing occurred protocols and similar testing procedures were applied by two
during the dark phase in Zurich (in August 2018). The mice in experimenters in the two laboratories. The behavioral tests were
Helsinki (third and fourth batch) were exposed to normal light carried out in the following order: elevated O-maze, open field
(light on 6:00–18:00) with the behavioral testing occurring during with object, rotarod and Barnes maze tests. Order of testing
the light phase (in September 2018). the animals was randomized and counterbalanced. A schematic
overview of the experimental approach is presented in Figure 1.
Video Tracking
During the elevated O-maze, open field with object and Barnes Behavioral Procedures
maze tests, the mice were video tracked using a Noldus Elevated O-Maze
The test is used to assess unconditioned anxiety like-behaviors
in mice (Shepherd et al., 1994). The behavioral device consists
TABLE 1 | Details of housing and husbandry in two laboratories.
of a 5.5 cm wide annular runway with an outer diameter of
46 cm. The apparatus was placed inside the large open field
Zurich Helsinki arena approximately 40 cm above the floor. The two opposing
90◦ closed sectors are protected by 16 cm high inner and outer
Light cycle Reversed (light on 20:00–8:00) Normal (light on 6:00–18:00)
walls of grey polyvinyl chloride. The remaining two open sectors
Food KLIBA NAFAG – Switzerland; Envigo Global Diet 2916C
Aliment for mice and rats – (pellet 12 mm) (30 × 5 cm) have no walls. Illumination was applied by indirect
3436 (pellet 15 mm) diffuse room light (20–25 lux). During the experiment the
Water Tap water, ad libitum Filtered and UV-irradiated, animals were placed in the center of the maze facing one of the
ad libitum closed sectors and observed for 10 min. Exploratory head dips,
Bedding Aspen chips 2.5–3.5 mm; aspen chips stretched attends, grooming and rearing events were manually
J.RETTENMAIER & SÖHNE 5 mm × 5 mm × 1 mm, 4HP; recorded using the keyboard event-recorder provided by the
GMBH + CO KG; Rosenberg, Tapvei, Estonia
Germany
video tracking system.
Nest material Tissue paper aspen strips, PM90L, Tapvei,
Estonia
Open Field With Object
Additional Red plastic shelter (Zoonlab); 3 aspen bricks
The test is used to measure locomotion, anxiety, explorative and
enrichment cardboard shelter (50 mm × 10 mm × 10 mm, stereotypical behaviors such as grooming and rearing in rodents
Tapvei, Estonia) (Walsh and Cummins, 1976; Voikar and Stanford, 2021). The
Cage Eurostandard Type III cage, Mouse IVC Green Line – behavioral apparatus consisted of four 50 cm × 50 cm arenas
dimensions overall cage dimensions (with wall height of 40 cm) placed under camera for recording.
425 mm × 276 mm × 153 mm, 391 mm × 199 mm × 160 mm,
The illumination was applied by indirect diffuse room light (20–
floor area 820 cm2 ; covered floor area 501 cm2 ; Tecniplast,
with filter top; Tecniplast, Italy Italy 25 lux). Each animal was released in one of the corners and
Cage change once/week once/week monitored for 15 min. The mice were then removed and placed
Temperature 21.9 ± 0.3◦ C (mean, SEM) 21.7 ± 0.4◦ C (mean, SEM) in the holding cage, the number of the fecal boli was counted and
(measured a 12 cm × 4 cm semi-transparent 50 ml falcon tube was placed
during exp) in the center of each arena. The animals were then released in the
Humidity 60.2 ± 9.6% (mean, SEM) 55.5 ± 5.3% (mean, SEM) arena and observed for additional 15 min.
(measured
during exp) Rotarod
Animal facility Conventional Standard Pathogen Free Motor coordination and learning was tested by using the digitally
Protecting Disposable cap and coat on Full re-dressing - cap, mask, controlled mouse rotarod apparatus (Ugo Basile, Italy). The
clothing top of personal clothing, lab coat, socks, lab shoes, gloves,
shoes, gloves entry to animal facility through
device has a drum with diameter of 30 mm and provides
air shower adjustable speed (2–80 rpm) and acceleration (600 –60000 ). The
Time of Between 8:30 and 15:00; Between 8:30 and 15:00; illumination was applied by indirect diffuse room light (20–
experiments 20.7.-10.8.2018 31.8.-28.9.2018 25 lux). Four mice were simultaneously placed on the rotarod

Frontiers in Behavioral Neuroscience | www.frontiersin.org 3 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

FIGURE 1 | Overview of the experimental design. The behavioral testing was carried out using two batches of eight weeks old C57BL/6J (n = 12) and DBA/2J
(n = 12) mice. All the behavioral procedures were applied by a 25-year-old female experimenter (M) and a 49-year-old male experimenter (V) in two different
laboratories: the Institute of Anatomy in Zürich (Z) and the Laboratory Animal Center in Helsinki (H). An adaptation phase of four weeks was followed by the elevated
O-maze, open field with object, rotarod and Barnes maze tests.

apparatus with the rod rotating at 4 rpm during the first minute. strong correlations between variances and group means were
The rotation speed is increased every 30 s by 4rpm and a trial subjected to Box-Cox transformation before the statistical
terminates either when the mouse falls down or when 5 min analysis. The significance threshold was set at 0.05 and the
are completed. Each animal was submitted to five trials with false discovery rate (FDR) control procedure of Hochberg
an inter-trial interval of 30 min. The time to fall, digitally and was applied to groups of conceptually related variables within
manually recorded, provides the measure of motor ability and the single tests to correct significance thresholds for multiple
improvement across trials measures the motor learning. comparisons. Cohen’s d was used as measure of the size of
strain differences, partial omega squared as measure of the size
Barnes Maze of ANOVA effects and interactions. Pooled data of the four
The test is used to assess spatial learning and memory in mice experiments was additionally analyzed using Bayesian statistics
and rats (Barnes, 1979). The maze consists of a circular platform (R package “BayesFactor”), permitting to probe the data not
(100 cm diameter) with 20 holes (5 cm diameter) around the only for presence but also for absence of a strain effect (Keysers
perimeter (Ugo Basile, Italy). One of the holes was connected et al., 2020). Precisely, a Bayes factor (BF) was computed as the
with a dark chamber filled with bedding material and two food likelihood ratio between alternative models with and without
pellets, the escape box. Two days before the experiment, each strain effect, given the observed data. A BF > 3 was taken
animal was introduced to the escape box for 2–3 min. The bright as moderate evidence for, a BF < 1/3 as moderate evidence
light (500–600 lux on the platform) was used to induce the mice against presence of a strain effect. BF > 10 and BF < 1/10 were
to find and enter the escape box. The mice were trained to find interpreted as strong evidence for and against a strain effect,
the escape box in three training trails per day (inter-trial interval respectively. The pooled data as pseudo-population permitted to
at least 60 min) over three days. The training trial ended when tentatively identify false positive (positive test outcome despite
the mouse entered the escape box or after 3 min as cut-off time evidence for absence of a strain effect in the pseudo-population)
(in this case, the mouse was gently directed to the escape box). and false negative results (negative test outcome despite evidence
The memory test was carried out during the first trial on day 4 for presence of a strain effect in the pseudo-population) in
when the mice were monitored on the platform without escape individual experiments. The statistical analyses and graphs were
box for 90 s. Thereafter, reversal learning was carried out, where obtained using R version 4.1.2, complemented with the packages
the escape box was moved under the opposite hole and the mice “effectsize” and “ggplot2.” In bar and line graphs, untransformed
received three training trials on day 4 and 5. After the last training data are plotted as mean + SEM with individual data points
trial on day 5, the second memory test was performed. shown in the background.

Statistical Analyses
The statistical analysis, blinded and performed by a third
person, was conducted using an ANOVA model with strain RESULTS
(B6 = C57BL/6J, D2 = DBA/2J), experimenter (M = female
experimenter, V = male experimenter) and laboratory Phenotypic Profile of C57BL/6J and
environments (Z = Zürich, H = Helsinki) as between DBA/2J Mice
subject factors. Significant interactions were further explored To deeply investigate the well documented behavioral differences
by pairwise t-tests or by splitting the ANOVA model, as between C57BL/6J and DBA/2J mice, a battery of behavioral
appropriate. Variables with strongly skewed distributions or tests was performed by both experimenters (M, V) in both

Frontiers in Behavioral Neuroscience | www.frontiersin.org 4 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

laboratory environments (Z, H, Figures 2, 3). The body with a higher tortuosity index (F1,88 = 36.52, p < 0.0001,
weight of mice, measured before and during the behavioral Figure 2C) in the open field. Overall, performance improved
testing, revealed a significant strain effect with DBA/2J showing across trials in the rotarod indicating motor learning with
a higher body weight than C57BL/6J mice (F1,88 = 57.46, the DBA/2J mice falling earlier during the initial phase of
p < 0.0001, Figure 2A). Moreover, the weight gain was more testing (F1,376 = 17.90, p < 0.0001, Figure 2D). These data
pronounced in C57BL/6J than in DBA/2J mice during the indicated DBA/2J being characterized by a faster and less linear
behavioral testing (F3,264 = 16.10, p < 0.0001, Figure 2A). locomotion combined with a poorer coordination. To address
Specifically looking at the locomotor activity and coordination anxiety like behaviors in C57BL/6J and DBA/2J mice, the
ability, the significant main effects of strain on locomotion elevated O-maze test was performed by both experimenters in
revealed how DBA/2J mice displayed higher walking velocity both laboratories. Results elucidated a much stronger avoidance
(F1,88 = 74.48, p < 0.0001, ω2 = 0.46, Figure 2B) combined of open sectors and less preference for transition zones, in

FIGURE 2 | Results of the behavioral battery of tests. (A) Body weight (g) during the behavioral testing (ANOVA: strain F1,88 = 57.46, p < 0.0001, ω2 = 0.40,
measure F3,264 = 133.0, p < 0.0001, ω2 = 0.60). DBA/2J mice were much heavier than C57BL/6J mice and weight gain was more sustained in C57BL/6J than
DBA/2J mice. The strain effect was detected by both experimenters and in both laboratories (post-hoc test: *p < 0.05, ***p < 0.001 for strain effect). (B) Lingering
(m/s) defined as sum of resting and any deceleration and walking velocity (m/s) in the open field (ANOVA: strain F1,88 = 74.48, p < 0.0001, ω2 = 0.46, strain × state
F1,88 = 36.38, p < 0.0001, ω2 = 0.29). DBA/2J mice displayed higher walking velocity compared to C57BL/6J mice. The strain effect was detected by both
experimenters and in both laboratories (post-hoc test: ***p < 0.001 for strain effect). (C) Tortuosity index defined as sum of unsigned direction changes divided by
total distance moved in the openfield (ANOVA: strain F1,88 = 36.52, p < 0.0001, ω2 = 0.29). DBA/2J showed an higher tortuosity index compared to C57BL/6J
mice. The strain effect was detected in 2 of 4 individual experiments: missed in VZ and VH (post-hoc test: ***p < 0.001, ~p < 0.1 for strain effect). (D) Time to fall (s)
as measure of motor learning and coordination ability in the rotarod (ANOVA: strain F1,88 = 18.44, p < 0.0001, ω2 = 0.17, trial F1,376 = 49.97, p < 0.0001,
ω2 = 0.12, strain × trial F1,376 = 17.90, p < 0.0001, ω2 = 0.05). Overall, performance improved across trials in the rotarod indicating motor learning with the
DBA/2J mice falling earlier during the initial phase of testing. The main effect of strain was missed when the mice were tested in Helsinki (post-hoc test: *p < 0.5,
***p < 0.001, ~p < 0.1 for strain effect).

Frontiers in Behavioral Neuroscience | www.frontiersin.org 5 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

FIGURE 3 | Results of the behavioral battery of tests. (A) Preference for sector (%) in the elevated O-maze (ANOVA: strain × sector F1,184 = 28.02, p < 0.0001,
ω2 = 0.13). DBA/2J mice showed much stronger avoidance of open sectors and less preference for the transition zones, in favor of a much stronger preference for
the closed sectors. This was detected by both experimenters in both laboratories (post-hoc-test: ◦◦◦ p < 0.001 for strain × sector interaction). (B) Object scanning
(m/min) as measure of exploratory activity in the open field with object (ANOVA: strain F1,88 = 48.45, p < 0.0001, ω2 = 0.36). DBA/2J spent more time exploring the
object in the open field with object whereas C57BL/6J showed the strain-typical absence of object exploration. This was detected by both experimenters in both
laboratories (post-hoc-test: ∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001 for strain effect). (C) Percentage of time (%) × state as measure of the motor profile in the open
field (ANOVA: strain × state F1,184 = 12.79, p = 0.0004, ω2 = 0.06). DBA/2J mice displayed higher resting time percentage and lower percentage of walking time in
the open field. This was detected in 2 of 4 individual experiments: missed in MZ and MH (post-hoc test: ◦ p < 0.05, ◦◦ p < 0.01 for strain × state interaction). (D)
Distance moved (m) as measure of spatial learning abilities in the Barnes maze (ANOVA: strain F1,88 = 51.44, p < 0.0001, ω2 = 0.37, day F4,352 = 21.92,
p < 0.0001, ω2 = 0.20). Overall, distance moved to find the escape hole showed a robust learning, reversal and re-learning effect, indicating that the protocol
worked as intended. DBA/2J moved a longer distance to find the escape hole in the Barnes maze. The strain effect was missed in MZ experiment (post-hoc test:
∗ p < 0.05, ∗∗ p < 0.01 for strain effect).

favor of a much stronger preference for closed sectors in motor profile, the main effect of strain on activity revealed how
DBA/2J compared to the C57BL/6J mice (F1,184 = 28.02, DBA/2J mice displayed higher resting time percentage and lower
p < 0.0001, Figure 3A). This was also confirmed in the open percentage of walking time in the open field (F1,184 = 12.79,
field test where DBA/2J mice showed much stronger avoidance p = 0.0004, Figure 3C). Overall, distance moved to find
of center zone in favor of a much stronger preference for the escape hole showed a robust learning, reversal and re-
the transition and wall zones (strain × zone F2,176 = 60.01, learning effect in the Barnes maze test performed by both
p < 0.0001, ω2 = 0.41, Supplementary Figure 4B). In addition, experimenters in both laboratories. Interestingly, DBA/2J mice
C57BL/6J showed the strain-typical absence of object exploration moved a longer distance to find the escape hole (F1,88 = 51.44,
in the open field with object whereas DBA/2J mice spent p < 0.0001, Figure 3D) taking longer time to finding it
more time exploring the object without sign of habituation (F1,88 = 34.16, p < 0.0001, Supplementary Figure 4C) indicating
(F1,88 = 48.45, p < 0.0001, Figure 3B). Focusing on the worse spatial learning abilities. Remarkably, our data detected

Frontiers in Behavioral Neuroscience | www.frontiersin.org 6 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

FIGURE 4 | Heatmaps indicating the direction of the strain effect. Behavioral measures with multiple observations per animal (repeated measures) were converted to
factorial measures by taking the average across observations (205 mean of selected repetitions) or by computing the slope across observations (321 slope across
selected repetitions). All the 526 behavioral variables have a primary assignment to a behavioral domain; 44 variables have a secondary assignment in addition.
Open field and OE test slope variables (∼time: bin 1-2-3, 5 min each, ∼stage: open field-OE test, ∼zone: (prospective) object-transition-wall, ∼direction:
centripetal-fugal, ∼state: rest-linger-walk); BM training and probe slope variables (∼acquisition: day 1-2-3, ∼relearning: day 1-4-5, ∼relocation: day (3| 5)-4, ∼state:
rest-linger-walk, ∼criterion: primary-extra, ∼strategy: mixed-serial-direct, ∼place: control-target, ∼angle: 72-54-36-18-0◦ deviation); physical slope variables
(∼testing: time during behavioral testing, ∼pretest: arrival to begin of testing); rotarod slope variables (∼trial 1-2-3-4-5, ∼begin-end 1-5); O-maze slope variables
(∼time: bin 1-2, 5 min each, ∼sector: open-transition-closed, ∼position: free-protected). Individual columns show effects obtained by individual experiments,
persons and labs with the second lane indicating the p-value of the overall ANOVA strain effect. In addition, experiments were analyzed using a pseudo population
(Continued)

Frontiers in Behavioral Neuroscience | www.frontiersin.org 7 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

FIGURE 4 | approach also with Bayesian stats permitting to obtain evidence for absence of effect. (A) Overview of 318 anxiety related measures, sorted by overall
Cohen’s d as measure of size of the strain effect in a strain × person × lab ANOVA model. Measures related to exploration were treated as negative measures of
anxiety and included in the table after multiplying d with –1. DBA/2J mice earned higher scores on measures of anxiety and lower scores on measures of exploration.
(B) Overview of 64 activity related measures, sorted by overall Cohen’s d as measure of size of the strain effect in a strain × person × lab ANOVA model. Measures
related to resting and lingering were treated as negative measures of activity and included in the table after multiplying d with –1. DBA/2J mice moved less and later,
earning lower scores on measures of activity and higher scores on measures of inactivity. (C) Overview of 56 motor related measures, sorted by overall Cohen’s d as
measure of size of the strain effect in a strain × person × lab ANOVA model. Locomotion of DBA/2J mice was characterized by faster walking as well as less linear
and less predictable trajectories combined with coordination deficit. (D) Overview of 7 physical related measures, sorted by overall Cohen’s d as measure of size of
the strain effect in a strain × person × lab ANOVA model. DBA/2J mice showed higher body weight but gained less weight during the behavioral testing.
(E) Overview of 125 cognition related measures, sorted by overall Cohen’s d as measure of size of the strain effect in a strain × person × lab ANOVA model. Error
scores were treated as negative measures of learning performance and included in the table after multiplying d with –1. DBA/2J mice earned poor scores of spatial
selectivity during training and probe trials on the Barnes maze.

the notorious strain behavioral differences between C57BL/6J on it. General agreement was also obtained in the context of
and DBA/2J mice. physical related measures with DBA/2J mice showing higher
body weight but gaining less weight during the testing compared
Experiments Agree Over Direction of to C57BL/6J mice (Figure 4D). Focusing on the cognitive profile
of C57BL/6J and DBA/2J mice, agreement on the direction of
Effect
the strain effect in learning and memory abilities was reached
To deeply examine the consistency of the direction of the
(Figure 4E). In this context, data showed how DBA/2J mice
observed strain differences obtained by the two experimenters
earned poor scores of spatial selectivity during training and
in the two laboratories, a novel statistical approach based on the
probe trials on the Barnes maze. This would also imply higher
analysis of multiple tests addressing a single behavioral domain
error scores which was counteracted by their generally reduced
was developed. To this end, all the 526 measures (Supplementary
locomotor activity. Surprisingly and in light of the mentioned
Table 1) obtained from the behavioral experiments were assigned
results, data suggested how experiments agree over direction of
to at least one behavioral domain: physical, motor, anxiety,
the strain effect.
activity and cognition. All the measures related to each behavioral
domain were then sorted by overall Cohen’s d as measure of
size of the strain effect in a strain × person × lab ANOVA Each Single Experiment Agrees With the
model and heatmaps were generated accordingly (Figure 4). Others
Using our novel approach and looking specifically at the anxiety The reproducibility of each single experiment was then deeply
domain (Figure 4A), results agreed on the direction of the investigated. To evaluate how well one single experiment
strain effect with DBA/2J obtaining higher scores on measures agrees with the others in terms of strain effect direction of
of anxiety and lower scores on measures of exploration and each behavioral measure, all possible six comparisons between
habituation. This was most evident in the Barnes maze where individual experiments (MZ, MH, VZ, VH) have been made
many measures reflect the fact that DBA/2J mice disappeared (Figure 5). According to agreement on both the presence and
more rapidly after having found the escape box. In this context, the direction of the strain effect, three outcome categories are
wall-related measures in the open field test yield the largest obtained: concordance, uncertainty and discordance. The latter
strain effects since DBA/2J mice avoided both the center and two are considered as failure of one experiment to replicate the
the transition zones more than C57BL/6J mice. Due to the other. In this context, a precision score was assigned to each
notorious avoidance reaction of C57BL/6J in the test, scores of behavioral measure based on the presence of either concordant
object exploration show a reversal pattern. Looking at the activity or discordant effects. Precisely, a score of 1 was assigned when
profile of C57BL/6J and DBA/2J mice, the heatmap presented concordant effects with identical size based on the Cohens’ d
in Figure 4B confirmed a good agreement on the direction of were observed. In contrast, discordant effects obtained a score
the strain effect with DBA/2J mice collecting lower scores on of −1. Surprisingly, 75% precision scores are > 0 indicating
measures of activity. Precisely, data confirmed how they moved reproducible results, either true positives or true negatives.
less and later. While the strain effect on latency related variables Additionally, and in agreement with the threshold of 5% set for
and head dips may be boosted by their increased anxiety, distance type-I error, false positive results are 4.6%. Interestingly, our data
related measures tended to show smaller effects due to their elucidated how discordant strain effects are very few and mostly
increased speed of locomotion. Agreement in the context of the explained by the compromised detection of body size by the video
motor profile related measures was also achieved (Figure 4C). In tracking system. Remarkably, results highlighted how the strain
this context, the heatmap revealed DBA/2J to be characterized by effect was highly reproducible for all the behaviors tested.
a faster and less linear locomotion combined with coordination
deficit. The less predictable trajectories observed in DBA/2J High Versus Low Reproducible Measures
mice were less pronounced in the open field test due to their The degree of reproducibility of variables belonging to each
increased wall preference. Specifically looking at the rotarod behavioral domain was then deeply evaluated. The previously
related measures, their performance was poor with a very strong mentioned approach based on the presence of either concordant
tendency to hold and rotate on the drum instead of walking or discordant effects was used, and precision scores were

Frontiers in Behavioral Neuroscience | www.frontiersin.org 8 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

FIGURE 5 | Results showing reproducibility between experiments. All possible six comparisons between individual experiments (MZ, MH, VZ, VH) have been made,
as presented in the heatmaps. According to agreement on both the presence and the direction of the strain effect, three outcome categories are obtained:
concordance, uncertainty and discordance. A score of 1 was assigned when concordant effects with identical size based on the Cohens’ d were observed. In
contrast, discordant effects obtained a score of –1. False positives appear less frequent than false negatives, hence more non-concordance when there is an effect
in the pseudo population – without evident relationship between size of the strain effect and rate of non-concordance. In addition, learning and memory related
measures were strongly overrepresented in the subset of the most reproducible measures. In contrast, measures of activity and size determined by video tracking
are overrepresented in the subset of the least reproducible measures.

assigned accordingly. The heatmaps presented in Figure 5 of the least reproducible measures. Importantly, our data were
elucidated how measures of learning and memory were strongly able to detect both the most and the least reproducible measures
overrepresented in the subset of the most reproducible measures belonging to the addressed behavioral domains.
and to a lesser degree also motor performance related measures.
Importantly, measures of activity determined by video tracking Experimenter Impact on Size and
are overrepresented in the subset of the least reproducible Direction of the Strain Effect
measures. Interestingly, our data show object exploration and Having defined both the most and the least reproducible
O-maze related parameters being overrepresented in the subset measures, we were then interested in deeply evaluating the

Frontiers in Behavioral Neuroscience | www.frontiersin.org 9 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

FIGURE 6 | Results showing the experimenter contribution to overall variance. Behavioral measures assigned to 3 sections according to strain effect in pseudo
population: evidence for, inconclusive, evidence against (false positives). Based on partial omega squared of the interaction, 30 variable subsets with overall strain
effect and large vs small person × strain interactions, were extracted. A modest enrichment of anxiety measures in the subset with large person × strain interaction
is observed. Additionally, motor related measures seem relatively resistant against person × strain interaction.

relative influence of the experimenter on size/direction of the showed measure of learning and memory as well as size being
strain effect (experimenter × strain interaction). To this aim, not present in the subset of the most affected measures. In
behavioral measures were assigned to 3 sections according to contrast, both activity and anxiety related parameters appeared
strain effect in pseudo population: evidence for, inconclusive, to be affected by the experimenter. Importantly, while the Barnes
evidence against. Precisely and based on partial omega squared of maze is overrepresented in the subset of least affected related
the interaction, 30 variable subsets with overall strain effect and measures, open field and object exploration are overrepresented
large vs small experimenter × strain interactions, were extracted in the subset of the most affected related parameters. Rotarod
and analyzed. Heatmaps presented in Figure 6 elucidated how and physical examination, by contrast, do not contribute to
measures of motor performance are strongly overrepresented the subset of the most affected measures. Interestingly, data
in the subset of the least affected measures. Interestingly, data presented in Supplementary Figure 1 showed how the impact

Frontiers in Behavioral Neuroscience | www.frontiersin.org 10 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

FIGURE 7 | Results showing the environment contribution to overall variance. Behavioral measures assigned to 3 sections according to strain effect in pseudo
population: evidence for, inconclusive, evidence against (false positives). Based on partial omega squared of the interaction, 30 variable subsets with overall strain
effect and large vs small environment × strain interactions, were extracted. Importantly, results elucidated how enrichment of activity related measures in the subset
with large lab × strain interactions was detected. In contrast, physical related parameters seem particularly resistant against lab × strain interaction.

of the experimenter on concordance is minor (5–10%) compared tracking appeared to belong to the most affected measures.
to the total strain effect size (70%). Importantly, learning and memory related parameters were
not present in the subset of the most affected measures
by the laboratory environment. Looking specifically at the
Environment Impact on Size and O-maze and object exploration, results highlighted their related
Direction of the Strain Effect measures being influenced by the laboratory environment.
The relative impact of the environment on size/direction of Remarkably, our results elucidated how experimenter and
the strain effect (environment × strain interaction) was also environment effects are mutually exclusive and independent of
investigated using the previously mentioned approach. In this strain effects. Considering the mentioned results, data presented
context, results (Figure 7) elucidated measures of learning and in Supplementary Figure 2 showed the impact of the laboratory
memory as well as size determined by physical examination environment being similar to the experimenter one, accounting
being the least affected by the laboratory environment. In for a minor impact on concordance (5–10%) compared to the
contrast, both measures of activity and size determined by video strain effect size (70%).

Frontiers in Behavioral Neuroscience | www.frontiersin.org 11 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

DISCUSSION In contrast, exploration of the novel object in the center of


open field was enhanced in the DBA/2J mice as also shown
The reproducibility and replicability of the experimental work in in previous studies (Kim et al., 2005). In spatial learning and
biomedical research has been a hot topic fueling intensive debates memory tasks, and fear conditioning, the DBA/2 mice have been
during the past 10 years (Baker, 2016; Fitzpatrick et al., 2018). usually described as inferior compared to the C57BL/6 strain
Indeed, irreproducibility prevalence rates have been estimated (Crawley et al., 1997; Logue et al., 1997; Holmes et al., 2002;
to range between 50 and 90% (Prinz et al., 2011; Begley and Youn et al., 2012). In line with earlier reports, the difference
Ellis, 2012; Begley and Ioannidis, 2015). Several reasons for in spatial learning abilities between the two strains were also
poor reproducibility have been identified – publication bias, detected in our study.
inappropriate statistical analysis, lack of randomization and Confirming the differences between the two strains allowed
blinding, validation of reagents (Landis et al., 2012; Begley, 2013; us deeply evaluate the relative impact of both the experimenter
Munafò et al., 2017). Recently concluded and published results and the environment on the behavioral results. We developed
of cancer reproducibility project highlights many of these issues and applied a novel statistical approach based on the analysis
(Editorial, 2021; Mullard, 2021). Working with animals requires of multiple tests addressing a single behavioral domain.
consideration of many more issues which are critical for the Interestingly, variable dependent effects of both the experimenter
validity of an experiment (Smith, 2020). and the environment are detectable but not capable to alter
With the present study, our aim was to add information the direction of conclusions. Surprisingly, main effects of the
on the role of environment and experimenter in behavioral experimenter and the environment are mutually exclusive and
phenotyping of mouse models. Importantly, the aim was not to remarkably good deal of consistency of the strain effect is
evaluate the standardization of the procedures. However, two observed. Importantly, accounting for 75% of the total variability,
laboratories had extensive experience (>25 years) in behavioral strain effect was highly reproducible for all the behaviors
testing and had similar equipment available. Therefore, the tested and importantly, well-documented strain differences were
“standardization” covered only agreement on the behavioral detected. Additionally, in agreement with the threshold of 5% set
protocols, and the source (and strain) of animals used. Inbred for type-I error, false positives are 4.6%.
strains provide an important tool for understanding genetic Two major environmental differences between the
mechanisms underlying behavior. By large, the phenotypic laboratories were the phase of light cycle when the testing
differences between inbred strains are suggested to be stable occurred and the housing system used. Alarmingly, up to 70%
over time and across laboratories, although the behaviors related of publications fail in disclosing the circadian time when the
to emotional, cognitive and social processes may be labile and animals are administered the treatment (Alitalo et al., 2021).
affected by laboratory-specific parameters in husbandry and Although testing during the dark period may be intuitively and
testing (Wahlsten et al., 2006). In order to investigate the ethologically more relevant, the fact is that many laboratories
experimenter and the environment contribution separately, our do not apply inverted light cycle because of various practical
study deeply explored their relative impact on the detection and logistic reasons. Moreover, for basic behavioral testing it has
of behavioral traits in C57BL/6J and DBA/2J female mice. been shown that many parameters are not affected by the time of
Overall, this approach is in line with the concept of systematic testing, and discriminate the strains well in the active or inactive
heterogenization (Richter, 2017, 2020; Voelkl et al., 2020) period (Hossain et al., 2004; Beeler et al., 2006; Deacon, 2006;
recommended for enhancing external validity and generalization Yang et al., 2008; Robinson et al., 2018). Even if the differences
(Karp, 2018; Eggel and Wurbel, 2021). depending on the time of testing (during light or dark phase) are
Several previous studies have elucidated how the experimenter detected, the comparison to the other studies is often complicated
and the laboratory environment may account for the variation because of specific design (only male or female animals, single or
across replicate studies within or between laboratories (Chesler group housed) or missing information on test conditions (Roedel
et al., 2002; Wahlsten et al., 2003; Bohlen et al., 2014). et al., 2006; Richetto et al., 2019). Importantly, it is suggested
Considering that highly reproducible finding under highly that mice can adapt to the daily activity of laboratory personnel
standardized conditions may poorly generalize to other (Robinson-Junker et al., 2018). Taken together, the findings of
experimental conditions (lab or experimenter), the same all these previous experiments and our data can be summarized
protocols and similar testing procedures were applied in our that if the differences between testing during light and dark
study, consisting of four replications. This approach allowed phase exist, they may be heavily dependent on variety of factors
us systematically collect data in large cohort of mice (pooled (strain, sex, housing conditions, illumination during testing, the
data as a pseudo-population) and thereafter, to focus on the test situation) (Peirson et al., 2018). Additionally, little or no
measures of reliability and validity (precision and accuracy) of evidence is reported for impaired welfare or sleep deprivation
each replication (mini-experiments). when mice are disturbed for testing or husbandry procedures
As expected, significant strain differences were revealed for during the light phase (Robinson-Junker et al., 2018, 2019).
each behavioral test. In accordance with previous findings, The individually ventilated cages (IVC) are becoming a
the DBA/2J mice were less active, showing enhanced anxiety- mainstream housing condition for laboratory rodents. Although
like behavior and avoidance of exposed areas (in open field there are clear benefits for monitoring hygiene, microbiological
and elevated O-maze) with impaired motor performance (on status and importantly, health hazards for personnel, the
rotarod) when compared to C57BL/6J mice (Voikar et al., impact on animal physiology and behavior has been extensively
2005; Kulesskaya and Voikar, 2014; Ahlgren and Voikar, 2019). discussed. The data so far show that the changes in the phenotype

Frontiers in Behavioral Neuroscience | www.frontiersin.org 12 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

of mice may be dependent on the parameters studied and work deeply examined the contribution of the experimenter and
laboratories (Mineur and Crusio, 2009; Logge et al., 2013; the environment and provided novel insights in the intricate field
Ahlgren and Voikar, 2019). Based on our data, we suggest that of behavioral phenotyping.
neither light cycle nor housing system obscured the phenotypic
differences between the C57BL/6 and DBA/2 mice.
Using only female mice in our study may be considered as a DATA AVAILABILITY STATEMENT
limitation. However, the main aim of this study was to investigate
the impact of environment and experimenter in behavioral The original contributions presented in the study are included
phenotyping experiments. Therefore, we planned it by employing in the article/Supplementary Material, further inquiries can be
two inbred strains, to identify the genotype × environment directed to the corresponding authors.
interactions. We did not include male mice because (1) the sex
difference was not of major interest in this proof-of-principle
study (including three factors – genotype, experimenter and ETHICS STATEMENT
laboratory) and (2) personal experience is that ordering male
All the experimental procedures were carried out in accordance
mice from the commercial supplier at the age of 6 weeks or
with the European legislation (Directive 2010/63/EU), having
later often results in fighting and need to single housing/re-
been approved by the veterinary office of the Canton of Zürich
grouping which may be a major drawback for the design and
(license number 060/2021) and National Animal Experiment
conduct of the study (Weber et al., 2017). In addition, convincing
Board of Finland (license ESAVI/10165/04.10.07/2016).
evidence exists that phenotypic variability may be higher in males
than females and exact information on the phase of the estrous
cycle is not necessary in basic studies with laboratory rodents AUTHOR CONTRIBUTIONS
(Prendergast et al., 2014; Becker et al., 2016; Fritz et al., 2017;
Shansky and Murphy, 2021). Examining the influence of the MN, JÅ, DW, and VV: design and concept of the study, local
estrous cycle on a particular experimental question is always support and coordination with planning, protocols, equipment
an option, but is not required for research in females, just and animal orders. MN and VV: mouse behavioral phenotyping.
as assessing testosterone levels (which can vary up to tenfold DW: statistical analysis. All authors discussed the data and wrote
across a cohort) is not a standard practice for experiments in the manuscript.
males (Shansky, 2019). However, this should not be taken as
underestimating the importance of sex differences in biomedical
research (Karp et al., 2017; Breznik et al., 2021). FUNDING
Testing animals in more than one laboratory in a coordinated
preclinical trial can definitely support the reliability and This study was financed by research grant from Jane and Aatos
generalization of findings. However, involving more than one Erkko Foundation to VV. Mouse Behavioral Phenotyping Facility
laboratory requires certainly more attention on planning and in Helsinki is supported by Biocenter Finland and Helsinki
logistics of the study. We have been partners in several such Institute of Life Science.
endeavors, which have produced a lot of useful data but also
emphasizing how important is the coordination of the project,
because many things can go wrong already before actual start ACKNOWLEDGMENTS
of the experiments (Krackow et al., 2010; Richter et al., 2011;
Codita et al., 2012). For instance, ordering the mice from We want to thank Irmgard Amrein for help and advice in
commercial vendor may seem easy and straightforward, but planning the study and preparing the facility in Zurich. Sonia
it may appear that suddenly they do not have available mice Matos (Zurich) and Nelli Koivisto (Helsinki) are acknowledged
at desired age, or in particular breeding facility. Therefore, for taking care of mice in study.
planning checklists and culture of care (good communication)
cannot be promoted enough (Smith, 2020; Robinson et al.,
2021). Finally, we want emphasize experience and training for SUPPLEMENTARY MATERIAL
conducting behavioral experiments, because failure to consider
The Supplementary Material for this article can be found online
essential factors affecting behavior of mice, interaction of mice
at: https://www.frontiersin.org/articles/10.3389/fnbeh.2022.
and experimenters, and scoring behavior, may strongly influence
835444/full#supplementary-material
the reproducibility, validity and reliability of the experiments
(Blizard et al., 2007; Rodgers, 2007; Stanford, 2007; Schellinck Supplementary Figure 1 | Results indicating the impact of person × strain
et al., 2010; Voikar, 2020). interaction. Person × strain interactions are by definition expected to negatively
In summary, by applying novel statistical approach, we impact on precision. Their impact in comparison to the effect of the size of the
elucidated how large strain differences are robust and are unlikely strain effect was examined and reported in the graphs. Person × strain
interactions have a detectable impact on the measurement of effect size and a
to alter the direction of the behavioral results. Highlighting how smaller one on the detection of presence and direction of effects. Importantly, the
reproducible results can be reached by converging evidence from impact of person × strain interactions (5–10%) on concordance is minor
multiple measures addressing the same behavioral domain, our compared to the impact of strain effect size or power (70%).

Frontiers in Behavioral Neuroscience | www.frontiersin.org 13 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

Supplementary Figure 2 | Results indicating the impact of laboratory × strain ω2 = 0.32). Overall, distance moved decreased with time indicating habituation.
interaction. Laboratory × strain interactions are by definition expected to Additionally, no evidence for an overall strain effect was observed. DBA/2J mice
negatively impact on precision. Their impact in comparison to the effect of the size moved less during the first 5 min of the experiment compared to C57BL/6J mice.
of the strain effect was examined and reported in the graphs. Laboratory × strain This was detected by both experimenters in both laboratories (post-hoc-test:
interactions have a detectable impact on the measurement of effect size and a ◦◦ p < 0.01, ◦◦◦ p < 0.001 for strain × bin interactions). (B) Preference × zone (%)

smaller one on the detection of presence and direction of effects. Large in open field (ANOVA: strain × zone F2,176 = 60.01, p < 0.0001, ω2 =0.41).
laboratory × stain interactions increase false negative as well as false positive rate DBA/2J mice showed much stronger avoidance of center zone in favor of a much
and true discordance. Importantly, the impact of person × strain interactions
stronger preference for the transition and wall zones. This was detected by both
(5–10%) on concordance is minor compared to the impact of strain effect size
experimenters in both laboratories (post-hoc-test: ◦◦ p < 0.01, ◦◦◦ p < 0.001 for
or power (70%).
strain × zone interactions). (C) Latency (s) primary × as measure of spatial
Supplementary Figure 3 | Details on the cage environment. (A) Cages equipped learning abilities in the Barnes maze (ANOVA: strain F1,88 = 34.16, p < 0.0001,
with two shelters (one cardboard and one red plastic, Zoonlab) and paper tissue ω2 = 0.28, day F4,352 = 69.05, p < 0.0001, ω2 = 0.44). Overall, latency to find
as nesting material in Zürich. (B) Cages equipped with wooden gnawing blocks the escape hole showed a robust learning, reversal and re-learning effect,
and abundant nesting material providing also a shelter (aspen strips) in Helsinki. indicated the protocol worked as intended. DBA/2J mice took longer to find the
escape hole. The strain effect was missed in MH and MZ experiments (post-hoc
Supplementary Figure 4 | Results of the behavioral battery of tests. (A) Distance test: ∗∗∗ p < 0.001 for strain effect, ~p < 0.1).
moved (m) × time in the open field (ANOVA: bin F1,184 = 118.9, p < 0.0001,
ω2 = 0.39, strain F1,88 = 1.398 ns, strain × bin F1,184 = 87.76, p < 0.0001, Supplementary Table 1 | List of the 526 behavioral related variables.

REFERENCES analysis of a large data archive. Neurosci. Biobehav. Rev. 26, 907–923. doi:
10.1016/s0149-7634(02)00103-3
Ahlgren, J., and Voikar, V. (2019). Housing mice in the individually ventilated Codita, A., Mohammed, A. H., Willuweit, A., Reichelt, A., Alleva, E., and Branchi,
or open cages-Does it matter for behavioral phenotype? Genes Brain Behav. I. (2012). Effects of spatial and cognitive enrichment on activity pattern and
18:e12564. doi: 10.1111/gbb.12564 learning performance in three strains of mice in the IntelliMaze. Behav. Genet.
Alitalo, O., Saarreharju, R. I, Henter, D., Zarate, C. A., Kohtala, S., and 42, 449–460. doi: 10.1007/s10519-011-9512-z
Rantamäki, T. (2021). A wake-up call: sleep physiology and related translational Crabbe, J. C., Wahlsten, D., and Dudek, B. C. (1999). Genetics of mouse behavior:
discrepancies in studies of rapid-acting antidepressants. Prog. Neurobiol. interactions with laboratory environment. Science 284, 1670–1672. doi: 10.
206:102140. doi: 10.1016/j.pneurobio.2021.102140 1126/science.284.5420.1670
Arroyo-Araujo, M., Graf, R., Maco, M., van Dam, E., Schenker, E., and Crawley, J. N., Belknap, J. K., Collins, A., Crabbe, J. C., Frankel, W., Henderson,
Drinkenburg, W. (2019). Reproducibility via coordinated standardization: N. (1997). Behavioral phenotypes of inbred mouse strains: implications and
a multi-center study in a Shank2 genetic rat model for Autism Spectrum recommendations for molecular studies. Psychopharmacology 132, 107–124.
Disorders. Sci. Rep. 9:11602. doi: 10.1038/s41598-019-47981-0 doi: 10.1007/s002130050327
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature 533, Deacon, R. M. (2006). Housing, husbandry and handling of rodents for behavioral
452–454. doi: 10.1038/533452a experiments. Nat. Protoc. 1, 936–946. doi: 10.1038/nprot.2006.120
Barnes, C. A. (1979). Memory deficits associated with senescence: a Editorial (2009). Troublesome variability in mouse studies. Nat. Neurosci. 12:1075.
neurophysiological and behavioral study in the rat. J. Comp. Physiol. Psychol. doi: 10.1038/nn0909-1075
93, 74–104. doi: 10.1037/h0077579 Editorial (2013). Enhancing reproducibility. Nat. Meth. 10, 367–367. doi: 10.1038/
Becker, J. B., Prendergast, B. J., and Liang, J. W. (2016). Female rats are not more nmeth.2471
variable than male rats: a meta-analysis of neuroscience studies. Biol. Sex Differ. Editorial (2019). Considerations for Experimental Design of Behavioral Studies
7:34. doi: 10.1186/s13293-016-0087-5 Using Model Organisms. J. Neurosci. 39, 1–2. doi: 10.1523/JNEUROSCI.2794-
Beeler, J. A., Prendergast, B., and Zhuang, X. (2006). Low amplitude entrainment 18.2018
of mice and the impact of circadian phase on behavior tests. Physiol. Behav. 87, Editorial (2021). Replicating scientific results is tough - but essential. Nature 600,
870–880. doi: 10.1016/j.physbeh.2006.01.037 359–360. doi: 10.1038/d41586-021-03736-4
Begley, C. G. (2013). Six red flags for suspect work. Nature 497, 433–434. doi: Eggel, M., and Wurbel, H. (2021). Internal consistency and compatibility of the
10.1038/497433a 3Rs and 3Vs principles for project evaluation of animal research. Lab. Anim.
Begley, C. G., and Ellis, L. M. (2012). Drug development: raise standards for 55, 233–243. doi: 10.1177/0023677220968583
preclinical cancer research. Nature 483, 531–533. doi: 10.1038/483531a Fitzpatrick, B. G., Koustova, E., and Wang, Y. (2018). Getting personal with the
Begley, C. G., and Ioannidis, J. P. (2015). Reproducibility in science: improving reproducibility crisis: interviews in the animal research community. Lab. Anim.
the standard for basic and preclinical research. Circ. Res. 116, 116–126. doi: 47, 175–177. doi: 10.1038/s41684-018-0088-6
10.1161/CIRCRESAHA.114.303819 Fritz, A. K., Amrein, I., and Wolfer, D. P. (2017). Similar reliability and equivalent
Blizard, D. A., Takahashi, A., Galsworthy, M. J., Martin, B., and Koide, T. (2007). performance of female and male mice in the open field and water-maze place
Test standardization in behavioural neuroscience: a response to Stanford. navigation task. Am. J. Med. Genet. C Semin. Med. Genet. 175, 380–391. doi:
J. Psychopharmacol. 21, 136–139. doi: 10.1177/0269881107074513 10.1002/ajmg.c.31565
Bohlen, M., Hayes, E. R., Bohlen, B., Bailoo, J., Crabbe, J. C., and Wahlsten, D. Gerlai, R. (1996). Gene-targeting studies of mammalian behavior: is it the mutation
(2014). Experimenter effects on behavioral test scores of eight inbred mouse or the background genotype? Trends Neurosci. 19, 177–181. doi: 10.1016/s0166-
strains under the influence of ethanol. Behav. Brain Res. 272, 46–54. doi: 10. 2236(96)20020-7
1016/j.bbr.2014.06.017 Goodman, S. N., Fanelli, D., and Ioannidis, J. P. (2016). What does research
Breznik, J. A., Schulz, C., Ma, J., Sloboda, D. M., and Bowdish, D. M. E. (2021). reproducibility mean? Sci. Transl. Med. 8:341ps12.
Biological sex, not reproductive cycle, influences peripheral blood immune cell Holmes, A., Wrenn, C. C., Harris, A. P., Thayer, K. E., and Crawley, J. N. (2002).
prevalence in mice. J. Physiol. 599, 2169–2195. doi: 10.1113/JP280637 Behavioral profiles of inbred strains on novel olfactory, spatial and emotional
Cabib, S., Orsini, C., Le Moal, M., and Piazza, P. V. (2000). Abolition and reversal tests for reference memory in mice. Genes Brain Behav. 1, 55–69. doi: 10.1046/j.
of strain differences in behavioral responses to drugs of abuse after a brief 1601-1848.2001.00005.x
experience. Science 289, 463–465. doi: 10.1126/science.289.5478.463 Hossain, S. M., Wong, B. K., and Simpson, E. M. (2004). The dark phase
Chesler, E. J., Wilson, S. G., Lariviere, W. R., Rodriguez-Zas, S. L., and Mogil, improves genetic discrimination for some high throughput mouse behavioral
J. S. (2002). Identification and ranking of genetic and laboratory environment phenotyping. Genes Brain Behav. 3, 167–177. doi: 10.1111/j.1601-183x.2004.
factors influencing a behavioral trait, thermal nociception, via computational 00069.x

Frontiers in Behavioral Neuroscience | www.frontiersin.org 14 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

Hurst, J. L., and West, R. S. (2010). Taming anxiety in laboratory mice. Nat. Meth. Percie du Sert, N., Hurst, V., Ahluwalia, A., Alam, S., Avey, M. T., Baker, M., et al.
7, 825–826. doi: 10.1038/nmeth.1500 (2020). The ARRIVE guidelines 2.0: updated guidelines for reporting animal
Kafkafi, N., Agassi, J., Chesler, E. J., Crabbe, J. C., Crusio, W. E., and Eilam, D. research. PLoS Biol. 18:e3000410. doi: 10.1371/journal.pbio.3000410
(2018). Reproducibility and replicability of rodent phenotyping in preclinical Pernold, K., Iannello, F., Low, B. E., Rigamonti, M., and Rosati, G. (2019).
studies. Neurosci. Biobehav. Rev. 87, 218–232. doi: 10.1016/j.neubiorev.2018. Towards large scale automated cage monitoring - Diurnal rhythm and impact
01.003 of interventions on in-cage activity of C57BL/6J mice recorded 24/7 with a non-
Karp, N. A. (2018). Reproducible preclinical research-Is embracing variability the disrupting capacitive-based technique. PLoS One 14:e0211063. doi: 10.1371/
answer? PLoS Biol. 16:e2005413. doi: 10.1371/journal.pbio.2005413 journal.pone.0211063
Karp, N. A., Mason, J., Beaudet, A. L., Benjamini, Y., and Bower, L. (2017). Prendergast, B. J., Onishi, K. G., and Zucker, I. (2014). Female mice liberated for
Prevalence of sexual dimorphism in mammalian phenotypic traits. Nat. inclusion in neuroscience and biomedical research. Neurosci. Biobehav. Rev. 40,
Commun. 8:15475. doi: 10.1038/ncomms15475 1–5. doi: 10.1016/j.neubiorev.2014.01.001
Keysers, C., Gazzola, V., and Wagenmakers, E.-J. (2020). Using Bayes factor Prinz, F., Schlange, T., and Asadullah, K. (2011). Believe it or not: how much can
hypothesis testing in neuroscience to establish evidence of absence. Nat. we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10,
Neurosci. 23, 788–799. 712–712. doi: 10.1038/nrd3439-c1
Kilkenny, C., Parsons, N., Kadyszewski, E., Festing, M. F. I, Cuthill, C., and Fry, Richetto, J., Polesel, M., and Weber-Stadlbauer, U. (2019). Effects of light and dark
D. (2009). Survey of the quality of experimental design, statistical analysis and phase testing on the investigation of behavioural paradigms in mice: relevance
reporting of research using animals. PLoS One 4:e7824. doi: 10.1371/journal. for behavioural neuroscience. Pharmacol. Biochem. Behav. 178, 19–29. doi:
pone.0007824 10.1016/j.pbb.2018.05.011
Kim, D., Chae, S., Lee, J., Yang, H., and Shin, H. S. (2005). Variations in the Richter, S. H. (2017). Systematic heterogenization for better reproducibility in
behaviors to novel objects among five inbred strains of mice. Genes Brain Behav. animal experimentation. Lab. Anim. 46:343. doi: 10.1038/laban.1330
4, 302–306. doi: 10.1111/j.1601-183X.2005.00133.x Richter, S. H. (2020). Automated Home-Cage Testing as a Tool to Improve
Krackow, S., Vannoni, E., Codita, A., Mohammed, A. H., Cirulli, F., and Branchi, Reproducibility of Behavioral Research? Front. Neurosci. 14:383. doi: 10.3389/
I. (2010). Consistent behavioral phenotype differences between inbred mouse fnins.2020.00383
strains in the IntelliCage. Genes Brain Behav. 9, 722–731. doi: 10.1111/j.1601- Richter, S. H., Garner, J. P., Zipser, B., Lewejohann, L., Sachser, N., and Touma, C.
183X.2010.00606.x (2011). Effect of population heterogenization on the reproducibility of mouse
Kulesskaya, N., and Voikar, V. (2014). Assessment of mouse anxiety-like behaviour behavior: a multi-laboratory study. PLoS One 6:e16461. doi: 10.1371/journal.
in the light-dark box and open-field arena: role of equipment and procedure. pone.0016461
Physiol. Behav. 133, 30–38. doi: 10.1016/j.physbeh.2014.05.006 Robinson, L., and Riedel, G. (2014). Comparison of automated home-cage
Landis, S. C., Amara, S. G., Asadullah, K., Austin, C. P., Blumenstein, R., and monitoring systems: emphasis on feeding behaviour, activity and spatial
Bradley, E. W. (2012). A call for transparent reporting to optimize the predictive learning following pharmacological interventions. J. Neurosci. Meth. 234, 13–
value of preclinical research. Nature 490, 187–191. doi: 10.1038/nature11556 25. doi: 10.1016/j.jneumeth.2014.06.013
Lariviere, W. R., Chesler, E. J., and Mogil, J. S. (2001). Transgenic studies of pain Robinson, L., Spruijt, B., and Riedel, G. (2018). Between and within
and analgesia: mutation or background genotype? J. Pharmacol. Exp. Ther. 297, laboratory reliability of mouse behaviour recorded in home-cage and
467–473. open-field. J. Neurosci. Meth. 300, 10–19. doi: 10.1016/j.jneumeth.2017.
Lewejohann, L., Reinhard, C., Schrewe, A., Brandewiede, J., Haemisch, A., and 11.019
Gortz, N. (2006). Environmental bias? Effects of housing conditions, laboratory Robinson, S., White, W., Wilkes, J., and Wilkinson, C. (2021). Improving
environment and experimenter on behavioral tests. Genes Brain Behav. 5, culture of care through maximising learning from observations and events:
64–72. doi: 10.1111/j.1601-183X.2005.00140.x addressing what is at fault. Lab. Anim. 8, 00236772211037177. doi: 10.1177/
Logge, W., Kingham, J., and Karl, T. (2013). Behavioural consequences of IVC 00236772211037177
cages on male and female C57BL/6J mice. Neuroscience 237, 285–293. doi: Robinson-Junker, A., O’Hara, B., Durkes, A., and Gaskill, B. (2019). Sleeping
10.1016/j.neuroscience.2013.02.012 through anything: the effects of unpredictable disruptions on mouse sleep,
Logue, S. F., Paylor, R., and Wehner, J. M. (1997). Hippocampal lesions cause healing, and affect. PLoS One 14:e0210620. doi: 10.1371/journal.pone.021
learning deficits in inbred mice in the Morris water maze and conditioned-fear 0620
task. Behav. Neurosci. 111, 104–113. doi: 10.1037//0735-7044.111.1.104 Robinson-Junker, A. L., O’Hara, F., and Gaskill, B. N. (2018). Out Like a Light? The
Maggi, S., Garbugino, L., Heise, I., Nieus, T., Balcı, F., Wells, S., et al. (2014). A Effects of a Diurnal Husbandry Schedule on Mouse Sleep and Behavior. J. Am.
Cross-Laboratory Investigation of Timing Endophenotypes in Mouse Behavior. Assoc. Lab. Anim. Sci. 57, 124–133.
Timing Time Percept. 2, 35–50. Rodgers, R. J. (2007). More haste, considerably less speed. J. Psychopharmacol. 21,
Mandillo, S., Tucci, V., Holter, S. M., Meziane, H., Banchaabouchi, M. A., 141–143. doi: 10.1177/0269881107074493
and Kallnik, M. (2008). Reliability, robustness, and reproducibility in mouse Roedel, A., Storch, C., Holsboer, F., and Ohl, F. (2006). Effects of light or dark phase
behavioral phenotyping: a cross-laboratory study. Physiol. Genomics 34, 243– testing on behavioural and cognitive performance in DBA mice. Lab. Anim. 40,
255. doi: 10.1152/physiolgenomics.90207.2008 371–381. doi: 10.1258/002367706778476343
Mineur, Y. S., and Crusio, W. E. (2009). Behavioral effects of ventilated micro- Schellinck, H. M., Cyr, D. P., and Brown, R. E. (2010). How Many Ways Can Mouse
environment housing in three inbred mouse strains. Physiol. Behav. 97, 334– Behavioral Experiments Go Wrong? Confounding Variables in Mouse Models
340. doi: 10.1016/j.physbeh.2009.02.039 of Neurodegenerative Diseases and How to Control Them. Adv. Stud. Behav.
Mogil, J. S., and Macleod, M. R. (2017). No publication without confirmation. 41, 255–366.
Nature 542, 409–411. doi: 10.1038/542409a Shansky, R. M. (2019). Are hormones a female problem for animal research?
Moldin, S. O., Farmer, M. E., Chin, H. R., and Battey, J. F. Jr. (2001). Trans- Science 364, 825–826. doi: 10.1126/science.aaw7570
NIH neuroscience initiatives on mouse phenotyping and mutagenesis. Mamm. Shansky, R. M., and Murphy, A. Z. (2021). Considering sex as a biological variable
Genome 12, 575–581. doi: 10.1007/s00335-001-4005-7 will require a global shift in science culture. Nat. Neurosci. 24, 457–464. doi:
Mullard, A. (2021). Half of top cancer studies fail high-profile reproducibility effort. 10.1038/s41593-021-00806-8
Nature 600, 368–369. doi: 10.1038/d41586-021-03691-0 Shepherd, J. K., Grewal, S. S., Fletcher, A., Bill, D. J., and Dourish, C. T. (1994).
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., du, Behavioural and pharmacological characterisation of the elevated zero-maze
Sert NP, et al. (2017). A manifesto for reproducible science. Nat. Hum. Behav. as an animal model of anxiety. Psychopharmacology 116, 56–64. doi: 10.1007/
1:0021. doi: 10.1038/s41562-016-0021 BF02244871
Paigen, K., and Eppig, J. T. (2000). A mouse phenome project. Mamm. Genome 11, Silva, A. J., Simpson, E. M., Takahashi, J. S., Lipp, H. P., Nakanishi, S., and Wehner,
715–717. doi: 10.1007/s003350010152 J. M. (1997). Mutant mice and neuroscience: recommendations concerning
Peirson, S. N., Brown, L. A., Pothecary, C. A., Benson, L. A., and Fisk, A. S. (2018). genetic background. Banbury Conference on genetic background in mice.
Light and the laboratory mouse. J. Neurosci. Meth. 300, 26–36. Neuron 19, 755–759. doi: 10.1016/s0896-6273(00)80958-7

Frontiers in Behavioral Neuroscience | www.frontiersin.org 15 February 2022 | Volume 16 | Article 835444


Nigri et al. Reproducibility in Mouse Behavioral Phenotyping

Smith, A. J. (2020). Guidelines for planning and conducting high-quality research gene-environment interaction. J. Neurobiol. 54, 283–311. doi: 10.1002/neu.
and testing on animals. Lab. Anim. Res. 36:21. doi: 10.1186/s42826-020-0 10173
0054-0 Walsh, R. N., and Cummins, R. A. (1976). The Open-Field Test: a critical review.
Smith, A. J., Clutton, R. E., Lilley, E., Hansen, K. E. A., and Brattelid, T. (2018). Psychol. Bull. 83, 482–504. doi: 10.1037/0033-2909.83.3.482
PREPARE: guidelines for planning animal research and testing. Lab. Anim. 52, Weber, E. M., Dallaire, J. A., Gaskill, B. N., Pritchett-Corning, K. R., and Garner,
135–141. doi: 10.1177/0023677217724823 J. P. (2017). Aggression in group-housed laboratory mice: why can’t we solve
Sorge, R. E., Martin, L. J., Isbester, K. A., Sotocinal, S. G., Rosen, S., and Tuttle, A. H. the problem? Lab. Anim. 46, 157–161. doi: 10.1038/laban.1219
(2014). Olfactory exposure to males, including men, causes stress and related Wolfer, D. P., Madani, R., Valenti, P., and Lipp, H. (2001). Extended analysis of path
analgesia in rodents. Nat. Meth. 11, 629–632. doi: 10.1038/nmeth.2935 data from mutant mice using the public domain software Wintrack. Physiol.
Stanford, S. C. (2007). Open fields (unlike wheels) can be any shape but Behav. 73, 745–753. doi: 10.1016/s0031-9384(01)00531-5
still miss the target. J. Psychopharmacol. 21:144. doi: 10.1177/026988110707 Würbel, H. (2000). Behaviour and the standardization fallacy. Nat. Genet. 26:263.
4492 doi: 10.1038/81541
Stiedl, O., Radulovic, J., Lohmann, R., Birkenfeld, K., Palve, M., and Kammermeier, Würbel, H. (2002). Behavioral phenotyping enhanced–beyond (environmental)
J. (1999). Strain and substrain differences in context- and tone-dependent fear standardization. Genes Brain Behav. 1, 3–8. doi: 10.1046/j.1601-1848.2001.
conditioning of inbred mice. Behav. Brain Res. 104, 1–12. doi: 10.1016/s0166- 00006.x
4328(99)00047-9 Yang, M., Weber, M. D., and Crawley, J. N. (2008). Light phase testing of social
Van der Staay, F. J., and Steckler, T. (2002). The fallacy of behavioral phenotyping behaviors: not a problem. Front. Neurosci. 2, 186–191. doi: 10.3389/neuro.01.
without standardisation. Genes Brain Behav. 1, 9–13. doi: 10.1046/j.1601-1848. 029.2008
2001.00007.x Youn, J., Ellenbroek, B. A., van Eck, I., Roubos, S., Verhage, M., and Stiedl,
Voelkl, B., Altman, N. S., Forsman, A., Forstmeier, W., Gurevitch, J., O. (2012). Finding the right motivation: genotype-dependent differences in
and Jaric, I. (2020). Reproducibility of animal research in light of effective reinforcements for spatial learning. Behav. Brain Res. 226, 397–403.
biological variation. Nat. Rev. Neurosci. 21, 384–393. doi: 10.1038/s41583-020- doi: 10.1016/j.bbr.2011.09.034
0313-3
Voikar, V. (2020). Reproducibility of behavioral phenotypes in mouse models— Conflict of Interest: The authors declare that the research was conducted in the
a short history with critical and practical notes. J. Reproducibility Neurosci. absence of any commercial or financial relationships that could be construed as a
1:1375. doi: 10.31885/jrn.1.2020.1375 potential conflict of interest.
Voikar, V., Polus, A., Vasar, E., and Rauvala, H. (2005). Long-term individual
housing in C57BL/6J and DBA/2 mice: assessment of behavioral consequences. Publisher’s Note: All claims expressed in this article are solely those of the authors
Genes Brain Behav. 4, 240–252. doi: 10.1111/j.1601-183X.2004.00106.x and do not necessarily represent those of their affiliated organizations, or those of
Voikar, V., and Stanford, S. C. (2021). The Open Field Test. PsyArXiv [preprint] the publisher, the editors and the reviewers. Any product that may be evaluated in
doi: 10.31234/osf.io/8m52y this article, or claim that may be made by its manufacturer, is not guaranteed or
Wahlsten, D. (2001). Standardizing tests of mouse behavior: reasons, endorsed by the publisher.
recommendations, and reality. Physiol. Behav. 73, 695–704. doi: 10.1016/
s0031-9384(01)00527-3 Copyright © 2022 Nigri, Åhlgren, Wolfer and Voikar. This is an open-access article
Wahlsten, D., Bachmanov, A., Finn, D. A., and Crabbe, J. C. (2006). Stability of distributed under the terms of the Creative Commons Attribution License (CC BY).
inbred mouse strain differences in behavior and brain size between laboratories The use, distribution or reproduction in other forums is permitted, provided the
and across decades. Proc. Natl. Acad. Sci. U S A 103, 16364–16369. doi: 10.1073/ original author(s) and the copyright owner(s) are credited and that the original
pnas.0605342103 publication in this journal is cited, in accordance with accepted academic practice.
Wahlsten, D., Metten, P., Phillips, T. J., Boehm, S. L., Burkhart-Kasch, S., and No use, distribution or reproduction is permitted which does not comply with
Dorow, J. (2003). Different data from different labs: lessons from studies of these terms.

Frontiers in Behavioral Neuroscience | www.frontiersin.org 16 February 2022 | Volume 16 | Article 835444

You might also like